InfiniBand: Improving Communications for Large-scale Computing

InfiniBand (IB) is a popular network communications link for large-scale computing applications. It offers low latency (transmission delay), high bandwidth, good scalability, inexpensive hardware, and an open standard. Livermore Computing (LC), one of the largest supercomputing centers in the world, uses IB networks in many of its supercomputers and file storage systems. However, deploying and operating IB on LC’s systems presents some challenges. To resolve these issues, LLNL works closely with IB hardware vendors and focuses efforts on developing a common software stack. Adopting industry solutions and a common software stack helps LC streamline the configuration, monitoring, and troubleshooting of each of its networks.

Since 2007, LC has deployed many IB networks within its commodity cluster computing systems and data storage networks. IB encompasses both hardware and software. While LC does not produce the IB hardware, it maintains close relationships with hardware vendors, and LC’s system administrators test new hardware during preproduction. After large deployments, LC also provides feedback to vendors, particularly about scalability issues, through LLNL’s membership in the InfiniBand Trade Association and the OpenFabrics Alliance (OFA), an industry association chartered to develop a unified open-source Linux-based software stack for IB deployment.

In 2012, LLNL deployed a new IB organization scheme for Sequoia’s storage network. Overall, it is functioning well, providing approximately 850 gigabytes per second of data transfer. The IB specification has also allowed greater flexibility and better network performance with less expensive hardware. However, these advances also rely on more complex software. LC has partnered with the IB industry through OFA to use the alliance’s open IB software stack for Sequoia. Additionally, LLNL is integrating its software changes into the open software stack to take advantage of the testing and maintenance provided by the community.

LLNL uses an open-source subnet manager called OpenSM for managing, monitoring, and troubleshooting the IB system. In 2012, LC added three major features to OpenSM. A new master configuration file enables the management software to compare the current operating network with the desired network and flag improper connections, invalid speeds, and missing nodes. LC has also integrated congestion control configuration management to OpenSM, which should eventually help to achieve higher overall throughput. Additionally, LC implemented an OpenSM plug-in that allows unprecedented access to subnet management data, which is useful for finding and resolving errors, for instance. As LC system administrators and computer scientists continue to maintain, enhance, and expand the use of IB infrastructure at LLNL, they will also benefit from and contribute to the standardization efforts of the open-source IB community.

For more information, contact Jim Silva.