PDL receives computing cluster from Los Alamos

by Marika Yang

Carnegie Mellon University’s Parallel Data Lab has received a supercomputer from Los Alamos National Lab (LANL) that will be reconstructed into a computing cluster and play an important role in educating the next generation of computer science professionals, researchers, and educators.

Carnegie Mellon University has received a supercomputer from Los Alamos National Lab (LANL) that will be reconstructed into a computing cluster operated by the Parallel Data Lab (PDL) and housed in Carnegie Mellon’s Data Center Observatory. This new computer cluster will augment the existing Narwhal, also from LANL and made up of parts of the decommissioned Roadrunner supercomputer technology, the fastest supercomputer in the world from June 2008 to June 2009.

This new supercomputer, tentatively named Wolf, will be an important part of educating the next generation of computer science professionals, researchers, and educators at Carnegie Mellon. The system was recently retired from LANL’s open institutional computing environment and while no longer efficient for simulation science, it still has high value as a training tool and for computer science research. Wolf is made up of 616 computing nodes, each containing two eight-core Intel Xeon Sandy Bridge processors, totaling 9,856 processing cores across the entire cluster. The cluster interconnect is QDR InfiniBand, providing a network that is 30 times faster than Narwhal. Altogether, it will have the capability of about 200 teraflops, where a teraflop represents one trillion computations per second.

A group picture of nine men outside — The PDL and LANL researchers involved in the effort.

A row of the supercomputer — Ten of the 13 racks that comprise Wolf.

Four men posing with the supercomputer — The principal investigators of the LANL-CMU collaboration with Wolf in the background.

“Wolf’s processing cores are each significantly faster than the previous system, and it consists of about 50 percent more computing nodes,” said George Amvrosiadis, assistant research professor of electrical and computer engineering and the Parallel Data Lab (PDL). “We will be retiring the Narwhal nodes. Our experienced PDL team, with Jason Boles leading the installation effort, is doing this gradually to make sure everything works as expected.”

In the five years since they received Narwhal from LANL, the researchers of the Parallel Data Lab have developed several projects with the computing cluster in service of educating the world’s next thought leaders in several areas of computer science including: scalable storage, cloud computing, machine learning, and operating systems.

“Standing up and operating a reasonably large supercomputer is no small feat,” said Brad Settlemyer, senior scientist at LANL. “One of the many reasons that Los Alamos partners with PDL in finding a place for our retired machines is their commitment to providing the staff and resources required to fully utilize this system as an important educational tool.”

One of the many reasons that Los Alamos partners with PDL in finding a place for our retired machines is their commitment to providing the staff and resources required to fully utilize this system as an important educational tool.
Brad Settlemyer, Senior Scientist, Los Alamos National Lab

For example, under the DeltaFS project, a new distributed file system was designed and built enabling scientists to create trillions of files in minutes. With students, faculty, relevant problems, and the right tools, PDL has been able to conduct world class research, training, and outcomes such as DeltaFS.

“The PDL infrastructure enabled us to develop such an ambitious project in-house on Narwhal,” said Amvrosiadis. “We were able to use hundreds of nodes to test the scalability of our code, which significantly sped up development and increased our confidence that we could run DeltaFS on Trinity, Los Alamos’s fastest supercomputer, before we finally did.”

Another major benefit of Narwhal, and now Wolf, is having direct access to its hardware on Carnegie Mellon’s campus. While there are projects at the Parallel Data Lab that use resources on the cloud to conduct experiments, training future researchers and working toward the future of systems often requires hands-on access to every layer of the machine, from the hardware to all of the software. Having the computing cluster physically on campus allows the researchers to have this control.

The transition from Narwhal to Wolf is currently underway in the Data Center Observatory on the first floor of the Robert Mehrabian Collaborative Innovation Center (CIC). It is a careful and gradual undertaking to ensure that all of the equipment works as expected, from cables and fans to processors and memory modules, as they can get damaged in the delivery process.

The Parallel Data Lab plans to use the new computing cluster for ongoing projects in research areas such as distributed systems, cluster computing, and parallel file systems. Amvrosiadis also anticipates that new projects will become possible with the computing power of Wolf.

“Over the years, we often found ourselves limited by the computational and network capabilities of Narwhal. With Wolf, I expect our experiments will be able to uncover interesting performance trends that are more realistic of contemporary hardware in data centers around the world, making these retired LANL systems a realistic training tool,” he said. “Narwhal enabled PDL to conduct training of world-class researchers for many years, and I am looking forward to the research and training that will be made possible by Wolf.”