How to build software for a computer 50 times faster than anything in the world

This image is a small portion of an output from the “Q Continuum” cosmology simulation; the full simulation evolves more than half a trillion particles. Exascale systems will further enable researchers to run advanced simulations like this to shed more light on the key ingredients that make up our universe.
Credit: Image courtesy of the Hardware/Hybrid Accelerated Cosmology Code (HACC) team

Imagine you were able to solve a problem 50 times faster than you can now. With this ability, you have the potential to come up with answers to even the most complex problems faster than ever before.

Researchers behind the U.S. Department of Energy’s (DOE) Exascale Computing Project want to make this capability a reality, and are doing so by creating tools and technologies for exascale supercomputers — computing systems at least 50 times faster than those used today. These tools will advance researchers’ ability to analyze and visualize complex phenomena such as cancer and nuclear reactors, which will accelerate scientific discovery and innovation.

Developing layers of software that support and connect hardware and applications is critical to making these next-generation systems a reality.

“These software environments have to be robust and flexible enough to handle a broad spectrum of applications, and be well integrated with hardware and application software so that applications can run and operate seamlessly,” said Rajeev Thakur, a computer scientist at the DOE’s Argonne National Laboratory and the director of software technology for the Exascale Computing Project (ECP).

Researchers in Argonne’s Mathematics and Computer Science Division are collaborating with colleagues from five other core ECP DOE national laboratories — Lawrence Berkeley, Lawrence Livermore, Sandia, Oak Ridge and Los Alamos — in addition to other labs and universities.

Their goal is to create new and adapt existing software technologies to operate at exascale by overcoming challenges found in several key areas, such as memory, power and computational resources.


Argonne computer scientist Franck Cappello leads an ECP project focused on advanced checkpoint/restart, a defense mechanism for withstanding failures that happen when applications are running.

“Given their complexity, faults in high-performance systems are a common occurrence, and some of them lead to failures that cause parallel applications to crash,” Cappello said.

“Many ECP applications already feature checkpoint/restart, but because we’re moving towards an even more complex system at exascale, we need more sophisticated methods for it. For us, that means providing an effective and efficient checkpoint/restart for ECP applications that lack it, and providing other applications a more efficient and scalable checkpoint/restart.”

Cappello also leads a project that focuses on reducing the large amounts of data that is generated by these machines, which is expensive to store and communicate effectively.

“We’re developing techniques that can reduce data volume by at least a factor of 10. The problem with this is that you add some margin of error when you reduce the data,” Cappello said.

“The focus then is on controlling the margin of error; you want to control the error so it doesn’t affect the scientific result in the end while still being efficient at reduction, and this is one of the challenges we are looking at.”


For information that is stored on exascale systems, researchers need data management controls for memory, power and processing cores. Argonne computer scientist Pete Beckman is investigating methods for managing all three through a project known as Argo.

“The efficiency of memory and storage have to keep up with the increase in computation rates and data movement requirements that will exist at exascale,” Beckman said.

“But how memory is arranged in systems and the technology used for it is also changing, and has more layers,” he said. “So we have to account for these changes, in addition to anticipating and designing around the future needs of the applications that will use these systems.”

With added layers of memory on exascale systems, researchers must develop complementary software for regulating these memory technologies that give users control over the process.

“Having controls in place is important because where you choose to store information affects how quickly you can retrieve it,” Beckman said.


Another key resource that Beckman and Argo Project researchers are studying is power. As with memory, methods for allocating power resources could speed up or slow computation within a high-performance system. Researchers are interested in developing software technologies that could enhance users’ control over this resource.

“Power limits may not be at the top of the list when you’re dealing with smaller systems, but when you’re talking about tens of megawatts of power, which is what we’ll need in the future, how an application uses that power becomes an important distinguishing characteristic,” Beckman said.

“The goal for us is to achieve a level of control that maximizes the user’s abilities while maintaining efficiency and minimizing cost,” he said.