There and back again… A Retrospective on Arm High Performance Computing Research

July 26, 2019

12 minute read time.

Arm is changing the landscape of HPC today. Following an exciting ISC 2019, and recent ecosystem announcements set to deliver energy-efficient supercomputing based on Arm technology across three continents, Eric Van Hensbergen, Fellow and Director of HPC Research at Arm, looks back at the Arm HPC journey so far, and the next steps for research and development.

I first got into High Performance Computing at the dawn of the petascale era, working on the DARPA High Productivity Computing Systems (HPCS) program which intended to develop and build the first petascale machine. At the start of the program there were five different system integrators (IBM, Cray, Sun, SGI, and HPE, with five different chip architectures (Power, x86, SPARC, MIPS, PA-RISC) all competing. Over the course of the different phases, the program was reduced to two system integrators with two different architectures, and ten years later half of the original architectures don’t exist anymore and only three of the companies are still standing. This development highlights the appeal of the Arm ecosystem when it comes to high performance computing. Instead of being a single company, Arm represents a standardized architecture with many diverse implementations, allowing it to be competitive in multiple market segments, as well as maintaining a competitive environment within the architecture that strengthens innovation.

While the DARPA HPCS program did eventually produce a machine, the petascale title had been reached years prior by a hybrid architecture built by IBM for Los Alamos National Lab called Road Runner. It was based on standard x86 servers with a Cell Processor accelerator (which itself was composed of general purpose Power cores coupled with many smaller SPU accelerators. It was one of the first applications of the hybrid computing model to high performance computing at scale. While IBM was developing their HPCS program and the Road Runner program, researchers were also developing the first manycore HPC system based on embedded processors – developing the Blue Gene line of supercomputers, which held top spots on the Top 500 for many years.

While the Cell Processor found application outside of HPC in the Sony Playstation 3, it was never commercially successful in IBM’s main line server business, a problem ultimately killing both the BlueGene and Cell processor product lines. But these two types of system would dominate supercomputing for some time to come, whether through the evolution of manycore through the Intel Knights series of processors and accelerators, or the eventual embrace of Nvidia GPUs for dense compute. However, many still believed that well-balanced, power-efficient by design general-purpose processors were still the path that would take the world from petascale to exascale.

Why Arm?

It was this belief that led me to join Arm, in order to pursue large scale systems by leveraging embedded designs that were commercially viable in multiple market segments such as mobile computing. By that time, Arm had already started a foray into server computing and was making the transition from 32-bit to 64-bit processors. Arm had a different kind of business model than IBM or Intel, they didn’t build systems or processors – they made money from licensing a standardized instruction set (the Arm architecture), implementations of that instruction set in the form of processor core designs, and supporting system components (things like interrupt controllers, on-chip interconnect, etc.). Arm partners take these pieces of intellectual property and incorporate their own components (sometimes in lieu of Arm-implemented components) in order to build chips.

As a result, there are hundreds of suppliers for Arm chips, for everything from microcontrollers the size of a grain of sand through to large scale multi-core systems targeted at networking or the data center. The standardized architecture meant that even though there were many different manufacturers, all of who could choose different optimization points and even their own micro-architectures for their products, they could all run the same software – creating a common software and diverse hardware ecosystem of different components which could be used with each other in different designs. That diversity of choice, opportunity for customization, and world-wide supply of companies able to build Arm-based silicon were primary drivers in Arm’s success in many market segments, and among the reasons Arm was pulled into the HPC segment.

Europe’s desire to re-establish a local high performance computing supply chain naturally gravitated towards Arm. After all, Arm was started in a barn in the Cambridgeshire countryside. It was a European-designed architecture and processor which was used throughout the world. Arm’s power efficiency was another key motivator, as data center power delivery limitations were viewed as one of the principal challenges identified to be able to move from petascale to exascale. To scale the systems of the time from petascale to exascale would have taken 100-200MW, where most data centers could only support of a small fraction of that. The US and China were setting 20MW budgets, but Europe set an even more aggressive target of 10MW.

Figure 1: Early BSC Arm Prototype Node

Figure 2: Deployed Mont-Blanc Prototype

Pioneering Arm-based HPC

The Barcelona Supercomputing Center (BSC) and several partners (including Arm) set out to do an initial evaluation of whether Arm could be used for HPC with a plan to build the system out of existing 32-bit mobile chipsets. The Mont-Blanc project built a 128-node cluster based off Nvidia embedded boards with Arm Cortex-A9 processing cores. The project partners ported a foundational HPC software stack to the platform and showed scaling to the maximum number of cores in the system. A follow-on prototype build 1080 node cluster based on Samsung mobile phone chips with more power and performant Arm cores. These early efforts analyzed the power efficiency potential of the Arm based platform and its viability for running large scale high performance computing codes. They also discovered several gaps in existing available hardware having to do with high performance networking, 64-bit addressability, and enterprise RAS features necessary for large scale clusters. The importance of this and future work has led to the establishment of a joint Arm-BSC Centre of Excellence, announced in May 2019.

Meanwhile in the US, the Department of Energy (DoE) had already kicked off research programs with the goal of accelerating their aspirations of exascale – but began to take notice of the early success the Europeans were having with the Arm architecture. A seedling program funded by the DoE National Nuclear Security Administration (NNSA) brought Arm together with Sandia National Labs and other partners, with an explicit focus on looking at balanced system design given the importance of data movement to the performance and efficiency of HPC systems. Based on the early collaboration of this partnership, Arm Research also joined a team working on the second phase of the DoE FastForward program, part of a line of fundamental research intended to accelerate the development of exascale computing. This coupled Arm and one of their server silicon providers, Broadcom, with Cray – an HPC system integrator with a track record of building some of the largest and highest performing supercomputers on the planet. Working together with Cray and other partners, Arm developed, evaluated and evolved the Scalable Vector Extension (SVE) to the architecture with a goal increasing computational density while allowing for different vector lengths to be used in different core implementations without changing the underlying software.

At roughly the same time, another supercomputing vendor, Fujitsu, approached Arm, to see if they could replace their existing SPARC architecture with Arm. Their principal motivation was that since the acquisition of Sun by Oracle, the SPARC software ecosystem had weakened to the point where critical software packages were no longer being updated or optimized for SPARC. They started their own core design based on the Arm architecture, and worked together with Arm to review and enhance the architecture for the HPC market including SVE. A few years earlier, a Chinese company named Phytium had also licensed Arm products for the use in their own line of high performance chips and systems. A 64-core 64-bit Arm server processor named FT-2000/64 was announced at HotChips 2015.

The Advent of 64-bit Arm Servers

While these new partners started their Arm-based explorations and designs, 64-bit Arm servers entered the world. While not designed specifically for HPC, the Applied Micro X-gene servers made their debut at the International Supercomputing Conference in Germany, followed a year later by AMD Seattle Arm server processors and the multi-socket Cavium 48-core ThunderX. While not intended for supercomputing, these early designs were scooped up and built into testbeds both in Europe and the US, coupled with high performance networking and GPGPU accelerators.

A year later, Broadcom announced their 256-thread Vulcan server processor, which intended to compete head to head with Intel’s high performance general purpose CPUs. It was the Vulcan design team that had collaborated with Cray and Arm during the DoE FastForward program, and aspects of their design were heavily influenced by that collaboration as well as feedback from application developers within the Department of Energy. However, just as Vulcan-based silicon started coming online, Broadcom was purchased by Avago, who decided to discontinue the Vulcan server product. An ecosystem-wide collaboration across governments, system integrators, and other interested parties managed to facilitate transfer of the IP and design team to Cavium, who eventually productized the Vulcan design as the ThunderX2, the first system-on-chip implementing the Arm architecture to target HPC. A key factor benefiting the ThunderX2 was a decision to build a more balanced system and have eight memory channels per socket, giving them a key advantage on real-world benchmarks when compared to contemporary chips. Isambard, an early prototype Cray system based on ThunderX2 was deployed at the University of Bristol who performed performance and energy comparisons versus other contemporary HPC systems. It is currently being used as a production HPC system.

Isambard

Figure 3: Isambard, an early prototype Cray system

Production Deployment of Arm-based Systems

Astra

Figure 4: Astra system at Sandia

Within a year Cray, HPE, and Atos had announced supercomputer products based on the ThunderX2, and a year after that HPE brought the Astra system online at Sandia National Labs, whilst Cray brought the Thunder system online at Los Alamos National Labs. Unlike previous Arm deployments at US national labs, both of these systems were intended for production use. The original plan had been to start with Astra as a testbed to help work out problems with Arm systems which could only be detected at scale – but the system exceeded all expectations and the schedule for production deployment was accelerated. The ThunderX2 system at LANL had the shortest acceptance period of any system they had ever deployed. Applications were up and running in record time due to the general-purpose nature of the ThunderX2 with its support of familiar Linux environments and InfiniBand based networking technologies.

Thunder

Figure 5: Thunder system at Los Alamos National Labs

As impressive as the ThunderX2 was, its compute was still based on the traditional Arm SIMD architecture, NEON, since ThunderX2 was developed before the SVE architecture was released. However, six months before Astra went online, Fujitsu showed the first SVE silicon for their SVE-enabled A64fx processor developed for RIKEN, and, concurrent with the Astra and Thunder going online, announced that they had begun running applications on their test silicon. Meanwhile, the European commission funded the Exascale Processor Initiative, with the intention of building silicon for general purpose compute and accelerators in Europe to support their own exascale program. While no final architecture or configuration has been announced, Arm-based processors are being considered for the general-purpose compute portion of the processor.

While Arm has always focused on general-purpose supercomputing, we have always recognized the importance of GPUs particularly in certain classes of applications. At ISC 2019, Nvidia announced they were porting CUDA and their entire AI stack to Arm based systems providing even more opportunity for Arm to play a broader role in HPC and data analytics.

What’s next for Arm-based HPC?

With production systems online, or in the process of coming online, in the US, Europe, China and Japan, Arm is increasingly taking a lead role in the production systems. At the same time, Arm Research is starting to look post-exascale, where the slowing of Moore’s Law will create an even more challenging environment in which to continue to extend performance gains. To come from cell-phones to petascale supercomputers in the span of six years has been an exciting time to be at Arm and it has been a privilege to be a part of bringing diversity back into the HPC ecosystem. None of this would be possible without the Arm Research HPC team and our partners across Arm.

So, what’s next? The prominence of Arm discussions at ISC 2019 is just one marker of the exciting future ahead of the Arm HPC Ecosystem. If you’d like to find out more, visit our ecosystem pages below – and watch this space for more news!

Find Out More About the Arm HPC Ecosystem

HPC at the Arm Research Summit

Interested in finding out more about the Arm ecosystem? Why not join us at this year's Arm Research Summit, which will include HPC-focused sessions including the ever-popular Arm User Group workshop. We have a huge number of inspirational speakers, workshops, demos and plenty of networking opportunities for you to meet other academics, researchers and industry experts from around the world.

Find Out More

Relevant Publications:

Armejach, Adrià, et al. "Using Arm’s scalable vector extension on stencil codes." The Journal of Supercomputing (2019): 1-24.
Banchelli Gracia, Fabio F., et al. "Is Arm software ecosystem ready for HPC?." SC17: International Conference for High Performance Computing, Networking, Storage and Analysis. 2017.
Catalán, Sandra, et al. "Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors." Journal of Computational Science 25 (2018): 140-151.
Cecowski, Mariano, et al. "The m2dc project: Modular microserver datacentre." 2016 Euromicro Conference on Digital System Design (DSD). IEEE, 2016.
Chronaki, Kallia, et al. "Task scheduling techniques for asymmetric multi-core systems." IEEE Transactions on Parallel and Distributed Systems7 (2016): 2074-2087.
Cruz, Miguel Tairum, Sascha Bischoff, and Roxana Rusitoru. "Shifting the barrier: Extending the boundaries of the BarrierPoint methodology." 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2018.
Cruz, Miguel Tairum. "Performing SVE Studies using the Arm Instruction Emulator." 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2018.
Ferreóon, Alexandra, et al. "Crossing the architectural barrier: Evaluating representative regions of parallel HPC applications." 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2017.
Garibotti, Rafael, et al. "Efficient embedded software migration towards clusterized distributed-memory architectures." IEEE Transactions on Computers8 (2015): 2645-2651.
Garibotti, Rafael, et al. "Simultaneous multithreading support in embedded distributed memory MPSoCs." 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 2013.
Grodowitz, P. S. Breaking Band: A Breakdown of High-performance Communication. ICPP 2019.
Hoque, Reazul, and Pavel Shamis. "Distributed Task-Based Runtime Systems-Current State and Micro-Benchmark Performance." 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 2018.
McIntosh-Smith, Simon, et al. "Comparative benchmarking of the first generation of HPC-optimised ARM processors on Isambard." Cray User Group (CUG) Conference. 2018.
Meyer, Nils, et al. "Lattice QCD on upcoming Arm architectures." arXiv preprint arXiv:1904.03927 (2019).
Mont-Blanc deliverable reports: https://www.montblanc-project.eu/project/deliverables
Pedretti, Kevin, et al. Vanguard: Maturing the Arm Software Ecosystem for US DOE Supercomputing. No. SAND2017-9621C. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2017.
Rajovic, Nikola, et al. "Are mobile processors ready for HPC?." IEEE/ACM Supercomputing Conference. 2013.
Rajovic, Nikola, et al. "The Mont-Blanc prototype: An alternative approach for HPC systems." SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2016.
Rico, Alejandro, et al. "Arm HPC Ecosystem and the Reemergence of Vectors." Proceedings of the Computing Frontiers Conference. ACM, 2017.
Ruiz, Daniel, et al. "The HPCG benchmark: analysis, shared memory preliminary improvements and evaluation on an Arm-based platform." (2018): 1-15.
Rusitoru, Roxana. "Armv8 micro-architectural design space exploration for high performance computing using fractional factorial." Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems. ACM, 2015.
Shamis, Pavel, M. Graham Lopez, and Gilad Shainer. "Enabling one-sided communication semantics on ARM." 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2017. Rajovic, Nikola, et al. "Tibidabo: Making the case for an Arm-based HPC system." Future Generation Computer Systems 36 (2014): 322-334.
Stephens, Nigel, et al. "The Arm scalable vector extension." IEEE Micro2 (2017): 26-39.

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024