The accelerating power of coherence

June 14, 2024

5 minute read time.

Technological advances are data and compute hungry. Progress in many scientific fields, from machine learning to genomics, depends on innovators having access to high-performance computing resources. Accelerators can enhance performance and power efficiency for particular computing workloads. In recent years, there has been a surge of interest in the role that accelerators can play in the design of future computers for industry and academia.

Researchers at the Barcelona Supercomputing Center (BSC) have been working in collaboration with Arm to investigate how custom accelerators with a direct coherent connection to memory hierarchy can improve traditional processors. Coherence, say the team, will ultimately support the development of powerful systems that can deploy multiple accelerators and processors simultaneously.

We spoke to PhD student at BSC, Guillem Lopez Paradis, to find out more.

My research focuses on the communication between hardware accelerators and traditional processors or CPUs; that is the memory hierarchy, or the way you connect these accelerations. I am working on improving these connections.

Until recently, most of these connections have been made without being coherent. This means that you have the processor, you send some information to the accelerator, the accelerator does its stuff and then gets back to the processor. What we are working on are systems that allow accelerators and processors to work at the same time, with the same data. This opens a new paradigm; new systems with many accelerators working simultaneously, and processors that can communicate at the same time without costly interactions.

It’s about creating ‘neo-systems’, the neo-computing devices that we need to build to continue with the development of newer systems with higher performance. We’re using technology that is already in the systems, trying to connect with more flexibility to the finite user – in this case, the hardware design and the accelerator designer.

We are moving towards standardizing, more-or-less, the connection between many cores. You usually have a very regular network-on-chip, or different processors connected through a mesh. We’re going into a world where instead of having many processors, we will have many fitted routine ‘neos’; processors in different shapes, bigger ones, smaller ones, but also with accelerators already connected in the same memory hierarchy.

“Working with Arm, I’ve been able to touch an industry-level product from the inside. Usually, unless you’re working with the company, you can only touch the outside. But I’ve been able to get the full product code, play with it, and check the code and documentation, to understand it very deeply.”

I have always been interested in understanding how processors work. Accelerators are the new paradigm; they are one of the hot topics of this decade. Memory hierarchy is also a very interesting topic to me.

Working with Arm, I have been able to touch an industry level product from the inside. Usually, unless you are working with the company, you can only touch the outside. But I have been able to get the full product code, play with it, and check the code and documentation, to understand it very deeply. Arm Academic Access gives you a lot of hardware designs and you can interact with them. My previous research has been in simulation, but for this project we were able to work on a real product and execute everything on real hardware.

“It’s very easy to prototype and have these frameworks that Arm has been able to provide us - I would say much easier than doing it in other technologies.”

The collaboration with Arm was ideal because in most cases the FPGA platforms that we use to prototype new designs, have an Arm CPU. So, it’s perfect for connecting anything to these Arm CPUs in FPGAs. It’s very easy to prototype and have these frameworks that Arm has been able to provide us - I would say much easier than doing it in other technologies.

Exploiting the power of memory coherence

The reason that some accelerators can benefit from a direct coherent connection with the memory hierarchy is that certain applications have a lot of re-use of data, or they have what we call ‘chaining’ or ‘lining’. So, one accelerator does one task and then the next one does some other task. If there is enough re-use, there is another access in the meantime, or some of the accelerators are working on the same data. Then it’s necessary, or useful, to have this memory coherence.

I think it’s the natural cost-efficient way of designing chips. We are more-or-less in the post-Moore’s Law era and the problem is that with newer technologies in the nanometer transistor size, we are not getting the same performance that we were before. Humans will intrinsically want more every time.

Many accelerators and workloads could benefit from a direct coherent connection with memory hierarchy. System services on Linux could be interesting – they need to be coherent. Applications in machine learning and genomics are interesting. So too is any application that uses parallel-relations techniques, such as OpenMP. If you are working with some of these, being coherent and having an accelerator that does some of the tasks would be extremely interesting. We’re not claiming it’s better for everything. But nevertheless, we still think there are many applications that can benefit from this.

“The standard right now is going with accelerators that are outside of the chip, but we still believe there is room for improvement inside the chip.”

There is interest in creating new protocols in how accelerators and processors communicate. Right now, industry is going in the direction of creating a standard for communications. So, you have a server, you put your accelerator in, and then, because there is this standard in communication, it will work without doing anything, like autoplay. We would like to help in this standard, or create new standards, and improve this protocol. Not only for off-the-shelf accelerators, but also accelerators that are inside the chip. The standard right now is going with accelerators that are outside of the chip, but we still believe there is room for improvement inside the chip.

The main benefit of standardizing a common framework for accelerators is usability and verification. It would be extremely easy to add new hardware system if you don’t need to take care of having to verify the new custom protocol or the new connection. This would really benefit industry, but also academia. If everyone talks the same framework or the same protocol, then it’s easier to do research or to try new things.

Arm offers free access to a wide range of commercially-proven Arm IP, tools, and other resources

Explore Research Enablement

0 comments
0 members are here

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024

Research Articles

The accelerating power of coherence

Exploiting the power of memory coherence

HOL4 users' workshop 2025

TinyML: Ubiquitous embedded intelligence

To the edge and beyond