To meet modern compute demands, accelerators have been busy speeding up processing in everything from machine learning to servers. These accelerators have succeeded in making compute faster – but chips are now struggling to move data fast enough to keep up. Andreas Gerstlauer, a Professor of Electrical and Computer Engineering specializing in embedded systems at The University of Texas Austin, explains how his team has worked with Arm technology to clear the data bottleneck.
Embedded computing is all about efficiency. As researchers in the field, we work to very tight constraints around energy consumption and costs. We are always looking to optimize compute performance using accelerators, so that systems can run key tasks extremely well.
We have been very successful in accelerating compute, but this has created a knock-on problem: data has become a bottleneck, especially for data-intensive applications. We cannot move it quickly enough. The accelerator itself may be 10x or 100x faster, but once we include the data movement overhead, we may lose all the benefits.
We must innovate new ways to feed the data beast. That is where we have been working together with Arm – optimizing hardware-assisted data movement for heterogeneous system architectures.
“Thanks to the slowing of Moore’s Law, and the potential end of semiconductor scaling as we hit physical limits […] efficiency is no longer just a problem for specialist embedded computing. It is now an issue for general purpose computing, such as servers, too.”
Computing has always been fascinating to me. I used to build and sell my own PCs as a teenager with my dad out of our home. In the old days, you had one CPU, some peripherals, and maybe one accelerator, max. Now you can have hundreds of these on one chip.
Thanks to the slowing of Moore’s Law, and the potential end of semiconductor scaling as we hit physical limits, transistors are getting to the range of a few atoms, which is incredible. Efficiency is no longer just a problem for specialist embedded computing. It is now an issue for general purpose computing, such as servers, too.
Developers in these areas have discovered they need to specialize their architectures more and more. Companies such as Meta and Google are bringing out specialized chips to run in their data centers for the types of workloads they typically experience, such as audio/video transcoding and machine learning.
Everyone needs to go to specialization and acceleration. The main challenge lies in moving data between the different components on a chip. The more you accelerate the compute, the more you have this problem of getting the data into the accelerator. Certain applications may be data-intensive, but not have a lot of compute relative to the amount of data they need to move. For example, some forms of machine learning today are much more data-intensive than classical compute-dominated applications, including neural networks.
Working with Arm technology was a natural choice. I have a lot of experience of using Arm processors in my research, dating back to my PhD. I moved to Austin in 2008, when I became a professor here at the University of Texas Austin, and there is a large research group from Arm in the area.
In 2020, Mochamad Asri, a PhD student who had received gift funding from Arm, worked with Arm researchers on near-memory and in-memory compute, using Arm CPUs. Here, the data sits in a particular memory, and we move the compute engine that is processing that data closer to the memory, rather than having it at the other end of the chip.
In this work, we were trying to develop concepts that could be applied in a wide variety of contexts, but we prototyped them in Arm environments. Arm is also the de facto market leader in the embedded space, as its processors were always designed to be efficient.
We work with a simulator called gem5. Arm contributes to the maintenance of gem5, but it is all open source, which suits our open research approach. There is no IP, non-disclosure agreements, or anything like that and we can ask questions of the people who developed it.
The key here is to reduce the overhead. We have an application computing something on the CPU. We want to ship off some computation to the accelerator, and then get the result back and process the result on the CPU. So we go back and forth, between CPU and accelerator.
This presents significant challenges, such as how to deal with dependencies. How can we make sure the CPU can keep running, and doing something useful, while the accelerator is running too?
“As researchers, it would be easy for us to maneuver ourselves into a completely unrealistic corner. Industrial experts can point out how things are not going to work, for reasons we did not even think about.”
We looked at mechanisms with which we could do a very lightweight, very efficient offload of tasks to accelerators. Instead of having complex software mechanisms using drivers, we tried to make hardware mechanisms by integrating that support into the CPU itself.
This would enable us to do a very efficient offload of computations onto the accelerators, with very little overhead.
We published the paper in 2020 [1], showing how the new approach significantly reduced overhead, compared to driver-based offload, and boosted performance by up to 2.6x. We worked closely with Arm researchers and technology to make joint research projects and co-author publications. It was a very natural and mutually beneficial collaboration.
As researchers, it would be easy for us to maneuver ourselves into a completely unrealistic corner. The industrial experts are the kind of people who eventually must turn this research into products, and they can give us a reality check. They can point out how things are not going to work, for reasons we did not even think about.
It has been decades since I started working in embedded compute, and it does not seem like we are running out of problems to solve. Examples of our ongoing work include researching fast, low-overhead hardware-assisted context switching on memory stalls, and virtualizing the register file.
The ending of Moore's Law is actually a boon for research, because Moore’s Law was in some ways a cause of laziness. As a researcher or practitioner, you could just wait another 18 months, and your system would get faster.
Now we must become more creative again, in how we architect these systems. Arm and others have realized that it is not just the embedded market that needs our expertise, but general compute too. It is a very interesting time for computing.
Andreas Gerstlauer is a Professor of Electrical and Computer Engineering at The University of Texas at Austin
Arm makes a wide range of IP and tools available at no charge for academic research use. To find out more, please visit our website.
Explore Research Enablement
[1] M. Asri, C. Dunham, R. Rusitoru, A. Gerstlauer and J. Beard (2020) "The Non-Uniform Compute Device (NUCD) Architecture for Lightweight Accelerator Offload" 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Västerås, Sweden, pp. 38-45, doi: 10.1109/PDP50117.2020.00013.