The answer is when it’s a computer. Deep neural networks (DNN) and systolic arrays are bringing us closer to replicating complex human thinking in hardware. Hai ‘Helen’ Li and her Duke University students are working on ways to accelerate DNN compute in low-power edge devices. She tells us what that work entails, and why they chose Arm to help them.
We may not always be conscious of deep neural networks (DNN), but they are all around us. The obvious example is ChatGPT, which uses a DNN to analyze and generate text. Another is autonomous cars, where companies rely heavily on neural networks for image processing, and in the LiDAR data used to facilitate detection of obstacles
Right now, the performance and accuracy of DNNs are generally meeting people's expectations. But their execution relies on the analysis of a huge amount of high-quality data.
For example, GPUs are very powerful for neural network training and inference. But running GPU cores consumes a huge amount of power. Their fabrication takes time. And there’s a global shortage of GPUs, which makes them expensive. The industry is pushing up against these limitations right now.
Developers are desperately looking for innovative hardware solutions. We can't advance by following the current path. The question is how to accelerate execution to support the high demands of these workloads.
Our group has been using Arm IP to work on redesigning the compute core so it can accelerate DNN operations. This will boost its applicability for a wide range of important use cases. Thanks to the flexibility of the input layer, DNNs can easily incorporate query features and item features, which can help capture the specific interests of a user and improve the relevance of recommendations.
"We wanted to study whether systolic array could be applied to edge devices. We need to provide the fastest speed with the minimum power consumption, with the flexibility to meet the needs of different users."
We’re particularly interested in systolic arrays, a type of compute that mimics our brain structure at the hardware level. This isn’t a new concept. It was first described by HT Kung and Charles E Leiserson in a paper in 1979. A machine called the Colossus Mark II used a similar technique as far back as 1944. In those days, the compute requirements of contemporary workloads were not that high. It was a beautiful idea, but it wasn’t practical for the time.
More recently, the technology was picked up by a team within Google, which used systolic arrays for its Tensor Processing Unit (TPU), the key driver behind the company’s DNN execution.
Our interest is different from Google's. Google focuses on large-batch data execution, with high performance, and a very fast response, which is all critical in cloud server design. In our research, we wanted to study whether these systolic array ideas could be applied to edge devices. These have limited battery supply. We need to provide the fastest speed with the minimum power consumption, with the flexibility to meet the needs of different users.
For example, an individual may want to use their phone for taking pictures, while the military may be looking to develop advanced instruments to track vehicles. Each scenario has different performance requirements, in terms of precision, efficiency and response time. Here, we practice what’s known as hardware and software co-design – optimizing the algorithm or application for the hardware in question. The aim is to marry these two parts perfectly so they’re not in conflict, and so they can meet different needs.
I joined Duke in 2017. At the time, we did not have a lot of circuit design capability. We're researchers, not engineers. So, I started talking to Arm, asking if they could help. We desperately needed their libraries and core IP to facilitate our research. Arm soon gave us access to their IP for an initial chip fabrication and concept approval. We also gained access to foundries and process design kits and started work.
If this were toy building blocks, the parts that would interest us most would be the smallest, most fundamental pieces. When building accelerators, we start with the accelerator core. We may then build 100 cores in an accelerator chip and integrate lots of chips together on a printed circuit board.
“Arm didn’t just hand us the libraries for us to use and then leave us to it. They also provided an ecosystem to help [*]… Our students could ask questions and learn according to their own needs.”
For this, we need standard library cells, such as AND, OR logic gates. We need memory cells because systolic arrays are based on memory structure. We’ve been using Arm’s SRAM library. We also needed I/O libraries. Because we were fabricating the chip, we needed to get the data in and out of the chip to test its performance. These are the three most important components that were used in the project.
The relationship has been perfect so far. Arm’s library is very reliable.
But Arm didn’t just hand us the libraries for us to use and then leave us to it. They also provided an ecosystem to help [*]. We found out how to explain things to the students: how to select the right library, and how to integrate the libraries into environments and optimize the designs. Arm provided a lot of tutorials and online courses, where the students could ask questions and learn according to their own needs. That accelerated the training process and was extremely helpful for us.
We published a paper on this research [1] in 2020. It showed our accelerator delivering 26.7 TOPS/W in a 17.8 milliwatt peak power budget. We’re comparing our work with equivalents that perform at more than 1,000 milliwatts. So, we have a far lower power consumption. And our throughput, measured as tera operations per second (TOPS), is higher. So that precisely demonstrates our goal, of minimizing the power consumption, while having a faster data ship out.
So far, we’ve been very successful in integrating computing and storage in small areas within the accelerator cores.
But we recently had a new chip come back, also using Arm’s library, that has 32 cores. Our measurements there show that its run latency is slower than our simulation results. This could be because the network designs didn't meet our design specification. We’re still exploring whether that’s the case.
In designing the data to move from A to B, and then B to C, we have to be careful. Just like on a highway, traffic is always an issue. You want to make sure every lane is fully occupied, but that all the cars are following the given speed and moving to the correct spot precisely.
In this case, there could be some internal schedule jamming. We’re still trying to figure it out.
Our next fundamental challenge is data communication, and network on chip designs, which support communication between a range of cores. Our group has been working very hard on this. When we integrate one hundred cores on a single chip, the data can't just fly from one spot to another spot. It has to be shipped through specific routings. And, again, there’s a large volume of data in transit.
This is an issue that is not going to go away.
Years ago, many questioned what people would be able to do with a phone. And now look what we can do. And later, you may remember there was an attempt at smart glasses. A fundamental challenge there is the battery life and the functionality they can offer. We’re still working in that direction.
Hai ‘Helen’ Li is the Clare Boothe Luce Professor and Department Chair of the Electrical and Computer Engineering Department at Duke University.
Arm makes a wide range of IP and tools available at no charge for academic research use.
Explore Research Enablement
[1] Q. Yang and H. Li, (2021) “BitSystolic: A 26.7 TOPS/W 2b~8b NPU With Configurable Data Flows for Edge Devices," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 68, no. 3, pp. 1134-1145, March 2021, doi: 10.1109/TCSI.2020.3043778
[*] The Arm Academic Access program offers online training seats to its members as well as peer support through SoC Labs: the new global academic community for Arm-based projects. SoC Labs provides a space for information exchange and mutual support from fellow academics and researchers for hardware and software developments around Arm IP.