SCALE-Sim: A cycle-accurate NPU simulator for your research experiments

April 21, 2020

6 minute read time.

Architectural Simulators

Architecture simulators of various kinds are a key tool in the computer architecture toolbox. They provide a convenient model of real hardware, such as a CPU or even a whole System on Chip (SoC), at a level of abstraction that makes them faster and more flexible than low-level circuit simulation. This comes at the cost of some loss of accuracy, which must be traded-off against speed and flexibility. Flexibility refers to the ability to modify the simulator to perform “what-if?” experiments on new features, or to understand the sensitivity of a performance metric to a certain architectural parameter, such as the size of a cache memory. The availability of high-quality simulators can really drive research progress, as they are such an effective tool for experimental work.

In the traditional world of CPUs, there are some very well-established simulators, such as the venerable Gem5. However, in the brave new world of neural network (NN) workloads, neural processing units (NPUs) are currently the focus of a huge amount of computer architecture research. Sadly, there are currently scant simulator options for those working on NPU architecture. This is a real limitation on architecture research, as NPU concepts are currently evolving rapidly, and the community is crying out for straightforward methodologies with which to evaluate them.

SCALE-Sim NPU Simulator

Enter SCALE-Sim. This is a simple architecture simulator in Python, which specifically targets NPUs. SCALE-Sim allows you to model the performance and power consumption of running an arbitrary NN workload on a parameterized NPU accelerator. The NPU that it simulates is based around a well-established idea known as a systolic array. The systolic array works really well for calculating matrix multiplication operations, which are the main building block of most NNs, and has also been widely used in industry. For example, in the Google Tensor Processing Unit (TPU).

The design modeled is widely parameterized, so it’s possible to explore small and large NPUs, as well as everything in between, targeting applications in microcontrollers for IoT, mobile devices and all the way up to datacenter computing. The simulator will run a wide range of NN workloads and report the runtime, along with data on utilization, on-chip and off-chip memory usage. In fact, the DRAM request traces it generates is a cool feature that is great for memory system research, outside of the NPU itself.

Figure 1: Overview of the SCALE-Sim NPU simulator

The really exciting part is that SCALE-Sim is open source. Yes, that is right, you can use it to evaluate your work too. For ML researchers, SCALE-Sim allows you to estimate the throughput (inferences per second) and power (inference per Watt) that your model can achieve. For computer architecture researchers, it provides deep insight into the sensitivity of performance to architectural parameters over a range of models. Finally, for computer systems researchers, it provides full DRAM request traces to understand memory system requirements for NPUs. Figure 1 gives an illustration of the simulator; check out this paper for more details. As an open source project, we are also looking for collaborators to contribute to the project, for which you can check out the Github repository for more details.

SCALE-Sim was developed primarily by Georgia Tech PhD student Ananda Samajdar, during his internship at the Arm ML Research Lab in Boston. Since his internship, Ananda has used SCALE-Sim in his ongoing PhD work, in collaboration with Arm Research. As a result, he will be presenting a paper at the 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) this summer that extensively leverages SCALE-Sim.

Scale-up vs Scale-out NPUs

The paper is entitled “A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim”. The main goal of the work is to better understand the optimal design of NPUs. The essential question it addresses is whether it is better to have one big NPU (scale-up), or numerous smaller ones (scale-out). This may sound like a simple question, but is challenging to answer convincingly, as it requires the analysis of a range of different neural networks, with cycle-accurate simulation capability. This is where SCALE-Sim comes in.

We modeled NPUs for both scale-up and scale-out systems, modeling on-chip memory access, runtime, and DRAM bandwidth requirements for each workload. Using these simulations, we were able to construct an analytical model to estimate the optimal scale-up to scale-out strategy for a given set of hardware constraints.

As a quick example, Figure 2 shows the kind of results produced from this work, enabled by SCALE-Sim. It shows the runtime for various types of layers in two important NN task of image classification and language modeling. The runtime is show as the relative improvement from the scale-out strategy relative to the scale-up approach. As we increase the size of the NPU by increasing the number of hardware MAC operations (shown in the plots by the colored bars), the relative gains for scale-out increase. This is essentially a result of a drop in the utilization for large scaled-up NPUs. In fact, we found that optimal choice of scaling strategy can lead to performance improvements as high as 50x for some models, within the available DRAM bandwidth.

Ratio of runtimes in best scaled-up NPU configuration vs best scaled-out (partitioned) configuration

Figure 2: Ratio of runtimes in best scaled-up NPU configuration vs best scaled-out (partitioned) configuration for various layers in (left) image classification and (right) language model tasks, for a range of MAC unit values.

The paper is co-authored by long-time collaborator Prof. Yuhao Zhu, Jan Mortiz Joseph (Georgia Tech and Otto-von-Guericke University Magdeburg), Prof. Tushar Krishna (Ananda’s PhD advisor), of Synergy Lab at Georgia Tech, Matthew Mattina (Sr. Director of ML Research at Arm), and myself. If you’re interested to know more, please drop us a line. The ISPASS conference will be virtual this year, but please do get in touch if you’d like to discuss this work. Also – please check out SCALE-Sim for your own research projects.

Read the Full Paper Contact Paul Whatmough

0 comments
0 members are here

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024

Research Articles

SCALE-Sim: A cycle-accurate NPU simulator for your research experiments

Architectural Simulators

SCALE-Sim NPU Simulator

Scale-up vs Scale-out NPUs

HOL4 users' workshop 2025

TinyML: Ubiquitous embedded intelligence

To the edge and beyond