Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Research Collaboration and Enablement
Research Collaboration and Enablement
Research Articles Training DNN IoT Applications for Deployment on Analog NVM Crossbars
  • Research Articles
  • Arm Research - Most active
  • Arm Research Events
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
Research Collaboration and Enablement requires membership for participation - click to join
More blogs in Research Collaboration and Enablement
  • Research Articles

Tags
  • Arm Research
  • Devices Circuits Materials
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Training DNN IoT Applications for Deployment on Analog NVM Crossbars

Fernando García Redondo
Fernando García Redondo
August 3, 2020
9 minute read time.

We are experiencing a transformation in both how and where computing is happening, with the myriad of sensors surrounding us in our environment changing the way information is processed and consumed. Driven by this paradigm shift, the growing TinyML community is extensively exploring computing at the edge with research efforts across both academia and industry. At the very heart of this revolution, we find efficient low-power signal processors and ML accelerators, computing the information sent by the network of sensors 24/7. This ubiquitous, always-ON computing enables intelligent sensors, compressing all the information available in the most efficient way.  

A diagram representing a battery-powered 24/7 system monitoring patient health.Figure 1: A battery-powered 24/7 system monitoring patient health

Let us consider the scenario in Figure 1, where a battery-powered system constantly monitors the health of a patient. A low-power ML accelerator processes the information from the bio-sensor network, and the moment it detects an anomaly, it wakes up a high-end processor to analyze it further, or even alert a doctor.

The key challenge behind efficient, always-ON, low-power accelerators running in constrained devices is data movement. Or how to efficiently move heavy data inputs and kernel weights to and from the different memories in the system. To overcome this data wall problem, three main potential solutions are proposed.

This is a graphic to show algorithm optimization techniques.

Figure 2: Algorithm optimization techniques, Whatmough et al., ISSCC 2017. 3D architecture integration, Shulakar et al.  2017. Devices and technology comparison, Western Digital press release, 2016.

First, we could explore new algorithmic solutions which take advantage of sparsity in the network by reducing precision of operands, or by pruning kernels. Second, we could look at re-engineering the whole accelerator architecture, moving the memories closer to the logic, where computation traditionally happens. And third, we could make use of emerging technologies and architectures, such as resistive crossbars, for Computation in Memory.

Computation in Memory (CIM) Using Resistive Crossbars

CIM radically changes where and how data flows are stored and processed. It proposes a new approach where computation takes place in the same memory elements. By reducing most of the memory transactions, and improving the power consumed in the computation operations, we alleviate the energy and extra time required for costly data movements and linear algebra operations.

Within the EU-funded MNEMOSENE project, we explore how CIM, using emerging Non-Volatile Memory (NVM) resistive crossbars, leads to significant energy and efficiency improvements accelerating known kernels. In particular, in the IJCNN article, Training DNN IoT Applications for Deployment on Analog NVM Crossbars, we explore how to efficiently train Deep Neural Networks (DNN) to overcome common challenges when accelerating Multiplication Accumulation (MAC) operations in NVM crossbars.

A diagram showing computation-in-memory using programmable resistive devices.

Figure 3: Computation-in-memory using programmable resistive devices: the vector matrix multiplication outputs appear at the bitlines in the form of currents. The matrix is encoded as conductances in the crossbar array, while the input vector,  is fed into the crossbar as voltages.

The acceleration of the MAC or vector-matrix operation (arguably the most common operation in ML and DSP kernels) naturally occurs due to Kirchhoff ‘s Law at the bitlines of the crossbar. Having the input operand encoded as a voltage, and the matrix components encoded as conductances in the programmable resistive devices, the current flowing through each one of the bottom paths translates to:

The current flowing through the bottom paths.

This means that through a very low-power operation, we obtain a full vector-matrix operation in. The speed, high-density, and low-power profile of this analog primitive makes it extremely well suited for inference at the edge. The basic primitive is composed of the NVM crossbar interfaced by a Digital-to-Analog converter and Analog-to-Digital converter. For simplicity, imagine that each layer of the NN can be unrolled and deployed in a single crossbar. Later, we digitally interconnect different crossbars in the same way that we interconnect different layers in the NN.

A diagram showing the deployment of a DNN.

Figure 4: Deployment of a DNN: layers are mapped to individual crossbars digitally interconnected to form the network. On the right hand side, deployment of negative weights in the resistive elements

Challenges for Resistive Crossbar CIM Applications

There are several challenges with this approach, related to the immaturity of the technologies, such as variations, temperature dependence, or noise intrinsic to the analog nature of the computation. However, a proper training scheme including appropriate ‘noisy’ elements allows us to overcome these problems.

In this work, we focus on the challenges related to the architecture design which are intrinsic to resistive crossbars: low-precision, non-uniform signal ranges and bipolar weights deployment.

Low precision. The analog accelerator works with a quantized set of analog signals: the weights are quantized as conductances, the inputs are encoded as quantized voltages, and the currents are digitized based on quantized steps.

Non-uniform dynamic ranges. The deployment of the many filters of a single convolutional layer in a CNN implies that we are mapping different weight matrices to the same conductances. Consequently, there is a disparity on currents accumulated in the bitlines, and the ADC will require appropriate scaling before the analog-to-digital conversion occurs. By finely tuning the periphery surrounding the crossbar, we lose the ability to reconfigure our accelerator for a different application.

Deployment of bipolar weights. When mapping the weight matrix, we need to map both positive and negative values to a limited set of only positive conductances. To emulate the negative weights, we traditionally use a second column computing the negative contributions, which are later subtracted. This approach doubles the area on the crossbar and its power consumption.

Training DNN IoT Applications for Deployment on Analog NVM Crossbars

To optimize the deployment of NNs in resistive crossbars while ensuring reconfigurability, we propose the use of a quantized training algorithm that takes into account the underlying analog HW characteristics. This also trains the DNN ensuring uniform dynamic ranges across layers to enable a reconfigurable deployment.

A diagram representing traditional HW implementation and proposed HW implementation.

Figure 5: Top: Traditional HW implementation  of per-layer/per-filter highly tuned periphery. Bottom: proposed scheme where the periphery is shared, saving area.

By ensuring dynamic range uniformity, we are able to multiplex and share the area- and energy-expensive DAC/ADC periphery.

To achieve uniform dynamic ranges on the circuitry, we ensure (quantized) uniform ranges on the DNN activations and weights. To achieve this, we alter the DNN training graph, introducing quantization elements on the forward stages of both weights and activations. This scheme is widely used in quantization-aware training, but has the peculiarity that instead of defining the ranges of the quantized variables per layer, or even per filter, it constrains them globally: the set of input (X)/output (Y) activations, weights (W) and biases (B) is going to be shared by every hidden layer in the NN.

The X, Y, W and B sets of values computed.

The X, Y, W, and B sets of values are computed given a pre-set precision along with the boundary limits (x0, x1, and so on). These are computed differentially through the global variable control operations, based on the characteristics of the hidden layers.

To aid convergence during the training stage, custom losses are added to the graph that dynamically (dependent on the epoch) penalize the deviation on our targets.

A diagram showing the overview of the proposed training graph.

Figure 6: Overview of the proposed training graph, including traditional quantization methods using Straight Through Estimator (STE) blocks, global variables and control to handle uniform dynamic ranges, and custom regularizers.

With uniform dynamic ranges, we can explore the possibility of having just positive weight matrices, solving the problem introduced by the deployment of bipolar weights.

By adding an extra constraint to the training stage, we further limit the range of weights available to train the NN with, forcing them to be positive. To aid the network convergence during training, we propose the use of an extra loss term that progressively shifts the weight matrix, penalizing the existence of negative values through the training epochs:

At the end of the process, we end up with a weight matrix with just positive values, which greatly simplifies the circuitry required by the weights deployment.

Human Activity Recognition and Image Classification Examples

We consider different representative examples of inference-at-the-edge applications:

  1. An image classification problem CIFAR10, with a simple CNN composed of six convolutional layers, and,
  2. A Human Activity Recognition Fully Connected (FC) Network, classifying the activity based on the accelerometers and magnetic sensed data coming from just one limb (simulating a smart watch).

Accuracy. We consider the case of a PCM NVM element achieving up to 16 different conductance levels, or 4-bit weights. We evaluate our proposal, which ensures uniform dynamic ranges and reconfigurability, against the traditional quantized methods (STE), with negligible loss of accuracy in both applications.

A graph showing CIFAR10 accuracy evolution and HAR accuracy.Figure 7: Left: CIFAR10 accuracy evolution using standard STE approaches (8-bit and 4-bit) and proposed one (4-bit). Right: HAR accuracy varying weight/activation precision.

More interesting are the results of training the NN with unipolar weights. In the case of the smaller HAR FC network, we can see how, even with very low precision in the weights. Therefore in the NVM elements, the unipolar weight matrix approach is able to deliver extremely high accuracy.

Figure 8: Left: Right: HAR accuracy varying weight/activation precision, comparing standard STE against bipolar/unipolar proposal methods. Right CIFAR10 accuracy vs percentage of channels forced as unipolar for different quantization methods.

Larger CNNs exhibit a trade-off between the accuracy achieved and its unipolarity. We analyze the impact of having a certain percentage of the filters in the convolutional layers constrained to unipolar devices. The larger this percentage, the more noticeable the improvements in area and energy, but of course there is a trade-off with the final classification accuracy. Competitive accuracies can be achieved by forcing 40% of the filters to be unipolar, while still ensuring uniform dynamic ranges across the NN. This experiment proves that there is a trade-off between the PPA and the accuracy delivered in this kind of accelerators.

Area and Energy Benefits. We compute the energy and area per inference in both CIFAR10 and HAR applications, considering PCM as the NVM element, and custom designed 4 bit and 8-bit ADC and DAC (using 55nm technology). Thanks to sharing the periphery resources, we can see an area saving of up to 80% in the case of CIFAR10, and up to a 55% reduction in energy consumption in HAR, as compared to traditional schemes.

Conclusions

Today’s computing is shifting towards the edge, where power-efficient signal and ML accelerators take advantage of the swarm of sensors surrounding us. Analog accelerators have captured the attention of the research community as enablers of efficient processing systems, but their intrinsic challenges are still to be addressed. In our IJCNN 2020 conference paper, we have established initial mechanisms to train the DNN to be aware of the analog hardware it is going to be deployed in. These minimize the reconfigurability limitations caused by the disparity of dynamic ranges on the electrical domain and we propose a mechanism to achieve total or partial unipolar weight matrices, with considerable area and energy benefits.

Please do get in touch if you have any questions.

Contact Fernando Garcia Redondo

Read the full paper

Anonymous
Research Articles
  • HOL4 users' workshop 2025

    Hrutvik Kanabar
    Hrutvik Kanabar
    Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
    • March 24, 2025
  • TinyML: Ubiquitous embedded intelligence

    Becky Ellis
    Becky Ellis
    With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
    • November 28, 2024
  • To the edge and beyond

    Becky Ellis
    Becky Ellis
    London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
    • November 5, 2024