We are experiencing a transformation in both how and where computing is happening, with the myriad of sensors surrounding us in our environment changing the way information is processed and consumed. Driven by this paradigm shift, the growing TinyML community is extensively exploring computing at the edge with research efforts across both academia and industry. At the very heart of this revolution, we find efficient low-power signal processors and ML accelerators, computing the information sent by the network of sensors 24/7. This ubiquitous, always-ON computing enables intelligent sensors, compressing all the information available in the most efficient way.
Figure 1: A battery-powered 24/7 system monitoring patient health
Let us consider the scenario in Figure 1, where a battery-powered system constantly monitors the health of a patient. A low-power ML accelerator processes the information from the bio-sensor network, and the moment it detects an anomaly, it wakes up a high-end processor to analyze it further, or even alert a doctor.
The key challenge behind efficient, always-ON, low-power accelerators running in constrained devices is data movement. Or how to efficiently move heavy data inputs and kernel weights to and from the different memories in the system. To overcome this data wall problem, three main potential solutions are proposed.
Figure 2: Algorithm optimization techniques, Whatmough et al., ISSCC 2017. 3D architecture integration, Shulakar et al. 2017. Devices and technology comparison, Western Digital press release, 2016.
First, we could explore new algorithmic solutions which take advantage of sparsity in the network by reducing precision of operands, or by pruning kernels. Second, we could look at re-engineering the whole accelerator architecture, moving the memories closer to the logic, where computation traditionally happens. And third, we could make use of emerging technologies and architectures, such as resistive crossbars, for Computation in Memory.
CIM radically changes where and how data flows are stored and processed. It proposes a new approach where computation takes place in the same memory elements. By reducing most of the memory transactions, and improving the power consumed in the computation operations, we alleviate the energy and extra time required for costly data movements and linear algebra operations.
Within the EU-funded MNEMOSENE project, we explore how CIM, using emerging Non-Volatile Memory (NVM) resistive crossbars, leads to significant energy and efficiency improvements accelerating known kernels. In particular, in the IJCNN article, Training DNN IoT Applications for Deployment on Analog NVM Crossbars, we explore how to efficiently train Deep Neural Networks (DNN) to overcome common challenges when accelerating Multiplication Accumulation (MAC) operations in NVM crossbars.
Figure 3: Computation-in-memory using programmable resistive devices: the vector matrix multiplication outputs appear at the bitlines in the form of currents. The matrix is encoded as conductances in the crossbar array, while the input vector, is fed into the crossbar as voltages.
The acceleration of the MAC or vector-matrix operation (arguably the most common operation in ML and DSP kernels) naturally occurs due to Kirchhoff ‘s Law at the bitlines of the crossbar. Having the input operand encoded as a voltage, and the matrix components encoded as conductances in the programmable resistive devices, the current flowing through each one of the bottom paths translates to:
This means that through a very low-power operation, we obtain a full vector-matrix operation in. The speed, high-density, and low-power profile of this analog primitive makes it extremely well suited for inference at the edge. The basic primitive is composed of the NVM crossbar interfaced by a Digital-to-Analog converter and Analog-to-Digital converter. For simplicity, imagine that each layer of the NN can be unrolled and deployed in a single crossbar. Later, we digitally interconnect different crossbars in the same way that we interconnect different layers in the NN.
Figure 4: Deployment of a DNN: layers are mapped to individual crossbars digitally interconnected to form the network. On the right hand side, deployment of negative weights in the resistive elements
There are several challenges with this approach, related to the immaturity of the technologies, such as variations, temperature dependence, or noise intrinsic to the analog nature of the computation. However, a proper training scheme including appropriate ‘noisy’ elements allows us to overcome these problems.
In this work, we focus on the challenges related to the architecture design which are intrinsic to resistive crossbars: low-precision, non-uniform signal ranges and bipolar weights deployment.
Low precision. The analog accelerator works with a quantized set of analog signals: the weights are quantized as conductances, the inputs are encoded as quantized voltages, and the currents are digitized based on quantized steps.
Non-uniform dynamic ranges. The deployment of the many filters of a single convolutional layer in a CNN implies that we are mapping different weight matrices to the same conductances. Consequently, there is a disparity on currents accumulated in the bitlines, and the ADC will require appropriate scaling before the analog-to-digital conversion occurs. By finely tuning the periphery surrounding the crossbar, we lose the ability to reconfigure our accelerator for a different application.
Deployment of bipolar weights. When mapping the weight matrix, we need to map both positive and negative values to a limited set of only positive conductances. To emulate the negative weights, we traditionally use a second column computing the negative contributions, which are later subtracted. This approach doubles the area on the crossbar and its power consumption.
To optimize the deployment of NNs in resistive crossbars while ensuring reconfigurability, we propose the use of a quantized training algorithm that takes into account the underlying analog HW characteristics. This also trains the DNN ensuring uniform dynamic ranges across layers to enable a reconfigurable deployment.
Figure 5: Top: Traditional HW implementation of per-layer/per-filter highly tuned periphery. Bottom: proposed scheme where the periphery is shared, saving area.
By ensuring dynamic range uniformity, we are able to multiplex and share the area- and energy-expensive DAC/ADC periphery.
To achieve uniform dynamic ranges on the circuitry, we ensure (quantized) uniform ranges on the DNN activations and weights. To achieve this, we alter the DNN training graph, introducing quantization elements on the forward stages of both weights and activations. This scheme is widely used in quantization-aware training, but has the peculiarity that instead of defining the ranges of the quantized variables per layer, or even per filter, it constrains them globally: the set of input (X)/output (Y) activations, weights (W) and biases (B) is going to be shared by every hidden layer in the NN.
The X, Y, W, and B sets of values are computed given a pre-set precision along with the boundary limits (x0, x1, and so on). These are computed differentially through the global variable control operations, based on the characteristics of the hidden layers.
To aid convergence during the training stage, custom losses are added to the graph that dynamically (dependent on the epoch) penalize the deviation on our targets.
Figure 6: Overview of the proposed training graph, including traditional quantization methods using Straight Through Estimator (STE) blocks, global variables and control to handle uniform dynamic ranges, and custom regularizers.
With uniform dynamic ranges, we can explore the possibility of having just positive weight matrices, solving the problem introduced by the deployment of bipolar weights.
By adding an extra constraint to the training stage, we further limit the range of weights available to train the NN with, forcing them to be positive. To aid the network convergence during training, we propose the use of an extra loss term that progressively shifts the weight matrix, penalizing the existence of negative values through the training epochs:
At the end of the process, we end up with a weight matrix with just positive values, which greatly simplifies the circuitry required by the weights deployment.
We consider different representative examples of inference-at-the-edge applications:
Accuracy. We consider the case of a PCM NVM element achieving up to 16 different conductance levels, or 4-bit weights. We evaluate our proposal, which ensures uniform dynamic ranges and reconfigurability, against the traditional quantized methods (STE), with negligible loss of accuracy in both applications.
Figure 7: Left: CIFAR10 accuracy evolution using standard STE approaches (8-bit and 4-bit) and proposed one (4-bit). Right: HAR accuracy varying weight/activation precision.
More interesting are the results of training the NN with unipolar weights. In the case of the smaller HAR FC network, we can see how, even with very low precision in the weights. Therefore in the NVM elements, the unipolar weight matrix approach is able to deliver extremely high accuracy.
Figure 8: Left: Right: HAR accuracy varying weight/activation precision, comparing standard STE against bipolar/unipolar proposal methods. Right CIFAR10 accuracy vs percentage of channels forced as unipolar for different quantization methods.
Larger CNNs exhibit a trade-off between the accuracy achieved and its unipolarity. We analyze the impact of having a certain percentage of the filters in the convolutional layers constrained to unipolar devices. The larger this percentage, the more noticeable the improvements in area and energy, but of course there is a trade-off with the final classification accuracy. Competitive accuracies can be achieved by forcing 40% of the filters to be unipolar, while still ensuring uniform dynamic ranges across the NN. This experiment proves that there is a trade-off between the PPA and the accuracy delivered in this kind of accelerators.
Area and Energy Benefits. We compute the energy and area per inference in both CIFAR10 and HAR applications, considering PCM as the NVM element, and custom designed 4 bit and 8-bit ADC and DAC (using 55nm technology). Thanks to sharing the periphery resources, we can see an area saving of up to 80% in the case of CIFAR10, and up to a 55% reduction in energy consumption in HAR, as compared to traditional schemes.
Today’s computing is shifting towards the edge, where power-efficient signal and ML accelerators take advantage of the swarm of sensors surrounding us. Analog accelerators have captured the attention of the research community as enablers of efficient processing systems, but their intrinsic challenges are still to be addressed. In our IJCNN 2020 conference paper, we have established initial mechanisms to train the DNN to be aware of the analog hardware it is going to be deployed in. These minimize the reconfigurability limitations caused by the disparity of dynamic ranges on the electrical domain and we propose a mechanism to achieve total or partial unipolar weight matrices, with considerable area and energy benefits.
Please do get in touch if you have any questions.
Contact Fernando Garcia Redondo
Read the full paper