Tutorial: Low Power Deep Learning on the OpenMV Cam

This article is part of the Arm Innovator Program, a series created to highlight the work of key technical leaders who are pushing the boundaries of how Arm architecture can enable next-generation solutions.

Meet Ibrahim Abdalkader, an embedded systems programmer, and vice president and co-founder of OpenMV. Ibrahim created the OpenMV project back in 2013 while searching for a better serial camera. His goal was to make machine vision more accessible to beginners by developing an open-source, low-cost machine vision platform. In this blog, he discusses Low Power Deep Learning on the OpenMV Cam, powered by the Arm Cortex-M7 Processor.

OpenMV Cam

OpenMV Cam

About the OpenMV Cam

The OpenMV Cam was created to be highly programmable micropython-powered colour tracking sensor for hobbyist projects. However, the incredible performance of the Cortex-M7 processor on-board has enabled additional features like AprilTag Detection/Decoding, DataMatrix Detecting/Decoding, QR Code Decoding, Bar Code Decoding, and now CNN Inference.

Recently Arm released the CMSIS-NN library, an efficient neural network library optimized for Cortex-M based microcontrollers. The CMSIS-NN library brings deep learning to low-power microcontrollers, such as the Cortex-M7-based OpenMV camera. In this blog post we'll go through training a custom neural network using Caffe on a PC, and deploying the network on the OpenMV Cam.


The CMSIS-NN library consists of a number of optimized neural network functions using SIMD and DSP instructions, separable convolutions, and most importantly, it supports 8-bit fixed point representation. Using fixed-point avoids costly floating-point operations, reduces the memory footprint, and uses less power when running inference. However, this means that models have to be quantized, before being used with CMSIS-NN.

Quantization, simply put, is the mapping of a range of numbers to a more compact range of numbers, or in this case, the mapping of 32-bits floats to 8-bit fixed-point numbers. The hardest part about quantizing models is finding the min and max ranges of layer inputs/outputs to evenly distribute the floating-point values across an 8-bit representation. Fortunately, Arm also provides a script to quantize Caffe model weights and activations. If you're interested in more details about the CMSIS-NN library and the quantization process, please see this paper published by Arm Machine Learning researchers.

The default CMSIS-NN library comes with a CNN example trained on the CIFAR-10 dataset. However, this example is hard-coded, meaning it must be compiled and linked with the main application, so we extended the CMSIS-NN library and supporting scripts. Our code allows users to convert Caffe models to a quantized binary format which can be loaded from the file-system (SD card or internal flash) at run-time. Additionally, our code takes care of pre-processing the input image, subtracting the mean, and scaling the data if required.

Next I'll demonstrate how to use the CMSIS-NN library with the OpenMV camera to train a simple CNN model on a smile detection dataset. The model achieves ~93% accuracy and the camera consumes about 150mA @ 3.3V while running the network.

1. Training a Network with Caffe

First, if you're just getting started with neural networks and Caffe, I highly recommend this tutorial on deep learning using Caffe and Python.

Note that the CMSIS-NN library has a small and focused set of operators, chosen to help reduce model complexity to work within the memory and compute budgets found in M-Class systems. This means that your model should be simple.

2. Dataset

The smile dataset that we used can be found on GitHub. The dataset consists of ~3000 positive images and ~9000 negative images. We need the number of positive and negative images to be close otherwise the network will be biased towards a class (class imbalance). To fix this, we can augment the dataset using this augmentation script on the positive images to increase the number of positive examples by 4x. The image augmentation script can be used like this:

python2 augment_images.py --input images/train/ --output images/train_aug/ --count 4

3. Training the Network

You can use any deep learning library to train the network. However, if you're not using Caffe you need to convert the network output to a Caffe format to work with the Arm scripts. In the future, Arm will provide more conversion scripts to accept models from TensorFlow.

4. Quantizing the Model

The first step after training the network is to use the quantization script provided by Arm to convert the Caffe model weights and activations from floating point to fixed point format. As mentioned before, quantization is performed to reduce the size of the network and avoid floating point computations.

The NN quantizer script works by testing the network and figuring out the best format for the dynamic fixed-point representation. The output of this script is a serialized Python (.pkl) file which includes the network's model, quantized weights and activations, and the quantization format of each layer. Running this command generates the quantized model:

python2 nn_quantizer.py --model models/smile/smile_train_test.prototxt --weights models/smile/smile_iter_*.caffemodel --save models/smile/smile.pkl

5. Converting the Model to Binary

The next step is to use our NN converter script to convert the model into a binary format, runnable by the OpenMV Cam. The converter script outputs a code for each layer type, followed by the layer's dimensions and weights (if any).

On the OpenMV Cam, our firmware reads the binary file and builds the network in memory using a linked list data structure.

Running this command generates the binary model:

python2 nn_convert.py --model models/smile/smile.pkl--mean /path/to/mean.binaryproto--output smile.network

6. Deployment on an OpenMV Camera

The CNN Inference operation on the camera downscales whatever region-of-interest it is called on, to the input size of the network and then runs the network on that downscaled image. In order to find a particular object in the image, the detection window must then be slid over the image at multiple scales.

While it's possible to slide the detection window over the entire image, to do so would be very slow. Instead, we use the built-in Haar cascade face detector to extract faces from the image, which is much faster. We then pass the region of interest (ROI) to the CNN to detect smiles. The first part of the smile detection code loads the network into memory and loads the built-in face detection Haar Cascade.

# Load Smile Detection network 
net = nn.load('/smile.network') 
# Load Face Detection Haar Cascade 
face_cascade = image.HaarCascade("frontalface", stages=25) 

The next step is capturing a snapshot and finding all the faces.

# Capture snapshot 
img = sensor.snapshot() 
# Find faces. 
objects = img.find_features(face_cascade, threshold=0.75, scale_factor=1.25) 

Finally, for each detected face, the region of interest is slightly cropped and passed to the neural network. Note that the smile detection network is trained on tightly cropped faces, so we have to reduce the size of the ROI.

# Detect smiles 
for r in objects: 
  # Resize and center detection area 
  r = [r[0]+10, r[1]+25, int(r[2]*0.70), int(r[2]*0.70)] 
  out = net.forward(img, roi=r, softmax=True) 
  img.draw_string(r[0], r[1], ':)' if (out[0] > 0.8) else ':(', color=0, scale=2) 

smile detection OpenMV Cam code

Moving Forward

The OpenMV Cam uses the Cortex-M7 processor without any external DRAM attached and only uses the internal SRAM. At any point in time, the processor can go into low-power mode drawing about 50 uA while retaining all states, then wake-up again on an interrupt, take a picture, and run the neural network before turning off again.

For example, on the upcoming OpenMV Cam H7, we're able to run a Lenet-6 CNN trained on the MNIST data set at 50 FPS while using only 3mA @ 3.3V per inference. With a 1Ah 3.7V Lipo battery you can deploy a CNN in the field running every minute that would last for over a year.

In particular, CNN support on Cortex-M7 processors lines up particularly well with the deployment of smart thermal vision-based cameras that are able to detect people accurately from low-resolution thermal vision images. Smart sensors powered by CMSIS-NN and Cortex-M7 processors are coming soon!

For a MNIST demonstration check out the video below: