Cartoonifying Images on Raspberry Pi with the Compute Library

Hi folks!

Here we are! For the first hands-on guide of the new Computer Vision and Machine Learning software library developed at Arm: Compute Library!

Compute Library is a rich collection of functions for image processing, computer vision and machine learning optimized through NEON on Arm Cortex CPUs and through OpenCL on Arm Mali GPUs. The library has been designed to target a wide variety of use-cases and it is completely free of charge under the MIT open-source license.  

With this first blog post (and more to come!) you will learn how to utilize the compute library along with the main steps to write a sample code to "cartoonify” your images on Raspberry Pi! 

Introduction to Compute Library

The era of intelligent vision applications has been rapidly progressing over the last few years. Thanks to the recent advances in mobile computing performance and the recent developments in deep learning, more and more frequently, smart vision applications have been landing on our smartphones with capabilities that have been unthinkable up until a few years ago.

Bear in mind, the evolving of text messaging from just messaging to smart image messaging or the incredible progresses of intelligent personal assistants.

The challenges to deploy these applications still has problems such as:

  1. Code/Performance portability: One of the problems developers have to face up to, as most of the time the algorithm has to be rewritten from scratch to reach the desired performance. 
  2. Code optimization on specific architectures:  Does the architecture support SIMD acceleration?  Does the architecture support FP16 acceleration?  Is the architecture 32 or 64 bit?  These are just few questions to have in mind when we want to considerably boost the performance of our algorithms. 

Compute Library was born mainly behind these two challenges.

Developed over years of experience working closely with partners and developers in the sphere of imaging and vision products;

the library wants to make the deployment of intelligent vision applications easy and performant on Arm based platforms in order to reduce the cost and the programming effort.  


At the current state the library has roughly 60 functions, accelerated for both Arm Cortex-A CPUs (both aarch32 and aarch64 with NEON support) and Arm Mali GPUs (both Midgard and Bifrost architectures).   

The functions implemented so far cover mainly the areas of image processing, computer vision and the machine learning needed to develop a smart vision application.  

Just to name a few: 

  • Image processing: Convolution, Gaussian filtering, Sobel filtering, Warp, Remap...
  • Computer Vision: Canny Edge, Harris Corner, HOG, Optical Flow...
  • Machine Learning: S/H/LOWP/GEMM, Convolution Layer, Activation Layer, Fully Connected Layer, Pooling Layer...  

Although it is still the early days, and the complete absence of hand-written assembly code (the library currently uses just NEON intrinsics), Compute Library presents already significant performance uplift compared to other well-known libraries and has fp16 and fixed-point acceleration for some key functions.  

More details about the library can be found also in Roberto Mijat's blog post; Arm Compute Library for computer vision and machine learning now publicly available

Time to have fun!

The only prerequisites to complete with success this tutorial are:

  • Basic knowledge of Linux
  • Remote access through SSH

The tutorial has been tested on an x86-64 host machine with Ubuntu Linux 14.04 but should work on other Linux distributions as well.  

Let’s see the list of things we need:

  1. Raspberry Pi 2 or 3 with Ubuntu Mate 16.04.02 (Note: Raspbian OS can't be used because it is based on an armv6 filesystem but an armv7 filesystem is required for NEON to work)
  2. A blank Micro SD card: we highly recommend a Class 6 or Class 10 microSDHC card with 8 GB (minimum 6GB) 
  3. Router + Ethernet cable  

Enabling remote access on Raspberry Pi

Assuming Ubuntu Mate has been correctly installed on the Raspberry Pi (if not, you can follow the instructions described here) we need to enable SSH connections on the device as OpenSSH server is disabled by default on Ubuntu Mate 16.04.2. This part will be necessary when are going to cross-compile the library.

For this scope you can use raspi-config.   

Open a terminal on your Raspberry Pi: 

sudo raspi-config

  1. Select Interfacing Options.  
  2. Navigate to and select SSH.  
  3. Choose Yes.  
  4. Select Ok.  
  5. Choose Finish and reboot your Raspberry Pi

Now let’s see what is the IP address associated to the device and let’s try to SSH from our host machine: 

  1. Plug your Raspberry Pi into your router with the ethernet cable
  2. Open a terminal on your Raspberry Pi and type:

ifconfig eth0 | grep 'inet addr' | cut -d: -f2 | awk '{print $1}'

Note: If the above command returns "eth0: error fetching interface information: Device not found", it means the device name for the ethernet port is not set to eth0. In this case you can try the following alternative for the Raspberry Pi

ifconfig eth0 | grep 'inet addr' | cut -d: -f2 | awk '{print $1}'

The above command should return the IP address associated to your device. 

Once we know the IP address of the Raspberry Pi, we can establish a SSH connection from the host machine: 

ssh <username_raspberrypi>@<ip_addr_raspberrypi>


  1. <username_raspberrypi>: username used on your Raspberry Pi 
  2. <ip_addr_raspberrypi>: IP address of your Raspberry Pi

Getting the Compute Library source code

Before starting to see how to build the library, let's have a look at its structure.

The latest version available of Compute Library can be grabbed from GitHub repository at Arm Developer

Within GitHub repository you should have the following structure:

The 3 main folders to take in consideration for this tutorial are: 

  1. arm_compute: contains all the library's header files 
  2. examples: contains few examples to compile   
  3. src: contains the library's source files   

In terms of building blocks, the library is essentially made up of 2 main parts:  

The first is the core which includes the kernels.

The kernels are the low-level algorithms designed to be embedded in existing projects since:  

  1. Do not allocate any memory so the memory allocation must be handled by the caller
  2. Do not perform any type of multi-threading but provides the necessary information to the caller about how the workload could be split between threads.

The latter is the runtime which contains the functions, actual wrappers around the kernels.

The functions:  

  1. Can allocate the memory for the tensors (for instance the function can allocate the memory for the temporary tensors needed) 
  2. Can perform multi-threading as they can use the information provided by the kernels 

Hint: In order to have a clear view of the distinction between the "core" and "runtime" blocks, you could take a look at the NEGaussian5x5 function. As you will notice, this function calls 3 kernels, allocates 1 temporary tensor and split the task between threads using the Arm Compute Schedule

Building natively on Raspberry Pi

Building natively on Raspberry Pi is much more straightforward than cross-compiling. 

The requirements for our Raspberry Pi are just 3: 

  1. g++ 
  2. git
  3. scons 2.3 or above  

# Install dependencies (g++, git and scons)
sudo apt-get install g++ git scons 

# Clone Compute Library
git clone 

# Enter ComputeLibrary folder
cd ComputeLibrary 

# Build the library and the examples
scons Werror=1 debug=0 asserts=0 neon=1 opencl=0 examples=1 build=native -j2

The scons command should return “scons: done building targets” once the library has been successfully compiled. 

Before continuing, just few comments about the arguments passed to the build command: 

  • Werror=1: It enables the -Werror compilation flag 
  • debug=0 & asserts=0: All optimizations are enabled and no validation is performed over the arguments passed to the functions. This means that if the application misuses the library it is likely to result in a crash.
  • neon=1 & opencl=0: it enables just the NEON acceleration. On Raspberry Pi there is no Arm Mali GPU so we can not benefit from OpenCL acceleration. 
  • build=native: it compiles the library natively
  • examples=1: It compiles the examples

All the binaries (library + examples) will be inside the build/ folder. 

Once you have built the library you should be able to run the examples executing the following command:

# Run convolution example on NEON
LD_LIBRARY_PATH=build/ ./build/neon_convolution 

If everything is working properly, the example should return "Test passed". 

Note: If you get an error like: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem.'

Try to remove all files inside /var/lib/dpkg/updates directory  

# Enter /var/lib/dpkg/updates
cd /var/lib/dpkg/updates 

# Remove all files
sudo rm * 

# Install dependencies (g++, git and scons)
sudo apt-get install g++ git scons 

Cross-compiling the Compute Library

Now let's see how to cross-compile the library on your Linux host machine.

Also in this case, the requirements for your Linux host machine are just 3: 

  1. Arm Cross compiler toolchain (4.9+) 
  2. Git
  3. scons 2.3 or above

# Install dependencies (scons, Arm cross-compiler toolchain)
sudo apt-get install git scons gcc-arm-linux-gnueabihf g++-arm-linux-gnueabihf 

# Clone Compute Library
git clone 

# Enter ComputeLibrary folder
cd ComputeLibrary 

# Cross compile the library and the examples
scons Werror=1 debug=0 asserts=0 neon=1 opencl=0 os=linux arch=armv7a examples=1

Once again all the binaries will be inside the build/ folder.

Note: The above command is valid for both Raspberry Pi 2 and Raspberry Pi 3 as Ubuntu Mate 16.04.2 is built for aarch32. In case your operating system was built for aarch64, you should replace arch=armv7a with arch=arm64-v8a.

In order to run the examples, we need to copy only the binaries and on the Raspberry Pi. 

Open a terminal on the host machine and inside ComputeLibrary folder:

# Copy examples' binaries and on the Raspberry PI 
scp build/neon_convolution build/neon_scale build/ <username_raspberrypi>@<ip_addr_raspberrypi>:Desktop

# Open the ssh session to Raspberry PI 
ssh <username_raspberrypi>@<ip_addr_raspberrypi>  

# Within the ssh session, enter the Desktop folder
cd Desktop

# Run convolution example
LD_LIBRARY_PATH=. ./neon_convolution

Cartoon effect with the Compute Library

We will now create a sample code for applying a cartoon effect on our images.

The sample code will help us show how to use the Compute Library and also how to convert an image so as to make them look hand drawn. 

When it comes to develop a cartoon effect, the main computation blocks are essentially just two: 

  1. Region smoothing (for instance with Gaussian Filter 5x5)
  2. Edge detection (for instance with Canny Edge algorithm) 

In order to achieve the basic cartoon effect, we need to apply the Gaussian filter 5x5 and the Canny edge over the input image. The region smoothing will reduce the color palette whilst the edge detection will produce the sketch effect. Combining the outputs of these two stages with an arithmetic subtraction, we will be able to achieve the desired result.

#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "arm_compute/core/Types.h"
#include "utils/Utils.h"

using namespace arm_compute;
using namespace utils;

int main(int argc, const char **argv)
    Image src_img;
    Image dst_img;
    Image gaus5x5_img;
    Image canny_edge_img;

    if(argc < 2)
        // Print help
        std::cerr << "Usage: ./build/neon_cartoon_effect [input_image.ppm]\n\n";
        std::cerr << "No input_image provided\n";
        return -1;

    // Open PPM file
    PPMLoader ppm;[1]);

    // Initialize just the dimensions and format of your buffers:
    ppm.init_image(src_img, Format::U8);

    // Initialize just the dimensions and format of the images:

    NEGaussian5x5             gaus5x5;
    NECannyEdge               canny_edge;
    NEArithmeticSubtraction   sub;

    // Configure the functions to call
    gaus5x5.configure(&src_img, &gaus5x5_img, BorderMode::REPLICATE);
    canny_edge.configure(&src_img, &canny_edge_img, 100, 80, 3, 1, BorderMode::REPLICATE);
    sub.configure(&gaus5x5_img, &canny_edge_img, &dst_img, ConvertPolicy::SATURATE);

    // Now that the padding requirements are known we can allocate the images:

    // Fill the input image with the content of the PPM image

    // Execute the functions:;;;

    // Save the result to file:
    save_to_ppm(dst_img, "cartoon_effect.ppm");

A closer look at the code

Step 0: Header files

For implementing correctly the example we need just 3 header files: 

// Contains the definitions of all the NEON functions 
#include "arm_compute/runtime/NEON/NEFunctions.h"

// Contains the definition of all types used in the library 
#include "arm_compute/core/Types.h"

// Contains the definition for the PPMLoader 
#include "utils/Utils.h"

Step 1: Image definitions

// Input image
Image src_img; 
// Output image
Image dst_img; 
// Output of Gaussian Filter 5x5
Image gaus5x5_img; 
// Output of Canny Edge
Image canny_edge_img;

Step 2: Input image initialization

The following step loads the ppm image using the PPMLoader class and sets the image's format.

The image's format is set to Format::U8 as the NEON functions Gaussian5x5 and CannyEdge support only single channel images with data type DataType::U8 

An important aspect to highlight is behind the initialization. The initialization doesn't fill the image with the content of the ppm file as the memory is not yet allocated at this point.

It only sets the dimensions of the image (width and height) and the format to use. 

PPMLoader ppm; 

// Open image[1]);

// Initialize with, height and format
ppm.init_image(src_img, Format::U8); 

Step 3: Initialization of the images

Also the other images must be initialized before configuring the NEON functions. Since all images have the same dimensions and format of the input image, we can simply use the TensorInfo of src_img.  

// Initialize the output image of Gaussian5x5

// Initialize the output image of Canny Edge

// Initialize the output image

Also in this case, the initialization doesn't allocate the memory.  

Why can't the memory be allocated during the initialization of the image?

The answer relies on the implementation of the kernels.

Most of the NEON and OpenCL kernels use vector load/store instructions to access the data in buffers. In order to avoid having special cases to handle the borders (when for instance the image's width is not a multiple of the width of the SIMD instruction used), all images use padding bytes.

In this library the padding bytes are defined just for first 2 dimensions of the image/tensor.

Since the configure methods will update the padding bytes requirements for each image, it is important to allocate the memory only when all the functions have been configured.

Step 4: Function configuration

Once all the images have been initialized, we can proceed with the configuration of the functions.  

// Configure Gaussian 5x5
gaus5x5.configure(&src_img, &gaus5x5_img, BorderMode::REPLICATE); 

// Configure Canny Edge
canny_edge.configure(&src_img, &canny_edge_img, 100, 80, 3, 1, BorderMode::REPLICATE); 

// Configure arithmetic subtraction
sub.configure(&gaus5x5_img, &canny_edge_img, &dst_img, ConvertPolicy::SATURATE); 

Step 5: Memory allocation

After the configurations of the functions it is the turn of the memory allocation, as the padding requirements now are known. 


Step 6: Fill the input image!

Now we have allocated the memory for the input image, we can fill it with the content of the ppm file.


Cross-compiling the cartoon effect sample code

Assuming the Compute library has been already built, we can just cross-compile the sample code with the following command: 

# Note: We assume that the input file to compile is inside example/ and called neon_cartoon_effect.cpp
arm-linux-gnueabihf-g++ examples/neon_cartoon_effect.cpp test_helpers/Utils.cpp -I. -Iinclude -std=c++11 -mfpu=neon -L. -larm_compute -o build/neon_cartoon_effect


Congratulations! You have completed this hands-on guide, where we have started playing with the Compute Library.  

In this first blog of the series, we have shown how to work with the compute library, and illustrated the main steps to render our images as hand drawn.  

Good news! This is just the beginning for building awesome smart vision applications on an Arm-based platform as Raspberry Pi through the Compute Library. With upcoming blogs, we will see how to enrich our applications through a traditional computer vision pipeline (HOG/SVM + OpenCV) and through the revolutionary and powerful Convolutional Neural Networks.


Gian Marco

Graphics & Multimedia blog