One of our goals at Arm is to enable more people to create and deploy their own machine learning (ML) enabled apps. Whether you are an experienced developer or if it is your first time venturing into the ML application world, this blog highlights some ideas and experiences that might get you interested.
We are going to focus on Android mobile deployment, but most of this information can be used on iOS too. Why mobile? Because mobile phones are accessible to most people worldwide, most of us have one in our pocket. In this blog, we are going to start from the beginning of making an ML application – discussing the framework, the creation of the model and good practices. After that, the training, we will discuss datasets and eventually the deployment of the app. For more information, examples and to follow along, you can check all the code at:
[CTAToken URL = "https://github.com/Pablololo12/ML_playground" target="_blank" text="Download Sample Code Now" class ="green"]
Artificial intelligence (AI) and ML applications are becoming more pervasive in our everyday lives. These applications can be used for many different use-cases, such as virtual personal assistants, commuting predictions, product recommendations, security, and even health. Working in the background, they can provide us with help even when we do not know that they are there. I am sure most people have received a notification on their phones on the best time to leave work based on traffic conditions from various map applications. This vision is precisely what Arm embodies as a company highlighted by our vision; "the technology that invisibly enables opportunity for a globally connected population".
The first choice that we face in our ML developer journey is the selection of a framework to build and train our own model. Among all the frameworks available, TensorFlow and PyTorch are two of the most used due to their large communities, flexibility and ease of use. So, for this project, we are going to use TensorFlow 2.0 since it also includes TensorFlow Lite (TFLite), one of the most used frameworks for inference on mobile devices.
TensorFlow 2.0 (https://www.tensorflow.org/install) was released officially on 30th September 2019 adding several improvements over TF1.X. Some of these improvements include:
Depending on the goal, through the selection of the model we can take a few different paths. The first is taking an already trained model and retrain it with our own data, which is useful because it needs fewer overall training data. For example, we can take a model that is trained to identify dogs and retrain it to identify cats. This is possible because in many models, the first layers end up doing similar things for feature extraction, and so we do not need to repeat the training for the same result. This is common on convolutional neural networks (CNN). We can also take a model from an article or from the web and train it with our own data from scratch.
But what if the overall goal is to learn? Or we want to try out some of our own new ideas? For this, we have to create our own model. Since we want to use it on mobile devices, we can follow some general advice to improve the execution efficiency.
It is good practice to look at what has been done for mobile devices before and use these as a guide for our ideas. MobileNets is a series of efficient models that use an architecture based on depth-wise separable convolutions to improve the performance and reduce the memory footprint of the execution. But wait, what is a convolution?
Convolutions are one of the fundamental building blocks of machine learning although they are used in many different applications like image filters or blurs. The mechanism of how they work it is easy. On one side, we have our input in form of a multidimensional matrix, in case of grey images they are 2-dimension matrixes in which each element is one pixel, for RGB images. We have 3 2-dimension matrixes or 3-dimension matrixes in which each dimension is one of the RGB values for each pixel. On the other side, we have a kernel, which is a multidimensional matrix but with smaller dimensions than the input. What we do is take this kernel and move it around the image performing convolutions. We can control the size of the kernel. The number of pixels left between each time we move the kernel (stride), and how many pixels we want to have extra on the border of the image (padding). A convolution is the addition of all the values that are weighted with the values of the kernel. In other words, multiply each element of the kernel with its corresponding on the image and add everything together.
In the image below we can see how a separable convolution works. We have three 3x3 kernels that we apply to each channel, and then a 1x3 kernel to reduce the size even more. A depth-wise convolution has the same effect but using a 3x3x3 kernel.
Every time that we do a convolution, we are increasing the size of the space in memory to keep all the data. This means more information for the model, but also a greater memory footprint, more memory accesses and more operations to perform in the next layer. We need to find that spot where our model is accurate enough but also fast enough for our application. Depth-wise convolutions and Maxpool layers help us reduce the overall size of the data, so it is important to make use of them in our models. Make use of the 3x3 and 1x1 kernels, not only because they have a small footprint but also, as Arm is a contributor to TensorFlow, it has put particular effort to make them incredibly efficient with instructions that make full use of the capacity of the processor. Finally, keep away from dense or fully connected layers since they generate a huge amount of random memory accesses that will decrease the performance of your app. As an example, in the code below we can see how to define a model that is called DeepHotDog with a convolutional layer and a Maxpool layer.
import tensorflow as tf import tensorflow.keras.layers as layers class DeepHotDog(tf.keras.Model): def __init__(self): super(DeepHotDog, self).__init__() Self.down1=layers.Conv2D(8, kernel_size=3, activation='relu', name="Conv1") self.pool1=layers.MaxPooling2D(pool_size=(2, 2), name="Pooling1") def call(self, x): x=self.down1(x) x=self.pool1(x) return x
Another important technique to consider is quantization. When we train any model, we are using floating-point operations, which means that we can use decimal values and be precise. It also means that performing operations with these values takes more time and energy. Quantization allows us to reduce the size of the values from 32 bits to float 16 bits or even integers of 16 or 8 bits. Generally, we lose some accuracy because of the reduction of the range of the values, but we can improve the performance. Again, we face the decision of whether we prefer accuracy or performance in our applications. In the image below we can see a comparison on how the different values are represented on the device.
TensorFlow 2.0 currently only allows for post-training quantization, but in the future, it will also include training aware quantization for improved accuracy. Therefore, instead of just performing simple linear transformation from a huge value into a small one, the quantizer can also do an adaptive quantization based on the use of the range of values. To perform the conversion, we need to open the saved model from training and call the TensorFlow Lite (TFLite) converter as shown below.
converter = tf.lite.TFLiteConverter.from_saved_model("model/1/") converter.optimizations = [tf.lite.Optimize.DEFAULT] tflite_model = converter.convert() open("converted_model_quantized.tflite", "wb").write(tflite_model)
The creation of the dataset that we use for training has a huge impact on how our application is going to perform. As Arm is one of the founding companies of pushing for “Engineering Ethics into AI”, it is massively important for us to show that leadership in ensuring AI is engineered to be ethical by design. For example, when we are building our dataset, we must be careful to ensure that the application has no inherent bias. To increase the effectiveness of our training data and make it more resilient, we can use data augmentation techniques. To each image, we can perform several changes to generate more images for training the app. Between these techniques we find that we can:
We can also apply different techniques on top of each other to create an even more effective training dataset.
Now with our trained network we are faced with the challenge of deploying it into a mobile device. There are many solutions on the market but since we are using TensorFlow we are going to use TensorFlow Lite for the deployment. It has great support on Android and can also be used on iOS, although for iOS we can also use CoreML. Arm NN is also a great tool but if it is your first time working on projects like this, we recommend gaining some experience with TensorFlow before taking the leap into a more flexible tool.
To understand the situation on Android, we can look at the following diagram. If we want to deploy a model, we can choose TFLite or Arm NN. TFLite allows you to use the CPU by default, the GPU VIA the GPUDelegate, or even Arm NN on compatible devices VIA Android NNAPI. We can also use Arm NN directly which means that it handles work for the CPU and GPU using the Arm Compute Library and Arm Ethos NPUs (Neural Processing Unit).
If you choose Arm NN, TFLite or CoreML, the first step for deployment is always the conversion of the model from checkpoints or pb files into TFLite files or CoreML. This step has the advantage that is not only a conversion between datatypes. But, also that a lot of improvements and tuning are performed to the models to make then even more efficient. Such techniques include pruning or clustering to reduce the size of the weights and layer fusion to simplify operations into one.
After loading the model, we can change the options of each of the frameworks to make them run faster. They all have the same options, starting with the selection of threads to use on the CPU. Nowadays most Android devices have Arm big.LITTLE technology with four powerful cores and four efficient cores. To run our network, it is enough to select four threads, that way Android allocates the work on the large cores and the performance is the best it can be. We can also experiment with a different number of threads to check if four is the optimal number. On TFLite, it is important to change the number of threads and give the options to the TFLite interpreter. The instructions in Java are the following:
Interpreter.Options tfliteOptions = new Interpreter.Options(); tfliteOptions.setNumThreads(4); Interpreter tfliteInterpreter = new Interpreter(model, tfliteOptions);
On Arm NN it is also possible to change the number of threads we want to use for the inference but, we need to access the scheduler and change the number of threads that its available to use:
arm_compute::Scheduler::get().set_num_threads(4);
Apart from the number of threads, we can use other compute units available in the device like the GPU or NPU. To use them on all the frameworks, we work in a similar way. The default compute unit is the CPU, it will always be the fallback in case that layer cannot be executed on the other compute units. After the CPU we can add the GPU which is great for highly parallelized workloads like convolutional layers. After that we can also use specialized accelerators like the NPU). However, it is important to always leave the CPU as a fallback in case the layers cannot be executed or are not yet supported.
On TFLite for the GPU, we have the GPUDelegate and for the accelerators the NNAPI. The NNAPI cannot be used to select the accelerators by itself, but to select a delegate like Arm NN can. If the application developer is looking for maximum device compatibility, then the CPU is the most universally supported compute unit in phones. In the future, NNAPI becomes widely supported by the GPU and NPU in devices and so the improved performance that is offered by these specialized compute units should be used where available. If we want to change the settings, it is similar as the number of threads. The same option object has methods to add delegate options.
tfliteOptions.addDelegate([delegate]);
Using this method, we can use the GPU delegate or the NNAPI delegate, but never both at the same time since the NNAPI can also use the GPU by its own.
new GpuDelegate(); new NnApiDelegate();
On iOS the operation is similar, a configuration object that will be used later for the creation of the interpreter.
let config = MLModelConfiguration(); config.computeUnits = .cpuOnly; let mymodel = MyModel(Configuration: config);
Instead of cpuOnly we can use cpuAndGPU for the GPU and all if we want to use the accelerators as well.
cpuOnly
cpuAndGPU
Arm NN is more explicit. It allows you to select which units and the order of preference to use. In this case, we need to make sure to add the CPU if we do not want our app to crash in case it cannot find a fallback for an operation. In this instance, we need to create a vector of backends which can be the CPU, the CPU but with specific instructions or the GPU, if the NPU is not yet available.
std::vector <armnn::Backendld> computeDevices; computeDevices.push_back(armnn::Compute::GpuAcc); computeDevices.push_back(armnn::Compute::CpuAcc); computeDevices.push_back(armnn::Compute::CpuRef);
We can adjust these options to check which combinations give the best model performance. In the project with all the code, we also include a useful tool that allows you to perform benchmark tests to the different TFLite options and Arm NN.
The tool uses a compiled version of Arm NN and TFLite (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark) that you can change if you want just substituting the files. With it, you can select different experiments in a simple way and the python script will handle all the execution. After that, we include a dashboard that is made with electron that allows you to see and compare different plots with the results. In the image above we can see the execution time per layer comparing one thread with 4. To know if we are using the GPUDelegate or the NNAPI is especially useful because all the layers executed using those delegates appear as delegate layers.
ML is an interesting and active field of study for new applications. Developers are starting to use ML in many different applications and the growing community of developer knowledge, along with more usable frameworks, will make us more aware of how to use current ML technology and how to improve it. Arm is investing a lot of effort to make ML a better experience for everyone on mobile devices. That is why we want to encourage you to join us in being a part of this exciting ML journey.
Hello! Pablo is currently on a break between his internship and starting full time at Arm! He might see this but just thought I would let you know in case he doesn't respond!
Thanks,
Ben