Deep learning has been emerged as the dominant approach for performing various classification tasks ranging from computer vision to speech processing.
Relying on deep convolutional neural networks architectures, their success is due to their ability to learn abstract and high-level feature representations using large amounts of training data in an end-to-end learning fashion.
However, top-performing systems usually involve deep and wide architectures and therefore come with the cost of increased storage and computational requirements, while the trend is to continuously increase the depth of the networks.
At the same time, there is an increasing need to use deep CNNs in applications running on embedded devices. These devices-especially in the Internet of Things (IoT) era- are equipped with small storage and low computational capabilities. As such they cannot easily cope with the associated computational complexity of a deep and wide CNN.
This generally requires inspiration and intense effort by very specialized programming teams. Such teams must be simultaneously capable of producing high-efficiency code for a target platform as well as being familiar with the details of CNN computations and algorithmic art, in order to be able to tweak – if necessary – any given architecture.
This is the exact case with the Irida Labs team.
Our team is comprised of, highly skilled scientists and engineers, featuring a variety of diverse – though complementary – skills, ranging from high-performance, low-level programming to algorithmic development and optimization.
In this article, the development and implementation of a typical deep-learning network is presented on a number embedded devices based on ARM/MALI processing units.
The network under consideration is a modified CaffeNet as shown in the following figure. It is actually comprised by a full CaffeNet with an addition of an extra FC layer at the end and is similar to the well-known AlexNet.
In the following , an implementation of this Network on the ARM/MALI SoCs.
Our analysis has been conducted in two different classification problems: A food recognition problem, utilizing the FOOD 101 dataset comprised by images of food organized to 101 categories and a general image recognition problem utilizing the ImageNet ILSVRC 2012 dataset comprised by images organized into 1000 categories. The featured top-1 recognition accuracy is 57.27% (80.62% for top-5) for the ILSVRC 2012 dataset and 68.54% (88.44% for top-5) for the FOOD 101 dataset
Two different ARM/MALI equiped platforms have been considered: Xiaomi Redmi Note 4 and Samsung S7 Edge. For each implementation, the inference time per image has been measured and the results are shown in the following table.
Our implementation follows a heterogeneous programming approach utilizing the multiple ARM/NEON CPU cores of the device mainly for housekeeping/data feeding, but mainly the MALI GPU which is programmed using OpenCL in a fully optimal way.
In all these platforms the computations have been offloaded to the GPU while CPU is mostly used for housekeeping functions. GPU is programmed in OpenCL using hand-optimizations aiming to avoid any pre- and post- processing operations at convolutions layers, minimize memory usage by avoiding temporary memories and reduce as much as possible the data transfers from/to GPU.
You can also check in Irida Labs's Publications page for a comparison of inference speed of ARM/MALI with Qualcomm SnapDragon.
IRIDA Labs (Patras, Greece) is bridging the gap between a camera and a human eye by bringing visual perception to any device. We develop software in Computer Vision, using Image Processing and Machine Learning techniques made for any CPU, GPU or DSP/ASP platform or a combination of them using Heterogeneous Programming techniques.
IRIDA Labs portfolio addresses the challenge of delivering innovative computer vision solutions while keeping optimal system requirements in terms of power consumption, memory and processing speed.
Our product and technology portfolio includes applications in Computational Photography and Visual Perception/Analytics addressing various markets such as mobile devices, action cameras, drones, surveillance, automotive, industrial and robot vision etc.
Can I get pointers on how to implement efficient convolution on GPUs
View all questions in Graphics and Gaming forum