People are spending more time than ever interacting with their digital TVs (DTVs) at home. A variety of video streaming providers, TV accessories and applications have emerged to drive the significant interest that has emerged in recent years. Among the latest trends, the installation of cameras on DTVs and the availability of attachable USB cameras for DTVs or set-top boxes (STBs) have become features of the market due to the many different potential use cases that they unlock. These use cases include applications like teleconferencing, interactive gaming, augmented reality (AR) filters, trying on clothes virtually, and even immersive exercising at home with prompt visual feedback – you can find out more about this here.
The access to cameras brings these applications to life. By enabling gesture control and face recognition, these cameras have the potential to revolutionize how people interact with their TVs. Gesture control replaces the physical navigation buttons, but face recognition can simplify the tedious account log-in and switching process. Nobody is a fan of typing their password using a remote on a large screen in front of everyone! However, the possibilities are not only limited to this.
With face recognition, applications can provide more relevant and valuable recommendations by identifying the users in front of the screen. The TVs can be smarter by understanding users’ behaviors better, for example, pausing the video when they leave. Moreover, if the TV detects that there are registered minors in front of it, then parental mode can be automatically turned on to filter out inappropriate content and track their screen time.
Compared with the common online solutions, on-device face recognition provides a higher protection of privacy as the user’s data never leaves the device. Most DTVs in the market have a SoC System on Chip with Arm Mali GPUs for better graphic performance, as well as Android operating systems. In this blog, we will take a deep look into the process of designing an Android application to enable on-device face recognition with RGB cameras on lightweight devices. We will start from the application's background and key considerations and challenges of this use case, then dive into the Machine Learning (ML) approach with model selection, pipeline design, and implementations.
First, let's start by clarifying the idea of face recognition.
What is face recognition? You might have heard this terminology being used together with face detection. Face recognition is the process to determine the identity of the face owner. It can often be roughly divided into the following stages: face detection, face alignment, face embedding, and face matching. Face detection first segments the face area from the background. The face is then aligned using landmarks detected in the first stage for a better positioned frontal face image. Landmarks are the locations of the key features on faces, like the eyes. In the face-embedding process, the high-dimensional face image is converted into a vector in the low-dimensional manifold (vector space). Finally, by simple vector operations in the manifold, we can calculate similarities of the faces to match the identity or verify it.
Researchers have never stopped chasing towards more accurate and faster face embedding algorithms. The GaussianFace Algorithm proposed by a research team from the Chinese University of Hong Kong (CUHK) in 2014 was the first computer vision algorithm to outperform humans in recognizing faces. In 2015, FaceNet by Google Research became the first deep learning network to outperform humans. Although accurate inference results are observed in research, most of them are not suitable for mobile platforms due to their large memory requirement or heavy network size. In recent years, a significant trend in academia has been the development of lightweight networks like MobileNet V1/V2 and techniques like knowledge distillation. This has led to many networks that are able to achieve comparable results to heavy networks, but with much smaller model sizes and far fewer computations. This trend encourages us to revisit the potential of performing on-device face recognition on lightweight devices in 2021.
Although there are aspects to face recognition that are similar on DTVs and smartphones, four key considerations set them apart:
With these considerations in mind, we have a rough idea about the ML models that we are looking for. The algorithms for each stage in face recognition need to perform well on DTVs. The Benchmark tool from TensorFlow should put us in a position to understand the trade-off between performance and accuracy better.
For face detection, there are multiple ways to obtain fast and accurate results. For example, the BlazeFace Model proposed by Google Research is based on small networks using the anchors (Single Shot MultiBox Detector-like) and optimized for GPUs on mobile devices, leading to an impressive performance on lightweight devices. For face embedding, the MobileFaceNet proposed in 2018 achieves comparable accuracies with heavy networks but has a model size of only 4.0MB.
Before diving into the pipeline design, one defining requirement is the size of the smallest face to be detected from the input image. The face box size can be approximated from the average human face size and appropriate margins. We then calculate corresponding face projection sizes on the image sensor by using the relationships between the user's distance from the camera and the focal length, as shown in the diagram below. Finally, the pixel size of faces can be determined with other camera parameters, including sensor size and supported resolutions. The following diagrams demonstrate the principles of the calculations here.
The pipeline design focuses on supporting the smallest faces to be detected and removing redundant face embedding processes by actively tracking the movement of the faces. For example, using a combination of a full-range BlazeFace face detector for spotting new faces and a short-range BlazeFace face detector for tracking existing ones in the local area is a way to achieve performance improvements. It takes advantage of the faster detection for large faces in the short-range model to effectively reduce the computational cost. Face tracking also makes the recognition results more robust when the faces are partially covered in some frames, which are not suitable for face embeddings.
In designing the pipeline, we use Google’s benchmark tool for TF Lite models to understand the inference time of different detector models. Arm's Streamline Performance Analyzer is a helpful tool to measure the breakdown of the actual time taken for each part of the pipeline using annotations in your code. This helps to verify the design and analyze the bottlenecks of the pipeline. Florent Lebeau's blog is useful in helping to understand the Streamline functionality for ML applications in more detail.
Before face embedding, we align the faces based on the face landmarks. This provides stable and robust inputs for the embedding network and improves the recognition accuracy significantly. A common approach is to use similarity transformation to fix the location of eyes to the pre-defined positions.
For face record management system, we separate the operations on the user dataset from the face embedding process to provide better security levels. The binary face data recorded by the application is stored within the app which is encrypted by Android on Android 10 and above. The similarity of faces is calculated from the Euclidian distance or the cosine distance of the vectors. The appropriate threshold is then worked out from the receiver operating characteristic curve from testing.
There are several key considerations in achieving a smooth recognition process. Firstly, to avoid an uneven distribution of workload, we implement a queue to maintain the face boxes from the detection stage to be recognized and implement a flow control. And, secondly, depending on the tracking algorithm used, two face boxes might overlap/merge if one face goes behind another. In this case, it is important to discard previous identification results and send a new request for recognition.
Once the application is ready, its performance can be measured again through the Streamline Performance Analyzer with different setups, such as the number of people on screen, or different backends. The Streamline Performance Analyzer can provide you with detailed hardware counters for the different units in the chosen system. This helps to profile the performance characteristics for the application on a given device like a DTV development board.
The major limitation of the application lies in the inherited sensitivity of accuracy under poor lighting conditions due to its RGB camera-only nature. Under sensitive use cases like account log-in, the face recognition can be combined with other biometric verification approaches, like voice recognition, to provide multimode biometric verification. As an extension, anti-spoofing algorithms and liveliness detectors can be useful tools to avoid fake face attacks. As developers, it is important to be aware of the inherited bias from the training dataset if certain groups are underrepresented.
Another limitation comes from the image resolution requirement for faces at a distance and the image quality requirement. Up to 4K resolution may be needed for wide-angle cameras. High-resolution images impose a significant challenge to a DTV's computational power. As the distance increases, the image quality also deteriorates. This can be noticed from more blurred edges and noise points, which effectively alter the embedding results. Image augmentation with filters has the potential to address this issue.
Finally, it is worth reflecting on security. Arm is at the forefront of meeting the ever-increasing need for secure and private solutions. We are working hard to build a deep understanding around key security and privacy-sensitive use cases, and develop and deploy the best solutions that providing security protections without compromising performance. On-device face recognition is a fantastic feature, but this means there will be more personal data being stored across different devices and systems. The security of current systems is very good, but even better protections will be provided in the future through Arm Realm technology to protect data and code in use. This ensures that personal information is safeguarded from end-to-end across different compute platforms.
Hopefully this blog provides enough ideas to get people inspired, and also have a greater understanding of the challenges and key considerations in building an on-device face recognition application for DTVs. As cameras gradually become ubiquitous in the living room, we also hope this use case provides some guidance on the requirements for developing DTV camera applications. We are already seeing promising results from the application. As the computational units on DTVs continue to evolve, we expect to see even lower latency and improved accuracy with higher resolution, wide-angle cameras in near future.
As a side note, if you are a student software developer who is keen on getting your hands on projects that solve real-world challenges like me, keep it up. I am more than excited to hear about your story building deep learning algorithms on Arm platforms to solve daily problems. Do not miss the resources and blogs available under the AI and ML section on developer.arm.com.
[CTAToken URL = "https://developer.arm.com/ip-products/processors/machine-learning" target="_blank" text="Learn more about ML" class ="green"]