The COVID-19 pandemic forcing people to spend more time indoors along with the recent advances in Smart TVs provide the need and opportunity for new, emerging experiences that we can introduce into people’s living rooms. Many people are now having very different lifestyles and deciding to work out at home. With the right technology, this can be a very convenient and effective approach, with remote workouts continuing to be an attractive option for many consumers going forward. Cameras are a key part of this technology and their return to the digital television (DTV) market creates opportunities. Given the high resolution and large screen size of modern TVs, these user experiences have the potential to be very immersive, while at the same time bringing health benefits.
In this blog, we discuss how to enable people to exercise in the comfort of their homes using an application on their large screen Smart TV as a guide. We start with an overview of the application, followed by the deep learning approaches for body pose estimation and tracking suitable for Android TVs. We discuss the main considerations in choosing the right Convolutional Neural Network (CNN) models and what their capabilities and limitations are.
Let us start with a brief description of the application. Our goal is to estimate and continuously track the body position of the user and compare it against a reference body position of a fitness instructor. We want to find each joint’s location of the body of the student from the camera stream and the teacher from a pre-recorded video. The application should then provide feedback to the user on how similar their position is to the reference. This can be a simple score value or perhaps a more user-friendly approach highlighting parts of the body that need to be corrected. The scoring function should take into account that people have different body shapes. It can use the probability score for each joint and calculate weighted distance or completely discard low confidence estimates. For more information, read this blog.
Beyond the challenge of quantifying the differences between two bodies poses, we have to solve how to correctly identify the corresponding frames from two video streams. Inevitably, there is some latency between the instructor and the student performing the exercise. For that purpose and to account for the different pace of exercising, we have to perform a search to determine which exact pair of images have to be compared.
As you can see in the following image, in our application we are visualizing both video streams overlaid with skeleton data and a score in percentages.
Figure 1: Our Fitness application using BlazePose model
In this blog, we are focusing on Smart TVs as we see great potential in this space. However, a lot of the software models and tools we discuss here also apply to other consumer devices that you might be targeting.
It has been almost a decade since Microsoft developed a Random Forest model for body pose estimation using the Kinect sensor. This highlighted the importance of this task for many end-user applications. Since then, there have been a steady stream of Machine Learning (ML) research focusing on 2D and 3D human body pose estimation. In recent years, deep Learning approaches have shown great potential and are now market leading in this area.
In deep learning approaches in estimating and tracking the body position, there are many things to consider. A solution for Smart TVs, mobile phones, and home devices, in general, has to be very performant, so choosing a suitable CNN model can be a challenging task. The lightweight models for pose estimation usually take RGB images from the camera as an input and output 2D or 3D locations of key points of the body. This can be performed with a single end-to-end model or split between two models where the first one detects a persons and the second one locates the joints or landmarks. An example of the first type of model is PoseNet (based on MobileNetV1 or ResNet50) and the second example is BlazePose (MobileNetV2-like with customized blocks).
Figure 2: Results from BlazePose model, red colour denotes the detection box
The most important considerations are accuracy and performance. To better understand the accuracy, we need to look at the training datasets and error metrics, but often that will not give us a complete picture. Commonly, the datasets are manually annotated, which might introduce large errors due to self-occlusions and low resolution. Even if we use an existing CNN model, we should consider creating our own small dataset to evaluate on. This gives a better understanding of how it performs on our particular use-case.
In addition, we have to constantly evaluate the performance. The ultimate battle – accuracy vs performance. A good starting point for understanding the performance of different models and inference engines is the benchmarking tool from TensorFlow.
For TFLite models, there are many options for running inference on Android devices both in terms of Software (SW) and Hardware (HW) and it can quickly get confusing. On the SW side there is NNAPI, TFLite CPU or a GPU delegate, and on the HW side you have a choice between multiple computation units available for inference such as CPU, GPU, NPU, etc. For Arm platforms, a good option is to use ArmNN TFLite delegate which provides a higher level of abstraction. Alternatively, you can directly target Arm Compute Library (ACL) and ArmNN, which allow for more control to the user. In our case, we have achieved best performance on Mali GPUs with both models, but that might not be the case on your device.
Figure 3: ArmNN inference flow chart
Now the performance of the model is only part of the equation, there is also pre- and post-processing operations that have to be considered. For example, in the case of PoseNet, the model takes 257 x 257 RGB input image and outputs heatmaps and offset vectors. These then have to be processed to locate the final position of each joint in the original camera image. BlazePose’s landmark model, on the other hand, outputs x, y, z coordinates, as well as visibility and presence for each joint. While this sounds more straightforward, there are actually more processing stages involved. This is because the output coordinates of the landmark model have to be projected to the original frame, thus reverting the pre-processing and post-processing stages related with the detection model.
Figure 4: Examples with BlazePose model
These processing stages along with other features in your application also have to be very performant and as a developer you want to utilize your system as much as possible. You want to focus on optimizing the parts which provide you with the most significant performance increases. For that purpose, Arm’s Streamline Performance Analyzer can be just the right tool. It provides you with detailed hardware counters for the different units in your system. Then, if you add annotations in your code, you can see the exact impact of each of the software stages in your pipeline. Florent’s blog provides a great overview of Streamline functionality for ML applications.
Since we have both a video stream and a camera stream in our application, we have to run inference on both. Luckily the instructor’s video can be processed in advance. We can write the skeleton positions to a file offline and then in real-time read from it, calculate the score and simply draw, thus providing valuable performance speed-up.
As briefly mentioned before, the CNN models we use have to be lightweight due to the hardware limitations associated with the Smart TVs and home devices, which impact the overall accuracy. But perhaps even more importantly, these models rely solely on RGB input data. This constraint is due to the limited computational budget, but also from the lack of depth sensors in those products. The presence of depth sensors impacts the price of the device, while at the same time adds to the bandwidth and processing. Of course, on a smaller device, the power budget and the available space would also be major considerations. But ultimately, it is the use-case and body pose estimation that can greatly benefit from depth information.
Due to their RGB-only nature, these models are particularly sensitive to changes in the lighting conditions, and the colors of the background and outfits. The type of application also means we can expect limitations due to the camera angle, camera frame rate and pace of the exercise. Additionally, some of the models have been trained on still images rather than videos and are particularly vulnerable to motion blur. Even the most robust ones struggle with a considerable amount of motion blur. You have to carefully review your camera capabilities, types of exercises and post-processing of the images to handle such cases. Of course, there are also self-occlusions – not all parts of the body are visible always. One last thing to note is that many of the models were trained on datasets where the person is predominantly facing the camera, which is a difficult constraint for a fitness app.
Figure 5: Results with BlazePose with blur and occlusions
At this point, we hope we shared enough information so the reader can understand the challenges in body pose estimation using deep learning. We have discussed some of the models, tools, and limitations. We have seen some excellent results from the BlazePose model and understood better what it takes to build a fitness application for Android Smart TVs. But this is only one part of the equation. Future implementations of such solutions for DTVs can be significantly improved with better HW compute capabilities. Similar to current high-end mobile devices, DTVs could also feature new interactive experiences and bring them to your living room screen. Adding depth sensors would further improve the accuracy and robustness of pose estimation and allow for 3D reconstruction and scene understanding for even more immersive experiences.
I am looking forward to hearing about your experiences implementing deep learning applications on Arm platforms. And do not forget to visit the ML section on developer.arm.com where you can find more help and useful guides.