Augmented Reality (AR) mobile applications are very popular these days. You can find nice looking filters in messaging or photo and video hosting apps like Instagram or Snapchat. These filters allow you to alter the real-life image from your front or back camera and add some nice-looking and ‘interesting’ effects. Some of them are simple and require just post-processing. But more complex filters like beautification or adding a funny beard and mustache to someone’s face rely on feature recognition in the camera image. That means, the algorithm must be aware of a user’s nose or position of the eyes, the rotation of the head, the distance from the camera.
Implementing such a filter can be a challenging task, especially when you are working with mobile devices and want to achieve real-time performance. Therefore, we built a filter app to see first-hand the challenges that developers can face.
The app allows you to replace the background and attach a virtual object to a human face, which makes it possible to put the user into several virtual environments or scenes. Here is what we found.
Screenshots of the app
To get the necessary data from the camera video stream and allow the program to see the world in the way we humans do, machine learning (ML) is used. There are different types of neural network models trained on different datasets, and each of them serves a particular purpose. For an AR filter app, the most important are:
With the recognition and tracking of key points, it is possible to attach an object to the face or hand, control an avatar or assign a certain behavior to a gesture.
Segmentation provides a pixel mask, highlighting important areas. We can then use these masks to replace or blur the background, or to change the shade of someone’s hair.
Style transfer takes some stylistic features from a reference image and applies them to the source image. This can make a person look like an anime character or the entire picture look like it was painted by a well-known artist.
An example of style transfer using Prisma Photo Editor. The source image in on the left, and the result is on the right.
When designing or choosing an existing model, quality and performance should be considered. This choice may depend on whether you want to process a video stream in real time or apply a filter and save or send the picture.
Using on-device inference is usually preferred over sending data to the server and receiving the processing results. This is because it does not require an internet connection or a costly server and is less likely to expose any user data.
We used two models from Mediapipe for face and landmarks detection. Mediapipe is a high-level framework, which provides trained models and flexible graph system for designing AR pipelines, and a few ready-to-use pipelines. It uses Tensorflow for inference and supports multiple platforms. This is useful not only for delivering the app, but also during the development process, allowing to test and debug on the development machine.
For rendering, we used Unity. Unity also provides a new engine for executing neural networks called Barracuda. It can be easily integrated into the project and has a simple interface, which allows to load the network, configure the runtime, and run inference. Barracuda is still in the development stage.
Another way to add ML functionality to a Unity project is to write a custom C++ plug in and use a neural network inference library like Tensorflow Lite or Android NNAPI. For our project, we used our own technology – Arm NN.
Arm NN is a neural inference engine, which relies on a lower-level Arm Compute Library (ACL). Both technologies help to utilize Arm GPU and CPU in the most efficient way, and this optimized interconnect is part of Arm Total Compute approach.
In our app, several neural networks are used together.
Input image from camera is cropped to get the central quad region, and then it is downscaled.
The segmentation network produces a mask image which is used for background replacement.
Landmarks detection network finds 3D coordinates of the key points of a human face (nose, eyes, cheeks). It only works well when the face occupies most of the frame and is vertically aligned. To get it working well, we run a face detection model first, and then crop the region of the frame, which contains the human face.
You can see a schematic diagram of the whole pipeline, which is applied to the input image of each frame.
Once the desired frame features are recognized, some additional work needs to be done to get a nice-looking result.
Virtual objects like glasses, rabbit ears or hats can be rendered on top of the face, if we know the face landmark coordinates. It can be a simple 2D sprite with transparency, or a 3D model.
In the case of 3D rendering, we can use either an orthographic or perspective projection. It is easier to implement an orthographic projection, as we only need X and Y coordinates and the scale of the object to position it correctly. With perspective projection, we can achieve a better-looking result, but it requires some additional effort to calculate proper translation, rotation and scale for 3D objects.
Also, rendering 3D objects requires taking occlusion into account. Imagine a cap rendered on top of the head. The back side should not be visible because in real life it would be occluded. The front side, on the other hand, should be visible, so it is not possible to simply use the segmentation mask as a depth buffer. To solve this problem, an invisible colluder can be used. This is an object with a shape like a human head (sphere, capsule, or something more accurate) and is only rendered to the depth buffer.
Mesh for occlusion
Finding or creating assets, which will work well for an AR filter can be difficult. It is important to make sure that they not only look nice but fit the shape of a human head or face.
Positioning the objects relative to the head can require some tweaking. A proxy object or a mannequin can be useful to set up the transformations in the default neutral pose.
Even with a good model and high-quality input picture, you can expect the output values to differ from frame to frame when the actual images are very similar. This problem is more relevant to the models, which do not take previous frame into consideration and process them completely independently. The coordinates we were getting from the face landmarks detection model were not very stable, causing the attached 3D model to wobble. It can be solved by averaging the results from multiple frames. The values from the current and a few previous frames are combined using a weighted sum. In this case, it is also important to take the velocity of change into account. If the user moves his head fast, then we recognize it and discard the values from the previous frames, allowing the 3D model to move to the new position immediately.
float Interpolate(float timeStamp, float value) { float distance = value - lastValue; float duration = timeStamp - lastTimeStamp; // Find the sum of the values for multiple frames to estimate average velocity. float totalDistance = distance; float totalDuration = duration; foreach (var entry in previousValues) { totalDistance += entry.distance; totalDuration += entry.duration; } float velocity = totalDistance / totalDuration; // velocityScale is constant, which regulates how much the velocity affects the interpolation. float alpha = 1.0f - 1.0f / (1.0f + velocityScale * Mathf.Abs(velocity)); previousValues.Enqueue(new Entry { distance = distance, duration = duration }); // We only need to keep track of a few previous frames. if (previousValues.Count > maxWindowSize) { previousValues.Dequeue(); } result = value * alpha + lastResult * (1.0f - alpha); // Save these values for the next frame. lastResult = result; lastValue = value; lastTimeStamp = timeStamp; return result; }
Any face-specific parts of the frame (like an object attached to the face) must be rendered only if there is a face in the frame and we are confident about its features. If the neural network provides confidence, we can set a threshold and skip parts of the pipeline if the value is below this threshold. Make sure it is not too high, otherwise the effects may disappear in some frames, which will not look very good.
When using the output from the segmentation model as a mask, we can treat it as a binary (“hard”) mask, or as a “soft” mask. Let us say there are two classes of pixels, and each pixel is assigned with a certain value of probability p, which tells how likely the pixel belongs to one of the classes. It can be used, for example, to replace the original background with a picture:
output = frame_color × p + background_picture_color × (1 – p)
The mask can be left as it is. In this case, we have a “soft” mask, which will blur the edges of the detected object slightly and make the transition between the object and the background less noticeable. We can also sharpen the mask by making low values even lower. For example, a simple square operation can be used:
p' = p2
The hard mask is similar to the Unity cutout shader in the sense that the image will not have partially transparent areas. Everything is either fully opaque or fully transparent. This is why to get a “hard” mask, we need to binarize it with a certain threshold, such as a.
p' = p if p > a else 0
Which of these approaches is better to use? It is a matter of trying to looking for a good result. If the neural network is giving a stable precise output, then a hard mask may look more natural. Otherwise, a soft mask can hide some imperfections.
The lighting used for rendering the objects can make a huge impact on the overall perception. If the light in the scene is too different from the original camera frame, this looks very unnatural. A flexible approach is to add multiple sources of light with different directions or set a high level of ambient light. In this case, there are no noticeable dark regions, which makes the scene neutral. The shading shows that the objects are not flat.
Another way to improve the perception or create a certain atmosphere is to use color filters during post-processing.
The easiest way to find what causes a problem in a camera-based application is to dump failure frames. They can be used to reproduce the bug and check each step of the pipeline. After the solution is found, you can make sure it is working with the same inputs and under the same conditions.
The dump may contain the input image, as well as the same image after some initial pre-processing and raw outputs from the neural networks or intermediate calculation results.
Dumping a frame can also be used while tweaking some settings to compare the results before and after and see if there is an improvement. If the inference engine does not support the platform of the development machine (like in our case), you can also save the outputs from the neural networks on the device for a certain frame. You can then use these values to emulate the inference during the development and debugging without installing the application each time you make some changes.
Building a mobile app with AR functionality, or adding AR features to an existing app may seem tricky at first. And it is in some way. But with the right frameworks and tools, you can save a lot of time. The previous tips described can help you to achieve better visual quality.
When you need to focus on good-looking assets, assembling the entire pipeline together or simply tweaking some parameters to achieve more realism, it is better to save some time on things like graphics. This means working with the camera and ML inference and using existing solutions. Unity is a good option in this case, especially considering their upcoming Barracuda technology, which is still in the development stage, but already showing good results.
If your goal is to get maximum performance on Arm devices, check out Arm NN – the technology we used in our project.
[CTAToken URL = "https://developer.arm.com/ip-products/processors/machine-learning/arm-nn" target="_blank" text="Learn more about Arm NN" class ="green"]