Video conferencing is a ubiquitous tool for communication, especially for remote work and social interactions. However, it is not always a straightforward plug and play experience, as adjustments may be needed to ensure a good audio and video setup. Lighting is one such factor that can be tricky to get right. A nicely illuminated video feed looks presentable in a meeting but on the other hand, poor lighting conditions can come across as unprofessional and distracting to other participants. Modifying the lighting may not always be possible, especially in the dark winter months or in locations with inadequate lighting setup. In this blog post, we are going to cover how we built a demo mobile app to solve video lighting in low light conditions. We will cover the neural network model powering the application, the app’s ML pipeline, performance optimizations and more.
We used a neural network-based solution as the method for improving the lighting of video. The core of this work was therefore dependent on finding a suitable neural network for the task at hand. There are excellent open-source models available to use and finding a suitable candidate model was key for this project. Below are the three main requirements we were looking out for when assessing models.
The target is real-time mobile inference and that means a tight performance budget of just 33ms per frame to achieve 30fps. This must fit in pre/postprocessing steps and running the neural network itself. Video enhancement quality is another important criterion. The model should intelligently enhance a dark image to recover detail and be temporally consistent between video frames to prevent flickering.
The chosen model was from a 2021 research paper, Semantic-Guided Zero-Shot Learning for Low-Light Image/Video Enhancement. The low light enhancement quality from this model was excellent when tested on a highly challenging dataset of mixed exposure and lighting conditions. Detail and structure that was not clear in dark images was suddenly visible. Another positive was the tiny model size at only ten thousand network parameters which translated to fast inference speed. In terms of the model architecture, the input image tensors are downscaled and passed to a stack of convolutional layers. These layers predict a pixel wise enhancement factor are then used in the model’s post processing module to multiplicatively apply the enhancement factor onto the original image pixels to produce the enhanced result.
Model architecture and enhancement module visualization
As a bonus, a tiny dataset of two thousand synthetic images was used for training suggesting even more scope for improvement with a larger dataset. Ground truth images were augmented to generate a range of uniform exposure values to lighten and darken the image. It was trained in an unsupervised manner, without labels, and the model was able to learn how to enhance low light images from the loss function guiding the training. The loss was a concatenation of multiple separate losses responsible for various aspects of an image such as color, brightness and semantic information.
The app’s ML pipeline starts with the input frame, before processing to fit model input tensor requirements , inference, and finally displaying output to the user. Intercepting camera frames for inference is an already built-in functionality using Android’s "Image Analysis" API as part of the CameraX library.
Two different ML inference engines were targeted, ONNX runtime and TensorFlow Lite. Exporting the Pytorch model to an ONNX model is a built-in feature to the Pytorch library, however exporting to TensorFlow Lite was much more difficult. The most successful exporter for this model is called Nobuco and works by creating a Keras model that can be converted to TFLite.
The output format resulting from model inference was dependent on the ML runtime. In the case of ONNX it was NCHW (number, channel, height, width) and TFLite uses NHWC with channels as last. This influenced how the post processing step was done to unpack the output buffer of integer RGB values to create the final bitmap for displaying on the screen.
Before and after low light enhancement comparison
Converting RGBA bitmaps to RGB is computationally expensive to do in Kotlin. Just the conversion alone was taking tens of milliseconds when there was a tight performance budget of 33ms. Making this faster involved using C++ with full compiler optimizations and interfacing from Kotlin code to C++ involves the JNI (Java Native Interface). Crossing this JNI bridge with the float buffer of 3x512x512 elements was expensive as two copies must be made, one to C++ and then another when returning the results back. The solution for this is to use a Java direct buffer. A traditional buffer would have its memory allocated by the Android Runtime on the heap and thus would not be easily accessible in C++. The direct buffer must be allocated with the correct byte order of the system but once done, the memory is allocated in a way that it is easily accessible by the operating system and C++. We therefore save on copying to the JNI and take advantage of highly optimized C++ code.
The model was optimized using a technique known as quantization. Quantization uses lower precision for a neural network’s weights and activations, giving an increase in inference speed at a slight cost to model quality . This happens using a smaller datatype such as INT8 to represent information at a quarter of the size as the traditional 32-bit floating point number (FP32). There are two strategies for quantizing models, dynamic and static. Dynamic only quantizes the model weights and at runtime, the quantization parameters are decided for the activations. Static is much faster for inference as quantization of the weights and activations are done beforehand using a representative dataset. For this model, static quantization increases the speed of inference, and the output lighting enhancement is slightly darker which is a worthwhile trade off.
Model inference time Pixel 7
The previous graph compares the model inference time using the ONNX runtime and Tensorflow Lite for the low light enhancement model. This was at a resolution of 512x512 on a Pixel 7.We started with the ONNX runtime as the first inference engine. Running on CPU, the FP32 model inference time was 40ms. This was reduced to 32ms when quantized to int8. There was an expectation of a larger performance improvement, however upon analysis of the model file in a visualization tool known as Netron, extra quantization/dequantization operators were added to the model graph increasing computation overhead. TensorFlow Lite on CPU using XNNPack and an int8 model was slower than the ONNX runtime just shy of 70ms. All the prior combinations of inference engine and model type were surpassed by TensorFlow Lite with the GPU delegate. At just 11ms for inference on a 512x512 input image we chose this as our backend for running the model for real time lighting enhancement.
Benchmarking the demo app is not repeatable unless an ADB command is used to enable the Android fixed performance mode. This is because different benchmarking runs may run at different CPU frequencies and the ADB command stabilizes the CPU frequency. We saw upon using this fixed performance mode, the frame times decreased. This proved to be a dilemma as the app developer has no control over the CPU frequency and you cannot expect the end user to use ADB. However, there was a solution in the form of the Android performance hints API. This is mainly used for games and works by setting a target frame time and by reporting that metric back to Android, which then adjusts the clock speed to try and achieve the target. This resulted in a good improvement in frame times equivalent to the fixed performance mode.
The app displays the frame timings when the low light enhancement is turned on. Even though the frame rate was around ~37 FPS, the camera frame rate was capped depending on the hardware and lighting situation (in extreme low light, Android reduced camera FPS to boost brightness). On a Pixel 7, the frame rate displayed to the user (default Android camera API) was a maximum of 30 FPS. Faster inference would not achieve a better user experience and thus a performance cushion of 7 FPS.
Whilst the lighting enhancement of the original model for dark scenes was quite good, there were instances of the image being whitewashed. This was resolved by training on a larger dataset of 20k synthetic images vs the 2k images the research paper used.
When training on the larger dataset, there was a slowdown in training time. Investigating the hit in performance was because the batch size of 8 exceeded the GPU VRAM capacity and spilled over to system memory. A technique to increase the effective batch size without increasing VRAM usage is to perform gradient accumulation. Rather than computing the gradients each batch, you accumulate over multiple batches and then calculate the gradients. In our case, the largest batch size we could use was 6 and using gradient accumulation we were able to use a size of 60.
In this blog post, we have presented a working demo mobile app for improving the lighting of video in real time on mobile. Optimizing and running ML models on Arm was a smooth process as techniques such as quantization and various inference engines allowed the model to run with a tight performance budget of 33ms.
Read Neural Network Model Quantization