The Rise of Depth on Mobile

Introduction

Images, along with video and image sensors have been in the center of mobile Hardware and Software development and users’ interest since the first phone with a built-in camera was released in 2000. Today, mobile images play a key role in our lives and we use them extensively everyday. This has been possible due to the progressive improvements of mobile cameras. Mobile devices we carry in our pocket are capable of easily taking high-resolution pictures of never seen before quality and recording HD video for hours.

Nevertheless, in the race for the best mobile picture and video, a key factor has been ignored. Mobile cameras translate the 3D world into a 2D image and this has been enough so far, but now we need the missing third dimension.

We are on the verge of a visual revolution. Ultimately, we want the devices to see and understand the world as we do. This will change the way we use computing and the way we interact with computing devices. No longer will we launch apps using icons and our fingers. Apps in Mixed Reality (MR) will be launched just by us looking at a given object, by a gesture or verbal command. The way we interact with computing devices will be more natural and human, much like we interact each other, based on verbal and visual cues. MR experiences, games and apps, will be tightly related with the context where we are, and they will be  recommended to us according to the context. For all this we need to know where we are, the objects surrounding us and what their position relative to us is. We cannot achieve this without knowledge of depth data.

Mobile SLAM technology, already present in millions of phones thanks to Apple ARKit and Google ARCore, can provide device position and orientation using as input only the camera feed and the readings from the accelerometer and gyroscope. When this information is combined with accurate depth data, we can build a virtual representation of the environment. This is a necessary step in MR to integrate seamless virtual objects with real objects and produce “realistic” interactions between them.

Our brain can understand depth thanks to our stereoscopic vision. We need devices to perceive depth as well and this is achieved by means of sensors. In mobile devices sensors are the bridge between the real world and its virtual representation. Unfortunately, while image sensors have experienced a remarkable development during the last few years, we can’t say the same about mobile depth sensors.

In this blog I will explain why depth data is so important for several relevant use-cases. Then I will analyze why we need depth in AR/MR in more detail and how we use depth to build a virtual representation of the environment. Next, I will briefly explain the main types of mobile depth sensors, their strengths and limitations. I also discuss the alternatives to mobile depth sensors, and finally address conclusions of these depth ideas and the recommendations that follow these conclusions.

I hope this blog will raise the necessary awareness on just how relevant depth data is for mobile AR, MR and the other important use cases. I expect the reader will get an understanding of the current state of mobile depth sensors and the challenges to integrate depth into the whole process of enabling mobile devices to see and understand the world as we do.

Where do we need depth info?

Depth is relevant for many use-cases. I briefly list below some of the areas where accurate mobile depth data will impact the most.

AR/MR

We need depth data to produce a 3D reconstruction of the environment that allows rendering properly virtual-to-real and real-to-virtual occlusions, collisions, shadows and other features we need to consider to make virtual object rendered on top of the camera feed indistinguishable from the real ones. Depth info also helps improving object recognition to achieve more specific virtual-to-real object interactions. I discuss this use-case in more detail in the next section. 

Navigation and tracking 

Depth data are crucial for mapping and navigating the environment and obstacle avoidance during navigation. Autonomous robots are already used in planetary exploration and here in the Earth in warehouses to manipulate and move loads. Nowadays drones are in multiple activities, from business to critical missions.  The automotive industry uses depth sensors for intelligent park assistance and collision detection. Autonomous vehicles will heavily rely on depth sensors.  In all this depth sensing is needed to allow the device locates itself and moves and interact safely in the environment. 

To learn more on SLAM please read some of the blogs posted on the Arm Community: Implementing mobile indoor navigation using SLAM, Mobile inside-out VR Tracking and How SLAM will transform five key areas.

Identity recognition 

Mobile technology already started introducing depth info for face recognition. Some of the latest smartphones offer this feature to unblock the access to the device using a front-facing depth camera. Depth helps greatly improving recognition when combined with the standard photo recognition, making it more robust and successfully against fraud. 

Gesture recognition 

Gesture will be an important way of communication with the coming MR devices. Our gestures are complex and depth info allows for easier and robust gesture recognition. HoloLens, Magic Leaps and other MR devices already recognize and process gestures together with speech for user interaction. 

Image segmentation and object recognition 

RGB images combined with depth help improving performance and robustness of the segmentation process, i.e. identifying different meaningful blocks in the images. Deep learning-based segmentation algorithms are more efficient from the power consumption point of view when considering depth data, making them suitable for mobile devices where energy efficiency is relevant. Depth also helps improving object recognition when it is based on segmentation that uses depth input. 

Digital photography 

Adding 3D data to photos and video will allow new options when editing digital content, for example, for removing and replacing the background or segment a specific object. The Halide mobile app uses depth info from the iPhone X to apply clever effects to the pictures. Samsung A9 smartphone comes with a back quad-camera array and one of them is a depth sensor for bokeh effects. Depth is called to become the fourth pixel data component together with RGB in the image processing solutions. 

Fashion 

Apparel fit represents a multibillion-dollar problem for retailers. Significant part of their online sales is returned due to sizing issues. Depth sensors in smartphones could provide accurate sizing and custom by means of 3D body shape scanning.  

Product design and 3D printing 

The 3D printer market is estimated to grow substantially in the next years. Mobile depth sensors would allow quick object and people 3D scan. Artists could then design, print and manufacture personalized products at scale. 

Depth in Augmented and Mixed Reality

Most of the current AR mobile experiences use the SLAM technology provided by Google ARCore and Apple ARKit to determine device position and orientation using as input the camera feed and the readings from the Inertial Measurement Units (IMUs).  It allows rendering 3D virtual objects on top of the camera feed as if they were there.

This is only the beginning! We want to properly render virtual-to-real and real-to-virtual occlusions and produce realistic virtual-to-real collisions and shadows. Furthermore, we want to identify the objects surrounding us to interact more specifically with them in 3D. For all this, we need to get information about the topology of the environment, its structure and how far objects are from the device. Here is where depth comes in to play a key role.

One of the main limitations of AR experiences shown at games/graphics events in 2018 such as GDC and SIGGRAPH, is that developers can’t make an effective use of the real environment in their apps. The most you can do is to detect planes (table, floor) and then render some virtual assets on top or show some flying object around the room. The reason for this is that the apps don’t have any knowledge about the surrounding geometry, or about its structure.

 Look at the picture of the hotel room on the right, assume it is an AR game. You want to exploit the real-world geometry and you want to add a bridge between the left bed and the desk, and also between the two beds. You do this so the character can move around the room and interact with the scene. How can you do this if you don’t have any information of the geometry of the main objects in the scene? If you could achieve some information and figure out that there are three main geometry blocks formed by the two beds and a desk, then you could add assets related with these objects; bridges between them, for example, to allow the character moving around.

Having a basic knowledge of the surrounding scene (being able to figure out the basic topology) gives enough info to make more effective use of the real objects in AR apps, i.e. start truly incorporating real objects into the AR experiences.

How do we achieve this information? If we know the camera position in the 3D world and we can obtain a depth map from the camera viewpoint then, for example, we can build a voxel representation of the environment and if needed a mesh representation of it. The figure below shows one possible way of achieving this.

A voxel representation approximates a volume surface by using small cubes that represent the occupancy in the 3D space. The picture below (left) shows a voxel approximation of a winged statue in a synthetic scene. The image on the right shows a zoom of the arm and wings where it is possible to distinguish the voxels structure of the surface.

AR on mobile is based on SLAM solutions that provide accurate camera pose in 3D. If additionally, the phone has a depth sensor, it is possible to build a voxel representation of the surrounding environment using the so called Truncated Signed Distance Function (TSDF) algorithm.

Depth maps provide more than just a surface location on an image. They give us information about the free space between the camera and the surface. The Signed Distance Function (SDF) enables a very useful surface representation of this free space. The SDF defines the signed distance from a point in a volume to the nearest point on the surface, where the sign delineates regions of space that are closest to a front (positive distance) or back (negative distance) of the surface. The truncated version TSDF, only defines a limited SDF near the surface and truncates the value where the unsigned distance is above a specified threshold, as shown in the picture on the right.

A naïve 3D voxel mapping implementation will have a heavy memory footprint as we increase the voxel resolution, so it is important to implement efficient data structures to reduce this footprint to acceptable levels. For example, if we use a voxel grid resolution of 5123 to 3D voxelize and colour a living room with a volume ~ 5m x 4m x 3m we will need ~650 MB. Device memory is limited and at some point, when scanning the environment, we would need to upload the data excess to the cloud.

Let’s come back to the pipeline outlined previously. The camera feed and readings from the IMUs are used by a SLAM library (for example, like Google ARCore) to find the device pose (position and orientation). Additionally, it provides a sparse point cloud. The camera pose and the depth map from the depth sensor are used as input to the 3D reconstruction block to build a voxel representation of the environment. If we want to use any of the popular game engines to render the virtual objects on top of the camera feed, then it will be convenient to build a mesh representation on top of it as game engines work with meshes.  We then could use the built-in functionalities of the game engine to properly render occlusions and shadows, produce collisions and trigger events among many possible virtual-to-real interactions.  The Structure Core depth sensor by Occipital produces directly a high-quality mesh at real time during the scanning process of the surrounding. It is in fact a 3D scanner attachable to smartphones.

The future of AR games/apps is tightly related to the capacity of acquiring knowledge of the structure of the surrounding objects and their nature. AR apps won’t progress much until we create the conditions for that. Fortunately, we have just started to witness the results of the works performed during the last few years going in this direction. For example, at SIGGRAPH 2018, one of the exciting presentations was the InfiniTAM mobile solution capable of performing a 3D reconstruction of the environment in realtime.

The 3D reconstruction of the surrounding environment will allow devices to attain the topology of the scene. Further segmentation and recognition of the most relevant objects in the scene will allow us to identify potential relationships between them and the MR app/game based on their specifics.

Having more advanced knowledge of the surrounding scene (like what kind of objects compound the scene? where we are?) gives additional info to make even more effective use of the real objects in the AR experiences, i.e. fully integrate real objects into the AR games/apps. In the pipeline outlined previously, the images from the camera and the depth map from the depth sensor feed the Neural Network (NN) engine block. This block identifies the “real objects” surrounding us helping this way acquiring a semantic and high-level understanding of the scene. We have recognized a toaster, a coffee maker and a fridge so, we must be in a kitchen. We can expect this block to interact with inference services in the cloud when the capabilities at the edge are not enough.

Look at the pic on the right. If we could get enough knowledge of the scene then we could know that we are in a bathroom and we could fill up the bath up with water, and add some floating objects (in this example, a boat) on it and we could make virtual water fall from the shower.

The idea I am trying to portray here is that the future of AR games and experiences is closely  related to the capacity of getting information of the structure of the surrounding objects and their nature in order to incorporate them into the AR experiences. We won’t see AR games/apps widely used until we create the conditions to make this happen. For this, we need to have access to the depth info of the surrounding environment.

If we want to integrate virtual objects realistically in the environment, we also need to properly consider lighting. Virtual objects should be able to cast realistic shadows on real objects and again for this we need information of the environment. At this point we can foresee the convergence of several technologies; SLAM, graphics, AI and computer vision (CV). Any solution in this field will have to come from the collaboration between different specialized teams. From the HW point of view we should expect identifying common acceleration elements and efficient reutilization of them for different tasks.

Mobile depth sensors

Depth cameras became widely known in 2010 when Microsoft's XBOX 360 accessory "Kinect" was launched.  We have seen depth cameras also in Google Tango tablets, Lenovo Phab 2 Pro and Asus Zenfone AR devices. More recently this technology has been integrated into the smartphones front-facing cameras for secure face ID. MR headsets as MS HoloLens, Meta 2 and Magic Leaps, among others, also integrate depth sensors to understand the environment and for gesture tracking. Qualcomm offers a passive depth sensing module and an active one as part of their Spectra ISP program.

The picture below shows a depth map (left) produced by a depth sensor. The colors in the depth image encode depth values from the nearest to most distant points. On the right the corresponding RGB image. 

Mobile depth cameras (sensors) can be grouped in four well defined types: Passive Stereo (PS) and Active Stereo (AS), Structured Light (SL) and Time of Flight (ToF), as shown in the picture below.

PS systems are formed by a pair of cameras that determines the depth by triangulation in the same way our binocular vision works. The distance between the two cameras is known as the “baseline” and it defines the depth resolution this type of system can achieve at a given distance.  The quality of the results will depend on the density of the visually distinguishable feature points the processing matching algorithm can identify in both views. Here, any source of natural or artificial texture will help in significantly improving the accuracy.

AS systems are formed by two cameras and a projector, that projects a random pattern. The main goal of the projector is to provide an additional source of texture that adds details outside of the visible spectrum. It also provides an additional light source that improves system reliability in poor lighting conditions.

SL is an alternative approach to depth from stereo. It combines a single camera and a single projector and relies on recognizing a specific projected pattern in a single image. The projector projects a known pattern in the infrared spectra on the scene. The pattern is deformed when reflected on surfaces and it is registered by the camera which calculates the depth and surface information.

ToF sensors emit an ultra-short pulse of light and a detector then measures the arrival time of the beam reflecting back from the objects in the environment. The farther away the objects, the longer the return time measured.

Stereo depth sensors estimate the depth by using two different points of view, including SL where one of the points of views is the projector. Only the sensors based on Time of Flight (ToF) technology measure the depth directly.

Depth sensors compare different when considering some relevant features.

  • Accuracy: SL sensors have best depth accuracy performance, while stereo camera arrays tend to have the largest depth error. For example, the Qualcomm SL module when used in conjunction with the company’s Spectra image signal processor, delivers a claimed depth accuracy of ~0.1mm. ToF systems accuracy is in the sub-centimeter range. The effective range of stereo systems depends on the optics and the baseline distance. For a standard camera with a baseline of ~6.5 cm the minimum distance is ~0.5 m. At this distance the error of the system is ~0.1 cm. This error increases to ~10 cm at 5 m and at 10 m the error is as high as ~40 cm. This calculator allows understanding how image, lens and stereo geometry metrics are related.
  • Range: When looking at sensing range, SL has the shortest range, while the range of ToF depends on the emitting power of the light source. Active depth sensors operate in the infrared spectra because there are not many other infrared sources in everyday life that would interfere with them. This limits their maximum range to ~5m due to restrictions to the power of the IR impulse for safety reason. Some depth solutions start offering wider working ranges.  The Orbbec Astra depth camera, according to its spec, can reach up to 8 m although the optimal range is 0.6-5 m. Intel® RealSense  active stereo IR depth cameras D415 and D435 list 10 m of maximum range in their specs.
  • Cost: The cost of stereo camera systems is typically the lowest and the development effort is mainly on the SW side. Dual camera arrays are already present in many smartphone models, although not for depth measurement but to improve the picture quality. The cost of ToF systems is moderate while the SL sensors have the highest. Nevertheless, we can expect a reduction of costs with the mass production of these sensors. Range of prices for current available solutions varies from ~ $150 (Intel® RealSense  D415 and D435, Orbbec Astra) to a few hundred: Structure Sensor, PMD Technologies Pico Flexx ToF camera, StereoLabs ZED depth stereo camera, and the Duo cameras from DUO3D .
  • Scalability: ToF is the best depth sensor technology in terms of scalability as it is based completely on semiconductor technology. SL technology is also scalable, but the optical system doesn’t scale as fast as semiconductor technology. Finally, stereo camera array scalability relays mainly on the SW side.
  • Power: Power consumption is lowest in the stereo systems. In the SL and ToF sensors the power scales with distance. Software processing is higher for the stereo systems, followed by the SL sensors. ToF sensors require the lowest processing power.
  • Robustness: Beside range limitation for outdoor depth sensing, SL and ToF systems have poor performance outdoors under the sunlight. They are also adversely affected by reflective properties of materials (e.g. translucent, water). ToF sensors may experience interferences by the presence of other ToF cameras. PS systems have poor low light performance and they don’t work well with textureless surfaces.
  • Refresh rate: Depth sensors’ refresh rate depends mainly on the depth image resolution they provide. Most available sensors for smartphones support 640*480 (VGA) at 30 FPS and we start to see some new sensors supporting 1280*720 at 30 FPS (e.g. Intel® RealSense  D415  and D435) .  As we go down in resolution to 320*240 (QVGA) sensors can achieve 60 FPS. Some sensors offer configurable refresh rate depending on the working range.

Mobile depth cameras requirements

Mobile depth cameras should satisfy some basic requirements:

  • Power consumption ~ 500 mW or less
  • Low processing of sensor output
  • Limited projection power for safety reasons
  • Thickness < 5mm due to mobile form factor
  • Robustness against permanent smartphone manipulation stress that affects factory calibration
  • AR/MR projection distance ranges
    • Environment scanning: 0.6 – 5 m
    • Gesture recognition: 0.2 – 1 m
  • Low cost, within acceptable limits for sensors

These requirements impose serious challenges to sensor manufacturers. In the case of stereo systems, to guarantee quality measurements in the range up to 5m, a baseline of ~7-8 cm is required. Tablets and AR/MR headsets can accommodate this distance between cameras, but it is challenging for smartphones. Additionally, stereo-matching-based depth solutions require highly specialized HW such as an ASIC or a DSP for low-power real-time processing.

Technical challenges make this technology also expensive. For big quantities the average price for mobile depth devices is estimated ~ $10. In the iPhone X, for example, the cost of the depth sensing system is estimated in $16.70. We can expect these figures to come down as this technology increases its penetration in the market.

Until recently, there has been no market that would support large scale and low-cost production of depth sensors, but this could change as AR/MR evolves and we could then expect depth sensor prices to come down.  

As ToF cameras don’t work on the stereo principle they don’t require a base line and can be very small. Calibration is simpler, and they commonly support mixed modes, i.e. can cover short and long ranges. Nevertheless, ToF sensors also have shortcomings. Current sensors have relatively low resolution ~ 100K pixels or less, although we can see already new developments achieving 300K. To overcome this and other limitations an extra processing is required.

The alternative to depth sensors: Neural Networks to the rescue

As we have seen, depth data will be relevant for many use cases and there are different types of depth sensors suitable for mobile devices. We could expect depth sensors to be integrated gradually in smartphones in the near future thanks to the very fast product refresh cycle of the smartphone industry. But, how we can get reliable access to depth data in mobile if current mobile depth sensors have limited specs and are pricy?

We have to look for the answer in another new technology. Fortunately, although it is difficult to believe, it is possible to generate a depth map from a single RGB image (see image below) using Neural Networks (NN). A number of works performed by different teams point to encouraging results. For example, we can train a network with RGB-depth data (supervised approach) to learn producing depth from a monocular RGB image. Another alternative is to use stereo RGB images (unsupervised approach) to train a network generating a depth map, by imposing consistency between the disparities produced relative to both the left and right images. This approach removes the need of ground truth depth data for training the network. Both approaches can deliver impressive results as the network can predict the depth map from a monocular RGB image with even better quality than depth maps obtained from depth sensors. More recently a semi-supervised learning approach has been proposed to train a stereo deep neural network (DNN), along with an architecture containing a machine-learned argmax layer and a custom runtime that enables a smaller version of the stereo DNN to run on an embedded GPU.

The picture below shows a depth map (right) produced by means of monocular depth estimation using a supervised approach for neural network training.

Conclusions

We are on the verge of a visual revolution that will drastically change the way we do computing and interact with computing devices. Augmented and mixed reality experiences will seamless integrate virtual content into the real world. We will no longer launch applications as we do now. Instead applications will be contextual suggested to us by the real objects just by looking or pointing at them. Nevertheless, to make this vision happening we need devices capable of seeing and understanding the world as we do, and this is where depth is called to play a key role.

Scanning the environment is the first, as humans we do this unconsciously every single day. We look around, we build a mental map of where we are, and then we identify the objects surrounding us before start interacting with them.  We need devices to follow a similar process the way humans do, to know where objects are positioned in the scene and what they are. For any of this to be vaguely plausible, we need to get access to the depth info.

As we have seen in this blog, if we know where the device is, and we have got a depth map of the surroundings then we can build a 3D representation of that environment. This 3D digital reconstruction will allow us to perfectly integrate virtual objects on top of the real-world environment and generate virtual-to-real interactions as if they were really happening.

In principle, mobile image sensors together with depth sensors and IMUs provide the necessary info to perform a 3D reconstruction of the environment. However, mobile depth sensors aren’t so developed in terms of performance, depth quality and power consumption. Sensor manufacturers will have to make an effort to improve depth sensor accuracy, stability and performance while reducing their dimensions, cost and power consumption to allow the integration in mobile devices and headsets. Additional research can provide valuable data to sensor manufacturers on accuracy, stability, frequency and range needed to achieve quality 3D reconstruction at real time.

Is it really possible to achieve this? Well, the video below from early 2014 that shows the first Google Tango phone “Peanut” mapping the environment at real time, demonstrates that it is something that can be done.

Sensors are only the front-end of a complex pipeline needed to fuse different data. 3D environment reconstruction, object recognition, environment lighting, topology extraction, all of these factors should be fused to seamlessly combine virtual and real content in terms of rendering virtual assets on top of the camera feed and producing virtual-to-real interactions. For this, mobile devices and headsets will need to run a number of complex and high computing demanding geometry, CV and AI algorithms.  This represents a challenge for future HW IP development. Mobile friendly algorithms need to be identified/developed for each relevant stage and then decide which of them must be HW accelerated. In the process of designing accelerators, common processing steps should be considered to make the most of them and maximize efficiency.

Depth generation from monocular RGB images represents an alternative to depth sensors but again, further studies are needed as most of the developments have been performed mainly on desktop HW. Existing approaches must prove themselves to be reliable on mobile in terms of performance, depth accuracy, power consumption and lighting conditions under which they can provide confident results.  

I enjoyed researching this topic and I hope you will experience the same when reading it, but more importantly I hope you will find this blog useful and as always I will appreciate your comments and feedback.

Anonymous
Graphics & Multimedia blog