Debuted at GDC 2017, CircuitVR is our latest mobile VR demo showcasing multiview and foveated rendering, while exploring the inner workings of a mobile phone and seeing how the various components work and fit together.
Mobile Multiview is a powerful extension in Unreal Engine 4.14 that mitigates one of the problems of mobile VR which is the necessity to draw two views that have a minimal difference on the camera position. To render those views with a naive approach would mean to execute the drawcalls twice (one for each eye) and that in turn causes an increase of CPU load that can become a bottleneck when your game has lots of objects that turn out to be separate drawcalls (you can check the drawcalls executed by the engine enabling the SceneRendering statistics in UE4). This is when Mobile Multiview comes into play, it allows the engine to execute only a single drawcall to render into 2 separate views (or even more!) each one with its own view parameters. For a more in-depth description of the mobile multiview extension please refer to this blog and this tutorial.
Enabling Mobile Multiview is pretty simple in Unreal Engine 4.14, just head to the Project Settings window than you will find it under the VR category of the Rendering tab.
Fig 1. Mobile Multiview option
Just enabling Mobile Multiview made the time per frame spent in the Renderer Thread (the biggest in terms of execution) go from 28ms per frame to 22ms. That is a 22% reduction of CPU time (due to the fact that only part of the execution time is actually spent on drawcalls) and a big step forward with just a simple tick. Figures 2 and 3 below show the CPU time difference using graphs captured with ARM Streamline tool ( https://developer.arm.com/products/software-development-tools/graphics-development-tools ).
Of course there was still some more optimization needed to reach 16ms to have 60FPS, yet by using other optimization techniques explained in other blogs and presentations (e.g. our GDC 2017 talk High Quality Mobile VR with Unreal Engine and Oculus) we managed to lower it down under that threshold.
Fig 2. Renderer thread CPU load
Fig 3. Renderer thread CPU load after enabling Mobile Multiview
If you are not familiar with the concept of Foveated Rendering it briefly means to render details where your eyes can see them and avoid them where cannot. This concept goes far beyond the classic view-based optimization techniques such as frustum culling and occlusion culling that avoids rendering entire objects that are occluded or not contained in the view-frustum. The technique is derived by the structure of the human eye which has an area call Fovea that captures the most high-detailed part of your vision and corresponds to the point you are focusing with your eyes. The peripheral area, from here called just periphery, still provides important visual information (such as movement, light changes, etc.) but with less resolution and details.
Exploiting these characteristics of the human body is what makes this technique really interesting, since it allows to reduce the amount of shading executed for periphery while keeping the same resolution on the foveal area. A lot of research is currently being done about Foveated Rendering to better support it in the future products and hardware. Currently, we can use the existing Mobile Multiview extension to develop a possible implementation that in case of CircuitVR, managed to give use an overall 20% GPU load reduction which in turn translated in more battery life and less heat.
Fig 4. Foveated rendering in CircuitVR. The high-resolution area is the Fovea while the pixelated area the Periphery.(the picture has been modified to make the difference more noticeable compared to what rendered on the device)
The technique I'm going to explain below is not a panacea and has benefit only when the application has a low triangle count since, as will be explained later, we will render the geometry not 2 times (one per eye) but 4 times (2 per eye).
Eye tracking is an important part of the technique since we need to know where in the scene you are currently focusing so that we render it with the proper resolution. For this reason, we partnered with SensoMotoric Instruments ( SMI https://www.smivision.com/ ) who provided us with a SMI modified GearVR headset with integrated eye tracking support, which allowed us to track what you are looking in the demo and provide the best image for it.
The idea behind the technique is quite simple (see this for more in-depth information), instead of rendering 2 views in full resolution (1024x1024 per eye) and then blit them to the GearVR eye buffers (which have the same resolution), we render 4 of them at lower resolution (360x360 per eye in our experiments which is a 65% reduction of resolution), then we compose them into the final GearVR eye buffers (upscaling the views). The important part of the technique is that while 2 views are rendered as usual (left and right view periphery) the other two views (left and right fovea) render the scene with a smaller field of view and also a view direction that matches the one of the user. This makes the scene looked zoomed (like when seeing through an FPS sniper scope) and makes the details in that small part of the scene appear more clearly.
Fig 5. Left eye periphery (wide FOV)
Fig 6. Left eye fovea (narrow FOV)
With this technique the simpler blitting process becomes a composition step where the fovea and periphery images are fused together to render the final image seen on the screen. But even in this case the composition algorithm is not that complex and in fact is mainly a scaling/offsetting process needed to cope with the movable centre of focus as well as the possibility of having different percentage of resolution reduction (that translates to different field of view angles for the Fovea scene).
The particularity of the algorithm is that when doing the composition the Fovea view always maps 1:1 to the GearVR eye buffer while only the periphery gets stretched to fill the rest of the GearVR eye buffer. This means the area of focus represented in the fovea view has the same details as if we were rendering the scene into the original 1024x1024 render target. To reduce the visible lower resolution in the periphery, and also increase the quality in the fovea, we have enabled 4xMSAA which mitigates the edge aliasing effect caused by the low resolution used. An example result of the final result is in Fig. 8.
Fig 7. CircuitVR Foveated Rendering result. Picture taken from a Samsung Galaxy S7 running on GearVR.Click on the picture to enlarge it to full resolution and see the difference for the Fovea and Periphery.
A noticeable problem of the technique is that the difference in resolution between the Fovea and the Periphery is clearly visible when the user moves her eyes to focus on various part of the scene - causing the illusion of having high resolutions to break.
Here is when Eye Tracking comes into play. If we have the ability to track where the eye is looking at each moment, we could use that information to orient the Fovea views to the correct position and always render high resolution the part of the scene the user is focusing on.
Thanks to SMI Eye Tracking technology we managed to achieve that. The eye tracking information is sent to Unreal Engine through a plugin and then used to modify the view and projection matrices for the Fovea views as well as tell to the compositing stage where the fovea view should be placed in the final image.
Before we go through an explanation of the results we achieved, it is better to explain what are the saving/costs of the technique for various parts of the system. Specifically we are going to show the costs in term of CPU Load, GPU Vertex Load and GPU Fragment Load.
The CPU is the easiest and shortest one. Nothing changes on the CPU side. The cost of adding the logic to render 4 views is negligible and thanks to the Mobile Multiview extension render 4 views instead of the classic 2 comes at almost no CPU cost.
The GPU Vertex Load suffers from an increase of its cost. That is expected since we are now rendering 4 views instead of 2 and that means the vertex shader has to run four times. This gave us a 53% increase of vertex shading cost. The increase is less than 100% (from 2 to 4 views) since the Mobile Multiview extension is clever to only execute 4 times the part of the vertex shaders that change based on the view we are rendering while executing just once the common parts.
This can be seen from picture 8 and 9 which are profiling traces took using ARM Streamline tool on a Samsung Galaxy S7 (See this for more information about HW counter on Mali). In the section "Mali Job Manager Cycles" the GPU cycles counter represents the overall GPU load, the JS0 cycles represent fragment load while JS1 represents vertex load. As you can see from the picture the JS1 cycles goes from 129M cycles of the Mobile Multiview case to 192M cycles in the Foveated rendering case (~53% more).
The GPU Fragment Load has instead around 40% reduction since we now render 75% pixels less. This can been improved further since you can see from the pictures that parts of the Fovea scene that are represented in the Periphery scene can be avoided since they will be overwritten during the composition phase.
Again, from the Streamline counters we can see that Mobile Multiview case spends 394M cycles in JS0 (fragment) while the Foveated Rendering techniques spends only 239M. That corresponds to a 40% reduction in fragment processing.
Fig 8. Streamline GPU Hardware counter for Mobile Multiview
Fig 9. Streamline GPU Hardware counter for Foveated Rendering 35% of original resolution.
Our current experiments showed that using 35% of the original resolution for rendering 4 views gave use a 20% GPU load reduction which allowed us to reduce GPU frequency and reduce heat and power consumption. Looking at the pictures above we can see that the GPU Cycles counters for Mobile Multiview add up to 488M cycles while for Foveated Rendering we reach 397M cycles. Beware that the GPU Cycles counters is not the sum of JS0 and JS1 since on the Mali architecture vertex jobs and fragment job run in parallel most of the time (if you want to know more refer to Frame Pipelining).
Even if the GPU Fragment Load saving percentage is lower, usually games spend more time on fragment than vertex so a 40% reduction of fragment shading reduces the overall frame cost to outrun the vertex load increase.
Another way to see the improvements it's to think about the timeline per frame of the CPU, Vertex and Fragment tasks and consider the amount of time the GPU can go idle.
Fig 10. Timeline for Mobile Multiview and Foveated Rendering
When foveated rendering is used, the amount of time the GPU can go idle is much larger which allows the GPU to sleep during that time. The timeline above supposes the frequency of the GPU is fixed while that is not the case on mobile since the GPU frequency varies based on load and temperature of the phone. That means the GPU instead of running at high frequency and then go idle can smartly adjust it's frequency so that is able to complete the task at a lower speed. If you look at the Streamline traces you can see that when using Mobile Multiview the GPU will require ~489M cycles to render the scene at 60FPS while with Foveated Rendering enabled it will need ~397M cycles to achieve the same.
Foveated Rendering is a really interesting technique that can bring huge savings for Mobile VR and be used to improve the quality of VR applications thanks to the freed resources. When we showed the demo to the public at GDC most people didn't realize the difference of resolution until told where to look at to see it.
Eye Tracking helps a lot in creating the illusion of full resolution and it is going to be an important part of future VR headsets. Even without eye tracking though, the technique performs well enough to be implemented into current VR applications if you are willing to sacrifice a bit of quality for performance. The amount of resolution reduction can be set to give both performance improvements and also make the difference in resolution harder to see (smaller reduction in resolution will cause the fovea area to represent a bigger part of the scene and also the periphery to be stretched less). The lens distortion also plays in favour of the technique since the fovea part matches with the part of the lenses with the higher pixel density while the peripheral area usually has lower pixel density and that means lower perceived resolution even if we are running in full resolution.
Eye tracking latency plays an important role in the illusion of full resolution. Due to deadlines and the structure of Unreal Engine we decided to sample the eye tracking location in the Game Thread. Even if this works fine there is at least a frame of latency between the Game Thread and the Render Thread plus the time it takes for the GPU to render the scene and display it. This can be improved by sampling the eye tracking information in the Render Thread, similarly to the correction Unreal Engine does with the GearVR, so that the latency is reduced.
Since this was an experiment there are still problems to be solved. One of them is trying to improve Mobile Multiview so that we can avoid executing the vertex shaders 4 times. That will reduce even more the GPU load or allow VR games to have more geometry in the scene which is important to create more rich and detailed worlds.