At this year’s Game Developers Conference in San Francisco, I presented our work with Vulkan Multipass at the Khronos Group’s DevDay. A YouTube video of the presentation can be found here:
Before we dive into detail on what Vulkan Multipass is, let’s consider how deferred shading can be done optimally on mobile GPUs.
A tile-based architecture renders a frame in two stages. First the geometry is processed and we create a geometry working set. This includes the varyings written by the vertex shaders as well as a data structure from the primitive tiling process.
In fragment shading, we can now look at an isolated region of the frame buffer (i.e. tile), and complete all the shading. The advantage here is that blending, depth testing and such can now access on-chip tile memory instead of going to the main memory. Only the final, completed tile will be written out to main memory.
Considering how deferred shading works, the optimal approach on tile-based GPUs then becomes to keep the G-Buffer data on the tile memory. The lighting shaders only need to read the G-Buffer data from its own pixel.
Until the existence of Vulkan, there was no way to express this data flow in a standardised way.
One historical problem with making full use of tile-based GPUs on OpenGL ES has been that developers would have to use various extensions in order to make use of the more optimal paths which tile-based GPUs support. The rendering path for desktop and mobile would therefore look quite different. For Mali GPUs, this is exposed through GL_ARM_shader_framebuffer_fetch, GL_EXT_shader_pixel_local_storage and GL_ARM_shader_framebuffer_fetch_depth_stencil.
In Vulkan, mobile GPUs are now on equal footing with desktop GPU architectures in that the API interface takes into consideration how a tile-based GPU works. Vulkan render passes are split into subpasses. These subpasses will usually have a particular task they will perform. If we think of subpasses in terms of deferred shading, we can consider two subpasses where the first will do the traditional G-Buffer pass, and the Lighting pass will apply lighting to the scene.
The main reason for putting these subpasses into one render pass is that we can express a per-pixel dependency between G-Buffer and Lighting passes. This means that we can keep the G-Buffer data on the on-chip tile instead of moving this data out to main memory. This saves a lot of external memory bandwidth, which is very important on mobile. This is the main motivation for multipass in the first place.
Using Multipass to implement deferred renderers, we wanted to validate the performance characteristics compared to a more traditional multiple render target solution where you render to many textures and read them back.
The first test is a bare-bones sample, taken from our recently released Vulkan SDK, which implements a Multipass renderer:
We tested both overall performance as well as external memory traffic generated by the GPU when rendering this scene at 4K.
Here we got a good performance improvement as well as a massive bandwidth improvement. The massive bandwidth improvement is due to the fact that the multipass version only needs to write out the final light buffer to memory whereas classic MRT would need to write 4 textures out to memory and then read back all the textures again in the lighting pass.
This can be seen as the “optimal” case for multipass.
We also tested this on a more complex scene with lots of lights and complex shading.
The bandwidth gains are still really solid and we still have a decent performance improvement as well.
At last year’s GDC, we presented a tech demo show-casing our Vulkan support on Mali GPUs. Later in the year, we showed a great CPU overhead reduction in Vulkan compared to OpenGL ES.
This time we have added a deferred renderer to our Vulkan engine using Multipass. The main motivation for doing this was to test the performance against a classic multiple render target implementation:
We achieved significant performance improvements as well as great bandwidth reduction. This scene is quite taxing. Going forward, I think Vulkan Multipass will be the key to enabling efficient deferred renderers on mobile hardware. We now have a standard API to express tile-based deferred which will be important in order for deferred techniques to be implemented in a wider range of applications on mobile.