Vulkan is changing the landscape of graphics, ushering in a new age of visual fidelity for Android devices. While powerful, the Vulkan API can be quite complex for mobile developers. Therefore, at GDC 2019, Arm released a set of Vulkan samples that illustrated a comprehensive list of best practice recommendations. Since then, these have been donated to Khronos Group and have been improved with contributions from other GPU vendors, and the well-known samples from Sascha Willems.
The main motivation behind the samples is to make it easy to experiment with different ways of doing the same thing in Vulkan. You can toggle options at run-time while monitoring the impact that they have on performance, thanks to the on-screen hardware counters. Finally, every sample is released alongside an article that explains the theory behind every best practice recommendation, as well as a guide to profiling using Arm Mobile Studio and other tools.
The main features that underlie all the samples in the framework are worth highlighting:
The repository contains samples that provide a reference for API features and new Vulkan extensions, as well as performance samples focusing on CPU and GPU optimizations. This blog covers a few of these performance samples, focusing on considerations around limited power and therefore bandwidth on mobile devices with tile-based rendering.
For immediate mode GPUs, found in desktop and console architectures, all the geometry is first processed and then added to a queue:
Fragment processing then proceeds from this queue one draw call at a time. For every pixel, it performs depth testing and color shading, reading from and writing to main memory as much as necessary. These operations can require exceptionally high bandwidth, which is energy intensive.
Compare this to tile-based architectures commonly found in mobile:
Here we divide the screen into regions of pixels known as tiles. The GPU rendering is then split into two phases:
The Vulkan API will let us optimize our rendering to take advantage of tile-local memory and save power on tile-based renderers.
A collection of attachments (e.g. depth and color) and the way they are used (e.g. shaders) is known in Vulkan as a Renderpass:
Renderpasses consist of one or more Subpasses. In Vulkan, when defining a Renderpass and its attachments, we need to specify load and store operations, i.e. what to do with the attachment before and after rendering. We define these for each attachment.
The load operation refers to what to do with the attachment before rendering. The available options are:
For immediate mode GPUs, these might all perform similarly, but on mobile LOAD_OP_CLEAR is far more optimal than LOAD_OP_LOAD. Tilers can efficiently clear all tile values before rendering, whereas loading the values from main memory requires expensive read operations. Also, most of the time it is not necessary to use LOAD_OP_LOAD, since the values are going to be replaced anyway.
Note that LOAD_OP_CLEAR is different from using Vulkan’s vkCmdClearAttachments, which will instruct the GPU to explicitly write out a clear value to the attachment in main memory. This is again wasteful if we are going to write out a new value over it afterwards. Therefore, avoid using LOAD_OP_LOAD and vkCmdClearAttachments and use LOAD_OP_CLEAR or LOAD_OP_DONT_CARE whenever possible.
Similarly, store operations define what to do with the attachment at the end of the Renderpass:
Usually the depth attachment is no longer needed after the Renderpass is complete. Therefore, we can save bandwidth if we avoid writing it out to main memory by using STORE_OP_DONT_CARE.
An attachment (such as depth) which does not need to be loaded from or stored to main memory can simply live in tile-local memory. This does not need to be allocated at all in main memory. We refer to such attachments as transient. Our tutorial covers in detail what additional Vulkan flags are required to request this optimization. Our sample shows improvements of up to 36% and 62% for external read and write bytes respectively when using LOAD_OP_CLEAR and STORE_OP_DONT_CARE:
Render Passes video
Consider now a slightly more complex example, with multiple subpasses and more attachments. For instance, deferred rendering using a G-buffer:
In this case, all the G-buffer attachments may be transient since they are not needed after the Renderpass. This is possible because subpasses that have a per-pixel dependency may be merged by the GPU. This processes all subpasses for a given tile and writes out the final lighted result to main memory. Our tutorial describes the necessary set-up and G-buffer size limitations to consider in order to achieve subpass merging. Using merged subpasses rather than two separate Renderpasses, achieves bandwidth savings of 45% and 56% in read and write bytes respectively, as it avoids writing out the G-buffer to main memory:
Render Subpasses video
For deferred rendering examples, like the one previous, we should emphasize the existing dependency between the G-buffer generation pass and the lighting pass. In this instance, we need to define a Vulkan subpass dependency. However, when synchronizing Renderpasses, the mechanism is slightly more complex since their execution order cannot be assumed to be the same as their submission order. For this case, Vulkan offers pipeline barriers.
Pipeline barriers work with pipeline stages. There is Vulkan enumeration listing all the possible stages of the graphics pipeline. For this example, we will consider a subset of these:
Every command that we submit to a queue goes through some of these stages. TOP_OF_PIPE and BOTTOM_OF_PIPE are helper stages. These signal that a command has been parsed or that a command is retired, respectively. In Vulkan we do not synchronize individual commands. Instead we synchronize the work using these stages.
With a barrier, we are dividing the command stream into two parts. This means that all commands after the barrier must wait at a certain destination stage until all commands before the barrier have gone through a certain source stage.
For example, This sort of barrier is inefficient for GPUs such as Mali where we have two processing slots to do vertex and fragment work in parallel. Since commands must wait at TOP_OF_PIPE (i.e. at the very first stage of the pipeline) until all previous commands have reached BOTTOM_OF_PIPE, the work is heavily serialized, and we introduce bubbles. No new work can start in the vertex stage until all previous commands have finished going through their fragment stage.
Compare this to a more relaxed barrier:
In this case, commands after the barrier must wait at the fragment stage until commands before the barrier have also gone through the fragment stage. This avoids bubbles since the vertex work from one Renderpass can proceed in parallel to the fragment work from a previous Renderpass.
Therefore, try to avoid BOTTOM->TOP dependencies and find the barrier that best applies to your use case. In this case since the vertex is not dependent on the fragment, this is the minimal correct barrier which covers the use case and hence avoids over-synchronizing.
Our sample includes graphs for vertex and fragment processing. In the case of a fragment to fragment barrier, these graphs show most of the vertex peaks occurring in parallel to the fragment plateau, maximizing throughput:
Pipeline Barriers video
As you can see in this detail from one of our scenes below, GPU rendering can sometimes result in jagged lines at model edges. Multisample anti-aliasing (MSAA) helps smooth these out:
With no MSAA, when rasterizing a polygon into pixels, pixels are only shaded if the center of the pixel lies within the rasterized primitive:
With 4x MSAA, the rasterizer defines the location of four samples. The GPU then shades a pixel if any of these four samples lies within the polygon by proportionally blending the result of the fragment shader at the center with the existing color. This results in a smoother transition at the edges:
Note that the fragment shader is still only run once, calculating the color at the center of the pixel. It is assigned to the samples that lie within the polygon, whereas the remaining samples will keep the clear value. The value of all the samples is then averaged in a step that is known as the resolve operation.
As we saw earlier with the depth and G-buffer attachments, multisampled attachments should also be transient and live in tile-local memory. At the end of tile rendering, the GPU should write out only the final, resolved value, thus saving bandwidth. To do this in Vulkan, we need to again specify the right load/store operations and memory flags. Additionally, we need to configure the Subpass’ pResolveAttachments attribute for it to point to the single-sample attachment. This will contain the result of the resolve operation, in this case the Swapchain image. Resolving on writeback avoids writing out a large multi-sampled attachment to main memory:
Avoid using vkCmdResolveImage, since it will instruct the GPU to store the large multisampled attachment to main memory, and then read it back in a separate pass to resolve it. This is unnecessarily making use of a great deal of bandwidth and power:
Our sample again uses the read and write hardware counters to show performance deltas of up to 440% when correctly configuring on-writeback resolve. It also includes a post-processing use case to explain how to resolve the depth attachment efficiently by using a new feature introduced in Vulkan 1.2:
Arm is working with Roblox, a leading games studio, to utilize the Vulkan Best Practices described above to target better performance for its games on Android. In a recorded talk for GDC 2020, Arseny Kapoulkine, a Technical Fellow at Roblox, and I talk in-depth about these Best Practices and how they were applied to Roblox’s leading mobile games. It’s great to work with leading games studios like Roblox to apply these Best Practices for spectacular gaming experiences!
Watch Arm's joint talk with Roblox