Vulkan samples: Bandwidth and throughput optimizations for mobile

June 9, 2020

11 minute read time.

Vulkan is changing the landscape of graphics by ushering in a new age of visual fidelity for Android devices. While powerful, the Vulkan API can be quite complex for mobile developers. Therefore, at GDC 2019, Arm released a set of Vulkan samples that illustrated a comprehensive list of best practice recommendations. Since then, these have been donated to Khronos Group and have been improved with contributions from other GPU vendors, and the well-known samples from Sascha Willems.

Vulkan Samples

The main motivation behind the samples is to make it easy to experiment with different ways of doing the same thing in Vulkan. You can toggle options at run-time while monitoring the impact that they have on performance. This is thanks to the on-screen hardware counters. Finally, every sample is released alongside an article that explains the theory behind every best practice recommendation, and a guide to profiling using Arm Mobile Studio and other tools.

Framework

The main features that underlie all the samples in the framework are worth highlighting:

Multi-platform, enabling the samples to run on both desktop for faster iterations and debugging, and mobile.
Provides an API to encapsulate some of the objects and make them easier to handle while maintaining a close relationship with the Vulkan structures.
Implements reflection using SPIRV-Cross to automatically generate some of the necessary objects based on the shaders.
Integrates a glTF importer and an internal scene graph and an entity component system.

Performance samples

The repository contains samples that provide a reference for API features and new Vulkan extensions, and performance samples focusing on CPU and GPU optimizations. This blog covers a few of these performance samples. It focuses on the considerations around limited power and bandwidth on mobile devices with tile-based rendering.

Immediate or tiled

For immediate mode GPUs, which are found in desktop and console architectures, the geometry is first processed and then added to a queue:

Immediate or tiled diagram

Fragment processing then proceeds from this queue one draw call at a time. For every pixel, it performs depth testing and color shading, reading from and writing to main memory as much as necessary. These operations can require exceptionally high bandwidth, which is energy intensive.

Compare this to tile-based architectures commonly found in mobile:

Main memory and GPU

Here we divide the screen into regions of pixels known as tiles. The GPU rendering is then split into two phases:

Firstly, all the geometry is processed and assigned to tiles.
Secondly, for each tile we can execute all the required fragment operations using tile local memory. This stores out the result only at the very end, saving considerable bandwidth.

The Vulkan API will let us optimize our rendering to take advantage of tile-local memory and save power on tile-based renderers.

Resources

Renderpasses and load/store operations

A collection of attachments, for example, depth, and color. Also, the way they are used (shaders) is known in Vulkan as a Renderpass:

Renderpass diagram

Renderpasses consist of one or more Subpasses. In Vulkan, when defining a Renderpass and its attachments, we need to specify load and store operations that is, what to do with the attachment before and after rendering. We define these for each attachment.

The load operation refers to what to do with the attachment before rendering. The available options are:

LOAD_OP_LOAD to retrieve the previous value from main memory
LOAD_OP_CLEAR to replace the previous value with a base clear value
LOAD_OP_DONT_CARE to let the driver work with whatever is more optimal

For immediate mode GPUs, these might all perform similarly, but on mobile LOAD_OP_CLEAR is far more optimal than LOAD_OP_LOAD. Tilers can efficiently clear all tile values before rendering, whereas loading the values from main memory requires expensive read operations. Also, most of the time it is not necessary to use LOAD_OP_LOAD, since the values are going to be replaced anyway.

Note that LOAD_OP_CLEAR is different from using Vulkan’s vkCmdClearAttachments, which will instruct the GPU to explicitly write out a clear value to the attachment in main memory. This is again wasteful if we are going to write out a new value over it afterwards. Therefore, avoid using LOAD_OP_LOAD and vkCmdClearAttachments and use LOAD_OP_CLEAR or LOAD_OP_DONT_CARE whenever possible.

Similarly, store operations define what to do with the attachment at the end of the Renderpass:

STORE_OP_STORE to save the value out to main memory
STORE_OP_DONT_CARE to let the driver optimize out the external write operation

Usually the depth attachment is no longer needed after the Renderpass is complete. Therefore, we can save bandwidth if we avoid writing it out to main memory by using STORE_OP_DONT_CARE.

An attachment, such as depth, which does not need to be loaded from or stored to main memory can simply live in tile-local memory. This does not need to be allocated at all in main memory. We refer to such attachments as transient. Our tutorial covers in detail what additional Vulkan flags are required to request this optimization. Our sample shows improvements of up to 36% and 62% for external read and write bytes respectively when using LOAD_OP_CLEAR and STORE_OP_DONT_CARE:

Render Passes video

Resources

Subpass merging

Consider now a slightly more complex example, with multiple subpasses and more attachments. For instance, deferred rendering using a G-buffer:

Renderpass diagram

In this case, all the G-buffer attachments may be transient since they are not needed after the Renderpass. This is possible because subpasses that have a per-pixel dependency may be merged by the GPU. This processes all subpasses for a given tile and writes out the final lighted result to main memory. Our tutorial describes the necessary set-up and G-buffer size limitations to consider in order to achieve subpass merging. Using merged subpasses rather than two separate Renderpasses, achieves bandwidth savings of 45% and 56% in read and write bytes respectively, as it avoids writing out the G-buffer to main memory:

Render Subpasses video

Resources

Pipeline barriers

For deferred rendering examples, like the one previous, we should emphasize the existing dependency between the G-buffer generation pass and the lighting pass. In this instance, we need to define a Vulkan subpass dependency. However, when synchronizing Renderpasses, the mechanism is slightly more complex since their execution order cannot be assumed to be the same as their submission order. For this case, Vulkan offers pipeline barriers.

Pipeline barriers work with pipeline stages. There is Vulkan enumeration listing all the possible stages of the graphics pipeline. For this example, we will consider a subset of these:

Top of pipe diagram

Every command that we submit to a queue goes through some of these stages. TOP_OF_PIPE and BOTTOM_OF_PIPE are helper stages. These signal that a command has been parsed or that a command is retired, respectively. In Vulkan we do not synchronize individual commands. Instead we synchronize the work using these stages.

With a barrier, we are dividing the command stream into two parts. This means that all commands after the barrier must wait at a certain destination stage until all commands before the barrier have gone through a certain source stage.

Avoid BOTTOM->TOP dependencies

CPU Vertex Fragment diagram

For example, this sort of barrier is inefficient for GPUs such as Mali where we have two processing slots to do vertex and fragment work in parallel. Since commands must wait at TOP_OF_PIPE that is, at the very first stage of the pipeline until all previous commands have reached BOTTOM_OF_PIPE, the work is heavily serialized, and we introduce bubbles. No new work can start in the vertex stage until all previous commands have finished going through their fragment stage.

Compare this to a more relaxed barrier:

CPU Vertex Fragment diagram

In this case, commands after the barrier must wait at the fragment stage until commands before the barrier have also gone through the fragment stage. This avoids bubbles since the vertex work from one Renderpass can proceed in parallel to the fragment work from a previous Renderpass.

Therefore, try to avoid BOTTOM->TOP dependencies and find the barrier that best applies to your use case. In this case, since the vertex is not dependent on the fragment, this is the minimal correct barrier which covers the use case and hence avoids over-synchronizing.

Our sample includes graphs for vertex and fragment processing. In the case of a fragment to fragment barrier, these graphs show most of the vertex peaks occurring in parallel to the fragment plateau, maximizing throughput:

Pipeline Barriers video

Resources

Multisample anti-aliasing

As you can see in this detail from one of our following scenes, GPU rendering can sometimes result in jagged lines at model edges. Multisample anti-aliasing (MSAA) helps smooth these out:

With no MSAA, when rasterizing a polygon into pixels, pixels are only shaded if the center of the pixel lies within the rasterized primitive:

Polygon rasterized primitive

With 4x MSAA, the rasterizer defines the location of four samples. The GPU then shades a pixel if any of these four samples lies within the polygon by proportionally blending the result of the fragment shader at the center with the existing color. This results in a smoother transition at the edges:

4x MSAA polygon

Note that the fragment shader is still only run once, calculating the color at the center of the pixel. It is assigned to the samples that lie within the polygon, whereas the remaining samples will keep the clear value. The value of all the samples is then averaged in a step that is known as the resolve operation.

As we saw earlier with the depth and G-buffer attachments, multisampled attachments should also be transient and live in tile-local memory. At the end of tile rendering, the GPU should write out only the final, resolved value, thus saving bandwidth. To do this in Vulkan, we need to again specify the right load and store operations and memory flags. Additionally, we need to configure the Subpass’ pResolveAttachments attribute for it to point to the single-sample attachment. This will contain the result of the resolve operation, in this case the Swapchain image. Resolving on writeback avoids writing out a large multi-sampled attachment to main memory:

Avoid using vkCmdResolveImage, since it will instruct the GPU to store the large multisampled attachment to main memory, and then read it back in a separate pass to resolve it. This is unnecessarily making use of bandwidth and power:

Our sample again uses the read and write hardware counters to show performance deltas of up to 440% when correctly configuring on-writeback resolve. It also includes a post-processing use case to explain how to resolve the depth attachment efficiently by using a new feature introduced in Vulkan 1.2:

MSAA video

Resources

Our work with Roblox

Arm is working with Roblox, a leading games studio, to utilize the Vulkan Best Practices described above to target better performance for its games on Android. In a recorded talk for GDC 2020, Arseny Kapoulkine, a Technical Fellow at Roblox, and I talk in-depth about these Best Practices and how they were applied to Roblox’s leading mobile games. It is great to work with leading games studios like Roblox to apply these best practices for spectacular gaming experiences.

[CTAToken URL = "https://youtu.be/BXlo09Kbp2k" target="_blank" text="Watch Arm's joint talk with Roblox" class ="green"]

Graphics, Gaming, and VR blog

Introducing Arm Accuracy Super Resolution

arm-phodges

Today we introduce “Arm Accuracy Super Resolution” (Arm ASR), which is a best-in-class open-source solution for upscaling on mobile devices.
- July 10, 2024
Getting started with Android Dynamic Performance Framework (ADPF) in Unreal Engine

Syed Farhan Hassan

For research purposes, Arm has developed a demo using Unreal Engine and Android Dynamic Performance Framework (ADPF) to investigate how ADPF is used to optimize game performance.
- July 4, 2024
NanoMesh on Mobile: Delivering great beauty in simplicity

Nathan Li

From the GDC24 tech talk “SmartGI Evolution: Adaptive NanoMesh on Mobile”. SmartGI and NanoMesh are cutting-edge rendering solutions aiming to enable the best possible graphics on all platforms.
- May 28, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Vulkan samples: Bandwidth and throughput optimizations for mobile

Framework

Performance samples

Immediate or tiled

Resources

Renderpasses and load/store operations

Resources

Subpass merging

Resources

Pipeline barriers

Avoid BOTTOM->TOP dependencies

Resources

Multisample anti-aliasing

Resources

Further reading

Our work with Roblox

Introducing Arm Accuracy Super Resolution

Getting started with Android Dynamic Performance Framework (ADPF) in Unreal Engine

NanoMesh on Mobile: Delivering great beauty in simplicity