Using asynchronous compute on Arm Mali GPUs: A practical sample

June 16, 2021

10 minute read time.

Asynchronous compute is a trend that has proven itself to be an effective optimization technique, but it is somewhat difficult pinning down how to apply it. This idea started its life on last generation console hardware but has since been made available on modern graphics APIs like Vulkan and D3D12. It is now part of a graphics programmer’s toolbox.

In this post, we will highlight a new Vulkan Sample that was added to Khronos’ sample repository which demonstrates how to use async compute. Check it out here.

“Async compute” is not necessarily a technique on its own, but it is a way to efficiently utilize hardware resources available on a modern GPU by submitting multiple streams of commands to the GPU at the same time. As we will explore later, it takes a fair bit of implementation-specific knowledge to make effective use of async compute.

The sample is an iteration of a blog post from 2018, which is available to view here. In this blog, I demonstrated that compute-based post-processing could be feasible, and even lead to performance gains sometimes. There is now a sample you can run on your phone.

What is even the point of async compute?

Modern GPUs have several queues which can feed shader cores. The queue topologies between desktop and Arm Mali GPUs look different, and this difference will change how we approach async compute on Arm Mali.

On desktop-class hardware, we might see a lone GRAPHICS queue which can do everything, and many COMPUTE queues which can only run compute workloads. On the other hand, Arm Mali is a tile-based GPU and thus has a queue layout that maps to this style of rendering. Here the rendering pipeline is split in two, where vertex shading and tiling is one hardware queue, and fragment shading is another hardware queue. Compute workloads happen alongside vertex and tiling, as vertex shading is just compute shaders if you squint hard enough.

It is important to note here that a VkQueue on Arm Mali does not map to just one hardware queue, it maps to both vertex/tiling and fragment queues. Multiple queues on Arm Mali do not map to different kinds of hardware, rather they just represent separate streams of commands. In Vulkan terminology, there is just one queue family to worry about.

The critical importance of keeping hardware queues fed

Tile-based rendering is essentially a two-stage pipeline, and in a pipeline we really do not want pipeline stalls. The happiest GPU is a GPU where the fragment queue is busy 100% of the time.

Vertex shading and tiling are some of the least efficient kinds of shading available because of intense bandwidth consumption per thread. Therefore, it is important that FRAGMENT queues can putter along while this work is going on. Having shader cores filled up with just geometry work will most likely just stall on external bandwidth.

Having dependencies like VERTEX / COMPUTE → FRAGMENT is perfectly fine. Problems arise when we start introducing FRAGMENT → VERTEX / COMPUTE dependencies, which is what this sample will explore and solve.

The case study: Compute for post effects

Using compute shaders for post-effects is getting more popular, and modern game engines are moving to a world where main pass rasterization occupies a smaller and smaller part of the rendering budget. By post-effects, we mean any compute pass which depends on any fragment shading from the current frame, for example, High Dynamic Range (HDR) bloom, depth-of-field, blurs.

Traditionally, post-effects would be (and should be) implemented as a series of render passes. However, using compute shaders is attractive for operations that are awkward to implement with fragment shaders. A common operation that comes to mind when mentioning HDR is reduction passes. A long chain of render passes ending with a 1x1 render passes is not very fun.

The problematic bubble

When using post-effects, we can easily end up in a situation which breaks pipelining and significantly reduces performance.

VERTEX → FRAGMENT (scene render) → COMPUTE (post-magic) → how do we get on screen?

To get on screen, we must eventually do something in FRAGMENT, and we get the dreaded FRAGMENT → COMPUTE → FRAGMENT. With this barrier, we starve FRAGMENT shading, which is something we do not want.

Can we not just complete the entire frame in compute and present?

It is theoretically possible in Vulkan, and a couple of desktop games do this. But a significant stumbling block for mobile is how we are going to handle UI rendering. Rendering UI in a render pass, only to write that back to memory, and composite it in a compute pass later, is very wasteful from a bandwidth point of view. We should absolutely avoid this scenario if we can.

Starting point of sample

This is the sample we start out with. The scene composition is quite simple, which serves as a proxy for a larger compute-post heavy application. The resolutions are cranked up to make it easier to see performance differences:

Shadow map, 8K (VERTEX / FRAGMENT)
Main pass rendering, forward shading, 4K (VERTEX / FRAGMENT)
Threshold + Bloom blur (COMPUTE) – A proxy for complex post effects
Tonemap + UI (VERTEX / FRAGMENT) – Represents how we end the frame in fragment

As we can see from the performance metrics, there are issues. The GPU is active for 787 M cycles / second, but fragment shading is only active 600 M cycles / second. If we are not CPU bound, and not hitting V-Sync, it is a good sign we have a bubble to pop. It is also telling that when Vertex Compute cycles shoots up, Fragment dips. This dip is the Threshold + Bloom blur pass.

How do you get those hardware stats?

For Arm Mali, there is this GitHub link. The Vulkan Samples framework can make use of this library to read hardware counters in real time - quite nifty indeed. These are the same counters that Arm Mobile Studio would give you.

Going async

Here we go with async, which allows us to pop the bubble. Finally, we see a nice, fully saturated Fragment queue.

The primary reason we get a decent gain here is that we can now run two things in parallel:

Shadow map for next frame (FRAGMENT)
Bloom (COMPUTE)

Rendering shadow maps is extremely rasterization bound, that is, fixed function hardware is heavily used, and the shader cores are mostly twiddling thumbs. This is the perfect time for us to inject some compute workloads. Vertex workloads would work great here as well, but we do not necessarily have enough vertex shading work to keep the GPU busy. Shifting some fragment work to compute makes sense here.

In this particular sample, we got a ~5% FPS gain on a Mali-G77 GPU, but these results are extremely content specific. It is important to note that even if Fragment cycles go up, performance does not scale linearly, since Vertex and Fragment still share the same shader core. By having active cycles, it just means the GPU is ready to start dispatching work immediately if there are idle threads on the shader core. Any dips in activity can be filled in by the shader core schedulers.

The technique

The idea here is to realize that if there is no pipeline, we can conjure a pipeline into existence with the power of multiple VkQueues. Thus, we’re not just doing async compute, we’re also doing async graphics.

Implementation specifics

The technique will exploit some ideas:

Queue priorities can be used on Arm Mali, and higher priority queues can pre-empt lower priority queues. We can thank VR for making this feature a thing.
Queues break up dependency chains in the Vulkan API.

To explain how queues break up dependency chains, we must first understand how barriers work in Vulkan. A pipeline barrier splits all commands in two, what came before, and what comes after. Those two halves are then ordered based on stage masks. Semaphores also operate on a similar idea, where semaphores are signaled when everything that came before is complete. Waiting means everything after the semaphore is blocked on the semaphore, subject to stage masks.

A FRAGMENT → COMPUTE → FRAGMENT barrier creates a situation where it is impossible to avoid a pipeline bubble. Barriers only affect ordering within a single VkQueue. The key here is to split the frame into two segments, and pipeline those instead:

All work that is required to render main pass → VkQueue #1 (lower prio)
Signal semaphore in VkQueue #1, wait in VkQueue #0
All work that comes after main render pass + present → VkQueue #0 (higher prio)

In this scheme, we never observe the dreaded FRAGMENT → COMPUTE barrier in VkQueue #1, so while VkQueue #0 is busy completing the frame for presentation, VkQueue #1 can happily power through and start rendering the next frame. This way we achieve proper pipelining.

The final trick is to use queue priorities. VkQueue #0 needs to have a higher priority than #1, since queue #0 is always going to be closer to having a complete frame, and we really do not want queue #1 to block #0 from doing work. If that happens, we risk missing V-Blank.

Queue priorities must be declared up front in Vulkan. This is done during device creation:

VkDeviceCreateInfo device_info = { VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO };
VkDeviceQueueCreateInfo queue_info = { VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO };
device_info.queueCreateInfoCount = 1;
device_info.pQueueCreateInfos = &queue_info;
queue_info.queueFamilyIndex = 0; // Query with vkGetPhysicalDeviceQueueFamilyProperties
static const float prios[] = { 1.0f, 0.5f };
queue_info.pQueuePriorities = prios;
queue_info.queueCount = 2; // Queue with vkGetPhysicalDeviceQueueFamilyProperties
vkCreateDevice(gpu, &device_info, nullptr, &device);
vkGetDeviceQueue(device, 0, 0, &high_prio_queue);
vkGetDeviceQueue(device, 0, 1, &normal_prio_queue);

What if I reorder submissions across frames instead?

That is certainly possible, but that means holding back frames so that reordering can take place, which typically increases input latency by one frame. This is not desirable for interactive applications, such as games.

Downsides with compute workloads

Is it time to compute all the things? Not necessarily. There are some issues with it that need to be considered before going all in. The general idea is that assuming equivalent work, a fragment thread is a bit more efficient than a compute thread, for various reasons:

Loss of framebuffer compression - With storage images, AFBC (Arm Frame Buffer Compression) is lost, meaning that bandwidth is hammered a bit harder than it could be.
Loss of transactional elimination - Another bandwidth saving feature is eliminating redundant tile write-backs, which cannot be used with storage images.
Indirect starvation of fragment shaders - Earlier, I mentioned that VERTEX and COMPUTE workloads use the same hardware queue. If a very large part of the frame is spent running COMPUTE workloads, we can end up starving VERTEX from doing its job. If VERTEX is starved, we also starve FRAGMENT workloads indirectly.

Further best practices

This best practices document has a section about compute for image processing:

In general, this advice still holds:

Do not use compute to process images generated by fragment shading. Doing so creates a backwards dependency that can cause a bubble. If fragment shader outputs are consumed by fragment shaders of later render passes, then render passes go through the pipeline more cleanly.

The goal of this study is to demonstrate that we can escape this unfortunate fate if we’re very particular about how we use the Vulkan API. As expected, we cannot optimize like this in OpenGL ES, as that API has no concept of multiple queues.

Conclusion

As the sample shows, there are ways to take advantage of compute shaders effectively on Arm Mali GPUs. They do require a fair amount of consideration, and measuring the results is critical. Async compute is always a temperamental optimization that can squeeze out the last few percentages of performance when done correctly. I hope this post will inspire some optimization ideas. Vulkan can take advantage of the hardware in ways older APIs can not, and it would be a shame not to try making use of it.

4 comments
0 members are here

Top Comments

Jack Melling over 4 years ago +2

Response from Hans-Kristian below: First, the compute pass is bandwidth limited, what if we overlap it with another compute pass(maybe some work for next frame), and we reach 100% GPU hardware utilization...

Mobile, Graphics, and Gaming blog

Optimizing 3D scenes in Godot on Arm GPUs

Clay John

Exploring advanced mobile GPU optimizations in Godot using Arm tools like Streamline and Mali Offline Compiler for real-world performance gains.
- July 10, 2025
Optimizing 3D scenes in Godot on Arm GPUs

Clay John

In part 1 of this series, learn how we utilized Arm Performance Studio to identify and resolve major performance issues in Godot’s Vulkan-based mobile renderer.
- June 11, 2025
Bringing realistic clothing simulation to mobile: A new frontier for game developers

Mina Dimova

Realistic clothing simulation on mobile—our neural GAT model delivers lifelike cloth motion without heavy physics or ground-truth data.
- June 6, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog