Using Compute Post-Processing in Vulkan on Mali

Vulkan fully supports compute shaders, however, they are quite underutilized on mobile. The reasons might be not having content which can take advantage of it, not having the knowledge of how to use compute, or not knowing how to use compute efficiently on Mali. Our architecture has some quirks which we need to consider in order to use compute optimally. Our best practices guide says to avoid FRAGMENT -> COMPUTE workloads due to the bubbles which will inevitably follow. This is generally good advice, and this blog post will explain some of the reasons why, but recently I have stumbled upon a method which can make this practice obsolete if you know what you are doing and are willing to go the extra mile to make this work well.

This blog post will outline how we can use two Vulkan queues with different priorities to enable optimal scheduling for compute post-processing workloads. This is an example where explicit control of work submission in Vulkan can enable things which would be very hard to achieve in GLES.

The two hardware queues

On current hardware, we have two hardware queues which we can submit work to. Shader cores can handle work from both queues concurrently. Making use of this is critical to get optimal performance.

Vertex / Tiling / Compute

This is a catch-all queue which deals with geometry and compute. Vertex shading for a render pass happens here, but also compute. The first nugget of knowledge which becomes important later is:

Vertex and compute workloads cannot run concurrently.

Fragment

This queue only deals with the fragment side of a render pass. This queue tends to be - and should be - the bottleneck. For optimal pipelining, this queue should be active 100% of the time, unless there is no more work to do, e.g. you are rendering at higher than 60 FPS and GPU can go to sleep.

Fragment workloads should always run on the shader cores if possible. Run compute or vertex workloads in parallel with fragment.

Tile-based deferred renderer pipelining

The common pipelining scheme for graphics on TBDR is a two-stage pipeline where geometry is always shaded first, then the fragment stage kicks in later and completes the render pass. Vertex and tiling for multiple render passes can complete their work ahead of time to keep the fragment stage constantly fed with work.

TIle-Based Render Data Flow

Pre-emption

While a workload is running in a hardware queue, it can be pre-empted by more “important” work. This is typically how asynchronous time-warp in VR is done. You need to be able to stop what the GPU is doing and insert something more important to do.

The Vulkan queues

Hardware queues in Vulkan are abstracted through the VkQueue object. On our current drivers, we expose two queues, which you can query and create in vkCreateDevice. You might think they map to the two hardware queues respectively, but this is not the case. One VkQueue will submit work to either one of the hardware queues, it just depends on which pipeline stage is used. Anything related to VERTEX_SHADER_BIT or COMPUTE_SHADER_BIT goes into the vertex/tiling/compute hardware queue, and FRAGMENT_SHADER_BIT and friends go into the fragment hardware queue.

So why do we expose two queues? An important feature to consider is priority. If pQueuePriorities for one queue is higher than the other, we now have a distinction between a “low” priority and a “high” priority one, which can be useful in many different scenarios. We will get the pre-emption effect if we have two VkQueues with different priorities, and we can and will abuse this for our use case.

The two kinds of compute workloads

For purposes of completeness, I would first like to discuss the simpler cases for compute, which is very easy to map efficiently to Mali.

Compute early in the pipeline

Certain compute workloads happen logically very early in the frame, and the results are later to be consumed by vertex and fragment shaders. Some examples here can be:

  • Clustered shading, a compute shader can update the clusters ahead of time
  • Heightmap generation for terrain, ocean, etc
  • Compute-based skinning

Because compute happens early in the pipeline, just like vertex shading, this is easy to deal with. Compute is not waiting for fragment, so there is no reason why fragment cannot run 100% all the time. There is an edge case with write-after-read hazards. If compute writes to the same resource every frame, we need to make sure that fragment from the previous frame is not still reading that data as only one stage can concurrently access the shared resource when we have writers in the picture. However, there are two easy approaches to avoid making pipeline bubbles here.

  • Ring-buffer the resource. You can break up the write-after-read hazard by cycling through multiple resources. Compute only needs to wait for a render pass which happened several frames ago, which breaks the bubble.
  • Have enough fragment work queued up after the last render pass which consumes the resource. This way, we will not drain the fragment stage while working on compute next frame.

Compute late in the pipeline

This is the awkward part, and something which we generally discourage. Let’s explore why in the context of post-processing.

Generally, most use of compute for post processing will have a frame looking like this:

  • Render auxiliary buffers, like shadow maps, reflection maps, etc
  • Fragment shader renders the main scene
  • Compute post-processes the results from main scene
  • Composite and render UI in fragment
  • Present

We need to end with a render pass, because compositing and rendering UI in compute is not very practical. It is certainly possible to avoid using fragment at the end of the frame, but it is quirky at best. Relying on STORAGE_BIT usage flag for the swap chain is asking for trouble as it is not guaranteed to be supported. Plus, we lose opportunities for framebuffer compression and transactional elimination on the swap chain.

The problem case is the bubble we have created. First, we have a pipeline barrier or semaphore which will implement a FRAGMENT -> COMPUTE barrier. We do the post processing, and to perform the composite/UI pass, we will need COMPUTE -> FRAGMENT. You might think “I’ll just use async compute!”, but no, because there is nothing we can truly run async here. The chain of FRAGMENT -> COMPUTE -> FRAGMENT is serial even if we use multiple queues. Because of this dependency chain, we create a scenario where the fragment hardware queue will go idle no matter what.

The biggest risk we take by letting fragment stall is that vertex/tiling work can end up running alone on the shader cores. This is very undesirable. Vertex/tiling work is far more “sluggish” than most workloads, and really needs to run concurrently with fragment to hide this.

We also lose out on a lot of fixed function work, i.e. rasterization, depth testing, varying interpolation, tile memory write-out, and so on, which can run in parallel with shading. We should run this fixed function workload alongside some “real” work to maximize the hardware utilization. Fragment work can make use of all the fixed function blocks of the hardware, compute cannot.

Another problem with having FRAGMENT -> COMPUTE is that we are now in a situation where fragment will eventually starve. If we have long-running COMPUTE workloads, this can block VERTEX work from happening later, and that in turn will block FRAGMENT from working. Remember that VERTEX work runs in the same hardware queue as COMPUTE.

Now, running compute shaders, you can certainly saturate the entire GPU with that and develop very fast code that way with clever use of shared memory and such, but you will still miss out on the fixed function stuff which you can do in parallel. For example, consider shadow maps which are almost entirely fixed function rasterization and depth testing.

So, to summarize, the rationale for discouraging FRAGMENT -> COMPUTE like workloads is that it is very easy to end up in a situation with bubbles where FRAGMENT processing goes idle.

Breaking bubble with manual interleaving

Now, given our proto-typical frame of:

  • Shadow maps
  • FRAGMENT -> FRAGMENT barrier
  • Scene
  • FRAGMENT -> COMPUTE barrier
  • Post
  • COMPUTE -> FRAGMENT barrier
  • UI

We might try to break the bubble using a manual interleaving method. We could try to schedule something like this:

  • Shadow map (frame #0)
  • FRAGMENT -> FRAGMENT barrier
  • Scene (frame #0)
  • FRAGMENT -> COMPUTE | FRAGMENT
  • Post (frame #0) + Shadow map (frame #1)
  • COMPUTE | FRAGMENT -> FRAGMENT
  • UI (frame #0)

This would indeed break the bubble, but there are problems with this approach.

Longer latency

By interleaving frames like this, we will almost certainly increase the latency of the application. We cannot submit frame #0 to the GPU before we know what frame #1 is going to look like. The only sensible option here is to add a frame of input latency, which is not great. We could simply defer submission of the command buffers, but we will need a deeper swap chain to deal with the fact we haven’t called vkQueuePresentKHR yet. Having two frames in flight on CPU is quite awkward in general.

A cleaner solution

So far, I’ve been building up how and why bubbles occur. Let’s see how we can solve it more elegantly using two queues.

We touched upon a flawed solution using a manual interleaving scheme, but can we replicate that without adding a lot of complexity and submission latency?

The key to understand this is the barrier which is causing the bubble, COMPUTE -> FRAGMENT. Pipeline barriers wait for “everything before” and “block everything after”. The “everything after” is significant, because that means the next frame’s fragment work has no way to interleave behind the barrier.

The key thing here is that pipeline barriers only observe their barriers within the VkQueue they are submitted in. If we can avoid submitting that particular COMPUTE -> FRAGMENT barrier in the main rendering queue, we technically allow interleaving of render passes without doing it manually.

By moving both post and UI compositing to the second, high-priority queue, we can do some rather interesting things. Our two queues are universal queues with GRAPHICS support, and they can both present.

We are going to use a lot more semaphores than normal, but there isn’t any extra overhead to really worry about.

Let queue #0 be a low priority queue, and #1 be a high priority one. As mentioned earlier, just make sure pQueuePriorities for queue #1 is greater than #0 and you will be fine. I used 0.5 for queue #0 and 1.0 for queue #1.

Basically, we will use queue #0 and queue #1 as a pipeline. The early stages of a frame will go into queue #0, and the later stages will go into queue #1. The pipeline barriers in queue #0 will not pollute the barriers in queue #1 and vice versa.

The frames will look something like this:

Frame #0

Queue #0

  • Render shadows
  • FRAGMENT -> FRAGMENT
  • Render main pass
  • Signal semaphore #0

Queue #1

  • Wait semaphore #0 in COMPUTE
  • Post
  • COMPUTE -> FRAGMENT
  • UI/compositing
  • Signal semaphore #1
  • Present

Frame #1

Queue #0

  • FRAGMENT -> FRAGMENT (write-after-read hazard)
  • Render shadows
  • Wait semaphore #1 (write-after-read hazard)
  • FRAGMENT -> FRAGMENT
  • Render main pass
  • Signal semaphore #2

Queue #1

  • Wait semaphore #2 in COMPUTE
  • Post
  • COMPUTE -> FRAGMENT
  • UI/compositing
  • Signal semaphore #3
  • Present

This kind of scheme allows us to submit the entire frame, and still allow the driver to interleave in the shadow rendering pass when we submit the next frame. The driver has an easy time interleaving work when they are in different queues.

We still express work such that the post processing can pipeline nicely with the shadow mapping pass.

Another very cool side effect of this is the lowered GPU latency we can achieve. Using the pre-emption mechanism, it is possible for the UI pass to pre-empt the shadow mapping pass from next frame, because the queue it is running in has higher priority. This reduces GPU latency, because we can present earlier than we would otherwise be able to. The GPU does not have to wait for shadow mapping to complete if it takes longer than the post for some reason. We are also certain that when running post, we do not end up running vertex or tiling work for the next frame either. The compute workload will take priority. Overall, this is a pretty neat system which enables modern compute-based post-processing pipelines on Mali.

Results

Let’s stare at Streamline a bit, which I measured on a Galaxy S9 - Exynos variant. I experimented with my simple renderer which implements the frame shown above. (The screenshots below are for illustration, I rendered something more complex when getting the numbers).

  • Render shadow map

Render Shadow Map

  • Render scene with lighting (HDR)

Render Scene with lighting

  • Bloom threshold and pyramid as my post-process

Bloom Threshold

Dramatic Threshold

Overkill bloom for dramatic effect!

In final pass, tonemap, and render UI on top of that in the same pass.

I tried three things:

  1. Plain implementation using fragment for bloom, which is what I was doing initially. Doing a lot of back to back smaller render passes can be quite bad for performance as well, so this blog post started out as “what if I move bloom to compute instead?”.
  2. Then, move the bloom as-is to compute in the same queue, creating a bubble.
  3. Finally, use two queues and the async method above to try to get 100% fragment utilization.

I verified that GPU clocks were the same for the test, to avoid any power management shenanigans.

Fragment-based method

Mali Graphics Debugger Fragment method

Looking at Mali Core Cycles we can see when fragment shading and compute (includes vertex shading) is active on the shader cores. Fragment cycles is active all the time as we can expect, but it’s not at 100%. This is because of the back-to-back small render passes which drain and fill the shader core all the time. We do get pipelining with vertex as expected.

The average time for this implementation is 21.5 ms average on my device.

Compute-based method

Mali Graphics Debugger compute method

Here I’ve moved all the bloom pyramid work over to compute, but without any overlap with fragment. To my surprise, it is significantly faster, 19.8 ms average on my device. The most likely reason here is that back to back compute dispatches in the same command buffer do not have to perform heavy cache maintenance, but back to back render passes do. This calls for further investigation.

As we can see from Core Cycles, the fragment cycles counter drops to 0 for a while, and compute kicks in to do the bloom pyramid. This is the bubble we would ideally like to avoid. Fortunately, we were not hurt by vertex work coming in right before we were working on bloom. The driver can interleave VERTEX and COMPUTE work in the same queue most of the time.

Async-compute based method

Mali Graphics Debugger: Async method

Finally, I tried using the two-queue method to see if overlapping shadow rendering with bloom would help, and indeed it did. I get 19.4 ms with this method. A 0.4 ms gain over the regular compute method. Fragment cycles is almost at 100% all the time here, and we see the blip with bloom workload coming in. As expected, vertex shading workload parallelizes nicely with the rest of the frame.

Conclusion

It is possible to use compute for post-processing if you are careful with scheduling, and in some cases you might find that you get better results than you expected. Using multiple queues can be helpful on mobile as well. And as always, measure!

Anonymous
  • Great blog to understand what Mali can offer in terms of compute and the best way of doing compute on Mali. We will see increasing demand of compute not only for graphics but for AR/MR as well, so it is important to show developers how to make the most of Vulkan Compute on Mali.

Graphics & Multimedia blog