Vulkan fully supports compute shaders, however, they are quite underutilized on mobile. The reasons might be not having content which can take advantage of it, not having the knowledge of how to use compute, or not knowing how to use compute efficiently on Mali. Our architecture has some quirks which we need to consider in order to use compute optimally. Our best practices guide says to avoid FRAGMENT -> COMPUTE workloads due to the bubbles which will inevitably follow. This is generally good advice, and this blog post will explain some of the reasons why, but recently I have stumbled upon a method which can make this practice obsolete if you know what you are doing and are willing to go the extra mile to make this work well.
This blog post will outline how we can use two Vulkan queues with different priorities to enable optimal scheduling for compute post-processing workloads. This is an example where explicit control of work submission in Vulkan can enable things which would be very hard to achieve in GLES.
On current hardware, we have two hardware queues which we can submit work to. Shader cores can handle work from both queues concurrently. Making use of this is critical to get optimal performance.
This is a catch-all queue which deals with geometry and compute. Vertex shading for a render pass happens here, but also compute. The first nugget of knowledge which becomes important later is:
Vertex and compute workloads cannot run concurrently.
This queue only deals with the fragment side of a render pass. This queue tends to be - and should be - the bottleneck. For optimal pipelining, this queue should be active 100% of the time, unless there is no more work to do, e.g. you are rendering at higher than 60 FPS and GPU can go to sleep.
Fragment workloads should always run on the shader cores if possible. Run compute or vertex workloads in parallel with fragment.
The common pipelining scheme for graphics on TBDR is a two-stage pipeline where geometry is always shaded first, then the fragment stage kicks in later and completes the render pass. Vertex and tiling for multiple render passes can complete their work ahead of time to keep the fragment stage constantly fed with work.
While a workload is running in a hardware queue, it can be pre-empted by more “important” work. This is typically how asynchronous time-warp in VR is done. You need to be able to stop what the GPU is doing and insert something more important to do.
Hardware queues in Vulkan are abstracted through the VkQueue object. On our current drivers, we expose two queues, which you can query and create in vkCreateDevice. You might think they map to the two hardware queues respectively, but this is not the case. One VkQueue will submit work to either one of the hardware queues, it just depends on which pipeline stage is used. Anything related to VERTEX_SHADER_BIT or COMPUTE_SHADER_BIT goes into the vertex/tiling/compute hardware queue, and FRAGMENT_SHADER_BIT and friends go into the fragment hardware queue.
So why do we expose two queues? An important feature to consider is priority. If pQueuePriorities for one queue is higher than the other, we now have a distinction between a “low” priority and a “high” priority one, which can be useful in many different scenarios. We will get the pre-emption effect if we have two VkQueues with different priorities, and we can and will abuse this for our use case.
For purposes of completeness, I would first like to discuss the simpler cases for compute, which is very easy to map efficiently to Mali.
Certain compute workloads happen logically very early in the frame, and the results are later to be consumed by vertex and fragment shaders. Some examples here can be:
Because compute happens early in the pipeline, just like vertex shading, this is easy to deal with. Compute is not waiting for fragment, so there is no reason why fragment cannot run 100% all the time. There is an edge case with write-after-read hazards. If compute writes to the same resource every frame, we need to make sure that fragment from the previous frame is not still reading that data as only one stage can concurrently access the shared resource when we have writers in the picture. However, there are two easy approaches to avoid making pipeline bubbles here.
This is the awkward part, and something which we generally discourage. Let’s explore why in the context of post-processing.
Generally, most use of compute for post processing will have a frame looking like this:
We need to end with a render pass, because compositing and rendering UI in compute is not very practical. It is certainly possible to avoid using fragment at the end of the frame, but it is quirky at best. Relying on STORAGE_BIT usage flag for the swap chain is asking for trouble as it is not guaranteed to be supported. Plus, we lose opportunities for framebuffer compression and transactional elimination on the swap chain.
The problem case is the bubble we have created. First, we have a pipeline barrier or semaphore which will implement a FRAGMENT -> COMPUTE barrier. We do the post processing, and to perform the composite/UI pass, we will need COMPUTE -> FRAGMENT. You might think “I’ll just use async compute!”, but no, because there is nothing we can truly run async here. The chain of FRAGMENT -> COMPUTE -> FRAGMENT is serial even if we use multiple queues. Because of this dependency chain, we create a scenario where the fragment hardware queue will go idle no matter what.
The biggest risk we take by letting fragment stall is that vertex/tiling work can end up running alone on the shader cores. This is very undesirable. Vertex/tiling work is far more “sluggish” than most workloads, and really needs to run concurrently with fragment to hide this.
We also lose out on a lot of fixed function work, i.e. rasterization, depth testing, varying interpolation, tile memory write-out, and so on, which can run in parallel with shading. We should run this fixed function workload alongside some “real” work to maximize the hardware utilization. Fragment work can make use of all the fixed function blocks of the hardware, compute cannot.
Another problem with having FRAGMENT -> COMPUTE is that we are now in a situation where fragment will eventually starve. If we have long-running COMPUTE workloads, this can block VERTEX work from happening later, and that in turn will block FRAGMENT from working. Remember that VERTEX work runs in the same hardware queue as COMPUTE.
Now, running compute shaders, you can certainly saturate the entire GPU with that and develop very fast code that way with clever use of shared memory and such, but you will still miss out on the fixed function stuff which you can do in parallel. For example, consider shadow maps which are almost entirely fixed function rasterization and depth testing.
So, to summarize, the rationale for discouraging FRAGMENT -> COMPUTE like workloads is that it is very easy to end up in a situation with bubbles where FRAGMENT processing goes idle.
Now, given our proto-typical frame of:
We might try to break the bubble using a manual interleaving method. We could try to schedule something like this:
This would indeed break the bubble, but there are problems with this approach.
By interleaving frames like this, we will almost certainly increase the latency of the application. We cannot submit frame #0 to the GPU before we know what frame #1 is going to look like. The only sensible option here is to add a frame of input latency, which is not great. We could simply defer submission of the command buffers, but we will need a deeper swap chain to deal with the fact we haven’t called vkQueuePresentKHR yet. Having two frames in flight on CPU is quite awkward in general.
So far, I’ve been building up how and why bubbles occur. Let’s see how we can solve it more elegantly using two queues.
We touched upon a flawed solution using a manual interleaving scheme, but can we replicate that without adding a lot of complexity and submission latency?
The key to understand this is the barrier which is causing the bubble, COMPUTE -> FRAGMENT. Pipeline barriers wait for “everything before” and “block everything after”. The “everything after” is significant, because that means the next frame’s fragment work has no way to interleave behind the barrier.
The key thing here is that pipeline barriers only observe their barriers within the VkQueue they are submitted in. If we can avoid submitting that particular COMPUTE -> FRAGMENT barrier in the main rendering queue, we technically allow interleaving of render passes without doing it manually.
By moving both post and UI compositing to the second, high-priority queue, we can do some rather interesting things. Our two queues are universal queues with GRAPHICS support, and they can both present.
We are going to use a lot more semaphores than normal, but there isn’t any extra overhead to really worry about.
Let queue #0 be a low priority queue, and #1 be a high priority one. As mentioned earlier, just make sure pQueuePriorities for queue #1 is greater than #0 and you will be fine. I used 0.5 for queue #0 and 1.0 for queue #1.
Basically, we will use queue #0 and queue #1 as a pipeline. The early stages of a frame will go into queue #0, and the later stages will go into queue #1. The pipeline barriers in queue #0 will not pollute the barriers in queue #1 and vice versa.
The frames will look something like this:
This kind of scheme allows us to submit the entire frame, and still allow the driver to interleave in the shadow rendering pass when we submit the next frame. The driver has an easy time interleaving work when they are in different queues.
We still express work such that the post processing can pipeline nicely with the shadow mapping pass.
Another very cool side effect of this is the lowered GPU latency we can achieve. Using the pre-emption mechanism, it is possible for the UI pass to pre-empt the shadow mapping pass from next frame, because the queue it is running in has higher priority. This reduces GPU latency, because we can present earlier than we would otherwise be able to. The GPU does not have to wait for shadow mapping to complete if it takes longer than the post for some reason. We are also certain that when running post, we do not end up running vertex or tiling work for the next frame either. The compute workload will take priority. Overall, this is a pretty neat system which enables modern compute-based post-processing pipelines on Mali.
Let’s stare at Streamline a bit, which I measured on a Galaxy S9 - Exynos variant. I experimented with my simple renderer which implements the frame shown above. (The screenshots below are for illustration, I rendered something more complex when getting the numbers).
Overkill bloom for dramatic effect!
In final pass, tonemap, and render UI on top of that in the same pass.
I tried three things:
I verified that GPU clocks were the same for the test, to avoid any power management shenanigans.
Looking at Mali Core Cycles we can see when fragment shading and compute (includes vertex shading) is active on the shader cores. Fragment cycles is active all the time as we can expect, but it’s not at 100%. This is because of the back-to-back small render passes which drain and fill the shader core all the time. We do get pipelining with vertex as expected.
The average time for this implementation is 21.5 ms average on my device.
Here I’ve moved all the bloom pyramid work over to compute, but without any overlap with fragment. To my surprise, it is significantly faster, 19.8 ms average on my device. The most likely reason here is that back to back compute dispatches in the same command buffer do not have to perform heavy cache maintenance, but back to back render passes do. This calls for further investigation.
As we can see from Core Cycles, the fragment cycles counter drops to 0 for a while, and compute kicks in to do the bloom pyramid. This is the bubble we would ideally like to avoid. Fortunately, we were not hurt by vertex work coming in right before we were working on bloom. The driver can interleave VERTEX and COMPUTE work in the same queue most of the time.
Finally, I tried using the two-queue method to see if overlapping shadow rendering with bloom would help, and indeed it did. I get 19.4 ms with this method. A 0.4 ms gain over the regular compute method. Fragment cycles is almost at 100% all the time here, and we see the blip with bloom workload coming in. As expected, vertex shading workload parallelizes nicely with the rest of the frame.
It is possible to use compute for post-processing if you are careful with scheduling, and in some cases you might find that you get better results than you expected. Using multiple queues can be helpful on mobile as well. And as always, measure!
Great blog to understand what Mali can offer in terms of compute and the best way of doing compute on Mali. We will see increasing demand of compute not only for graphics but for AR/MR as well, so it is important to show developers how to make the most of Vulkan Compute on Mali.