One way to take advantage of all the CPU cores available to us is to implement multi-threaded recording of draw calls. In tile-based renderers (found in most mobile GPUs), the best approach to split the draw calls is to record them in secondary command buffers. This way they can all be submitted to the same render pass and take advantage of tile-local memory.
With Vulkan, applications are responsible for managing resources and synchronizing access to them. This can be less conservative and avoid unnecessary locks and idle time. Largely because the application has more information than the driver regarding when and how they are to be used.
To record the commands across several threads, the application must correctly manage the memory accesses and usage of related resources, such as buffers and descriptor sets. One option is to manage resource pools per frame and per thread. According to the Vulkan specification:
This means that each frame in the queue (e.g. three frames in case of triple buffering) manages a thread pool and a collection of resources for each thread:
Therefore, each frame will use a set number of threads - as many as the cores in the system - to concurrently record commands. For more detail I recommend “Writing an efficient Vulkan renderer” (from "GPU Zen 2: Advanced Rendering Techniques") by Arseny Kapoulkine.
The Khronos Vulkan Samples project (available on GitHub here) includes a command buffer usage sample that shows multi-threaded recording. The screenshot below shows the sample in action on a mobile device.
To test the sample for yourself, make sure to build it in release mode and without validation layers. Both these factors can significantly affect the results.
Note that since state is not reused across command buffers, a reasonable number of draw calls should be submitted per command buffer. This avoids having the GPU go idle while processing commands. Therefore, having many secondary command buffers with few draw calls can negatively affect performance. In any case, there is no advantage in exceeding the CPU parallelism level that is, using more command buffers than threads. Similarly, having more threads than buffers may have a performance impact. To keep all threads busy, the sample resizes the thread pool for low number of buffers. The sample slider can help illustrate these trade-offs and their impact on performance.
In this case, we run the sample on a high-end mobile device, rendering a scene with a high number of draw calls. This shows a 15% improvement in performance when dividing the workload among 8 buffers across 8 threads.
Multi-threaded command recording has the potential to improve CPU time significantly, but it also opens up several pitfalls. In the worst case scenario, this can lead to worse performance than single threaded.
Our general recommendation is to use a profiler and figure out the bottleneck for your application, while keeping a close eye on common pain points regarding threading in general. The issues that we have encountered most often are the following:
Vulkan provides different ways to manage and allocate command buffers:
Our sample provides options to compare them and monitor their efficiency. This can be done directly on the device by monitoring frame time.
Command buffers are allocated from a command pool with vkAllocateCommandBuffers. They can then be recorded and submitted to a queue for the Vulkan device to execute them.
A possible approach to managing the command buffers for each frame in our application would be to free them once they are executed, using vkFreeCommandBuffers.
The command pool will not automatically recycle memory from deleted command buffers if the command pool was created without the RESET_COMMAND_BUFFER_BIT flag. This flag, however, will force separate internal allocators to be used for each command buffer in the pool, which can increase CPU overhead compared to a single pool reset.
This is the worst-performing method of managing command buffers, as it involves a significant CPU overhead for allocating and freeing memory frequently.
Rather than freeing and re-allocating the memory used by a command buffer, it is more efficient to recycle it for recording new commands. There are two ways of resetting a command buffer: 1) individually, with vkResetCommandBuffer; or 2) indirectly by resetting the command pool with vkResetCommandPool.
In order to reset command buffers individually with vkResetCommandBuffer, the pool must have been created with the RESET_COMMAND_BUFFER_BIT flag set. The buffer will then return to a recordable state and the command pool can reuse the memory it allocated for it.
However, frequent calls to vkResetCommandBuffer are more expensive than a command pool reset.
Resetting the pool with vkResetCommandPool automatically resets all the command buffers allocated by it. Doing this periodically will allow the pool to reuse the memory allocated for command buffers with lower CPU overhead.
To reset the pool the flag RESET_COMMAND_BUFFER_BIT is not required. In fact, it is better to avoid since it prevents using a single large allocator for all buffers in the pool, thus increasing memory overhead.
The sample offers two options. First, recording all drawing operations on a single command buffer. Second, dividing the opaque object draw calls among a given number of secondary command buffers. The second option allows multi-threaded command buffer construction. However, the number of secondary command buffers should be kept low since their invocations are expensive. Our sample lets you adjust the number of command buffers. Using a high number of secondary command buffers causes the application to become CPU bound and makes the differences between the described memory allocation approaches more pronounced.
All command buffers in this sample are initialized with the ONE_TIME_SUBMIT_BIT flag set. This indicates to the driver that the buffer will not be resubmitted after execution and allows it to optimize accordingly.
We hope this sample will help you implement multi-threaded command buffer recording in your applications. Also, we recommend that instead of freeing command buffers, reuse them with vkResetCommandPool. This avoids having to allocate them again.
This can improve performance (if your application is CPU-limited) and most importantly for mobile devices, significantly reduce system power consumption. This frees up thermal budget, which can be reallocated to useful rendering workloads.
We would encourage you to check out the project on the Vulkan Samples GitHub page and try the sample for yourself. The project has just been donated to The Khronos Group. You can tweak the number of command buffers and the allocation strategy directly on the screen, showing the performance impact through real-time hardware counter graphs. You are also warmly invited to contribute to the project by providing feedback and fixes and creating additional samples.
You may also read the other posts in this series:
And our previous blog on Multithreading in Vulkan.
Vulkan Best Practices