Vulkan Mobile Best Practices - Management of Command Buffers and Multi-Threaded Recording

March 25, 2020

8 minute read time.

One way to take advantage of all the CPU cores available to us is to implement multi-threaded recording of draw calls. In tile-based renderers (found in most mobile GPUs), the best approach to split the draw calls is to record them in secondary command buffers. This way they can all be submitted to the same render pass and take advantage of tile-local memory.

With Vulkan, applications are responsible for managing resources and synchronizing access to them. This can be less conservative and avoid unnecessary locks and idle time. Largely because the application has more information than the driver regarding when and how they are to be used.

Multi-threaded recording

To record the commands across several threads, the application must correctly manage the memory accesses and usage of related resources, such as buffers and descriptor sets. One option is to manage resource pools per frame and per thread. According to the Vulkan specification:

This means that each frame in the queue (e.g. three frames in case of triple buffering) manages a thread pool and a collection of resources for each thread:

A command pool
A descriptor pool cache
A descriptor set cache
A buffer pool

Therefore, each frame will use a set number of threads - as many as the cores in the system - to concurrently record commands. For more detail I recommend “Writing an efficient Vulkan renderer” (from "GPU Zen 2: Advanced Rendering Techniques") by Arseny Kapoulkine.

The sample

The Khronos Vulkan Samples project (available on GitHub here) includes a command buffer usage sample that shows multi-threaded recording. The screenshot below shows the sample in action on a mobile device.

To test the sample for yourself, make sure to build it in release mode and without validation layers. Both these factors can significantly affect the results.

Multi-threaded recording of commands

Note that since state is not reused across command buffers, a reasonable number of draw calls should be submitted per command buffer. This avoids having the GPU go idle while processing commands. Therefore, having many secondary command buffers with few draw calls can negatively affect performance. In any case, there is no advantage in exceeding the CPU parallelism level that is, using more command buffers than threads. Similarly, having more threads than buffers may have a performance impact. To keep all threads busy, the sample resizes the thread pool for low number of buffers. The sample slider can help illustrate these trade-offs and their impact on performance.

In this case, we run the sample on a high-end mobile device, rendering a scene with a high number of draw calls. This shows a 15% improvement in performance when dividing the workload among 8 buffers across 8 threads.

Relative performance

Multi-threaded command recording has the potential to improve CPU time significantly, but it also opens up several pitfalls. In the worst case scenario, this can lead to worse performance than single threaded.

Our general recommendation is to use a profiler and figure out the bottleneck for your application, while keeping a close eye on common pain points regarding threading in general. The issues that we have encountered most often are the following:

Thread spawning causing a significant overhead. This could happen if you use std::async directly to spawn your threads, as STL implementations usually do not pool threads in that case. We recommend using a thread pool library instead, or to implement thread pooling yourself.
Synchronization overhead might be significant. If you are using mutexes to guard all your map accesses, the code might end up running in a serialized fashion with the extra overhead for lock acquisition/release. Alternative approaches could be using a read/write mutex like shared_mutex, or go lock-free by ensuring that the map is read-only while executing multi-threaded code.
In the lock-free approach, each thread can keep a list of entries to add to the map. These per-thread lists of entries are then inserted into the map after all the threads have returned.
Having few meshes per thread. Multi-threaded command recording has some performance overhead both on the CPU side (cost of threading) and on the GPU side (executing secondary command buffers). Therefore, using the full parallelism available is not always a good choice. As a rule of thumb, only go parallel if you measure that draw call recording is taking a significant portion of your frame time.

Recycling strategies

Vulkan provides different ways to manage and allocate command buffers:

Allocate and free
Resetting individual command buffers
Resetting the command pool

Our sample provides options to compare them and monitor their efficiency. This can be done directly on the device by monitoring frame time.

Allocate and free

Command buffers are allocated from a command pool with vkAllocateCommandBuffers. They can then be recorded and submitted to a queue for the Vulkan device to execute them.

A possible approach to managing the command buffers for each frame in our application would be to free them once they are executed, using vkFreeCommandBuffers.

The command pool will not automatically recycle memory from deleted command buffers if the command pool was created without the RESET_COMMAND_BUFFER_BIT flag. This flag, however, will force separate internal allocators to be used for each command buffer in the pool, which can increase CPU overhead compared to a single pool reset.

This is the worst-performing method of managing command buffers, as it involves a significant CPU overhead for allocating and freeing memory frequently.

Allocate and free command buffers every frame (not recommended)

Rather than freeing and re-allocating the memory used by a command buffer, it is more efficient to recycle it for recording new commands. There are two ways of resetting a command buffer: 1) individually, with vkResetCommandBuffer; or 2) indirectly by resetting the command pool with vkResetCommandPool.

Resetting individual command buffers

In order to reset command buffers individually with vkResetCommandBuffer, the pool must have been created with the RESET_COMMAND_BUFFER_BIT flag set. The buffer will then return to a recordable state and the command pool can reuse the memory it allocated for it.

Reset command buffers individually every frame (not recommended)

However, frequent calls to vkResetCommandBuffer are more expensive than a command pool reset.

Resetting the command pool

Resetting the pool with vkResetCommandPool automatically resets all the command buffers allocated by it. Doing this periodically will allow the pool to reuse the memory allocated for command buffers with lower CPU overhead.

To reset the pool the flag RESET_COMMAND_BUFFER_BIT is not required. In fact, it is better to avoid since it prevents using a single large allocator for all buffers in the pool, thus increasing memory overhead.

Reset the command buffer pool (this is the most efficient approach)

Relative performance

The sample offers two options. First, recording all drawing operations on a single command buffer. Second, dividing the opaque object draw calls among a given number of secondary command buffers. The second option allows multi-threaded command buffer construction. However, the number of secondary command buffers should be kept low since their invocations are expensive. Our sample lets you adjust the number of command buffers. Using a high number of secondary command buffers causes the application to become CPU bound and makes the differences between the described memory allocation approaches more pronounced.

All command buffers in this sample are initialized with the ONE_TIME_SUBMIT_BIT flag set. This indicates to the driver that the buffer will not be resubmitted after execution and allows it to optimize accordingly.

Conclusion

We hope this sample will help you implement multi-threaded command buffer recording in your applications. Also, we recommend that instead of freeing command buffers, reuse them with vkResetCommandPool. This avoids having to allocate them again.

This can improve performance (if your application is CPU-limited) and most importantly for mobile devices, significantly reduce system power consumption. This frees up thermal budget, which can be reallocated to useful rendering workloads.

Get involved

We would encourage you to check out the project on the Vulkan Samples GitHub page and try the sample for yourself. The project has just been donated to The Khronos Group. You can tweak the number of command buffers and the allocation strategy directly on the screen, showing the performance impact through real-time hardware counter graphs. You are also warmly invited to contribute to the project by providing feedback and fixes and creating additional samples.

You may also read the other posts in this series:

And our previous blog on Multithreading in Vulkan.

Vulkan Best Practices

Mobile, Graphics, and Gaming blog

Unlock the power of SVE and SME with SIMD Loops

Vidya Praveen

SIMD Loops is an open-source project designed to help developers learn SVE and SME through hands-on experimentation. It offers a clear, practical pathway to mastering Arm’s most advanced SIMD technologies…
- September 19, 2025
What is Arm Performance Studio?

Jai Schrem

Arm Performance Studio gives developers free tools to analyze performance, debug graphics, and optimize apps on Arm platforms.
- August 27, 2025
How Neural Super Sampling works: Architecture, training, and inference

Liam O'Neil

A deep dive into a practical, ML-powered approach to temporal super sampling.
- August 12, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Vulkan Mobile Best Practices - Management of Command Buffers and Multi-Threaded Recording

Multi-threaded recording

The sample

Relative performance

Recycling strategies

Allocate and free

Resetting individual command buffers

Resetting the command pool

Relative performance

Conclusion

Get involved

Unlock the power of SVE and SME with SIMD Loops

What is Arm Performance Studio?

How Neural Super Sampling works: Architecture, training, and inference