Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Mobile, Graphics, and Gaming blog Multi-Threading in Vulkan
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • graphics_week
  • vulkan
  • graphics
  • mobile gpu
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Multi-Threading in Vulkan

Marius Bjørge
Marius Bjørge
April 19, 2016
3 minute read time.

In my previous blog post I explained some of the key concepts of Vulkan and how we implemented them in our internal graphics engine. In this post I will go into a bit more detail about how we implemented multi-threading and some of the caveats to watch out for.

Quick background

Vulkan was created from the ground up to be thread-friendly and there's a huge amount of details in the spec relating to thread-safety and the consequences of function calls. In OpenGL, for instance, the driver might have a number of background threads working while waiting for API calls from the application. In Vulkan, this responsibility has moved up to the application level, so it's now up to you to ensure correct and efficient multi-threading behavior. This is a good thing since the application often has better visibility of what it wants to achieve.

Command pools

In Vulkan command buffers are allocated from command pools. Typically you pin command pools to a thread and only use this thread when writing to command buffers allocated from its command pool. Otherwise you need to externally synchronize access between the command buffer and the command pool which adds overhead.

commandpool.png

For graphics use-cases you also typically pin a command pool per frame. This has the nice side-effect that you can simply reset the entire command pool once the work for the frame is completed. You can also reset individual command buffers, but it's often more efficient to just reset the entire command pool.

Coordinating work

In OpenGL, work is executed implicitly behind the scenes. In Vulkan this is explicit where the application submits command buffers to queues for execution.

blog_diagrams.png

Vulkan has the following synchronization primitives:

  • Semaphores - used to synchronize work across queues or across coarse-grained submissions to a single queue
  • Events and barriers - used to synchronize work within a command buffer or a sequence of command buffers submitted to a single queue
  • Fences - used to synchronize work between the device and the host

Queues have simple sync primitives for ordering the execution of command buffers. You can basically tell the driver to wait for a specific event before processing the submitted work and you can also get a signal for when the submitted work is completed. This synchronization is really important when it comes to submitting and synchronizing work to the swap chain. The following diagram shows how work can be recorded and submitted to the device queue for execution before we finally tell the device to present our frame to the display.

swap1.png

In the above sequence there is no overlap of work between different frames. Therefore, even though we're recording work to command buffers in multiple threads, we still have a certain amount of time where the CPU threads sit idle waiting for a signal in order to start work on the next frame.

swap2.png

This is much better. Here we start recording work for the next frame immediately after submitting the current frame to the device queue. All synchronization here is done using semaphores. vkAcquireNextImageKHR will signal a semaphore once the swap chain image is ready, vkQueueSubmit will wait for this semaphore before processing any of the commands and will signal another semaphore once the submitted commands are completed. Finally, vkQueuePresentKHR will present the image to the display, but it will wait for the signaled semaphore from vkQueueSubmit before doing so.

Summary

In this blog post I have given a brief overview of how to get overlap between CPU threads that record commands into command buffers over multiple frames. For our own internal implementation we found this really useful as it allowed us to start preparing work for the next frame very early on, ensuring the GPU is kept busy.

Anonymous
Mobile, Graphics, and Gaming blog
  • Optimizing 3D scenes in Godot on Arm GPUs

    Clay John
    Clay John
    In part 1 of this series, learn how we utilized Arm Performance Studio to identify and resolve major performance issues in Godot’s Vulkan-based mobile renderer.
    • June 11, 2025
  • Bringing realistic clothing simulation to mobile: A new frontier for game developers

    Mina Dimova
    Mina Dimova
    Realistic clothing simulation on mobile—our neural GAT model delivers lifelike cloth motion without heavy physics or ground-truth data.
    • June 6, 2025
  • Join the Upscaling Revolution with Arm Accuracy Super Resolution (Arm ASR)

    Lisa Sheckleford
    Lisa Sheckleford
    With Arm ASR you can easily improve frames per second, enhance visual quality, and prevent thermal throttling for smoother, longer gameplay.
    • March 18, 2025