Vulkan Mobile Best Practices: Frequently Asked Questions - Part 1

February 25, 2020

8 minute read time.

After one year of development, the Vulkan best practices have seen massive change. From an idea in the heads of our engineering team to an official donation to Khronos. There are roughly 4,000 visitors a week to the best practices and we get tons of great feedback and questions. The following are a collection of those frequently asked questions. For our series on Vulkan best practices, please see:

Transitioning the swapchain image on acquisition

How should I transition the swapchain image to the layout I need? Is the implicit transition (initialLayout → layout) good enough?

The default transition may not be enough if you want to acquire in the most efficient way possible. VkSubmitInfo allows to pass an acquisition semaphore with a pWaitDstStageMask. The optimal value is usually VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, because only need the swapchain image to be ready when we are going to write to it.

The problem is that the implicit transition (initialLayout → layout) of the image will not wait for that stage. If there is a mismatch, the GPU might try to transition the image before it is fully acquired, with undefined results.

There are two approaches to solve this:

Giving up on optimal acquisition, by passing TOP_OF_PIPE as pWaitDstStageMask (not recommended);
Replace the implicit subpass dependency with an explicit one, taking the correct stage mask into account.

The acquisition semaphore's pWaitDstStageMask guarantees that the image acquisition happens before VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, but we do not know exactly when. Thus we need an external dependency which fixes the stage for the transition to VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT. Otherwise, the GPU might try to transition the image before it is acquired from the presentation engine.

This is an example subpass dependency:

VkSubpassDependency dependency = { 0 };
dependency.srcSubpass = VK_SUBPASS_EXTERNAL;
dependency.dstSubpass = 0;
dependency.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
dependency.dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
// Since we changed the image layout, we need to make the memory visible to
// color attachment to modify.
dependency.srcAccessMask = 0;
dependency.dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;

Another approach, in case you find it simpler, is to disable the implicit dependency (that is, set initialLayout = layout) and do the layout transitions VIA pipeline barriers. This way you can ensure that the transition happens at the correct stage, when the image has already been acquired.

Implicit image transitions via initialLayout / finalLayout

I am transitioning images between render passes using initialLayout → layout → finalLayout but I am getting rendering artifacts. Why?

This question comes out of a real-life example we had when setting up a single post-processing pipeline. The screen was filled with tile-sized artifacts.

I was running a first render pass which would transition the image (finalLayout) to SHADER_READ_ONLY_OPTIMAL, then the following pass would read from it and render to the swapchain. Those rendering artifacts appeared inconsistently between devices, also depending on swapchain size, thus suggesting some sort of synchronization issue.

Vulkan requires explicit synchronization, even when it might seem that it could be inferred. The GPU can execute the render pass in any order, unless we explicitly mark the dependencies. The problem here was that we asked Vulkan to transition the image, but we did not really say by when we wanted it. In Vulkan terms:

Automatic layout transitions into finalLayout happens-before the visibility operations for all dependencies with a dstSubpass equal to VK_SUBPASS_EXTERNAL, where srcSubpass uses the attachment that is transitioned. For attachments created with VK_ATTACHMENT_DESCRIPTION_MAY_ALIAS_BIT, automatic layout transitions into finalLayout happen-before the visibility operations for all dependencies with a dstSubpass equal to VK_SUBPASS_EXTERNAL, where srcSubpass uses any aliased attachment.

It is not guaranteed that the next render pass will see the image in its new layout, unless we add a subpass dependency with VK_SUBPASS_EXTERNAL. Subpass dependencies are baked into the render pass description, which means you will lose flexibility in how you can run your render passes.

A solution is to give up on using the implicit transitions and set initialLayout = layout = finalLayout. You can then handle the transition with a pipeline barrier, which is easier to use than subpass dependencies and can be declared at command buffer creation time.

You might be wondering if this is an issue with swapchain images as well, when setting initialLayout = UNDEFINED and finalLayout = PRESENT_SRC_KHR.
The final transition to PRESENT_SRC_KHR is safe as long as we pass the signalSemaphore from queue submission to vkQueuePresentKHR. The initial transition requires further considerations like in the answered question above on toolchains.

Debugging a DEVICE_LOST

I am getting a DEVICE_LOST error when calling either vkQueueSubmit or vkWaitForFences. Validation is clean. Why could that be?

While developing a complex Vulkan application, you might encounter a VK_ERROR_DEVICE_LOST after seemingly normal usage. This is relatively expensive to deal with, as it is a sticky error flag and requires the VkDevice to be rebuilt, and quite hard to debug too. The Vulkan spec does not currently provide the driver with a straightforward way to communicate the cause of the error, so some trial and error might be required.

There are two main reasons why a DEVICE_LOST might arise:

Out of memory (OOM)
Resource corruption

We covered OOM conditions in greater detail in this blog. If your application is within a reasonable vertex budget for mobile (around 2 million vertices under normal usage), it is worth looking for resource corruption due to missing synchronization.

Common signs for synchronization issues are flickering and inconsistencies between devices. An application with incorrect API usage might run fine on some platforms and fail on others. If GPU resources are corrupted due to missing synchronization, a VK_ERROR_DEVICE_LOST usually occurs.

It should be noted that missing synchronization does not necessarily result in a lost device. For example, if your rendering pipeline depends on the ordering of render passes you need to add some synchronization, such as pipeline barriers, between them. Just issuing the render passes in order is not enough to guarantee that they will be executed in order. Applications with unsynchronized render passes might run as expected in some platforms and show flickering in some others, without any Vulkan errors or validation messages. This is because the API usage is technically correct, but it does not correspond to your intentions.

Synchronization bugs are tricky to identify, reproduce, and track down. Validation layers do not cover all cases, but they can help in some situations. Having a mental model of the data dependencies in your rendering pipeline is critical too. An approach to debugging synchronization issues is to temporarily add more synchronization (for example, extra pipeline barriers, wait idle). This will narrow down the point where the missing synchronization happens.

Meaning of srcAccessMask = 0

Does srcAccessMask = 0 mean "can't read or write"? Shouldn't we specify a READ srcAccessMask in the following code?

VkSubpassDependency dependency = { 0 };
dependency.srcSubpass = VK_SUBPASS_EXTERNAL;
dependency.dstSubpass = 0;
dependency.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
dependency.dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
// Since we changed the image layout, we need to make the memory visible to
// color attachment to modify.
dependency.srcAccessMask = 0;
dependency.dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;

srcAccessMask = 0 refers to the access scope before the transition happens. There are two reasons that explain why that scope can be 0:

The image was just acquired from the presentation engine. It is in VK_IMAGE_LAYOUT_PRESENT_SRC_KHR layout, but we use it as VK_IMAGE_LAYOUT_UNDEFINED because we do not need to preserve previous results. Our rendering pipeline does not use the image before the transition (no reads or writes), so the access mask can be empty from this point of view.
We still want to make sure that any memory accesses from the presentation engine are visible after the barrier, but we do not have to specify anything in srcAccessMask: the acquisition semaphore already guarantees that any external accesses are made visible when the semaphore is signaled. This is not the case for dstAccessMask, so we need to specify what we want to do with the image (read and write in this case) so the right caches can be flushed.

Usage of a single image in multiple frames

Why does the multisampling sample (from the Arm Vulkan SDK) use a single multisampled color image in multiple frames? The same thing happens with the depth image.

Does not this break GPU synchronization?

This is due to an optimization for tiled GPUs when you have an attachment that will only be used in a single render pass and does not need to be stored. Taking the multisampled render target as an example:

// This image will only be used as a transient render target.
// Its purpose is only to hold the multisampled data before resolving the render pass.
info.usage = VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT | VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT;
 
[...]
 
alloc.memoryTypeIndex =
    findMemoryTypeFromRequirementsWithFallback(memReqs.memoryTypeBits, VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT);

With a usage of TRANSIENT_ATTACHMENT_BIT and a memory type with LAZILY_ALLOCATED_BIT the GPU can avoid allocating the image at all, the data will only reside in tile-local memory. At the end of the render pass the multisampled data are resolved to the swapchain image (see pResolveAttachments). Using the multisampled image does not involve any actual memory accesses, so we can use the same one for all framebuffers without hazards.

The same reasoning applies to the depth image you mentioned - it is only used in that render pass and never stored, so we can avoid allocating it at all.

This optimization saves a significant amount of memory bandwidth, you can find more information in the tutorial: https://arm-software.github.io/vulkan-sdk/multisampling.html.

Using multiple queues

In which cases should you use more than one queue?

Multiple queues could be used for more complex applications such as asynchronous compute. See here:

https://gpuopen.com/concurrent-execution-asynchronous-queues/

and this is a mobile-friendly application of async compute:

https://community.arm.com/developer/tools-software/graphics/b/blog/posts/using-compute-post-processing-in-vulkan-on-mali.

Get involved

We would encourage you to check out the project on Vulkan Mobile Best Practice GitHub page and try the sample for yourself. The tutorials have just been donated to The Khronos Group. The sample code gives developers on-screen control to demonstrate multiple ways of using the feature. It also shows the performance impact of the different approaches through real-time hardware counters on the display. You are also warmly invited to contribute to the project by providing feedback and fixes and creating additional samples.

Vulkan Best Practices

0 comments
0 members are here

Mobile, Graphics, and Gaming blog

Optimizing 3D scenes in Godot on Arm GPUs

Clay John

In part 1 of this series, learn how we utilized Arm Performance Studio to identify and resolve major performance issues in Godot’s Vulkan-based mobile renderer.
- June 11, 2025
Bringing realistic clothing simulation to mobile: A new frontier for game developers

Mina Dimova

Realistic clothing simulation on mobile—our neural GAT model delivers lifelike cloth motion without heavy physics or ground-truth data.
- June 6, 2025
Join the Upscaling Revolution with Arm Accuracy Super Resolution (Arm ASR)

Lisa Sheckleford

With Arm ASR you can easily improve frames per second, enhance visual quality, and prevent thermal throttling for smoother, longer gameplay.
- March 18, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog