Vulkan Mobile Best Practices: Frequently Asked Questions - Part 2

March 10, 2020

9 minute read time.

After one year of development, the Vulkan best practices have seen massive change. From an idea in the heads of our engineering team to an official donation to Khronos. There are roughly 4,000 visitors a week to the best practices and we get tons of great feedback and questions. The following are a collection of those frequently asked questions. For our series on Vulkan best practices, please see:

Changing a buffer's contents dynamically

Can we update a buffer while it is in flight? Do we need barriers to do it?

Changing a uniform while the GPU is using it is dangerous from a synchronization standpoint: you cannot know if the GPU will read the data before or after the update, so the behavior of your app would be inconsistent.

The spec says:

The descriptor set contents bound by a call to vkCmdBindDescriptorSets may be consumed during host execution of the command. This can also happen during during shader execution of the resulting draws, or anytime in between. Thus, the contents must not be altered (overwritten by an update command, or freed) between when the command is recorded and when the command completes executing on the queue. The contents of pDynamicOffsets are consumed immediately during execution of vkCmdBindDescriptorSets. Once all pending uses have completed, it is legal to update and reuse a descriptor set.

If you want to change the uniform buffer data across frames without breaking synchronization, you will have to replicate those data in some way. One way to do so without major changes to your code would be to create a larger uniform buffer (for example, 3x the size for 3 frames) and bind it as a dynamic uniform buffer. This will change the dynamic offset for each frame.

Since you cannot update a part of a buffer that is in use, pipeline barriers will not help. If you have a single buffer, the update on the CPU side has to wait for the GPU to finish using the buffer, so you would end up serializing frames.

Allocating and mapping memory for a buffer

What's the best practice for allocating and mapping buffer memory?

Allocating memory for each buffer VIA vkAllocateMemory might be really slow and there is a cap on the total number of allocations. Mapping memory VIA vkMapMemory is also costly operation.
The intended usage for an app is to allocate a big chunk of memory, keep it mapped and manage it.

If you want a drop-in replacement for memory management which follows these best practices, check out VMA. Its API is similar to Vulkan's so it probably will not require any major changes to your code.

Understanding barrier scope

Supposing we are only using one queue and we have the following code:

// Set of commands - A 
vkCmdDraw(...) 
... 
vkCmdDraw(...) 
// Barrier 1 
vkCmdPipelineBarrier(...) 
// Set of commands - B 
vkCmdDraw(...) 
... 
vkQueueSubmit(...) 
vkQueuePresentKHR(...) 
// Barrier 2 
vkCmdPipelineBarrier(...) 
// Set of commands - C 
vkCmdDraw(...) 
... 
vkQueueSubmit(...) 
vkQueuePresentKHR(...)

How do the two barriers interact with each set of commands?

A pipeline barrier always acts on two sets of commands, those which come before the barrier and those which come after.

Since you do not mention render passes, we assume that the calls to vkCmdPipelineBarrier are outside of a render pass instance. The spec says:

If vkCmdPipelineBarrier is called outside a render pass instance, then the first set of commands is all prior commands submitted to the queue and recorded in the command buffer. The second set of commands is all subsequent commands recorded in the command buffer and submitted to the queue.

The main difference between the two barriers is that the first one is in the middle of a command buffer. The second one is after the first commands are submitted and presented (so it is likely to be in another command buffer).
This difference does not really matter according to the spec, because commands previously submitted and previously recorded in the current command buffer are treated the same way.

This is a breakdown of the 2 barriers:

Barrier1
- Before: set A and everything that comes prior to that
- After: sets B, C, and everything that comes afterwards
Barrier2
- Before: sets A, B and everything that comes prior to them
- After: set C and everything that comes afterwards

Number of descriptor pools

How many descriptor pools should you have? Just a large one or one per frame?

Using one descriptor pool per frame it is not strictly necessary, but it is still very good to have. If you create your descriptor pool without the FREE_DESCRIPTOR_SET_BIT flag, it means you can only free the pool VIA vkResetDescriptorPool.
If you use only a single pool for all frames, you have to wait idle before freeing. If you use several descriptor pools instead, you will be able to free them for the frames that are not currently in flight.

Avoiding the FREE_DESCRIPTOR_SET_BIT flag can let the driver use a simpler allocator, ultimately improving performance.

You can also check out our blog on descriptor management for more information.
If you are performing multithreaded rendering, you may need to allocate more descriptor pools, as discussed in the tutorial on multithreading.

Synchronizing texture transfers

How should I synchronize texture transfers without calling vkQueueWaitIdle? What happens if I don't specify any synchronization?

If you don't specify any synchronization, there is a concurrency risk. You have no guarantee that the transfer will be complete when the rendering begins. You could add a pipeline barrier between the transfer and the shader stage in which you are going to use the image. You need a pipeline barrier for the layout transition anyway.

If you are uploading many textures at once, for example when loading a new scene, it might be easier to submit all the transfers and wait idle.

Meaning of signalling a fence

When a fence is signaled, does it mean that all commands are transferred to the GPU or that all commands have completed?

If it is the fence you get from vkQueueSubmit, yes, it means that commands are executed completely.

Actually it means even more than that. If the fence is signaled, it means that all commands from all previous submissions are executed completely:

When a fence is submitted to a queue as part of a queue submission command, it defines a memory dependency on the batches that were submitted as part of that command. This defines a fence signal operation which sets the fence to the signaled state.

The first synchronization scope includes every batch submitted in the same queue submission command. Fence signal operations that are defined by vkQueueSubmit additionally include in the first synchronization scope all commands that occur earlier in submission order.

Struct alignment for uniform buffers and push constants

How do you pass data from a C/C++ struct to a uniform buffer? The data I'm passing is not read correctly from the shader.

Uniform buffer alignment is not straightforward due to structure packing rules: a struct in C++ will not match a struct in GLSL unless you structure them carefully. You can find more information on the std140 packing here, which applies both to uniform buffers and push constants. Debugging it might be hard: if you are lucky validation layers complain about some offsets you are not expecting, otherwise you will see weird values being passed to the shaders.

The golden rule is that struct and array elements must be aligned as multiples of 16 bytes (the size of a vec4). Thus:

vec4 and mat4 are safe, feel free to use them
Do not use vec3, use a vec4 and pack some other information in the 4th component if possible
If you need to use float / int32_t, you will need to add a vec3 of padding after them; try to pack basic types in a vec4 whenever possible

Dynamic uniform buffers have an additional alignment requirement for the dynamic offset. You might need to further pad your uniform buffer data so that the offset is an exact multiple of that limit. You can check the limit as minUniformBufferOffsetAlignment in VkPhysicalDeviceProperties, with common values ranging between 16 bytes and 256 bytes.

Crashes with no backtrace on Android

My Android app crashes without any message or backtrace on logcat. What could it be?

Your app may be running out of memory. Look for a message like this in logcat:

07-13 17:10:37.788 19132 19132 V threaded_app: LowMemory: 0x7926307ec0

If you are running out of memory, debugging the app in Android Studio Profiler may help. It lets you track the memory usage of your app and may let you trace it down to individual allocations.

Shader variants

How can I set up shader variants in Vulkan? Should I use specialization constants?

A first approach to shader variants is to use #ifdef directives in your shaders, like in this one. You can then compile different variants by running glslangValidator with the -D option, like this:

%VULKAN_SDK%\bin\glslangValidator.exe -V pbr.vert -o variants\pbr_vert_.spv
%VULKAN_SDK%\bin\glslangValidator.exe -V pbr.vert -o variants\pbr_vert_N.spv -DHAS_NORMALS
%VULKAN_SDK%\bin\glslangValidator.exe -V pbr.vert -o variants\pbr_vert_T.spv -DHAS_TANGENTS
%VULKAN_SDK%\bin\glslangValidator.exe -V pbr.vert -o variants\pbr_vert_NT.spv -DHAS_NORMALS -DHAS_TANGENTS

This can be done either at compile time or at runtime, by building glslang along with your app.

A different approach to shader variants is to use specialization constants: they are efficient as they are still compile-time constants, specified at pipeline creation time, and you don't need to compile separate variants with glslangValidator or shaderc. Specialization constants do have some limitations, however, the main one being that you can't use if statements while defining your shader's interface, like vertex attributes, texture samplers:

// valid GLSL
#ifdef HAS_BASECOLORMAP
    layout(binding = 0) uniform texture2D baseColorT;
#endif
 
 
// invalid GLSL
if (specialization_constant) {
    layout(binding = 0) uniform texture2D baseColorT;
}

So the interface for your shaders will be fixed, but you can use if statements based on specialization constants in your main() function. These will then be evaluated at compile time just like #define. Even if you cannot modify the shader interface variables, the compiler may optimize out the ones you do not need, if you remove all references to them.

Get involved

We would encourage you to check out the project on Vulkan Mobile Best Practice GitHub page and try the sample for yourself. The tutorials have just been donated to The Khronos Group. The sample code gives developers on-screen control to demonstrate multiple ways of using the feature. It also shows the performance impact of the different approaches through real-time hardware counters on the display. You are also warmly invited to contribute to the project by providing feedback and fixes and creating additional samples.

[CTAToken URL = "https://github.com/KhronosGroup/Vulkan-Samples" target="_blank" text="Vulkan Best Practices" class ="green"]

Graphics, Gaming, and VR blog

The mobile gaming revolution, powered by Arm

Philippe Bressy

This blog post describes the stratospheric growth of mobile gaming growth from the late 90s to present day, and how Arm technology has been at the heart of the mobile gaming revolution.
- November 18, 2024
Shader analysis and more in Arm Performance Studio 2024.4

Julie Gaskin

Learn about the new shader analysis features for mobile developers in Frame Advisor, and hear about other Arm Performance Studio changes in this release.
- October 2, 2024
Save your battery while enjoying the modern graphics on mobile with Android Dynamic Performance Framework

Patrick Wang

Save battery and enhance mobile gaming with ADPF and Unreal Engine. Mori shows you how it optimizes graphics based on real-time thermal data, reducing overheating and power consumption.
- September 26, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog