An application using Vulkan will have to implement a system to manage descriptor sets. The most straightforward and flexible approach is to re-create them for each frame but doing this when targeting mobile platforms can be inefficient and problematic.
The underlying issue of descriptor management is intertwined with that of buffer management, which is choosing how to pack data in VkBuffer objects. This blog will explore a few of the options available to improve both descriptor and buffer management.
VkBuffer
Some of the approaches presented here are covered in greater detail in “Writing and efficient Vulkan renderer” by Arseny Kapoulkine (from "GPU Zen 2: Advanced Rendering Techniques"), along with some more options.
Samsung also covered descriptor set caching in a presentation at GDC 2019 where they are bringing Fortnite to mobile platforms.
When rendering dynamic objects, the application will need to push some amount of per-object data to the GPU, such as the MVP matrix. This data may not fit into the push constant limit for the device. So, it becomes necessary to send it to the GPU by putting it into a VkBuffer and binding a descriptor set that points to it.
Materials also need their own descriptor sets, which point to the textures they use. We can either bind per-material and per-object descriptor sets separately or collate them into a single set. Either way, complex applications will have a large number of descriptor sets that may need to change on the fly. An example of this would be due to textures being streamed in or out.
The simplest approach to circumvent the issue is to have one or more VkDescriptorPools per frame, reset them at the beginning of the frame and allocate the required descriptor sets from it. This approach will consist of a vkResetDescriptorPool() call at the beginning, followed by a series of vkAllocateDescriptorSets() and vkUpdateDescriptorSets() to fill them with data.
The issue is that these calls can add a significant overhead to the CPU frame time, especially on mobile. An example of this would be calling vkUpdateDescriptorSets() for each draw call, resulting in the time it takes to update descriptors can be longer than the time of the draws themselves.
Figure 1: Base case with no descriptor caching.
Our Vulkan Best Practice for Mobile Developers (available on GitHub here) has a descriptor management sample that shows its advantages. Figure 1 above shows the sample in action on a phone.
The descriptor set management issue is highlighted with a draw-call intensive scene. Frame time is around 44 ms, corresponding to 23 FPS, using the simplest descriptor management scheme.
If you want to test the sample yourself, make sure to set it in release mode and without validation layers. Both these factors can significantly affect the results.
A major way to reduce descriptor set updates is to reuse them as much as possible. Instead of calling vkResetDescriptorPool() every frame, the app keeps the VkDescriptorSet handles stored with some caching mechanism to access them.
VkDescriptorSet
The cache could be a hash map with the contents of the descriptor set (images, buffers) as key. This approach is used in our framework by default. It is possible to remove another level of indirection by storing descriptor sets handles directly in the materials and meshes.
Caching descriptor sets has a dramatic effect on frame time for our CPU-heavy scene, as seen blow in Figure 2.
Figure 2: Scene with descriptor caching
The frame time is now around 27 ms, corresponding to 37 FPS. This is a 38% decrease in frame time.
We can confirm this behavior using Streamline Performance Analyzer, as shown in Figure 3 below.
Figure 3: CPU speedup with descriptor caching
The first part of the trace until the marker is without descriptor set caching. We can see that the app is CPU bound, since the GPU is idling between frames while the CPU is fully utilized.
After the marker we enable descriptor set caching and we can see that frames are processed faster. GPU frame time does not change much, and the app is still CPU bound, so the speedup is related to CPU-side improvements.
This system is reasonably easy to implement for a static scene, but it becomes harder when you need to delete descriptor sets. Complex engines may implement techniques to figure out which descriptor sets have not been accessed. This can be for a certain number of frames, so that they can be removed from the map.
This may correspond to calling vkFreeDescriptorSets(), but this solution poses another issue: to free individual descriptor sets the pool has to be created with the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT flag. Mobile implementations may use a simpler allocator if that flag is not set, relying on the fact that pool memory will only be recycled in block.
VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT
It is possible to avoid using that flag by updating descriptor sets instead of deleting them. The application can keep track of recycled descriptor sets and reuse one of them when a new one is requested. The render subpass sample uses this approach when it re-creates the G-buffer images.
Going back to the initial case, we will now explore an alternative approach, that is complementary to descriptor caching in some way. Especially for applications in which descriptor caching is not quite feasible, buffer management is another lever for optimizing performance.
Each rendered object will typically need some uniform data along with it, that needs to be pushed to the GPU somehow. A straightforward approach is to store a VkBuffer per object and update that data for each frame.
This already poses an interesting question: is one buffer enough? The problem is that this data will change dynamically. This will be in use by the GPU while the frame is in flight.
Since we do not want to flush the GPU pipeline between each frame, we will need to keep several copies of each buffer, one for each frame in flight. Another similar option is to use just one buffer per object, but with a size equal to num_frames * buffer_size, then offset it dynamically based on the frame index.
num_frames
buffer_size
A similar approach is used in the default configuration of the sample. For each frame, one buffer per object is created and filled with data. This means that we will have many descriptors sets to create, since every object will need one that points to its VkBuffer. Furthermore, we will have to update many buffers separately, meaning we cannot control their memory layout and we might lose some optimization opportunities with caching.
We can address both problems by reverting the approach: instead of having a VkBuffer per object containing per-frame data, we will have a VkBuffer per frame containing per-object data. The buffer will be cleared at the beginning of the frame. Then each object will record its data and will receive a dynamic offset to be used at vkCmdBindDescriptorSets() time.
With this approach we will need fewer descriptor sets. This is because more objects can share one: they will all reference the same VkBuffer, but at different dynamic offsets. Furthermore, we can control the memory layout within the buffer.
Figure 4: Scene using a single large VkBuffer
Using a single large VkBuffer in this case shows a performance improvement similar to descriptor set caching.
For this relatively simple scene stacking the two approaches does not provide a further performance boost, but for a more complex case they do stack nicely:
We would encourage you to check out the project on Vulkan Mobile Best Practice GitHub page and try the sample for yourself. The tutorials have just been donated to The Khronos Group. The sample code gives developers on-screen control to demonstrate multiple ways of using the feature. It also shows the performance impact of the different approaches through real-time hardware counters on the display. You are also warmly invited to contribute to the project by providing feedback and fixes and creating additional samples.
[CTAToken URL = "https://github.com/khronosGroup/Vulkan-samples" target="_blank" text="Vulkan Best Practices" class ="green"]