The new Vulkan API from Khronos is the first to be designed to work across all platform types. Vulkan is a very low-level API that gives developers explicit control of *everything*. In this blog we want to share with developers our experience of porting a graphics engine to the Vulkan API.
The engine we decided to port to Vulkan has already had support for various other standard APIs such as OpenGL 4.5, OpenGL ES 2.0, OpenGL ES 3.2. Although these APIs are all different, they have some characteristics in common, for example that an application cannot efficiently talk with the driver from multiple threads. Implementing multi-threading in those previous APIs would not improve performance as it would require a lot of context switches which would eat up most of the performance budget. The GL APIs were designed for previous generations of GPU hardware and whilst the capabilities of hardware and technology evolved, the API’s haven’t done so at the same rate. Vulkan therefore comes at just the right time to address most of the issues of the existing APIs. It also provides developers with a lot more control and flexibility over resources and how and when “jobs” are executed in the graphics pipeline. It does however mean that there isn’t the level of support developers might have come to expect in working with Open GL ES but there are Vulkan debug layers available for plug in that can do a sense check and keep you on the right track.
Another really important advantage from Vulkan is memory management. You might have wondered how buffers and textures are handled by the GPU pipeline and at what points they are actually allocated and destroyed. If you assumed they are allocated and deallocated at the same time when you call function glTexImage2D, glBufferData or glDeleteTexture then you might be surprised. The behavior of memory management depends on the vendor, GPU and pipeline architecture. The desktop pipeline, which is typically immediate, is different from mobile (deferred). In OpenGL the memory management is hidden from the programmer and the driver decides when is the best time to allocate memory for a resource. Just as an immediate architecture may be simpler for handling memory, deferred is more complicated as it requires retaining buffers across multiple frames.
If you want to learn more about deferred pipeline architecture please see Peter Harris' blog on frame pipelining.
For example, if you have a dynamic buffer that you allocate and deallocate within one frame, the driver would have to keep a reference to this data until it’s no longer used in the pipeline. For an immediate mode architecture this might be 1-2 frames, while a deferred mode architecture might need to keep track of this for 2-4 frames. Handling textures may require keeping a reference for even more frames, since the processing sits deeper in the pipeline. This requires maintaining more memory and complicates tracking the data. In the Vulkan API, all memory management is more explicit. The developer allocates and deallocates memory which needs to be bound to Vulkan objects in order to be useful. This also opens up the possibility to alias memory between, for instance, a texture and a buffer.
It’s also up to the developer to ensure that memory that is used by the pipeline is valid throughout execution – this means that you cannot free memory that is bound to the pipeline. Luckily Vulkan also provides a means to check when a given command buffer has been completed and decide whether to free or recycle memory.
Okay, let’s get to the nitty gritty of the porting experience and how we approached it. As mentioned earlier about multithreading issues - our engine was single threaded. Therefore, in the first approach we designed the Vulkan backend on the same model as our existing OpenGL backend. Effectively we just implemented another API without changing any of the existing structure of the engine. In order to achieve that we created a default command buffer to which we internally recorded commands, but all still on one thread. We also had to create several objects such as Pipelines and Pipeline layouts which were reused for subsequent similar drawcalls. On the other hand, we called vkQueueWaitIdle after every vkQueueSubmit which caused stalls and compromised performance. That, however, wasn’t our priority as the main goal was to just get the engine backend up and running on Vulkan.
In the engine we had a custom shader language that we can translate for different standards: HLSL, GLSL and GLSL ES. Basically just yet another shader cross compiler. The Vulkan API comes with the new SPIR-V language so that was another significant chunk of work to be done. First we had to cross compile using our custom compiler and then compile this down to SPIR-V and feed it to the Vulkan driver. What we found interesting in compiling this down to SPIR-V was being able to extract binding information directly from the SPIR-V binary blob in order to create compatible pipeline layouts.
In GL, a framebuffer doesn’t have a clear beginning and end. The only way you can tell that with GL is SwapBuffer, clearing/discarding or by switching framebuffer objects. The Vulkan API addresses this by having the developer explicitly declare when a drawing begins and ends. The initial implementation we did was mimicking the GL behavior so that when we either called SwapBuffer or changed framebuffer, we ended the render pass and started a new one. This was obviously a temporary solution and definitely not what we would call best practice, but it helped us get the engine working to the same level as OpenGL with any content, except maybe for performance!
Having the engine working on the Vulkan API as described above doesn’t give you better performance or the benefit of any of the cool features of Vulkan. In order to improve upon the process described above, we took the following steps to make the engine work more efficiently and to properly utilize Vulkan features.
We made the following key decisions:
The first step was redesigning the engine backend and, more specifically, the structures and functionality which were responsible for interfacing with the driver. Unfortunately, once we ported to Vulkan, content running on OpenGL performed slightly worse than previously. This is mainly due to descriptor set emulation that we implemented in the OpenGL backend. This means that for our OpenGL backend we added support for allowing multiple sets in the shaders. The SPIR-V cross compiler removes the sets in the OpenGL GLSL output and replaces it with a serialized list of binding locations together with a remapping table. The descriptor set emulation in OpenGL then has to rebind and remap textures, samplers, uniform buffers, etc depending on the remapping table. This means that we get a slight performance hit if we have lots of shaders using wildly different binding tables because of the API overhead of rebinding.
But this is a decision that you will have to make for you graphics engine. We wanted to support descriptor sets at a high level, but this can also be hidden at a lower level in your engine so that you don’t have to think about handling descriptor set remapping in your OpenGL implementation.
We added a block based memory allocator where we allowed two types of allocations: “static” and “transient.” The static memory is allocated for either a scene or the engine lifetime. The transient allocation type on the other hand, is a memory which lasts one frame at the most. Transient memory is allocated and tracked by our memory manager. We decided to track this memory for the same amount of frames as we created in our swap chain. This means that we can only need to track entire blocks of memory which simplifies a lot of things.
We also implemented constant data memory updates using the block based memory allocator but for this we made ring frame allocators that can be pinned to specific threads in order to allow efficient uniform data updates across multiple threads.
Device memory is memory that can be bound to Vulkan objects. You may be familiar with VBO, PBO, UBO and so on, implemented in the OpenGL API. Those were designed to map/unmap memory on the CPU. Once the memory has been mapped, the CPU could access, write and read the data allocated by the GPU. In Vulkan this is more generic and it can be anything like VBO, PBO, textures, it can also be a Framebuffer for instance.
As Vulkan is capable of recording to command buffers on multiple threads, we made a decision to make explicit abstractions for command buffers. We implemented this as a thin wrapper which pushes calls to the Vulkan command buffer object.
The command buffer abstraction also keeps track of state locally per command buffer, and the wrapper flushes the commands when required. By saying flushing we do not mean flushing the graphics pipeline as was the case in OpenGL. What we mean is creating any necessary Vulkan state objects. In other words, we group all “draw calls” with similar states and as long as the draw calls are of the same state we put them into one batch. Once a “draw call” is coming with different states we flush the current state which means we create new Vulkan objects. Note that we only create new pipeline objects in case we don’t already have an equivalent one in the cache.
There are a lot of Vulkan objects that are created on-demand, such as:
Some of these (like vkPipeline) need to be created during command buffer recording, which means it needs to be thread-safe and efficient. In order to keep it running efficiently we reduced lock contention by spreading the caches over a number of threads. This ensured that we could create state objects during runtime without hitting performance bottlenecks.
vkFramebuffer and vkRenderPass need to be created depending on the render target to which you’re rendering so there are dependencies between different Vulkan objects that you need to track and ensure you handle correctly. The ultimate solution would be to create all state upfront, as that would mean you could just focus the graphics engine on feeding the GPU with existing data.
We added an explicit render pass interface. The render pass is one of the most interesting features of the Vulkan API as it basically defines where the rendering content is going to end up. You also explicitly declare what to load and what to store before and after a render pass. It also has more powerful capabilities such as rendering sub-passes. Sub-passes are basically a more vendor-neutral version of Pixel Local Storage. Sub-passes allow the shader to read attachments written by the previous pass at the same pixel location. You may read more about Pixel Local Storage in this blog.
We found it beneficial to write a library as we can then parse and retrieve information about input and output bindings. Then we can reuse this information for pipeline layout creation. It can also be very useful for re-using already existing pipeline layouts which are compatible across different shaders buckets.
As a result of porting the engine to Vulkan, we have created the Metropolis demonstration which showcases Vulkan’s benefits. The demo is at the very early stage of development. As we are going to implement more features for GDC and then SIGGRAPH, you will see the obvious benefits of the API as we issue from 2-3k draw calls per frame and simultaneously produce dynamic cascaded shadows.
There are more features of Vulkan that we will cover in a future blog post:
Note: This demo includes assets which have been created under licence from TurboSquid Inc.
If you enjoyed this blog, you may wish to read the next article in the series: Multi-threading in Vulkan. Click on the click below.
[CTAToken URL = "https://community.arm.com/graphics/b/blog/posts/multi-threading-in-vulkan" target="_blank" text="Read Multi-threading in Vulkan" class ="green"]
This is fantastic! Great work. When will ARM release the Vulkan driver for Android and Linux for Mali-450 and Mali-T760?