In this blog, we briefly look at two examples of how to use Vulkan to maximize the graphics performance in your game. We will walk you through a few key Vulkan performance samples that demonstrate common optimizations and best practices to follow in your mobile games, so you can start squeezing every last drop of performance out of the device and give your fans the game they absolutely need to play, through the power of Vulkan APIs.
As an Android game developer, you have two choices for graphics APIs: OpenGL ES and Vulkan. In this article, we are going to look at Vulkan. Designed to push 3D graphics on mobile devices, Vulkan acts as a super-thin abstraction layer. This gives you much more control, lower CPU overhead, a smaller memory footprint, and greater stability.
We will walk you through a few key Vulkan performance samples that demonstrate common optimizations and best practices to follow in your mobile games, so you can start squeezing every last drop of performance out of the device and give your fans the game they absolutely need to play, through the power of Vulkan APIs.
How Vulkan enables high-performance, cross-platform graphics is simple: “With great power comes great responsibility.” To enable maximum graphics performance, Vulkan allows more control over the hardware resources than OpenGL ES, in exchange for requiring more explicit memory management and operations. And to achieve lower CPU overhead, the Vulkan API supports multithreading and takes advantage of the four to eight cores built into mainstream mobile devices.
For more detail, Vulkan Essentials is a great resource with an in-depth explanation of how Vulkan works under the hood.
There are tons of great resources and examples available to learn how to use the Vulkan API. The two examples we will look at are Render Passes and Wait Idle, which demonstrate some of the most useful optimizations you can take advantage of in your own mobile game. These performance samples show recommended best practices for enhancing performance with the Vulkan APIs, and provide real-time profiling information to help you identify and understand bottlenecks in your application. The full set of samples and tutorials, open-sourced by Arm and administered by the Khronos Group, can be found here.
This article assumes you are familiar with 3D render pipelines and Vulkan API basics. If you are new to Vulkan, this Vulkan Guide and introductory tutorial helps you get your first triangles rendered. For additional examples, refer to these API samples that cover topics such as HDR, instancing, texture loading, and tessellation.
To work with the Vulkan samples, you need to have the right tools and dependencies. For Android, you can check out the Android section of the Build Guide.
The main prerequisites are:
Render pass attachments are how Vulkan keeps track of your input and output render targets. It might make sense to think of them as references to color or depth buffers. Configuring them optimally is a simple but effective way to gain precious milliseconds during the render pass.
Let us start by taking a look at this performance tutorial and sample code.
You will see an app rendering a 3D scene in a single pass with a GUI showing render stats and options to switch between load operations for the color attachment and store operations for the depth attachment.
Knowing whether or not the contents of the attachment buffer needs to be cleared of a color, read from, or written to can greatly affect the draw performance. This is because you can set it up in a way to minimize the number of read/write operations.
For example, because you do not need to read the contents of the final color buffer drawn to the screen, in Vulkan, you can set its load operation for the attachment description to VK_ATTACHMENT_LOAD_OP_DONT_CARE and speed up your render pass.
VK_ATTACHMENT_LOAD_OP_DONT_CARE
You can test this out by selecting Load for your color attachment load operation and then seeing how the External Read Bytes value increases because it prepares your color buffer to not just draw the scene, but also to be able to read in its contents for this pass.
Changing the Depth attachment store operation has a similar effect on External Write Bytes because you are indicating whether you want to spend time saving the depth information to the buffer.
Here is a typical setup for how you could optimally use render pass attachments when drawing a 3D scene in your own code:
VkAttachmentDescription attachments[ 2 ]; // Color attachment attachments[ 0 ].format = colorFormat; attachments[ 0 ].samples = VK_SAMPLE_COUNT_1_BIT; attachments[ 0 ].loadOp = VK_ATTACHMENT_LOAD_OP_DONT_CARE; attachments[ 0 ].storeOp = VK_ATTACHMENT_STORE_OP_STORE; attachments[ 0 ].stencilLoadOp = VK_ATTACHMENT_LOAD_OP_DONT_CARE; attachments[ 0 ].stencilStoreOp = VK_ATTACHMENT_STORE_OP_DONT_CARE; attachments[ 0 ].initialLayout = VK_IMAGE_LAYOUT_UNDEFINED; attachments[ 0 ].finalLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR; // Depth attachment attachments[ 1 ].format = depthFormat; attachments[ 1 ].samples = VK_SAMPLE_COUNT_1_BIT; attachments[ 1 ].loadOp = VK_ATTACHMENT_LOAD_OP_CLEAR; attachments[ 1 ].storeOp = VK_ATTACHMENT_STORE_OP_DONT_CARE; attachments[ 1 ].stencilLoadOp = VK_ATTACHMENT_LOAD_OP_DONT_CARE; attachments[ 1 ].stencilStoreOp = VK_ATTACHMENT_STORE_OP_DONT_CARE; attachments[ 1 ].initialLayout = VK_IMAGE_LAYOUT_UNDEFINED; attachments[ 1 ].finalLayout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL; VkAttachmentReference colorReference = {}; colorReference.attachment = 0; colorReference.layout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL; VkAttachmentReference depthReference = {}; depthReference.attachment = 1; depthReference.layout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL; VkSubpassDescription subpass = {}; subpass.pipelineBindPoint = VK_PIPELINE_BIND_POINT_GRAPHICS; subpass.colorAttachmentCount = 1; subpass.pColorAttachments = &colorReference; subpass.pDepthStencilAttachment = &depthReference; VkRenderPassCreateInfo renderPassInfo = {}; renderPassInfo.sType = VK_STRUCTURE_TYPE_RENDER_PASS_CREATE_INFO; renderPassInfo.attachmentCount = 2; renderPassInfo.pAttachments = attachments; renderPassInfo.subpassCount = 1; renderPassInfo.pSubpasses = &subpass; vkCreateRenderPass( g_device, &renderPassInfo, nullptr, &renderPass );
One final option demonstrated in this sample is the Use vkCmdClear checkbox, which will explicitly clear the color attachment, and demonstrates how doing so can negatively affect performance. Resetting the whole buffer by using the load operation is more efficient. Using this explicit clear function is better reserved for other scenarios, such as when you need to specify an inner rectangular region to be cleared.
For instance, if you want to keep a 10px border intact, you could add to your command buffer like this:
VkClearAttachment clearAttachment = {}; clearAttachment.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT; clearAttachment.clearValue.color = 0; clearAttachment.colorAttachment = 0; VkClearRect clearRect = {}; clearRect.layerCount = 1; clearRect.rect.offset = { 10, 10 }; clearRect.rect.extent = { width - 20, height - 20 }; vkCmdClearAttachments( g_cmdBuffer, 1, &clearAttachment, 1, &clearRect );
Optimization Tip: Identifying specifically how you are using each render pass attachment will help make sure you get the best read/write throughput. Remember to use VK_ATTACHMENT_LOAD_OP_CLEAR when you need to clear a render target. And when you do not need to read an attachment’s contents, set VK_ATTACHMENT_LOAD_OP_DONT_CARE to avoid unnecessary operations.
VK_ATTACHMENT_LOAD_OP_CLEAR
Some of Vulkan’s render pipeline computations are done on the CPU, like creating command buffers, and others are done on the GPU, such as shaders and render targets. Processing them in the correct order means that the CPU and GPU need to work with each other with proper timing.
An easy and reliable way to accomplish this with Vulkan APIs is to use vkQueueWaitIdle to simply wait for the current queue to be empty before the CPU adds new commands to hand off to the GPU. However, one of the biggest gains to your render throughput can come from making sure your GPU and CPU aren’t sitting around waiting for long periods of micro-time and can work right away on preparing the next frame.
vkQueueWaitIdle
You can see how this makes a difference in the Wait Idle performance tutorial and sample code.
Running this sample shows a scene with two options, Wait Idle and Fences, and text showing the frame times (the average time it took to render the frame). This sample demonstrates how efficiently queuing up the next frame (or the next command buffer for more complex passes) can improve performance.
When you run the sample, you will notice that the frame times are much higher with the Wait Idle option selected, and lower when the Fences option is selected.
Here is how you could set up your render loop to do this in your code to use fences:
void render() { vkWaitForFences( g_device, 1, &g_renderFence, VK_TRUE, UINT64_MAX ); vkResetFences( g_device, 1, &g_renderFence ); // Update frame with new commands setCmdBuffer( g_cmdBuffer ); uint32_t imageIndex; vkAcquireNextImageKHR( g_device, g_swapchain, UINT64_MAX, g_imageSemaphore, VK_NULL_HANDLE, &imageIndex ); VkSubmitInfo submitInfo = { VK_STRUCTURE_TYPE_SUBMIT_INFO }; submitInfo.waitSemaphoreCount = 1; submitInfo.pWaitSemaphores = &g_imageSemaphore; submitInfo.commandBufferCount = 1; submitInfo.pCommandBuffers = &g_cmdBuffer; vkQueueSubmit( g_queue, 1, &submitInfo, g_renderFence ); VkPresentInfoKHR presentInfo = { VK_STRUCTURE_TYPE_PRESENT_INFO_KHR }; presentInfo.waitSemaphoreCount = 1; presentInfo.pWaitSemaphores = &g_renderSemaphore; presentInfo.swapChainCount = 1; presentInfo.pSwapchains = &g_swapchain; presentInfo.pImageIndices = &imageIndex; vkQueuePresentKHR( g_queue, &presentInfo ); }
Optimization Tip: Keep your render queue moving by avoiding vkQueueWaitIdle and vkDeviceWaitIdle and using VkFence objects and vkWaitForFences. You need to make sure that each fence works independently of the others without overlap (separate render frames, for example). Also, if you have multiple commands within a single frame on the GPU that don’t need to be synchronized with the CPU, you might consider using VkSemaphore objects instead.
vkDeviceWaitIdle
VkFence
vkWaitForFences
VkSemaphore
To see a more detailed example on synchronizing the CPU and GPU, you can also take a look at this Vulkan tutorial for Frames in Flight.
We briefly looked at two examples of how to use Vulkan to maximize the graphics performance in your game. Vulkan provides some low-level optimizations that require you to manage processes in your app on a more granular level. But as you have seen, implementing some individual Vulkan APIs makes it easier to get started and can pay performance dividends immediately.
That is only the beginning. There are many more open source tutorials and samples available here to help you optimize the drawing of polygons and do more with your render passes in your mobile game.
Here are a few more performance samples we recommend if you are developing for Android devices with Vulkan:
We would encourage you to check out the project on the Vulkan Samples GitHub page and try the sample for yourself. The project has just been donated to The Khronos Group. You can tweak the number of command buffers and the allocation strategy directly on the screen, showing the performance impact through real-time hardware counter graphs. You are also warmly invited to contribute to the project by providing feedback and fixes and creating additional samples.
You may also read the other posts in this series:
And here are some other useful resources:
[CTAToken URL = "https://github.com/KhronosGroup/Vulkan-Samples" target="_blank" text="Vulkan Samples" class ="green"]
This article was originally posted on CodeProject as a sponsored article by Arm. It was written by Raphael Munn and you can find the link to the CodeProject article here