Recently I have been working with my colleague, Darío, at W4 Games on improving the performance of Godot when using the Vulkan backend on Mobile devices. Before setting out on this adventure, neither Darío nor I had much experience optimizing renderers specifically for mobile devices, so this was a learning opportunity for us. Thankfully the documentation for Arm Performance Studio and the team at Arm were extremely helpful.
In this series I am going to show you how we utilized Arm Performance Studio to identify and resolve major performance issues in Godot’s Vulkan-based mobile renderer. Godot is open source, so you can read, run, and copy all the changes discussed in this blog post series into your project.
This is a two-part series, in the first part I am going to show how we resolved a major bandwidth bottleneck in Godot using Streamline and the Performance Advisor. The second part covers how we improved performance step by step using Streamline and the Mali Offline Compiler.
Godot is a powerful, cross platform, free, and open-source game engine that is widely used for making games on mobile devices (among other platforms). It includes an intuitive, feature-rich editor that runs anywhere the engine does. Weighing in at under 100 mb, it is a great tool for making games that run practically anywhere.
In 2023, Godot 4 introduced its biggest update yet. Godot 4 came with several ground-breaking new features, among them was the move to using a Vulkan API backend for advanced rendering. The new Vulkan-based renderer came with substantial improvements to both quality and performance as well as several exciting new features (like real time dynamic global illumination). However, the new renderer and new features were optimized for Desktop architectures and left the performance of Godot using Vulkan on Mobile lagging behind the simpler, less feature-rich OpenGL backend.
To set the foundation, coming into this project we knew that something was wrong with our mobile renderer. On Android it uses Vulkan through our RenderingDevice abstraction (currently abstracts D3D12, Metal, and Vulkan with WebGPU planned). The Mobile renderer was split off from the Forward+ renderer (our desktop focused renderer) several years ago and has many architectural changes that make it more suitable for mobile such as:
It still shares a lot of code with the Forward+ renderer especially:
Finally, when we designed our RenderingDevice abstraction we had only implemented the Forward+ renderer and did not have much experience using Vulkan on Mobile. In short, we knew improvements were needed but were not sure where to begin.
The changes in this post have been merged into Godot. View them on our GitHub Page.
We will start from a build of Godot that corresponds to Godot 4.4 dev 5 with Git hash 9e6098432aac35bae42c9089a29ba2a80320d823 and then incrementally cherry pick subsequent changes to Godot that addressed the issues I identify here.
If you want to validate the findings of this post yourself, you can build Godot yourself by following along with the Android build instructions in the official Godot documentation.
We initially tested with a complex scene but noticed unexpected performance scaling. Suspecting bandwidth limitations, we created a simple test scene to isolate the issue. The scene consists of a single gray cube lit by a directional light. However, the scene is rendered at a resolution of 2688 x 5984 (2x the native resolution of the Pixel 8 pro) and MSAA 4x. It looks deceptively simple.
This simple scene should, in theory, stay v-sync locked on most mobile devices. However, defying all expectations this runs at 50 FPS on a Pixel 8 Pro. Clearly there is a significant bottleneck, so let us see where it is.
If you want to follow along, you can download this minimal project.
First, we start with Performance Advisor to get some quick insights into why this scene is performing so poorly.
To use Performance Advisor with Godot, ensure you're running a debug build. The docs for Performance Advisor tell you that you need to use “debuggable build” of the application, for Godot that means you need to have “debug enabled” checked off in the export menu when you export your game. If you are building the engine from scratch, you need to ensure that you have built a “debug_template” and it is selected in the “custom template” field.
To get the most out of Performance Advisor, you need to run a special python script that is provided by Arm Performance Studio. This script will connect Streamline to your device and expose valuable counters and frame information that Streamline/Performance Advisor can use.
From the command line navigate to the Arm performance Studio directory and `<install_directory>/streamline/bin/android` and run the Python script contained there:
`python3 streamline_me.py --lwi-mode counters`
Follow the instructions that pop up and select the application you want to debug. Once you have done that, you can open Streamline.
First, select your device. Ensure USB Debugging is enabled in the Developer Options. Then, select the same application you selected in the terminal. It is very important that you select the same application.
Finally, run the application. Streamline will launch the application and then give you an overview of the counters that looks like this. You can customize what counters are available in the menu on the left-hand side.
For now, we are just going to let this run for ten seconds and then stop the recording.
We will come back to this later, but for now, we have captured the data we need. Performance Advisor can take this and provide some context and advice.
Exit out of the terminal you opened previously.
Navigate to the folder where your Streamline capture is saved. From there, run the Performance Advisor on the capture. Performance Advisor takes your Streamline capture and creates an easy-to-read HTML report that you can open in any browser.
Generate the report using this command:
`Streamline-cli -pa <filename>.apc [options]`
Test scene report
Performance Advisor correctly flags the fragment stage as a significant problem. That is not too surprising given that we are rendering at 2x scale and with 4x MSAA.
Scrolling down, you can see that GPU bandwidth per frame indeed looks suspiciously high. Godot writes 30× more memory than it reads, which is unexpected.Also notice the near perfect negative correlation between bandwidth and FPS. We need to do something to bring bandwidth usage down.
The most likely suspect here is load/store ops (to use the Vulkan terminology). These are properties of render passes in Vulkan that instruct the GPU driver what to do with your framebuffer attachments. For loading, you can either load the attachment, clear the attachment, or tell the GPU driver that you do not care what happens, in which case the contents are undefined. For storing you can either store the attachment or specify do not care, in which case it is up to the driver whether to store the data or not.
Load/store ops are often forgotten when targeting desktop devices as they often do not have much of an impact on performance. On mobile devices, however, bandwidth is limited and must be used efficiently. One of the most costly operations is moving data on and off chip, therefore, you want to avoid loading attachments as much as possible and you want to avoid storing attachments as much as possible.
Godot’s renderer was originally crafted with desktop devices in mind. Subpasses are a relatively recent addition. Unfortunately, our higher-level handling of render passes was designed around desktop architectures which means that:
On desktop architectures, these simplifications were relatively harmless, but on mobile devices they led us to the problem I highlighted above. Further, the result of this is that both the MSAA version of the texture and the resolve target are copied back into main memory. Ouch.
The solution is naturally to be more careful about assigning load store ops and to only load/store what is needed. That is easier said than done! Godot is a general purpose engine, we do not control user content. Further, users can drop down to the lower-level rendering API anytime they want. Currently they have to specify load/store ops on a render pass basis, we could extend that to have more granularity, but then the rendering API becomes difficult to use and too much complexity is pushed on the user. Godot is designed to be easy to use and to take care of the low-level tasks that game developers do not want to think about. So we want our solution to be elegant and easy to use.
Fortunately, Godot already has a system that simplifies this process. We already use an Acyclic Render graph to reorder render passes and insert barriers between them as necessary. We can re-use a lot of the same logic to detect how attachments are used and then conditionally enable/disable load/store ops as needed. This allows the engine to automatically optimize bandwidth usage based on what users do without having to manually specify load store ops.
For more details, check out the Pull Request on Github.
In the repo, this change is in commit 6d5ac8f7ef4a3ddaf50720ab473b9dffece21674. Let us look at another Performance Advisor capture with this change applied.
You can see the full report here.
First, you can see that we are now vsync locked and GPU utilization has dropped to 26%. Since we are hitting our performance targets, Performance Advisor has no more advice for us.
Let us see what happened to GPU bandwidth per frame.
Notice 3 things:
In the next part of this blog series, we look at a more complex scene using Streamline and the Mali Offline Compiler to see how we can improve performance by reducing fragment operations.