This is the second part of a two-part blog series exploring how my colleague and I at W4 Games used Arm Performance Studio to optimize Godot’s mobile render for Android mobile devices.
In part 1, I walked through the process of using Streamline and the Performance Advisor to confirm a bandwidth bottleneck and validate that the bandwidth optimization we did worked correctly.
In this part, we are going to look at a more complex scene and use a combination of Streamline and the Mali Offline Compiler to identify high impact optimizations that improve performance without adding significant complexity to the renderer.
Godot is a fully featured, cross platform, free, and open-source game engine that is widely used for making games on mobile devices (among other platforms). It features an easy to use and incredibly capable editor that can run almost anywhere that the engine can run. Weighing in at under 100 mb, it’s a great tool for making games that run practically anywhere.
In 2023, Godot released its biggest update yet, increasing to version 4. Godot 4 came with several groundbreaking new features, among them was the move to using a Vulkan API backend for advanced rendering. The new Vulkan-based renderer came with substantial improvements to both quality and performance as well as several exciting new features (like real time dynamic global illumination). However, the new renderer and new features were optimized for Desktop architectures and left the performance of Godot using Vulkan on Mobile lagging behind the simpler, less feature-rich OpenGL backend.
Here is a screenshot of the scene we used for testing. This is a modified version of the official Godot Third person Shooter Demo which is available on GitHub.
Note the complexity of geometry, dynamic lights with shadows, multiple particle systems, multiple animated characters, and decals. This was intended to be a performance heavy scene that challenges the best of mobile GPUs.
This scene is un-optimized for mobile devices and does not represent a typical mobile scene. However, we decided to profile with it anyway for 3 reasons:
A core part of the Godot philosophy is to avoid performance cliffs. Testing on a scene like this helps us cover a wide range of potential performance problems and ensures that our worst-case performance is as good as possible, even if the scene itself may never reach acceptable framerates.
All the changes mentioned in this post are actual changes that have been merged into Godot. You can see them on our GitHub page. As before, we will start with a specific commit hash. In this case it will be e2c6daf7eff6e0b7e2e8d967e95a9ad56e948231 which is the first commit to introduce our “ubershader” system (I will explain what that means below). The reason for starting from a slightly older base commit is that this is one of the latest commits that I can cleanly apply the other changes on top of directly.
Godot’s codebase changes very quickly, so it can be challenging to pull out commits and view them in isolation. By starting from a bit further back, it makes it easier to see the impact of individual changes.
If you want to validate the findings of this post yourself, you can build Godot yourself by following along with the Android build instructions in the official Godot documentation.
To measure performance, I used a Pixel 8 Pro provided to me by Arm for this purpose.
Running our demo on the device results in a frame time of about 33 FPS (or about 30 milliseconds per frame).
Already things look quite bad. I expected that the Pixel 8 Pro would have no problem with this scene (despite my comments above). To understand why this is happening, let’s start by looking at the Performance Advisor report.
It’s no surprise that fragment cost still dominates the frame, given how much is going on. Now let’s look at our streamline capture to see what further details we can learn.You can see right away that a little over half of the frame is spent in fragment shader work exclusively. Further, we can see that during the fragment shader we have a high number of arithmetic and load/store operations. This shows us that we are doing too much work in the fragment shader.
Clearly, we need to do less work in the fragment shader in our main pass. Godot uses one template shader to do most 3D rendering. When a user authors a GDShader, it gets inserted into the middle of our template shader. This template shader needs to support every feature and thus it can be quite big. Further, the shader needs to dynamically switch between features at run time.
Due to how flexible Godot is, it can be difficult to optimize such a template shader. It needs to support a lot of features to keep things simple on the user side, but we also want to ensure that it is optimized.
To learn a bit more, we are going to turn to another tool, the Mali Offline Compiler.
The Mali Offline Compiler is an amazing tool in your toolbelt. It can be run without a device connected which makes it very convenient. Not only does it provide a host of useful statistics, but it is also very fast to iterate when testing out changes to get a quick sense of the performance impact of a change without having to re-compile your code and transfer it to the device.
For this sample, I am going to take the fragment shader from one of the structures in the scene. It doesn’t really matter which shader I choose since they all utilize the same template scene shader and it's that template shader that we want to optimize today.
To retrieve the shader from Godot. We select the material in the inspector, then select “Inspect Native Shader Code”.
Doing so will open a popup menu where you can select either the vertex or fragment shader and which shader variant you would like to use. For this analysis, I selected the first shader variant of the fragment shader since it is our default opaque color variant.
Mali Offline Compiler is a command line tool. So, the next step is to copy that shader into a local file. Then run the Mali Offline Compiler on it. I used the command:
./malioc --vulkan --fragment old.frag
Main shader =========== Work registers: 64 (100% used at 50% occupancy) Uniform registers: 128 (100% used) Stack spilling: 42 bytes 16-bit arithmetic: 0% A LS V T Bound Total instruction cycles: 10.94 124.00 0.22 2.00 LS Shortest path cycles: 1.47 25.00 0.22 0.38 LS Longest path cycles: N/A N/A N/A N/A N/A A = Arithmetic, LS = Load/Store, V = Varying, T = Texture Shader properties ================= Has uniform computation: true Has side-effects: false Modifies coverage: false Uses late ZS test: false Uses late ZS update: false Reads color buffer: false
The number of Load/Store operations was unexpectedly high, even in the shortest path. In practice, we expect to follow a path that is close to the shortest path, so seeing such high numbers there was shocking. Ideally the values would all be in the single digits for a shader like this.
The next step in our analysis was to enable and disable different code paths and evaluate their impact on ALUs and Load/Store operations.
As expected, the high number of dynamic branches, and the dynamic loops for lights are contributing to a high number of fragment shader operations, especially Load/Store operations. These control flow operations are currently controlled by uniform values which do not change across a single draw call. Because they do not change across draw calls, these values are good candidates for replacement by specialization constants.
Specialization constants are a useful tool for optimizing pipelines that allow you to hot swap in values to already compiled pipelines which can be used by the GPU driver to further optimize the shader code inside the pipeline (that is, by pruning dead branches, unrolling loops, etc.). You have to be careful when relying on specialization constants, as they do require the pipeline to go through an optimization step again, and that isn’t free. Over-use of them can lead to load time issues and stuttering at run time. Luckily, we have a fantastic ubershader system that compiles one version of the shader with all dynamic values that can be used as a fallback while the optimized versions are compiled in the background. You can read more about the ubershader technique on the Godot documentation here.
By replacing complex branches and loops with specialization constants, the GPU driver can optimize out a lot of code that it was previously unable to. Additionally, for loops, it can make more informed decisions about whether to unroll.
Another change which yielded great results was skipping light and shadow calculations when the distance or angle attenuation value of the light is very low. This optimization can have a significant impact on OmniLight3Ds which have a long range and SpotLight3Ds. We use a traditional forward light list on the mobile renderer. We use Axis-Aligned Bounding Boxes (AABBs) to pair lights with objects. Since lights are not boxes, it is surprisingly common for lights to pair with objects and not contribute much or anything to the final image. With this optimization, those lights will be skipped in the fragment shader, and we avoid doing any expensive shadow sampling, or lighting calculations. You can read the source of the change here.
Taking another look at the Mali Offline Compiler, we see that our optimizations indeed made a huge difference.
Main shader =========== Work registers: 64 (100% used at 50% occupancy) Uniform registers: 128 (100% used) Stack spilling: false 16-bit arithmetic: 1% A LS V T Bound Total instruction cycles: 6.95 15.00 0.22 1.38 LS Shortest path cycles: 1.54 0.00 0.22 0.38 A Longest path cycles: 6.06 21.00 0.22 1.38 LS A = Arithmetic, LS = Load/Store, V = Varying, T = Texture Shader properties ================= Has uniform computation: true Has side-effects: false Modifies coverage: false Uses late ZS test: false Uses late ZS update: false Reads color buffer: false
Note especially that there is no more stack spilling. Importantly, our Load/Store operations are now down to a much more reasonable level. In practice, it is unclear whether we are Load/Store or ALU bound now since each invocation of the shader will lie somewhere between the shortest path and the longest path.
Now that we have validated these changes in the Mali Offline Compiler, we can test them out on device to validate.
With these optimizations, the demo now runs at about 43 FPS (23 milliseconds per frame). That’s approximately a 7 ms improvement without any content changes! All users will benefit from this improvement to varying degrees.
Looking again at the Performance Advisor confirms the improvement but also confirms that we are still fragment shader bound and have a way to go to meet our performance goals.
All these changes shipped in Godot 4 and are running in real games today! In the meantime, we have continued to optimize our Vulkan renderer with a goal to make it as fast as possible.
Based on our profiling we identified a number of areas that could improve:
As a sneak peak, here is the output of Mali Offline Compiler in Godot 4.5-beta1 for the same shader analyzed previously:Main shader =========== Work registers: 59 (92% used at 50% occupancy) Uniform registers: 128 (100% used) Stack spilling: false 16-bit arithmetic: 50% A LS V T Bound Total instruction cycles: 5.59 11.00 0.25 1.00 LS Shortest path cycles: 1.23 0.00 0.25 0.38 A Longest path cycles: 4.81 4.00 0.25 1.00 A A = Arithmetic, LS = Load/Store, V = Varying, T = Texture Shader properties ================= Has uniform computation: true Has side-effects: false Modifies coverage: false Uses late ZS test: false Uses late ZS update: false Reads color buffer: false
Main shader =========== Work registers: 59 (92% used at 50% occupancy) Uniform registers: 128 (100% used) Stack spilling: false 16-bit arithmetic: 50% A LS V T Bound Total instruction cycles: 5.59 11.00 0.25 1.00 LS Shortest path cycles: 1.23 0.00 0.25 0.38 A Longest path cycles: 4.81 4.00 0.25 1.00 A A = Arithmetic, LS = Load/Store, V = Varying, T = Texture Shader properties ================= Has uniform computation: true Has side-effects: false Modifies coverage: false Uses late ZS test: false Uses late ZS update: false Reads color buffer: false
Clearly the results are even better. But we still have more room to improve!
This covers a few of the major optimizations that we have worked on recently. There have been many more and will continue to be more as we continue to work. Feel free to follow development on GitHub if you want to see what else we are up to.