Optimizing 3D scenes in Godot on Arm GPUs

July 10, 2025

12 minute read time.

This is the second part of a two-part blog series exploring how my colleague and I at W4 Games used Arm Performance Studio to optimize Godot’s mobile render for Android mobile devices.

In part 1, I walked through the process of using Streamline and the Performance Advisor to confirm a bandwidth bottleneck and validate that the bandwidth optimization we did worked correctly.

In this part, we are going to look at a more complex scene and use a combination of Streamline and the Mali Offline Compiler to identify high impact optimizations that improve performance without adding significant complexity to the renderer.

About Godot

Godot is a fully featured, cross platform, free, and open-source game engine that is widely used for making games on mobile devices (among other platforms). It features an easy to use and incredibly capable editor that can run almost anywhere that the engine can run. Weighing in at under 100 mb, it’s a great tool for making games that run practically anywhere.

In 2023, Godot released its biggest update yet, increasing to version 4. Godot 4 came with several groundbreaking new features, among them was the move to using a Vulkan API backend for advanced rendering. The new Vulkan-based renderer came with substantial improvements to both quality and performance as well as several exciting new features (like real time dynamic global illumination). However, the new renderer and new features were optimized for Desktop architectures and left the performance of Godot using Vulkan on Mobile lagging behind the simpler, less feature-rich OpenGL backend.

The test scene

Here is a screenshot of the scene we used for testing. This is a modified version of the official Godot Third person Shooter Demo which is available on GitHub.

Note the complexity of geometry, dynamic lights with shadows, multiple particle systems, multiple animated characters, and decals. This was intended to be a performance heavy scene that challenges the best of mobile GPUs.

Third person Shooter Demo test scene

This scene is un-optimized for mobile devices and does not represent a typical mobile scene. However, we decided to profile with it anyway for 3 reasons:

It simultaneously stresses many systems (particles, skeletal animation, dynamic lighting, rasterization)
The assets were available (and we don’t have a team to create new assets)
It’s a good example of the sort of complexity a user might throw at the engine. Godot should be capable of a supporting wide range of complexity with reasonable performance, so we want to test the worst-case scenario.

A core part of the Godot philosophy is to avoid performance cliffs. Testing on a scene like this helps us cover a wide range of potential performance problems and ensures that our worst-case performance is as good as possible, even if the scene itself may never reach acceptable framerates.

The framework

All the changes mentioned in this post are actual changes that have been merged into Godot. You can see them on our GitHub page.

As before, we will start with a specific commit hash. In this case it will be e2c6daf7eff6e0b7e2e8d967e95a9ad56e948231 which is the first commit to introduce our “ubershader” system (I will explain what that means below). The reason for starting from a slightly older base commit is that this is one of the latest commits that I can cleanly apply the other changes on top of directly.

Godot’s codebase changes very quickly, so it can be challenging to pull out commits and view them in isolation. By starting from a bit further back, it makes it easier to see the impact of individual changes.

If you want to validate the findings of this post yourself, you can build Godot yourself by following along with the Android build instructions in the official Godot documentation.

Performance baseline

To measure performance, I used a Pixel 8 Pro provided to me by Arm for this purpose.

Running our demo on the device results in a frame time of about 33 FPS (or about 30 milliseconds per frame).

Already things look quite bad. I expected that the Pixel 8 Pro would have no problem with this scene (despite my comments above). To understand why this is happening, let’s start by looking at the Performance Advisor report.

Performance Advisor report baseline.

It’s no surprise that fragment cost still dominates the frame, given how much is going on. Now let’s look at our streamline capture to see what further details we can learn.

Baseline GPU usage

You can see right away that a little over half of the frame is spent in fragment shader work exclusively. Further, we can see that during the fragment shader we have a high number of arithmetic and load/store operations. This shows us that we are doing too much work in the fragment shader.

Clearly, we need to do less work in the fragment shader in our main pass. Godot uses one template shader to do most 3D rendering. When a user authors a GDShader, it gets inserted into the middle of our template shader. This template shader needs to support every feature and thus it can be quite big. Further, the shader needs to dynamically switch between features at run time.

Due to how flexible Godot is, it can be difficult to optimize such a template shader. It needs to support a lot of features to keep things simple on the user side, but we also want to ensure that it is optimized.

To learn a bit more, we are going to turn to another tool, the Mali Offline Compiler.

Mali Offline Compiler

The Mali Offline Compiler is an amazing tool in your toolbelt. It can be run without a device connected which makes it very convenient. Not only does it provide a host of useful statistics, but it is also very fast to iterate when testing out changes to get a quick sense of the performance impact of a change without having to re-compile your code and transfer it to the device.

For this sample, I am going to take the fragment shader from one of the structures in the scene. It doesn’t really matter which shader I choose since they all utilize the same template scene shader and it's that template shader that we want to optimize today.

To retrieve the shader from Godot. We select the material in the inspector, then select “Inspect Native Shader Code”.

Screenshot of inspect source

Doing so will open a popup menu where you can select either the vertex or fragment shader and which shader variant you would like to use. For this analysis, I selected the first shader variant of the fragment shader since it is our default opaque color variant.

Mali Offline Compiler is a command line tool. So, the next step is to copy that shader into a local file. Then run the Mali Offline Compiler on it. I used the command:

./malioc --vulkan --fragment old.frag

     Main shader
=========== 

Work registers: 64 (100% used at 50% occupancy)
Uniform registers: 128 (100% used)
Stack spilling: 42 bytes
16-bit arithmetic: 0%

                                A      LS       V       T    Bound
Total instruction cycles:   10.94  124.00    0.22    2.00       LS
Shortest path cycles:        1.47   25.00    0.22    0.38       LS
Longest path cycles:          N/A     N/A     N/A     N/A      N/A

A = Arithmetic, LS = Load/Store, V = Varying, T = Texture

Shader properties
=================

Has uniform computation: true
Has side-effects: false
Modifies coverage: false
Uses late ZS test: false
Uses late ZS update: false
Reads color buffer: false

Going through a few important statistics:

Work registers (VGPRs): We should be aiming for 100% occupancy in general. So, we need to use fewer work registers.
Uniform registers/Stack Spilling: We are using too many registers and overflowing onto the stack. Stack spilling will seriously harm the performance of a shader. It should always be 0 bytes.
16-bit-arithmetic: Mali GPUs can process 16-bit types at double rate which gives us a “free” 2x speedup on those codepaths. It can also pack 16-bit types and use fewer registers (which can free up work registers and help avoid stack spilling.
Total instruction cycles: This is the total number of instructions in the entire shader. Realistically no shader will use this combination of instructions.
Shortest path cycles: This is the breakdown of instructions if your shader always follows the fastest path.

The number of Load/Store operations was unexpectedly high, even in the shortest path. In practice, we expect to follow a path that is close to the shortest path, so seeing such high numbers there was shocking. Ideally the values would all be in the single digits for a shader like this.

The next step in our analysis was to enable and disable different code paths and evaluate their impact on ALUs and Load/Store operations.

As expected, the high number of dynamic branches, and the dynamic loops for lights are contributing to a high number of fragment shader operations, especially Load/Store operations. These control flow operations are currently controlled by uniform values which do not change across a single draw call. Because they do not change across draw calls, these values are good candidates for replacement by specialization constants.

Specialization constants

Specialization constants are a useful tool for optimizing pipelines that allow you to hot swap in values to already compiled pipelines which can be used by the GPU driver to further optimize the shader code inside the pipeline (that is, by pruning dead branches, unrolling loops, etc.). You have to be careful when relying on specialization constants, as they do require the pipeline to go through an optimization step again, and that isn’t free. Over-use of them can lead to load time issues and stuttering at run time. Luckily, we have a fantastic ubershader system that compiles one version of the shader with all dynamic values that can be used as a fallback while the optimized versions are compiled in the background. You can read more about the ubershader technique on the Godot documentation here.

By replacing complex branches and loops with specialization constants, the GPU driver can optimize out a lot of code that it was previously unable to. Additionally, for loops, it can make more informed decisions about whether to unroll.

Light calculations

Another change which yielded great results was skipping light and shadow calculations when the distance or angle attenuation value of the light is very low. This optimization can have a significant impact on OmniLight3Ds which have a long range and SpotLight3Ds. We use a traditional forward light list on the mobile renderer. We use Axis-Aligned Bounding Boxes (AABBs) to pair lights with objects. Since lights are not boxes, it is surprisingly common for lights to pair with objects and not contribute much or anything to the final image. With this optimization, those lights will be skipped in the fragment shader, and we avoid doing any expensive shadow sampling, or lighting calculations. You can read the source of the change here.

Results

Taking another look at the Mali Offline Compiler, we see that our optimizations indeed made a huge difference.

Main shader
===========

Work registers: 64 (100% used at 50% occupancy)
Uniform registers: 128 (100% used)
Stack spilling: false
16-bit arithmetic: 1%

                                A      LS       V       T    Bound
Total instruction cycles:    6.95   15.00    0.22    1.38       LS
Shortest path cycles:        1.54    0.00    0.22    0.38        A
Longest path cycles:         6.06   21.00    0.22    1.38       LS

A = Arithmetic, LS = Load/Store, V = Varying, T = Texture

Shader properties
=================

Has uniform computation: true
Has side-effects: false
Modifies coverage: false
Uses late ZS test: false
Uses late ZS update: false
Reads color buffer: false

Note especially that there is no more stack spilling. Importantly, our Load/Store operations are now down to a much more reasonable level. In practice, it is unclear whether we are Load/Store or ALU bound now since each invocation of the shader will lie somewhere between the shortest path and the longest path.

Now that we have validated these changes in the Mali Offline Compiler, we can test them out on device to validate.

With these optimizations, the demo now runs at about 43 FPS (23 milliseconds per frame). That’s approximately a 7 ms improvement without any content changes! All users will benefit from this improvement to varying degrees.

Looking again at the Performance Advisor confirms the improvement but also confirms that we are still fragment shader bound and have a way to go to meet our performance goals.

Final capture summary

All these changes shipped in Godot 4 and are running in real games today! In the meantime, we have continued to optimize our Vulkan renderer with a goal to make it as fast as possible.

Next steps

Based on our profiling we identified a number of areas that could improve:

FP16 utilization: We were leaving a lot of performance and battery life savings on the table by not using FP16. We later implemented explicit FP16 types in our template shader.
Vertex/Fragment overlap: Mali devices allow overlapping vertex work from the current task with fragment work from the previous task with the correct use of barriers. Currently, Godot is not making use of this and is leaving a lot of performance on the table.
Complex shadow calculations: The current codepath for directional light shadows and for soft shadows of all lights are not optimal for mobile devices and lead to a huge decrease in performance when used. Targeted optimizations are needed to reduce the performance hit from using shadows.

As a sneak peak, here is the output of Mali Offline Compiler in Godot 4.5-beta1 for the same shader analyzed previously:

Main shader
===========

Work registers: 59 (92% used at 50% occupancy)
Uniform registers: 128 (100% used)
Stack spilling: false
16-bit arithmetic: 50%

                                A      LS       V       T    Bound
Total instruction cycles:    5.59   11.00    0.25    1.00       LS
Shortest path cycles:        1.23    0.00    0.25    0.38        A
Longest path cycles:         4.81    4.00    0.25    1.00        A

A = Arithmetic, LS = Load/Store, V = Varying, T = Texture

Shader properties
=================

Has uniform computation: true
Has side-effects: false
Modifies coverage: false
Uses late ZS test: false
Uses late ZS update: false
Reads color buffer: false

Clearly the results are even better. But we still have more room to improve!

Conclusion

This covers a few of the major optimizations that we have worked on recently. There have been many more and will continue to be more as we continue to work. Feel free to follow development on GitHub if you want to see what else we are up to.

Mobile, Graphics, and Gaming blog

Unlock the power of SVE and SME with SIMD Loops

Vidya Praveen

SIMD Loops is an open-source project designed to help developers learn SVE and SME through hands-on experimentation. It offers a clear, practical pathway to mastering Arm’s most advanced SIMD technologies…
- September 19, 2025
What is Arm Performance Studio?

Jai Schrem

Arm Performance Studio gives developers free tools to analyze performance, debug graphics, and optimize apps on Arm platforms.
- August 27, 2025
How Neural Super Sampling works: Architecture, training, and inference

Liam O'Neil

A deep dive into a practical, ML-powered approach to temporal super sampling.
- August 12, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog