Shader analysis and more in Arm Performance Studio 2024.4

October 2, 2024

7 minute read time.

We are very excited to announce our latest Frame Advisor enhancement, which gives mobile developers a new way to identify inefficient shader programs used in a scene. We have made a few other enhancements to the Arm Performance Studio suite too, so let’s look at what’s new.

Shader analysis with Frame Advisor

Shaders are one of the critical inputs that a developer provides from the point of view of performance. Whether your goal is to increase FPS, reduce memory bandwidth or free up power budget, the ability to understand and optimize shaders is a powerful tool. By reducing complexity, and ensuring that shader programs are efficient, you can make significant cost savings.

If you are familiar with Mali Offline Compiler, you will understand how useful it can be to get shader program statistics for example cycle cost, register usage and what precision was used for arithmetic operations when your shader program runs on an Arm-based GPU. In this release, Frame Advisor now reports the following metrics for all the shader programs used in your captured frame:

Cycle cost for each functional unit
Number of work registers used, and the corresponding impact on shader core thread occupancy
Number of uniform registers used
Number of bytes of stack spills
Percentage of arithmetic operations performed at 16-bit precision, or lower

A screenshot of Frame Advisor filter by render pass.

The metrics are shown in tabular format in the Programs view. If you select the frame in the Frame Hierarchy view, you can see all the shaders used in the frame. You can then sort the table columns by each metric, to find the most expensive shaders.

Note that For Vulkan applications the metrics are generated without using the application pipeline state, so the numbers might not exactly match the runtime driver.

Cycle cost

The Programs view table includes an approximate cycle cost breakdown for the major functional units in the shader core, the arithmetic unit, the load/store unit, the varying unit, and the texture unit. Look for the functional unit with the highest cycle cost in the shortest and longest path cycles. Consider how you could optimize the shader to reduce cost for that functional unit first.

Shortest path: An estimate of the number of cycles for the shortest control flow path though the shader program
Longest path: An estimate of the number of cycles for the longest control flow path though the shader program
Total emitted: The cumulative number of cycles for all instructions that are generated for the program, irrespective of program control flow.

Note that the table shows the highest cost per thread. However, Frame Advisor does not know how many threads were executed in total, therefore this number does not represent the total shader cost per frame.

Register usage

Occupancy shows the maximum possible number of threads as a percentage of shader core capacity. For example, if a shader core can physically run a maximum of 2048 threads, running the shader with 50% occupancy could only run 1024 at a time.

Work registers are general purpose read-write registers that are allocated to each running thread. The available physical register pool is divided among the shader threads that are executing. Therefore, reducing work register usage can increase the number of threads that can be executed simultaneously. This helps to keep the GPU busy because there are more threads to choose from when some threads are stalled waiting for data to load from memory. Look for shaders below 100% occupancy, and optimize them to use fewer work registers. The most effective way to do this is to reduce variable precision from 32-bit to 16-bit, which enables the GPU to store twice as many variables per register.

Uniform registers are read-only registers that are allocated to each running program. They are used to store uniform and literal constant values. Shaders that run out of uniform storage need to fall back to per-thread memory loads for additional values. Look for shaders that use the highest numbers of uniform registers and aim to reduce them by using 16-bit data types, or by reducing the number of uniforms and constants in the shader program.

Stack spills

Stack is a form of thread local storage that is used by compiler-generated memory allocations and register spills. The stack spilling metric shows the number of bytes of data that are spilled to the stack. Stack spills are very expensive for a GPU to process because of the high number of running threads.

Shaders that spill to stack are expensive for a GPU to process, so to prevent this, try reducing register pressure by:

Reducing variable precision
Reducing the live ranges of variables
Simplifying the shader program

Precision of computation

The FP16 usage column in the Programs view reports the percentage of arithmetic operations that are performed at 16-bit precision or lower for each shader program. A higher number here is better, because 16-bit precision is twice as fast as 32-bit precision. Sort the table by the lowest use of FP16 to find shaders that could be optimized by reducing precision.

Reducing precision to 16-bit can double the performance and reduces both energy consumption and register pressure. There are situations where 32-bit precision is always required, such as for position and depth calculations. However, for many mobile use cases, there is no noticeable difference on-screen when precision is reduced to 16-bit.

Tip: To reduce precision to 16-bit for OpenGL ES applications, set precision to mediump. For Vulkan, use RelaxedPrecision. Alternatively, use explicit 16-bit type extensions to set precision.

View shader source code

To view the source code for a shader program, double-click anywhere along a row to open Frame Advisor’s Source view.

A screenshot of Frame Advisor viewing the shader source code.

Enabling shader metrics in Frame Advisor

Before capturing a trace, you need to set the path to the Mali Offline Compiler directory so that Frame Advisor can generate and display metrics for shaders in the Programs view.
To do this, select Configure –> Open preferences in the Frame Advisor menu, click Browse and select the mali_offline_compiler directory in the Arm Performance Studio install directory.

More enhancements in Arm Performance Studio

In addition to the shader analysis features in Frame Advisor, we’ve made a few other changes to the tools in Arm Performance Studio. Here is a roundup.

Streamline enhancements

In Streamline, we have added support for the latest Cortex and Neoverse CPUs:

Cortex-A520AE
Cortex-A720AE
Cortex-R52AE
Neoverse V3AE

The streamline_me.py connection script now installs both the OpenGL ES and Vulkan API layers at the same time. This is now the default behavior, so there’s no longer any need to use the --lwi-api=vulkan option to specify the Vulkan API. The --lwi-api option now has a new default value of all.

Debug Android applications with RenderDoc for Arm GPUs

Earlier this year, we began adding Arm extensions to RenderDoc, the popular open-source graphics API debugger. We ship this version of RenderDoc with Arm Performance Studio, and we will be upstreaming the majority of our changes. You can see the full list of extensions on our new RenderDoc for Arm GPUs web page.

Additionally, we have created a new User Guide to help you learn how to capture frames from Android applications with RenderDoc.

Deprecated functionality

Note that Graphics Analyzer is no longer part of the Arm Performance Studio tools suite. We recommend using Frame Advisor and RenderDoc for performance profiling and graphics debugging. If you prefer to use Graphics Analyzer, it is still available in earlier versions of Arm Performance Studio. The latest version that included Graphics Analyzer is Arm Performance Studio 2024.2.

In Streamline, we have removed support for energy profiling using a hardware energy probe, for example the Arm Energy Probe or a National Instruments DAQ.

Join our user panel

The teams behind Arm Performance Studio are always looking for ways to improve our tools and prioritize the features that matter to you. If you would like to help shape the future of Arm Performance Studio and the free tools within it, join our exclusive user panel here.

Get the latest version of Arm Performance Studio

If you have not already done so, try out the new features available with Frame Advisor, by installing the latest version of Arm Performance Studio.

Download Arm Performance Studio

Peter Harris 10 months ago in reply to sindney

The Frame Advisor shader analysis is from the Mali Offline Compiler, compiling for the GPU in the device you captured from.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
sindney over 1 year ago

Hey, i wonder what's the data source of "Shader analysis" data in PA? Is it based on offline compiler or comes from real device's counters? Very useful feature btw.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Mobile, Graphics, and Gaming blog

Unlock the power of SVE and SME with SIMD Loops

Vidya Praveen

SIMD Loops is an open-source project designed to help developers learn SVE and SME through hands-on experimentation. It offers a clear, practical pathway to mastering Arm’s most advanced SIMD technologies…
- September 19, 2025
What is Arm Performance Studio?

Jai Schrem

Arm Performance Studio gives developers free tools to analyze performance, debug graphics, and optimize apps on Arm platforms.
- August 27, 2025
How Neural Super Sampling works: Architecture, training, and inference

Liam O'Neil

A deep dive into a practical, ML-powered approach to temporal super sampling.
- August 12, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog