We are very excited to announce our latest Frame Advisor enhancement, which gives mobile developers a new way to identify inefficient shader programs used in a scene. We have made a few other enhancements to the Arm Performance Studio suite too, so let’s look at what’s new.
Shaders are one of the critical inputs that a developer provides from the point of view of performance. Whether your goal is to increase FPS, reduce memory bandwidth or free up power budget, the ability to understand and optimize shaders is a powerful tool. By reducing complexity, and ensuring that shader programs are efficient, you can make significant cost savings.
If you are familiar with Mali Offline Compiler, you will understand how useful it can be to get shader program statistics for example cycle cost, register usage and what precision was used for arithmetic operations when your shader program runs on an Arm-based GPU. In this release, Frame Advisor now reports the following metrics for all the shader programs used in your captured frame:
The metrics are shown in tabular format in the Programs view. If you select the frame in the Frame Hierarchy view, you can see all the shaders used in the frame. You can then sort the table columns by each metric, to find the most expensive shaders.
Note that For Vulkan applications the metrics are generated without using the application pipeline state, so the numbers might not exactly match the runtime driver.
The Programs view table includes an approximate cycle cost breakdown for the major functional units in the shader core, the arithmetic unit, the load/store unit, the varying unit, and the texture unit. Look for the functional unit with the highest cycle cost in the shortest and longest path cycles. Consider how you could optimize the shader to reduce cost for that functional unit first.
Note that the table shows the highest cost per thread. However, Frame Advisor does not know how many threads were executed in total, therefore this number does not represent the total shader cost per frame.
Occupancy shows the maximum possible number of threads as a percentage of shader core capacity. For example, if a shader core can physically run a maximum of 2048 threads, running the shader with 50% occupancy could only run 1024 at a time.
Work registers are general purpose read-write registers that are allocated to each running thread. The available physical register pool is divided among the shader threads that are executing. Therefore, reducing work register usage can increase the number of threads that can be executed simultaneously. This helps to keep the GPU busy because there are more threads to choose from when some threads are stalled waiting for data to load from memory. Look for shaders below 100% occupancy, and optimize them to use fewer work registers. The most effective way to do this is to reduce variable precision from 32-bit to 16-bit, which enables the GPU to store twice as many variables per register.
Uniform registers are read-only registers that are allocated to each running program. They are used to store uniform and literal constant values. Shaders that run out of uniform storage need to fall back to per-thread memory loads for additional values. Look for shaders that use the highest numbers of uniform registers and aim to reduce them by using 16-bit data types, or by reducing the number of uniforms and constants in the shader program.
Stack is a form of thread local storage that is used by compiler-generated memory allocations and register spills. The stack spilling metric shows the number of bytes of data that are spilled to the stack. Stack spills are very expensive for a GPU to process because of the high number of running threads.
Shaders that spill to stack are expensive for a GPU to process, so to prevent this, try reducing register pressure by:
The FP16 usage column in the Programs view reports the percentage of arithmetic operations that are performed at 16-bit precision or lower for each shader program. A higher number here is better, because 16-bit precision is twice as fast as 32-bit precision. Sort the table by the lowest use of FP16 to find shaders that could be optimized by reducing precision.
Reducing precision to 16-bit can double the performance and reduces both energy consumption and register pressure. There are situations where 32-bit precision is always required, such as for position and depth calculations. However, for many mobile use cases, there is no noticeable difference on-screen when precision is reduced to 16-bit.
Tip: To reduce precision to 16-bit for OpenGL ES applications, set precision to mediump. For Vulkan, use RelaxedPrecision. Alternatively, use explicit 16-bit type extensions to set precision.
To view the source code for a shader program, double-click anywhere along a row to open Frame Advisor’s Source view.
Before capturing a trace, you need to set the path to the Mali Offline Compiler directory so that Frame Advisor can generate and display metrics for shaders in the Programs view. To do this, select Configure –> Open preferences in the Frame Advisor menu, click Browse and select the mali_offline_compiler directory in the Arm Performance Studio install directory.
In addition to the shader analysis features in Frame Advisor, we’ve made a few other changes to the tools in Arm Performance Studio. Here is a roundup.
In Streamline, we have added support for the latest Cortex and Neoverse CPUs:
The streamline_me.py connection script now installs both the OpenGL ES and Vulkan API layers at the same time. This is now the default behavior, so there’s no longer any need to use the --lwi-api=vulkan option to specify the Vulkan API. The --lwi-api option now has a new default value of all.
Earlier this year, we began adding Arm extensions to RenderDoc, the popular open-source graphics API debugger. We ship this version of RenderDoc with Arm Performance Studio, and we will be upstreaming the majority of our changes. You can see the full list of extensions on our new RenderDoc for Arm GPUs web page.
Additionally, we have created a new User Guide to help you learn how to capture frames from Android applications with RenderDoc.
Note that Graphics Analyzer is no longer part of the Arm Performance Studio tools suite. We recommend using Frame Advisor and RenderDoc for performance profiling and graphics debugging. If you prefer to use Graphics Analyzer, it is still available in earlier versions of Arm Performance Studio. The latest version that included Graphics Analyzer is Arm Performance Studio 2024.2.
In Streamline, we have removed support for energy profiling using a hardware energy probe, for example the Arm Energy Probe or a National Instruments DAQ.
The teams behind Arm Performance Studio are always looking for ways to improve our tools and prioritize the features that matter to you. If you would like to help shape the future of Arm Performance Studio and the free tools within it, join our exclusive user panel here.
If you have not already done so, try out the new features available with Frame Advisor, by installing the latest version of Arm Performance Studio.
Download Arm Performance Studio
Hey, i wonder what's the data source of "Shader analysis" data in PA? Is it based on offline compiler or comes from real device's counters? Very useful feature btw.