Shader programs for OpenGL ES and Vulkan are one of the most important inputs an application provides to render a scene because they define the processing operations executed by the GPU shader core hardware. They are also one of the hardest aspects of the system to profile because the details of the generated program binary and how it executes on the target hardware are hidden below the level of the API and are therefore difficult to interpret by developers.
The Mali Offline Compiler, included in the Arm Mobile Studio 2019.2 suite of Android performance analysis tools, is a host tool that provides offline performance analysis for shader programs and OpenCL compute kernels. It enables easy visibility of the expected performance and the likely performance bottlenecks on any of the available Mali GPU targets. It also provides full syntax checking and error reporting for syntax errors in the shader source code. Using an offline tool for these activities enables efficient development iteration, validating source code fixes and optimizations without needing the whole application to be built and executed on the target device.
The new Mali Offline Compiler 7.0 release includes support for the latest compiler back-end for the Mali Bifrost family of GPUs, adding support for the Mali-G52 and Mali-G76 products, and a new instruction cost model which provides much improved data accuracy for all of the Mali Bifrost GPUs.
The following shader implements the horizontal pass of a 5-tap separable Gaussian blur, with an optional tone mapping stage implemented using a matrix multiply:
#version 310 es #define WINDOW_SIZE 5 precision highp float; precision highp sampler2D; uniform bool toneMap; uniform sampler2D texUnit; uniform mat4 colorModulation; uniform float gaussOffsets[WINDOW_SIZE]; uniform float gaussWeights[WINDOW_SIZE]; in vec2 texCoord; out vec4 fragColor; void main() { fragColor = vec4(0.0); // For each gaussian sample for (int i = 0; i < WINDOW_SIZE; i++) { // Create sample texture coord vec2 offsetTexCoord = texCoord + vec2(gaussOffsets[i], 0.0); // Load data and perform tone mapping vec4 data = texture(texUnit, offsetTexCoord); if (toneMap) { data *= colorModulation; } // Accumulate result fragColor += data * gaussWeights[i]; } }
Compiling this shader for Mali-G76 generates the following performance report:
Mali Offline Compiler v7.0.0 (Build bc7a3e) Copyright 2007-2019 Arm Limited, all rights reserved Configuration ============= Hardware: Mali-G76 r0p0 Driver: Bifrost r19p0-00rel0 Shader type: OpenGL ES Fragment Main shader =========== Work registers: 32 Uniform registers: 34 Stack spilling: False A LS V T Bound Total Instruction Cycles: 4.5 0.0 0.2 2.5 A Shortest Path Cycles: 1.0 0.0 0.2 2.5 T Longest Path Cycles: 4.5 0.0 0.2 2.5 A A = Arithmetic, LS = Load/Store, V = Varying, T = Texture Shader properties ================= Uniform computation: False
The most interesting part of this report is the performance table for the Main shader, which gives us an approximate cycle cost breakdown for the major functional units in the design. For this shader we can see that:
The hardware units run in parallel so identifying which units are the critical path is the first step in optimization, as it tells you what part of your shader code you should look to optimize.
For full details of all of the reported sections and fields please see the Mali Offline Compiler User Guide.
Now that we know our critical path, let’s look at what can be done to make the tone mapping part of this shader faster. The first change we should make is to reduce precision. Currently the tone mapping is using a highp (fp32) matrix operation, which has much more precision than we need for generating an 8-bit per channel color output. First we therefore drop to “mediump” (fp16) float and sampler precision by changing these two lines at the top of the shader:
precision mediump float; precision mediump sampler2D;
Just these two simple changes significantly reduce the cost of the longest path, as Mali GPUs can process twice as many fp16 operations per clock than fp32 operations.
A LS V T Bound Longest Path Cycles: 2.7 0.0 0.2 2.5 A
However, despite this change arithmetic is still our longest path. One final change would be to move the tone mapping out of the accumulation loop, applying it to the final color rather than the individual samples. This gives the final shader structure shown below:
// For each gaussian sample for (int i = 0; i < WINDOW_SIZE; i++) { vec2 offsetTexCoord = texCoord + vec2(gaussOffsets[i], 0.0); vec4 data = texture(texUnit, offsetTexCoord); fragColor += data * gaussWeights[i]; }// Tone map the final colorif (toneMap) { fragColor *= colorModulation; }
This reduces the arithmetic cost of the longest path to just a single shader core cycle, even if tone mapping is used. The slowest path is now texturing, which needs 2.5 cycles per fragment to load the 5 samples needed. We cannot make this any faster; this is the architectural performance of the shader core.
A LS V T Bound Total Instruction Cycles: 1.0 0.0 0.2 2.5 T Shortest Path Cycles: 0.5 0.0 0.2 2.5 T Longest Path Cycles: 1.0 0.0 0.2 2.5 T
It’s worth noting that although the last optimization reduced the arithmetic cost from 2.7 cycles to 1.0 cycles, the shader throughput only improved from 2.7 cycles to 2.5 cycles per fragment because the bottleneck changed from “A” to “T”. Reducing the load on any pipeline will improve energy efficiency and prolong battery life, so these types of optimization are still worth making, even if they do not improve the headline performance.
Different models of Mali GPU are tuned for different target markets, so not all of them have the same performance ratios between the functional units. The Mali Offline Compiler allows developers to test the performance of their shaders on different target GPUs. For example, if we target the final shader above at a Mali-G31, which is designed for embedded use cases, and so it has a lower ratio of arithmetic to texture performance; we get the performance shown below:
A LS V T Bound Total Instruction Cycles: 2.9 0.0 0.2 2.5 A Shortest Path Cycles: 1.6 0.0 0.2 2.5 T Longest Path Cycles: 2.9 0.0 0.2 2.5 A
…which shows that on this GPU the shader is still arithmetic bound, despite our optimizations.
The performance reports generated by the Mali Offline Compiler are based only on the shader source code visible to the compiler. They are not aware of the actual uniform values or texture sampler configuration for any specific draw call, or any data centric effects such as cache miss overheads.
Texture unit performance in particular can be impacted by the texture format and filtering type used; trilinear filtering (GL_LINEAR_MIPMAP_LINEAR) takes twice as long as bilinear filtering (GL_LINEAR_MIPMAP_NEAREST), and anisotropic filtering can be scaled by both the probe type and the number of anisotropic sample probes made. The Mali Offline Compiler reports assume simple bilinear filtering for all samples, which is the fastest type supported by the hardware. If you know a draw call is using trilinear filtering for texture samples, you should double the cycle cost of the texture accesses reported in the performance report.
GL_LINEAR_MIPMAP_LINEAR
GL_LINEAR_MIPMAP_NEAREST
The Arm Streamline profiler, also included in Arm Mobile Studio, can sample performance data from the Mali GPU hardware while it is running your application. This data can be used to supplement Mali Offline Compiler performance reports, for example by providing a direct measure of the number of multi-cycle texture operations being performed. This clearly provides a warning that the assumption that all accesses are bilinear accesses is not valid for this application.
The Mali Offline Compiler provides a means to see underneath the API and measure the performance of your shader programs and compute kernels for all of the APIs supported by the Mali GPU family. Performance reports provide rapid identification of the critical path pipelines, allowing developers to accurately target optimizations at the parts of their shaders which are the performance bottleneck.
The latest release of the Mali Offline Compiler can be downloaded today in Arm Mobile Studio 2019.2, which includes a range of OpenGL ES and Vulkan sample shaders to get you started quickly.
[CTAToken URL = "https://developer.arm.com/tools-and-software/graphics-and-gaming/arm-mobile-studio" target="_blank" text="Download the Mali Offline Compiler in Arm Mobile Studio" class ="green"]
If you have any feedback, feature requests, or questions please post on our developer forum.
Hi Peter,
Good article, Is it possible for general members in Community to write a blog related to IOT.