Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Mobile, Graphics, and Gaming blog Accelerate your shaders with Mali Offline Compiler 7.0
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • GPU Tools
  • optimization
  • performance analysis
  • gpu
  • Arm Mobile Studio
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Accelerate your shaders with Mali Offline Compiler 7.0

Peter Harris
Peter Harris
November 6, 2019
7 minute read time.

Shader programs for OpenGL ES and Vulkan are one of the most important inputs an application provides to render a scene because they define the processing operations executed by the GPU shader core hardware. They are also one of the hardest aspects of the system to profile because the details of the generated program binary and how it executes on the target hardware are hidden below the level of the API and are therefore difficult to interpret by developers.

The Mali Offline Compiler, included in the Arm Mobile Studio 2019.2 suite of Android performance analysis tools, is a host tool that provides offline performance analysis for shader programs and OpenCL compute kernels. It enables easy visibility of the expected performance and the likely performance bottlenecks on any of the available Mali GPU targets. It also provides full syntax checking and error reporting for syntax errors in the shader source code. Using an offline tool for these activities enables efficient development iteration, validating source code fixes and optimizations without needing the whole application to be built and executed on the target device.

The new Mali Offline Compiler 7.0 release includes support for the latest compiler back-end for the Mali Bifrost family of GPUs, adding support for the Mali-G52 and Mali-G76 products, and a new instruction cost model which provides much improved data accuracy for all of the Mali Bifrost GPUs.

Profiling using the Mali Offline Compiler

The following shader implements the horizontal pass of a 5-tap separable Gaussian blur, with an optional tone mapping stage implemented using a matrix multiply:

#version 310 es 
#define WINDOW_SIZE 5

precision highp float;
precision highp sampler2D;

uniform bool toneMap;
uniform sampler2D texUnit;
uniform mat4 colorModulation;
uniform float gaussOffsets[WINDOW_SIZE];
uniform float gaussWeights[WINDOW_SIZE];

in vec2 texCoord;
out vec4 fragColor;

void main() {
fragColor = vec4(0.0);

// For each gaussian sample
for (int i = 0; i < WINDOW_SIZE; i++) {
// Create sample texture coord
vec2 offsetTexCoord = texCoord + vec2(gaussOffsets[i], 0.0);

// Load data and perform tone mapping
vec4 data = texture(texUnit, offsetTexCoord);
if (toneMap) {
data *= colorModulation;
}

// Accumulate result
fragColor += data * gaussWeights[i];
}
}

Compiling this shader for Mali-G76 generates the following performance report:

Mali Offline Compiler v7.0.0 (Build bc7a3e) 
Copyright 2007-2019 Arm Limited, all rights reserved

Configuration
=============

Hardware: Mali-G76 r0p0
Driver: Bifrost r19p0-00rel0
Shader type: OpenGL ES Fragment

Main shader
===========

Work registers: 32
Uniform registers: 34
Stack spilling: False

A LS V T Bound
Total Instruction Cycles: 4.5 0.0 0.2 2.5 A
Shortest Path Cycles: 1.0 0.0 0.2 2.5 T
Longest Path Cycles: 4.5 0.0 0.2 2.5 A

A = Arithmetic, LS = Load/Store, V = Varying, T = Texture

Shader properties
=================

Uniform computation: False

The most interesting part of this report is the performance table for the Main shader, which gives us an approximate cycle cost breakdown for the major functional units in the design. For this shader we can see that:

  • The shader is texture bound when not using tone mapping; “T” is the highest value for the shortest path, taking 0.5 cycles a sample for our 5-sample blur, which is as fast as the hardware texture filtering unit can go.
  • The shader is arithmetic bound when using matrix-based tone mapping; “A” is the highest value for the longest path when the conditional tone mapping block is executed.

The hardware units run in parallel so identifying which units are the critical path is the first step in optimization, as it tells you what part of your shader code you should look to optimize. 

For full details of all of the reported sections and fields please see the Mali Offline Compiler User Guide.

Optimizing using the Mali Offline Compiler

Now that we know our critical path, let’s look at what can be done to make the tone mapping part of this shader faster. The first change we should make is to reduce precision. Currently the tone mapping is using a highp (fp32) matrix operation, which has much more precision than we need for generating an 8-bit per channel color output. First we therefore drop to “mediump” (fp16) float and sampler precision by changing these two lines at the top of the shader:

precision mediump float; 
precision mediump sampler2D;

Just these two simple changes significantly reduce the cost of the longest path, as Mali GPUs can process twice as many fp16 operations per clock than fp32 operations.

                        A   LS    V    T  Bound 
Longest Path Cycles: 2.7 0.0 0.2 2.5 A

However, despite this change arithmetic is still our longest path. One final change would be to move the tone mapping out of the accumulation loop, applying it to the final color rather than the individual samples. This gives the final shader structure shown below:

// For each gaussian sample 
for (int i = 0; i < WINDOW_SIZE; i++) {
vec2 offsetTexCoord = texCoord + vec2(gaussOffsets[i], 0.0);
vec4 data = texture(texUnit, offsetTexCoord);
fragColor += data * gaussWeights[i];
}

// Tone map the final color
if (toneMap) {
fragColor *= colorModulation;
}

This reduces the arithmetic cost of the longest path to just a single shader core cycle, even if tone mapping is used. The slowest path is now texturing, which needs 2.5 cycles per fragment to load the 5 samples needed. We cannot make this any faster; this is the architectural performance of the shader core.

                             A   LS    V    T  Bound 
Total Instruction Cycles: 1.0 0.0 0.2 2.5 T
Shortest Path Cycles: 0.5 0.0 0.2 2.5 T
Longest Path Cycles: 1.0 0.0 0.2 2.5 T

It’s worth noting that although the last optimization reduced the arithmetic cost from 2.7 cycles to 1.0 cycles, the shader throughput only improved from 2.7 cycles to 2.5 cycles per fragment because the bottleneck changed from “A” to “T”. Reducing the load on any pipeline will improve energy efficiency and prolong battery life, so these types of optimization are still worth making, even if they do not improve the headline performance.

Target aware profiling

Different models of Mali GPU are tuned for different target markets, so not all of them have the same performance ratios between the functional units. The Mali Offline Compiler allows developers to test the performance of their shaders on different target GPUs. For example, if we target the final shader above at a Mali-G31, which is designed for embedded use cases, and so it has a lower ratio of arithmetic to texture performance; we get the performance shown below:

                             A   LS    V    T  Bound 
Total Instruction Cycles: 2.9 0.0 0.2 2.5 A
Shortest Path Cycles: 1.6 0.0 0.2 2.5 T
Longest Path Cycles: 2.9 0.0 0.2 2.5 A

…which shows that on this GPU the shader is still arithmetic bound, despite our optimizations.

Performance report limitations

The performance reports generated by the Mali Offline Compiler are based only on the shader source code visible to the compiler. They are not aware of the actual uniform values or texture sampler configuration for any specific draw call, or any data centric effects such as cache miss overheads.

Texture unit performance in particular can be impacted by the texture format and filtering type used; trilinear filtering (GL_LINEAR_MIPMAP_LINEAR) takes twice as long as bilinear filtering (GL_LINEAR_MIPMAP_NEAREST), and anisotropic filtering can be scaled by both the probe type and the number of anisotropic sample probes made. The Mali Offline Compiler reports assume simple bilinear filtering for all samples, which is the fastest type supported by the hardware. If you know a draw call is using trilinear filtering for texture samples, you should double the cycle cost of the texture accesses reported in the performance report.

The Arm Streamline profiler, also included in Arm Mobile Studio, can sample performance data from the Mali GPU hardware while it is running your application. This data can be used to supplement Mali Offline Compiler performance reports, for example by providing a direct measure of the number of multi-cycle texture operations being performed. This clearly provides a warning that the assumption that all accesses are bilinear accesses is not valid for this application.

Summary

The Mali Offline Compiler provides a means to see underneath the API and measure the performance of your shader programs and compute kernels for all of the APIs supported by the Mali GPU family. Performance reports provide rapid identification of the critical path pipelines, allowing developers to accurately target optimizations at the parts of their shaders which are the performance bottleneck.

The latest release of the Mali Offline Compiler can be downloaded today in Arm Mobile Studio 2019.2, which includes a range of OpenGL ES and Vulkan sample shaders to get you started quickly.

Download the Mali Offline Compiler in Arm Mobile Studio

If you have any feedback, feature requests, or questions please post on our developer forum.

Anonymous
  • techguyz
    techguyz over 5 years ago

    Hi Peter,

    Good article, Is it possible for general members in Community to write a blog related to IOT.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Mobile, Graphics, and Gaming blog
  • Join the Upscaling Revolution with Arm Accuracy Super Resolution (Arm ASR)

    Lisa Sheckleford
    Lisa Sheckleford
    With Arm ASR you can easily improve frames per second, enhance visual quality, and prevent thermal throttling for smoother, longer gameplay.
    • March 18, 2025
  • Generative AI in game development

    Roberto Lopez Mendez
    Roberto Lopez Mendez
    How is Generative AI (GenAI) technology impacting different areas of game development?
    • March 13, 2025
  • Physics simulation with graph neural networks targeting mobile

    Tomas Zilhao Borges
    Tomas Zilhao Borges
    In this blog post, we perform a study of the GNN architecture and the new TF-GNN API and determine whether GNNs are a viable approach for implementing physics simulations.
    • February 26, 2025