Accelerating Mali GPU analysis using Arm Mobile Studio

March 20, 2019

19 minute read time.

Japanese version - 日本語版
 Korean version - 한국어판

The Streamline performance analyzer is a sample-based profiler which can present detailed performance information about the Arm CPU and Mali GPU present in a device. Recent versions of Streamline have included a set of predefined templates which can be used to easily select a set of data sources to use, and control how they are visualized. The latest release of Streamline, included in Arm Mobile Studio 2019.0 and Arm Development Studio 2019.0, includes a number of improvements to the Mali GPU templates for the Mali Bifrost GPU family. This article walks through the use of the template for a Mali-G72 GPU.

This blog assumes that the reader has familiarity with graphics terminology, in particular relating to tile-based rendering GPU architectures. Some useful quick-start guides on these topics can be found here:

Selecting counters

Once you have followed the Quick Start Guide to set up your application and install the gator daemon to the target, it is time to select some data sources and start profiling. Connect to your device and bring up the Counter Selection dialog. In the Counter Selection dialog select the appropriate template for your device from the drop-down menu.

Selecting a template in Streamline Counter Configuration.

This will automatically select all of the data sources necessary to render the template's visualization. Click save, and then capture a trace of your application. Once the initial data analysis has completed the default Timeline visualization will be presented:

Streamline default Timeline view.

This shows an alphabetical list of the charts and data series that were captured. The first thing we therefore need to do is select the same template we used for capture for visualization:

Streamline selecting a new template in the Timeline.

This will change the Timeline to display a pre-defined visualization, designed by our in-house performance analysis team. This will order the charts in a more methodical sequence, and make use of mathematical expressions to combine multiple raw counters to derive more readable metrics such as percentage utilization of a functional unit.

Finding frames

The initial view that the Timeline presents gives us 1 second per on-screen sample, which is too coarse for debugging graphics content as we are most interested in viewing how well we are processing frames which are typically between 16-32 milliseconds in length. The first step in the analysis is therefore to zoom in the view until single frames become distinct.

Zooming in to find a single frame

In the application shown in the sample we have added instrumentation to the source code to generate a Streamline marker annotation whenever the application calls eglSwapBuffers(). These are visible as the red ticks on the time track above the charts.

Once you can see individual frames it is possible to make an initial assessment of the current system behavior:

Measure the time between frames to determine the achieved framerate.
Measure the CPU thread load to determine whether we are CPU bound.
Measure the GPU load to determine whether we are GPU bound.
Inspect the pipelining of CPU and GPU workloads to determine if the application logic is optimally feeding the graphics pipeline without scheduling bubbles.

In our example above we can see that the CPUs are all going completely idle for a significant proportion of the frame, so we are not CPU bound. We can also see that the GPU is active all of the time, so the GPU is highly likely to be the processor limiting this application's performance.

In terms of breaking down the GPU workload further we can see that the fragment shading queue is the one active all of the time, with the non-fragment queue used for all geometry and compute processing going idle for most of the frame. You would therefore look to optimize fragment workload for this application if you wanted to improve performance.

The following sections in this tutorial work through each of the charts in the template, and explain what they mean and what changes they could imply for an application developer looking to improve performance.

CPU workload

The CPU charts show the overall usage of the CPUs in the system.

Streamline CPU chart extract.

The CPU Activity charts show the per-CPU utilization, computed as a percentage of time the CPU was active, split by processor type if you have big.LITTLE clustering present. This is based off OS scheduling event data. The CPU Cycles chart shows the number of cycles that each CPU was active, measured using the CPU performance monitoring unit (PMU). By considering both of these together we can assess the overall application software load; a high utilization and a high CPU cycle count indicate that the CPU is both very busy and running at a high clock frequency.

Streamline CPU chart expanded to show cores and threads.

The process view at the bottom of the Timeline tab shows the application thread activity, allowing an identification of which threads are causing the measured load. Selecting one or more threads from the list will filter the CPU related charts so that only the load from the selected threads is shown. When a thread-level filter is active the chart title background changes to a blue-tinted color to indicate that not all of the measured load is currently visible.

If the application is not hitting its performance target and has a single CPU thread which is active all of the time then it is likely to be CPU bound. Improvements to frame time will require software optimizations to reduce the cost of this thread's workload. Streamline provides native software profiling via program counter sampling, in addition to the performance counter views. Software profiling is beyond the scope of this tutorial, so please refer to the Streamline User Guide for more information.

GPU workload

The GPU workload charts show the overall usage of the GPU.

Streamline GPU Job Manager chart extract.

The Mali Job Manager Cycles chart shows the number of GPU cycles spent with work running, both for the GPU as a whole, the two parallel hardware work queues for Non-fragment and Fragment work. The Mali Job Manager Utilization charts show the same data normalized as a percentage against GPU active cycles.

For GPU bound content the dominant work queue should be active all of the time, with the other queue running in parallel to it. If a GPU bound application is not achieving good parallelism check for API calls which drain the rendering pipeline, such as glFinish() or synchronous use of glReadPixels(), or Vulkan dependencies which are too conservative to allow for stage overlap of multiple render passes (including overlap across frames).

The Tiler active counter in this chart is not always directly useful, as the tiler is normally active for the entire duration of geometry processing, but it can give an indication of how much compute shading is present. Any large gap between Non-fragment active and Tiler active may be caused by application compute shaders.

The IRQ active counter shows the number of cycles the GPU has an interrupt pending with the CPU. A IRQ pending rate of ~2% of GPU cycles is normal, but applications can cause a higher rate of interrupts by enqueuing a large number of small render passes or compute dispatches.

Note: that a high IRQ overhead can also be indicative of a system integration issue, such as a CPU interrupts being masked for a long time by a privileged kernel activity. It is not usually possible to fix a high IRQ overhead using application changes.

GPU memory system

The memory system charts show the behavior seen at the GPU memory interface, both in terms of memory traffic generated by the GPU and how effectively the system is handling that traffic.

Streamline GPU Memory System chart extract.

The Mali External Bus Bandwidth chart shows the total read and write bandwidth generated by the application. Reducing memory bandwidth can be an effective application optimization goal, as external DDR memory accesses are very energy intensive. Later charts can help identify which application resource types are the cause of the traffic.

The Mali External Bus Stall Rate chart shows the percentage of GPU cycles with a bus stall, indicating how much back-pressure the GPU is getting from the external memory system. Stall rates of up to 5% are considered normal; a stall rate much higher than this is indicative of a workload which is generating more traffic than the memory system can handle. Stall rates can be reduced by reducing overall memory bandwidth, or improving access locality.

The Mali External Bus Read Latency chart shows a stacked histogram of the response latency of external memory accesses. Mali GPUs are designed for an external memory latency of up to 170 GPU cycles, so seeing a high percentage of reads in the slower bins may indicate a memory system performance issue. DDR performance is not constant, and latency will increase when the DDR is under high load, so reducing bandwidth can be an effective method to reduce latency.

Note: It is expected that a small proportion of memory accesses will be the in slower bins, as the DDR is a shared resource and there will be competing traffic from other parts of the system.

The Mali External Bus Outstanding Reads/Writes charts show another set of stacked histograms, this time showing the percentage of allowed memory accesses the GPU has queued in the memory system. If a high percentage of the histogram is in the 75-100% bin, it may be possible that the GPU is running out of transactions. This will stall new memory requests until an older request has retired. Reducing memory bandwidth or improving access locality in DDR may improve performance.

GPU geometry

The geometry charts show the amount of geometry being processed by the GPU, and the behavior of the primitive culling unit.

Streamline GPU Geometry chart extract.

The Mali Primitive Culling chart shows the absolute number of primitives being processed, how many are killed by each culling stage, and how many are visible. A single vertex is much more expensive to process than a single fragment because they have high memory bandwidth requirements, so you should aim to reduce total primitive count per frame as much as possible.

The Mali Primitive Culling Rate chart shows the percentage of primitives entering each culling stage that are killed by it, and the percentage of visible primitives. The culling pipeline runs as a number of serial stages:

Mali Bifrost GPU culling pipeline

For a 3D scene it is expected that ~50% of the primitives are back-facing and killed by the facing test culling unit. If the Culled by facing test rate is much lower than this review whether the facing-test is correctly enabled.

It is standard best practice for an application to cull out-of-frustum draw calls on the CPU, so the Culled by frustum test rate should be kept as low as possible. If more than 10% of input primitives are getting killed at this stage, review for effectiveness of CPU-side culling. In addition it can be worth reviewing batch sizes, as overly large batches of objects can reduce culling efficiency.

The final culling rate, Culled by sample test, measures the percentage of primitives killed because they are so small that they will hit no rasterization sample points. Dense geometry is very expensive, both in terms of the direct vertex processing cost and reduced fragment shading efficiency, so this number should be kept as close to 0% as possible. If a high number of primitives are being killed here review both static mesh density, and the effectiveness of any dynamic level-of-detail selection.

The Mali Geometry Threads chart shows the absolute number of shading requests generated by Mali's index-driven vertex shading algorithm. This design splits the application's vertex shader into two pieces, one piece which computes position and one which computes the other varyings. The varying shader is only run for vertices belonging to primitives which survive clipping and culling. It is possible to review multiple things at this point.

Compare the total number of position shader invocations with the application index buffers. If the GPU is shading more indices than the application submitted then this may be indicative of poor index locality, causing thrashing of the position cache and forcing reshading.
Compare the total number of position shader invocations with the total number of input primitives. For most content a single vertex should be used by multiple adjacent primitives to amortize the cost as much as possible, so aim for an average of less than one vertex per primitive.

GPU shader front-end

The shader front-end charts show the behavior of the fixed function unit which turns a primitives in to fragment threads to be shaded.

Streamline GPU Shader Core Frontend chart extract.

The Mali Core Primitives chart shows the number of primitives that are loaded for rasterization. Note that Mali will load large primitives once per tile, so a single primitive will be included in this count once for every tile it intersects.

The Mali Early ZS Testing Rate chart shows the depth (Z) and stencil (S) testing and culling rates in the front-end. Early ZS testing is much less expensive than late ZS testing, so aim for nearly all fragments to be Early ZS tested by minimizing use of shader discard, alpha-to-coverage, and shader generated depth values. The FPK killed counter reports the proportion of quads killed by Mali's Forward Pixel Kill hidden surface removal scheme. A high proportion of quads being killed by FPK indicates a back-to-front render order; reversing this to a front-to-back render order will kill quads earlier during early-ZS testing which will reduce energy consumption.

The Mali Late ZS Testing Rate chart shows the depth and stencil testing and culling rates in the back-end after fragment shading. Killing a high proportion of quads during late ZS testing indicates a potential efficiency issue because these fragments are killed after they have been shaded.

Note: A render pass using an existing depth or stencil attachment as a starting state, rather than a clear color, will trigger a late ZS operation as part of the reload process. This may not be avoidable, but aim to minimize the number of render passes starting without a clear of all attachments.

The Mali Core Warps chart shows the number of warps created by the compute front-end (which includes all non-fragment workloads) and the fragment front-end. Note that the warp width can vary from product to product.

The final two charts show the average shader core processing cost per thread. For GPU bound content there are two possible optimization objectives for shader workloads:

reduce the number of warps created by simplifying the scene content, or
optimize the shader programs to reduce the cost per thread.

GPU shader frontend pixels

This set of charts looks at the rate at which the shader core is producing pixels.

Streamline GPU Shader Core Pixels chart extract.

The Mali Pixels chart shows the total number of pixels shaded by all shader cores, allowing an assessment of the total number of pixels required to produce a frame.

The Mali Overdraw chart shows the average number of fragments shaded per output pixel. High levels of overdraw can reduce performance, even if the cost per fragment is low. Aim to minimize the number of layers of transparent fragments in use to reduce overdraw.

GPU shader core

The shader core is the heart of the GPU, so it should be no surprise that we have a large number of counters allowing inspection of shader core workloads. The first set of charts aim to give an at-a-glance view of the overall shader core utilization.

Streamline GPU Shader Core Usage chart extract.

Note: The shader core "compute" data path is used for processing all non-fragment workloads, so any compute related counter will also include vertex shading workloads.

The Mali Core Utilization chart shows the percentage utilization of the three major parts of shader core.

The Compute utilization series and the Fragment utilization series shows the percentage of time that the shader core is processing a workload of that type, including time spent in any fixed-function logic such as rasterization and tile writeback.
The Execution core utilization series shows the percentage of time that the programmable core itself is active; if this is lower than 100% for long periods this may be indicative of a problem keeping the programmable core fed with work.
The Fragment FPK utilization series in this chart shows the percentage of time that quads are queued waiting to be turned into fragment threads. If this is lower than 100% for long periods this may be indicative that we are failing to generate new fragments fast enough for the shader core. This may be caused by either a high volume of microtriangles which generate a small number of fragments per primitive, but can also be indicative of a workload with a large number of empty tiles which contain no geometry at all, such as common types of shadow map.

The Mali Core Unit Utilization chart shows the percentage utilization of the major pipelines inside the Execution core.

The Execution engine utilization series shows the percentage of time that the shader core arithmetic units are active.
The Varying unit utilization series shows the percentage of time that the fixed-function interpolation unit is active.
The Texture unit utilization series shows the percentage of time that the fixed-function texture sampling and filtering unit is active.
The Load/store unit utilization series shows the percentage of time that the general purpose memory access unit is active.

For shader content which is shader core bound, identifying the unit which is most heavily loaded using this chart is a good way to determine where to target optimizations.

Workload properties

The Mali Workload Properties chart contains a variety of component series which indicate interesting behaviors of the workload.

Streamline GPU Shader Core Workload Quirks chart extract.

The Warp divergence rate series reports the percentage of instructions executed when there is control flow divergence across the warp, causing some execution lanes to be masked out. Aim to minimize control flow divergence, as it can rapidly erode shader execution efficiency.

The Partial warp rate series reports the percentage of warps which contain thread slots with no coverage. This occurs due to a fragment quad intersecting the edge of a primitive, resulting in a fragment with no hit samples. A high percentage of partial warps may be indicative of an application with a high number of microtriangles, or triangles which are very thin. Aim to minimize the number of partial warps, as they can also rapidly erode shader execution efficiency.

The Tile CRC kill rate series reports the percentage of tiles which are killed due to a CRC match, indicating that the computed color matches the color already in memory. A high kill rate may indicate an optimization opportunity, if the application is able to identify and draw only the changed parts of the screen.

Arithmetic unit

The Execution Engine is used for executing all shader instructions, including all of the arithmetic workloads. The Execution engine utilization series in the Mali Core Unit Utilization chart described earlier can be used to determine whether the application is arithmetic processing limited.

Varying unit

The Mali Core Varying Cycles chart reports the usage of the fixed-function varying interpolator, broken down by data precision.

Streamline GPU Shader Core Varying Usage chart extract.

For content which is varying bound there are three possible optimization opportunities:

Direct workload reduction: reducing the overall number of varyings that must be loaded per frame.
Precision reduction: switching from 32-bit highp to 16-bit mediump varyings will halve interpolator requirements, and this often causes knock-on improvements in shader logic.
Varying packing: packing 16-bit varyings into vectors which are a multiple of 32-bits minimizes lost cycles due to unused interpolator lanes. For example, a packed vec4 will interpolate one cycle faster than a float and a separate vec3.

Texture unit

The texture unit is a complex unit which handles all texture sampling and filtering. It can have variable performance depending upon texture format and filtering mode being used.

Streamline GPU Shader Core Texture Usage chart extract.

The Mali Core Texture Cycles chart reports the total usage of the fixed-function texture filtering unit.

The Mali Core Texture CPI chart reports the average number of texture cycles per request, giving some idea of the number of multi-cycle operations being triggered by use of more complex filtering modes. For texture bound content reducing CPI by using simpler filtering modes can be an effective means to improve performance.

The Mali Core Texture Usage Rate chart reports statistics about the types of texture access being made.

The Compressed access series reports the percentage of the texture accesses which are using block compressed texture formats such as ASTC and ETC. Game rendering should be using block compressed textures as much as possible to reduce bandwidth.
The Mipmapped access series reports the percentage of the texture accesses which are using mipmapped textures. Game rendering should be using mipmapped textures for all 3D scenes to improve both performance and image quality.
The Trilinear filtered access series reports the percentage of texture samples which are using trilinear filtering. These accesses run at half the rate of the bilinear accesses.
The 3D access series reports the percentage of texture samples which are making a sample into a volumetric texture. These accesses run at half the rate of 2D accesses textures.

The Mali Core Texture Bytes/Cycle chart reports how many bytes need to be fetched from the L2 and from external memory for each filtering cycles. External accesses are particularly energy intensive so it is recommended to use compressed textures and mipmaps, as well as ensuring good sample locality to reduce cache pressure.

Load/store unit

The load/store unit provides generic read/write data memory access, as well as image and atomic access.

Streamline GPU Shader Core Load/Store Usage chart extract.

The Mali Core Load/Store Cycles chart reports the access types made to the load/store cache. Reads and Writes are either "full" or "short", where a short access does not make full use of the available data bus width. Reducing the number of short accesses by making vectorized memory accesses, and accessing adjacent data in spatially adjacent threads in a compute shader, can help improve performance.

The Mali Core Load/Store Bytes/Cycle chart reports how many bytes need to be fetched from the L2 and from external memory per read, and how many bytes are written per write. Interpreting this counter can be difficult without knowledge of the algorithms being used, but it can be useful for investigating compute shader performance where the memory access and data usage pattern is known. For example, identifying content where most loads are coming from external memory rather than the L2 cache could indicate that the working set is too large.

Memory bandwidth

The final set of charts show the memory bandwidth generated by a shader core, broken down by the unit which is generating the traffic.

Streamline GPU Shader Core Memory Source chart extract.

For content which has an overall bandwidth problem, these counters can help identify which data resource is responsible for generating the most traffic.

Load/store unit traffic is related to all types of buffer access, and access to data though an image() accessor.
Texture unit traffic is related to all types of shader texture() access, including implicit loads needed to restore tile buffer contents at the start of a render pass if attachments are not cleared or invalidated.
Tile buffer traffic is related to all framebuffer attachment writes back to memory at the end of a render pass.

Summary

This article has shown how to perform an initial performance review of a graphical application using an Arm CPU and a Mali GPU, using performance counter information to identify dominant workloads and possible causes of performance slow down. Predefined templates built in to Streamline can be used to quickly and efficiently capture the necessary counters to support a methodical review workflow, allowing a step-by-step review of the various blocks and behaviors in the design.

Arm Mobile Studio

Parents

JPJ over 5 years ago

Great article Pete! Quite insightful, thanks!
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

JPJ over 5 years ago

Great article Pete! Quite insightful, thanks!
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

Peter Harris over 5 years ago in reply to JPJ

Glad you found it useful, shout if you have any questions.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Mobile, Graphics, and Gaming blog

Unlock the power of SVE and SME with SIMD Loops

Vidya Praveen

SIMD Loops is an open-source project designed to help developers learn SVE and SME through hands-on experimentation. It offers a clear, practical pathway to mastering Arm’s most advanced SIMD technologies…
- September 19, 2025
What is Arm Performance Studio?

Jai Schrem

Arm Performance Studio gives developers free tools to analyze performance, debug graphics, and optimize apps on Arm platforms.
- August 27, 2025
How Neural Super Sampling works: Architecture, training, and inference

Liam O'Neil

A deep dive into a practical, ML-powered approach to temporal super sampling.
- August 12, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Accelerating Mali GPU analysis using Arm Mobile Studio

Selecting counters

Finding frames

CPU workload

GPU workload

GPU memory system

GPU geometry

GPU shader front-end

GPU shader frontend pixels

GPU shader core

Workload properties

Arithmetic unit

Varying unit

Texture unit

Load/store unit

Memory bandwidth

Summary

Unlock the power of SVE and SME with SIMD Loops

What is Arm Performance Studio?

How Neural Super Sampling works: Architecture, training, and inference