Mali GPU Tools: A Case Study, Part 2 — Frame Analysis with Mali Graphics Debugger

June 29, 2015

11 minute read time.

In the first part of this series we used ARM^® DS-5 Streamline to profile the Epic Citadel demo in order to understand the workload on CPU and GPU, and the bandwidth usage. In this article we are going to analyze the same frame with the ARM Mali™ Graphics Debugger. By profiling the demo with DS-5 Streamline we had understood that:

There is very high GPU fragment activity, which seems to be the culprit for non-optimal performance while rendering the selected scene.
The Load/Store pipeline is active for more than 90% of the time (408m L/S cycles / 448m GPU cycles), which means that it probably is the bottleneck.
The cycles per instruction (CPI) metric for the Load/Store pipeline is poor: on average, it takes more than two cycles for each instruction to be completed.
The bandwidth usage is fairly high (around 3 GB/s), probably due to the high Load/Store pipeline activity.

To investigate more deeply, we need to know what the application is doing at the API level. While the hardware counters are useful to see the big picture and find the bottleneck, the Mali Graphics Debugger will show us the reason why a certain issue is happening, and what we could do about it.

Analyzing the Application with Mali Graphics Debugger

The Mali Graphics Debugger is an application that traces all the OpenGL® ES activity at the API level. It's useful to content developers for graphics debugging to understand issues and causes at a frame level. The tool allows the user to observe OpenGL ES and OpenCL™ API call arguments and return values and to interact with a running target application so as to investigate the effect of individual calls on the target. Attempted misuse of the API is highlighted, as are recommendations for improvement on a Mali-based system. Trace information may also be captured to a file on one system and analyzed later. The state of the underlying GPU subsystem is observable at any point. It contains two main components: the GUI application and the target interceptor components. Additional information can be found in the Mali Graphics Debugger User Guide.

A nice thing about the Mali Graphics Debugger is that not only does it help debug a problem, but it also gives a lot of helpful information when improving the performance. Its frame analysis features can be used for:

Debugging, for example understanding why a particular object does not appear on the scene, or why a light does not look correct, etc.
Analyzing, visualizing all the resources being used to render a frame, getting statistics about the number of vertices, fragments, binary instructions, etc.

Here we are focusing on the latter. We are going to inspect the resources that are used to render the frame we are analyzing. In this case the Epic Citadel demo is running on the same device as in Part 1, a Google Nexus 10 with Android™ OS 4.4.2. The device has been rooted and the target components of the Mali Graphics Debugger have been installed. They consist of the API interceptor library and mgddaemon, a small application that communicates with the GUI through USB.

Mali Graphics Debugger showing an Epic Citadel frame and some of the available features

Understanding the Fragment Activity

We are going to understand why the fragment activity is so high and whether there are optimizations we can apply to reduce the memory accesses. The high fragment activity could be because of a combination of the following cases:

Too many fragments to draw
This is possibly because of:
- Too high a resolution
- Overdraw
- Too many render targets.
Too many instructions per shader
In this case we need to find the shader that is being used the most and optimize it with the Mali Offline Shader Compiler.
Too many stalls
The fragment shader cannot run at full speed because too many stalls are happening. In this case the shader core will waste cycles waiting for memory, probably because of a bad cache utilization.

Fragment Activity Budget

The first consideration we can do is calculating the average number of cycles we can spend for fragment activity. This is going to give us a guideline for analyzing shaders and overdraw.

Since we are running the demo in High Quality mode, which renders the scene directly to framebuffer zero, the theoretical number of pixels that we are drawing is 2560×1600, namely 4,096,000. If we are targeting display synchronization, sixty frames every second must be rendered. We are running our demo on a Google Nexus 10, which contains an ARM Mali-T604 GPU which has four cores capable of running at 450 MHz. With these figures we can calculate the theoretical average number of cycles per pixel that we have available:

4 cores × 450m cycles	=	7.32 cycles/pixel
2560×1600 pixels × 60 fps

This does not take into account stalls and overhead of the setup work that the driver may need to do and it assumes that the vertex activity is irrelevant or fully parallelized. The truth is that we may be able to dedicate fewer cycles per pixel to the fragment activity and we do not want to keep the GPU fully utilized for a long time (in order to save energy).

Overdraw

Since the objects in the scene are not always sorted by depth in an optimized way and there are some semi-transparent textures to draw, the GPU usually requires the drawing of multiple fragments for the same pixel in order to do the right blending. We call this overdraw, and it's something that, while unavoidable, should be reduced at least. By overdraw factor we mean the average number of times a pixel is drawn. For example, an overdraw factor of 1.5x means that on average, each pixel is drawn 1.5 times, such that 50% of the pixels require two fragments to be processed.

We have two ways to calculate the overdraw factor for a particular scene:

Combining some Mali GPU hardware counters in DS-5 Streamline.
Using the Fragment Count feature in the Mali Graphics Debugger.

Here we are going to use both, so that we can validate our computations.

Calculating the Overdraw Factor with DS-5 Streamline

Using the same capture that we made in Part 1, for the same selected second of activity in the timeline, we can add a custom counter to get the overdraw factor:

Overdraw factor calculated using the Mali GPU hardware counters in Streamline

We need to know the size of the tiles (Mali GPU is tile-based), which is 16 by 16 pixels in case of Mali-T604. Such a custom counter would be calculated with the following formula:

Mali Core Threads ➞ Fragment threads	=	90.7m	=	2.48 fragment threads/pixel
Mali Fragment Tasks ➞ Tiles rendered × Tile Size		143k × 16 × 16

This figure represents an average for the selected second, which includes around 30 frames.

This formula is valid for the Mali-T604 GPU. For other Mali GPUs please check the document http://community.arm.com/docs/DOC-10182 (see section 3.4.4 for the Mali-T760 meaning of the counter).

Calculating the Overdraw Factor with Mali Graphics Debugger

The Mali Graphics Debugger takes a different approach to analyzing the 3D application. It works at the API level, knowing all the OpenGL ES 2.0 function calls that the application is making and reading the framebuffers that are produced by the GPU. To analyze a particular frame from a fragment workload perspective, a feature called Fragment Count is available. This feature replaces the fragment shaders utilized in all the draw calls of one frame with some special shaders that we can use to count how many fragments are affected by each draw call. It is important to remember that it counts all the fragments that are passed by the rasterizer, including the ones that would normally be discarded by the shader and the ones that would normally result as fully transparent. We believe this is the right thing to do, since such fragments would still have to run through the fragment shader.

Statistics for the fragment shaders: the number of fragments (instances) is the result of the Fragment Count mode

Once we capture a frame with the Fragment Count mode activated, we get statistics about how many fragments each shader has processed and how many cycles it has taken. This is useful to highlight the top heavyweight shaders and focus on optimizing those. We can also sum the number of fragments (called instances in the tool) that each shader processes to get the total number of fragments that have been drawn for this particular frame.

The total amount of fragments for a frame will take into account overdrawing. At this point, we can simply divide this number by the number of pixels on the screen to get the overdraw factor:

Total number of fragment instances	=	7537773 + 1459254 + 415710 + 197329 + 279555 + ...	=	~10m	=	2.44 fragment instances/pixel
Size of the framebuffer		2560×1600		4096000

This figure is very close to the one calculated with Streamline, 2.48 fragments/pixel. That, however, was an average of multiple frames.

Understand the Overdraw

An overdraw factor of around 2.5x is quite high. For a scene like the one we are analyzing, which does not have many semi-transparent objects, it is definitely too much. A lot of wasted fragment activity could be avoided just by sorting the draw calls by the depth of the opaque objects, from front to back, and finally drawing the semi-transparent ones. After a fully opaque fragment has been drawn, all the subsequent fragments in the same position that have a lower depth will be discarded by the Z-test before running the fragment shader. That's why it is so important to sort draw calls in a sensible way.

With the Mali Graphics Debugger we can use the Frame Capture feature to understand how every draw call is affecting the image on screen. With this mode, a full resolution snapshot of the framebuffer is taken after every draw call. It can take up to a few minutes, but the resulting image sequence is very valuable. Take a look at the following video to see how the investigated frame is being drawn:

7180.42682cc1e8b90_fb_0.mp4

The frame that has been captured. Shows what every draw call is drawing.

We have built into the tool a dedicated feature to understand overdraw. It's called Overdraw Map, and it replaces all the fragment shaders used to draw a frame with our overdraw shaders. The resulting image that is drawn to the screen is a grayscale map showing overdraw. The darkest color means that a pixel has been drawn only once, while full white means that a pixel has been overdrawn ten times. At that point the brightness saturates, but 10x overdraw is definitely too much. With this mode on, it is normal to not see all the objects appearing on the image; in fact, if all the objects were perfectly sorted from front to back, we would only see a uniform dark area on screen.

Overdraw map for the selected frame shows big areas where objects are drawn 3-5 times on the top of each other.

In the next part we will analyze the vertex activity for the same frame and we will focus on shader optimization, using the Mali Graphics Debugger and the Mali Offline Shader Compiler. Part 3 is now available.

Lorenzo Dal Col is the Product Manager of Mali GPU Tools. He first used ARM technology when, in 2007, he created a voice-controlled robot at university. He has experience in machine learning, image processing and computer vision. He moved into a new dimension when he joined ARM in 2011 to work on 3D graphics, developing performance analysis and debug tools for software running on ARM Mali GPUs.

Sean Lumly over 10 years ago

I recall the LPDDR2 memory bandwidth power approximation from one of your posts, and it is very helpful (albeit an approximation, as you allude, likely a step or two from reality). Still, I am very thankful for these approximations as they help constrain the types of things that are look possible when considering raw specs and thinking about algorithm implementation.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Peter Harris over 10 years ago

The 70% of maximum bandwidth is a generally good rule of thumb for any memory technology - performance goes fairly quickly awry as you get above that as memory latency starts to go up very quickly due to congestion (which also limits the amount of clever stuff the DDR controller can to to optimally access DDR pages).
In terms of thermally usable memory bandwidth, that is an easier one to answer as the power cost is generally dominated by driving the external PHY which links the AP SoC to the DDR, which tends to be more predictable. Our rule of thumb is between 100pJ (LPDDR4) and 150pJ (LPDDR2) per byte of memory access - which is a horrible approximation which tries to account for all access power (AXI, memory controller, PHY, and DDR) - but is generally a good sanity check number. You would therefore expect 1GB/second of access bandwidth to cost between 100-150mW, although as always YMMV.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sean Lumly over 10 years ago

I figured as much.. I was hoping for something incredibly vague (and perhaps overly cautious) like "generally aim for 50% of external bandwidth on smartphones", or "keep GPU cycle utilization around 60% theoretical maximum on smartphones". Advice to only count on having access to ~70% of maximum external bandwidth, or ~80% of maximum cycles in previous blog posts was very helpful!
It will probably be better to test on a device-by-device basis, or to collect data from shipping code.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Peter Harris over 10 years ago

Really an impossible question to answer from the IP standpoint - so much depends silicon implementation (process node, top frequency target used in synthesis, transistor choice, number of DVFS operating points), device form factor (tablets are physically bigger than phones so can dissipate more heat energy per second - so have higher thermal limits), etc.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sean Lumly over 10 years ago

Here are some interesting questions:
1) What is a general rule-of-thumb for sustainable rendering before running into thermal limits?
Obviously sustaining the external bandwidth (eg. huge, uncompressed, non-mipmapped textures) at theoretical limits (eg. 25.6GB) is not realistic, so what should be a more realistic target?
2) Similarly for computation: what percentage of load on the GPU is likely to be sustainable?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Mobile, Graphics, and Gaming blog

Unlock the power of SVE and SME with SIMD Loops

Vidya Praveen

SIMD Loops is an open-source project designed to help developers learn SVE and SME through hands-on experimentation. It offers a clear, practical pathway to mastering Arm’s most advanced SIMD technologies…
- September 19, 2025
What is Arm Performance Studio?

Jai Schrem

Arm Performance Studio gives developers free tools to analyze performance, debug graphics, and optimize apps on Arm platforms.
- August 27, 2025
How Neural Super Sampling works: Architecture, training, and inference

Liam O'Neil

A deep dive into a practical, ML-powered approach to temporal super sampling.
- August 12, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog