Mali GPU Tools: A Case Study, Part 1 — Profiling Epic Citadel

July 17, 2014

10 minute read time.

Performance analysis and optimization have always been key topics when it comes to mobile applications, in particular for game developers. The task of bringing console class graphics to mobile platforms is challenging: users expect very high quality content on any kind of device, from smartphones and tablets to laptops and TVs. The latest mobile devices we are seeing in the market often have a screen resolution that is beyond HD, from 2560×1600 (4 Megapixel) on the Nexus 10 to the latest TVs and set top boxes with 4K resolution (8 Megapixel). Animations, games and UIs must run smoothly on a low-power ARM chip, with a limited amount of energy and thermal limit. While hardware and driver engineers are working to improve the efficiency and the performance of the system, a lot can be done by the application developers to optimize games and UIs to avoid bottlenecks and write code that takes advantage of the features of the target device.

Fortunately we are building tools that developers can use to profile and debug their games. The scope of this series is to show how these tools can be used to analyze a well known graphics demo in the industry: Epic Citadel, from Epic Games.

The tools we are presenting are available free of charge on our Mali Developer Center, and can be used on any Android™ or Linux-based device with an ARM® Mali™ GPU. In this series we are going to use:

ARM DS-5 Streamline to analyse the performance and profile the Mali GPU hardware counters;
Mali Graphics Debugger to trace the demo and understand how the OpenGL® ES API is used in this demo in order to find areas of improvement;
Mali Offline Shader Compiler to optimize the shaders' source code.

Profiling Epic Citadel via ARM DS-5 Development Studio

The first thing we usually do when approaching the analysis of an application is to profile it with DS-5 Streamline. This will lead us to understand what the overall activity looks like so we can focus on one component at a time. In this case we are running the introduction animation of Epic Citadel running on a Google Nexus 10 with an instrumented version of Android 4.4.2. The device has been loaded with a custom kernel and gator, which is the kernel module that communicates with Streamline (see Using DS-5 Streamline with Mali on Google Nexus 10).

One of the frames from Epic Citadel that we decided to analyze

We decide to focus on a particular scene, the one which seems to be one of the most complex to render in the animation. It takes 29ms to render a frame (36 fps), which means that the animation is not running at the best speed the display is capable of. To explore this, we are going to analyze the timeline counters to understand what is limiting the performance for this particular frame and what we could do optimize it.

Initially we look at the activity timelines, to get a grasp of the overall activity of CPU and GPU, so we can narrow down the following analysis.

In the timeline view, we have selected a range of one second, during which around 36 frames of the same scene are rendered. The frames, of course, are not identical, but they are similar enough; in this way all the figures we are going to read will be referred to a period of one second, which is convenient.

ARM DS-5 Streamline showing the timeline activity and Mali GPU hardware counters for Epic Citadel

CPU Activity

We notice that the CPU activity (CPU Activity ➞ User) averages to 24% over one second, which is a reasonable figure for a complex demo like this. The Google Nexus 10 contains a chip with two ARM Cortex®-A15 cores, which means that this demo should not be CPU limited - and it isn't. Even if the application was only single-threaded, CPU activity would have to be over 50% to be considered limited.

On the other hand, there is a clear burst in GPU activity, especially related to fragment processing (GPU Fragment ➞ Activity) that is happening almost 100% if the time. Vertex processing is also significant, reaching 42% on average, but we will focus on the fragment activity, which is indeed limiting the speed in this case.

GPU Activity

At this point we can begin diving deeper into understanding the ARM Mali GPU hardware counters, which are fully available in Streamline. When configuring Streamline we had selected the subset of the available counters that we find particularly useful for this kind of analysis.

Over the highlighted time of one second the GPU was active for 448m cycles (Mali Job Manager Cycles ➞ GPU cycles). With this hardware, the maximum number of cycles is 450m.

We can use the Mali Job Slots counters to understand how many cycles are spent doing vertex and fragment processing, remembering that the two kind of activities may be happening simultaneously:

186m on vertex processing (Mali Job Manager Cycles ➞ JS1 cycles)
448m on fragment processing (Mali Job Manager Cycles ➞ JS0 cycles)

A first pass of optimization would lead to a higher frame rate. After reaching V-SYNC, optimization can lead to saving energy and to a longer play time.

The graphics processor that this device houses is the ARM Mali-T604 GPU, the first implementation of ARM’s Midgard architecture. This GPU has four shader cores. Since Midgard is a unified shader core architecture, each shader core is capable of executing vertex, fragment and compute activity. An important characteristic of this architecture is that different kind of instructions can be executed at the same time in the same core, thanks to the tripipe design. Additional information about the Midgard architecture is available at The Mali GPU: An Abstract Machine, Part 3 - The Shader Core.

Activity timeline, showing CPU, GPU fragment and vertex activity and job cycles

Inspect the tripipe counters

Each shader core in a Mali Midgard GPU has three different kinds of pipeline:

Load/Store Pipeline responsible for all the memory accesses that are not related to textures;
Texture Pipeline which loads and filters texture data;
Arithmetic Pipeline (there are two arithmetic pipelines in the ARM Mali-T604 GPU) responsible for the math.

Pipelines in each shader core of the ARM Mali-T604 and Mali-T628 GPUs

At this point we can inspect the tripipe counters to understand whether the bottleneck is arithmetic, load/store or textures. It is important to highlight that the three pipelines cannot be seen as completely independent. The busy-ness of each pipeline may actually depend on different kinds of activity. For example, by reducing the arithmetic instructions we may drastically reduce the load/store activity, if the shader was spilling over the available registers.

ARM Mali GPU pipeline counters (Load/Store, Texture and Arithmetic cycles)

In the case we are inspecting, a breakdown of the tripipe activity shows that over 444m GPU cycles:

408m cycles were spent in the load/store pipeline (Mali Load/Store Pipe ➞ LS instruction issues),
105m cycles in the arithmetic pipeline (Mali Arithmetic Pipe ➞ A instructions),
197m cycles in the texture pipeline (Mali Texture Pipe ➞ T instruction issues).

We notice that in this case the GPU is able to run in parallel a significant amount of instructions: a total amount of 710m tripipe cycles over 448m effective cycles.

Unfortunately the L/S pipeline is busy 91% of the time, which means that there is a lot of memory accesses and this may lead to a high bandwidth utilization. This fact is important to bear in mind, because it will determine what we will be focusing on in the following stages of the analysis.

Cycles per instruction metrics

Additional interesting information we can get from the tripipe counters is the cycles per instruction metrics (CPI). Not all instructions manage to be executed in a single cycle, sometimes the GPU has to wait more cycles to complete an instruction. These can be considered stalls and we should aim to reduce their number.

For the texture pipeline:

Mali Texture Pipe ➞ T instruction issues	=	197m	=	1.16 cycles/instruction
Mali Texture Pipe ➞ T instructions		169m

This number is very good because it means that almost all the the texture instructions can be completed in a single cycle. This is probably due to the fact that the application is using textures in the right way: all the big and medium size textures are compressed in ETC1 format and mipmapping is used extensively. ETC1 was the only standard format available at the time this demo was made, now ASTC would reduce the texture bandwidth even more, maintaining the same visual quality.

For the load/store pipeline we have:

Mali Load/Store Pipe ➞ LS instruction issues	=	408m	=	2.09 cycles/instruction
Mali Load/Store Pipe ➞ LS instructions		195m

This metric may be showing a problem. For gaming content we consider any number below 1.8 cycles/instruction to be acceptable, in this case the GPU is stalling too many times waiting for memory. This is probably because of:

Too many attributes, uniforms and varyings
Bad cache utilization
Register spilling in the GPU cores.

Afterwards we will use the Mali Graphics Debugger and the Mali Offline Shader Compiler to understand whether there is room for improvement when it comes to shader optimization.

Cycles per instruction metrics: showing the number of cycles spent executing instructions compared to the number of cycles the GPU was waiting

Bandwidth utilization

When creating embedded graphics applications bandwidth is a scarce resource. Devices like the one we are using can handle 4 to 8 gigabytes per second of data, but transferring that amount for a long period of time would easily consume all the energy we have available (see How low can you go? Building low-power, low-bandwidth ARM Mali GPUs).

Here we are going to check two things:

Whether the performance in terms of frames per second is limited by bandwidth; in this case the CPU and GPU would be wasting cycles waiting for data to be transferred from and to memory.
Even if the application is not bandwidth limited, we want to check if there are more efficient ways to achieve the same with fewer data transfers.

To calculate the memory bandwidth used by the GPU we can use two counters in Streamline, but we also need to know the size of the bus. In our case it's 128 bits, or 16 bytes, so:

(Mali L2 Cache ➞ External read beats + Mali L2 Cache ➞ External write beats) × Bus Size

(96.0m + 90.7m) × 16 bytes

≈

2.9 GB/s

This shows that for the selected range of time the application is not limited by bandwidth on this device. However, almost three gigabytes per second is a lot of data to be transferred and it could be a problem on a device with less available bandwidth. Besides, since bandwidth usage is directly related to energy consumption, it’s always worth optimizing it.

Mali L2 Cache counters can be used to calculate total GPU bandwidth utilization

In the following part of this series we will use the Mali Graphics Debugger to understand why this demo is fragment bound while rendering the selected scene and what could be done to improve its performance. Part 2 and Part 3 are now available.

Lorenzo Dal Col is the Product Manager of Mali GPU Tools. He first used ARM technology when, in 2007, he created a voice-controlled robot at university. He has experience in machine learning, image processing and computer vision. He moved into a new dimension when he joined ARM in 2011 to work on 3D graphics, developing performance analysis and debug tools for software running on ARM Mali GPUs.

Sean Lumly over 8 years ago

That was extremely helpful, and while some of it is a bit esoteric, it gives me a useful high-level understanding of concepts that I can do some searches on with the goal of understanding more clearly!
Thanks again,
Sean
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Peter Harris over 8 years ago

AXI has independent read and write data channels, so the GPU can do both in parallel provided AXI can accept the data (i.e. no backpressure). At some point you have to multiplex reads and writes onto a single physical DDR pin, so whether you can actually sustainably do both depends on the relative bandwidths of the AXI and DDR controller (and thermal capacity of the design - external DDR bandwidth is power hungry)

Any partitioning depends on the memory controller design; I would hope most are dynamic (it's pretty rare to have a use case where reads and writes are entirely equal, so a static 50:50 split would leave performance on the table).
HTH,
Pete
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sean Lumly over 8 years ago

Thanks Pete,
It seems there is no real substitute for profiling!
Am I correct in understanding that modern SoCs that have dual-channel LPDDR4x can do reads *or* writes up to the advertised peak bandwidth? Or is half of the bandwidth reserved for reads and the other half for writes?
Cheers,
Sean
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Peter Harris over 8 years ago

Hi Sean,
Unfortunately this method won't work; frame time tells you how many GPU cycles you needed, but doesn't really tell you anything about how hard the AXI bus is having to working. Also worth remembering in most real devices where you're not maxing out the bandwidth capability then the AXI and DDR clocks are both probably changing dynamically to reduce energy cost of each access which can have a massive impact on effective latency.
Cheers,
Pete
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sean Lumly over 8 years ago
This is a wonderful resource! Thank you!
Assuming the following:
we have a stable frame rate,
external bandwidth reads/write are evenly distributed per frame,
we are within limits to avoid memory bandwidth stalls due to over saturation of the bus,
is finding an estimate for memory bandwidth latency as simple as multiplying the frame time by the maximum of frame read-bandwidth divided peak read-bandwidth, and frame write-bandwidth divided by peak write-bandwidth? Put another way, is the following true:
estimated_per_frame_memory_latency = frame_time_in_ms * max (frame_read_bandwidth/peak_read_bandwidth, frame_write_bandwidth/peak_write_bandwidth)
Or are reads/writes aggregated when considering total external bandwidth, and if so, would this be true:
estimated_per_frame_memory_latency = frame_time_in_ms * (frame_read_bandwidth + frame_write_bandwidth) / peak_read_bandwidth
I'm trying to get a general estimate for per-frame external memory latency, and purposefully neglecting the GPU scheduling of ALU operations (et. al.) as a function of per-frame time.
Is there something that I am neglecting in this estimate?
Sean
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Graphics, Gaming, and VR blog

Coming soon in Arm Frame Advisor

Julie Gaskin

Read about our vision for future feature enhancements in Frame Advisor. We have listened to your feedback and plan to extend the kinds of analyses you can perform. Help us to create more great features…
- March 13, 2024
Using the new custom reporting features in Performance Advisor

Connor Brookes

Explaining the new custom reporting features in Performance Advisor and how to use them.
- March 4, 2024
Beyond Mobile: Arm Mobile Studio is now Arm Performance Studio

Julie Gaskin

We are proud to announce that the latest version of our profiling tool suite for mobile is now available to download and use for free. In this release, we have a few changes to tell you about.
- February 26, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog