Performance analysis and optimization have always been key topics when it comes to mobile applications, in particular for game developers. The task of bringing console class graphics to mobile platforms is challenging: users expect very high quality content on any kind of device, from smartphones and tablets to laptops and TVs. The latest mobile devices we are seeing in the market often have a screen resolution that is beyond HD, from 2560×1600 (4 Megapixel) on the Nexus 10 to the latest TVs and set top boxes with 4K resolution (8 Megapixel). Animations, games and UIs must run smoothly on a low-power ARM chip, with a limited amount of energy and thermal limit. While hardware and driver engineers are working to improve the efficiency and the performance of the system, a lot can be done by the application developers to optimize games and UIs to avoid bottlenecks and write code that takes advantage of the features of the target device.
Fortunately we are building tools that developers can use to profile and debug their games. The scope of this series is to show how these tools can be used to analyze a well known graphics demo in the industry: Epic Citadel, from Epic Games.
The tools we are presenting are available free of charge on our Mali Developer Center, and can be used on any Android™ or Linux-based device with an ARM® Mali™ GPU. In this series we are going to use:
The first thing we usually do when approaching the analysis of an application is to profile it with DS-5 Streamline. This will lead us to understand what the overall activity looks like so we can focus on one component at a time. In this case we are running the introduction animation of Epic Citadel running on a Google Nexus 10 with an instrumented version of Android 4.4.2. The device has been loaded with a custom kernel and gator, which is the kernel module that communicates with Streamline (see Using DS-5 Streamline with Mali on Google Nexus 10).
One of the frames from Epic Citadel that we decided to analyze
We decide to focus on a particular scene, the one which seems to be one of the most complex to render in the animation. It takes 29ms to render a frame (36 fps), which means that the animation is not running at the best speed the display is capable of. To explore this, we are going to analyze the timeline counters to understand what is limiting the performance for this particular frame and what we could do optimize it.
Initially we look at the activity timelines, to get a grasp of the overall activity of CPU and GPU, so we can narrow down the following analysis.
In the timeline view, we have selected a range of one second, during which around 36 frames of the same scene are rendered. The frames, of course, are not identical, but they are similar enough; in this way all the figures we are going to read will be referred to a period of one second, which is convenient.
ARM DS-5 Streamline showing the timeline activity and Mali GPU hardware counters for Epic Citadel
We notice that the CPU activity (CPU Activity ➞ User) averages to 24% over one second, which is a reasonable figure for a complex demo like this. The Google Nexus 10 contains a chip with two ARM Cortex®-A15 cores, which means that this demo should not be CPU limited - and it isn't. Even if the application was only single-threaded, CPU activity would have to be over 50% to be considered limited.
On the other hand, there is a clear burst in GPU activity, especially related to fragment processing (GPU Fragment ➞ Activity) that is happening almost 100% if the time. Vertex processing is also significant, reaching 42% on average, but we will focus on the fragment activity, which is indeed limiting the speed in this case.
At this point we can begin diving deeper into understanding the ARM Mali GPU hardware counters, which are fully available in Streamline. When configuring Streamline we had selected the subset of the available counters that we find particularly useful for this kind of analysis.
Over the highlighted time of one second the GPU was active for 448m cycles (Mali Job Manager Cycles ➞ GPU cycles). With this hardware, the maximum number of cycles is 450m.
We can use the Mali Job Slots counters to understand how many cycles are spent doing vertex and fragment processing, remembering that the two kind of activities may be happening simultaneously:
A first pass of optimization would lead to a higher frame rate. After reaching V-SYNC, optimization can lead to saving energy and to a longer play time.
The graphics processor that this device houses is the ARM Mali-T604 GPU, the first implementation of ARM’s Midgard architecture. This GPU has four shader cores. Since Midgard is a unified shader core architecture, each shader core is capable of executing vertex, fragment and compute activity. An important characteristic of this architecture is that different kind of instructions can be executed at the same time in the same core, thanks to the tripipe design. Additional information about the Midgard architecture is available at The Mali GPU: An Abstract Machine, Part 3 - The Shader Core.
Activity timeline, showing CPU, GPU fragment and vertex activity and job cycles
Each shader core in a Mali Midgard GPU has three different kinds of pipeline:
Pipelines in each shader core of the ARM Mali-T604 and Mali-T628 GPUs
At this point we can inspect the tripipe counters to understand whether the bottleneck is arithmetic, load/store or textures. It is important to highlight that the three pipelines cannot be seen as completely independent. The busy-ness of each pipeline may actually depend on different kinds of activity. For example, by reducing the arithmetic instructions we may drastically reduce the load/store activity, if the shader was spilling over the available registers.
ARM Mali GPU pipeline counters (Load/Store, Texture and Arithmetic cycles)
In the case we are inspecting, a breakdown of the tripipe activity shows that over 444m GPU cycles:
We notice that in this case the GPU is able to run in parallel a significant amount of instructions: a total amount of 710m tripipe cycles over 448m effective cycles.
Unfortunately the L/S pipeline is busy 91% of the time, which means that there is a lot of memory accesses and this may lead to a high bandwidth utilization. This fact is important to bear in mind, because it will determine what we will be focusing on in the following stages of the analysis.
Additional interesting information we can get from the tripipe counters is the cycles per instruction metrics (CPI). Not all instructions manage to be executed in a single cycle, sometimes the GPU has to wait more cycles to complete an instruction. These can be considered stalls and we should aim to reduce their number.
For the texture pipeline:
This number is very good because it means that almost all the the texture instructions can be completed in a single cycle. This is probably due to the fact that the application is using textures in the right way: all the big and medium size textures are compressed in ETC1 format and mipmapping is used extensively. ETC1 was the only standard format available at the time this demo was made, now ASTC would reduce the texture bandwidth even more, maintaining the same visual quality.
For the load/store pipeline we have:
This metric may be showing a problem. For gaming content we consider any number below 1.8 cycles/instruction to be acceptable, in this case the GPU is stalling too many times waiting for memory. This is probably because of:
Afterwards we will use the Mali Graphics Debugger and the Mali Offline Shader Compiler to understand whether there is room for improvement when it comes to shader optimization.
Cycles per instruction metrics: showing the number of cycles spent executing instructions compared to the number of cycles the GPU was waiting
When creating embedded graphics applications bandwidth is a scarce resource. Devices like the one we are using can handle 4 to 8 gigabytes per second of data, but transferring that amount for a long period of time would easily consume all the energy we have available (see How low can you go? Building low-power, low-bandwidth ARM Mali GPUs).
Here we are going to check two things:
To calculate the memory bandwidth used by the GPU we can use two counters in Streamline, but we also need to know the size of the bus. In our case it's 128 bits, or 16 bytes, so:
This shows that for the selected range of time the application is not limited by bandwidth on this device. However, almost three gigabytes per second is a lot of data to be transferred and it could be a problem on a device with less available bandwidth. Besides, since bandwidth usage is directly related to energy consumption, it’s always worth optimizing it.
Mali L2 Cache counters can be used to calculate total GPU bandwidth utilization
Lorenzo Dal Col is the Product Manager of Mali GPU Tools. He first used ARM technology when, in 2007, he created a voice-controlled robot at university. He has experience in machine learning, image processing and computer vision. He moved into a new dimension when he joined ARM in 2011 to work on 3D graphics, developing performance analysis and debug tools for software running on ARM Mali GPUs.
Okay that makes sense about the DDR transaction. And Thanks for confirming my first doubt. Cheers!
still there is 128 KB L2 Cache and 128 bit per clock read and write, just because I am using only 4-cores of the GPU?
Yes
The cache line size, as per your other blog is 64 Bytes, so in order to fill in one Cache line, that would mean 4 trips to the memory... Did I get it right?
Not 4 trips - it will be a single AXI transaction to DDR, returning data on 4 consecutive clock cycles.
Hey Thanks peterharris .. Just to confirm the GPU itself has 6 cores, arranged in a cluster of 4 and 2. So, are you saying that still there is 128 KB L2 Cache and 128 bit per clock read and write, just because I am using only 4-cores of the GPU?
The cache line size, as per your other blog is 64 Bytes, so in order to fill in one Cache line, that would mean 4 trips to the memory... Did I get it right? or am I missing something? Thanks!
For 4 cores you will have 128-KB L2 cache available with 128-bit per clock read and write to the external bus.
HTH, Pete
Hi Lorenzo,
I have been using the Odroid XU3 platform with Exynos 5422 which contains a Mali T628 GPU. I am using this platform to execute a few OpenCL applications and wanted to calculate the bandwidth requirement for each application. I have been using your formula for bandwidth comparison purposes, since I am not sure about the bus width of the GPU on this SoC. Since I am using only 4 cores of the GPU for OpenCL, could you please confirm the bus width size, 128 or 256 bits, that I should use for the BW calculation?