Performance analysis and optimization have always been key topics when it comes to mobile applications, in particular for game developers. The task of bringing console class graphics to mobile platforms is challenging: users expect very high quality content on any kind of device, from smartphones and tablets to laptops and TVs. The latest mobile devices we are seeing in the market often have a screen resolution that is beyond HD, from 2560×1600 (4 Megapixel) on the Nexus 10 to the latest TVs and set top boxes with 4K resolution (8 Megapixel). Animations, games and UIs must run smoothly on a low-power ARM chip, with a limited amount of energy and thermal limit. While hardware and driver engineers are working to improve the efficiency and the performance of the system, a lot can be done by the application developers to optimize games and UIs to avoid bottlenecks and write code that takes advantage of the features of the target device.
Fortunately we are building tools that developers can use to profile and debug their games. The scope of this series is to show how these tools can be used to analyze a well known graphics demo in the industry: Epic Citadel, from Epic Games.
The tools we are presenting are available free of charge on our Mali Developer Center, and can be used on any Android™ or Linux-based device with an ARM® Mali™ GPU. In this series we are going to use:
The first thing we usually do when approaching the analysis of an application is to profile it with DS-5 Streamline. This will lead us to understand what the overall activity looks like so we can focus on one component at a time. In this case we are running the introduction animation of Epic Citadel running on a Google Nexus 10 with an instrumented version of Android 4.4.2. The device has been loaded with a custom kernel and gator, which is the kernel module that communicates with Streamline (see Using DS-5 Streamline with Mali on Google Nexus 10).
One of the frames from Epic Citadel that we decided to analyze
We decide to focus on a particular scene, the one which seems to be one of the most complex to render in the animation. It takes 29ms to render a frame (36 fps), which means that the animation is not running at the best speed the display is capable of. To explore this, we are going to analyze the timeline counters to understand what is limiting the performance for this particular frame and what we could do optimize it.
Initially we look at the activity timelines, to get a grasp of the overall activity of CPU and GPU, so we can narrow down the following analysis.
In the timeline view, we have selected a range of one second, during which around 36 frames of the same scene are rendered. The frames, of course, are not identical, but they are similar enough; in this way all the figures we are going to read will be referred to a period of one second, which is convenient.
ARM DS-5 Streamline showing the timeline activity and Mali GPU hardware counters for Epic Citadel
We notice that the CPU activity (CPU Activity ➞ User) averages to 24% over one second, which is a reasonable figure for a complex demo like this. The Google Nexus 10 contains a chip with two ARM Cortex®-A15 cores, which means that this demo should not be CPU limited - and it isn't. Even if the application was only single-threaded, CPU activity would have to be over 50% to be considered limited.
On the other hand, there is a clear burst in GPU activity, especially related to fragment processing (GPU Fragment ➞ Activity) that is happening almost 100% if the time. Vertex processing is also significant, reaching 42% on average, but we will focus on the fragment activity, which is indeed limiting the speed in this case.
At this point we can begin diving deeper into understanding the ARM Mali GPU hardware counters, which are fully available in Streamline. When configuring Streamline we had selected the subset of the available counters that we find particularly useful for this kind of analysis.
Over the highlighted time of one second the GPU was active for 448m cycles (Mali Job Manager Cycles ➞ GPU cycles). With this hardware, the maximum number of cycles is 450m.
We can use the Mali Job Slots counters to understand how many cycles are spent doing vertex and fragment processing, remembering that the two kind of activities may be happening simultaneously:
A first pass of optimization would lead to a higher frame rate. After reaching V-SYNC, optimization can lead to saving energy and to a longer play time.
The graphics processor that this device houses is the ARM Mali-T604 GPU, the first implementation of ARM’s Midgard architecture. This GPU has four shader cores. Since Midgard is a unified shader core architecture, each shader core is capable of executing vertex, fragment and compute activity. An important characteristic of this architecture is that different kind of instructions can be executed at the same time in the same core, thanks to the tripipe design. Additional information about the Midgard architecture is available at The Mali GPU: An Abstract Machine, Part 3 - The Shader Core.
Activity timeline, showing CPU, GPU fragment and vertex activity and job cycles
Each shader core in a Mali Midgard GPU has three different kinds of pipeline:
Pipelines in each shader core of the ARM Mali-T604 and Mali-T628 GPUs
At this point we can inspect the tripipe counters to understand whether the bottleneck is arithmetic, load/store or textures. It is important to highlight that the three pipelines cannot be seen as completely independent. The busy-ness of each pipeline may actually depend on different kinds of activity. For example, by reducing the arithmetic instructions we may drastically reduce the load/store activity, if the shader was spilling over the available registers.
ARM Mali GPU pipeline counters (Load/Store, Texture and Arithmetic cycles)
In the case we are inspecting, a breakdown of the tripipe activity shows that over 444m GPU cycles:
We notice that in this case the GPU is able to run in parallel a significant amount of instructions: a total amount of 710m tripipe cycles over 448m effective cycles.
Unfortunately the L/S pipeline is busy 91% of the time, which means that there is a lot of memory accesses and this may lead to a high bandwidth utilization. This fact is important to bear in mind, because it will determine what we will be focusing on in the following stages of the analysis.
Additional interesting information we can get from the tripipe counters is the cycles per instruction metrics (CPI). Not all instructions manage to be executed in a single cycle, sometimes the GPU has to wait more cycles to complete an instruction. These can be considered stalls and we should aim to reduce their number.
For the texture pipeline:
This number is very good because it means that almost all the the texture instructions can be completed in a single cycle. This is probably due to the fact that the application is using textures in the right way: all the big and medium size textures are compressed in ETC1 format and mipmapping is used extensively. ETC1 was the only standard format available at the time this demo was made, now ASTC would reduce the texture bandwidth even more, maintaining the same visual quality.
For the load/store pipeline we have:
This metric may be showing a problem. For gaming content we consider any number below 1.8 cycles/instruction to be acceptable, in this case the GPU is stalling too many times waiting for memory. This is probably because of:
Afterwards we will use the Mali Graphics Debugger and the Mali Offline Shader Compiler to understand whether there is room for improvement when it comes to shader optimization.
Cycles per instruction metrics: showing the number of cycles spent executing instructions compared to the number of cycles the GPU was waiting
When creating embedded graphics applications bandwidth is a scarce resource. Devices like the one we are using can handle 4 to 8 gigabytes per second of data, but transferring that amount for a long period of time would easily consume all the energy we have available (see How low can you go? Building low-power, low-bandwidth ARM Mali GPUs).
Here we are going to check two things:
To calculate the memory bandwidth used by the GPU we can use two counters in Streamline, but we also need to know the size of the bus. In our case it's 128 bits, or 16 bytes, so:
This shows that for the selected range of time the application is not limited by bandwidth on this device. However, almost three gigabytes per second is a lot of data to be transferred and it could be a problem on a device with less available bandwidth. Besides, since bandwidth usage is directly related to energy consumption, it’s always worth optimizing it.
Mali L2 Cache counters can be used to calculate total GPU bandwidth utilization
Lorenzo Dal Col is the Product Manager of Mali GPU Tools. He first used ARM technology when, in 2007, he created a voice-controlled robot at university. He has experience in machine learning, image processing and computer vision. He moved into a new dimension when he joined ARM in 2011 to work on 3D graphics, developing performance analysis and debug tools for software running on ARM Mali GPUs.
That was extremely helpful, and while some of it is a bit esoteric, it gives me a useful high-level understanding of concepts that I can do some searches on with the goal of understanding more clearly!
AXI has independent read and write data channels, so the GPU can do both in parallel provided AXI can accept the data (i.e. no backpressure). At some point you have to multiplex reads and writes onto a single physical DDR pin, so whether you can actually sustainably do both depends on the relative bandwidths of the AXI and DDR controller (and thermal capacity of the design - external DDR bandwidth is power hungry)
Any partitioning depends on the memory controller design; I would hope most are dynamic (it's pretty rare to have a use case where reads and writes are entirely equal, so a static 50:50 split would leave performance on the table).
It seems there is no real substitute for profiling!
Am I correct in understanding that modern SoCs that have dual-channel LPDDR4x can do reads *or* writes up to the advertised peak bandwidth? Or is half of the bandwidth reserved for reads and the other half for writes?
Unfortunately this method won't work; frame time tells you how many GPU cycles you needed, but doesn't really tell you anything about how hard the AXI bus is having to working. Also worth remembering in most real devices where you're not maxing out the bandwidth capability then the AXI and DDR clocks are both probably changing dynamically to reduce energy cost of each access which can have a massive impact on effective latency.
This is a wonderful resource! Thank you!
Assuming the following:
is finding an estimate for memory bandwidth latency as simple as multiplying the frame time by the maximum of frame read-bandwidth divided peak read-bandwidth, and frame write-bandwidth divided by peak write-bandwidth? Put another way, is the following true:
estimated_per_frame_memory_latency = frame_time_in_ms * max (frame_read_bandwidth/peak_read_bandwidth, frame_write_bandwidth/peak_write_bandwidth)
Or are reads/writes aggregated when considering total external bandwidth, and if so, would this be true:
estimated_per_frame_memory_latency = frame_time_in_ms * (frame_read_bandwidth + frame_write_bandwidth) / peak_read_bandwidth
I'm trying to get a general estimate for per-frame external memory latency, and purposefully neglecting the GPU scheduling of ALU operations (et. al.) as a function of per-frame time.
Is there something that I am neglecting in this estimate?