In the first part of this series we used ARM® DS-5 Streamline to profile the Epic Citadel demo in order to understand the workload on CPU and GPU, and the bandwidth usage. In this article we are going to analyze the same frame with the ARM Mali™ Graphics Debugger. By profiling the demo with DS-5 Streamline we had understood that:
To investigate more deeply, we need to know what the application is doing at the API level. While the hardware counters are useful to see the big picture and find the bottleneck, the Mali Graphics Debugger will show us the reason why a certain issue is happening, and what we could do about it.
The Mali Graphics Debugger is an application that traces all the OpenGL® ES activity at the API level. It's useful to content developers for graphics debugging to understand issues and causes at a frame level. The tool allows the user to observe OpenGL ES and OpenCL™ API call arguments and return values and to interact with a running target application so as to investigate the effect of individual calls on the target. Attempted misuse of the API is highlighted, as are recommendations for improvement on a Mali-based system. Trace information may also be captured to a file on one system and analyzed later. The state of the underlying GPU subsystem is observable at any point. It contains two main components: the GUI application and the target interceptor components. Additional information can be found in the Mali Graphics Debugger User Guide.
A nice thing about the Mali Graphics Debugger is that not only does it help debug a problem, but it also gives a lot of helpful information when improving the performance. Its frame analysis features can be used for:
Here we are focusing on the latter. We are going to inspect the resources that are used to render the frame we are analyzing. In this case the Epic Citadel demo is running on the same device as in Part 1, a Google Nexus 10 with Android™ OS 4.4.2. The device has been rooted and the target components of the Mali Graphics Debugger have been installed. They consist of the API interceptor library and mgddaemon, a small application that communicates with the GUI through USB.
Mali Graphics Debugger showing an Epic Citadel frame and some of the available features
We are going to understand why the fragment activity is so high and whether there are optimizations we can apply to reduce the memory accesses. The high fragment activity could be because of a combination of the following cases:
The first consideration we can do is calculating the average number of cycles we can spend for fragment activity. This is going to give us a guideline for analyzing shaders and overdraw.
Since we are running the demo in High Quality mode, which renders the scene directly to framebuffer zero, the theoretical number of pixels that we are drawing is 2560×1600, namely 4,096,000. If we are targeting display synchronization, sixty frames every second must be rendered. We are running our demo on a Google Nexus 10, which contains an ARM Mali-T604 GPU which has four cores capable of running at 450 MHz. With these figures we can calculate the theoretical average number of cycles per pixel that we have available:
This does not take into account stalls and overhead of the setup work that the driver may need to do and it assumes that the vertex activity is irrelevant or fully parallelized. The truth is that we may be able to dedicate fewer cycles per pixel to the fragment activity and we do not want to keep the GPU fully utilized for a long time (in order to save energy).
Since the objects in the scene are not always sorted by depth in an optimized way and there are some semi-transparent textures to draw, the GPU usually requires the drawing of multiple fragments for the same pixel in order to do the right blending. We call this overdraw, and it's something that, while unavoidable, should be reduced at least. By overdraw factor we mean the average number of times a pixel is drawn. For example, an overdraw factor of 1.5x means that on average, each pixel is drawn 1.5 times, such that 50% of the pixels require two fragments to be processed.
We have two ways to calculate the overdraw factor for a particular scene:
Here we are going to use both, so that we can validate our computations.
Using the same capture that we made in Part 1, for the same selected second of activity in the timeline, we can add a custom counter to get the overdraw factor:
Overdraw factor calculated using the Mali GPU hardware counters in Streamline
We need to know the size of the tiles (Mali GPU is tile-based), which is 16 by 16 pixels in case of Mali-T604. Such a custom counter would be calculated with the following formula:
This figure represents an average for the selected second, which includes around 30 frames.
This formula is valid for the Mali-T604 GPU. For other Mali GPUs please check the document http://community.arm.com/docs/DOC-10182 (see section 3.4.4 for the Mali-T760 meaning of the counter).
The Mali Graphics Debugger takes a different approach to analyzing the 3D application. It works at the API level, knowing all the OpenGL ES 2.0 function calls that the application is making and reading the framebuffers that are produced by the GPU. To analyze a particular frame from a fragment workload perspective, a feature called Fragment Count is available. This feature replaces the fragment shaders utilized in all the draw calls of one frame with some special shaders that we can use to count how many fragments are affected by each draw call. It is important to remember that it counts all the fragments that are passed by the rasterizer, including the ones that would normally be discarded by the shader and the ones that would normally result as fully transparent. We believe this is the right thing to do, since such fragments would still have to run through the fragment shader.
Statistics for the fragment shaders: the number of fragments (instances) is the result of the Fragment Count mode
Once we capture a frame with the Fragment Count mode activated, we get statistics about how many fragments each shader has processed and how many cycles it has taken. This is useful to highlight the top heavyweight shaders and focus on optimizing those. We can also sum the number of fragments (called instances in the tool) that each shader processes to get the total number of fragments that have been drawn for this particular frame.
The total amount of fragments for a frame will take into account overdrawing. At this point, we can simply divide this number by the number of pixels on the screen to get the overdraw factor:
This figure is very close to the one calculated with Streamline, 2.48 fragments/pixel. That, however, was an average of multiple frames.
An overdraw factor of around 2.5x is quite high. For a scene like the one we are analyzing, which does not have many semi-transparent objects, it is definitely too much. A lot of wasted fragment activity could be avoided just by sorting the draw calls by the depth of the opaque objects, from front to back, and finally drawing the semi-transparent ones. After a fully opaque fragment has been drawn, all the subsequent fragments in the same position that have a lower depth will be discarded by the Z-test before running the fragment shader. That's why it is so important to sort draw calls in a sensible way.
With the Mali Graphics Debugger we can use the Frame Capture feature to understand how every draw call is affecting the image on screen. With this mode, a full resolution snapshot of the framebuffer is taken after every draw call. It can take up to a few minutes, but the resulting image sequence is very valuable. Take a look at the following video to see how the investigated frame is being drawn:
7180.42682cc1e8b90_fb_0.mp4
The frame that has been captured. Shows what every draw call is drawing.
We have built into the tool a dedicated feature to understand overdraw. It's called Overdraw Map, and it replaces all the fragment shaders used to draw a frame with our overdraw shaders. The resulting image that is drawn to the screen is a grayscale map showing overdraw. The darkest color means that a pixel has been drawn only once, while full white means that a pixel has been overdrawn ten times. At that point the brightness saturates, but 10x overdraw is definitely too much. With this mode on, it is normal to not see all the objects appearing on the image; in fact, if all the objects were perfectly sorted from front to back, we would only see a uniform dark area on screen.
Overdraw map for the selected frame shows big areas where objects are drawn 3-5 times on the top of each other.
In the next part we will analyze the vertex activity for the same frame and we will focus on shader optimization, using the Mali Graphics Debugger and the Mali Offline Shader Compiler. Part 3 is now available.
Lorenzo Dal Col is the Product Manager of Mali GPU Tools. He first used ARM technology when, in 2007, he created a voice-controlled robot at university. He has experience in machine learning, image processing and computer vision. He moved into a new dimension when he joined ARM in 2011 to work on 3D graphics, developing performance analysis and debug tools for software running on ARM Mali GPUs.
I recall the LPDDR2 memory bandwidth power approximation from one of your posts, and it is very helpful (albeit an approximation, as you allude, likely a step or two from reality). Still, I am very thankful for these approximations as they help constrain the types of things that are look possible when considering raw specs and thinking about algorithm implementation.
The 70% of maximum bandwidth is a generally good rule of thumb for any memory technology - performance goes fairly quickly awry as you get above that as memory latency starts to go up very quickly due to congestion (which also limits the amount of clever stuff the DDR controller can to to optimally access DDR pages).
In terms of thermally usable memory bandwidth, that is an easier one to answer as the power cost is generally dominated by driving the external PHY which links the AP SoC to the DDR, which tends to be more predictable. Our rule of thumb is between 100pJ (LPDDR4) and 150pJ (LPDDR2) per byte of memory access - which is a horrible approximation which tries to account for all access power (AXI, memory controller, PHY, and DDR) - but is generally a good sanity check number. You would therefore expect 1GB/second of access bandwidth to cost between 100-150mW, although as always YMMV.
I figured as much.. I was hoping for something incredibly vague (and perhaps overly cautious) like "generally aim for 50% of external bandwidth on smartphones", or "keep GPU cycle utilization around 60% theoretical maximum on smartphones". Advice to only count on having access to ~70% of maximum external bandwidth, or ~80% of maximum cycles in previous blog posts was very helpful!
It will probably be better to test on a device-by-device basis, or to collect data from shipping code.
Really an impossible question to answer from the IP standpoint - so much depends silicon implementation (process node, top frequency target used in synthesis, transistor choice, number of DVFS operating points), device form factor (tablets are physically bigger than phones so can dissipate more heat energy per second - so have higher thermal limits), etc.
Here are some interesting questions:
1) What is a general rule-of-thumb for sustainable rendering before running into thermal limits?
Obviously sustaining the external bandwidth (eg. huge, uncompressed, non-mipmapped textures) at theoretical limits (eg. 25.6GB) is not realistic, so what should be a more realistic target?
2) Similarly for computation: what percentage of load on the GPU is likely to be sustainable?