This is the third and last part of this series that explains how to analyze the Epic Citadel demo using our ARM® Mali™ GPU tools. In Part 1 we profiled the application using using DS-5 Streamline, while in Part 2 we used the ARM Mali Graphics Debugger to capture a frame, understand overdraw and focus on fragment activity. Here we are going to focus on the vertex activity and we will use the Mali Offline Shader Compiler to perform static analysis of the vertex and fragment shaders in order to optimize them.
Vertex bound applications are rarer than fragment bound applications, especially if the amount of geometry in the scene is well proportioned to the amount of pixels that are drawn. Some simple algebra can be done to calculate the appropriate number of triangles given a surface of a specific resolution, but “as the rule of thumb, coverage of 10 fragments to one triangle should be used as the low water mark”. If that's not the case, it is a really good idea to reduce the number of vertices per object that is drawn. This can be done by simply reducing the number of triangles for each object, especially the ones that are drawn far away from the camera. If the same object can be dynamically positioned near or far from the camera, techniques like LOD should be used. It is worth remembering that even if the application is not vertex bound, unnecessary geometry leads to needless energy consumption: in fact the more vertices used, the more power will be burnt as they all get read in, processed, written out to memory, and finally read back for the fragment processing stage.
The Mali Graphics Debugger (MGD) can be used to identify vertex-heavy draw calls, since all the information is captured by the MGD interceptor and displayed in the host application. This works particularly well with captured frames, because the user can see what every draw call is adding to the frame. By inspecting the draw calls and the content of the framebuffer, it can be seen that a lot of them don't actually add anything to the frame. This is usually because the object is positioned out of the camera view. In this case techniques like culling can be used to avoid even attempting to draw those objects. This can lead to a higher CPU usage, but if the application is heavy on geometry the trade off is worth it.
The Outline View in the Mali Graphics Debugger shows the number of vertices that are processed for each draw call.
If reducing the number of triangles is not an option, a different approach can be taken to reduce the number of cycles the GPU spends executing the vertex shaders. The Mali Graphics Debugger exposes all the shaders that are used to render every single frame. Every shader is automatically compiled and statically analyzed using the Mali Offline Shader Compiler, so the instructions count is displayed in the Vertex Shaders view. For a selected frame, the number of vertices that every vertex shader is processing is known to MGD, so we can multiply the number of instructions per shader by the number of vertices per shaders to get the total number of cycles per shader for a frame. This is a theoretical number because it does not count for cache misses, but it is very useful anyway. After sorting the vertex shaders by the total number of cycles, we can identify the top heavyweight shaders for the selected frame. Usually there are two or three shaders that run for more than 80% of the time, so developers can focus on optimizing those. The source code for each shader is displayed just by double-clicking on the shader name.
The Vertex Shaders view shows the total number of cycles spent for each shader for the selected frame.
Just by optimizing two or three of the heaviest shaders the GPU vertex activity can be reduced a lot.
Once we have identified the shaders we want to focus our efforts on, we can use the Mali Offline Shader Compiler to analyze them. This is the same compiler that is used by the Mali GPU driver to compile the shaders to GPU binary code, but it's built as a standalone tool for Linux, Windows and Mac OS X. This tool is also useful to validate shaders ahead of execution, to see whether they would compile on a Mali GPU enabled device and to get errors and warnings.
From the User Guide:
The Mali Offline Shader Compiler is a command line tool that compiles vertex shaders and fragment shaders written in the OpenGL ES Shading Language (ESSL) into binary vertex shaders and binary fragment shaders that you can link and run on the Mali GPU. Shaders can be used in the OpenGL® ES 2.0 and OpenGL ES 3.0 APIs. . . You can use the Mali Offline Shader Compiler to: Pre-compile shaders into binary code that you can distribute with your application. Assist software development, by checking that shaders compile properly without having to pass them through an OpenGL ES application. Optimize your shaders by collecting feedback about the number of cycles each execution of the shader takes when you run it on the GPU.
The Mali Offline Shader Compiler is a command line tool that compiles vertex shaders and
fragment shaders written in the OpenGL ES Shading Language (ESSL) into binary vertex shaders
and binary fragment shaders that you can link and run on the Mali GPU. Shaders can be used in
the OpenGL® ES 2.0 and OpenGL ES 3.0 APIs. . .
You can use the Mali Offline Shader Compiler to:
Multiple versions of the Mali GPU driver are supported, and multiple versions of the hardware as well; the complete list of all the variants that are supported can be retrieved by running the command malisc --list. We are releasing a new version of the Mali Offline Shader Compiler every quarter, to support new cores and new driver versions.
Source code of Shader 176, that runs for more than 3 million cycles in one frame
At this point we can run the Mali Offline Shader Compiler to get the instruction count and the shader core registers utilization. In this case we are using version 4.3, that supports both the Mali-400 and Mali-T600 series up to driver version r4p0. We can get the source code for Shader 176 (linked in Program 175) and save it on disk as 'shader-176.vert'. Here's the output of the compiler for that shader:
Mali_Offline_Compiler_v4.3.0$ ./malisc --core Mali-T600 --revision r0p0_15dev0 --driver Mali-T600_r4p0-00rel0 --vertex shader-176.vert -V ARM Mali Offline Shader Compiler v4.3.0 (C) Copyright 2007-2014 ARM Limited. All rights reserved. Compilation successful. 3 work registers used, 16 uniform registers used, spilling not used. A L/S T Total Bound Cycles: 9 5 0 14 A Shortest Path: 4.5 5 0 9.5 L/S Longest Path: 4.5 5 0 9.5 L/S Note: The cycles counts do not include possible stalls due to cache misses.
The number of instructions for each unit in the 'Midgard' shader core is reported by the offline compiler. The three unit types are Arithmetic (A), Load/Store (L/S) and Texture (T). To learn more about the 'Midgard' shader core and the tripipe design refer to Part 1 or read the excellent AnandTech article ARM’s Mali Midgard Architecture Explored.
In Shader 176 the Texture pipeline is not involved at all, while the Arithmetic and Load/Store pipelines are busy. The offline compiler reports 9 cycles in the Arithmetic unit and 5 cycles in the Load/Store unit. A single thread running this shader will be arithmetic bound, but the truth is that usually there will be multiple threads for the same shader being executed at the same time. Each shader core supports up to 256 concurrently executing threads. Since there are two Arithmetic units, on average, each thread will execute for 4.5 cycles in the Arithmetic unit and 5 cycles in the Load/Store unit. This means that if we consider thread parallelism this shader will be Load/Store bound.
The other thing to look at with the offline compiler is the registers' utilization. This shader core contains two kind of registers:
If the shader needed more registers than the available one, the GPU would need to perform registers spilling, causing big inefficiencies and higher Load/Store utilization. The offline compiler reports this kind of issue, which should be avoided. Luckily, this is not the case for Shader 176.
This article concludes the series dedicated to analyzing the performance of Epic Citadel running on a Google Nexus 10 using some of the ARM GPU tools: Mali Graphics Debugger, DS-5 Streamline and Mali Offline Shader Compiler. This series was focused more on using the tools rather than deep diving in the optimization techniques, which are covered in more detail in Mali Performance 1: Checking the Pipeline. The same tools and similar techniques have been used very successfully to increase the performance of many games that you can find in Google Play, like Gameloft's Iron Man 3, Gangstar Vegas and Asphalt 8, as described in the GDC 2014 talk Optimizing Mobile Games with Gameloft and ARM.
Lorenzo Dal Col is the Product Manager of Mali GPU Tools. He first used ARM technology when, in 2007, he created a voice-controlled robot at university. He has experience in machine learning, image processing and computer vision. He moved into a new dimension when he joined ARM in 2011 to work on 3D graphics, developing performance analysis and debug tools for software running on ARM Mali GPUs.