Attending events and manning Arm booths at trade shows is a great opportunity to meet developers and to understand first-hand about the problems they face and what their needs are. One of the reoccurring conversations I have is the following:
Dev.: I am struggling to achieve good performance in the game I am developing now. Me: Are you familiar with Arm tools for performance analysis such as Mali Graphics Debugger and the Mali Offline Shader Compiler? Dev.: Mmmm, not really. Me: When planning your game did you consider your GPU budget for the devices you are targeting? Maybe you are just pushing the GPU past its capabilities? Dev.: Mmmm, not really.
Dev.: I am struggling to achieve good performance in the game I am developing now.
Me: Are you familiar with Arm tools for performance analysis such as Mali Graphics Debugger and the Mali Offline Shader Compiler?
Dev.: Mmmm, not really.
Me: When planning your game did you consider your GPU budget for the devices you are targeting? Maybe you are just pushing the GPU past its capabilities?
At this point I will usually explain the different Arm Mali performance tools, and what is possible to do with them. In particular I will explain how to calculate an appropriate GPU performance budget for the target device, and how to get the number of cycles used in the application’s shaders using the tools so that the developer can track how much of that budget is being used.
This is unfortunately, a common pattern. Optimizing graphics applications is not simple and requires some low level knowledge about how GPU works. Nevertheless, considering the available GPU processing budget from the beginning of the development process can save a lot of time and painful decisions later. The key to developing well performing real-time graphics applications is to know the limitations of your system, and work within those limits.
While game studios have a well-tuned pipeline for planning games and consider the hardware capabilities of targeted devices, indie developers can struggle in doing this. Enthusiasm sometimes trumps common sense, and they start to work on that great game and the best ever seen graphics. Then gradually the reality must be faced when they start to notice performance problems, and then painful decisions are needed about removing that great rendering effect, and lowering the graphics quality. Doing this before the game’s release is often the best case; unfortunately some developers only discover performance problems after releasing their game when the users come back with the feedback that for some low and middle end devices the game is not running that well.
This blog tries to help those indie developers, saving them time and painful decisions when developing that game that could potentially change their lives. Although the ideas and recommendations presented in the blog are valid for any game engine and platform, I will provide hints for Unity developers as it is the most popular game development platform in mobile and it is the game engine I am most familiar with.
When you are in the process of buying an expensive item in the real world – such as a car – you intuitively consider your available budget, and if you can’t afford it with the available resources you have to then ask for credit from a financial institution like a bank. Good news! With graphics you can calculate your GPU processing budget to see what you can afford with your rendering in terms of GPU processing cycles. The bad news is that you can’t get GPU credit from a graphic bank; you can do many things, but always need to stay within the limits of your GPU processing budget, and going over that will only lead to performance problems and headaches.
Let us show you how you can calculate the GPU processing budget for a given device, for example the Samsung Galaxy S7. It comes with an octa-core Arm CPU and Arm Mali-T880 MP12 GPU. Firstly, you need to know the GPU frequency or clock speed. The Exynos Wikipedia page is a good website to look for data related to Samsung releases. This page tells you that the top GPU frequency is 650 MHz per core. A frequency of 650 MHz means that each GPU shader core can perform 650 M single cycle operations per second or 325 M two-cycle operations per second per processing pipeline in the shader core. As this device has 12 shader cores the total of single-cycle operations per second per pipeline, the GPU can perform is:
numOp = 12 x 650 M cycles/sec = 7800 M cycles/sec
Now it is time to consider some important features of the game. For example the FPS you want to target. If it is 60 FPS then the number of operations our GPU can perform per frame is:
numOp = (7800 M cycles/sec) / (60 frames/sec) = 130 M cycles/frame
The GPU is a parallel processor; the shader code describes the operations to be performed for every vertex/pixel. The number of vertices is defined by the complexity of the geometry we use in the game. The number of pixels is defined by the resolution we want to target. If we target a Full HD resolution it means our fragment shader will run in parallel on 1920 × 1080 = 2 073 600 pixels.
Now we can calculate the GPU budget per pixel:
numOp = (130 M cycles/frame) / (2.07 M pixels) ~= 63 cycles/frame/pixel.
At this point we can write a formula to get the GPU fragment and vertex cycle budgets:
fragCycleBudget = (No of GPU fragment cores) * (gpuFrequency in Hz) / ( FPS * numPixels)
vertCycleBudget = (No of GPU vertex cores) * (gpuFrequency in Hz) / (FPS * numVertices)
GPUs with unified shader core architecture, such as the Arm Midgard GPU architecture, are formed by single processor that executes all vertex and fragment shaders. In the older Arm Utgard GPU architecture the shader cores are specialized with separate vertex and fragment processors. For unified shader cores all shader workloads run on the same physical share cores, so the total available performance is split across the different types of workload and you cannot strictly treat them in isolation; for these GPUs you must add the total contributions together.
Note that in these calculations we have used the top GPU frequency, but it doesn’t need to be like this. We might be interested in setting a lower operating frequency as a target in order to save battery life and allow players enjoy the game for longer.
Coming back to our previous results, the resolution figure used in the previous calculations is low compared with the native QHD (1440 x 2560) Samsung S7 resolution. If we want to use the native resolution then our GPU fragment budget per pixel reduces to:
fragCycleBudget = 130 M cycles/frame / 3.69 M pixels ~= 35 cycles/frame/pixel.
So we see our GPU budget is almost halved when going for QHD resolution. If we are planning a VR game where QHD is fully used for the two views, then that is our GPU budget.
But how that figure will look if we target a mid-end device, for example powered with an Arm Mali-450 MP4 GPU?
This GPU still very common so it is important for developers to have an idea of the budget it can provide. Mali-450 MP4 implements the Utgard architecture with a fixed number of cores for vertex and fragment processors, in this case one vertex core and four fragment cores. The MediaTek MT8127 SoC has been used extensively for Android based tablets. The Mali-450 GPU of this SoC works at 600 MHz. At 1024x600 tablet pixel resolution the GPU fragment budget will be:
fragCycleBudget = (4 x 600 M cycles/sec) / (60 frames/sec * 614 K pixels) ~= 65 cycles/frame/pixel
As we have a single vertex core the expression for the GPU vertex budget:
vertCycleBudget = (600 M cycles/sec) / (60 frames/sec * numVertices)
For a reasonable number of 100 K vertices this expression gives us 100 cycles/frame/vert.
Let’s now increase the resolution. This page lists a number of phones with Mali-450 MP4 GPU. If we take for example the MediaTek MT6592 SoC used in many phones, it has a Mali-450 MP4 GPU @ 700 MHz. At a 1920 x 1080 phone resolution it will give us a GPU fragment budget of:
fragCycleBudget = (4 x 700 M cycles/sec) / (60 frames/sec * 2.07 M pixels) ~= 22 cycles/frame/pixel
The budget figure is now much tighter if you wish to target 1080p resolution!
We have seen how the GPU budget can change from high-end devices to mid or low-end devices. From the hardware perspective it is the number of cores and the GPU frequency the influencing factor when moving from high to low end device category. From the software perspective the targeted FPS and screen resolution are the two dominant factors; both of these parameters are under application control so you must use them wisely. The higher the resolution the more pixels your application needs to render which means less available budget of cycles per pixel at your disposal. Increasing the targeted FPS has similar effect. Reducing resolution and FPS will have the opposite effect but it can impact negatively the quality of the graphics and the user experience. It’s also worth noting that all of our budgets here target the maximum GPU operating frequency; for some games it may be appropriate to target a lower operating frequency to reduce energy consumption and allow your players to play for longer. At the end of the day, in graphics rendering, it is a matter of finding a good balance between performance, quality, and battery life. You should consider all factors adequately to guarantee the user the best possible experience on the hardware the game is running on.
Let’s see now how you can determine the amount of GPU cycles used by our shaders to compare with the available budget.
If you want to know how many GPU cycles your shader programs require you can use the Arm tools Mali Offline Compiler (MOC) or Mali Graphics Debugger (MGD).
The MOC is a command line tool that compiles vertex, fragment, compute, geometry, tessellation control, and tessellation evaluation shaders written in the OpenGL ES Shading Language (ESSL) and prints information about the compiled code.
To make use of MOC, Unity developers must first compile the shader to get a low-level optimized GLSL version of the shader for OpenGL ES. Select your shader and press the “Compile and show code” button shown in the picture below. Click first on the small tab at the right of the button and set the appropriate version of OpenGL ES compilation target.
Figure 1. Compile shader in Unity.
From the GLSL generated code you must isolate the vertex and the fragment shader. For the vertex shader copy in a separate file only the code contained between the instructions
#ifdef VERTEX … … #endif
Name the file with a “vert” extension: shader_name.vert.
For the fragment shader copy in a separate file only the code contained between the instructions
#ifdef FRAGMENT … … #endif
Name the file with the “frag” extension: shader_name.frag.
You then pass each file to the MOC command tool. Assuming the malisc executable is on your path, on Linux or Mac OS X run:
./malisc shader.vert
./malisc shader.frag
or on Windows run:
malisc shader.vert
malisc shader.frag
Beside the shader file you can indicate a specific target driver, hardware core and hardware release. You can inspect the folder openglessl in the MOC installation path to see the list of supported drivers.
Below is shown an example of the output of MOC for the simple shader code:
#version 100 precision mediump float; uniform sampler2D tex0; varying vec2 vTexCoord1; void main(){ vec4 color1 = texture2D(tex0, vTexCoord1); gl_FragColor = color1; }
Figure 2. Example of Mali Offline Compiler output.
As you can see the command line in Fig. 2 doesn’t specify any target core, driver or revision. In this case the MOC will default to the listed values. The Arm Developer page lists all different Mali GPUs you can use to pass to MOC with the core (-c) option.
The MOC will list the number of cycles in the different parallel GPU pipelines: Arithmetic, Load/Store and Texture pipes. There is an excellent blog from Peter Harris in this Arm Community that will help you understand the concept of the different pipes and how they work.
Another important info the output reports is the register spilling. If the shader needs to use more register storage than is physically available the GPU will have to perform register spilling to stack, leading to big inefficiencies and higher Load/Store utilization due to the extra data traffic saving and restoring values from the program stack.
MOC will also alert you if the shader is out of register-based uniform storage space. A number of uniforms could be OK for a given core, for example Mali-T880, but the same shader could run out of uniform registers for Mali-400, requiring uniforms to be loaded from memory.
In Unity, when building built-in shaders the resulting GLSL code can be very large as a result of the many shader variants. Unity prepares different flavours of the same shaders by means of keywords to consider different types of lighting, shadows, rendering path, etc. When extracting the vertex and fragment code for the MOC analysis you should follow the procedure recommended above for the compiled code inside of a given shader variant. Shader variants are delimited by the keywords it uses. A simple Mobile/Diffuse built-in shader has more than 500 variants. Some of the delimiting keywords look as below:
Keywords set in this variant: DIRECTIONAL
Keywords set in this variant: DIRECTIONAL SHADOWS_SHADOWMASK
Keywords set in this variant: DIRECTIONAL LIGHTMAP_ON DYNAMICLIGHTMAP_ON
Keywords set in this variant: DIRECTIONAL LIGHTMAP_ON LIGHTMAP_SHADOW_MIXING
Keywords set in this variant: DIRECTIONAL LIGHTMAP_ON SHADOWS_SHADOWMASK
Such big number of shader variants makes difficult the use of MOC for Unity built-in shaders as we need to know which variant(s) are exactly used. Nevertheless for custom shaders where we know exactly what each variant (if any) does MOC is excellent.
MGD provides the same information as MOC related to shader cycles and register usage, with the added advantage that it intercepts the API calls from Unity and so automatically shows the information for the Unity shader variant actually in use.
Figures 3 and 4 show all the information provided by MGD for the vertex and fragment shaders running on the device. In a previous blog I describe how to build Unity applications with support for MGD. When capturing an MGD trace, you can click on the button in the tool bar to make MGD to show the number of fragments that have been rendered with the shader in the current frame. This is needed to allow the tool to compute the total frame contribution of a fragment shader.
Figure 3. Example of vertex shader cycle and register use output in Mali Graphics Debugger.
Figure 4. Example of fragment shader cycle and register use output in Mali Graphics Debugger.
Vertex and fragment shaders are listed with different IDs. To find the fragment shader corresponding to a given vertex shader we should look for the same linked program ID. For example, the vertex and fragment shader framed in red have the same associated program ID.
By clicking on the column header we can order the figures so we can easily see which are the most cycle consuming vertex and fragment shaders at a glance.
At the end of the table MGD displays the total vertex and fragment cycle count for a given frame if we have selected the frame in the Trace outline pane. If the frame is composed by several render passes MGD will display the total cycles accumulated up to the selected render pass in the Trace outline pane.
At this point we can use the total cycle count figure to compare it with our target GPU budget. Nevertheless we need to do a couple of additional math operations. As the GPU budget we have calculated is expressed in terms of cycles/frame/pixel we need to divide the figure provided by MGD by the total number of pixels and finally find the total number of cycles in a frame by adding the figure for the vertex and fragment shaders. If there is any compute shader we need to consider them as well into the sum.
Let’s consider the total values in the figures 3 and 4. The frame buffer resolution used in this capture is 1280x720, it means we are rendering 922K pixels.
Total vertex cycle count/frame/pixel = 3.49M / 922K = 3.8 cycles/frame/pixel
Total fragment cycle count/frame/pixel = 30.2M / 922K = 32.7 cycles/frame/pixel
Total cycle count/frame/pixel = 3.8 + 32.7 = 36.5 cycles/frame/pixel
The total value of 36.5 cycles/frame/pixel is the figure we have to compare with our target GPU budget. If this figure is in the range of our GPU budget then we are not asking to the GPU more that it can give. If it is higher, then the way to proceed is to look at the shaders with the higher cycle contribution and try to reduce their complexity until we get a total cycle value in the acceptable range.
The advantage of MGD over MOC is that it shows the relevant information simultaneously for all the shaders and provides total figures so we can compare them with our GPU budget. For this our application must be running on the device. MGD is also excellent for debugging Unity built-in shaders as we can see the exact shader variant in use among the hundreds possible variants. For Unity VR applications with the Single-Pass Stereo-Rendering option enabled only MGD allows us to see at runtime the shader code that handles this option. The “static” compilation of the shader in the Unity Editor (as shown in Fig. 1) doesn’t expose the code associated with the use of Single-Pass Stereo-Rendering.
With MOC we can obtain only the “static” cycle value for a given shader which will give us an approximate idea of the shader complexity in terms of cycles. The advantage of MOC is that we can easily know how this figure changes for different cores with different driver versions without actually running the application. In this case extra work is needed to prepare the shader files adequately.
Once you have obtained a clear idea of the GPU budget, you can take decisions about the average complexity of the meshes and shaders you can deploy. Using shader variants we can also run within the GPU budget more complex shader for high-end GPUs and simpler shaders for low-end GPUs.
GPU budget analysis must be performed in the early planning stage of the project so the artists can be informed about the limits imposed by the hardware on the amount of geometry and the complexity of the shader effects. This is very important as the artist has the responsibility of designing the characters, levels, and FXs. Making the artist part of this process will save valuable time, money, and frustrations.
So we have got our GPU budget and we have shared it with our artist and software engineers so everybody is clear about the limits of the hardware, do we have to strictly stick to the budget?
The general answer is yes, but it doesn’t mean we can’t plan for example a special effect with a shader that exceeds the GPU cycle budget. Let’s consider for example a complex FX we want to show in the game that costs 70 fragment cycles and our GPU budget is 35 cycles. The solution is to limit the number of pixels for this FX will be rendered, i.e. don’t let the camera going very close to the FX so it will be always rendered in small portion of the screen.
The variety of mobile phones in the hands of potential users is considerable and it is something we need to consider carefully. Our game won’t run with the same performance on a high-end device as it will on a mid-end or low-end device. We need to handle adequately this variety of devices to assure the best possible game experience to the user on a given hardware. Setting different quality levels in your application to match the current device’s capability will greatly help you to achieve this goal.
Practically all major engines have some kind of settings to consider game scalability. In Unity for example you can define for each targeted platform a set of quality level and for each level it is possible to set a number of settings that impact directly the quality of the graphics as shown in Fig. 5. This feature in Unity is very flexible and allows adding custom quality levels.
Figure 5. Example of graphics quality levels in Unity.
Game quality levels can be associated with different target resolutions in such a way that you can use high-quality graphics levels and native resolution on high-end devices and start dropping graphics quality and resolution as you move down to mid and low-end devices. In Unity for example quality levels can be set at runtime which allows setting the appropriate graphics quality level once the application has checked the hardware capabilities.
Defining the quality level also means adapting your shader programs for the different quality levels. Unity handles this automatically for the built-in shaders by means of the shader variants mentioned previously, so it changes the quality of lighting, shadows, texture, LOD, etc. in the shaders. If you use your own custom shaders you can use also the same technique. Instead of writing different shaders for different graphics quality levels, you can use your own keywords to create shader variants that fit the quality levels. The source code of the built-in shaders is available from the Unity Archive, which can provide a useful reference for shader quality reduction techniques.
You have seen how calculate the GPU budget in terms of cycles/frame/pixel based on the main factors that define the amount of graphics work the GPU can perform running on a given software configuration on a given hardware platform. We have also shown the available tools at your disposal to evaluate the amount of GPU cycles your shaders consume, which can be compared with the available GPU budget and check how you are using the GPU capabilities. I hope I have provided enough practical information to you so you can now consider the GPU budget as part of the planning stage of your games, and use it later in the development work to assure that the game is not exceeding the GPU capabilities of your target devices.
I don’t think I will ever get tired of highlighting the importance of considering the GPU budget in game development to save time, resources, frustrations and deliver best possible game experience to the user. Nevertheless here the question is: are you ready to start using it? If you are reading this blog and decide to apply its recommendations in your current or future game I will really appreciate your feedback and your own story about “before and after GPU budget”.
Hi Brown2Fox, thank you so much for pointing this out. I have now switched these links. Many thanks again!
That is correct!
You just need to apply the formula. From now on you know how to calculate your GPU fragment budget and you have a powerful tool in your hands to use it when planning and optimizing your game.
Thank you so much for the detailed answer!one thing more about calculating the target GPU budget. if running at maximum performance, is following are the right operations for calculating the gpu budget for Mali-T880 MP12 with 8 cores running at 650 MHz.Cycles in 1 sec = 650 * 10^6 * 8Cycles in 1 frame @ 60 fps = (650 * 10^6 * 8) / 60Cycles/frame/pixel at (1440 * 2560) = (650 * 10^6 * 8) / (60 * 1440 * 2560) = 23.50.
Hi nomi27951:
Thanks for your feedback!
As you know the tripipe design has three types of execution pipeline: 1 handling arithmetic operations, 1 handling memory load/store and varying access, and 1 handling texture access. There is only 1 load/store pipeline and 1 texture pipeline per shader core, but the number of arithmetic pipeline depends on the GPU model. For example the Mali-T880 you are evaluating your shader on has 3 arithmetic pipelines.
The Mali Offline Compiler reports that your shader needs 17 instructions in the arithmetic pipeline, 3 instructions in the load/store pipeline, and 2 instructions in the texture pipeline. Since there are 3 arithmetic pipelines in the Mali-T880, the shader will execute the 17 arithmetic instructions in 5.28 cycles (17/3). This means that if we consider thread parallelism this shader will still be arithmetic bound (5.28 A cycles > 3 LS cycles > 2 T cycles), although note that this ignores all memory effects such as cache misses.
The figure of 17 FP32 operations from Pete’s blog is the best case performance of a single instruction in a single arithmetic pipeline; Mali-T880 could therefore provide 51 FP32 operations per clock cycle. This is somewhat orthogonal to the statistics the compiler gives you, which just focuses on the number of instructions emitted (a single arithmetic pipeline instruction might be 1 FLOP it might be 17, depending on the program and code compilation, so there is no way to get a direct measure of shader program FLOP count from the instruction count given by the offline compiler).
Hi Roberto!Very good and detailed read!i am trying to budget a simple single variant matcap shader with MOC and targeting the T880.As per peter's article i know that we have 17 FP32 operations per A-pipe.can you elaborate a bit more what this following result means in accordance to this article.does it mean that i have used 6/17 operations per clock?