Over the first few blogs in this series I have introduced the high level rendering model which the ARM Mali "Midgard" GPU family uses. In the remainder of this series I will explain how to use DS-5 Streamline, a system-level profiling tool from ARM, to identify areas where an application is not getting the best performance out of a Mali-based system.
In this blog we will look at debugging issues around macro-scale pipelining, the means by which we keep the GPU busy all of the time, and some of the common reasons for that frame level pipeline to stall. If you are new to this series I would recommend reading at least the first blog, as it introduces the concepts which we will be investigating in more detail this time around.
Note: I'm assuming you already have DS-5 Streamline up and running on your platform. If you are yet to do this, there are some work guides posted on the community for getting set up on a variety of Mali-based consumer devices.
The examples in this blog were captured using DS-5 v5.16.
Before we dive into diagnosing performance problems it is useful to understand what we are aiming for, and what this looks like in Streamline. There are two possible "good" behaviors depending on the performance of the system and the complexity of the content.
The counters needed for this experiment are:
If we successfully create and maintain the frame-level rendering pipeline needed for content where the GPU is the bottleneck (e.g. the rendering is too complex to hit 60 FPS), then we would expect one of the GPU workload types (vertex or fragment processing) to be running at full capacity all of the time.
In nearly all content the fragment processing is the dominant part of the GPU execution; applications usually have one or two orders of magnitude more fragments to shade than vertices. In this scenario we would therefore expect JS0 to be active all of the time, and both the CPU and JS1 to be going idle for at least some of the time every frame.
When using Streamline to capture this set of counters we will see three activity graphs which are automatically produced by the tool, in addition to the raw counter values for GPU. We can see that the "GPU Fragment" processing is fully loaded, and that both the "CPU Activity" and the "GPU Vertex-Tiling-Compute" workloads are going idle for a portion of each frame. Note - you need to zoom in down close to the 1ms or 5ms zoom level to see this - we are talking about quite short time periods here.
In systems which are throttled by vsync then we would expect the CPU and the GPU to go idle every frame, as they cannot render the next frame until the vsync signal occurs and a window buffer swap happens. The graph below shows what this would look like in Streamline:
If you are a platform integrator rather than an application developer, testing cases which are running at 60FPS can be a good way to review the effectiveness of your system's DVFS frequency choices. In the example above there is a large amount of time between each burst of activity. This implies that the DVFS frequency selected is too high and that the GPU is running much faster than it needs to, which reduces energy efficiency of the platform as a whole.
In a double-buffered system it is possible to have content which is not hitting 60 FPS, but which is still limited by vsync. This content will look much like the graph above, except the time between workloads will be a multiple of one frame period, and the visible framerate will be an exact division of the maximum screen refresh rate (e.g. a 60 FPS panel could run at 30 FPS, 20 FPS, 15 FPS, etc).
In a double-buffered system which is running at 60 FPS the GPU successfully manages to produce frames in time for each vsync buffer swap. In the figure below we see the lifetime of the two framebuffers (FB0 and FB1), with periods where they are on-screen in green, and periods where they are being rendered by the GPU in blue.
In a system where the GPU is not running fast enough to do this, we will miss one or more vsync deadlines, so the current front-buffer will remain on screen for another vsync period. At the point of the orange line in the diagram below the front-buffer is still being displayed on the screen, and the back-buffer is queued for display, the GPU has no more buffers to render on to and goes idle. Our performance snaps down to run at 30 FPS, despite having a GPU which is fast enough to run the content at over 45 FPS.
The Android windowing system typically uses triple buffering, so avoids this problem as the GPU has a spare buffer available to render on to, but this is still seen in some X11-based Mali deployments which are double buffered. If you see this issue it is recommended that you disable vsync while doing performing optimization; it is much easier to determine what needs optimizing without additional factors clouding the issue!
The second issue which you may see is a pipeline break. In this scenario at least one of the CPU or GPU processing parts are busy at any point, but not at the same time; some form of serialization point has been introduced.
In the example below the content is fragment dominated, so we would expect the fragment processing to be active all the time, but we see an oscillating activity which is serializing GPU vertex processing and fragment processing.
The most common reason for this is the use of an OpenGL ES API function which enforces the synchronous behavior of the API, forcing the driver to flush all of the pending operations and drain the rendering pipeline in order to honor the API requirements. The most common culprits here are:
It is almost impossible to make these API calls fast due to their pipeline draining semantics, so I would suggest avoiding these specific uses wherever possible. It is worth noting that OpenGL ES 3.0 allows glReadPixels to target a Pixel Buffer Object (PBO) which can do the pixel copy asynchronously. This no longer causes a pipeline flush, but may mean you have to wait a while for your data to arrive, and the memory transfer can still be relatively expensive.
The final issue I will talk about today is one where the GPU is not the bottleneck at all, but which often shows up as poor graphics performance.
We can only maintain the pipeline of frames if the CPU can produce new frames faster than the GPU consuming them. If the CPU takes 20ms to produce a frame which the GPU takes 5ms to render, then the pipeline will run empty each frame. In the example below the GPU is going idle every frame, but the CPU is running all of the time, which implies that the CPU cannot keep up with the GPU.
"Hang on" I hear you say, "that says the CPU is only 25% loaded". Streamline shows the total capacity of the system as 100%, so if you have 4 CPU cores in your system with one thread maxing out a single processor then this will show up as 25% load. If you click on the arrow in the top right of the "CPU Activity" graph's title box it will expand giving you separate load graphics per CPU core in the system:
As predicted we have one core maxed at 100% load, so this thread is the bottleneck in our system which is limiting the overall performance. There can be many reasons for this, but in terms of the graphics behavior rather than application inefficiency, the main two are:
Every draw call has a cost for the driver in terms of building control structures and submitting them to the GPU. The number of draw calls per frame should minimized by batching together drawing of objects with similar render state, although there is a balance to be struck between larger batches and efficient culling of things which are not visible. In terms of a target to aim for: most high-end 3D content on mobile today uses around 100 draw calls per render target, with many 2D games coming in around 20-30.
In terms of dynamic data upload be aware that every data buffer uploaded from client memory to the graphics server requires the driver to copy that data from a client buffer into a server buffer. If this is a new resource rather than sub-buffer update then the driver has to allocate the memory for the buffer too. The most common offender here is the use of client-side vertex attributes. Where possible use static Vertex Buffer Objects (VBOs) which are stored persistently in graphics memory, and use that buffer by reference in all subsequent rendering. This allows you to pay the upload cost once, and amortize that cost over many frames of rendering.
It some cases it may not be Mali graphics stack which is limiting the performance at all. We do sometimes get support cases where the application logic itself is taking more than 16.6ms, so the application could not hit 60 FPS even if the OpenGL ES calls were infinitely fast. DS-5 Streamline contains a very capable software profiler which can help you identify precisely where the bottlenecks are in your code, as well as helping you load balance workloads across multiple CPU cores in your system if you want to parallelize your software using multiple threads, but as this is not directly related to the Mali behavior I'm not going to dwell on it this time around.
Next time I will be reviewing the Mali driver's approach to render target management, and how to structure your application's use of Frame Buffer Objects (FBOs) to play nicely with this model.
Comments and questions welcome,
Read Part 2: How to correctly handle framebuffers
Just one comment, I believe the community edition of DS-5 does not allow you to see individual CPU cores, it just reports "Linux Scheduler" and gives you a percentage of total load.