Chinese version of this blog
Optimization of graphics workloads is often essential to many modern mobile applications, as almost all rendering is now handled directly or indirectly by an OpenGL ES based rendering back-end. One of my colleagues, Michael McGeagh, recently posted a work guide on getting the Arm DS-5 Streamline profiling tools working with the Google Nexus 10 for the purposes of profiling and optimizing graphical applications using the Mali-T604 GPU. Streamline is a powerful tool giving high resolution visibility of the entire system’s behavior, but it requires the engineer driving it to interpret the data, identify the problem area, and subsequently propose a fix.
For developers who are new to graphics optimization it is fair to say that there is a little bit of a learning curve when first starting out, so this new series of blogs is all about giving content developers the essential knowledge they need to successfully optimize for Mali GPUs. Over the course of the series, I will explore the fundamental macro-scale architectural structures and behaviors developers have to worry about, how this translates into possible problems which can be triggered by content, and finally how to spot them in Streamline.
The most essential piece of knowledge which is needed to successfully analyze the graphics performance of an application is a mental model of how the system beneath the OpenGL ES API functions, enabling an engineer to reason about the behavior they observe.
To avoid swamping developers in implementation details of the driver software and hardware subsystem, which they have no control over and which is therefore of limited value, it is useful to define a simplified abstract machine which can be used as the basis for explanations of the behaviors observed. There are three useful parts to this machine, and they are mostly orthogonal so I will cover each in turn over the first few blogs in this series, but just so you know what to look forward to the three parts of the model are:
In this blog we will look at the first of these, the CPU-GPU rendering pipeline.
The most fundamental piece of knowledge which is important to understand is the temporal relationship between the application’s function calls at the OpenGL ES API and the execution of the rendering operations those API calls require. The OpenGL ES API is specified as a synchronous API from the application perspective. The application makes a series of function calls to set up the state needed by its next drawing task, and then calls a glDraw[1] function — commonly called a draw call — to trigger the actual drawing operation. As the API is synchronous all subsequent API behavior after the draw call has been made is specified to behave as if that rendering operation has already happened, but on nearly all hardware-accelerated OpenGL ES implementations this is an elaborate illusion maintained by the driver stack.
In a similar fashion to the draw calls, the second illusion that is maintained by the driver is the end-of-frame buffer flip. Most developers first writing an OpenGL ES application will tell you that calling eglSwapBuffers swaps the front and back-buffer for their application. While this is logically true, the driver again maintains the illusion of synchronicity; on nearly all platforms the physical buffer swap may happen a long time later.
The reason for needing to create this illusion at all is, as you might expect, performance. If we forced the rendering operations to actually happen synchronously you would end up with the GPU idle when the CPU was busy creating the state for the next draw operation, and the CPU idle while the GPU was rendering. For a performance critical accelerator all of this idle time is obviously not an acceptable state of affairs.
To remove this idle time we use the OpenGL ES driver to maintain the illusion of synchronous rendering behavior, while actually processing rendering and frame swaps asynchronously under the hood. By running asynchronously we can build a small backlog of work, allowing a pipeline to be created where the GPU is processing older workloads from one end of the pipeline, while the CPU is busy pushing new work into the other. The advantage of this approach is that, provided we keep the pipeline full, there is always work available to run on the GPU giving the best performance.
The units of work in the Mali GPU pipeline are scheduled on a per render-target basis, where a render target may be a window surface or an off-screen render buffer. A single render target is processed in a two step process. First, the GPU processes the vertex shading[2] for all draw calls in the render target, and second, the fragment shading[3] for the entire render target is processed. The logical rendering pipeline for Mali is therefore a three-stage pipeline of: CPU processing, geometry processing, and fragment processing stages.
An observant reader may have noticed that the fragment work in the figure above is the slowest of the three operations, lagging further and further behind the CPU and geometry processing stages. This situation is not uncommon; most content will have far more fragments to shade than vertices, so fragment shading is usually the dominant processing operation.
In reality it is desirable to minimize the amount of latency from the CPU work completing to the frame being rendered – nothing is more frustrating to an end user than interacting with a touch screen device where their touch event input and the data on-screen are out of sync by a few 100 milliseconds – so we don’t want the backlog of work waiting for the fragment processing stage to grow too large. In short we need some mechanism to slow down the CPU thread periodically, stopping it queuing up work when the pipeline is already full-enough to keep the performance up.
This throttling mechanism is normally provided by the host windowing system, rather than by the graphics driver itself. On Android for example we cannot process any draw operations in a frame until we know the buffer orientation, because the user may have rotated their device, changing the frame size. SurfaceFlinger — the Android window surface manager – can control the pipeline depth simply by refusing to return a buffer to an application’s graphics stack if it already has more than N buffers queued for rendering.
If this situation occurs you would expect to see the CPU going idle once per frame as soon as “N” is reached, blocking inside an EGL or OpenGL ES API function until the display consumes a pending buffer, freeing up one for new rendering operations.
This same scheme also limits the pipeline buffering if the graphics stack is running faster than the display refresh rate; in this scenario content is "vsync limited" waiting for the vertical blank (vsync) signal which tells the display controller it can switch to the next front-buffer. If the GPU is producing frames faster than the display can show them then SurfaceFlinger will accumulate a number of buffers which have completed rendering but which still need showing on the screen; even though these buffers are no longer part of the Mali pipeline, they count towards the N frame limit for the application process.
As you can see in the pipeline diagram above, if content is vsync limited it is common to have periods where both the CPU and GPU are totally idle. Platform dynamic voltage and frequency scaling (DVFS) will typically try to reduce the current operating frequency in these scenarios, allowing reduced voltage and energy consumption, but as DVFS frequency choices are often relatively coarse some amount of idle time is to be expected.
In this blog we have looked at synchronous illusion provided by the OpenGL ES API, and the reasons for actually running an asynchronous rendering pipeline beneath the API. Tune in next time, and I’ll continue to develop the abstract machine further, looking at the Mali GPU’s tile-based rendering approach.
Comments and questions welcomed,
Pete
Hi Peter,
From my perspective as a hardware verification engineer, this is an interesting blog as it hints at performance requirements that the software and drivers place upon the underlying hardware; not just the CPU and GPU, but also the interconnect and the interrupt controller which I guess will have a big impact on the ability of the software to finely co-ordinate the operations that are involved.
It would be interesting if you give give typical timings or clock counts for some of these operations to put them some kind time/resource consumption perspective.
I'm looking forward to reading the follow-on blog
Thanks!
Stewart