In this blog I will talk about energy efficiency in embedded GPUs and what an application programmer can do to improve the energy efficiency of their application. I have split this blog into two parts; in the first part I will give an introduction to the topic of energy efficiency and in the second part I will show some real SoC power measurements by using an in-house micro-benchmark to demonstrate the extent to which a variety of factors impact frame rendering time, external bandwidth and SoC power consumption.
Let's look first at what energy efficiency means from the GPU's perspective. At a high level the energy is consumed by the GPU and its associated driver in three different ways:
On most devices Vertical Synchronization (vsync) synchronizes the frame rate of an application with the screen display rate. Using vsync not only removes tearing, but it also reduces power consumption by preventing the application from producing frames faster than the screen can display them. When vsync is enabled on the device the application cannot draw frames faster than the vsync rate (vsync rate is typically 60fps on modern devices so we can keep that as our working assumption in the discussion). On the other hand, in order to give the best possible user experience the application/GPU should not draw frames significantly slower than the vsync rate i.e. 60fps. Therefore the device/GPU tries hard to keep the frame rate always at 60fps, while also trying to use as little power as possible.
A device typically has power management functionality for both GPU and CPU in order to adjust their operating frequencies based on the current workload. This functionality is referred to as DVFS (Dynamic Voltage and Frequency Scaling). DVFS allows the device to handle both normal and peak workload in an energy efficient fashion by adjusting the clock frequency to provide just enough performance for the current workload, which in turn allows us to drop the voltage as we do not need to drive the transistors as hard to meet the more relaxed timing constraints. The energy consumed per clock is proportional to V2, so if we drop frequency to allow a voltage reduction of 20% then energy efficiency would improve by 36%. Using a higher clock frequency than needed means higher voltage and consequently higher power consumption, therefore the power management tries to keep the clock frequency as low as possible while still keeping the frame rate at the vsync rate. When the GPU is under extremely high load some vendors allow the GPU to run at an overdrive frequency - a frequency which requires a voltage higher than the nominal voltage for the silicon process - which can provide a short performance boost, but cannot be sustained for long periods. If high workload from an application keeps the GPU frequency overdriven for a long time, the SoC may become overheated and as a consequence the GPU is forced to use a lower clock frequency to allow the SoC to cool down even if the frame rate goes under 60fps. This behavior is referred to as thermal throttling.
Device vendors often differentiate their devices by making their own customizations to the power management. As a result two devices having the same GPU may have different power management functionality. The ARM® Mali™ GPU driver provides an API to SoC vendors that can be used for implementing power management logic based on the ongoing workload in the GPU.
In addition to DVFS, some systems may also adjust the number of active GPU cores to find the most energy efficient configuration for the given GPU workload. Typically, DVFS provides just a few available operating frequencies and enabling/disabling cores can be used for fine-tuning the processing capacity for the given workload to save power.
In its simplest form the power management is implemented locally for the GPU i.e. the GPU power management is based only on the ongoing GPU workload and the temperature of the chip. This is not optimal as there can be several other sub-systems on the chip which all "compete" with each other to get the maximum performance for its own processing until the thermal limit is exceeded and all sub-systems are forced to operate in a lower capacity. A more intelligent power management scheme maintains a power budget for the entire SoC and allocates power for different sub-systems in a way that thermal throttling can be avoided.
From an application point of view the power management functionality provided by the GPU/device means that the GPU/device always tries to adjust the processing capacity for the workload coming from the application. This adjustment happens automatically in the background and if the application workload doesn't exceed the maximum capacity of the GPU, the frame rate remains constantly at the vsync rate regardless of the application workload. The only side effect from the high application workload is that the battery runs out faster and you can feel the released energy as a higher temperature of the device.
Most applications don't need to create a higher workload than the GPU's maximum processing capacity i.e. the power management is able to keep the frame rate constantly at the vsync level. The interval between two vsync points is 1/60 seconds and if the GPU completes a frame faster than that, the GPU sits idle until the next frame starts. If the GPU constantly has lots of idle time before the next vsync point, the power management may decrease the GPU clock frequency to a lower level to save power.
Screenshot from Streamline of a GPU and CPU idling each frame when the DVFS frequency selected is too high
As the maximum processing capacity of modern GPUs keeps growing, it is often not necessary for an application developer to optimize the application for better performance, but instead for better energy efficiency and that is the topic of this blog.
In order to be energy efficient the application should:
But hey, aren't these the same things that you used to focus on when optimizing your application for better performance? Yes, pretty much! To explain this further:
So the task of improving energy efficiency becomes very similar to the task of optimizing the performance of an application. For that task you can find lots of useful tips in the Mali GPU Application Optimization Guide.
There is one topic that may require some more attention: how can you measure the energy efficiency of your application? Measuring the actual SoC power consumption might not be practical. It might also be problematic to measure the system FPS of your application if vsync is enabled on your device and you cannot turn it off.
ARM provides a tool called DS-5 Streamline for system-wide performance analysis. Using DS-5 Streamline for detecting performance bottlenecks is explained in peterharris' s blog Mali Performance 1: Checking the Pipeline and lorenzodalcol's blogs starting with Mali GPU Tools: A Case Study, Part 1 — Profiling Epic Citadel, and also in the Mali GPU Application Optimization Guide. Shortly, DS-5 Streamline allows you to measure the main components of energy efficiency with the following charts / HW counters:
GPU cycles:
External memory bandwidth:
CPU load:
Another very useful tool for measuring GPU cycles is the Mali Offline Shader Compiler which allows you to see how many GPU cycles are spent in the arithmetic, load/store and texture pipes in the shader core. Each saved cycle in the shader code means thousands/millions of saved cycles in each frame, as the shader is executed for each vertex/fragment.
If you want to measure the performance of an application in a vsync limited device, it is possible to do it by rendering graphics in offscreen mode using FBOs. This is the trick used by some benchmark applications to get rid of vsync and resolution limitations in the performance measurement. The thing is that the vsync limitation applies only for the onscreen frame buffer, but not for the offscreen framebuffers implemented with FBOs. It is possible to measure performance by rendering to an FBO that has the same frame buffer resolution and configuration (color and depth buffer bit depths) as the onscreen frame buffer. After setting up the FBO and binding it with glBindFramebuffer() your rendering functions don't see any difference whether the render target is the onscreen frame buffer or an FBO. However, in order to make the performance measurement work correctly you need to do a few things:
There is a small overhead in the performance measurement when using this method because of down-sampling the offscreen frames to the onscreen frame, but nevertheless it should give you quite representative FPS results without the vsync limitation.
You might ask how significant an energy saving you can really get by optimizing your application. We will focus on that in the next part of this blog, Energy Efficiency in GPU Applications, Part 2, where I will present a small micro-benchmark that will show how much you can reduce real SoC power consumption by optimizing your application.