Energy Efficiency in GPU Applications, Part 1

August 28, 2014

10 minute read time.

In this blog I will talk about energy efficiency in embedded GPUs and what an application programmer can do to improve the energy efficiency of their application. I have split this blog into two parts; in the first part I will give an introduction to the topic of energy efficiency and in the second part I will show some real SoC power measurements by using an in-house micro-benchmark to demonstrate the extent to which a variety of factors impact frame rendering time, external bandwidth and SoC power consumption.

Energy Efficiency in the GPU/Device

Let's look first at what energy efficiency means from the GPU's perspective. At a high level the energy is consumed by the GPU and its associated driver in three different ways:

GPU is running active cycles in the hardware to perform its computation tasks in one or more of its cores.
GPU/driver is issuing memory transactions to read data from, or write data to, the external memory.
GPU driver code is executed in the CPU either in the user mode or in the kernel mode.

On most devices Vertical Synchronization (vsync) synchronizes the frame rate of an application with the screen display rate. Using vsync not only removes tearing, but it also reduces power consumption by preventing the application from producing frames faster than the screen can display them. When vsync is enabled on the device the application cannot draw frames faster than the vsync rate (vsync rate is typically 60fps on modern devices so we can keep that as our working assumption in the discussion). On the other hand, in order to give the best possible user experience the application/GPU should not draw frames significantly slower than the vsync rate i.e. 60fps. Therefore the device/GPU tries hard to keep the frame rate always at 60fps, while also trying to use as little power as possible.

A device typically has power management functionality for both GPU and CPU in order to adjust their operating frequencies based on the current workload. This functionality is referred to as DVFS (Dynamic Voltage and Frequency Scaling). DVFS allows the device to handle both normal and peak workload in an energy efficient fashion by adjusting the clock frequency to provide just enough performance for the current workload, which in turn allows us to drop the voltage as we do not need to drive the transistors as hard to meet the more relaxed timing constraints. The energy consumed per clock is proportional to V², so if we drop frequency to allow a voltage reduction of 20% then energy efficiency would improve by 36%. Using a higher clock frequency than needed means higher voltage and consequently higher power consumption, therefore the power management tries to keep the clock frequency as low as possible while still keeping the frame rate at the vsync rate. When the GPU is under extremely high load some vendors allow the GPU to run at an overdrive frequency - a frequency which requires a voltage higher than the nominal voltage for the silicon process - which can provide a short performance boost, but cannot be sustained for long periods. If high workload from an application keeps the GPU frequency overdriven for a long time, the SoC may become overheated and as a consequence the GPU is forced to use a lower clock frequency to allow the SoC to cool down even if the frame rate goes under 60fps. This behavior is referred to as thermal throttling.

Device vendors often differentiate their devices by making their own customizations to the power management. As a result two devices having the same GPU may have different power management functionality. The ARM® Mali™ GPU driver provides an API to SoC vendors that can be used for implementing power management logic based on the ongoing workload in the GPU.

In addition to DVFS, some systems may also adjust the number of active GPU cores to find the most energy efficient configuration for the given GPU workload. Typically, DVFS provides just a few available operating frequencies and enabling/disabling cores can be used for fine-tuning the processing capacity for the given workload to save power.

In its simplest form the power management is implemented locally for the GPU i.e. the GPU power management is based only on the ongoing GPU workload and the temperature of the chip. This is not optimal as there can be several other sub-systems on the chip which all "compete" with each other to get the maximum performance for its own processing until the thermal limit is exceeded and all sub-systems are forced to operate in a lower capacity. A more intelligent power management scheme maintains a power budget for the entire SoC and allocates power for different sub-systems in a way that thermal throttling can be avoided.

Energy Efficiency in Applications

From an application point of view the power management functionality provided by the GPU/device means that the GPU/device always tries to adjust the processing capacity for the workload coming from the application. This adjustment happens automatically in the background and if the application workload doesn't exceed the maximum capacity of the GPU, the frame rate remains constantly at the vsync rate regardless of the application workload. The only side effect from the high application workload is that the battery runs out faster and you can feel the released energy as a higher temperature of the device.

Most applications don't need to create a higher workload than the GPU's maximum processing capacity i.e. the power management is able to keep the frame rate constantly at the vsync level. The interval between two vsync points is 1/60 seconds and if the GPU completes a frame faster than that, the GPU sits idle until the next frame starts. If the GPU constantly has lots of idle time before the next vsync point, the power management may decrease the GPU clock frequency to a lower level to save power.

Screenshot from Streamline of a GPU and CPU idling each frame when the DVFS frequency selected is too high

As the maximum processing capacity of modern GPUs keeps growing, it is often not necessary for an application developer to optimize the application for better performance, but instead for better energy efficiency and that is the topic of this blog.

How to Make an Application Energy Efficient

In order to be energy efficient the application should:

Render frames with the least number of GPU cycles
Consume the least amount of external memory bandwidth
Generate the least amount of CPU load either directly in the application code or indirectly by using the OpenGL® ES API in a way that causes unnecessary CPU load in the driver

But hey, aren't these the same things that you used to focus on when optimizing your application for better performance? Yes, pretty much! To explain this further:

Every GPU cycle that you save when rendering a frame means more idle time in the GPU before the next vsync point. In the best case the idle time becomes long enough to allow the power management to use a lower GPU frequency or enable a smaller number of cores
Reducing bandwidth load doesn't always improve performance as GPUs are designed to tolerate high memory latencies without affecting performance. However, reducing bandwidth can improve energy efficiency significantly
The same as for bandwidth, extra CPU load may not impact performance but it definitely can increase the power consumption

So the task of improving energy efficiency becomes very similar to the task of optimizing the performance of an application. For that task you can find lots of useful tips in the Mali GPU Application Optimization Guide.

How Do You Measure Energy Efficiency?

There is one topic that may require some more attention: how can you measure the energy efficiency of your application? Measuring the actual SoC power consumption might not be practical. It might also be problematic to measure the system FPS of your application if vsync is enabled on your device and you cannot turn it off.

ARM provides a tool called DS-5 Streamline for system-wide performance analysis. Using DS-5 Streamline for detecting performance bottlenecks is explained in peterharris' s blog Mali Performance 1: Checking the Pipeline and lorenzodalcol's blogs starting with Mali GPU Tools: A Case Study, Part 1 — Profiling Epic Citadel, and also in the Mali GPU Application Optimization Guide. Shortly, DS-5 Streamline allows you to measure the main components of energy efficiency with the following charts / HW counters:

GPU cycles:

Mali Job Manager Cycles: GPU cycles
- This counter increments any clock cycle the GPU is doing something
Mali Job Manager Cycles: JS0 cycles
- This counter increments any clock cycle the GPU is fragment shading
Mali Job Manager Cycles: JS1 cycles
- This counter increments any clock cycle the GPU is vertex shading or tiling

External memory bandwidth:

Mali L2 Cache: External read beats
- Number of external bus read beats
Mali L2 Cache: External write beats
- Number of external bus write beats

CPU load:

CPU Activity
- The percentage of the CPU time spent in system or user code

Another very useful tool for measuring GPU cycles is the Mali Offline Shader Compiler which allows you to see how many GPU cycles are spent in the arithmetic, load/store and texture pipes in the shader core. Each saved cycle in the shader code means thousands/millions of saved cycles in each frame, as the shader is executed for each vertex/fragment.

If you want to measure the performance of an application in a vsync limited device, it is possible to do it by rendering graphics in offscreen mode using FBOs. This is the trick used by some benchmark applications to get rid of vsync and resolution limitations in the performance measurement. The thing is that the vsync limitation applies only for the onscreen frame buffer, but not for the offscreen framebuffers implemented with FBOs. It is possible to measure performance by rendering to an FBO that has the same frame buffer resolution and configuration (color and depth buffer bit depths) as the onscreen frame buffer. After setting up the FBO and binding it with glBindFramebuffer() your rendering functions don't see any difference whether the render target is the onscreen frame buffer or an FBO. However, in order to make the performance measurement work correctly you need to do a few things:

You need to consume your FBO rendering results in the onscreen frame buffer. This is necessary because if you render something to an FBO and don't use your rendering results for anything visible, there is no guarantee that the GPU actually renders anything. After rendering to an FBO you can down-sample your output texture into a small area in the onscreen frame buffer. This guarantees that the GPU must render the frame image into an FBO as expected.
The offscreen rendering should be implemented with two different FBOs in order to simulate double buffering functionality. After rendering a frame to an FBO, you should down-sample the output texture to the onscreen buffer, and then swap to another FBO that is used for rendering the next frame.
You should Use glDiscardFramebufferExt (OpenGL ES 2.0) or glInvalidateFramebuffer (OpenGL ES 3.0) for discarding depth/stencil buffers right after the rendering of a frame to an FBO is complete. This is necessary to avoid writing out the depth/stencil buffer to the main memory in the Mali GPU (the same effect happens for the onscreen frame buffer when you call eglSwapBuffers()). You can find some details of this topic in Mali Performance 2: How to Correctly Handle Framebuffers.
After rendering a suitable number of offscreen frames (for example 100) and down-sampling them to a small area in the onscreen frame, you can call eglSwapBuffers() as normal to present the frame in the onscreen buffer. You can measure the offscreen FPS by dividing the total number of rendered offscreen frames by the total rendering time measured when eglSwapBuffers() returns.

There is a small overhead in the performance measurement when using this method because of down-sampling the offscreen frames to the onscreen frame, but nevertheless it should give you quite representative FPS results without the vsync limitation.

Is it really worth it?

You might ask how significant an energy saving you can really get by optimizing your application. We will focus on that in the next part of this blog, Energy Efficiency in GPU Applications, Part 2, where I will present a small micro-benchmark that will show how much you can reduce real SoC power consumption by optimizing your application.

Graphics, Gaming, and VR blog

The mobile gaming revolution, powered by Arm

Philippe Bressy

This blog post describes the stratospheric growth of mobile gaming growth from the late 90s to present day, and how Arm technology has been at the heart of the mobile gaming revolution.
- November 18, 2024
Shader analysis and more in Arm Performance Studio 2024.4

Julie Gaskin

Learn about the new shader analysis features for mobile developers in Frame Advisor, and hear about other Arm Performance Studio changes in this release.
- October 2, 2024
Save your battery while enjoying the modern graphics on mobile with Android Dynamic Performance Framework

Patrick Wang

Save battery and enhance mobile gaming with ADPF and Unreal Engine. Mori shows you how it optimizes graphics based on real-time thermal data, reducing overheating and power consumption.
- September 26, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Energy Efficiency in GPU Applications, Part 1

Energy Efficiency in the GPU/Device

Energy Efficiency in Applications

How to Make an Application Energy Efficient

How Do You Measure Energy Efficiency?

Is it really worth it?

The mobile gaming revolution, powered by Arm

Shader analysis and more in Arm Performance Studio 2024.4

Save your battery while enjoying the modern graphics on mobile with Android Dynamic Performance Framework