Mali Performance 7: Accelerating 2D rendering using OpenGL ES

November 19, 2015

10 minute read time.

It's been a while since my last performance blog, but one of those lunchtime coffee discussions about a blog I'd like to write was turned into a working tech demo by wasimabbas overnight, so thanks to him for giving me the kick needed to pick up the digital quill again. This time around I look at 2D rendering, and what OpenGL ES can do to help ...

A significant amount of mobile content today is still 2D gaming or 2D user-interface applications, in which the applications render layers of sprites or UI elements to build up what is on screen. Nearly all of these applications are actually using OpenGL ES to perform the rendering, but few applications actually leverage the 3D functionality available, preferring the simplicity of a traditional back-to-front algorithm using blending to handle alpha transparencies.

This approach works, but doesn’t make any use of the 3D features of the hardware, and so in many cases makes the GPU work a lot harder than it needs to. The impacts of this will vary from poor performance to reduced battery life, depending on the GPU involved, and these impacts are amplified by the trend towards higher screen resolutions in mobile devices. This blog looks at some simple changes to sprite rendering engines which can make the rendering significantly faster, and more energy efficient, by leveraging some of the tools which a 3D rendering API provides.

Performance inefficiency of 2D content

In 2D games the OpenGL ES fragment shaders are usually trivial – interpolate a texture coordinate, load a texture sample, blend to the framebuffer – so there isn’t much there to optimize. Any performance optimization for this type of content is therefore mostly about finding ways to remove redundant work completely, so that the shader never even runs for some of the fragments.

The figure in the introduction section shows a typical blit of a square sprite onto a background layer; the outer parts of the shield sprite are transparent, the border region is partially transparent so it fades cleanly into the background without any aliasing artifacts, and the body of the sprite is opaque. These sprite blits are rendered on top of what is in the framebuffer in a back-to-front render order, with alpha blending enabled.

There are two main sources of inefficiency here:

Firstly, the substantial outer region around this sprite is totally transparent, and so has no impact on the output rendering at all, but takes time to process.
Secondly, the middle part of the sprite is totally opaque, completely obscuring many background pixels underneath it. The graphics driver cannot know ahead of time that the background will be obscured, so these background fragments have to be rendered by the GPU, wasting processing cycles and nanojoules of energy rendering something which is not usefully contributing to the final scene.

This is a relatively synthetic example with only a single layer of overdraw, but we see real applications where over half of all of the rendered fragments of a 1080p screen are redundant. If applications can use OpenGL ES in a different way to remove this redundancy then the GPU could render the applications faster, or use the performance headroom created to reduce the clock rate and operating voltage, and thus save a substantial amount of energy. Either of these outcomes sounds very appealing, so the question is how can application developers achieve this?

Test scene

For this blog we will be rendering a simple test scene consisting of a cover-flow style arrangement of the shield icon above, but the technique will work for any sprite set with opaque regions. Our test scene render looks like this:

… where each shield icon is actually a square sprite using alpha transparencies to hide the pieces which are not visible.

Tools of the trade

In traditional dedicated 2D rendering hardware there are not usually many options to play with; the application has to render the sprite layers from back to front to make sure blending functions correctly. In our case the applications are using a 3D API to render a 2D scene, so the question becomes what additional tools does the 3D API give the applications which can be used to remove redundant work?

The principal tool used in full 3D scene rendering to remove redundant work is the depth test. Every vertex in a triangle has a “Z” component in its position, which is emitted from the vertex shader. This Z value encodes how close that vertex is to the camera, and the rasterization process will interpolate the vertex values to assign a depth to each fragment which may need fragment shading. This fragment depth value can be tested against the existing value stored in the depth buffer and if it is not closer¹ to the camera than the current data already in the framebuffer then the GPU will discard the fragment without ever submitting it to the shader core for processing, as it now safely knows that it is not needed.

Using depth testing in “2D” rendering

Sprite rendering engines already track the layering of each sprite so that they stack correctly, so we can map this layer number to a Z coordinate value assigned to the vertices of each sprite which is sent to the GPU, and actually render our scene as if it has 3D depth. If we then use a framebuffer with a depth attachment, enable depth writes, and render the sprites and background image in front-to-back order (i.e. the reverse order of normal blitting pass which is back-to-front) then the depth test will remove parts of sprites and the background which are hidden behind other sprites.

If we run this for our simple test scene, we get:

Uh oh! Something has gone wrong.

The issue here is that our square sprite geometry does not exactly match the shape of opaque pixels. The transparent parts of the sprites closer to the camera are not producing any color values due to the alpha test, but are still setting a depth value. When the sprites on a lower layer are rendered the depth testing means that the pieces which should be visible underneath the transparent parts of an earlier sprite are getting incorrectly killed, and so only the OpenGL ES clear color is showing.

Sprite geometry

To fix this issue we need to invest some time into setting up some more useful geometry for our sprites. We can only safely set the depth value when rendering front-to-back for the pixels which are totally opaque in our sprite, so the sprite atlas generation needs to provide two sets of geometry for each sprite. One set, indicated by the green area in the middle image below, covers only the opaque geometry, and the second, indicated by the green area in the right image below, picks up everything else unless it is totally transparent (in which case it can be dropped completely).

Vertices are relatively expensive, so use as little additional geometry as possible when generating these geometry sets. The opaque region must only contain totally opaque pixels, but the transparent region can safely contain opaque pixels and totally transparent pixels without side-effects, so use rough approximations for a "good fit" without trying to get "best fit". Note for some sprites it isn’t worth generating the opaque region at all (there may be no opaque texels, or the area involved may be small), so some sprites may consist of only a single region rendered as a transparent render. As a rule of thumb, if your opaque region is smaller than 256 pixels it probably isn't worth bothering with the additional geometry complexity, but as always it's worth trying and seeing.

Generating this geometry can be relatively fiddly, but sprite texture atlases are normally static so this can be done offline as part of the application content authoring process, and does not need to be done live on the platform at run time.

Draw algorithm

With the two geometry sets for each sprite we can now render the optimized version of our test scene. First render all of the opaque sprites regions and the background from front-to-back, rendering with depth testing and depth writes enabled. This results in the output below:

Any area where one sprite or the background is hidden underneath another sprite is rendering work which has been saved, as that is an area which has been be removed by the early depth test before shading has occurred.

Having rendered all of the opaque geometry we can now render the transparent region for each sprite in a back-to-front order. Leave depth testing turned on, so that sprites on a lower layer don't overwrite an opaque region from a sprite in a logically higher layer which has already been rendered, but disable depth buffer writes to save a little bit of power.

If we clear the color output of the opaque stage, but keep its depth values, and then draw the transparent pass, we can visualize that the additional rendering added by this pass. This is show in the figure below:

Any area where one of the outer rings is partial indicates an area where work has been saved, as the missing part has been removed by the depth test using the depth value of an opaque sprite region closer to the camera which we rendered in the first drawing pass.

If we put it all together and render both passes to the same image then we arrive back at the same visual output as the original back-to-front render:

... but achieve that with around 35% fewer fragment threads started, which should translate approximately to a 35% drop in MHz required to render this scene. Success!

The final bit of operational logic needed is to ensure that the depth buffer we have added to the scene is not written back to memory. If your application is rendering directly to the EGL window surface then there is nothing to do here, as depth is implicitly discarded for window surfaces automatically, but if your engine is rendering to an off-screen FBO ensure that you add a call to glInvalidateFramebuffer() (OpenGL ES 3.0 or newer) or glDiscardFramebufferEXT() (OpenGL ES 2.0) before changing the FBO binding away from the offscreen target. See Mali Performance 2: How to Correctly Handle Framebuffers for more details.

Summary

In this blog we have looked at how the use of depth testing and depth-aware sprite techniques can be used to accelerate rendering using 3D graphics hardware significantly.

Adding additional geometry to provide the partition between opaque and transparent regions of each sprite does add complexity, so care must be taken to minimize the number of vertices for each sprite otherwise the costs of additional vertex processing and small triangle sizes will out-weigh the benefits. For cases where the additional geometry required is too complicated, or the screen region covered is too small, simply omit the opaque geometry and render the whole sprite as transparent.

It’s worth noting that this technique can also be used when rendering the 2D UI elements in 3D games. Render the opaque parts of the UI with a depth very close to near clip plane before rendering the 3D scene, then render the 3D scene as normal (any parts behind the opaque UI elements will be skipped), and finally the remaining transparent parts of the UI can be rendered and blended on top of the 3D output. To ensure that the 3D geometry does not intersect the UI elements glDepthRange() can be used to limit the range of depth values emitted by the 3D pass very slightly, guaranteeing that the UI elements are always closer to the near clip plane than the 3D rendering.

Tune in next time,

Pete

[1] Other depth test functions are possible in OpenGL ES, but this is the common usage which is analogous to the natural real world behaviour.

Pete Harris is the lead engineer for the Mali GPU performance analysis team at ARM. He enjoys spending his time working on a whiteboard to determining how to get the best out of complex graphics sub-systems, and how to make the ARM Mali GPUs even better.

Atomic Orange over 7 years ago

Hi Pete. Any ideas on ways to extract the 2 geometries automatically out of any give png file f.ex? Some script, or tool, or even service that can do this?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Roberto Lopez Mendez over 9 years ago

Thanks Pete for this excellent blog! It is definitely a recommendation to consider when developing 2D content.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Mobile, Graphics, and Gaming blog

Unlock the power of SVE and SME with SIMD Loops

Vidya Praveen

SIMD Loops is an open-source project designed to help developers learn SVE and SME through hands-on experimentation. It offers a clear, practical pathway to mastering Arm’s most advanced SIMD technologies…
- September 19, 2025
What is Arm Performance Studio?

Jai Schrem

Arm Performance Studio gives developers free tools to analyze performance, debug graphics, and optimize apps on Arm platforms.
- August 27, 2025
How Neural Super Sampling works: Architecture, training, and inference

Liam O'Neil

A deep dive into a practical, ML-powered approach to temporal super sampling.
- August 12, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog