Killing Pixels - A New Optimization for Shading on ARM Mali GPUs

September 11, 2013

5 minute read time.

Invisible pixels are expensive

Shading pixels is expensive, so you want to make sure that you don't spend time and energy shading pixels that will not actually make it to the screen. To address this, ARM® Mali™ GPUs are pioneering a novel optimization.

But before we jump to the solution, why do we have invisible pixels in the first place? For an exploration of when pixels are and are not, you might also enjoy Ed Plowman's Of Philosophy and When is a Pixel Not a Pixel?

The colour of every pixel on the screen is determined by a shader program. Each object typically has a different program associated with it, and one thread of execution is spawned for every pixel in the object. Once launched, these threads are committed to complete (unless they execute a "discard" instruction to terminate themselves), and they then pass the calculated colour to the blending unit where it is combined with the existing pixel value in the output image.

The key problem here is overdraw - nearby objects will be drawn over more distant objects, hiding them. There's no point drawing the Emerald City on the horizon in huge detail if there's a hill in the foreground occluding (hiding) it. If you have already spent the time and effort rendering the emerald pixels before discovering that they will be overdrawn, then this is a waste of performance, time, battery life, and possibly karma.

Reducing the load

There are several existing approaches that aim to reduce the cost of overdrawn pixels. The first is for the application to use its knowledge of the scene to avoid even sending geometry to the graphics driver at all. This works well in closed, room-based games but requires additional logic in the game engine. For common classes of scene, it's also quite difficult actually working out which objects are occluding others.

Even if you do eliminate some of the more distant geometry, there will still be cases where the geometry you do draw is still hidden. Perhaps there's an enemy player in the same room as you - you can see their helmet, but the rest of them is behind a crate. You don't want to shade the pixels for the whole character when just the top of his hat is enough.

Using a simple depth-buffer, together with "early" depth testing, it is possible to determine that the pixels from a more distant object are hidden by the pixels from the nearer one before we start shading the pixels.

By sorting the objects in order of increasing distance, and drawing the nearest objects first, it is possible to help the process along and eliminate most of the hidden pixels in an overdrawn image. Of course, it's not possible to do the reverse, as the pipeline is not psychic and cannot know what is going to be drawn afterwards... but hold that thought.

But front-to-back sorting has some other problems.

For semi-transparent objects, front-to-back is exactly the wrong order to draw them in, as they need to be blended with the objects behind them. And just sorting the objects in the first place takes time. Even worse, the structure of modern graphics APIs (OpenGL ES® and Direct3D®) doesn't really include the concept of "object in a scene" at all, so you have to keep track of this yourself and draw in an acceptable order.

Another way to avoid work is to defer as much shading as possible, by first running a quick pass that just calculates the depths and stores the data about which object is in front at each pixel, and only after all the pixels have been calculated, running the full lighting calculation.

This works extremely well. At least, it works well until you come across something that breaks the rules. Perhaps it's a pixel which writes its own depth. Perhaps it's a semi-transparent object. As soon as that happens, you have to fall back into a more "brute force" mode of operation in order to keep track of the additional data. The fail-over isn't soft, either, as performance decreases markedly as soon as any special cases are detected, and these are becoming common as the game engines strive for more and more realism.

And so, with the inevitability of a rhetorical question at the end of the introduction to a technology article, what can we do about it?

Forward Pixel Kill

Our answer is a patented technology known as Forward Pixel Kill (FPK), which is included in ARM Mali GPUs from Mali-T62X and T678 onwards (such as the Mali-T628 MP6 in the recently announced Samsung Exynos5420).

In an FPK-enabled GPU, the threads that colour the pixels are not irrevocably committed to complete once they are launched. Calculations already in flight can be terminated at any time if we spot that a later thread will write opaque data to the same pixel location. Since each thread takes a finite time to complete, we have a window in time which we can exploit to kill pixels already in the pipeline. In effect, we exploit the depth of the pipeline to emulate the "psychic" seeing-into-the-future effect that I alluded to earlier.

In fact, it's possible to do even better than this. By adding a simple FIFO buffer to the start of the pipeline, we can extend the forward pixel kill zone, making it more likely to spot overdraw, and at the same time giving the pipeline the chance to kill threads before they are even started.

This all works particularly well with a tile-based renderer like the ARM Mali GPUs. With even a modest kill zone, this can produce results that are as good as the front-to-back drawing order, but without the requirement to sort the scene (with consequent overhead in silicon area, power and memory bandwidth). So, no need to modify your application to add the sorting algorithm. Also, since drawing proceeds in the same natural order, semi-transparent content works properly without expensive workarounds that degrade performance.

And the best thing is that the transition between operating regimes is soft - more like a steady speed adjustment than a gear change. Inconsistent frame rates (sometimes known as "jank") are extremely annoying to users, so any technique which significantly reduces the uncertainty in scene rendering time will be popular with users and developers alike.

Parents

Sean Ellis over 11 years ago

Maxim,
I don't want to comment on competing technologies directly. However, I think our approach has two very important key features. The first is its relative simplicity, which means low area overhead, and the second is the fact that it copes gracefully when it encounters primitives for which Forward Pixel Kill is not appropriate.
Sean.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

Sean Ellis over 11 years ago

Maxim,
I don't want to comment on competing technologies directly. However, I think our approach has two very important key features. The first is its relative simplicity, which means low area overhead, and the second is the fact that it copes gracefully when it encounters primitives for which Forward Pixel Kill is not appropriate.
Sean.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Mobile, Graphics, and Gaming blog

What is Arm Performance Studio?

Jai Schrem

Arm Performance Studio gives developers free tools to analyze performance, debug graphics, and optimize apps on Arm platforms.
- August 27, 2025
How Neural Super Sampling works: Architecture, training, and inference

Liam O'Neil

A deep dive into a practical, ML-powered approach to temporal super sampling.
- August 12, 2025
Start experimenting with Neural Super Sampling for mobile graphics today

Sergio Alapont Granero

Laying the foundation for neural upscaling to enable sharper, smoother, AI-powered gaming on next-generation Arm GPUs.
- August 12, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Killing Pixels - A New Optimization for Shading on ARM Mali GPUs

Invisible pixels are expensive

Reducing the load

Forward Pixel Kill

What is Arm Performance Studio?

How Neural Super Sampling works: Architecture, training, and inference

Start experimenting with Neural Super Sampling for mobile graphics today