Killing Pixels - A New Optimization for Shading on ARM Mali GPUs

September 11, 2013

5 minute read time.

Invisible pixels are expensive

Shading pixels is expensive, so you want to make sure that you don't spend time and energy shading pixels that will not actually make it to the screen. To address this, ARM® Mali™ GPUs are pioneering a novel optimization.

But before we jump to the solution, why do we have invisible pixels in the first place? For an exploration of when pixels are and are not, you might also enjoy Ed Plowman's Of Philosophy and When is a Pixel Not a Pixel?

The colour of every pixel on the screen is determined by a shader program. Each object typically has a different program associated with it, and one thread of execution is spawned for every pixel in the object. Once launched, these threads are committed to complete (unless they execute a "discard" instruction to terminate themselves), and they then pass the calculated colour to the blending unit where it is combined with the existing pixel value in the output image.

The key problem here is overdraw - nearby objects will be drawn over more distant objects, hiding them. There's no point drawing the Emerald City on the horizon in huge detail if there's a hill in the foreground occluding (hiding) it. If you have already spent the time and effort rendering the emerald pixels before discovering that they will be overdrawn, then this is a waste of performance, time, battery life, and possibly karma.

Reducing the load

There are several existing approaches that aim to reduce the cost of overdrawn pixels. The first is for the application to use its knowledge of the scene to avoid even sending geometry to the graphics driver at all. This works well in closed, room-based games but requires additional logic in the game engine. For common classes of scene, it's also quite difficult actually working out which objects are occluding others.

Even if you do eliminate some of the more distant geometry, there will still be cases where the geometry you do draw is still hidden. Perhaps there's an enemy player in the same room as you - you can see their helmet, but the rest of them is behind a crate. You don't want to shade the pixels for the whole character when just the top of his hat is enough.

Using a simple depth-buffer, together with "early" depth testing, it is possible to determine that the pixels from a more distant object are hidden by the pixels from the nearer one before we start shading the pixels.

By sorting the objects in order of increasing distance, and drawing the nearest objects first, it is possible to help the process along and eliminate most of the hidden pixels in an overdrawn image. Of course, it's not possible to do the reverse, as the pipeline is not psychic and cannot know what is going to be drawn afterwards... but hold that thought.

But front-to-back sorting has some other problems.

For semi-transparent objects, front-to-back is exactly the wrong order to draw them in, as they need to be blended with the objects behind them. And just sorting the objects in the first place takes time. Even worse, the structure of modern graphics APIs (OpenGL ES® and Direct3D®) doesn't really include the concept of "object in a scene" at all, so you have to keep track of this yourself and draw in an acceptable order.

Another way to avoid work is to defer as much shading as possible, by first running a quick pass that just calculates the depths and stores the data about which object is in front at each pixel, and only after all the pixels have been calculated, running the full lighting calculation.

This works extremely well. At least, it works well until you come across something that breaks the rules. Perhaps it's a pixel which writes its own depth. Perhaps it's a semi-transparent object. As soon as that happens, you have to fall back into a more "brute force" mode of operation in order to keep track of the additional data. The fail-over isn't soft, either, as performance decreases markedly as soon as any special cases are detected, and these are becoming common as the game engines strive for more and more realism.

And so, with the inevitability of a rhetorical question at the end of the introduction to a technology article, what can we do about it?

Forward Pixel Kill

Our answer is a patented technology known as Forward Pixel Kill (FPK), which is included in ARM Mali GPUs from Mali-T62X and T678 onwards (such as the Mali-T628 MP6 in the recently announced Samsung Exynos5420).

In an FPK-enabled GPU, the threads that colour the pixels are not irrevocably committed to complete once they are launched. Calculations already in flight can be terminated at any time if we spot that a later thread will write opaque data to the same pixel location. Since each thread takes a finite time to complete, we have a window in time which we can exploit to kill pixels already in the pipeline. In effect, we exploit the depth of the pipeline to emulate the "psychic" seeing-into-the-future effect that I alluded to earlier.

In fact, it's possible to do even better than this. By adding a simple FIFO buffer to the start of the pipeline, we can extend the forward pixel kill zone, making it more likely to spot overdraw, and at the same time giving the pipeline the chance to kill threads before they are even started.

This all works particularly well with a tile-based renderer like the ARM Mali GPUs. With even a modest kill zone, this can produce results that are as good as the front-to-back drawing order, but without the requirement to sort the scene (with consequent overhead in silicon area, power and memory bandwidth). So, no need to modify your application to add the sorting algorithm. Also, since drawing proceeds in the same natural order, semi-transparent content works properly without expensive workarounds that degrade performance.

And the best thing is that the transition between operating regimes is soft - more like a steady speed adjustment than a gear change. Inconsistent frame rates (sometimes known as "jank") are extremely annoying to users, so any technique which significantly reduces the uncertainty in scene rendering time will be popular with users and developers alike.

Sean Ellis over 10 years ago

Maxim,
I don't want to comment on competing technologies directly. However, I think our approach has two very important key features. The first is its relative simplicity, which means low area overhead, and the second is the fact that it copes gracefully when it encounters primitives for which Forward Pixel Kill is not appropriate.
Sean.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sean Ellis over 10 years ago

Sean,

Thanks for the kind words. We are excited by this technology too (hence the blog).

Sean.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Maxim Mogilnitsky over 10 years ago

Very Impressive. Just to be sure. As far as I know a competitive solution, PowerVR from Imagination, has similar technologies. To my knowledge this technologies are patented from "top to bottom" of the GPU processing pipe. They, of course, very highly guarded as being one of the major assets of Imagination. Even further, I heard that Imagintion continue to enlarge this asset by adding more and more patents on that matter. From other side, as far as I understand the deferred GPU technology this is an absolute must to achieve descent performance. So how?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sean Lumly over 10 years ago

Wow! To me this is the most exciting thing that has happened to mobile graphics since ASTC! Overdraw is no doubt a very large resource sink, and eliminating it (or drastically reducing it) will have dramatic boosts on performance. But what I really love about this method over the others (deferred shading, object sorting, early Z, queries, low-res passes, etc) is how remarkably elegant it is: I can only imagine how tiny it is on silicon per-core, and it seems like a glove-fit for tile-based rendering. I would imagine that "short-circuiting" a thread will incur very little cost and can be done in a few cycles, meaning a thread can get to work immediately, and change very quickly to a new workload should it be occluded. This should mean that much more complex shaders covering much larger areas of screen can be more easily incorporated into an application, with *much* less strain on the developer. I'm blown away..

I must say that I am very, very impressed, and am looking forward to this technology both on mobile and on ever-larger silicon. It's quite evident that comparing GPUs isn't a MHz to MHz, or GFLOP to GFLOP, is misleading (at best) given that optimizations can mean multiples of performance differences. I would be very interested to see the performance of Mali silicon that scaled up to the TFLOP sizes and how it would compare to desktop GPUs (all things being equal -- eg. memory bandwidth).

Wonderful job, guys.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Graphics, Gaming, and VR blog

Coming soon in Arm Frame Advisor

Julie Gaskin

Read about our vision for future feature enhancements in Frame Advisor. We have listened to your feedback and plan to extend the kinds of analyses you can perform. Help us to create more great features…
- March 13, 2024
Using the new custom reporting features in Performance Advisor

Connor Brookes

Explaining the new custom reporting features in Performance Advisor and how to use them.
- March 4, 2024
Beyond Mobile: Arm Mobile Studio is now Arm Performance Studio

Julie Gaskin

We are proud to announce that the latest version of our profiling tool suite for mobile is now available to download and use for free. In this release, we have a few changes to tell you about.
- February 26, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Killing Pixels - A New Optimization for Shading on ARM Mali GPUs

Invisible pixels are expensive

Reducing the load

Forward Pixel Kill

Coming soon in Arm Frame Advisor

Using the new custom reporting features in Performance Advisor

Beyond Mobile: Arm Mobile Studio is now Arm Performance Studio