Killing Pixels - A New Optimization for Shading on ARM Mali GPUs

Invisible pixels are expensive                                                           

Chinese Version中文版


Shading pixels is  expensive, so you want to make sure that you don't spend time and energy  shading pixels that will not actually make it to the screen. To address  this, ARM® Mali™ GPUs are pioneering a novel optimization.

But before we jump to the solution, why do we have invisible pixels in the first place? For an exploration of when pixels are and are not, you might also enjoy Ed Plowman's Of Philosophy and When is a Pixel Not a Pixel?

The colour of every pixel on the screen is determined by a shader program.  Each object typically has a different program associated with it, and  one thread of execution is spawned for every pixel in the object. Once  launched, these threads are committed to complete (unless they execute a  "discard" instruction to terminate themselves), and they then pass the  calculated colour to the blending unit where it is combined with the  existing pixel value in the output image.

The key problem here is overdraw  - nearby objects will be drawn over more distant objects, hiding them.  There's no point drawing the Emerald City on the horizon in huge detail  if there's a hill in the foreground occluding (hiding) it. If you have  already spent the time and effort rendering the emerald pixels before  discovering that they will be overdrawn, then this is a waste of  performance, time, battery life, and possibly karma.

 
Reducing the load


There  are several existing approaches that aim to reduce the cost of  overdrawn pixels. The first is for the application to use its knowledge  of the scene to avoid even sending geometry to the graphics driver at  all. This works well in closed, room-based games but requires additional  logic in the game engine. For common classes of scene, it's also quite  difficult actually working out which objects are occluding others.

Even  if you do eliminate some of the more distant geometry, there will still  be cases where the geometry you do draw is still hidden. Perhaps  there's an enemy player in the same room as you - you can see their  helmet, but the rest of them is behind a crate. You don't want to shade  the pixels for the whole character when just the top of his hat is  enough.

Using a simple depth-buffer,  together with "early" depth testing, it is possible to determine that  the pixels from a more distant object are hidden by the pixels from the  nearer one before we start shading the pixels.

By  sorting the objects in order of increasing distance, and drawing the  nearest objects first, it is possible to help the process along and  eliminate most of the hidden pixels in an overdrawn image.  Of course, it's not  possible to do the reverse, as the pipeline is not  psychic and cannot  know what is going to be drawn afterwards... but hold  that thought.

But front-to-back sorting has some other problems.

For semi-transparent objects, front-to-back is  exactly the wrong order to draw them in, as they need to be blended with  the objects behind them. And just sorting the objects in the first  place takes time. Even worse, the structure of modern graphics APIs (OpenGL ES® and Direct3D®)  doesn't really include the concept of "object in a scene" at all, so  you have to keep track of this yourself and draw in an acceptable order.

Another  way to avoid work is to defer as much shading as possible, by first  running a quick pass that just calculates the depths and stores the data  about which object is in front at each pixel, and only after all the  pixels have been calculated, running the full lighting calculation.

This  works extremely well. At least, it works well until you come across  something that breaks the rules. Perhaps it's a pixel which writes its  own depth. Perhaps it's a semi-transparent object. As soon as that  happens, you have to fall back into a more "brute force" mode of  operation in order to keep track of the additional data. The fail-over  isn't soft, either, as performance decreases markedly as soon as any  special cases are detected, and these are becoming common as the game  engines strive for more and more realism.

And so, with the  inevitability of a rhetorical question at the end of the introduction to  a technology article, what can we do about it?

Forward Pixel Kill


Our  answer is a patented technology known as Forward Pixel Kill (FPK),  which is included in ARM Mali GPUs from Mali-T62X and T678 onwards (such  as the Mali-T628 MP6 in the recently announced Samsung Exynos5420).

In  an FPK-enabled GPU, the threads that colour the pixels are not  irrevocably committed to complete once they are launched. Calculations  already in flight can be terminated at any time if we spot that a later  thread will write opaque data to the same pixel location. Since each  thread takes a finite time to complete, we have a window in time which  we can exploit to kill pixels already in the pipeline. In effect, we  exploit the depth of the pipeline to emulate the "psychic"  seeing-into-the-future effect that I alluded to earlier.

In fact, it's possible to do even better than this. By adding a simple FIFO buffer  to the start of the pipeline, we can extend the forward pixel kill  zone, making it more likely to spot overdraw, and at the same time  giving the pipeline the chance to kill threads before they are even  started.

This all works particularly well with a tile-based  renderer like the ARM Mali GPUs. With even a modest kill zone, this can  produce results that are as good as the front-to-back drawing order, but  without the requirement to sort the scene (with consequent overhead in  silicon area, power and memory bandwidth). So, no need to modify your  application to add the sorting algorithm. Also, since drawing proceeds  in the same natural order, semi-transparent content works properly  without expensive workarounds that degrade performance.

And the best thing is that the transition between operating regimes is soft - more like a steady speed adjustment than a gear change. Inconsistent  frame rates (sometimes known as "jank") are extremely annoying to users,  so any technique which significantly reduces the uncertainty in scene  rendering time will be popular with users and developers alike.

Anonymous
  • Maxim,

    I don't want to comment on competing technologies directly. However, I think our approach has two very important key features. The first is its relative simplicity, which means  low area overhead, and the second is the fact that it copes gracefully when it encounters primitives for which Forward Pixel Kill is not appropriate.

    Sean.

  • Sean,

    Thanks for the kind words. We are excited by this technology too (hence the blog).

    Sean.
  • Very Impressive. Just to be sure. As far as I know a competitive solution, PowerVR from Imagination, has similar technologies. To my knowledge this technologies are patented from "top to bottom" of the GPU processing pipe. They, of course, very highly guarded as being one of the major assets of Imagination. Even further, I heard that Imagintion continue to enlarge this asset by adding more and more patents on that matter.   From other side, as far as I understand the deferred GPU technology this is an absolute must to achieve descent performance.  So how?
  • Wow! To me this is the most exciting thing that has happened to mobile graphics since ASTC! Overdraw is no doubt a very large resource sink, and eliminating it (or drastically reducing it) will have dramatic boosts on performance. But what I really love about this method over the others (deferred shading, object sorting, early Z, queries, low-res passes, etc) is how remarkably elegant it is: I can only imagine how tiny it is on silicon per-core, and it seems like a glove-fit for tile-based rendering. I would imagine that "short-circuiting" a thread will incur very little cost and can be done in a few cycles, meaning a thread can get to work immediately, and change very quickly to a new workload should it be occluded. This should mean that much more complex shaders covering much larger areas of screen can be more easily incorporated into an application, with *much* less strain on the developer. I'm blown away..

    I must say that I am very, very impressed, and am looking forward to this technology both on mobile and on ever-larger silicon. It's quite evident that comparing GPUs isn't a MHz to MHz, or GFLOP to GFLOP, is misleading (at best) given that optimizations can mean multiples of performance differences. I would be very interested to see the performance of Mali silicon that scaled up to the TFLOP sizes and how it would compare to desktop GPUs (all things being equal -- eg. memory bandwidth).

    Wonderful job, guys.
Graphics & Multimedia blog