How low can you go? Building low-power, low-bandwidth ARM Mali GPUs

I'm just back from SIGGRAPH 2012, the world's biggest computer graphics conference, and (as every year) still breathing hard from the excitement of seeing the latest in graphics research. This year was triply exciting for me; in addition to soaking up the new work, I had the privilege (in my role as OpenGL ES committee chair) of announcing the release of the new OpenGL ES 3.0 specification and the Khronos-standard version of our ASTC compression technology, and I gave a talk at SIGGRAPH Mobile about power and bandwidth in mobile GPUs.  In this blog I'll go over the key points of the SIGGRAPH mobile presentation.

We love to talk about "raising the bar" in graphics performance. The idea is that GPU performance competition is like a high jump, where you are constantly trying to clear a higher bar than you did last time; but when it comes to power and memory bandwidth, it's more like a limbo contest. There, the goal is to keep lowering the bar, trying to keep the same level of performance while using ever smaller amounts of energy. In fact, it turns out that these goals are equivalent; in both cases, what you're really trying to do is maximize efficiency.  Want higher performance? Then you have to reduce energy consumption. Want lower power? Then you've just enabled higher performance.

It's all about power

In fact, mobile GPU design is all about reducing energy consumption. This has been true for quite a while, but the reasons for it are changing.  We used to worry about it because we were concerned about battery life.  We still are, because nobody likes having to recharge their mobile devices in the middle of the day; and certainly we'd love it if some breakthrough in battery technology gave us ten times the energy storage capacity we have now.  But even if that were to happen – not likely – it wouldn't solve our problem, because nowadays the problem isn't battery life. It's heat.

Modern high-end applications processors, without exception, are thermally limited.  Given enough work to do and permission to run as fast as they can, they can do so much work that they overheat their packages and destroy themselves. To keep that from happening, they contain lots of smart system hardware and software that forces them to slow down when they start to get too hot.  Think about that for a minute; if your performance is limited, not by having too few transistors, or too little compute capability, or insufficient access to memory, but simply by the amount of energy you can use, then the only way to increase your performance is to reduce your energy consumption.  And the most important performance metric to optimize for is not pixels or (God forbid!) triangles per second, but nanojoules per pixel (nJ/p).

Thinking like a GPU designer

To get a feel for the kind of reasoning this leads to, let's look at one aspect of power optimization: reducing memory bandwidth.  Since we're playing engineer, we'll do this using numbers; but don't worry, we won't need higher math and physics – simple arithmetic will do. We'll start with some simple facts:

  1. Power is just a rate of energy consumption. Energy is measured in joules; one watt of power is one joule per second.
  2. Speaking very loosely, the power budget for a mobile GPU is about one watt (sometimes less).
  3. Every time the GPU reads or writes one byte of memory, it consumes about 150 picojoules (pJ), or millionths of a millionth of a joule. (Memory geeks, this is for 2x32 LPDDR2 and includes everything from the memory controller out, under a whole boatload of assumptions. It's only a ballpark figure, so don't take it too seriously.  But it's enough to get us started.)

The first question we have to ask is, does memory bandwidth use enough power to be worth worrying about? Here comes our first numerical argument: the kind of memory system we're talking about can transfer something like 4 to 8 GB (gigabytes) of data per second.  Multiply that by 150 pJ per byte, and we get 0.6 to 1.2 watts. In other words, memory bandwidth can eat up our entire power budget. So the answer to our question is yes, memory bandwidth does matter; in fact it's critical.

The Tile Game

In the SIGGRAPH talk, I went on to talk about tile-based rendering. This is a way of organizing the graphics pipeline so that the color, depth, and stencil sample buffers stay on-chip. Tile-based rendering greatly reduces memory bandwidth usage, especially if the application is using multi-sampled antialiasing (MSAA), which requires multiple color, depth, and stencil samples for every pixel.  We use it in all of the ARM® Mali™ GPUs, and it's also used (with variations) in the Qualcomm Adreno™ and Imagination PowerVR™ cores. Our version works like this:

The GPU divides the output image into small rectangles called tiles, and maintains to-do lists of things that need to be drawn into each tile. When the application asks the GPU to draw a triangle, it doesn't actually do it; it just figures out which tiles contain pixels that the triangle might cover, and adds the triangle to those tiles' to-do lists.  When it's time to draw the pixels, the GPU processes the tiles one-at-a-time.  For each tile, it reads the to-do list and draws all of its triangles in order; but since the tile is small, it can do this into a special on-chip memory called the tile buffer.  When all the triangles have been drawn, it does what we call a resolve: it filters the color samples to produce one color per pixel, and writes the pixel colors into the external frame buffer.  The color, depth, and stencil samples aren't needed any more (usually), so the GPU just forgets about them and goes on to the next tile. Figure 1 shows what it looks like in pictures.

Figure 1: Tile-based rendering. Triangles submitted for drawing are  written into per-tile to-do lists in system memory. When the pixels are  needed, the rasterizer reads the to-do list for each tile and renders it  into the on-chip multisample (MS) depth (Z) and color (C) buffers. When a  tile is finished, it is resolved to obtain pixel colors, which are  written into the off-chip framebuffer. In this figure, the GPU has just  finished rendering tile 9 to the internal, multisampled tile buffers,  and writing the resolved image to the external frame buffer.

Reducing texture bandwidth

Figure 1 shows us that tile-based rendering puts most of the heavy data traffic – specifically, traffic into and out of the multisampled Z and color buffers – inside the GPU, in on-chip memory. The fattest arrow that still crosses the bus into system memory is texture data. The first law of optimization is, "work on the stuff that's hurting you the most" – so reducing texture bandwidth is the next thing we need to worry about.

This, of course, is exactly what motivated our work on Adaptive Scalable Texture Compression (ASTC). We've written several blogs about ASTC, so I won't repeat the whole story here; for a great introduction to how it works, read Sean Ellis's excellent blog based on our HPG paper. The latest development on the ASTC front is that the Khronos group has adopted a subset of ASTC as a Khronos-ratified OpenGL and OpenGL ES extension. We've announced plans to support the extension in the newly announced Mali-T624 and Mali-T678 GPUs, and several other GPU providers have expressed similar intentions. Since we've agreed to license the patents royalty-free under the terms of the Khronos members' agreement, we expect that ASTC will be available on all OpenGL ES platforms within a few years.

The exciting thing about ASTC from a developer's point of view is that it allows almost any texture you can imagine to be compressed.  The formats in common use today (S3TC, PVRTC, ETC1, RGTC) offer only a limited number of bit rates, and a few choices of number of color components.  ASTC offers just about any bit rate you could want, with any number of color components you like, in your choice of standard (8-bit) or HDR (float), all at a quality that is matched only by still-exotic high-end formats such as BPTC. This means that, for the first time, you can think about compressing all of the textures used by your application.  There are no 'holes' in the coverage; no matter what your pixel format or quality requirement are, ASTC has a format to match. [FOOTNOTE: OK, there's an exception to every rule. ASTC doesn't have a way to compress integer textures, which are a new feature in OpenGL ES 3.0. Give us time.]

We expect it'll take a little time for developers to get a feel for working with ASTC, and in particular to learn what kind of use cases demand what kinds of bit rates. For developers who want to get a head start, we've released an evaluation codec package, including source code.  We hope you'll find it interesting.

Making tiling even better

Looking back at Figure 1, we've used tile-based rendering to eliminate external traffic to the multisample buffers, and introduced ASTC to shrink texture-fetch traffic as much as we can. The biggest arrow remaining is tile writeback, where we write resolved color samples from the on-chip tile buffer to the framebuffer in external memory.  As screens get bigger, this step becomes more and more important – and screens, trust me, are going to get ridiculously big.  Can we do something about tile writeback?

During the design of the Midgard GPU architecture, we spent a lot of time looking at application behavior, looking for opportunities to reduce power or improve performance.  One thing we noticed is that surprisingly often, the resolved pixels we write out to memory are exactly the same as the pixels we wrote during the preceding frame.  That is, the part of the image corresponding to the tile hasn't changed. The architects found this annoying; the GPU was burning energy to write data to memory, when that data was already there.  Clearly, if we could detect situations where a tile hadn't changed, we could skip writing it, and reduce power consumption.

Now, it's not a surprise that a lot of pixels don't change when the GPU is, say, compositing a web page or a window system.  But we found that you get significant numbers of redundant tile writes even in modern FPS games, where you'd think the whole screen would be changing constantly; and you even get them during video playback.  Obviously you don't save a lot on that kind of content, but you pretty much always save enough to make it worth doing. So, we decided to attack the problem in Midgard, by adding a feature we call transaction elimination.

Introducing transaction elimination

OK, it's not the coolest name in the world, but the technology itself is simple and elegant.  Every time the GPU resolves a tile-full of color samples, it computes a signature or checksum – a short bit string that depends sensitively on every pixel in the resolved buffer. It writes each signature into a list associated with the output color buffer.  The next time it renders to that buffer, after resolving each tile, it compares the new signature to the old one. If the signature hasn't changed, it skips writing out the tile, because the probability that the pixels have changed is one in, well, a very, very, very large number.

Figure 2: Transaction elimination signature comparisons

Figure 2 illustrates the idea. For tiles where we have a (green) signature match, we can skip writing the tile; this happens (in this hypothetical case) for the skybox, parts of the heads-up display, and parts of the car.  Where we have a (red) mismatch, we have to write the tile to memory.

Theory meets practice

Based on our design studies, we expected that transaction elimination would help a lot for browsing and GUI compositing, but only modestly for games. Now that we have access to partner silicon for the Mali-T604, however, we've been able to study its behavior in real applications, running on a real OS. It turns out it works better than we thought, for two reasons.  First, display resolutions have once again grown faster than we predicted; and second, the kinds of games people are playing aren't the kind we were expecting.

Saving the planet, one Angry Bird™ at a time

Currently, the most popular mobile game on the planet, by a wide margin, is Rovio's Angry Birds. It is played a lot, according to its creators: about 200 million minutes per day worldwide. Statistically you've almost certainly played it, so I don't need to tell you that its style is friendly to transaction elimination.  But to help you visualize just how friendly it is, here are several images (Figures 2, 3, and 4). I've painted a red overlay on the tiles where we have a signature mismatch (and therefore have to write the tile to memory).  As you can see, when we're aiming the slingshot, there's very little motion and only a handful of tiles need to be written.  When we launch the bird, the whole screen pans and a lot of tiles change, but we still end up skipping almost 50% of tile writes. Finally, when the bird hits, the scrolling slows down and then stops, and the number of active tiles trails off.

Figure 3: Aiming. Transaction elimination is able to suppress 96% of tile writes
Figure 4: Bird in flight. Here there is a lot of background motion, but we are still able to eliminate about half of tile writes
Figure 5: Settling. As the physics engine converges, more and more of  the scene becomes static and stops requiring tiles to be written to  memory

So how much does this help?

To put numbers on the value of transaction elimination, we captured a couple of thousand frames of the OpenGL ES commands issued by Angry Birds "Seasons" during a playing session. We then ran the commands on a prototype high-end Android™ tablet with Mali-T604 silicon, first with transaction elimination disabled, and then with it enabled. We used the built-in debug protocols to read back the internal performance counters.  We found that over the sequence, about 75% of tile writebacks were eliminated. Total GPU bandwidth was cut nearly in half, from 6.5 MB/frame to 3.4 MB/frame. 

To put that into perspective: if every Angry Birds player on the planet were using Mali silicon at a resolution of 1368x760 and assuming a bandwidth cost of 150 pJ per byte, the technology would be saving about 3.8 kW continuous power world-wide. That's enough to run several single-family houses, 24x7. It's equivalent to about five horsepower, so it's more than the max output of a Vespa S 50 motor scooter, or my old Sears lawnmower. But it's more fun to think of it in terms of energy. Again, assuming every Angry Birds player were using the technology, transaction elimination would save 34 megawatt-hours of energy per year. If you're interested in saving the planet, that's 20 barrels of oil, which would yield 8.7 metric tons of carbon dioxide; if you're more the Duke Nukem type, it's approximately the energy released from exploding about 16.3 metric tons of dynamite. It's a lot of energy!

I hope you've enjoyed this little dive into GPU design and energy-think.  Deepest thanks to Rovio for giving me permission to use the Angry Birds images, for writing a game that is so awesomely well suited to transaction elimination, and (of course) for several hundred hours of my life which I will never, ever get back...

Got questions? Just like to argue? Drop me a line...

  • Hi tomolson and seanellis,

    I have a few questions regarding transaction elimination:

    1) Does transaction elimination work with Multiple Render Targets in GLES3?

    2) Does transaction elimination (or possible AFBC) eliminate updating an intermediary framebuffer for pixels that have not been touched. For example, if I am rendering a teapot as a layer to be composited (not a final display-ready framebuffer) in the center of a full-screen framebuffer with black (or transparent) pixels surrounding it, can writing those surrounding pixels be eliminated? Put a slightly different way: is there a way to have only the teapot pixels written out to memory?

  • Sean, we have looked at FPS games too, and although the rejection rates are much lower, they still work in our favour. The "break even" point comes when only 1.5% of tile writebacks are eliminated, and we are seeing average rates during gameplay from about 5% upwards, even on highly textured interior scenes. So the very worst case we have is still a net "win" on bandwidth. When you also factor in menus, options screens, pause screens, and other non-gameplay UI, which have very high elimination rates, these serve to bring up the average when considered over the game as a whole.
  • I'm quite fascinated by transaction elimination. The Angrybirds case makes sense but you mention that it is also quite effective for First Person Shooters. How effective is it for modern FPS games? Are we talking mainly about external environments with a [mostly] static skybox?
Graphics & Multimedia blog