The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering

A Chinese version of this blog is available here - thanks to vincent for the translation!

 

In my previous blog I started defining an abstract machine which can be used to describe the application-visible behaviors of the Mali GPU and driver software. The purpose of this machine is to give developers a mental model of the interesting behaviors beneath the OpenGL ES API, which can in turn be used to explain issues which impact their application’s performance. I will use this model in the future blogs of this series to explore some common performance pot-holes which developers encounter when developing graphics applications.

This blog continues the development of this abstract machine, looking at the tile-based rendering model of the Mali GPU family. I’ll assume you've read the first blog on pipelining; if you haven’t I would suggest reading that first.

The “Traditional” Approach

In a traditional mains-powered desktop GPU architecture — commonly called an immediate mode architecture — the fragment shaders are executed on each primitive, in each draw call, in sequence. Each primitive is rendered to completion before starting the next one, with an algorithm which approximates to:

    foreach( primitive )
         foreach( fragment )
              render fragment
          

 

As any triangle in the stream may cover any part of the screen the working set of data maintained by these renderers is large; typically at least a full-screen size color buffer, depth buffer, and possibly a stencil buffer too. A typical working set for a modern device will be 32 bits-per-pixel (bpp) color, and 32bpp packed depth/stencil. A 1080p display therefore has a working set of 16MB, and a 4k2k TV has a working set of 64MB.  Due to their size these working buffers must be stored off-chip in a DRAM.

model-imr.png

Every blending, depth testing, and stencil testing operation requires the current value of the data for the current fragment’s pixel coordinate to be fetched from this working set. All fragments shaded will typically touch this working set, so at high resolutions the bandwidth load placed on this memory can be exceptionally high, with multiple read-modify-write operations per fragment, although caching can mitigate this slightly. This need for high bandwidth access in turn drives the need for a wide memory interface with lots of pins, as well as specialized high-frequency memory, both of which result in external memory accesses which are particularly energy intensive.

The Mali Approach

The Mali GPU family takes a very different approach, commonly called tile-based rendering, designed to minimize the amount of power hungry external memory accesses which are needed during rendering. As described in the first blog in this series, Mali uses a distinct two-pass rendering algorithm for each render target. It first executes all of the geometry processing, and then executes all of the fragment processing. During the geometry processing stage, Mali GPUs break up the screen into small 16x16 pixel tiles and construct a list of which rendering primitives are present in each tile. When the GPU fragment shading step runs, each shader core processes one 16x16 pixel tile at a time, rendering it to completion before starting the next one. For tile-based architectures the algorithm equates to:

    foreach( tile )
         foreach( primitive in tile )
              foreach( fragment in primitive in tile )
                    render fragment
          

As a 16x16 tile is only a small fraction of the total screen area it is possible to keep the entire working set (color, depth, and stencil) for a whole tile in a fast RAM which is tightly coupled with the GPU shader core.

model-tbr.png

This tile-based approach has a number of advantages. They are mostly transparent to the developer but worth knowing about, in particular when trying to understand bandwidth costs of your content:

  • All accesses to the working set are local accesses, which is both fast and low power. The power consumed reading or writing to an external DRAM will vary with system design, but it can easily be around 120mW for each 1GByte/s of bandwidth provided. Internal memory accesses are approximately an order of magnitude less energy intensive than this, so you can see that this really does matter.
  • Blending is both fast and power-efficient, as the destination color data required for many blend equations is readily available.
  • A tile is sufficiently small that we can actually store enough samples locally in the tile memory to allow 4x, 8x and 16x multisample antialising1. This provides high quality and very low overhead anti-aliasing. Due to the size of the working set involved (4, 8 or 16 times that of a normal single-sampled render target; a massive 1GB of working set data is needed for 16x MSAA for a 4k2k display panel) few immediate mode renderers even offer MSAA as a feature to developers, because the external memory footprint and bandwidth normally make it prohibitively expensive.
  • Mali only has to write the color data for a single tile back to memory at the end of the tile, at which point we know its final state. We can compare the block’s color with the current data in main memory via a CRC check — a process called Transaction Elimination — skipping the write completely if the tile contents are the same, saving SoC power. My colleague tomolson has written a great blog on this technology, complete with a real world example of Transaction Elimination (some game called Angry Birds; you might have heard of it). I’ll let Tom’s blog explain this technology in more detail, but here is a sneak peek of the technology in action (only the “extra pink” tiles were written by the GPU - all of the others were successfully discarded).

     blogentry-107443-087661400 1345199231_thumb.png

 

  • We can compress the color data for the tiles which survive Transaction Elimination using a fast, lossless, compression scheme — ARM Frame Buffer Compression (AFBC) — allowing us to lower the bandwidth and power consumed even further. This compression can be applied to offscreen FBO render targets, which can be read back as textures in subsequent rendering passes by the GPU, as well as the main window surface, provided there is an AFBC compatible display controller such as Mali-DP500 in the system.
  • Most content has a depth and stencil buffer, but doesn’t need to keep their contents once the frame rendering has finished. If developers tell the Mali drivers that depth and stencil buffers do not need to be preserved2 — ideally via a call to glDiscardFramebufferEXT (OpenGL ES 2.0) or glInvalidateFramebuffer (OpenGL ES 3.0), although it can be inferred by the drivers in some cases — then the depth and stencil content of tile is never written back to main memory at all. Another big bandwidth and power saving!

It is clear from the list above that tile-based rendering carries a number of advantages, in particular giving very significant reductions in the bandwidth and power associated with framebuffer data, as well as being able to provide low-cost anti-aliasing. What is the downside?

The principal additional overhead of any tile-based rendering scheme is the point of hand-over from the vertex shader to the fragment shader. The output of the geometry processing stage, the per-vertex varyings and tiler intermediate state, must be written out to main memory and then re-read by the fragment processing stage. There is therefore a balance to be struck between costing extra bandwidth for the varying data and tiler state, and saving bandwidth for the framebuffer data.

In modern consumer electronics today there is a significant shift towards higher resolution displays; 1080p is now normal for smartphones, tablets such as the Mali-T604 powered Google Nexus 10 are running at WQXGA (2560x1600), and 4k2k is becoming the new “must have” in the television market. Screen resolution, and hence framebuffer bandwidth, is growing fast. In this area Mali really shines, and does so in a manner which is mostly transparent to the application developer - you get all of these goodies for free with no application changes!

On the geometry side of things, Mali copes well with complexity. Many high-end benchmarks are approaching a million triangles a frame, which is an order of magnitude (or two) more complex than popular gaming applications on the Android app stores. However, as the intermediate geometry data does hit main memory there are some useful tips and tricks which can be applied to fine tune the GPU performance, and get the best out of the system. These are worth an entire blog by themselves, so we’ll cover these at a later point in this series.

Summary

In this blog I have compared and contrasted the desktop-style immediate mode renderer, and the tile-based approach used by Mali, looking in particular at the memory bandwidth implications of both.

Tune in next time and I’ll finish off the definition of the abstract machine, looking at a simple block model of the Mali shader core itself. Once we have that out of the way we can get on with the useful part of the series: putting this model to work and earning a living optimizing your applications running on Mali.

Note: The next blog in this series has now been published: The Mali GPU: An Abstract Machine, Part 3 - The Midgard Shader Core

As always comments and questions more than welcome,

Pete

Footnotes

  1. Exactly which multisampling options are available depends on the GPU. The recently announced Mali-T760 GPU includes support for up to 16x MSAA.
  2. The depth and stencil discard is automatic for EGL window surfaces, but for offscreen render targets they may be preserved and reused in a future rendering operation.

Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.

  • peterharris , Thank you for this writeup, it was most informative and gives me a much better idea of the strengths and weaknesses of the Mali Tile Based Render solution. I now understand the memory cost of having to move the vertices 3x versus only 1x in an Immediate mode renderer. Of course, as resolutions increase and framebuffer counts grow, the IMR seems to be at an increasingly large disadvantage -- fragment count seems to be growing much faster than geometric complexity. The other benefits of the TBDR seem quite straight forward as well and are very welcome (eg. Power consumption, AA, Transaction Elimination, etc). With the ability to [now] access the contents of the tile (Shader Pixel Local Store), programmable blending can also be done 100% on chip for a wildly memory/speed efficient operations on many buffers. Forward Pixel Kill is yet another reason to love Mali's tiles!

    But I have a few questions:

    1) I notice that the latest Mali T760 description on the ARM site lists the the AA scheme is "4x FSAA, 8x FSAA, and 16x MSAA." I assume that FSAA and MSAA are both using rotated grid super-sampling, but is FSAA super-sampled over the entire triangle, or localized to the edges (as MSAA tends to be)? How is FSAA different than MSAA?

    2) What are the expected ballpark costs of implementing AA on the newest Mali T760 GPU? Are they still negligible at low sample density as was the case with the older Mali Utgard cores?

    3) Tiling seems a great fit for Multiple Render Target fragments that don't have to consider their neighbours. But for modern effects like SSAO (for example) that do consider their neighbours, how will tiling fare? Will such an operation have to be done in a second pass and with a bunch of dependent reads?

    Sean

  • Hi Sean,

    Thanks for the comments - nice to know someone actually reads these .

    1) I notice that the latest Mali T760 description on the ARM site lists the the AA scheme is "4x FSAA, 8x FSAA, and 16x MSAA." I assume that FSAA and MSAA are both using rotated grid super-sampling, but is FSAA super-sampled over the entire triangle, or localized to the edges (as MSAA tends to be)? How is FSAA different than MSAA?

    On the technical aspects:

    • MSAA has multiple sample points but only one fragment executed per pixel. As you allude to in your question this means that this technique can only anti-alias edges - four samples in the middle of a triangle will all return the same color value (but unique depth values from the rasterizer).
    • There are many ways to implement FSAA - but effectively you get multiple sample points per pixel and will execute one fragment per sample point rather than one per pixel. It is exceptionally expensive to enable because it is a literal multiplier on fragment count, but it does allow you to enable nice fine anti-aliasing in the middle of triangle surfaces, etc.

    For the OpenGL ES driver we only expose MSAA (multi-sampling) - we exposed a 16x AA mode on Mali-400 which was a hybrid of 4xMSAA + 4xFSAA and it was so expensive no one ever really used it - so for the time being we're only supporting multi-sampling.

    2) What are the expected ballpark costs of implementing AA on the newest Mali T760 GPU? Are they still negligible at low sample density as was the case with the older Mali Utgard cores?

    A lot depends on content. If you have bad content with a screen full of sub-pixel tiny triangles then it is possible for 4xMSAA to generate 4x more fragments than a single sampled scene - but that's obviously a corner case. In most cases it is "almost free" - it gets less free as you crank up the multiplier.

    The main technical cost is that we can emit 4 samples a clock into the tilebuffer; 8xMSAA takes two cycles, 16xMSAA takes four cycles. This is obviously pipelined - so if your shader is taking more than 2 or 4 cycles throughput per fragment then you won't see any of this - but for "really simple" content you may see it slow down more than "almost free".

    3) Tiling seems a great fit for Multiple Render Target fragments that don't have to consider their neighbours. But for modern effects like SSAO (for example) that do consider their neighbours, how will tiling fare? Will such an operation have to be done in a second pass and with a bunch of dependent reads?

    In these kinds of algorithms we generally have to revert to behaving like an IMR - we have to bounce things via an off-screen render target which we re-read as a texture. No magic can create adjacent data in tiles we've not rendered yet.

    Cheers,
    Pete

  • Thanks peterharris!

    Thanks for the comments - nice to know someone actually reads these .

    I actually set aside my morning to read and re-read your 3-part series of articles when I found them late last night. I had to look up a lot of information to get a better handle of the high level concepts, and then formulate questions if I could not find the information. I've been also trying to think of a way to reduce the geometry transfer bandwidth for tile based renders. I've considered the obvious case of turning the AFBC logic on the vertex data to reduce the data size. I've also thought about having a die-configurable chunk of memory sitting on chip that geometry could be written to, and when it was full, the remaining data could spill over into DDR. In this way, it could cut down on the amount of memory being written out to RAM, thus reducing power and improving performance. Perhaps the L2 caches could run double-duty in this regard? Forgive me if this sounds silly, I do enjoy thinking of these types of problems. I get a bit too excited about this stuff..

    I am very pleased to hear about the MSAA 16x implementation. Ignoring the edge-cases (tiny triangles, copious overdraw, etc), even for extremely simple render-to-screen scenes you're likely not to feel it, as Mali T6xx/T7xx has no trouble with 4-cycle shaders, and one can always choose a lower multiplier. It seems truly an elegant use of Mali's TBDR.

    Perhaps a post should be done on AA alone (if there hasn't been already). While finding info about MSAA is easy, finding good examples is not -- even finding images of examples of 16x MSAA is very hard. It floors me how rarely I find this Mali feature used in Android games (and I specifically look out for it) given the negligible performance cost and dead-simple implementation. The only game that I can recall to have found that has natively implemented MSAA was a game called Royal Revolt which runs at the full 2.5K on my T604 powered Nexus 10. Though the shaders are very simple, the game looks amazing, thanks in large part to the AA.

    I may have one or two questions in your follow up blog post on the architecture, but I will try to make them good ones!

    Thanks again Pete,

    Sean

  • Perhaps a post should be done on AA alone (if there hasn't been already).

    I was thinking that last night when I was writing the answer above - it feels meaty enough to be worth a blog. Consider it added to the list - although it may take me a while to get around to it (currently writing up a series on using DS-5 Streamline for performance profiling).

    Cheers,

    Pete

  • Great! I will look out for this at some point in the future. My only request is that there are a lot of comparison pictures: both zoomed-in crops and full images.

  • Nice article.

    I had some very basic doubts regarding dependency on tiler jobs, and really eager to get the answers.

    Is each tiler job dependent on the previous one?

    If yes, then why the relative ordering of these jobs is important? Anything to do with early-z?

  • I can't talk in detail about the internals of the micro-architecture, but in terms of generic OpenGL ES requirements ...

    The specification behaves as an in-order machine. Triangles must be shown on screen as if they were rendered in the order specified at the API (both in terms of by drawcall, and in terms of by primitive within each drawcall). If you don't follow the "in order" rules then it is possible to get rendering artefacts (typically around depth testing, stencil testing, but also transparencies).

    For example:

    Imagine two overlapping coplanar triangles - one red, one blue, draw with depth write enabled, and depth test set to "less than". If you render them in API order (red then blue) then the red one will get drawn, and the blue on will get culled because it will fail the depth test. If the hardware swaps that order for any reason then the blue one will get drawn, and the red one gets culled - you get "wrong" rendering output.

    HTH,
    Pete