Chinese version of this blog - thanks to vincent for the translation!
In my previous blog I started defining an abstract machine which can be used to describe the application-visible behaviors of the Mali GPU and driver software. The purpose of this machine is to give developers a mental model of the interesting behaviors beneath the OpenGL ES API, which can in turn be used to explain issues which impact their application’s performance. I will use this model in the future blogs of this series to explore some common performance pot-holes which developers encounter when developing graphics applications.
This blog continues the development of this abstract machine, looking at the tile-based rendering model of the Mali GPU family. I’ll assume you've read the first blog on pipelining; if you haven’t I would suggest reading that first.
In a traditional mains-powered desktop GPU architecture — commonly called an immediate mode architecture — the fragment shaders are executed on each primitive, in each draw call, in sequence. Each primitive is rendered to completion before starting the next one, with an algorithm which approximates to:
foreach( primitive )
foreach( fragment )
As any triangle in the stream may cover any part of the screen the working set of data maintained by these renderers is large; typically at least a full-screen size color buffer, depth buffer, and possibly a stencil buffer too. A typical working set for a modern device will be 32 bits-per-pixel (bpp) color, and 32bpp packed depth/stencil. A 1080p display therefore has a working set of 16MB, and a 4k2k TV has a working set of 64MB. Due to their size these working buffers must be stored off-chip in a DRAM.
Every blending, depth testing, and stencil testing operation requires the current value of the data for the current fragment’s pixel coordinate to be fetched from this working set. All fragments shaded will typically touch this working set, so at high resolutions the bandwidth load placed on this memory can be exceptionally high, with multiple read-modify-write operations per fragment, although caching can mitigate this slightly. This need for high bandwidth access in turn drives the need for a wide memory interface with lots of pins, as well as specialized high-frequency memory, both of which result in external memory accesses which are particularly energy intensive.
The Mali GPU family takes a very different approach, commonly called tile-based rendering, designed to minimize the amount of power hungry external memory accesses which are needed during rendering. As described in the first blog in this series, Mali uses a distinct two-pass rendering algorithm for each render target. It first executes all of the geometry processing, and then executes all of the fragment processing. During the geometry processing stage, Mali GPUs break up the screen into small 16x16 pixel tiles and construct a list of which rendering primitives are present in each tile. When the GPU fragment shading step runs, each shader core processes one 16x16 pixel tile at a time, rendering it to completion before starting the next one. For tile-based architectures the algorithm equates to:
foreach( tile )
foreach( primitive in tile )
foreach( fragment in primitive in tile )
As a 16x16 tile is only a small fraction of the total screen area it is possible to keep the entire working set (color, depth, and stencil) for a whole tile in a fast RAM which is tightly coupled with the GPU shader core.
This tile-based approach has a number of advantages. They are mostly transparent to the developer but worth knowing about, in particular when trying to understand bandwidth costs of your content:
It is clear from the list above that tile-based rendering carries a number of advantages, in particular giving very significant reductions in the bandwidth and power associated with framebuffer data, as well as being able to provide low-cost anti-aliasing. What is the downside?
The principal additional overhead of any tile-based rendering scheme is the point of hand-over from the vertex shader to the fragment shader. The output of the geometry processing stage, the per-vertex varyings and tiler intermediate state, must be written out to main memory and then re-read by the fragment processing stage. There is therefore a balance to be struck between costing extra bandwidth for the varying data and tiler state, and saving bandwidth for the framebuffer data.
In modern consumer electronics today there is a significant shift towards higher resolution displays; 1080p is now normal for smartphones, tablets such as the Mali-T604 powered Google Nexus 10 are running at WQXGA (2560x1600), and 4k2k is becoming the new “must have” in the television market. Screen resolution, and hence framebuffer bandwidth, is growing fast. In this area Mali really shines, and does so in a manner which is mostly transparent to the application developer - you get all of these goodies for free with no application changes!
On the geometry side of things, Mali copes well with complexity. Many high-end benchmarks are approaching a million triangles a frame, which is an order of magnitude (or two) more complex than popular gaming applications on the Android app stores. However, as the intermediate geometry data does hit main memory there are some useful tips and tricks which can be applied to fine tune the GPU performance, and get the best out of the system. These are worth an entire blog by themselves, so we’ll cover these at a later point in this series.
In this blog I have compared and contrasted the desktop-style immediate mode renderer, and the tile-based approach used by Mali, looking in particular at the memory bandwidth implications of both.
Tune in next time and I’ll finish off the definition of the abstract machine, looking at a simple block model of the Mali shader core itself. Once we have that out of the way we can get on with the useful part of the series: putting this model to work and earning a living optimizing your applications running on Mali.
Note: The next blog in this series has now been published. You can read it by clicking on the button below.
Read next blog: The Midgard Shader Core
As always comments and questions more than welcome,
Thank you for this article, which has cleared up some areas of confusion that I had had about Mali's approach to tiling. But I am unclear about how hierarchical tiling affects the construction of tile lists. In essence, my questions are: Does Mali have exactly as many tile lists as there are 16x16 areas of the screen (or do larger heirarchical levels make their own larger tile lists that then schedule separately)? and Under what conditions can a triangle get put into the list of a tile that it doesn't actually touch?
For example, if there were 3 triangles rendered in the following order:T1 vertices: (0,0), (10,0), (10,10) -- ie, entirely within the bottom-left 16x16 tileT2 vertices: (0,0), (20,0), and (0,20) -- ie, it touches 3 16x16 tiles (but lies within a square of 4 tiles).T3 vertices: (0,0), (0,10), (10,10) -- ie, entirely within the bottom-left 16x16 tile
Then I think the bottom-left tile's list renders in this order: T1, then its T2fragment, then T3. Correct?
And the neighboring tiles' lists would each contain their own T2 fragments?
But would there be TWO of these neighboring tiles or THREE (i.e., what happens to tiles that lie within the square but don't actually touch the triangle)?
I can't talk in detail about the internals of the micro-architecture, but in terms of generic OpenGL ES requirements ...
The specification behaves as an in-order machine. Triangles must be shown on screen as if they were rendered in the order specified at the API (both in terms of by drawcall, and in terms of by primitive within each drawcall). If you don't follow the "in order" rules then it is possible to get rendering artefacts (typically around depth testing, stencil testing, but also transparencies).
Imagine two overlapping coplanar triangles - one red, one blue, draw with depth write enabled, and depth test set to "less than". If you render them in API order (red then blue) then the red one will get drawn, and the blue on will get culled because it will fail the depth test. If the hardware swaps that order for any reason then the blue one will get drawn, and the red one gets culled - you get "wrong" rendering output.
I had some very basic doubts regarding dependency on tiler jobs, and really eager to get the answers.
Is each tiler job dependent on the previous one?
If yes, then why the relative ordering of these jobs is important? Anything to do with early-z?
Great! I will look out for this at some point in the future. My only request is that there are a lot of comparison pictures: both zoomed-in crops and full images.
Perhaps a post should be done on AA alone (if there hasn't been already).
I was thinking that last night when I was writing the answer above - it feels meaty enough to be worth a blog. Consider it added to the list - although it may take me a while to get around to it (currently writing up a series on using DS-5 Streamline for performance profiling).