Time and time again when I’m presenting at events I am asked the same questions: What does the future of mobile content look like? How much performance will content developers need? When will mobile devices reach console quality? My answers to the first two always end up being vague and non-committal – surely the first depends on how much computing power is available and the second on the ambitions of the developers in question? Ultimately, those first two questions depend on the answer to the third. That is what we will address in this blog.
Theoretically, the compute potential of mobile devices had just managed to catch up with that of current generation consoles (before the Xbox One and PlayStation 4 were released recently, raising that hanging branch a little bit higher). Both superphones and consoles such as the Xbox 360 were offering about 250GFLOPS/s of computing horsepower. In fact, by the end of 2013, superphone computing performance was expected to finally equal that of first generation shader-based desktop GPUs – and as the mobile compute trend is happily following Moore’s Law, it is not looking likely to slow down its improvements any time soon.
There is no doubt about it, these devices are pulling some amazing performance potential, so why aren’t we seeing a corresponding increase in content complexity? Why can't I play Crysis on a handset? To understand that, you have to look at the comparable bandwidth available on mobile, console and desktop.
State of the art desktops have an inordinate level of bandwidth available to them. The Xbox360 had 32GB/s. State-of-the-art mobile SoCs can currently offer a theoretical max of 8-12GB/s. And the reason why available bandwidth is so low in the mobile space is that it comes at the price of power – a sparse commodity on a battery powered device!
Whereas desktop GPUs routinely require >300W for the GPU alone and the current crop of console systems are typically between 80W and 100W, mobile devices have somewhere in the region of 3W to 7W (superphone and tablet respectively) available with which they have to power not just the GPU, but also the CPU, modem, WiFi and the display! With this restriction on power, bandwidth capacity has not grown at the same rate as compute and the mobile space remains, at this moment in time, two to three years behind consoles and eight plus years behind desktops. So, whilst semiconductor companies are striving to deliver higher levels of bandwidth within a mobile SoC, developers should also learn the methods of getting “100W” of work from the 3W available.
A typical system configuration for a superphone will have a memory system that will yield approximately 8GB/second as a theoretical bandwidth limit for the entire system. Right off the bat, as the memory system is not 100% efficient in real life (you have to allow for memory refresh, opening and closing of pages, etc.), we need to degrade that figure to approximately 80%. Already we are at 6.4GB/second as a starting point.
From a GPU standpoint we need to add in some basic system overheads. Firstly, as we are performing a graphics function, you need to get your pixels to a display at some point. For a superphone with a 1080p screen updating at 60fps the display subsystem is absorbing about 0.5GB/second of that. Similarly, we need to leave the CPU with some bandwidth for the driver etc. This will be somewhere in the region of 0.5GB/second. This is slightly conservative, but allows for the overhead of low latency access required for the CPU. This leaves us with a total of approximately 5.4GB/second for the GPU.
Although we’ve used 80% utilisation as an average, it is worth keeping in mind that not all SoCs are created equally and there is a dynamic range here. We have seen platforms which, for various reasons, are only able to achieve 70% of theoretical max. It’s also worth mentioning that utilization often degrades with increased memory clocks due to increased latency.
So with a starting point of 5.4GB/second, if we divide that by our target frame rate (60fps) we have a per frame starting point of 90MB/frame peak bandwidth. Rendering an image requires the target frame buffer to be written back from the GPU at least once. That removes about 8MB/frame leaving us at about 82MB/frame.
Dividing that by the resolution gives you approximately 40 bytes/pixel to cover everything, that includes attribute inputs, varying outputs, varying inputs and texture inputs. The texture and frame buffer bandwidth costs are per-pixel costs, the impact of others dependent on poly size, but it’s still worth thinking in terms of "per pixel" bandwidth cost.
Three of the five main suppliers of GPU technology to the mobile market – making up approximately 90% of total are using tile based/deferred rendering GPUs of one form or another and this is no coincidence. Tile based rendering allows most of the intermediate fragment write and z read/write cost to be removed as we do all of that “in tile” on the GPU, saving the vast majority of the bandwidth.
However, deferred rendering does require vertex data to be “Read, Written & Read back” as part of the binning process (the method by which tile based renderers decide which primitives contribute to which screen tile sections). This is where most of the bandwidth is used.
Let’s consider a simple agnostic example of this. For each primitive, which, for the sake of argument, we’ll say is a tri-strip optimized mesh (i.e. we only need one unique vertex per new triangle) which has at least:
Now, as said above, we need to read, write and read back that information, so we basically use 96 bytes of bandwidth. While caching etc. has an effect, everyone does it slightly differently, so for the sake of simplicity let’s say that’s compensated for by the tri-strip optimization for the moment.
Now, we’ve specified that we have a texture. Assuming we do nothing else but light it and that all the per-fragment inputs are register mapped, we need to fetch at least one Texel per-fragment so an uncompressed texture would be 1x INT32[RGB] = 4 bytes. Assuming we apply a limit of 10 fragments per primitive as per the guide figure for primitive to fragment ratio we discuss in “Better living through (appropriate) geometry” we have a total of 40 bytes for the fragments. Working this through you can basically see that we are hitting the bandwidth limit in a very simple use case which would yield approximately 603K polys/frame or about 36M polys/sec. Now that sounds like a lot, but we haven't done anything “interesting” with the polygon yet. By increasing that by an additional texture source or adding a surface normal etc. that number comes down pretty quickly.
Let's have a look at what happens when we do some very basic things to our polygon. We can't really do anything about the position without it getting a bit complicated (although it was popular in the early days of 3D games to send objects with compressed object space co-ordinates and then scale using the transform), but instantly we can bring that colour value down to a packed RGB:888 value to reduce the overhead by a third. We can also halve the texture co-ordinate size to FP16 for U and V, this is no hardship if you use normalized co-ordinates as they can be scaled and calculated at a higher resolution inside the shader code.
Now we've gone from 603K polys/frame to just over 1M polys/frame or 60M polys/sec. If we apply texture compression to the texture source we can get the 4 bytes per texture fetch to 4 bits using ETC1 (5 bytes for our 10 pixel poly) or down to 2 bits (or lower) using ASTC (2.5 bytes for our 10 pixel poly). This brings us up to 1.26M to 1.3M polys/frame or 75M to 78M polygons/sec, which I'm sure you'll agree is a hell of a lot more impressive.
So you can see that making the most out of the performance available without killing the bandwidth requires slightly different (but not excessively tricky) thinking compared to desktop or console.
Next time... "It’s Just Criminal! Examples of performance thrown away in real Apps"
Perfect! This is the best news I have heard all day.
It sounds as though access the latency (ie. locality) of memory is dependent on the size of the uniform data on a per-draw call basis. This provides a very good target for per-case optimization, and is very encouraging. I'm sure that with a little profile driven optimization, great performance can be attained with modest amounts of uniforms!
This can be very useful for certain tasks. For example producing many pseudo-random numbers can be dramatically aided with a very small uniform seed array, which is far more palatable than soaking up bandwidth doing multiple texture reads. In this case, even if the random-seed "texture" is resident in L2, you tie up the texture unit for a few clocks that could be otherwise used. As the number of generated random numbers increases, the more attractive using uniforms becomes over both straight computation or dependent texture-reads.
Sean