At the Develop 2012 conference in Brighton I gave a talk about how we achieved some of the effects in our brand new (at the time) demo Timbuktu. As I repeated this presentation at a number of developer events, one particular section of it got longer and longer as I incorporated additional information to pre-answer the most common questions I was receiving. When the opportunity arose to write new presentations, expanding that one section into a presentation by itself seemed an easy win.
While we’re talking about easy wins, have you ever found that as you develop an application with lots of models you reach a point where, regardless of the complexity of the models, each new model you add drops the frame rate? There’s a chance that you’ve hit the draw call limit, which coincidentally is what that presentation I mentioned before was about.
There’s a limiting factor in graphics which is nothing to do with the GPU itself and entirely about CPU load associated with sending commands to the driver. This load is generated from calls to glDrawElements and glDrawArrays, often referred to by the collective name ‘draw calls’. Everything up to a draw call is simply setting states in the driver software. The point when the draw call is issued, all that state gets bundled up and sent to the GPU in a language it can understand, so that the GPU can then work on rendering it all without any further communication with the driver.
Depending on the CPU you’re using this figure changes but as a rule we try to stay under 50 draw calls per frame in our internal demos, less if possible, and we maintain this limit despite having a complex virtual world by the use of batching.
Batching is a technique whereby you draw multiple things in one draw call. The simplest way to imagine it is you take a number of different models and put them all in the same vertex buffer object. Then you render the whole buffer as one. If the objects have different textures, they are combined into one big texture atlas and the texture coordinates are rescaled to look up the correct points in the atlas rather than the individual texture. Finally, in order to make sure the objects can move independently, the vertices have an extra attribute, basically an ID number tagged to each vertex to tell it what model it’s part of.
In the vertex shader you then give an array of uniform mat4 values, rather than the single world space transformation typically used, and the ID number can look into this array to find the right one. Thus you can have different models with different textures in different positions with different scale and rotation factors, all moving independently with a single draw call.
If you do this with different models it’s a way of batching together a scene, though take note that the objects will always be drawn in the order in which they are lined up in the VBO, which makes it a little harder to depth sort the scene. If the models are identical you can draw them in the right order because it doesn’t matter which model ID represents which particular instance of that model.
Using a batch like this to represent multiple instances of the same object also offers an additional technique with pretty much no overhead. By filling a VBO with the same object at different levels of detail, starting with the most detailed and ending with the least, the detail level will switch automatically, so long as you draw your instances front to back.
When batching different objects in a scene, sometimes the issue of occlusion or removing objects from a scene comes up. Models at the start of the batch can be skipped by starting at a later vertex, and reducing the vertex count will stop before the end, but if you are drawing a batch of models and want to skip a few in the middle, the quick way to take them out is by passing a matrix of zeroes into that part of the uniform array, essentially scaling it to always be at world space origin and completely degenerate. However, if you have a sparsely rendered batch of objects (basically, if from the first to last model you render, there are more models skipped with a zero matrix than actually rasterized to the screen) it may work out more efficient to render it in more than one draw call. If you do a lot of batching and the application is constantly vertex bound irrespective of how much is currently drawn, this might be a sign that you’re transforming lots of batched vertices to null matrices.
If you’ve been proactive in your batching you should be sufficiently under the CPU load limit to draw a VBO with several passes, using different starting vertices and different vertex counts to draw subsets of the buffer. Exactly how you slice it is dependent on your application, but using the CPU and vertex shader load in ARM® Streamline™ Performance Analyzer you should be able to make the right choices.
The final question which usually arises is how to perfectly depth sort different objects within a batch, for example if the objects were alpha blended and needed to be sorted back to front. There’s no perfect solution for this, although depending on your use case there are a number of partial solutions. If you’re working with a small number of objects, you could store an index buffer of the objects swizzled in every possible permutation, and pass the right order through to the draw call. Faced with a larger number of objects I’d suggest reducing the alpha blended geometry down to their own separate topologically identical meshes. Often alpha blended models are mostly opaque with one specific part that is alpha blended, such as a model of a tree with a few textured leafy parts or a car with transparent windows. If the transparent parts are simple enough they can be made topologically congruent and use parameters to convert what each mesh represents on the fly.
A good example of this is merging different types of foliage into a batch. In Timbuktu we did this first by making the opaque parts, tree trunks and the like, into a separate geometry batch. Then the grass, shrubs, treetops and bushes could all be represented by a mesh which looked like a couple of crossed rectangles, textured rotated and scaled based upon what the mesh was meant to be. The texture bounds within the texture atlas were passed as an array, just like the matrices, allowing the models to be re-ordered freely and still represent different things in world space.
All these techniques are described in a presentation I gave on the ARM booth at GDC 2013, which later got combined with my other presentation from that event and recorded for the Mali developer website. You can watch the video right now:
If you’d like to talk about any of the techniques I’ve described in person, I regularly attend game development events and I’m not hard to find. Keep an eye on the ARMMultimedia twitter feed to see what events we’re attending next. Alternatively, drop a comment in the section below.
Understood! I will certainly attend a tradeshow one of these days, and I will make sure to stop by!
Sorry about the lack of availability of APKs, we're not in a position to release them publicly at this time. If you're at a tradeshow we're attending however I encourage you to visit us and check out the new demos, and we're happy to answer any questions you have/talk about graphics in general etc.
Indeed! While GLES3X seems to be a little more forgiving with reducing draw calls (eg. instancing), I would guess that there is still a modern incentive to reduce calls by batching. Even if modern CPUs can handle the load, greatly reducing the draw-calls should have a significant impact on power consumption and thermals. There are also other demands on the CPU (eg. physical simulation), so reducing load also leaves you with more cycles to do other things!
It seems that mobile can be very powerful, but you can't attack problems the same way they are dealt with on desktop, even given similar resources..
Off topic: I was an early fan of trueforce. As an enthusiast, I remember marvelling that the Mali400MP4 could do trueforce at 1080p 60fps and in stereo! At the time, rendering 4 Million Pixels per frame was unheard of, especially since there was overdraw (the engines particle exhaust)! It would be nice to download a simple apk and try a great many ARM demos (I'm really interested in the latest Geomerics demo, and the SeeMore demo) but sadly this seems only an option if one attends a trade-show!
Ah I see, these techniques are essentially designed to reduce CPU overhead, as at the time TrueForce was made for example (which implements a lot of the techniques discussed) the market was predominantly single-core phones, dual core phones were just coming out, and it was very easy for a large number of draw calls to saturate the CPU and become a bottleneck. Batching therefore reduced the CPU load and increased perf, but GLES2 requires you to get creative about how you do this, hence the techniques described
Thanks for the chat! It was very interesting and informative. And who knows? Maybe something will come of this.
Cheers,
Sean