At the Develop 2012 conference in Brighton I gave a talk about how we achieved some of the effects in our brand new (at the time) demo Timbuktu. As I repeated this presentation at a number of developer events, one particular section of it got longer and longer as I incorporated additional information to pre-answer the most common questions I was receiving. When the opportunity arose to write new presentations, expanding that one section into a presentation by itself seemed an easy win.
While we’re talking about easy wins, have you ever found that as you develop an application with lots of models you reach a point where, regardless of the complexity of the models, each new model you add drops the frame rate? There’s a chance that you’ve hit the draw call limit, which coincidentally is what that presentation I mentioned before was about.
There’s a limiting factor in graphics which is nothing to do with the GPU itself and entirely about CPU load associated with sending commands to the driver. This load is generated from calls to glDrawElements and glDrawArrays, often referred to by the collective name ‘draw calls’. Everything up to a draw call is simply setting states in the driver software. The point when the draw call is issued, all that state gets bundled up and sent to the GPU in a language it can understand, so that the GPU can then work on rendering it all without any further communication with the driver.
Depending on the CPU you’re using this figure changes but as a rule we try to stay under 50 draw calls per frame in our internal demos, less if possible, and we maintain this limit despite having a complex virtual world by the use of batching.
Batching is a technique whereby you draw multiple things in one draw call. The simplest way to imagine it is you take a number of different models and put them all in the same vertex buffer object. Then you render the whole buffer as one. If the objects have different textures, they are combined into one big texture atlas and the texture coordinates are rescaled to look up the correct points in the atlas rather than the individual texture. Finally, in order to make sure the objects can move independently, the vertices have an extra attribute, basically an ID number tagged to each vertex to tell it what model it’s part of.
In the vertex shader you then give an array of uniform mat4 values, rather than the single world space transformation typically used, and the ID number can look into this array to find the right one. Thus you can have different models with different textures in different positions with different scale and rotation factors, all moving independently with a single draw call.
If you do this with different models it’s a way of batching together a scene, though take note that the objects will always be drawn in the order in which they are lined up in the VBO, which makes it a little harder to depth sort the scene. If the models are identical you can draw them in the right order because it doesn’t matter which model ID represents which particular instance of that model.
Using a batch like this to represent multiple instances of the same object also offers an additional technique with pretty much no overhead. By filling a VBO with the same object at different levels of detail, starting with the most detailed and ending with the least, the detail level will switch automatically, so long as you draw your instances front to back.
When batching different objects in a scene, sometimes the issue of occlusion or removing objects from a scene comes up. Models at the start of the batch can be skipped by starting at a later vertex, and reducing the vertex count will stop before the end, but if you are drawing a batch of models and want to skip a few in the middle, the quick way to take them out is by passing a matrix of zeroes into that part of the uniform array, essentially scaling it to always be at world space origin and completely degenerate. However, if you have a sparsely rendered batch of objects (basically, if from the first to last model you render, there are more models skipped with a zero matrix than actually rasterized to the screen) it may work out more efficient to render it in more than one draw call. If you do a lot of batching and the application is constantly vertex bound irrespective of how much is currently drawn, this might be a sign that you’re transforming lots of batched vertices to null matrices.
If you’ve been proactive in your batching you should be sufficiently under the CPU load limit to draw a VBO with several passes, using different starting vertices and different vertex counts to draw subsets of the buffer. Exactly how you slice it is dependent on your application, but using the CPU and vertex shader load in ARM® Streamline™ Performance Analyzer you should be able to make the right choices.
The final question which usually arises is how to perfectly depth sort different objects within a batch, for example if the objects were alpha blended and needed to be sorted back to front. There’s no perfect solution for this, although depending on your use case there are a number of partial solutions. If you’re working with a small number of objects, you could store an index buffer of the objects swizzled in every possible permutation, and pass the right order through to the draw call. Faced with a larger number of objects I’d suggest reducing the alpha blended geometry down to their own separate topologically identical meshes. Often alpha blended models are mostly opaque with one specific part that is alpha blended, such as a model of a tree with a few textured leafy parts or a car with transparent windows. If the transparent parts are simple enough they can be made topologically congruent and use parameters to convert what each mesh represents on the fly.
A good example of this is merging different types of foliage into a batch. In Timbuktu we did this first by making the opaque parts, tree trunks and the like, into a separate geometry batch. Then the grass, shrubs, treetops and bushes could all be represented by a mesh which looked like a couple of crossed rectangles, textured rotated and scaled based upon what the mesh was meant to be. The texture bounds within the texture atlas were passed as an array, just like the matrices, allowing the models to be re-ordered freely and still represent different things in world space.
All these techniques are described in a presentation I gave on the ARM booth at GDC 2013, which later got combined with my other presentation from that event and recorded for the Mali developer website. You can watch the video right now:
If you’d like to talk about any of the techniques I’ve described in person, I regularly attend game development events and I’m not hard to find. Keep an eye on the ARMMultimedia twitter feed to see what events we’re attending next. Alternatively, drop a comment in the section below.
Hmm.. chrisvarns I'm not familiar with that command, but I'm not so sure that my suggestion is similar based on what I've read. The Draw Indirect command seems to concern itself with using attribute information taken from a buffer supplied by the GPU without requiring CPU intervention. In the case of the Sparse Object Array List, the goal is draw a set of arrays (eg. VBOs), stored at different memory offsets, in a single call. Each element in the Sparse Object Array List points to one (or more) of the arrays of data to be drawn.
For example, our Sparse Object Array List could be defined as a list of pointers (forgive the shoddy pseudo code ):
{ *VBO_1, *VBO_34, *VBO_5, *VBO_766 }
In this case, we have 4 batched objects. This list would be re-built each frame, with relevant objects. The data in our objects: VBO_1, VBO_34, VBO_5, and VBO_766 would be submitted for vertex shading in a single draw call and in the order supplied.
The benefit here is that draw calls are greatly reduced for batch-able objects (eg. those that share materials). As with separate calls, the order of submission can be carefully selected for depth-sorted objects, or spatially similar objects, and objects can be easily ignored by not including them on the list. The cost in bandwidth is just the cost of the list pointers, so should be very small. The cost of processing would be similarly small. Suddenly a scene with 2000 dynamic, non-instanced, on-screen objects need not overwhelm the CPU with draw calls if they can be batched! Effectively they could be submitted in 1 call, whilst retaining the benefits of individual draw calls (ie. sorting, culling, transforms, etc)!
And the crazy thing is that this seems as though SOAL would be an extension that could be implemented in software rather than hardware.