It’s Just Criminal! Examples of Performance Thrown Away in Real Apps (Part 1)

March 3, 2014

6 minute read time.

Every time I stand up at GDC and give a presentation on how to improve the performance of mobile applications there is always one person in the crowd that fills in the feedback form with the following – “Well, this is all obvious. I’ve been doing this for years!” I do wonder if it’s the same guy every time or different people, but, my friend, I’m here to inform you that sadly, you are in a minority.

I have seen the following frequently in my role as Director of the ARM® Mali™ Performance Analysis team and in my mind, ignoring the tips below should be a crime against graphics performance (mostly because I like the image of bursting into a room and shouting “Book ’em Danno, performacide in the 1st!”*). I’ve picked out some of the more recurrent themes from the last few years with the hope of a little crime prevention…the names have been removed to protect the innocent.

*Being English and a child of the 70’s I did want to put a quote from the Sweeney in, but Jack Regan’s utterance wouldn’t make it past marketing in our postmodern, more sensitive world.

“Overdraw: A Victimless Crime?” or “Sort your objects and nobody get hurt!”

This is by far the easiest optimization to implement, but the majority of apps we see still don’t use it (including not one, but two widely used graphics benchmarks). It’s an amazing thing, seeing the surprise on the face of the developer when you show him how easy this is to implement and the effect it can have on performance. Apparently qsort() is a very overlooked function in libc.

Simply put, you order the objects on the Z values of their origin and submit in front to back draw order. This ensures optimal performance of the Z buffer. If you want to get fancier you can do it based on the bounding sphere or box for larger objects with potential overlap. “But what if I have objects with Alpha?” You simply separate out objects containing Alpha, then order those based on Z, same as we did with the opaque objects. Draw the opaque objects first and then draw the Alpha objects.

From my crime files the worst offender I’ve seen was an app with an average overdraw of 12x (see my previous blog to get an idea of what effect this has). My team showed the developer the version with sorted objects and they ended up with a 2-3x performance boost.

“GBH: Grievous Batching Horrors”

There is a common misconception that calls to the driver are free. Unfortunately, this is not true. Perhaps we have only ourselves to blame as we make every effort to make it seem that way, but every OpenGL® ES API function call has CPU overhead. For some functions that overhead is bigger, for others it’s smaller; this is largely dependent on how much state they affect. Naturally, functions such as draw calls (calls to glDraw* functions) tend to carry more overhead because they use the state information.

This sounds like a basic concept, but you would be surprised at what is done in apps. Issuing excessive draw calls is generally a bad idea. One notable example was an app which sent a single quad per draw call because it used a different part of a texture (no, I’m not kidding) for each quad meaning it used 700+ draw calls per frame to draw the scene. On lower end (single core ARM11xx™ class) platforms, this consumed almost as much CPU time as it did to draw the scene in the first place.

Generally draw calls consume less time on deferred rendering GPUs because the driver only needs to ensure it has a snap shot of the state of the buffer etc. A deferred renderer won’t actually engage the hardware to draw anything (usually the point which would cause overhead) until either a glSwapBuffer, glFlush or similar condition requiring the draw calls to be resolved. This means that a lot of the cost can be offset by utilizing today’s modern multi-core CPU environments by asynchronously performing the data/state preparation on a separate thread which happens in parallel with the running app and driver.

However, there is still an overhead and this overhead varies little with the number of primitives being drawn by the draw call. The overhead is similar whether you draw a single triangle or thousands of triangles in a draw call. So if you combine multiple triangles into a single draw call the overhead is only applied once rather than multiple times. This reduces the total overhead and increases the performance of your application. For some very neat ideas on how to combine draw calls more effectively, start with this blog Game Set and Batch.

“I’d like to plead insanity your honor.”

Having said all of that, don’t go crazy! You need to bear in mind that large object batches with high potential for occlusion (single, very large scenery objects with large portions on and off screen or a number of smaller objects distributed over a large area in the scene are good examples) can be unnecessarily costly as the vertices still need processing to determine position before visibility culling.

For very large objects with dense geometry, it is always worth implementing a hierarchy of bounding boxes and checking each child box for visibility rather than sending the whole object and letting the GPU work it out. Again we have seen examples of objects in apps which have vertex counts in the 50K regions where only 20-30% of the object is visible at any one time.

The bandwidth cost of those vertices and processing time in the GPU versus a simple bounding volume check against the view frustum is likely to be an order of magnitude difference. That’s an order of magnitude for the sake of a bit of judicious app-side culling…

“Bound over (and over and over…)”

Unfortunately this is seen regularly in a lot of commercial engines, for some reason, but can cause flush and reloads in tile/cache memory. The optimal use case is to bind once, issue all draw calls and then unbind.

Why? This is because in most deferred rendering the GPU works on a small section of the screen at a time, commonly referred to as a tile. That tile is an ‘N’x’N’ sub region of the screen. What the driver and the GPU try to do is retain the tile it’s working on for as long as there is work to be done on it. Binding and unbinding between draw calls means the driver has to second guess what you wanted it to do. If it’s not sure it has to err on the side of caution and write back the tile to main memory.A re-bind of the same target after an un-bind can see the tile ping-pong into and out of memory.

Remember the driver gets very little information about what your intent is (hopefully this will be fixed in future revisions of OpenGL ES, but for now we have to live with it), so making it second guess is always a bad idea. If you leave the draw target bound then you are explicitly telling the driver “yes – I’m still drawing to that”. Also, take a look at the use of glDiscardFramebufferEXT() which helps indicate to the driver when the render attachment is complete.

Next... I will be exposing more crimes against performance in “It’s Just Criminal! Examples of Performance Thrown Away in Real Apps (Part 2)It’s Just Criminal! (Part 2)”, in which PC Ray Caster will put before the jury the case of the year.

Ed Plowman over 10 years ago

Hi Sean,
Thanks for the heads up, I've updated it today, so should be all good now.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sean Lumly over 10 years ago

Hola, edplowman... Just a heads up:
You may want to correct the link in your article to Sam Martin's Siggraph paper to:
http://www.geomerics.com/wp-content/uploads/2014/03/SIGGRAPH-2013-SamMartinEtAl-Challenges.pdf
Sean
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Chris Varnsverry over 10 years ago

Being a child of the 90's, I'm now having to look up Sweeney quotes.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Graphics, Gaming, and VR blog

Arm Performance Studio: A look back, and a look forward

Peter Harris

Arm Performance Studio release 2024.6 release bringing you quality-of-life improvements and bug fixes. Read this blog post for more information about other features in this release.
- December 20, 2024
The future of AI for games

Ian Bolton

Arm sponsored the AI and Games Conference at Goldsmiths in London, read about the day that brought experts and enthusiasts together for talks on the intersection of AI & gaming.
- November 29, 2024
Hidden Surface Removal in Immortalis-G925: The Fragment Prepass

Tord Øygard

Arm's Immortalis and Mali GPUs are energy efficient. In this blog post fragment pre-pass for Arm GPUs is discussed with Immortalis-G925, Mali-G725 & Mali-G625.
- November 28, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

It’s Just Criminal! Examples of Performance Thrown Away in Real Apps (Part 1)

“Overdraw: A Victimless Crime?” or “Sort your objects and nobody get hurt!”

“GBH: Grievous Batching Horrors”

“I’d like to plead insanity your honor.”

“Bound over (and over and over…)”

Arm Performance Studio: A look back, and a look forward

The future of AI for games

Hidden Surface Removal in Immortalis-G925: The Fragment Prepass