Continuing on from It’s Just Criminal! Examples of Performance Thrown Away in Real Apps (Part 1) lets look at more criminal behavior and for me what has to be the crime of the year...
Client side buffers really shouldn’t be your first choice on a mobile platform. Any buffer administered by the client has an unknown history. The driver has no idea what has been done between uses unless it scans it. This can be an extremely costly affair so is mostly not an approach that driver implementers take, preferring to recommend Vertex Buffer Objects (VBOs) instead. Because the status of a client side buffer is not deterministic in a deferred rendering GPU (which is, as previously discussed, about 90% of the mobile market) the driver will have to take a copy of the client side buffer being referenced. This has a bandwidth overhead and also a memory footprint overhead.
VBOs on the other hand have a prescribed interface in the API so it is possible (to a certain extent) to track the provenance of updates to the buffer meaning the driver will only take a copy when it needs to and can often “dirty patch” the updates so it only requires the difference between the pre and post modified buffer. This can save a lot of potential bandwidth.
One of the biggest offences we’ve seen in this category is using VBOs, but uploading the contents of a client side buffer into the VBO for each draw call. This kind of defeats the object. Similarly, overusing dynamic VBO or index buffer object updates using glBufferSubData() etc. causes the same issue. We’ve seen a couple of applications recently which are tweaking several small (in the order of 10-15 vertices) within a larger VBO, which are not localized, on each new draw call. This is not as bad as client side buffers, but if the object is that dynamic it really should be in its own draw call and VBO.
See this blog Nothing but Static for more details of vertex buffer objects in action.
You also need to pay similar attention to the use of glTexSubImage() updates. Remember: in a deferred renderer no draw activity happens until eglSwapBuffers() or similar is called. If you update a texture several times within a frame that means that all possible forms of that texture must exist at the time of rendering. Overuse of partial texture updates can have a detrimental effect on bandwidth and working footprint.
Multiple Render Targets (MRTs), or the ability to target more than one output buffer with a single draw call, is a common way of reducing the need to send expensive geometry multiple times to build up secondary buffers and is often used in app-side deferred render flows (not to be confused with the deferred hardware model). Since this technique is new for OpenGL ES 3.0 I’ll apply some leniency, but of the applications so far we’ve still seen some suspicious behavior!
MRTs can be implemented very efficiently on deferred rendering GPUs, if you can keep everything in the tile. Guidance from most of the GPU vendors with deferred architectures (i.e. tile based) is to make sure that the sum of bits/fragment fits within the maximum width on tile storage – bear in mind that each GPU provider will have different criteria, but consensus seems to be 128 bit is a good number to work to. Also keep an eye on the alignment of field for each target (it’s unlikely hardware will allow you to do arbitrary bit field assignments).
As I said there are limited numbers of OpenGL ES 3.0 applications available today, but we have seen at least a couple which use four MRTs (the primary colour buffer and three secondary buffers). In OpenGL® and OpenGL ES all the targets in a MRT need to be the same size and format as the primary. For this use case we had 4xRGBA buffers, which is fine, but when we examined the buffers only 1-2 channels from each target were being used. “So what?” you may say, “It’s all in the tile so I use a little more, big deal”, but at some point you will want to write those buffers back to memory and read them back when you perform your resolve/consolidation pass. It’s going to be a lot cheaper to pack them into two MRTs at full resolution than have to write and read back four.
If you want the optimal implementation of the deferred rendering model and you don’t mind using an extension you might want to take a look at this paper [Removed] from Sam Martin of The specified item was not found. By using the extension described, for most cases you can eliminate the need to write back the tile and then read it back as a texture source for the resolve/consolidation pass, saving even more bandwidth.
Deferred GPUs pipeline the operations required to create a frame. This means that frames move through stages which build a command stream, perform vertex shading and finally perform fragment shading and output. Which means that there are actually three frames in flight and the one you are working on app side is actually Frame N+2. Within this pipeline commands such as glReadPixels(), glCopyTexImage() and Occlusion Queries can block the pipeline and degrade performance if not used carefully… and unfortunately pretty much every app I’ve seen using these mechanisms has committed this crime.
Firstly, if using the command glReadPixels() make sure you use it with pixel buffer objects (PBOs). This schedules the actually pixel read back from the buffer asynchronously (often hardware accelerated) and the glReadPixels command returns to the calling thread immediately without stalling the application. To read the content of the buffer you need to bind and map the PBO (see glMapBuffer()). At the point at which you attempt the map operation if the rendering to the buffer isn’t complete the map operation will still stall until rendering is complete. Therefore the best advice is to pipeline these read backs where possible such that you are using the results from frame N in frame N+2 or, if that’s not possible, to separate the dependent operations as much as possible and then use fence and sync to ensure coherence.You might consider using a shared context and placing the wait for read back on an asynchronous thread. I’d also apply the same advice to glCopyTexImage().
The advice for Occlusion Queries is very similar. Polling for the result of an occlusion query immediately creates a stall (this is true on all GPUs, not just deferred). Therefore the advice is to always pipeline occlusion queries.
Not compressing your textures is a bit like speeding. We’ve pretty much all done it, it’s easily done, we don’t think about the consequences and everyone has an excuse, but there isn’t one. However, unlike speeding I think that not compressing textures should be a capital offense. Compressing your textures has massive impact on bandwidth, reducing it 2x, 4x, 8x or more, and is an essential part of authoring for maximum performance in mobile devices.
So what’s the excuse? Back in the days of ETC1, there was the defense of “but it doesn’t do Alpha m’ lud”, that, however, could be worked around. With the introduction of OpenGL ES 3.0 that defense has been eliminated by the inclusion of ETC2 which now has Alpha support. However this has given rise to the “Matrix Defense”; let me explain…
Consider the “Matrix” below which shows the available compression formats in the world that developers have been used to. Only a very narrow selection of input bit rates, pixel formats and encoding bit rates can be compressed. The defence is that in the “Matrix”, developers can’t get the exact format they want…
Time to take the red pill. With ASTC this is the new reality:
Adaptive Scalable Texture Compression, the standard developed by ARM and officially adopted by The Khronos Group as an extension to both the OpenGL and OpenGL ES graphics APIs, is the best method available, offering increased quality and fidelity, very low bit-rates and just about every input format you may want or need. Independent testing of ASTC has shown that quality levels similar to 2bits per pixel in existing compression schemes can be achieved using the next level down in ASTC, saving further bandwidth for the same level of quality. So now there is no excuse!
To close out this blog, I’d like to give you my personal pick of crimes against performance from 2013. We join the court as the prosecution presents its first witness…
PC Ray Caster:
“If I may present the evidence, your honour….
"Whilst proceeding in our usual routine activities we happened upon some suspicious activity. The performance analysis team regularly, as new games or graphics focused applications appear in the app stores, run them and investigate how they are using the GPU. This helps us maintain a view of trends in workload, application composition etc. which helps us shape the future direction of our roadmap.
“Our attention was attracted to this particular app when we noticed that it was consuming an unusually large amount of bandwidth for the visual complexity of the scene. “Hello, hello, hello!” we said to ourselves, “What’s all this then?” Upon further investigation the crime scene revealed itself. The crime scene consisted of a field of long grass.
“We proceeded to assess the crime scene and the grass was found to be made up of imposters*, which is what we expected as this is a fairly standard technique for rendering scrub/foliage etc. In this particular case the imposters were made up of two quads which intersected each other at the mid-point of the X access at 90° to each other. Again, this is all fairly standard stuff.
“The forensics team used the Mali Graphics Debugger to look for evidence of foul play and immediately the full horrors of this case began to unfold. As we stepped through the first issue became immediately obvious: the imposters were being drawn back to front. We let the frame complete and then checked the stats. The overdraw map showed a peak in double digits and the texture bandwidth was criminal! The grass was accounting for more than half of the total run-time of the scene.
“Continuing the investigation we found that the texture used for the grass/shrubs was also not MIP Mapped or compressed. Given the viewing angle for the scene and distance from the viewer of each shrub imposter, that meant that most of the imposters were very small causing under sampling of the texture (the mapping between texture pixels and screen pixels was less than 1:1) which was thrashing the cache and causing the excessive bandwidth consumption.
“After some more investigation we also found that rather than using “Punch through Alpha”**, the app had turned on Alpha blending, causing all overdrawn pixels to be blended with each other which was causing the engine to force the back to front ordering (alpha blended objects need to observe back to front ordering for visual correctness).
“Once the crime scene was cleaned up your honor, the application performance improved considerably. Clearly this shows a criminal neglect your honor. That concludes the evidence for the prosecution."
*You basically replace a model with a textured 2D quad which rotates to always remain view port aligned.
Imagine a cardboard cut-out of a tree that follows you to always face you and you are there!
**Transparent texels in an RGBA texture are marked with Alpha = 0 and are discarded in the fragment shader acting as a mask.
All other texels have an Alpha of >0 and are written as opaque pixels, the alpha is not used for blending.
A cheaper way to do this is also to use only RGB texture and pick either black (0:0:0) or white (1.0:1.0:1.0) as the mask value.
Judge Hugh Harshly:
"I believe I've heard enough...
“I find the defendant guilty on the charge of fraudulent use of Alpha modes liable to cause excessive bandwidth consumption, being several times over the legal limit of overdraw while in charge of a GPU, cache abuse, extortion of bandwidth, applying a texture without due care and attention and finally failure to compress a texture... a most heinous crime.
“Do you have anything to say for yourself before I pass sentence?"
Defendant:
"Its a fit up! Society's to blame! What chance did I have growing up with a desktop GPU, I don't know no different do I?"
“Very well… clerk, hand me my black cap would you, there’s a good fellow..."