Previous blog in the series: Mali Performance 4: Principles of High Performance Rendering
Welcome to the next instalment of my blog series looking at graphics rendering performance on Mali GPUs using OpenGL ES. This time around I'll be looking at some of the important application-side optimizations which should be considered when developing a 3D application, before making any OpenGL ES calls at all. Future blogs will look in more detail at specific areas of usage for the OpenGL ES API. Note that the techniques outlined in this blog are not Mali specific, and should work well on any GPU.
The OpenGL ES API specifies a serial stream of drawing operations which are turned into hardware commands for the GPU to perform, with explicit state control over how those drawing operations are to be processed. This low level mode of operation gives the application a huge amount of control over how the GPU performs its rendering tasks, but also means that the device driver has very little knowledge about the whole scene that the application is trying to render. This lack of global knowledge means that the device driver cannot significantly restructure the command stream that it sends to the GPU, so there is a burden of responsibility on the application to send sensible rendering commands through the API in order to achieve maximum efficiency and high performance. The first rule of high performance rendering and optimization is "Do Less Work", and that needs to start in the application before any OpenGL ES API calls have happened.
All GPUs can perform culling, discarding primitives which are outside of the viewing frustum or which are facing away from the camera. This is a very low level cull which is applied primitive-by-primitive, and which can only be applied after the vertex shader has computed the clip-space coordinate for each vertex. If an entire object is outside of the frustum, this can be a tremendous waste of GPU processing time, memory bandwidth, and energy.
The most important optimization which a 3D application should therefore perform is early culling of objects which are not visible in the final render, skipping the OpenGL ES API calls completely for these objects. There are a number of methods which can be used here, with varying degrees of complexity, a few examples of which are outlined below.
The simplest scheme is for the application to maintain a bounding box for each object, which has vertices at the min and max coordinate in each axis. The object-space to clip-space computation for 8 vertices is sufficiently light-weight that it can be computed in software on the CPU for each draw operation, and the box can be tested for intersection with the clip-space volume. Objects which fail this test can be dropped from the frame rendering.
For very geometrically complex objects that cover a large amount of screen space it can be useful to break the object up into smaller pieces, each with its own bounding box, allowing some sections of the object to be rejected if the current camera position would benefit from it.
The images above show one of our old Mali tech demos, an interactive a fly through of an indoor science fiction space station environment. The final 3D render is shown on the left and a content debug view, which shows the bounding boxes of the various objects in the scene are highlighted in blue, is shown on the right.
This type of bounding box scheme can be taken further, and turned into a more complete scene data structure for the world being rendered. Bounding boxes could be constructed for each building in a world, and for each room in each building, for example. If a building is off-screen then it can be rejected quickly, based on a single bounding box check, instead of needing hundreds of such checks for all of the individual objects which that building contains. In this hierarchy the rooms are only tested if their parent building is visible, and renderable objects are only tested if their parent room is visible. This type of hierarchy scheme doesn't really change the workload which gets sent to the GPU for rendering, but can really help make the CPU performance overhead of all of the checks much more manageable.
In many game worlds simple bounding box checks against the viewing frustum will remove a lot of redundant work, but still leave a significant amount present. This is especially common in worlds consisting of interconnected rooms, as from many camera angles the view of the spatially adjacent rooms will be entirely blocked by a wall, floor, or ceiling, but be close enough to be inside the viewing frustum (and so pass a simple bounding box cull).
The bounding box scheme can therefore be supplemented with pre-calculated visibility knowledge, allowing for more aggressive culling of objects in the scene. For example in the scene consisting of three rooms shown below, there is no way that any object inside Room C can be seen by the player standing in Room A, so the application can simply skip issuing OpenGL ES calls for all objects inside Room C until the player moves into Room B.
This type of visibility culling is often factored into game designs by the level designers; games can achieve higher visual quality and frame rates if the level design keeps a consistently small number of rooms visible at any point in time. For this reason many games using indoor settings make heavy use of S and U shaped rooms and corridors as they guarantee no line of sight through that room if the doors are placed appropriately.
This scheme can be taken further, allowing us to cull even Room B in our test floor plan in some cases, by testing the coordinates of the portals - doors, windows, etc. - linking the current room and the adjacent rooms against the frustum. If no portal linking Room A and Room B is visible from the current camera angle, then we can also discard the rendering of Room B.
These types of broad-brush culling checks are very effective at reducing GPU workload, and are impossible for the GPU or GPU drivers to perform automatically - we simply don't have this level of knowledge of the scene being rendered - so it is critical that the application performs this type of early culling.
It should go without saying that in addition to not sending off-screen geometry to the GPU, the application should ensure that the render state for the objects which are visible is set efficiently. For culling purposes this means enabling back-face culling for opaque objects, allowing the GPU to discard the triangles facing away from the camera as early as possible.
OpenGL ES provides a depth buffer which allows the application to send in geometry in any order, and the depth-test ensures that the correct objects end up in the final render. While throwing geometry at the GPU in any order is functional, it is measurably more efficient if the application draws objects using a front-to-back order, as this maximizes the effectiveness of the early depth and stencil test unit (see The Mali GPU: An Abstract Machine, Part 3 - The Midgard Shader Core for more information on early depth and stencil testing). If you render objects using a back-to-front order then there is a good chance that the GPU will have spent some cycles rendering some fragments, only to later overdraw them with a fragment which is closer to the camera, which is a waste of precious GPU cycles!
It is not a requirement that triangles are sorted perfectly, which would be very expensive in terms of CPU cycles; we are just aiming to get it mostly right. Performing an object-based sort using the bounding boxes or even just using the object origin coordinate in world space is often sufficient here; anywhere where we get triangles slightly out of order will be tidied up by the full depth test in the GPU.
Remember that blended triangles need to be rendered back-to-front in order to get the correct blend results, so it is recommended that all opaque geometry is rendered first in a front-to-back order, and then blended triangles are drawn last.
OpenGL ES uses a client-server memory model; client-side memory resembles resources owned by the application and driver, server-side resembles resources owned by the GPU hardware. Transfer of resources from application to server-side is generally expensive:
For reasons dating back to the early OpenGL implementations - namely that geometry processing was performed on the CPU, and did not using the GPU hardware at all - OpenGL and OpenGL ES have a number of APIs which allow client-side buffers for geometry to be passed into the API for every draw operation.
Using client-side buffers specified this way is very inefficient. In most cases the models used each frame do not change, so this simply forces the drivers to perform a huge amount of work allocating memory and transferring the data to the graphics server for no benefit. As a much more efficient alternative OpenGL ES allows the application to upload data for both vertex attribute and index information to server-side buffer objects, which can typically be done at level load time. The per-frame data traffic for each draw operation when using buffer objects is just a set of handles telling the GPU which of these buffer objects to use, which for obvious reasons is much more efficient.
The one exception to this rule is the use of Uniform Buffer Objects (UBOs), which are server-side storage for per-draw-call constants for used by the shader programs. As uniform values are shared by every vertex and fragment thread in a draw-call it is important that they can be accessed by the shader core as efficiently as possible, so the device drivers will generally aggressively optimize how they are packaged in memory to maximize hardware access efficiency. It is preferable that small volumes of uniform data per draw call should be set directly via the glUniform<x>() family of functions, instead of using server-side UBOs, as this gives the driver far more control over how the uniform data is passed to the GPU. Uniform Buffer Objects should still be used for large uniform arrays, such as long arrays of matrices used for skeletal animation in a vertex shader.
OpenGL ES is a state-based API with a large number of state settings which can be configured for each drawing operation. In most non-trivial scenes there will be multiple render states in use, so the application must typically perform a number of state change operations to set up the configuration for each draw operation before the draw itself can be issued.
There are two useful goals to bear in mind when trying to get the best performance out of the GPU and minimizing CPU overhead of the drivers:
One of the most common forms of application optimization to improve both of these areas is draw call batching, where multiple objects using the same render state are pre-packaged into the same data buffers and as such can be rendered using a single draw operation. This reduces CPU load as we have fewer state changes and draw operations to package for the GPU, and gives the GPU bigger batches of work to process. My colleague stacysmith has an entire blog dedicated to effective batching here: Game Set and Batch. There is no hard-and-fast rule on how many draw calls a single frame should contain, as a lot depends on the system hardware capability and the desired frame rate, but in general we would recommend no more than a few hundred draw calls per frame.
It should also be noted that there is sometimes a conflict between getting the best batching and removing the most redundant work via depth sorting and culling. Provided that the draw call count remains sensible and your system is not CPU limited, it is generally better to remove GPU workload via improved culling and front-to-back object render order.
As discussed in one of my previous blogs, modern graphics APIs maintain an illusion of synchronous execution but are really deeply pipelined to maximize performance. The application must avoid using API calls which break this rendering pipeline or performance will rapidly drop as the CPU blocks waiting for the GPU to finish, and the GPU goes idle waiting for the CPU to give it more work to do. See Mali Performance 1: Checking the Pipeline for more information on this topic - it is was important enough to warrant and entire blog in its own right!
This blog has looked as some of the critical application-sde optimizations and behaviours which must be considered in order to achieve a successful high performance 3D render using Mali. In summary the key things to remember are:
Tune in next time and I'll start looking at some of the more technical aspects of using the OpenGL ES API itself, and hopefully even manage to include some code samples!
TTFN,
Pete
Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.
This is precisely what happens when you let Canadians on the Internet..
(referring, of course, to myself)
Ah! I see, problems can arise when dealing with multiple framebuffers (if I'm interpreting this rabid document skimming correctly), which makes sense! I very much look forward to your upcoming blog post!
Like all good questions worth asking "it depends". Won't answer here as it's not a quick answer, and I'm planning on looking at this in one of my upcoming blogs (you may have just promoted it to being the next one I do ), so I will get you an answer soon.
If you want to Google in the meantime the keywords are "Texture Ghosting".
This is a very welcomed read. Thanks!
I'm curious: Is there a driver penalty for modifying an uncompressed texture between draw calls? Does client application have direct access to the texture outside of the driver (pointer to texture colour data), or must there be an upload/validation based on each modification?