Previous blog in the series: Mali Performance 1: Checking the Pipeline
This week I take a slight diversion from the hardware-centric view of the rendering pipeline we have been exploring so far to look at how, and more importantly when, the Mali driver stack turns OpenGL ES API activity into the hardware workloads needed for rendering. As we will see, OpenGL ES is not particularly tightly specified around this area, so there are some common pitfalls which developers must be careful to avoid.
As described in my previous blogs, Mali's hardware engine operates on a two-pass rendering model, rendering all of the geometry for a render target to completion before starting any of the fragment processing. This allows us to keep most of our working state in local memory tightly coupled to the GPU, and minimize the amount of power-hungry external DRAM accesses which are needed for the rendering process.
When OpenGL ES is used well we can create, use, and then discard most of our framebuffer data inside this local memory. This avoids the need to read framebuffers from, or write framebuffers to, external memory at all, except for the buffers we want to keep such as the color buffer. However this isn't guaranteed behavior and some patterns of API usage can trigger inefficient behavior which forces the GPU to make extra reads and writes.
In Open GL ES there are two types of render target:
Conceptually these are very similar in OpenGL ES; although not entirely identical. Only one render target can be active at the API level for rendering at any point in time; the current render target is selected via a call to glBindFramebuffer( fbo_id ), where an ID of 0 can be used to switch back to the window render target (also often called the default FBO).
On-screen render targets are tightly defined by EGL. The rendering activity for one frame has very clearly defined demarcation of what is one frame and what is the next; all rendering to FBO 0 between two calls to eglSwapBuffers() defines the rendering for one frame.
In addition the color, depth, and stencil buffers in use are defined when the context is created, and their configuration is immutable. By default the value of the color, depth, and stencil immediately after eglSwapBuffers() is undefined - the old value is not preserved from the previous frame - allowing the GPU driver to make guaranteed assumptions about the use of the buffers. In particular we know that depth and stencil are only transient working data, and we never need to write them back to memory.
Off-screen render targets are less tightly defined.
Firstly, there is no equivalent of eglSwapBuffers() which tells the driver that the application has finished rendering to an FBO and it can be submitted for rendering; the flush of the rendering work is inferred from other API activities. We'll look more about the inferences the Mali drivers support in the next section.
Secondly, there are no guarantees about what the application will do with the buffers attached to the color, depth, and stencil attachment points. An application may use any of these as textures, or reattach them to a different FBO, for example reloading the depth value from a previous render target as the starting depth value for a different render target. By default the behavior of OpenGL ES is to preserve all attachments, unless explicitly discarded by the application via a call to glInvalidateFramebuffer(). Note: this is a new entry point in OpenGL ES 3.0; in OpenGL ES 2.0 you can access the equivalent functionality via the glDiscardFramebufferExt() extension entry point which all Mali drivers support.
In normal circumstances Mali flushes rendering work when a render target is "unbound", except for the main window surface which is flushed when the driver sees a call to eglSwapBuffers().
To avoid a performance drop developers need to avoid unneeded flushes which only contain a sub-set of the final rendering, so it is recommended that you bind each off-screen FBO once per frame and render it to completion in one go.
A well-structured rendering sequence (almost anyway - read the next section to see why this is incomplete) would look like:
#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 0 (window surface) glBindFramebuffer( 1 ) // Switch away from FBO 0, does not trigger rendering glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 1 ... // Draw FBO 1 to completion glBindFramebuffer(0) // Switch to FBO 0, unbind and flush FBO 1 for rendering glDraw...( ... ) // Draw something else to FBO 0 (window surface) glBindFramebuffer( 2 ) // Switch away from FBO 0, does not trigger rendering glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 2 ... // Draw FBO 2 to completion glBindFramebuffer(0) // Switch to FBO 0, unbind and flush FBO 2 for rendering glDraw...( ... ) // Draw something else to FBO 0 (window surface) eglSwapBuffers() // Tell EGL we have finished, flush FBO 0 for rendering
By contrast the "bad" behaviour would look like:
#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 0 (window surface) glBindFramebuffer( 1 ) // Switch away from FBO 0, does not trigger rendering glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 1 glBindFramebuffer(0) // Switch to FBO 0, unbind and flush FBO 1 for rendering glDraw...( ... ) // Draw something else to FBO 0 (window surface) glBindFramebuffer( 1 ) // Rebind away from FBO 0, does not trigger rendering of FBO // However, rebinding FBO 1 requires us to reload old render // state from memory, and write over the top of it glDraw...( ... ) // Draw something to FBO 1 glBindFramebuffer(0) // Switch to FBO 0, unbind and flush FBO 1 (again) glDraw...( ... ) // Draw something else to FBO 0 (window surface) eglSwapBuffers() // Tell EGL we have finished, flush FBO 0 for rendering
This type of behavior is known as an incremental render and it forces the driver to process the render target twice, the first processing pass will need to write all of the intermediate render state out to memory (color, depth, and stencil), and the second pass will read it back in from memory again so it can "append" more rendering on top of the old state.
As shown in the diagram above you can see that incremental rendering has a +400% bandwidth penalty [assuming 32-bpp color and D24S8 packed depth-stencil] in terms of the framebuffer bandwidth when compared against a well-structured single pass render which avoids the need to write and then re-read the intermediate state to and from main memory.
The observant reader will have noted that I inserted some calls to glClear() into the rendering sequence for our frame buffers. The application should always call glClear() for every attachment at the start of each render target's rendering sequence, provided that the previous contents of the attachments are not needed, of course. This explicitly tells the driver we do not need the previous state, and thus we avoid reading it back from memory, as well as putting any undefined buffer contents into a defined "clear color" state.
One common mistake which is seen here is only clearing part of the framebuffer; i.e. calling glClear() while only a portion of the render target is active because of a scissor rectangle with only partial screen coverage. We can only completely drop the render state when it applies to whole surfaces, so a clear of the whole render target should be performed where possible.
The final requirement placed on the application for efficient use of FBOs in the OpenGL ES API is that it should tell the driver which of the color / depth / stencil attachments are simply transient working buffers, the value of which can be discarded at the end of rendering the current render pass. For example, nearly every 3D render will use color and depth, but for most applications the depth buffer is transient and can be safely invalidated. Failure to invalidate the unneeded buffers may result in them being written back to memory, wasting memory bandwidth and increasing energy consumption of the rendering process.
The most common mistake at this point is to treat glInvalidateFramebuffer() as equivalent to glClear() and place the invalidate call for frame N state at the first use of that FBO in frame N+1. This is too late! The purpose of the invalidate call is to tell the driver that that the buffers do not need to be kept, so we need to modify the work submission to the GPU for frame which produces those buffers. Telling us in the next frame is often after the original frame has been processed. The application needs to ensure that the driver knows which buffers are transient before the framebuffer is flushed. Therefore transient buffers in frame N should be indicated by calling glInvalidateFramebuffer() before unbinding the FBO in frame N. For example:
#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT static const GLEnum invalid_ap[2] = { GL_DEPTH_ATTACHMENT, GL_STENCIL_ATTACHMENT }; glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 0 (window surface) glBindFramebuffer( 1 ) // Switch away from FBO 0, does not trigger rendering glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 1 ... // Draw FBO 1 to completion glInvalidateFramebuffer( GL_FRAMEBUFFER, 2, &invalid_ap[0] ); // Only keep color glBindFramebuffer(0) // Switch to FBO 0, unbind and flush FBO 1 for rendering glDraw...( ... ) // Draw something else to FBO 0 (window surface) glBindFramebuffer( 2 ) // Switch away from FBO 0, does not trigger rendering glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 2 ... // Draw FBO 2 to completion glInvalidateFramebuffer( GL_FRAMEBUFFER, 2, &invalid_ap[0] ); // Only keep color glBindFramebuffer(0) // Switch to FBO 0, unbind and flush FBO 2 for rendering glDraw...( ... ) // Draw something else to FBO 0 (window surface) eglSwapBuffers() // Tell EGL we have finished, flush FBO 0 for rendering
In this blog we've looked at how the Mali drivers1 handle the identification of render passes, the common points of inefficiency, and how an application developer can drive the OpenGL ES API to avoid them. In summary we recommend:
Next time I'll look at a related topic to this one – the efficient use of EGL_BUFFER_PRESERVED for maintaining window surface color from one frame as the default input for the next frame, and the implications that has for performance and bandwidth.
Cheers,
Pete
The next blog in the series asks the question: is EGL_BUFFER_PRESERVED a good thing? Read it below.
[CTAToken URL = "https://community.arm.com/graphics/b/blog/posts/mali-performance-3-is-egl_5f00_buffer_5f00_preserved-a-good-thing" target="_blank" text="Read next blog: Is EGL_BUFFER_PRESERVED a good thing?" class ="green"]
Hi Peter,
Is there a different on your architecture and driver between glClear and glInvalidateFramebuffer for the case flushing tiles to graphics memory?
If yes, what's the nature of that difference?
I am tempted to use glInvalidateFramebuffer on colorbuffer and glClear on depth/stencil buffer only because I only need to write values for depth/stencil so I think glInvalidateFramebuffer might be more effective.
Additionally, glInvalidateSubFramebuffer is a more powerful API as for some algorithms, it would allow me to readback only a subset of the full framebuffer. Using scissor and glClear doesn't work at it would kill the fragments. It's actually pretty unfortunate because what we would typically want to do it actually glValidateSubFramebuffer but I guess it's nothing that 4 calls to glInvalidateSubFramebuffer would workaround.
Thanks,
Christophe
Generally no; only things which draw are going to cause problems.
Hi, does glClear has to be the very first command right after glBindFrameBuffer? Would the optimization fail if there are some non-drawing GL commands such as glClearColor, glViewport, glScissor between the glBindFramebuffer call and glClear?
The application can't; it still renders to one FBO from the point of view of the API behavior. However, due to the deep asynchronous pipeline that tile-based GPUs use, the hardware may process data from two different render passes at the same time. See the previous blog (link below) for more details.
Mali Performance 1: Checking the Pipeline
HTH, Pete
Hi Peter, You have mentioned above that we can use two FBO (one for vertex shading, and one for fragment shading) at the the same time .How do we do that. My understanding was that we can bund/unding FBO using glBindFramebuffer() and only one FBO will be used at a time for the subsequent draw call. Is there a gap in my understanding?
regards,
jitender