Previous blog in the series: Mali Performance 1: Checking the Pipeline
This week I take a slight diversion from the hardware-centric view of the rendering pipeline we have been exploring so far to look at how, and more importantly when, the Mali driver stack turns OpenGL ES API activity into the hardware workloads needed for rendering. As we will see, OpenGL ES is not particularly tightly specified around this area, so there are some common pitfalls which developers must be careful to avoid.
As described in my previous blogs, Mali's hardware engine operates on a two-pass rendering model, rendering all of the geometry for a render target to completion before starting any of the fragment processing. This allows us to keep most of our working state in local memory tightly coupled to the GPU, and minimize the amount of power-hungry external DRAM accesses which are needed for the rendering process.
When OpenGL ES is used well we can create, use, and then discard most of our framebuffer data inside this local memory. This avoids the need to read framebuffers from, or write framebuffers to, external memory at all, except for the buffers we want to keep such as the color buffer. However this isn't guaranteed behavior and some patterns of API usage can trigger inefficient behavior which forces the GPU to make extra reads and writes.
In Open GL ES there are two types of render target:
Conceptually these are very similar in OpenGL ES; although not entirely identical. Only one render target can be active at the API level for rendering at any point in time; the current render target is selected via a call to glBindFramebuffer( fbo_id ), where an ID of 0 can be used to switch back to the window render target (also often called the default FBO).
On-screen render targets are tightly defined by EGL. The rendering activity for one frame has very clearly defined demarcation of what is one frame and what is the next; all rendering to FBO 0 between two calls to eglSwapBuffers() defines the rendering for one frame.
In addition the color, depth, and stencil buffers in use are defined when the context is created, and their configuration is immutable. By default the value of the color, depth, and stencil immediately after eglSwapBuffers() is undefined - the old value is not preserved from the previous frame - allowing the GPU driver to make guaranteed assumptions about the use of the buffers. In particular we know that depth and stencil are only transient working data, and we never need to write them back to memory.
Off-screen render targets are less tightly defined.
Firstly, there is no equivalent of eglSwapBuffers() which tells the driver that the application has finished rendering to an FBO and it can be submitted for rendering; the flush of the rendering work is inferred from other API activities. We'll look more about the inferences the Mali drivers support in the next section.
Secondly, there are no guarantees about what the application will do with the buffers attached to the color, depth, and stencil attachment points. An application may use any of these as textures, or reattach them to a different FBO, for example reloading the depth value from a previous render target as the starting depth value for a different render target. By default the behavior of OpenGL ES is to preserve all attachments, unless explicitly discarded by the application via a call to glInvalidateFramebuffer(). Note: this is a new entry point in OpenGL ES 3.0; in OpenGL ES 2.0 you can access the equivalent functionality via the glDiscardFramebufferExt() extension entry point which all Mali drivers support.
In normal circumstances Mali flushes rendering work when a render target is "unbound", except for the main window surface which is flushed when the driver sees a call to eglSwapBuffers().
To avoid a performance drop developers need to avoid unneeded flushes which only contain a sub-set of the final rendering, so it is recommended that you bind each off-screen FBO once per frame and render it to completion in one go.
A well-structured rendering sequence (almost anyway - read the next section to see why this is incomplete) would look like:
#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 0 (window surface) glBindFramebuffer( 1 ) // Switch away from FBO 0, does not trigger rendering glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 1 ... // Draw FBO 1 to completion glBindFramebuffer(0) // Switch to FBO 0, unbind and flush FBO 1 for rendering glDraw...( ... ) // Draw something else to FBO 0 (window surface) glBindFramebuffer( 2 ) // Switch away from FBO 0, does not trigger rendering glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 2 ... // Draw FBO 2 to completion glBindFramebuffer(0) // Switch to FBO 0, unbind and flush FBO 2 for rendering glDraw...( ... ) // Draw something else to FBO 0 (window surface) eglSwapBuffers() // Tell EGL we have finished, flush FBO 0 for rendering
By contrast the "bad" behaviour would look like:
#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 0 (window surface) glBindFramebuffer( 1 ) // Switch away from FBO 0, does not trigger rendering glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 1 glBindFramebuffer(0) // Switch to FBO 0, unbind and flush FBO 1 for rendering glDraw...( ... ) // Draw something else to FBO 0 (window surface) glBindFramebuffer( 1 ) // Rebind away from FBO 0, does not trigger rendering of FBO // However, rebinding FBO 1 requires us to reload old render // state from memory, and write over the top of it glDraw...( ... ) // Draw something to FBO 1 glBindFramebuffer(0) // Switch to FBO 0, unbind and flush FBO 1 (again) glDraw...( ... ) // Draw something else to FBO 0 (window surface) eglSwapBuffers() // Tell EGL we have finished, flush FBO 0 for rendering
This type of behavior is known as an incremental render and it forces the driver to process the render target twice, the first processing pass will need to write all of the intermediate render state out to memory (color, depth, and stencil), and the second pass will read it back in from memory again so it can "append" more rendering on top of the old state.
As shown in the diagram above you can see that incremental rendering has a +400% bandwidth penalty [assuming 32-bpp color and D24S8 packed depth-stencil] in terms of the framebuffer bandwidth when compared against a well-structured single pass render which avoids the need to write and then re-read the intermediate state to and from main memory.
The observant reader will have noted that I inserted some calls to glClear() into the rendering sequence for our frame buffers. The application should always call glClear() for every attachment at the start of each render target's rendering sequence, provided that the previous contents of the attachments are not needed, of course. This explicitly tells the driver we do not need the previous state, and thus we avoid reading it back from memory, as well as putting any undefined buffer contents into a defined "clear color" state.
One common mistake which is seen here is only clearing part of the framebuffer; i.e. calling glClear() while only a portion of the render target is active because of a scissor rectangle with only partial screen coverage. We can only completely drop the render state when it applies to whole surfaces, so a clear of the whole render target should be performed where possible.
The final requirement placed on the application for efficient use of FBOs in the OpenGL ES API is that it should tell the driver which of the color / depth / stencil attachments are simply transient working buffers, the value of which can be discarded at the end of rendering the current render pass. For example, nearly every 3D render will use color and depth, but for most applications the depth buffer is transient and can be safely invalidated. Failure to invalidate the unneeded buffers may result in them being written back to memory, wasting memory bandwidth and increasing energy consumption of the rendering process.
The most common mistake at this point is to treat glInvalidateFramebuffer() as equivalent to glClear() and place the invalidate call for frame N state at the first use of that FBO in frame N+1. This is too late! The purpose of the invalidate call is to tell the driver that that the buffers do not need to be kept, so we need to modify the work submission to the GPU for frame which produces those buffers. Telling us in the next frame is often after the original frame has been processed. The application needs to ensure that the driver knows which buffers are transient before the framebuffer is flushed. Therefore transient buffers in frame N should be indicated by calling glInvalidateFramebuffer() before unbinding the FBO in frame N. For example:
#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT static const GLEnum invalid_ap[2] = { GL_DEPTH_ATTACHMENT, GL_STENCIL_ATTACHMENT }; glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 0 (window surface) glBindFramebuffer( 1 ) // Switch away from FBO 0, does not trigger rendering glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 1 ... // Draw FBO 1 to completion glInvalidateFramebuffer( GL_FRAMEBUFFER, 2, &invalid_ap[0] ); // Only keep color glBindFramebuffer(0) // Switch to FBO 0, unbind and flush FBO 1 for rendering glDraw...( ... ) // Draw something else to FBO 0 (window surface) glBindFramebuffer( 2 ) // Switch away from FBO 0, does not trigger rendering glClear( ALL_BUFFERS ) // Clear initial state glDraw...( ... ) // Draw something to FBO 2 ... // Draw FBO 2 to completion glInvalidateFramebuffer( GL_FRAMEBUFFER, 2, &invalid_ap[0] ); // Only keep color glBindFramebuffer(0) // Switch to FBO 0, unbind and flush FBO 2 for rendering glDraw...( ... ) // Draw something else to FBO 0 (window surface) eglSwapBuffers() // Tell EGL we have finished, flush FBO 0 for rendering
In this blog we've looked at how the Mali drivers1 handle the identification of render passes, the common points of inefficiency, and how an application developer can drive the OpenGL ES API to avoid them. In summary we recommend:
Next time I'll look at a related topic to this one – the efficient use of EGL_BUFFER_PRESERVED for maintaining window surface color from one frame as the default input for the next frame, and the implications that has for performance and bandwidth.
Cheers,
Pete
The next blog in the series asks the question: is EGL_BUFFER_PRESERVED a good thing? Read it below.
[CTAToken URL = "https://community.arm.com/graphics/b/blog/posts/mali-performance-3-is-egl_5f00_buffer_5f00_preserved-a-good-thing" target="_blank" text="Read next blog: Is EGL_BUFFER_PRESERVED a good thing?" class ="green"]
Hi Peter.
Thanks for your prompt reply.
I've re-read your article and now I understand that every time you were writing "flushing FBO" that actually meant "committing the command buffer". Thanks for pointing this out. I'm more got used to resolve/restore terms, but whatever simplifies the conversation will be a perfect fit.
control flush/restore independentlyI'm not sure how you reliably get independent control over the decision to writeout and then readback; it's a single bit of state (e.g. is the framebuffer invalid or not?). If the framebuffer is invalid then it will be neither written out nor read back, if it is valid then it will written out and read back.
control flush/restore independently
I'm not sure how you reliably get independent control over the decision to writeout and then readback; it's a single bit of state (e.g. is the framebuffer invalid or not?). If the framebuffer is invalid then it will be neither written out nor read back, if it is valid then it will written out and read back.
I'm not quite sure what exactly do you mean. If we speak about the potentially possible scenario - this might be resetting the invalid bit once either writeout or readback happens. However, this doesn't match well enough "invalid content" semantics. I do understand that once the content is marked "invalid" it is considered invalid until it is overwritten with the data which is considered "valid".
I'm making sure the engine I'm optimizing works as fast as possible on OpenGL ES. We do have separate control for writeout/readback and it does save us quite a few milliseconds. Just wanted to make sure there's no way to do the same thing in GL ES.
Could you, please, clarify the following: when glInvalidateFramebuffer is called just before framebuffer flush and cancels it, does this also mean that next time framebuffer is bound its content won't be restored?
Correct. If the content is invalid then there is no need to restore it.
I'd avoid using the term "flush" to mean "framebuffer writeout" - it's generally used in most drivers to mean submission of a command queue to the hardware (e.g. see glFlush() in the OpenGL ES API).
hi Peter.
in other words is it possible to use glInvalidateFramebuffer to control flush/restore independently? E.g render depth buffer and flush it once, then restore it multiple times for different fbos for depth testing only. for this scenario it would be beneficial to cancel flush when depth buffer was not modified, and to restore it again from ram for the next fbo.
Hi Christophe,
Is there a different on your architecture and driver between glClear and glInvalidateFramebuffer for the case flushing tiles to graphics memory?
If either are used as the first rendering operation in a render pass, before any glDraw calls, then they are functionally identical - both will result in the clear color being loaded to the tile.
After the first real drawcall has happened, then glInvalidateFramebuffer effectively becomes a no-op from a rendering point of view (we've already loaded the tile memory by that point so there is nothing invalidate can optimize away), whereas a glClear would have to render because it changes the color.
The only point where glInvalidateFramebuffer has any further effect is when it is the last operation before a render target is flushed; this will save us writing intermediate results (typically depth and stencil) back to main memory.
Additionally, glInvalidateSubFramebuffer is a more powerful API as for some algorithms, it would allow me to readback only a subset of the full framebuffer.
Today we don't really gain anything by sub-frame invalidation; you will still get a readback of the original surface unless the entire framebuffer viewport is either cleared or invalidated.
However, in general invalidate is only legal if you guarantee you are overdrawing the invalidated region with opaque geometry, which means that another optimization will kick in. On the recent Mali GPUs (Mali-T62x onwards) we have a hidden surface removal scheme called Forward Pixel Kill. More info here:
In summary this means that any areas of the original framebuffer readback which are occluded by subsequent opaque geometry GL_BLEND disabled, no discard in the fragment shader, no alpha-to-coverage) will be optimized away. This covers most use cases where glInvalidateSubFramebuffer would be useful; if you have another use case please let me know - we are always on the look out for feedback on things we can improve.
Cheers,Pete
Hi Peter,
Is there a different on your architecture and driver between glClear and glInvalidateFramebuffer when case we don't want to restoring a framebuffer from graphics memory to on-chip memory?
If yes, what's the nature of that difference?
I am tempted to use glInvalidateFramebuffer on colorbuffer and glClear on depth/stencil buffer because I only need to write values for depth/stencil so I think glInvalidateFramebuffer might be more effective as it expect don't read and avoid a fast clear.
Additionally, glInvalidateSubFramebuffer is a more powerful API as for some algorithms, it would allow me to readback only a subset of the full framebuffer. Using scissor and glClear doesn't work at it would kill the fragments. It's actually pretty unfortunate because what we would typically want to do it actually glValidateSubFramebuffer but I guess it's nothing that 4 calls to glInvalidateSubFramebuffer would workaround.
Thanks,
Christophe