Mali Performance 2: How to Correctly Handle Framebuffers

April 28, 2014

9 minute read time.

Previous blog in the series: Mali Performance 1: Checking the Pipeline

This week I take a slight diversion from the hardware-centric view of the rendering pipeline we have been exploring so far to look at how, and more importantly when, the Mali driver stack turns OpenGL ES API activity into the hardware workloads needed for rendering. As we will see, OpenGL ES is not particularly tightly specified around this area, so there are some common pitfalls which developers must be careful to avoid.

Per-Render Target Rendering: Quick Recap

As described in my previous blogs, Mali's hardware engine operates on a two-pass rendering model, rendering all of the geometry for a render target to completion before starting any of the fragment processing. This allows us to keep most of our working state in local memory tightly coupled to the GPU, and minimize the amount of power-hungry external DRAM accesses which are needed for the rendering process.

When OpenGL ES is used well we can create, use, and then discard most of our framebuffer data inside this local memory. This avoids the need to read framebuffers from, or write framebuffers to, external memory at all, except for the buffers we want to keep such as the color buffer. However this isn't guaranteed behavior and some patterns of API usage can trigger inefficient behavior which forces the GPU to make extra reads and writes.

Open GL ES: What is a Render Target?

In Open GL ES there are two types of render target:

On-screen window render targets
Off-screen framebuffer render targets

Conceptually these are very similar in OpenGL ES; although not entirely identical. Only one render target can be active at the API level for rendering at any point in time; the current render target is selected via a call to glBindFramebuffer( fbo_id ), where an ID of 0 can be used to switch back to the window render target (also often called the default FBO).

On-screen Render Targets

On-screen render targets are tightly defined by EGL. The rendering activity for one frame has very clearly defined demarcation of what is one frame and what is the next; all rendering to FBO 0 between two calls to eglSwapBuffers() defines the rendering for one frame.

In addition the color, depth, and stencil buffers in use are defined when the context is created, and their configuration is immutable. By default the value of the color, depth, and stencil immediately after eglSwapBuffers() is undefined - the old value is not preserved from the previous frame - allowing the GPU driver to make guaranteed assumptions about the use of the buffers. In particular we know that depth and stencil are only transient working data, and we never need to write them back to memory.

Off-screen Render Targets

Off-screen render targets are less tightly defined.

Firstly, there is no equivalent of eglSwapBuffers() which tells the driver that the application has finished rendering to an FBO and it can be submitted for rendering; the flush of the rendering work is inferred from other API activities. We'll look more about the inferences the Mali drivers support in the next section.

Secondly, there are no guarantees about what the application will do with the buffers attached to the color, depth, and stencil attachment points. An application may use any of these as textures, or reattach them to a different FBO, for example reloading the depth value from a previous render target as the starting depth value for a different render target. By default the behavior of OpenGL ES is to preserve all attachments, unless explicitly discarded by the application via a call to glInvalidateFramebuffer(). Note: this is a new entry point in OpenGL ES 3.0; in OpenGL ES 2.0 you can access the equivalent functionality via the glDiscardFramebufferExt() extension entry point which all Mali drivers support.

Render Target Flush Inference

In normal circumstances Mali flushes rendering work when a render target is "unbound", except for the main window surface which is flushed when the driver sees a call to eglSwapBuffers().

To avoid a performance drop developers need to avoid unneeded flushes which only contain a sub-set of the final rendering, so it is recommended that you bind each off-screen FBO once per frame and render it to completion in one go.

A well-structured rendering sequence (almost anyway - read the next section to see why this is incomplete) would look like:

#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT

glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 0 (window surface)

glBindFramebuffer( 1 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 1
...                       // Draw FBO 1 to completion

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 1 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)

glBindFramebuffer( 2 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 2
...                       // Draw FBO 2 to completion

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 2 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)
eglSwapBuffers()          // Tell EGL we have finished, flush FBO 0 for rendering

By contrast the "bad" behaviour would look like:

#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT

glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 0 (window surface)

glBindFramebuffer( 1 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 1

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 1 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)

glBindFramebuffer( 1 )    // Rebind away from FBO 0, does not trigger rendering of FBO
                          // However, rebinding FBO 1 requires us to reload old render
                          // state from memory, and write over the top of it
glDraw...( ... )          // Draw something to FBO 1

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 1 (again)
glDraw...( ... )          // Draw something else to FBO 0 (window surface)
eglSwapBuffers()          // Tell EGL we have finished, flush FBO 0 for rendering

This type of behavior is known as an incremental render and it forces the driver to process the render target twice, the first processing pass will need to write all of the intermediate render state out to memory (color, depth, and stencil), and the second pass will read it back in from memory again so it can "append" more rendering on top of the old state.

As shown in the diagram above you can see that incremental rendering has a +400% bandwidth penalty [assuming 32-bpp color and D24S8 packed depth-stencil] in terms of the framebuffer bandwidth when compared against a well-structured single pass render which avoids the need to write and then re-read the intermediate state to and from main memory.

When to call glClear?

The observant reader will have noted that I inserted some calls to glClear() into the rendering sequence for our frame buffers. The application should always call glClear() for every attachment at the start of each render target's rendering sequence, provided that the previous contents of the attachments are not needed, of course. This explicitly tells the driver we do not need the previous state, and thus we avoid reading it back from memory, as well as putting any undefined buffer contents into a defined "clear color" state.

One common mistake which is seen here is only clearing part of the framebuffer; i.e. calling glClear() while only a portion of the render target is active because of a scissor rectangle with only partial screen coverage. We can only completely drop the render state when it applies to whole surfaces, so a clear of the whole render target should be performed where possible.

When to call glInvalidateFramebuffer?

The final requirement placed on the application for efficient use of FBOs in the OpenGL ES API is that it should tell the driver which of the color / depth / stencil attachments are simply transient working buffers, the value of which can be discarded at the end of rendering the current render pass. For example, nearly every 3D render will use color and depth, but for most applications the depth buffer is transient and can be safely invalidated. Failure to invalidate the unneeded buffers may result in them being written back to memory, wasting memory bandwidth and increasing energy consumption of the rendering process.

The most common mistake at this point is to treat glInvalidateFramebuffer() as equivalent to glClear() and place the invalidate call for frame N state at the first use of that FBO in frame N+1. This is too late! The purpose of the invalidate call is to tell the driver that that the buffers do not need to be kept, so we need to modify the work submission to the GPU for frame which produces those buffers. Telling us in the next frame is often after the original frame has been processed. The application needs to ensure that the driver knows which buffers are transient before the framebuffer is flushed. Therefore transient buffers in frame N should be indicated by calling glInvalidateFramebuffer() before unbinding the FBO in frame N. For example:

#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT
static const GLEnum invalid_ap[2] = { GL_DEPTH_ATTACHMENT, GL_STENCIL_ATTACHMENT };

glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 0 (window surface)

glBindFramebuffer( 1 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 1
...                       // Draw FBO 1 to completion
glInvalidateFramebuffer( GL_FRAMEBUFFER, 2, &invalid_ap[0] ); // Only keep color

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 1 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)

glBindFramebuffer( 2 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 2
...                       // Draw FBO 2 to completion
glInvalidateFramebuffer( GL_FRAMEBUFFER, 2, &invalid_ap[0] ); // Only keep color

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 2 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)

eglSwapBuffers()          // Tell EGL we have finished, flush FBO 0 for rendering

Summary

In this blog we've looked at how the Mali drivers¹ handle the identification of render passes, the common points of inefficiency, and how an application developer can drive the OpenGL ES API to avoid them. In summary we recommend:

Binding each FBO (other than FBO 0) exactly once in each frame, rendering it to completion in a contiguous sequence of API calls.
Calling glClear() at the start of each FBO’s rendering sequence, for all attachments where the old value is not needed.
Calling glInvalidateFramebuffer() or glDiscardFramebufferExt() at the end of each FBO’s rendering sequence, before switching to a different FBO, for all attachments which are simply transient working buffers for the intermediate state.

Next time I'll look at a related topic to this one – the efficient use of EGL_BUFFER_PRESERVED for maintaining window surface color from one frame as the default input for the next frame, and the implications that has for performance and bandwidth.

Cheers,

Pete

The next blog in the series asks the question: is EGL_BUFFER_PRESERVED a good thing? Read it below.

[CTAToken URL = "https://community.arm.com/graphics/b/blog/posts/mali-performance-3-is-egl_5f00_buffer_5f00_preserved-a-good-thing" target="_blank" text="Read next blog: Is EGL_BUFFER_PRESERVED a good thing?" class ="green"]

Footnotes

It is worth noting that little of this is actually Mali specific - most of the mobile GPU vendors make the same recommendations, so this is general best practice, irrespective of the underlying GPU.

Igor Lobanchikov over 8 years ago

Hi Peter.
Thanks for your prompt reply.
I've re-read your article and now I understand that every time you were writing "flushing FBO" that actually meant "committing the command buffer". Thanks for pointing this out. I'm more got used to resolve/restore terms, but whatever simplifies the conversation will be a perfect fit.
control flush/restore independently
I'm not sure how you reliably get independent control over the decision to writeout and then readback; it's a single bit of state (e.g. is the framebuffer invalid or not?). If the framebuffer is invalid then it will be neither written out nor read back, if it is valid then it will written out and read back.
I'm not quite sure what exactly do you mean. If we speak about the potentially possible scenario - this might be resetting the invalid bit once either writeout or readback happens. However, this doesn't match well enough "invalid content" semantics. I do understand that once the content is marked "invalid" it is considered invalid until it is overwritten with the data which is considered "valid".
I'm making sure the engine I'm optimizing works as fast as possible on OpenGL ES. We do have separate control for writeout/readback and it does save us quite a few milliseconds. Just wanted to make sure there's no way to do the same thing in GL ES.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Peter Harris over 8 years ago

Could you, please, clarify the following: when glInvalidateFramebuffer is called just before framebuffer flush and cancels it, does this also mean that next time framebuffer is bound its content won't be restored?
Correct. If the content is invalid then there is no need to restore it.
control flush/restore independently
I'd avoid using the term "flush" to mean "framebuffer writeout" - it's generally used in most drivers to mean submission of a command queue to the hardware (e.g. see glFlush() in the OpenGL ES API).
I'm not sure how you reliably get independent control over the decision to writeout and then readback; it's a single bit of state (e.g. is the framebuffer invalid or not?). If the framebuffer is invalid then it will be neither written out nor read back, if it is valid then it will written out and read back.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Igor Lobanchikov over 8 years ago

hi Peter.
Could you, please, clarify the following: when glInvalidateFramebuffer is called just before framebuffer flush and cancels it, does this also mean that next time framebuffer is bound its content won't be restored?
in other words is it possible to use glInvalidateFramebuffer to control flush/restore independently? E.g render depth buffer and flush it once, then restore it multiple times for different fbos for depth testing only. for this scenario it would be beneficial to cancel flush when depth buffer was not modified, and to restore it again from ram for the next fbo.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Peter Harris over 8 years ago
Hi Christophe,
Is there a different on your architecture and driver between glClear and glInvalidateFramebuffer for the case flushing tiles to graphics memory?
If either are used as the first rendering operation in a render pass, before any glDraw calls, then they are functionally identical - both will result in the clear color being loaded to the tile.
After the first real drawcall has happened, then glInvalidateFramebuffer effectively becomes a no-op from a rendering point of view (we've already loaded the tile memory by that point so there is nothing invalidate can optimize away), whereas a glClear would have to render because it changes the color.
The only point where glInvalidateFramebuffer has any further effect is when it is the last operation before a render target is flushed; this will save us writing intermediate results (typically depth and stencil) back to main memory.
Additionally, glInvalidateSubFramebuffer is a more powerful API as for some algorithms, it would allow me to readback only a subset of the full framebuffer.
Today we don't really gain anything by sub-frame invalidation; you will still get a readback of the original surface unless the entire framebuffer viewport is either cleared or invalidated.
However, in general invalidate is only legal if you guarantee you are overdrawing the invalidated region with opaque geometry, which means that another optimization will kick in. On the recent Mali GPUs (Mali-T62x onwards) we have a hidden surface removal scheme called Forward Pixel Kill. More info here:
Killing Pixels - A New Optimization for Shading on ARM Mali GPUs
In summary this means that any areas of the original framebuffer readback which are occluded by subsequent opaque geometry GL_BLEND disabled, no discard in the fragment shader, no alpha-to-coverage) will be optimized away. This covers most use cases where glInvalidateSubFramebuffer would be useful; if you have another use case please let me know - we are always on the look out for feedback on things we can improve.
Cheers,
Pete
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Christophe over 8 years ago

Hi Peter,
Is there a different on your architecture and driver between glClear and glInvalidateFramebuffer when case we don't want to restoring a framebuffer from graphics memory to on-chip memory?
If yes, what's the nature of that difference?
I am tempted to use glInvalidateFramebuffer on colorbuffer and glClear on depth/stencil buffer because I only need to write values for depth/stencil so I think glInvalidateFramebuffer might be more effective as it expect don't read and avoid a fast clear.
Additionally, glInvalidateSubFramebuffer is a more powerful API as for some algorithms, it would allow me to readback only a subset of the full framebuffer. Using scissor and glClear doesn't work at it would kill the fragments. It's actually pretty unfortunate because what we would typically want to do it actually glValidateSubFramebuffer but I guess it's nothing that 4 calls to glInvalidateSubFramebuffer would workaround.
Thanks,
Christophe
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Graphics, Gaming, and VR blog

The mobile gaming revolution, powered by Arm

Philippe Bressy

This blog post describes the stratospheric growth of mobile gaming growth from the late 90s to present day, and how Arm technology has been at the heart of the mobile gaming revolution.
- November 18, 2024
Shader analysis and more in Arm Performance Studio 2024.4

Julie Gaskin

Learn about the new shader analysis features for mobile developers in Frame Advisor, and hear about other Arm Performance Studio changes in this release.
- October 2, 2024
Save your battery while enjoying the modern graphics on mobile with Android Dynamic Performance Framework

Patrick Wang

Save battery and enhance mobile gaming with ADPF and Unreal Engine. Mori shows you how it optimizes graphics based on real-time thermal data, reducing overheating and power consumption.
- September 26, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog