Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Mobile, Graphics, and Gaming blog Mali Performance 2: How to Correctly Handle Framebuffers
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Mali
  • performance
  • OpenGL ES
  • gpu
  • rendering
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Mali Performance 2: How to Correctly Handle Framebuffers

Peter Harris
Peter Harris
April 28, 2014
9 minute read time.

Previous blog in the series: Mali Performance 1: Checking the Pipeline

This week I take a slight diversion from the hardware-centric view of the rendering pipeline we have been exploring so far to look at how, and more importantly when, the Mali driver stack turns OpenGL ES API activity into the hardware workloads needed for rendering. As we will see, OpenGL ES is not particularly tightly specified around this area, so there are some common pitfalls which developers must be careful to avoid.

Per-Render Target Rendering: Quick Recap

As described in my previous blogs, Mali's hardware engine operates on a two-pass rendering model, rendering all of the geometry for a render target to completion before starting any of the fragment processing. This allows us to keep most of our working state in local memory tightly coupled to the GPU, and minimize the amount of power-hungry external DRAM accesses which are needed for the rendering process.

When OpenGL ES is used well we can create, use, and then discard most of our framebuffer data inside this local memory. This avoids the need to read framebuffers from, or write framebuffers to, external memory at all, except for the buffers we want to keep such as the color buffer. However this isn't guaranteed behavior and some patterns of API usage can trigger inefficient behavior which forces the GPU to make extra reads and writes.

Open GL ES: What is a Render Target?

In Open GL ES there are two types of render target:

  • On-screen window render targets
  • Off-screen framebuffer render targets

Conceptually these are very similar in OpenGL ES; although not entirely identical. Only one render target can be active at the API level for rendering at any point in time; the current render target is selected via a call to glBindFramebuffer( fbo_id ), where an ID of 0 can be used to switch back to the window render target (also often called the default FBO).

On-screen Render Targets

On-screen render targets are tightly defined by EGL. The rendering activity for one frame has very clearly defined demarcation of what is one frame and what is the next; all rendering to FBO 0 between two calls to eglSwapBuffers() defines the rendering for one frame.

In addition the color, depth, and stencil buffers in use are defined when the context is created, and their configuration is immutable. By default the value of the color, depth, and stencil immediately after eglSwapBuffers() is undefined - the old value is not preserved from the previous frame - allowing the GPU driver to make guaranteed assumptions about the use of the buffers. In particular we know that depth and stencil are only transient working data, and we never need to write them back to memory.

Off-screen Render Targets

Off-screen render targets are less tightly defined.

Firstly, there is no equivalent of eglSwapBuffers() which tells the driver that the application has finished rendering to an FBO and it can be submitted for rendering; the flush of the rendering work is inferred from other API activities. We'll look more about the inferences the Mali drivers support in the next section.

Secondly, there are no guarantees about what the application will do with the buffers attached to the color, depth, and stencil attachment points. An application may use any of these as textures, or reattach them to a different FBO, for example reloading the depth value from a previous render target as the starting depth value for a different render target. By default the behavior of OpenGL ES is to preserve all attachments, unless explicitly discarded by the application via a call to glInvalidateFramebuffer(). Note: this is a new entry point in OpenGL ES 3.0; in OpenGL ES 2.0 you can access the equivalent functionality via the glDiscardFramebufferExt() extension entry point which all Mali drivers support.

Render Target Flush Inference

In normal circumstances Mali flushes rendering work when a render target is "unbound", except for the main window surface which is flushed when the driver sees a call to eglSwapBuffers().

To avoid a performance drop developers need to avoid unneeded flushes which only contain a sub-set of the final rendering, so it is recommended that you bind each off-screen FBO once per frame and render it to completion in one go.

A well-structured rendering sequence (almost anyway - read the next section to see why this is incomplete) would look like:

#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT

glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 0 (window surface)

glBindFramebuffer( 1 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 1
...                       // Draw FBO 1 to completion

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 1 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)

glBindFramebuffer( 2 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 2
...                       // Draw FBO 2 to completion

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 2 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)
eglSwapBuffers()          // Tell EGL we have finished, flush FBO 0 for rendering

By contrast the "bad" behaviour would look like:

#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT

glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 0 (window surface)

glBindFramebuffer( 1 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 1

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 1 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)

glBindFramebuffer( 1 )    // Rebind away from FBO 0, does not trigger rendering of FBO
                          // However, rebinding FBO 1 requires us to reload old render
                          // state from memory, and write over the top of it
glDraw...( ... )          // Draw something to FBO 1

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 1 (again)
glDraw...( ... )          // Draw something else to FBO 0 (window surface)
eglSwapBuffers()          // Tell EGL we have finished, flush FBO 0 for rendering

This type of behavior is known as an incremental render and it forces the driver to process the render target twice, the first processing pass will need to write all of the intermediate render state out to memory (color, depth, and stencil), and the second pass will read it back in from memory again so it can "append" more rendering on top of the old state.

Well-behaved one-pass render

As shown in the diagram above you can see that incremental rendering has a +400% bandwidth penalty [assuming 32-bpp color and D24S8 packed depth-stencil] in terms of the framebuffer bandwidth when compared against a well-structured single pass render which avoids the need to write and then re-read the intermediate state to and from main memory.

When to call glClear?

The observant reader will have noted that I inserted some calls to glClear() into the rendering sequence for our frame buffers. The application should always call glClear() for every attachment at the start of each render target's rendering sequence, provided that the previous contents of the attachments are not needed, of course. This explicitly tells the driver we do not need the previous state, and thus we avoid reading it back from memory, as well as putting any undefined buffer contents into a defined "clear color" state.

One common mistake which is seen here is only clearing part of the framebuffer; i.e. calling glClear() while only a portion of the render target is active because of a scissor rectangle with only partial screen coverage. We can only completely drop the render state when it applies to whole surfaces, so a clear of the whole render target should be performed where possible.

When to call glInvalidateFramebuffer?

The final requirement placed on the application for efficient use of FBOs in the OpenGL ES API is that it should tell the driver which of the color / depth / stencil attachments are simply transient working buffers, the value of which can be discarded at the end of rendering the current render pass. For example, nearly every 3D render will use color and depth, but for most applications the depth buffer is transient and can be safely invalidated. Failure to invalidate the unneeded buffers may result in them being written back to memory, wasting memory bandwidth and increasing energy consumption of the rendering process.

The most common mistake at this point is to treat glInvalidateFramebuffer() as equivalent to glClear() and place the invalidate call for frame N state at the first use of that FBO in frame N+1. This is too late! The purpose of the invalidate call is to tell the driver that that the buffers do not need to be kept, so we need to modify the work submission to the GPU for frame which produces those buffers. Telling us in the next frame is often after the original frame has been processed. The application needs to ensure that the driver knows which buffers are transient before the framebuffer is flushed. Therefore transient buffers in frame N should be indicated by calling glInvalidateFramebuffer() before unbinding the FBO in frame N. For example:

#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT
static const GLEnum invalid_ap[2] = { GL_DEPTH_ATTACHMENT, GL_STENCIL_ATTACHMENT };

glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 0 (window surface)

glBindFramebuffer( 1 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 1
...                       // Draw FBO 1 to completion
glInvalidateFramebuffer( GL_FRAMEBUFFER, 2, &invalid_ap[0] ); // Only keep color

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 1 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)

glBindFramebuffer( 2 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 2
...                       // Draw FBO 2 to completion
glInvalidateFramebuffer( GL_FRAMEBUFFER, 2, &invalid_ap[0] ); // Only keep color

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 2 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)

eglSwapBuffers()          // Tell EGL we have finished, flush FBO 0 for rendering

Summary

In this blog we've looked at how the Mali drivers1 handle the identification of render passes, the common points of inefficiency, and how an application developer can drive the OpenGL ES API to avoid them. In summary we recommend:

  • Binding each FBO (other than FBO 0) exactly once in each frame, rendering it to completion in a contiguous sequence of API calls.
  • Calling glClear() at the start of each FBO’s rendering sequence, for all attachments where the old value is not needed.
  • Calling glInvalidateFramebuffer() or glDiscardFramebufferExt() at the end of each FBO’s rendering sequence, before switching to a different FBO, for all attachments which are simply transient working buffers for the intermediate state.

Next time I'll look at a related topic to this one – the efficient use of EGL_BUFFER_PRESERVED for maintaining window surface color from one frame as the default input for the next frame, and the implications that has for performance and bandwidth.

Cheers,

Pete

The next blog in the series asks the question: is EGL_BUFFER_PRESERVED a good thing? Read it below.

Read next blog: Is EGL_BUFFER_PRESERVED a good thing?

Footnotes

  1. It is worth noting that little of this is actually Mali specific - most of the mobile GPU vendors make the same recommendations, so this is general best practice, irrespective of the underlying GPU.
Anonymous
  • Igor Lobanchikov
    Igor Lobanchikov over 8 years ago

    Hi Peter.

      Thanks for your prompt reply.

    I've re-read your article and now I understand that every time you were writing "flushing FBO" that actually meant "committing the command buffer". Thanks for pointing this out. I'm more got used to resolve/restore terms, but whatever simplifies the conversation will be a perfect fit.

    control flush/restore independently

    I'm not sure how you reliably get independent control over the decision to writeout and then readback; it's a single bit of state (e.g. is the framebuffer invalid or not?).  If the framebuffer is invalid then it will be neither written out nor read back, if it is valid then it will written out and read back.

      I'm not quite sure what exactly do you mean. If we speak about the potentially possible scenario - this might be resetting the invalid bit once either writeout or readback happens. However, this doesn't match well enough "invalid content" semantics. I do understand that once the content is marked "invalid" it is considered invalid until it is overwritten with the data which is considered "valid".

      I'm making sure the engine I'm optimizing works as fast as possible on OpenGL ES. We do have separate control for writeout/readback and it does save us quite a few milliseconds. Just wanted to make sure there's no way to do the same thing in GL ES.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Peter Harris
    Peter Harris over 8 years ago

    Could you, please, clarify the following: when glInvalidateFramebuffer is called just before framebuffer flush and cancels it, does this also mean that next time framebuffer is bound its content won't be restored?

    Correct. If the content is invalid then there is no need to restore it.

    control flush/restore independently

    I'd avoid using the term "flush" to mean "framebuffer writeout" - it's generally used in most drivers to mean submission of a command queue to the hardware (e.g. see glFlush() in the OpenGL ES API).

    I'm not sure how you reliably get independent control over the decision to writeout and then readback; it's a single bit of state (e.g. is the framebuffer invalid or not?).  If the framebuffer is invalid then it will be neither written out nor read back, if it is valid then it will written out and read back.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Igor Lobanchikov
    Igor Lobanchikov over 8 years ago

    hi Peter.

      Could you, please, clarify the following: when glInvalidateFramebuffer is called just before framebuffer flush and cancels it, does this also mean that next time framebuffer is bound its content won't be restored?

    in other words is it possible to use glInvalidateFramebuffer to control flush/restore independently? E.g render depth buffer and flush it once, then restore it multiple times for different fbos for depth testing only. for this scenario it would be beneficial to cancel flush when depth buffer was not modified, and to restore it again from ram for the next fbo.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Peter Harris
    Peter Harris over 8 years ago

    Hi Christophe,

    Is there a different on your architecture and driver between glClear and glInvalidateFramebuffer for the case flushing tiles to graphics memory?

    If either are used as the first rendering operation in a render pass, before any glDraw calls, then they are functionally identical - both will result in the clear color being loaded to the tile.

    After the first real drawcall has happened, then glInvalidateFramebuffer effectively becomes a no-op from a rendering point of view (we've already loaded the tile memory by that point so there is nothing invalidate can optimize away), whereas a glClear would have to render because it changes the color.

    The only point where glInvalidateFramebuffer has any further effect is when it is the last operation before a render target is flushed; this will save us writing intermediate results (typically depth and stencil) back to main memory.

    Additionally, glInvalidateSubFramebuffer is a more powerful API as for some algorithms, it would allow me to readback only a subset of the full framebuffer.

    Today we don't really gain anything by sub-frame invalidation; you will still get a readback of the original surface unless the entire framebuffer viewport is either cleared or invalidated.

    However, in general invalidate is only legal if you guarantee you are overdrawing the invalidated region with opaque geometry, which means that another optimization will kick in. On the recent Mali GPUs (Mali-T62x onwards) we have a hidden surface removal scheme called Forward Pixel Kill. More info here:

    • Killing Pixels - A New Optimization for Shading on ARM Mali GPUs

    In summary this means that any areas of the original framebuffer readback which are occluded by subsequent opaque geometry GL_BLEND disabled, no discard in the fragment shader, no alpha-to-coverage) will be optimized away. This covers most use cases where glInvalidateSubFramebuffer would be useful; if you have another use case please let me know - we are always on the look out for feedback on things we can improve.

    Cheers,
    Pete

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Christophe
    Christophe over 8 years ago

    Hi Peter,

    Is there a different on your architecture and driver between glClear and glInvalidateFramebuffer when case we don't want to restoring a framebuffer from graphics memory to on-chip memory?

    If yes, what's the nature of that difference?

    I am tempted to use glInvalidateFramebuffer on colorbuffer and glClear on depth/stencil buffer because I only need to write values for depth/stencil so I think glInvalidateFramebuffer might be more effective as it expect don't read and avoid a fast clear.

    Additionally, glInvalidateSubFramebuffer is a more powerful API as for some algorithms, it would allow me to readback only a subset of the full framebuffer. Using scissor and glClear doesn't work at it would kill the fragments. It's actually pretty unfortunate because what we would typically want to do it actually glValidateSubFramebuffer but I guess it's nothing that 4 calls to glInvalidateSubFramebuffer would workaround.

    Thanks,

    Christophe

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
<>
Mobile, Graphics, and Gaming blog
  • Join the Upscaling Revolution with Arm Accuracy Super Resolution (Arm ASR)

    Lisa Sheckleford
    Lisa Sheckleford
    With Arm ASR you can easily improve frames per second, enhance visual quality, and prevent thermal throttling for smoother, longer gameplay.
    • March 18, 2025
  • Generative AI in game development

    Roberto Lopez Mendez
    Roberto Lopez Mendez
    How is Generative AI (GenAI) technology impacting different areas of game development?
    • March 13, 2025
  • Physics simulation with graph neural networks targeting mobile

    Tomas Zilhao Borges
    Tomas Zilhao Borges
    In this blog post, we perform a study of the GNN architecture and the new TF-GNN API and determine whether GNNs are a viable approach for implementing physics simulations.
    • February 26, 2025