My last blog looked at some of the critical areas which an application has to implement efficiently to get best performance out of 3D content, such as broad-brush culling of large sections of a scene which are guaranteed not to be visible so they are not sent to the GPU at all. In one of the follow on comments to this blog seanlumly01 asked "Is there a performance penalty for an application modifying textures between draw calls?". It is a really good question, but the answer is non-trivial, so I deferred to this blog post to answer it fully.
The most important thing to remember when it comes to resource management is the fact that OpenGL ES implementations are nearly all heavily pipelined. This is discussed in more detail in this earlier blog, but in summary ...
When you call glDraw...() to draw something the draw does not happen instantly, instead the command which tells the GPU how to perform that draw is added to a queue of operations to be performed at some point in future. Similarly, eglSwapBuffers() does not actually swap the front and back buffer of the screen, but really just tells the graphics stack that the application has finished composing a frame of rendering and queues that frame for rendering. In both cases the logical specification of the behaviour - the API calls - and the actual processing of the work on the GPU are decoupled by a buffering process which can be tens of milliseconds in length.
For the most part, OpenGL ES defines a synchronous programming model. Apart from a few explicit exceptions, when you make a draw call rendering must appear to have happened at the point that the draw call was made, with pixels on screen correctly reflecting the state of any command flags, textures, or buffers at that point in time (based either on API function calls or previously specified GPU commands). This appearance of synchronous rendering is an elaborate illusion maintained by the driver stack underneath the API, which works well but does place some constraints on the application behavior if you want to achieve the best performance and lowest CPU overheads.
Due to the pipelining process outlined earlier, enforcing this illusion of synchronicity means that a pending draw call which reads a texture or buffer effectively places a modification lock on that resource until that draw operation has actually completed rendering on the GPU.
For example, if we had a code sequence:
glBindTexture(1) // Bind texture 1, version 1
glDrawElements(...) // Draw reading texture 1, version 1
glTexSubImage2D(...) // Modify texture 1, so it becomes version 2
glDrawElements(...) // Draw reading the texture 1, version 2
... then we cannot allow the glTexSubImage2D() to modify the texture memory until the first draw call has actually been processed by the GPU, otherwise the rendering of the first draw call will not correctly reflect the state of the GL at the point the API call was made (we need it to render the draw using the contents of the physical memory which reflect texture version 1, not version 2). A lot of what OpenGL ES drivers spend their time doing is tracking resource dependencies such as this one to make sure that the synchronous programming "illusion" is maintained, ensuring that operations do not happen too early (before the resources are available) or too late (after a later resource modification has been made).
In scenarios where a resource dependency conflict occurs - for example a buffer write is requested when that buffer still has a pending read lock - the Mali drivers cannot apply the resource modification immediately without some special handling; here are multiple possible routes open to the drivers to resolve the conflict automatically.
We could drain the rendering pipeline to the point where all pending reads and writes from the GPU for the conflicted resource are resolved. After the finish has completed we can process the modification of the resource as normal. If this happens part way through the drawing of a framebuffer you will incur incremental rendering costs where we are forced to flush the intermediate render state to main memory; see this blog for more details.
Draining the pipeline completely means that the GPU will then go idle waiting for the CPU to build the next workload, which is a poor use of hardware cycles, so this tends to be a poor solution in practice.
We can maintain both the illusion of the synchronous programming model and process the application update immediately, if we are willing to spend a bit more memory. Rather than modifying the physical contents of the current resource memory, we can simply create a new version of the logical texture resource, assembling the new version from both the application update and any of the data from the original buffer (if the modification is only a partial buffer or texture replacement). The latest version of the resource is used for any operations at the API level, older versions are only needed until their pending rendering operations are resolved, at which point their memory can be freed. This approach is known as resource ghosting, or copy-on-write.
This is the most common approach taken by drivers as it leaves the pipeline intact and ensures that the GPU hardware stays busy. The downsides of this approach are additional memory footprint while the ghost resources are alive, and some additional processing load to allocate and assemble the new resource versions in memory.
It should also be noted that resource ghosting isn't always possible; in particular when resources are imported from external sources using a memory sharing API such as UMP, Gralloc, dma_buf, etc. In these cases other drivers, such as cameras, video decoders, and image processors may be writing into these buffers and the Mali drivers have no way to know whether this is happening or not. In these cases we generally cannot apply copy-on-write mechanisms, so the driver tends to block and wait for pending dependencies to resolve. For most applications you don't have to worry about this, but if you are working with buffers sourced from other media accelerators this is one to watch out for.
Given that resource dependencies are a problem on all hardware rendering systems due to pipeline depth, it should come as no surprise that more recent versions of OpenGL ES come with some features which allow application developers to override the purely synchronous rendering illusion to get more fine control if it is needed.
The function glMapBufferRange() function in OpenGL ES 3.0 allows application developers to map a buffer into the application's CPU address space. Mapping buffers allows the application to specify an access flag of GL_MAP_UNSYNCHRONIZED_BIT, which loosely translates as the "don't worry about resource dependencies, I know what I am doing" bit. When a buffer mapping is unsynchronized the driver does not attempt to enforce the synchronous rendering illusion, and the application can modify areas of the buffer which are still referenced by pending rendering operations and therefore cause incorrect rendering for those operations if the buffer updates are made erroneously.
In addition to the direct use of features such as GL_MAP_UNSYCHRONIZED_BIT, many applications work with the knowledge that the resource usage is pipelined to create flexible rendering without causing excessive ghosting overheads.
Ghosting can be made less expensive by ensuring that volatile resources are separated out from the static resources, making the memory regions which need to be allocated and copied as small as possible. For example, ensuring that animated glyphs which are updated using glTexSubImage2D() are not sharing a texture atlas with static images which are never changed, or ensuring that models which are animated in software on the CPU (either via attribute or index update) are not in the same buffer as static models.
The overheads related to buffer updates can be reduced, and the number of ghosted copies minimized, by performing most of the resource updates in a single block (either one large update or multiple sequential sub-buffer/texture updates), ideally before any rendering to a FBO has occurred. Avoid interleaving resource updates with draw calls like this ...
... unless you are able to use GL_MAP_UNSYNCHORNIZED_BIT. It is usually much more efficient to make the same set of updates like this:
If the application wants to make performance more predictable and avoid the overheads of ghosting reallocating memory in the driver, one technique it can apply is to explicitly create multiple copies of each volatile resource in the application, one for each frame of latency present in the rendering pipeline (typically 3 for a system such as Android). The resources are used in a round-robin sequence, so when the next modification of a resource occurs the pending rendering using that resource should have completed. This means that the application modifications can be committed directly to physical memory without needing special handling in the driver.
There is no easy way to determine the pipeline length of an application, but it can be empirically tested on a device by inserting a fence object by calling glFenceSync() after a draw call using a texture, and then polling that fence object by calling glClientWaitSync() with a timeout of zero just before making the modifications N frames later. If this wait returns GL_TIMEOUT_EXPIRED then the rendering is still pending and you need to add an additional resource version to the resource pool you are using.
Thanks to Sean for the good question, and I hope this answers it!
Next Blog In Series: Mali Performance 7: Accelerating 2D rendering using OpenGL ES
Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali GPUs even better.
Thanks for such great articles.
Regarding synchronous rendering illusion:
Since eglSwapBuffers() is also asynchronous, how does driver deal with glVertexAtttribPointer(.., ptr) & glUniform* (... ptr) updates, assuming we are dealing with client resource memory (CPU pointer, no buffer object yet)
- Frame NglVertexAttribPointer(.. ,ptr_A)glUniform* (..., val_A)eglSwapBuffers() --> Asynchronous
- Modify ptr_B, val_B in CPUFrame N+1glVertexAttribPointer(.., ptr_B)glUniform*(...., val_B)eglSwapBuffers()Q1) If GPU is lagging behind, will frame N use ptr_B, val_B (updated copies of the client resource) ?Or resource dependencies are ensured and frame N will use A copies, and frame N+1 will use B, by probably flushing pipeline and then process the glVertexAttribPointer(.., ptr_B) API
In Vulkan, as you said it give app developers more power and functionality, How will the same scenario be rendered if we don't use buffer objects per swap-chain
Frame NAcquire_update_uniform_buffer( _ubo_)vkQueueSubmit( _render_cmds_)PresentFrame N+1Acquire_update_uniform_buffer(_ubo_) -> Update the contents of the buffervkQueueSubmit(_render_cmds_)Present
I guess the GPU can possibly be reading the updated data in Frame N, if GPU is lagging behind as in Open-GLES.
Q2) However does GLES handle this automatically by resource dependencies as you mentioned ?
However synchronization here 'has' to be done by application probably by introducing 'resource' like buffer per swap-chain, similarly in GLES maybe creating 3 VBO for triple buffering. Is it correct ?
Or, maybe vkQueueSubmit() & vkWaitFence() in _update_uniform_buffer_ to ensure using single 'resource'.But i am not sure how GLES can handle this scenario (Q1)
Please correct my understanding if wrong.
> Interesting to see that evolution between, GLES 2 "Dont touch it" -> GLES 3 "Be careful"-> Vulkan "Have fun being the driver now !".
Yes indeed. Vulkan gives applications a lot of power and flexibility, but also almost full responsibility for how they drive the hardware.
This allows some much higher level optimizations which are completely impossible in OpenGL ES, such as reusing the same physical memory for totally different resource types at different points in the frame pipeline allowing applications to reduce memory footprint quite significantly.
I think the quote "With great power comes great responsibility" sums it up quite nicely =)
> When testing the number of buffers to achieve ghost-less multibuffering, is it needed to ...
The pipeline depth in terms of frames is really a function of the amount of time it takes to render a frame in the application, rather than single textures and shaders. You'd want to take a representative whole frame render.
Interesting to see that evolution between, GLES 2 "Don't touch it" -> GLES 3 "Be careful"-> Vulkan "Have fun being the driver now !".
When testing the number of buffers to generate with glFenceSync and glClientWaitSync, to achieve a 'ghost-less' multibuffering, is it needed to send the biggest texture possible (4096x4096) or does any texture reasonable texture (1024x1024) fits the test ?
Also, I guess it should be tested with the most expensive shader program currently used by the program in order to make this test accurate ?
> Also, I guess that allocating a big buffer, partition it in three and update only one part of it, instead of allocating three buffers is also a no-no?
If any part of the buffer is still referenced by an in-flight draw call then that will be expensive because the driver will have to create a ghost, so yes, definitely one to avoid. If you are planning on using glMapBufferRange() with OpenGL ES 3.x then you could patch subregions of a buffer using GL_MAP_UNSYNCHRONIZED to disable the dependency checking (but you have to provide engine-level guarantees that the part you are patching is not still referenced by pending draws, or you will get rendering corruption).
Its worth noting that this latter behavior is effectively what Vulkan requires applications to manage; it heavily uses mapped memory buffers and application managed resource dependencies.
Very interesting read. I see that there's mentions of using small buffers to avoid the cost of ghosting.
For a recent projection of mine, I was thinking about allocating three big buffers and partition them accordingly, for all dynamic resources, in order to minimize the memory buffers recreations, when more memory is needed.
However, if I understand correctly this could turn to be a big mistake due to potential ghosting issues.
When multiple buffering is used, is it better to allocate very small buffers and expand them when needed, like this :
- Bind buffer 1
- Upload first data
- Bind buffer 2
- Reallocate more space in buffer 2 because of the new data
- Upload the new data
- Bind buffer 3
- Upload new data
- Reallocate ...
Also, I guess that allocating a big buffer, partition it in three and update only one part of it, instead of allocating three buffers is also a no-no ?