This very nice desktop OpenGL has been ratified as a OpenGL ES extension recently.
I would like to request support for this extension since it will improve performance of my application by a very large margin on Mali hardware.
It's at the Khronos registry here. https://www.khronos.org/registry/gles/extensions/EXT/EXT_draw_elements_base_vertex.txt
We can raise it with our product management team, but based on typical release cycles even if we implemented it immediately it's unlikely to see devices for ~12 months.
In the interests of solving your application issues before then, if it actually makes that much difference to performance you could just upload multiple index buffers, or if you only render a consistent sub-ranges pre-offset the indices and then calculate the pointer offsets as part of attribute setting (as outlined in the extension doc). The extension is "nice to have" but it doesn't do anything the application can't to already - it just moves the effort for supporting it into the drivers, but it isn't necessarily "free" or any cheaper than just having the applications do it (and by not relying on extensions, is then also portable).
Pete
The problem is that due to the nature of my application I can't do things by "best practices" so when I'm stuck with base OpenGL ES 3.0 I am /very/ limited in performance, especially on any form of hardware that does deferred rendering.
In a perfect world I would have both this extension and GL_ARB_buffer_storage, which at that point my performance problems would be gone.
With this extension we are able to use the elements buffer as a ring buffer, updating portions of it as we go making sure to never trample GPU state as we go along.
Due to the nature of our application, we have many many updates that we have to push through the elements buffer, and without this extension it is the most inefficient for us.
All this extension really changes is some semantics of setting up the attribute pointers, allowing multiple drawcalls to use the same attribute buffers with different offsets (i.e. you avoid the CPU load of the repeated attribute binding calls). It shouldn't change the actual rendering workload at all - can you explain why you think this helps performance? I'm probably missing something obvious .
EDIT To add a few more details behind my question - on most deferred architectures there is a minimum static cost per drawcall and per render target in terms of dispatch overhead to the hardware. State setting, such as attribute pointers, is generally really inexpensive - so I'd be really surprised if it was a problem unless you also have a lot of drawcalls (and hence a lot of rebinding calls). If this is the case then I doubt that the attribute setting is actually the cause of the slow performance - it's more likely down to the general drawcall cost - so I'm not entirely sure this extension will actually solve your problem.
Cheers, Pete
Do you know what your "inefficiency" actually is? It's still not clear to me what the specific bottleneck your application is hitting (CPU load for binding, attribute setting, memory upload of index buffer data, etc).
If the overhead is index upload for new indices via glBufferSubData, what is stopping you using an existing "in memory" index buffer, and providing appropriately modified data offsets into glVertexAttribPointer per drawcall?
All this extension really provides is a cleaner interface to avoid the need for blatant "hacky" offset mangling in the application, but the ability to set an arbitrary offset in to a VBO already there today, and shouldn't actually be that expensive.
My application is the Dolphin Gamecube/Wii emulator, for desktops we are able to use up to OpenGL 4.4 features, which we strongly recommend due to the performance increases that buffer_storage gives us.
We have supported OpenGL ES 3.0 for nearly two years now, ever since Intel has gained support for the standard since early 2013.
The main issues that we run in to is that we are emulating a console GPU made by ArtX. Games tend to abuse this GPU by a large margin. It is a fixed function pipeline GPU that is exceedingly flexible. Games have full access to the hardware, which allows directly interfacing with the GPU's registers. This allows games to switch state immediately and with exceedingly low overhead.
The main issue with desktop GPUs is that there isn't efficient ways to upload data to the GPU in OpenGL ES 3.x. Whenever the emulated GPU changes GPU state we have to do a flush of pushing the state to the GPU since we are emulating the fixed-pipeline GPU using shaders.
In particular a lot of games tend to draw a handful of vertices, switch state, and draw some more. This causes the draw calls to grow to very large amounts, in particular we can call glBuffer{Sub,}Data over 5000 times in a single frame.
With base_vertex we can negate a lot of driver syncing due to those memory updates by using the elements buffer as a ring buffer, which would allow a glMapBufferRange + unsync flag to be used. So as long as we don't stomp of the state that the GPU is using we are fine and we lose the overhead that the driver has when calling glBuffer{Sub,}Data.
If we have buffer_storage support we can take this even farther by mapping all of our buffers(elements, uniform buffer, etc) upon initialization and using them all as ring buffers and being capable of storing multiple frames of data as we go along. This removes the overhead that glMapBufferRange has with mapping/unmapping API calls. This has a larger performance impact than one would expect due to how often we are required to update our buffers. In particular on AMD hardware on the desktop the comparison of using glMapBufferRange versus buffer_storage results in 52FPS compared to 80FPS respectively.
If the only thing that is improved by supporting base_vertex is lower CPU utilization then that is definitely a win, because emulating a GPU on mobile ARM hardware is quite heavy, and anything that lowers CPU usage to improve battery life and speed is great to have.
in particular we can call glBuffer{Sub,}Data over 5000 times in a single frame
Thanks - that explains it =)
Yes. Which we can reduce with base_vertex since at that point we can support using glMapBufferRange + unsyc flag, and remove entirely if buffer_storage is supported.
[Edit]
For a direct comparison between three mobile vendors I have a video showcasing three devices.
The ODROID-XU3 with a Exynos 5420 + Mali-T628MP6
The Nexus 5 with an Adreno 330.
The Nexus 9 with the 64bit Denver SoC with Kepler GPU.
Dolphin Mobile Device Comparison - Wind Waker - YouTube
Particular information about each devices limitations in the video.
The ODROID-XU3 is completely bottlenecked on the GPU emulation thread due to the high overhead of the Mali driver.
The Nexus 5 is CPU emulation thread bound, with a minor bottleneck in the GPU emulation thread due to also not supporting these extensions
The Nexus 9 is completely CPU emulation bound at this point, For particular on the Nexus 9 the application is working around Google crippling the drivers and only enabling GLES 3.1 + AEP.
The function pointers for desktop GL functions are still available in their driver, which allows hackish use of base_vertex and buffer_storage even though they aren't exposed as extensions. If these weren't forced to be hacked around then it would have a performance hit as well. By hacking around the limitation, the Nexus 9 gets about the same speed as the Nvidia Shield Tablet which has full OpenGL 4.5.
[/Edit]