Hello all,
I'm currently using glMapBufferRange to update a trippled buffered UBO in instanced rendering, but I'm noticing that calling glUnmapBuffer is taking ~0.5ms of CPU time, despite calling glMapBufferRange with the GL_MAP_UNSYNCHRONIZED_BIT set and using fences. Is it normal for the glUnmapBuffer call to take this long?
In addition, I found that setting the GL_MAP_INVALIDATE_RANGE_BIT spikes the glMapBufferRange call to 10-20ms on the CPU, which is very strange because I would have expected it to improve performance. I also verified in MGD that I wasn't remapping a previously-invalidated range. Is it also normal for this bit to cause such drastic slowdowns?
cedega said:Is it normal for the glUnmapBuffer call to take this long?
Yes, it can be quite slow depending the platform and whether we actually have to release virtual address range. If I remember correctly 32-bit (ARMv7) applications will unmap quite aggressively, whereas 64-bit applications (ARMv8) have enough VA space that we can leave things mapped and therefore avoid the need for CPU-side cache maintenance and MMU updates.
cedega said: I found that setting the GL_MAP_INVALIDATE_RANGE_BIT spikes the glMapBufferRange
Yes, this is a known issue in our drivers. There was some ambiguity in the specification of which bit takes precedence - the invalidate or the unsynchronized - so we currently play it safe. The ambiguity has now been clarified in the standard's group (unsynchronized should take precedence), but we've not yet released a driver with the fix implemented. In general on Mali you should just be able to safely drop the INVALIDATE; we're a unified memory architecture driver so there is e.g. no need to copy from the graphics card into CPU-visible memory.
Cheers, Pete
Thanks, that clears things up perfectly!
Actually out of curiosity, why is there ambiguity in the invalidate range bit and unsynchronize bit? Or rather, how does that ambiguity lead to such a long CPU synchronization? I think I may be misunderstanding the driver implications of the invalidate bit, so some clarification would be incredibly useful.
On an immediate mode renderer with separate graphics memory applications might have expected "MAP_INVALIDATE_RANGE_BIT | MAP_WRITE_BIT" behavior to create a new buffer chunk that is later patched into the real underlying buffer before rendering. Using UNSYNCHRONIZED_BIT to overwrite the contents of the buffer might then, from the app point of view, corrupt rendering if they expect a patch to be created and applied later. The current drivers are defensive to ensure correct behavior in this scenario and trigger a resource ghost to be created (see https://community.arm.com/graphics/b/blog/posts/mali-performance-6-efficiently-updating-dynamic-resources), so partial buffer mapping causes a full copy of the underlying buffer to be taken (minus the invalidated region, of course).
Khronos has now clarified that this is not expected behavior and applications relying on this would be out of spec, so we should be able to patch the buffer in place without creating a ghost. This change is planned, just not available yet.