This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

glMapBufferRange and glUnmapBuffer performance on the Mali-T880

Hello all,

I'm currently using glMapBufferRange to update a trippled buffered UBO in instanced rendering, but I'm noticing that calling glUnmapBuffer is taking ~0.5ms of CPU time, despite calling glMapBufferRange with the GL_MAP_UNSYNCHRONIZED_BIT set and using fences. Is it normal for the glUnmapBuffer call to take this long?

In addition, I found that setting the GL_MAP_INVALIDATE_RANGE_BIT spikes the glMapBufferRange call to 10-20ms on the CPU, which is very strange because I would have expected it to improve performance. I also verified in MGD that I wasn't remapping a previously-invalidated range. Is it also normal for this bit to cause such drastic slowdowns?

Top replies

+1 Peter Harris over 7 years ago

cedega said:
Is it normal for the glUnmapBuffer call to take this long?

Yes, it can be quite slow depending the platform and whether we actually have to release virtual address range. If I remember correctly 32-bit (ARMv7) applications will unmap quite aggressively, whereas 64-bit applications (ARMv8) have enough VA space that we can leave things mapped and therefore avoid the need for CPU-side cache maintenance and MMU updates.

cedega said:
I found that setting the GL_MAP_INVALIDATE_RANGE_BIT spikes the glMapBufferRange

Yes, this is a known issue in our drivers. There was some ambiguity in the specification of which bit takes precedence - the invalidate or the unsynchronized - so we currently play it safe. The ambiguity has now been clarified in the standard's group (unsynchronized should take precedence), but we've not yet released a driver with the fix implemented. In general on Mali you should just be able to safely drop the INVALIDATE; we're a unified memory architecture driver so there is e.g. no need to copy from the graphics card into CPU-visible memory.

Cheers,
Pete
Cancel
Up +1 Down

Cancel
0 cedega over 7 years ago in reply to Peter Harris

Thanks, that clears things up perfectly!
Cancel
Up 0 Down

Cancel
0 cedega over 7 years ago in reply to Peter Harris

Actually out of curiosity, why is there ambiguity in the invalidate range bit and unsynchronize bit? Or rather, how does that ambiguity lead to such a long CPU synchronization? I think I may be misunderstanding the driver implications of the invalidate bit, so some clarification would be incredibly useful.
Cancel
Up 0 Down

Cancel
0 Peter Harris over 7 years ago in reply to cedega

On an immediate mode renderer with separate graphics memory applications might have expected "MAP_INVALIDATE_RANGE_BIT | MAP_WRITE_BIT" behavior to create a new buffer chunk that is later patched into the real underlying buffer before rendering. Using UNSYNCHRONIZED_BIT to overwrite the contents of the buffer might then, from the app point of view, corrupt rendering if they expect a patch to be created and applied later. The current drivers are defensive to ensure correct behavior in this scenario and trigger a resource ghost to be created (see https://community.arm.com/graphics/b/blog/posts/mali-performance-6-efficiently-updating-dynamic-resources), so partial buffer mapping causes a full copy of the underlying buffer to be taken (minus the invalidated region, of course).

Khronos has now clarified that this is not expected behavior and applications relying on this would be out of spec, so we should be able to patch the buffer in place without creating a ghost. This change is planned, just not available yet.
Cancel
Up +1 Down

Cancel