This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Do we need to repack our vertex buffers for Mali-G76 to avoid VK_DEVICE_LOST?

Numerous models of Mali (f.e. Galaxy 10Se) using Mali-G76 (Bifrost 2nd gen) are producing VK_DEVICE_LOST error when rendering 250K triangles or more.   I read about the 180mb driver limit on the Mali systems, and how that simply hands a VK_DEVICE_LOST error back to the developer, and then it is up to them to split render passes.  We don't have this issue with Adreno and other Android devices.  iOS also has a parameter buffer, but flushes it behind the scenes so we've never hit any issues there either.

community.arm.com/.../memory-limits-with-vulkan-on-mali-gpus

This device lost error happens when I turn on terrain, or turn off culling on the terrain.  This spike in triangle count going from 200K tris that render fine to 250k tris is when Vulkan returns VK_DEVICE_LOST and a message prior to that about "QueueSignalReleaseImageandroid failed:-4".  Looking this up in the Vulkan sources indicates this is tied in with the framebuffer loss, so may be just the first part of the device loss.

So since I don't have a lot to go on, and Validation seems to crash the driver with an unknown symbol.  I was able to fix a few validation errors using other non-Mali devices, but this code has mostly been working up until the high polycounts are hit.

1. Chunk up terrain into index chunks that represent spatially close triangles.  These can be culled.

2. Copy out indices for each of the specific materials in new chunks (these are a subset of the indices in the original chunk).  LODs work the same.

3. Draw each visible chunk with vkDrawIndexedIndirect that correspond with a given material.   Disabling this optimization does not prevent the crash.

I read the Mali guide and there's not much to go on there about organizing vb or ib data.  In general, iOS doesn't even recommend anything like repacking.  Pete Harris had mention that Bifrost copies the entire min/maxIndex range of vertices, and Valhall copies on the visible/backfaced triangle vertices.   So Vallhal gets around 50% more out of the same parameter buffer if half the triangles are backfacing.

With things moving towards mesh shaders and meshlets like in UE5, I was considering repacking/reordering/splitting up our vertex buffers so that each of the indices is an incrementing sequence mostly and the range is as tight as possible.   I could even see if these are small enough, that 8-bit indices would suffice.   But in step 3, we may pass say 100 of 200 index chunks to the driver that reference a single vb.  I understand that within one index range (indexStart, indexCount) all verts are transformed, but if those 100 index chunks reference half the buffer, will only that half be allocated to the parameter buffer.    LODs could be packed smallest to largest by appending the unique vertices to the end from the larger LODs.

  • Also in this case the terrain varyings are 30-44 bytes, the terrain mesh is 100K verts max, and 8 unique materials that each generate their own index lists.   We can reduce the varying size using using half.  Even if the materials sparsely reference the entire range of vertex indices, that's 100k x 32 * 8 = 32MB.  That's still far under the 180MB limit.   I'll still to look at the layout of our non-terrain models and LODs, but I haven't delved deep into that system yet.


    If I leave terrain off, then I can push 300K triangles.  If I turn terrain on, then the device is lost when enough terrain chunks become visible.  But there is push for more and more content and detail.

  • Hi Alec, 

    but if those 100 index chunks reference half the buffer, will only that half be allocated to the parameter buffer. 

    Yes, the min/max range used for memory allocation is determined on a per-draw basis. Indexing a sub-range of a buffer, not starting at zero, is perfectly acceptable and doesn't incur any overhead. Only the per-draw min/max index is used for memory allocation purposes.

    LODs could be packed smallest to largest by appending the unique vertices to the end from the larger LODs.

    Yes, dedicated vertex sub-ranges for each LOD is definitely the most efficient way to do this for Mali.

    That's still far under the 180MB limit. 

    Agreed, you can store a LOT of vertex data in 180MB. If you only have 250K tris after any instancing I really wouldn't expect you to be tripping over the 180MB limit (i.e. you can store 180 bytes per vertex for a million verts in that). I wonder if you have another problem lurking which can trigger a device lost such as an out-of-bounds buffer access, or something of that type. 

    One handy thing to rule that out is running with the robustBufferAccess feature enabled on the device. This isn't free, and has a runtime overhead, but it adds DX-style buffer clamping to limit buffer accesses to the extent of the buffer region. If the DEVICE_LOST vanishes with that enabled, then start looking for an out-of-bounds buffer access.

    Validation seems to crash the driver with an unknown symbol. 

    Any more info on this one? Not heard of this one before.

    Cheers, 
    Pete

  • As always Pete, you rock.   I didn't know about the robustBufferAccess (and I guess there's a robustBufferAccess2).  I did try setting those in the VkDeviceCreateInfo struct.  But that didn't seem to flag anything, and the loss continued.  I also ran through all the terrain indices, and they all seem fine.  I also wondered if this may be related to our use of uint32 indices on some of these larger terrains, but I have cases with uint16 indices that also cause loss.  I really need to switch to an indexOffset to keep them uint16.  I'll do some more sleuthing next week, and let you know what I find. I'll also see if I can start the vertex repacking in our pipeline.

  • I could use some info on allocation with DrawIndirect calls vs. regular draws. I have DrawIndirect disabled for now.  

    No real info to go on.  robustBufferAccess didn't catch anything.  And nothing is reported when the device is lost even with validation enabled.  Validation seems to seg-fault when I use debug markers/groups around the pass, so I've disabled markers when validation is on.

    I can strip the two terrain shaders (and two variants of those) down to not using any varyings.  If I only write out white from the fragment shader, then the device isn't lost.  I can render 450K polys no problem on the same Mali device, though typically it's around 200K total polys.  I removed all half usage, and that didn't work either.   I also switched from uint32 to uint16 indices, but again that didn't help.  The varying memory would be tied to the vertices anyways.  I haven't yet repacked the vertices.

    D/mali.instrumentation.graph.work: key already added <- see a ton of these every frame

    This is right before the DEVICE_LOST
    E/vulkan: QueueSignalReleaseImageANDROID failed: -4


    E/CRASH: ASSERT! Foo.cpp (2888): Renderer Crash, Error: ERROR_DEVICE_LOST, exiting app..
    Run 'make callstack' to see the symbolicated crash callstack

    Also Mali seems to require VkPhysicalDeviceFloat16Int8FeaturesKHR setup for half shaders, where other platforms don't have this requirement.  I don't know if that means the other platforms are not running the half code for the sahders.

  • So I finished the work on repacking the vertices and reindexing.  This doubled our terrain vert size.  I'm no longer seeing VK_DEVICE_LOST anymore from the Mali-G76 driver.   The perf dropped from 12ms before to 15ms now.  That doesn't make a lot of sense since it's all repacked.  Seems like this Mali driver is bailing at 18MB of temp varyings and not 180MB.

    I did find your excellent Mali guide Pete, and in there you had mentioned repacking.   So now the question is whether Mali with DrawIndirect treats the min/max of all the draws in the buffer, or if it only copies the varying data for the specific ranges in each indirect draw block.

  • I'm no longer seeing VK_DEVICE_LOST anymore from the Mali-G76 driver.

    *cheer*. Something definitely seems odd here, so if you're able to share a minimal repro of the failing case it would be appreciated, but I'll ask the driver team if they have any ideas why the limit is looking so low.

    is whether Mali with DrawIndirect treats the min/max of all the draws in the buffer

    Only the latest Mali-G710 and friends support multi-draw indirect, so shipping Mali GPUs will only support single indirect draws.  I'm therefore not entirely sure what you mean by "all the draws in the buffer", as each draw indirect parameters buffer can only ever reference a single draw.  

    Can you throw up some pseudo code for what your draw dispatch looks like?

    D/mali.instrumentation.graph.work: key already added <- see a ton of these every frame  

    This impacts some Samsung driver builds - hopefully a fix should be getting rolled out in the next OTA driver update.

    Cheers,  Pete

  • > Only the latest Mali-G710 and friends support multi-draw indirect, so shipping Mali GPUs will only support single indirect draws. 

    I see I had been using DrawIndirect on an old Adreno part, and thought Mali had that feature.   So yes, this isn't going through DrawIndirect at all on Mali just the standard vkCmdDrawIndexed.  That's good I guess it should only reference the index range specified now.

    A minimal repro case is just to run our retail game.  On the more complex levels, it hits it right away.  So one would have to play enough to unlock.  I think we'll land the repack in the near future though to bypass the device lost.

    I'm wondering if the perf loss I'm seeing is because before we had the same indices shared across all materials.  Seems like the position pass on Mali would then transform half the data, since it's about a 2x increase making all of those verts/indices unique.

  • I see I had been using DrawIndirect on an old Adreno part, and thought Mali had that feature.

    We have support for DrawIndirect, but drawCount must be a maximum of 1 on all pre-Mali-G710 family hardware.

  • So I just did an interesting test here.  Draw 1 new chunk of terrain each frame until the Mali driver dies.  This is independent of culling, but my viewpoint may or may not include the triangles.  I also have vertex/index repack disabled which also somehow magically avoids this problem, but doubles our vertex count.  Is there a bug in vertex dedupe across index ranges?  When I hit 500 / 1500 chunks, that's when the driver fails with a flurry of these same errors and then a VK_DEVICE_LOST.  This could obviously be a race condition in the way we recycle our command buffers, but wanted to share the errors, in case anyone has any insight.

    Even if I bump up our max command buffer count to 2x what it is now, this error still occurs at roughly the same chunk count.  This last time it was at 458.   I think what's happening is that we or the driver are somehow returning an uncompleted command buffer back to our pool when enough draw calls/commands are submitted.  This only happens with our terrain, not other draws.   

    I/TerrainBarn: chunkCounter 465
    D/mali.instrumentation.graph.work: key already added
    D/mali.instrumentation.graph.work: key already added

    I/TerrainBarn: chunkCounter 466
    D/mali.instrumentation.graph.work: key already added
    D/mali.instrumentation.graph.work: key already addedI

    /VALIDATION: VUID-vkBeginCommandBuffer-commandBuffer-00049(ERROR / SPEC): msgNum: 0 - Calling vkBeginCommandBuffer() on active VkCommandBuffer 0x7db4402d80[] before it has completed. You must check command buffer fence before this call. The Vulkan spec states: commandBuffer must not be in the recording or pending state. (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkBeginCommandBuffer-commandBuffer-00049)      [0] 0x7db4411a20, type: 6, name: NULL

    I/VALIDATION: VUID-vkFreeDescriptorSets-pDescriptorSets-00309(ERROR / SPEC): msgNum: 0 - vkUpdateDescriptorSets() failed write update validation for VkDescriptorSet 0x39a[] with error: Cannot call vkUpdateDescriptorSets() to perform write update on VkDescriptorSet VkDescriptorSet 0x39a[] allocated with VkDescriptorSetLayout VkDescriptorSetLayout 0x38d[] that is in use by a command buffer. The Vulkan spec states: All submitted commands that refer to any element of pDescriptorSets must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkFreeDescriptorSets-pDescriptorSets-00309)      [0] 0x4fc, type: 23, name: NULL

    I/VALIDATION: VUID-vkResetFences-pFences-01123(ERROR / SPEC): msgNum: 0 - VkFence 0xc7[] is in use. The Vulkan spec states: Each element of pFences must not be currently associated with any queue command that has not yet completed execution on that queue (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetFences-pFences-01123)
        Objects: 1
          [0] 0xc7, type: 7, name: NULL

    E/vulkan: QueueSignalReleaseImageANDROID failed: -4
    E/CRASH: ASSERT! VulkanRenderer.cpp (3003): Renderer Crash, Error: ERROR_DEVICE_LOST, exiting app..
      Run 'make callstack' to see the symbolicated crash callstack

  • I suspect we're not going to be able to understand this one from our side without more detail on the specific API sequence (and possibly a reproducer). Assume we can't do that on the forums - please can you contact developer@arm.com and we'll see if we can help offline.

    Kind regards, 
    Pete

  • Already in contact with them.  Unfortunately, the repack is already in the live build.  So it may not be easy to share a capture.  If I can get one of the few devices that supports AGI that isn't Mali, then I might be able to use that to do a capture of the render.  I suspect bad logic in the dedupe/sharing of vertices across various index ranges in the driver.