Do we need to repack our vertex buffers for Mali-G76 to avoid VK_DEVICE_LOST?

Numerous models of Mali (f.e. Galaxy 10Se) using Mali-G76 (Bifrost 2nd gen) are producing VK_DEVICE_LOST error when rendering 250K triangles or more.   I read about the 180mb driver limit on the Mali systems, and how that simply hands a VK_DEVICE_LOST error back to the developer, and then it is up to them to split render passes.  We don't have this issue with Adreno and other Android devices.  iOS also has a parameter buffer, but flushes it behind the scenes so we've never hit any issues there either.

community.arm.com/.../memory-limits-with-vulkan-on-mali-gpus

This device lost error happens when I turn on terrain, or turn off culling on the terrain.  This spike in triangle count going from 200K tris that render fine to 250k tris is when Vulkan returns VK_DEVICE_LOST and a message prior to that about "QueueSignalReleaseImageandroid failed:-4".  Looking this up in the Vulkan sources indicates this is tied in with the framebuffer loss, so may be just the first part of the device loss.

So since I don't have a lot to go on, and Validation seems to crash the driver with an unknown symbol.  I was able to fix a few validation errors using other non-Mali devices, but this code has mostly been working up until the high polycounts are hit.

1. Chunk up terrain into index chunks that represent spatially close triangles.  These can be culled.

2. Copy out indices for each of the specific materials in new chunks (these are a subset of the indices in the original chunk).  LODs work the same.

3. Draw each visible chunk with vkDrawIndexedIndirect that correspond with a given material.   Disabling this optimization does not prevent the crash.

I read the Mali guide and there's not much to go on there about organizing vb or ib data.  In general, iOS doesn't even recommend anything like repacking.  Pete Harris had mention that Bifrost copies the entire min/maxIndex range of vertices, and Valhall copies on the visible/backfaced triangle vertices.   So Vallhal gets around 50% more out of the same parameter buffer if half the triangles are backfacing.

With things moving towards mesh shaders and meshlets like in UE5, I was considering repacking/reordering/splitting up our vertex buffers so that each of the indices is an incrementing sequence mostly and the range is as tight as possible.   I could even see if these are small enough, that 8-bit indices would suffice.   But in step 3, we may pass say 100 of 200 index chunks to the driver that reference a single vb.  I understand that within one index range (indexStart, indexCount) all verts are transformed, but if those 100 index chunks reference half the buffer, will only that half be allocated to the parameter buffer.    LODs could be packed smallest to largest by appending the unique vertices to the end from the larger LODs.

Parents
  • Hi Alec, 

    but if those 100 index chunks reference half the buffer, will only that half be allocated to the parameter buffer. 

    Yes, the min/max range used for memory allocation is determined on a per-draw basis. Indexing a sub-range of a buffer, not starting at zero, is perfectly acceptable and doesn't incur any overhead. Only the per-draw min/max index is used for memory allocation purposes.

    LODs could be packed smallest to largest by appending the unique vertices to the end from the larger LODs.

    Yes, dedicated vertex sub-ranges for each LOD is definitely the most efficient way to do this for Mali.

    That's still far under the 180MB limit. 

    Agreed, you can store a LOT of vertex data in 180MB. If you only have 250K tris after any instancing I really wouldn't expect you to be tripping over the 180MB limit (i.e. you can store 180 bytes per vertex for a million verts in that). I wonder if you have another problem lurking which can trigger a device lost such as an out-of-bounds buffer access, or something of that type. 

    One handy thing to rule that out is running with the robustBufferAccess feature enabled on the device. This isn't free, and has a runtime overhead, but it adds DX-style buffer clamping to limit buffer accesses to the extent of the buffer region. If the DEVICE_LOST vanishes with that enabled, then start looking for an out-of-bounds buffer access.

    Validation seems to crash the driver with an unknown symbol. 

    Any more info on this one? Not heard of this one before.

    Cheers, 
    Pete

Reply
  • Hi Alec, 

    but if those 100 index chunks reference half the buffer, will only that half be allocated to the parameter buffer. 

    Yes, the min/max range used for memory allocation is determined on a per-draw basis. Indexing a sub-range of a buffer, not starting at zero, is perfectly acceptable and doesn't incur any overhead. Only the per-draw min/max index is used for memory allocation purposes.

    LODs could be packed smallest to largest by appending the unique vertices to the end from the larger LODs.

    Yes, dedicated vertex sub-ranges for each LOD is definitely the most efficient way to do this for Mali.

    That's still far under the 180MB limit. 

    Agreed, you can store a LOT of vertex data in 180MB. If you only have 250K tris after any instancing I really wouldn't expect you to be tripping over the 180MB limit (i.e. you can store 180 bytes per vertex for a million verts in that). I wonder if you have another problem lurking which can trigger a device lost such as an out-of-bounds buffer access, or something of that type. 

    One handy thing to rule that out is running with the robustBufferAccess feature enabled on the device. This isn't free, and has a runtime overhead, but it adds DX-style buffer clamping to limit buffer accesses to the extent of the buffer region. If the DEVICE_LOST vanishes with that enabled, then start looking for an out-of-bounds buffer access.

    Validation seems to crash the driver with an unknown symbol. 

    Any more info on this one? Not heard of this one before.

    Cheers, 
    Pete

Children