This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

End of buffer corruption for non-coherent memory type

Hello!

We have observed some garbage vertex data fed into vertex shaders, where garbage is located at the very end of vertex buffers. This causes a 100% reproducible GPU crash. Vertex buffers are allocated in the non coherent memory.

This happens on Pixel 6 which has a Mali-G78 MP20 chip.

For now, the workaround is to align up the VkBufferCreateInfo size field to a multiple of nonCoherentAtomSize, and this fixes the GPU crash bug.

Mapping the buffer and reading data back from it on the CPU produces correct data, so it seems that only GPU is not seeing the correct data at the end of the buffer.

We are doing vkFlushMappedMemoryRanges() after memcpy() to the aligned & allocated buffer memory, and there are no Vulkan debug layer errors displayed during the app execution.

I would be curious to know if this is perhaps a known bug on your side?

Thank you in advance for your help,
Milan

  • Hi Milan,

    Thanks for getting in touch. I'm not aware of any specific bug like this - can you share a reproducer? Feel free to email developer at arm dot com if you can only share privately. 

    A couple of diagnostics:

    * How are you mapping the buffer, are you using WHOLE_SIZE?

    * Does it stop reproducing if you enable robustBufferAccess?

    * Does it stop reproducing if you round up buffer size to be an exact multiple of 4 vertices?

    Also note this bit in the spec, which might apply to your usage: 

    vkMapMemory ... If the device memory was allocated without the VK_MEMORY_PROPERTY_HOST_COHERENT_BIT set, these guarantees must be made for an extended range: the application must round down the start of the range to the nearest multiple of VkPhysicalDeviceLimits::nonCoherentAtomSize, and round the end of the range up to the nearest multiple of VkPhysicalDeviceLimits::nonCoherentAtomSize.

    FWIW, aligning on nonCoherentAtomSize (cache line alignment) is probably good for performance anyway.

    Kind regards, 
    Pete

  • Hi Pete,

    First of all, thank you for the detailed reply!

    As soon as we work out an NDA with ARM, we will be able to share a repro case - it's in the works as we speak.

    Regarding the diagnostics:
    How are you mapping the buffer, are you using WHOLE_SIZE?
    We allocate memory in chunks of 16MB, and we map to host memory once on creation by using [0, 16MB] range in the map call.

    Does it stop reproducing if you enable robustBufferAccess?
    Interestingly, robustBufferAccess does not stop it from crashing. GPU is crashing when we access the following array, which is hard-coded in the vertex shader:

    vec4 visibility_lookup[62] = vec4[62]( ... );

    , because we are getting garbage vertex shader input data for the last few vertices in the buffer (so basically due to the garbage data, we are accessing the array with a very large index).

    Does it stop reproducing if you round up buffer size to be an exact multiple of 4 vertices?
    Yup, that's the gist of the workaround I found - if I align up the VkBufferCreateInfo size field to nonCoherentAtomSize, it works just fine.

    One theory is that there might be a driver bug after our memcpy() to mapped memory, where the last / ending cache line is not flushed correctly to the GPU for buffers whose size is not aligned to nonCoherentAtomSize (ie. a bug in vkFlushMappedMemoryRanges()). Of course, the other theory is that we made a mistake in the buffer management code, but looking at it so far, I can't find it.

    Thanks,
    Milan

  • Thanks for the explanation. We have a partial theory, but no concrete smoking gun yet.

    The short summary is that Mali always shades vertices in groups of 4 vertices, each group being consecutive indices naturally aligned on multiples of 4. This can shade indices which are not referenced in the index buffer if they are in the same "4 group" as indices that are referenced, and can also shade past the last index if the last "4 group" is only partially filled. 

    The partial vertices in that last group should be benign (and forcing robustBufferAccess should ensure that), but it sounds like we're possibly missing a guard check somewhere. For this issue, the generic recommended workaround is padding the mesh buffers so any "partial 4 group" overspill that ends up getting shaded contains valid data (e.g. replicate the last vertex). We're hopefully we can give a more precise workaround once we actually identify which memory access is causing the problem.

    HTH, 
    Pete

  • Hi Pete,

    Thank you for the detailed analysis, it makes the context of the bug much clearer.

    The interesting part which I've noticed is that if we request buffer size to be "4 vertices aligned", then we see no bug. If buffer size is less then 4 vertices aligned then we observe the bug.

    Note that the actual allocation of memory inside our 16MB Vulkan memory buffer remains the same in both cases (because in this specific case, "4 vertices aligned" and nonCoherentAtomSize alignments are exactly the same) and contents of the memory just outside of the buffer are the same.

    So my theory is that there is something in the driver / GPU which prevents accesses outside of the less then "4 vertices aligned" allocated VkBuffer, but feeds junk data instead. While in the case of "4 vertices aligned" allocated buffer, it reads the actual memory which we provide, and which, by chance, contains values which don't crash the GPU (maybe all zeros, so GPU accesses array element 0 and does not crash).

    Thanks,
    Milan

  • Hi Pete,

    Thank you for the detailed analysis, it makes the context of the bug much clearer.

    The interesting part which I've noticed is that if we request buffer size to be "4 vertices aligned", then we see no bug. If buffer size is less then 4 vertices aligned then we observe the bug.

    Note that the actual allocation of memory inside our 16MB Vulkan memory buffer remains the same in both cases (because in this specific case, "4 vertices aligned" and nonCoherentAtomSize alignments are exactly the same) and contents of the memory just outside of the buffer are the same.

    So my theory is that there is something in the driver / GPU which prevents accesses outside of the less then "4 vertices aligned" allocated VkBuffer, but feeds junk data instead. While in the case of "4 vertices aligned" allocated buffer, it reads the actual memory which we provide, and which, by chance, contains values which don't crash the GPU (maybe all zeros, so GPU accesses array element 0 and does not crash).

    Thanks,
    Milan

  • OK, I think we have a better idea of what is happening. Shout if any of this conjecture on what you are doing is incorrect.

    In terms of application behavior:

    • You have a VkBuffer is the whole 16MB chunk, and you are suballocating ranges inside of that. The end of any single mesh may either run into the next mesh, or hit the end of the buffer, depending where the sub-allocation is inside the buffer. 
    • The shader is loading or computing an index (directly or indirectly) based on an attribute, and using that to index into the visibility_lookup constant array in the shader.

    In terms of the probable causal chain:

    • The draw in question is a sub-allocation that is immediately adjacent to another draw call. 
    • The draw in question is not a multiple of 4 vertices. 
    • When the vertex shader runs, the index over-spill at the end of the draw fetches data from the "next mesh" and interprets it as if it were a vertex from the initial mesh. Because we don't hit the end of the buffer, the hardware over-fetch protection doesn't kick-in, this is still inside the valid buffer extents.
    • The shader uses this bad data to compute the array index into the visibility_lookup array, and ends up with an out-of-bounds index. Only user buffers are bounds-checked in hardware and protected by robustBufferAccess, so in this case the out-of-bounds access into the literal array isn't caught.
    • <boom>.

    Workarounds:

    • The most efficient workaround is to ensure vertex allocations are always in multiple-of-4-vert chunks (the padding requirement could be more than coherency atom size alignment gives if vertex size changes in future), and ensure that the padding bytes return sensible values that are in-range for the array index calculation.
    • Alternatively you could clamp the array offset you use for the index into the constant array in the shader. This requires actual computation per vertex, so is likely to be slower.

    Kind regards, 
    Pete

  • Hi Pete,

    Thanks for the detailed analysis of the issue!

    What you are describing is likely very close to the culprit - to clarify a bit further, we are allocating 16MB "pages" of VkDeviceMemory and then binding it to smaller VkBuffer objects. Each mesh has it's own VkBuffer, and since vertex size is 16 bytes and coherency atom size is 64 bytes, there is little chance GPU is reading from the next mesh, but rather, if it's reading outside of the specificed VkBuffer, it's reading junk memory data in the 64 bytes alignment padding before the next VkBuffer starts.

    What's puzzling me is, the GPU crashes if we create a VkBuffer of size which is not 4-vertex & coherency atom size aligned, yet if we allocate a  buffer whose size is coherency atom size aligned and 4-vertex aligned all works fine. The memory "page" contents are exactly the same in both cases (there is the same junk memory there, as our memcpy to the mapped memory copies vertex data which count is non-divisible by 4, so the last few vertices in that buffer are never initialized / written to), so I would expect a crash in both cases.

    My guess would be that the over-fetch, when activated in the first case, is feeding junk data to the vertex shaders, but when it doesn't get activated in the second case, uninitialized memory gets read, and by luck GPU does not crash (maybe the memory is inited to 0 somewhere, so array accesses do not crash the GPU).

    On the other hand, maybe I'm completely off with this guess - we are still waiting for the NDAs to be prepared / signed, to be able to send you our repro case.

    Thanks,
    Milan