This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

End of buffer corruption for non-coherent memory type

Hello!

We have observed some garbage vertex data fed into vertex shaders, where garbage is located at the very end of vertex buffers. This causes a 100% reproducible GPU crash. Vertex buffers are allocated in the non coherent memory.

This happens on Pixel 6 which has a Mali-G78 MP20 chip.

For now, the workaround is to align up the VkBufferCreateInfo size field to a multiple of nonCoherentAtomSize, and this fixes the GPU crash bug.

Mapping the buffer and reading data back from it on the CPU produces correct data, so it seems that only GPU is not seeing the correct data at the end of the buffer.

We are doing vkFlushMappedMemoryRanges() after memcpy() to the aligned & allocated buffer memory, and there are no Vulkan debug layer errors displayed during the app execution.

I would be curious to know if this is perhaps a known bug on your side?

Thank you in advance for your help,
Milan

Parents
  • OK, I think we have a better idea of what is happening. Shout if any of this conjecture on what you are doing is incorrect.

    In terms of application behavior:

    • You have a VkBuffer is the whole 16MB chunk, and you are suballocating ranges inside of that. The end of any single mesh may either run into the next mesh, or hit the end of the buffer, depending where the sub-allocation is inside the buffer. 
    • The shader is loading or computing an index (directly or indirectly) based on an attribute, and using that to index into the visibility_lookup constant array in the shader.

    In terms of the probable causal chain:

    • The draw in question is a sub-allocation that is immediately adjacent to another draw call. 
    • The draw in question is not a multiple of 4 vertices. 
    • When the vertex shader runs, the index over-spill at the end of the draw fetches data from the "next mesh" and interprets it as if it were a vertex from the initial mesh. Because we don't hit the end of the buffer, the hardware over-fetch protection doesn't kick-in, this is still inside the valid buffer extents.
    • The shader uses this bad data to compute the array index into the visibility_lookup array, and ends up with an out-of-bounds index. Only user buffers are bounds-checked in hardware and protected by robustBufferAccess, so in this case the out-of-bounds access into the literal array isn't caught.
    • <boom>.

    Workarounds:

    • The most efficient workaround is to ensure vertex allocations are always in multiple-of-4-vert chunks (the padding requirement could be more than coherency atom size alignment gives if vertex size changes in future), and ensure that the padding bytes return sensible values that are in-range for the array index calculation.
    • Alternatively you could clamp the array offset you use for the index into the constant array in the shader. This requires actual computation per vertex, so is likely to be slower.

    Kind regards, 
    Pete

Reply
  • OK, I think we have a better idea of what is happening. Shout if any of this conjecture on what you are doing is incorrect.

    In terms of application behavior:

    • You have a VkBuffer is the whole 16MB chunk, and you are suballocating ranges inside of that. The end of any single mesh may either run into the next mesh, or hit the end of the buffer, depending where the sub-allocation is inside the buffer. 
    • The shader is loading or computing an index (directly or indirectly) based on an attribute, and using that to index into the visibility_lookup constant array in the shader.

    In terms of the probable causal chain:

    • The draw in question is a sub-allocation that is immediately adjacent to another draw call. 
    • The draw in question is not a multiple of 4 vertices. 
    • When the vertex shader runs, the index over-spill at the end of the draw fetches data from the "next mesh" and interprets it as if it were a vertex from the initial mesh. Because we don't hit the end of the buffer, the hardware over-fetch protection doesn't kick-in, this is still inside the valid buffer extents.
    • The shader uses this bad data to compute the array index into the visibility_lookup array, and ends up with an out-of-bounds index. Only user buffers are bounds-checked in hardware and protected by robustBufferAccess, so in this case the out-of-bounds access into the literal array isn't caught.
    • <boom>.

    Workarounds:

    • The most efficient workaround is to ensure vertex allocations are always in multiple-of-4-vert chunks (the padding requirement could be more than coherency atom size alignment gives if vertex size changes in future), and ensure that the padding bytes return sensible values that are in-range for the array index calculation.
    • Alternatively you could clamp the array offset you use for the index into the constant array in the shader. This requires actual computation per vertex, so is likely to be slower.

    Kind regards, 
    Pete

Children
  • Hi Pete,

    Thanks for the detailed analysis of the issue!

    What you are describing is likely very close to the culprit - to clarify a bit further, we are allocating 16MB "pages" of VkDeviceMemory and then binding it to smaller VkBuffer objects. Each mesh has it's own VkBuffer, and since vertex size is 16 bytes and coherency atom size is 64 bytes, there is little chance GPU is reading from the next mesh, but rather, if it's reading outside of the specificed VkBuffer, it's reading junk memory data in the 64 bytes alignment padding before the next VkBuffer starts.

    What's puzzling me is, the GPU crashes if we create a VkBuffer of size which is not 4-vertex & coherency atom size aligned, yet if we allocate a  buffer whose size is coherency atom size aligned and 4-vertex aligned all works fine. The memory "page" contents are exactly the same in both cases (there is the same junk memory there, as our memcpy to the mapped memory copies vertex data which count is non-divisible by 4, so the last few vertices in that buffer are never initialized / written to), so I would expect a crash in both cases.

    My guess would be that the over-fetch, when activated in the first case, is feeding junk data to the vertex shaders, but when it doesn't get activated in the second case, uninitialized memory gets read, and by luck GPU does not crash (maybe the memory is inited to 0 somewhere, so array accesses do not crash the GPU).

    On the other hand, maybe I'm completely off with this guess - we are still waiting for the NDAs to be prepared / signed, to be able to send you our repro case.

    Thanks,
    Milan