This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali-G76 Vulkan mesh skinning compute shader bug

Hi,

We are having issues of doing mesh skinning using Vulkan compute shader on Mali-G76 device (Samsung S10+) where at around a certain spot while transforming the vertices the device would just return all zeroes or invalid data for the coordinates. What we are doing is packing all the vertex information such as position (float3), normal (float3), tangent (float4), bitangent (float3) into a large buffer and have index offsets to access them.

In this sample the offsets we have is:

InstanceSrg_m_targetPositions 0                   uint
InstanceSrg_m_targetNormals 1390896 uint
InstanceSrg_m_targetTangents 2086344 uint
InstanceSrg_m_targetBiTangents 3013608 uint

The pseudo shader code looks something like this:

void MainCS(...)
{

    const uint i = (thread_id.x) + InstanceSrg::m_totalNumberOfThreadsX * (thread_id.y);

    ...

    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3] = position.x;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3 + 1] = position.y;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3 + 2] = position.z;

    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetNormals + i * 3] = normal.x;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetNormals + i * 3 + 1] = normal.y;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetNormals + i * 3 + 2] = normal.z;

    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetTangents + i * 4] = tangent.x;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetTangents + i * 4 + 1] = tangent.y;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetTangents + i * 4 + 2] = tangent.z;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetTangents + i * 4 + 3] = tangent.w;

    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetBiTangents + i * 3] = bitangent.x;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetBiTangents + i * 3 + 1] = bitangent.y;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetBiTangents + i * 3 + 2] = bitangent.z;

}

   

As a simple test, I just copied the x and y coordinate of the positions and using z as debugging index, with the code like something below:

PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3] = m_sourcePositions[InstanceSrg::m_targetPositions + i * 3];

PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3 + 1] = m_sourcePositions[InstanceSrg::m_targetPositions + i * 3 + 1];

PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3 + 1] = i;

Below is a snippet of renderdoc capture of m_skinnedMeshOutputStream with the first element as the index, next two as x and y coordinates and last as index for debugging purpose. The area of interest is where the device stops fetching the correct data for the mesh.

Element _child0 _child0 _child0
13360      -0.00282 0.1494 13360.00
13361      -0.00426 0.14954 13361.00
13362      -0.00441 0.14802 13362.00
13363       0.00 0.00 13363.00
13364 0.00 0.00 13364.0
13365 0.00 0.00 13365.0
13366 0.00 0.00 13366.0
13367 0.00 0.00 13367.0

 

Device Info:

Samsung S10+

SM-G975F/DS

Mali-G76

VkPhysicalDeviceProperties = {
apiVersion = 4198531
driverVersion = 109051904
vendorID = 5045
deviceType = VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
deviceName = "Mali-G76"
pipelineCacheUUID = "\xb6\xa8\U00000003\xe0c}\xa0\x94\xec\xde\xd4ξ\xe2\U00000004\x9b"
limits = {
maxImageDimension1D = 16384
maxImageDimension2D = 16383
maxImageDimension3D = 16383
maxImageDimensionCube = 16383
maxImageArrayLayers = 1024
maxTexelBufferElements = 65536
maxUniformBufferRange = 65536
maxStorageBufferRange = 268435456
maxPushConstantsSize = 256
maxMemoryAllocationCount = 4294967295
maxSamplerAllocationCount = 4294967295
bufferImageGranularity = 4096
sparseAddressSpaceSize = 0
maxBoundDescriptorSets = 4
maxPerStageDescriptorSamplers = 128
maxPerStageDescriptorUniformBuffers = 36
maxPerStageDescriptorStorageBuffers = 35
maxPerStageDescriptorSampledImages = 256
maxPerStageDescriptorStorageImages = 21
maxPerStageDescriptorInputAttachments = 9
maxPerStageResources = 365
maxDescriptorSetSamplers = 768
maxDescriptorSetUniformBuffers = 216
maxDescriptorSetUniformBuffersDynamic = 32
maxDescriptorSetStorageBuffers = 210
maxDescriptorSetStorageBuffersDynamic = 32
maxDescriptorSetSampledImages = 1536
maxDescriptorSetStorageImages = 126
maxDescriptorSetInputAttachments = 9
maxVertexInputAttributes = 32
maxVertexInputBindings = 32
maxVertexInputAttributeOffset = 2047
maxVertexInputBindingStride = 2048
maxVertexOutputComponents = 128
maxTessellationGenerationLevel = 64
maxTessellationPatchSize = 32
maxTessellationControlPerVertexInputComponents = 128
maxTessellationControlPerVertexOutputComponents = 128
maxTessellationControlPerPatchOutputComponents = 120
maxTessellationControlTotalOutputComponents = 4096
maxTessellationEvaluationInputComponents = 128
maxTessellationEvaluationOutputComponents = 128
maxGeometryShaderInvocations = 32
maxGeometryInputComponents = 64
maxGeometryOutputComponents = 128
maxGeometryOutputVertices = 256
maxGeometryTotalOutputComponents = 1024
maxFragmentInputComponents = 128
maxFragmentOutputAttachments = 8
maxFragmentDualSrcAttachments = 0
maxFragmentCombinedOutputResources = 64
maxComputeSharedMemorySize = 32768
maxComputeWorkGroupCount = ([0] = 4294967295, [1] = 4294967295, [2] = 4294967295)
maxComputeWorkGroupInvocations = 384
maxComputeWorkGroupSize = ([0] = 384, [1] = 384, [2] = 384)
subPixelPrecisionBits = 8
subTexelPrecisionBits = 8
mipmapPrecisionBits = 8
maxDrawIndexedIndexValue = 4294967295
maxDrawIndirectCount = 1
maxSamplerLodBias = 255
maxSamplerAnisotropy = 16
maxViewports = 1
maxViewportDimensions = ([0] = 16383, [1] = 16383)
viewportBoundsRange = ([0] = -32766, [1] = 32765)
viewportSubPixelBits = 0
minMemoryMapAlignment = 64
minTexelBufferOffsetAlignment = 64
minUniformBufferOffsetAlignment = 16
minStorageBufferOffsetAlignment = 64
minTexelOffset = -8
maxTexelOffset = 7
minTexelGatherOffset = -8
maxTexelGatherOffset = 7
minInterpolationOffset = -0.5
maxInterpolationOffset = 0.5
subPixelInterpolationOffsetBits = 4
maxFramebufferWidth = 16383
maxFramebufferHeight = 16383
maxFramebufferLayers = 256
framebufferColorSampleCounts = 13
framebufferDepthSampleCounts = 13
framebufferStencilSampleCounts = 13
framebufferNoAttachmentsSampleCounts = 29
maxColorAttachments = 8
sampledImageColorSampleCounts = 13
sampledImageIntegerSampleCounts = 13
sampledImageDepthSampleCounts = 13
sampledImageStencilSampleCounts = 13
storageImageSampleCounts = 1
maxSampleMaskWords = 1
timestampComputeAndGraphics = 1
timestampPeriod = 38.4615402
maxClipDistances = 0
maxCullDistances = 0
maxCombinedClipAndCullDistances = 0
discreteQueuePriorities = 2
pointSizeRange = ([0] = 1, [1] = 1024)
lineWidthRange = ([0] = 1,

Is this a problem related to the 180MB memory address limit on the Mali device? It seems no matter the changes I've made to the shader such as number of threads or changing in the index offset it would always stop at the same index point (13363). 

This sample works on other non-Mali devices.

Thanks. 

Parents
  • Hello! Thanks for submitting pseudo-code, it's really helpful.

    So after the index 13363 m_skinnedMeshOutputStream still has the correct z value? It's only filled with zeros or incorrect data when you assign position.x/y/z or a value from m_sourcePositions? 
    It sounds like reading the data from m_sourcePositions doesn't work (assuming you also calculate position.x/y/z based on it).

Reply
  • Hello! Thanks for submitting pseudo-code, it's really helpful.

    So after the index 13363 m_skinnedMeshOutputStream still has the correct z value? It's only filled with zeros or incorrect data when you assign position.x/y/z or a value from m_sourcePositions? 
    It sounds like reading the data from m_sourcePositions doesn't work (assuming you also calculate position.x/y/z based on it).

Children
  • Hello,

    That is correct, the debug z value as index is correct even after index 13363 for m_skinnedMeshOutputStream. So my question is if this mean that we have to break up the meshes into smaller chunks to get around memory addressing size problem like this? Would there be a more specific size to use for the chunks or will it be just trial and error?

    Thanks.

  • The 180MB data limit only applies to post-transform vertex data. Compute shaders don't have that issue at all, so it's not going to be that. 

    One obvious thing to check - do you have any mediump variables around? Are you possibly overflowing those on address calculation?

    If you are still stuck, are you able to share the shader (and ideally a reproducer)? If this is not something you can share publicly, please can you email it to developer@arm.com

    Kind regards, 
    Pete

  • All our int are full 32 bits so there should not be any overflow (int32 performance on everything aside). I am currently getting more info on releasing the apk. Once approval is given, we will have the appropriate way of sharing the apk.

    Thank you.