Mali-G76 Vulkan mesh skinning compute shader bug

Hi,

We are having issues of doing mesh skinning using Vulkan compute shader on Mali-G76 device (Samsung S10+) where at around a certain spot while transforming the vertices the device would just return all zeroes or invalid data for the coordinates. What we are doing is packing all the vertex information such as position (float3), normal (float3), tangent (float4), bitangent (float3) into a large buffer and have index offsets to access them.

In this sample the offsets we have is:

InstanceSrg_m_targetPositions 0                   uint
InstanceSrg_m_targetNormals 1390896 uint
InstanceSrg_m_targetTangents 2086344 uint
InstanceSrg_m_targetBiTangents 3013608 uint

The pseudo shader code looks something like this:

void MainCS(...)
{

    const uint i = (thread_id.x) + InstanceSrg::m_totalNumberOfThreadsX * (thread_id.y);

    ...

    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3] = position.x;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3 + 1] = position.y;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3 + 2] = position.z;

    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetNormals + i * 3] = normal.x;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetNormals + i * 3 + 1] = normal.y;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetNormals + i * 3 + 2] = normal.z;

    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetTangents + i * 4] = tangent.x;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetTangents + i * 4 + 1] = tangent.y;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetTangents + i * 4 + 2] = tangent.z;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetTangents + i * 4 + 3] = tangent.w;

    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetBiTangents + i * 3] = bitangent.x;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetBiTangents + i * 3 + 1] = bitangent.y;
    PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetBiTangents + i * 3 + 2] = bitangent.z;

}

   

As a simple test, I just copied the x and y coordinate of the positions and using z as debugging index, with the code like something below:

PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3] = m_sourcePositions[InstanceSrg::m_targetPositions + i * 3];

PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3 + 1] = m_sourcePositions[InstanceSrg::m_targetPositions + i * 3 + 1];

PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3 + 1] = i;

Below is a snippet of renderdoc capture of m_skinnedMeshOutputStream with the first element as the index, next two as x and y coordinates and last as index for debugging purpose. The area of interest is where the device stops fetching the correct data for the mesh.

Element _child0 _child0 _child0
13360      -0.00282 0.1494 13360.00
13361      -0.00426 0.14954 13361.00
13362      -0.00441 0.14802 13362.00
13363       0.00 0.00 13363.00
13364 0.00 0.00 13364.0
13365 0.00 0.00 13365.0
13366 0.00 0.00 13366.0
13367 0.00 0.00 13367.0

 

Device Info:

Samsung S10+

SM-G975F/DS

Mali-G76

VkPhysicalDeviceProperties = {
apiVersion = 4198531
driverVersion = 109051904
vendorID = 5045
deviceType = VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
deviceName = "Mali-G76"
pipelineCacheUUID = "\xb6\xa8\U00000003\xe0c}\xa0\x94\xec\xde\xd4ξ\xe2\U00000004\x9b"
limits = {
maxImageDimension1D = 16384
maxImageDimension2D = 16383
maxImageDimension3D = 16383
maxImageDimensionCube = 16383
maxImageArrayLayers = 1024
maxTexelBufferElements = 65536
maxUniformBufferRange = 65536
maxStorageBufferRange = 268435456
maxPushConstantsSize = 256
maxMemoryAllocationCount = 4294967295
maxSamplerAllocationCount = 4294967295
bufferImageGranularity = 4096
sparseAddressSpaceSize = 0
maxBoundDescriptorSets = 4
maxPerStageDescriptorSamplers = 128
maxPerStageDescriptorUniformBuffers = 36
maxPerStageDescriptorStorageBuffers = 35
maxPerStageDescriptorSampledImages = 256
maxPerStageDescriptorStorageImages = 21
maxPerStageDescriptorInputAttachments = 9
maxPerStageResources = 365
maxDescriptorSetSamplers = 768
maxDescriptorSetUniformBuffers = 216
maxDescriptorSetUniformBuffersDynamic = 32
maxDescriptorSetStorageBuffers = 210
maxDescriptorSetStorageBuffersDynamic = 32
maxDescriptorSetSampledImages = 1536
maxDescriptorSetStorageImages = 126
maxDescriptorSetInputAttachments = 9
maxVertexInputAttributes = 32
maxVertexInputBindings = 32
maxVertexInputAttributeOffset = 2047
maxVertexInputBindingStride = 2048
maxVertexOutputComponents = 128
maxTessellationGenerationLevel = 64
maxTessellationPatchSize = 32
maxTessellationControlPerVertexInputComponents = 128
maxTessellationControlPerVertexOutputComponents = 128
maxTessellationControlPerPatchOutputComponents = 120
maxTessellationControlTotalOutputComponents = 4096
maxTessellationEvaluationInputComponents = 128
maxTessellationEvaluationOutputComponents = 128
maxGeometryShaderInvocations = 32
maxGeometryInputComponents = 64
maxGeometryOutputComponents = 128
maxGeometryOutputVertices = 256
maxGeometryTotalOutputComponents = 1024
maxFragmentInputComponents = 128
maxFragmentOutputAttachments = 8
maxFragmentDualSrcAttachments = 0
maxFragmentCombinedOutputResources = 64
maxComputeSharedMemorySize = 32768
maxComputeWorkGroupCount = ([0] = 4294967295, [1] = 4294967295, [2] = 4294967295)
maxComputeWorkGroupInvocations = 384
maxComputeWorkGroupSize = ([0] = 384, [1] = 384, [2] = 384)
subPixelPrecisionBits = 8
subTexelPrecisionBits = 8
mipmapPrecisionBits = 8
maxDrawIndexedIndexValue = 4294967295
maxDrawIndirectCount = 1
maxSamplerLodBias = 255
maxSamplerAnisotropy = 16
maxViewports = 1
maxViewportDimensions = ([0] = 16383, [1] = 16383)
viewportBoundsRange = ([0] = -32766, [1] = 32765)
viewportSubPixelBits = 0
minMemoryMapAlignment = 64
minTexelBufferOffsetAlignment = 64
minUniformBufferOffsetAlignment = 16
minStorageBufferOffsetAlignment = 64
minTexelOffset = -8
maxTexelOffset = 7
minTexelGatherOffset = -8
maxTexelGatherOffset = 7
minInterpolationOffset = -0.5
maxInterpolationOffset = 0.5
subPixelInterpolationOffsetBits = 4
maxFramebufferWidth = 16383
maxFramebufferHeight = 16383
maxFramebufferLayers = 256
framebufferColorSampleCounts = 13
framebufferDepthSampleCounts = 13
framebufferStencilSampleCounts = 13
framebufferNoAttachmentsSampleCounts = 29
maxColorAttachments = 8
sampledImageColorSampleCounts = 13
sampledImageIntegerSampleCounts = 13
sampledImageDepthSampleCounts = 13
sampledImageStencilSampleCounts = 13
storageImageSampleCounts = 1
maxSampleMaskWords = 1
timestampComputeAndGraphics = 1
timestampPeriod = 38.4615402
maxClipDistances = 0
maxCullDistances = 0
maxCombinedClipAndCullDistances = 0
discreteQueuePriorities = 2
pointSizeRange = ([0] = 1, [1] = 1024)
lineWidthRange = ([0] = 1,

Is this a problem related to the 180MB memory address limit on the Mali device? It seems no matter the changes I've made to the shader such as number of threads or changing in the index offset it would always stop at the same index point (13363). 

This sample works on other non-Mali devices.

Thanks.