Hi,
We are having issues of doing mesh skinning using Vulkan compute shader on Mali-G76 device (Samsung S10+) where at around a certain spot while transforming the vertices the device would just return all zeroes or invalid data for the coordinates. What we are doing is packing all the vertex information such as position (float3), normal (float3), tangent (float4), bitangent (float3) into a large buffer and have index offsets to access them.
In this sample the offsets we have is:
The pseudo shader code looks something like this:
void MainCS(...) { const uint i = (thread_id.x) + InstanceSrg::m_totalNumberOfThreadsX * (thread_id.y); ... PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3] = position.x; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3 + 1] = position.y; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3 + 2] = position.z; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetNormals + i * 3] = normal.x; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetNormals + i * 3 + 1] = normal.y; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetNormals + i * 3 + 2] = normal.z; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetTangents + i * 4] = tangent.x; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetTangents + i * 4 + 1] = tangent.y; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetTangents + i * 4 + 2] = tangent.z; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetTangents + i * 4 + 3] = tangent.w; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetBiTangents + i * 3] = bitangent.x; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetBiTangents + i * 3 + 1] = bitangent.y; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetBiTangents + i * 3 + 2] = bitangent.z; }
As a simple test, I just copied the x and y coordinate of the positions and using z as debugging index, with the code like something below:
PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3] = m_sourcePositions[InstanceSrg::m_targetPositions + i * 3]; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3 + 1] = m_sourcePositions[InstanceSrg::m_targetPositions + i * 3 + 1]; PassSrg::m_skinnedMeshOutputStream[InstanceSrg::m_targetPositions + i * 3 + 1] = i;
Below is a snippet of renderdoc capture of m_skinnedMeshOutputStream with the first element as the index, next two as x and y coordinates and last as index for debugging purpose. The area of interest is where the device stops fetching the correct data for the mesh.
Device Info:
Samsung S10+
SM-G975F/DS
Mali-G76
VkPhysicalDeviceProperties = { apiVersion = 4198531 driverVersion = 109051904 vendorID = 5045 deviceType = VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU deviceName = "Mali-G76" pipelineCacheUUID = "\xb6\xa8\U00000003\xe0c}\xa0\x94\xec\xde\xd4ξ\xe2\U00000004\x9b" limits = { maxImageDimension1D = 16384 maxImageDimension2D = 16383 maxImageDimension3D = 16383 maxImageDimensionCube = 16383 maxImageArrayLayers = 1024 maxTexelBufferElements = 65536 maxUniformBufferRange = 65536 maxStorageBufferRange = 268435456 maxPushConstantsSize = 256 maxMemoryAllocationCount = 4294967295 maxSamplerAllocationCount = 4294967295 bufferImageGranularity = 4096 sparseAddressSpaceSize = 0 maxBoundDescriptorSets = 4 maxPerStageDescriptorSamplers = 128 maxPerStageDescriptorUniformBuffers = 36 maxPerStageDescriptorStorageBuffers = 35 maxPerStageDescriptorSampledImages = 256 maxPerStageDescriptorStorageImages = 21 maxPerStageDescriptorInputAttachments = 9 maxPerStageResources = 365 maxDescriptorSetSamplers = 768 maxDescriptorSetUniformBuffers = 216 maxDescriptorSetUniformBuffersDynamic = 32 maxDescriptorSetStorageBuffers = 210 maxDescriptorSetStorageBuffersDynamic = 32 maxDescriptorSetSampledImages = 1536 maxDescriptorSetStorageImages = 126 maxDescriptorSetInputAttachments = 9 maxVertexInputAttributes = 32 maxVertexInputBindings = 32 maxVertexInputAttributeOffset = 2047 maxVertexInputBindingStride = 2048 maxVertexOutputComponents = 128 maxTessellationGenerationLevel = 64 maxTessellationPatchSize = 32 maxTessellationControlPerVertexInputComponents = 128 maxTessellationControlPerVertexOutputComponents = 128 maxTessellationControlPerPatchOutputComponents = 120 maxTessellationControlTotalOutputComponents = 4096 maxTessellationEvaluationInputComponents = 128 maxTessellationEvaluationOutputComponents = 128 maxGeometryShaderInvocations = 32 maxGeometryInputComponents = 64 maxGeometryOutputComponents = 128 maxGeometryOutputVertices = 256 maxGeometryTotalOutputComponents = 1024 maxFragmentInputComponents = 128 maxFragmentOutputAttachments = 8 maxFragmentDualSrcAttachments = 0 maxFragmentCombinedOutputResources = 64 maxComputeSharedMemorySize = 32768 maxComputeWorkGroupCount = ([0] = 4294967295, [1] = 4294967295, [2] = 4294967295) maxComputeWorkGroupInvocations = 384 maxComputeWorkGroupSize = ([0] = 384, [1] = 384, [2] = 384) subPixelPrecisionBits = 8 subTexelPrecisionBits = 8 mipmapPrecisionBits = 8 maxDrawIndexedIndexValue = 4294967295 maxDrawIndirectCount = 1 maxSamplerLodBias = 255 maxSamplerAnisotropy = 16 maxViewports = 1 maxViewportDimensions = ([0] = 16383, [1] = 16383) viewportBoundsRange = ([0] = -32766, [1] = 32765) viewportSubPixelBits = 0 minMemoryMapAlignment = 64 minTexelBufferOffsetAlignment = 64 minUniformBufferOffsetAlignment = 16 minStorageBufferOffsetAlignment = 64 minTexelOffset = -8 maxTexelOffset = 7 minTexelGatherOffset = -8 maxTexelGatherOffset = 7 minInterpolationOffset = -0.5 maxInterpolationOffset = 0.5 subPixelInterpolationOffsetBits = 4 maxFramebufferWidth = 16383 maxFramebufferHeight = 16383 maxFramebufferLayers = 256 framebufferColorSampleCounts = 13 framebufferDepthSampleCounts = 13 framebufferStencilSampleCounts = 13 framebufferNoAttachmentsSampleCounts = 29 maxColorAttachments = 8 sampledImageColorSampleCounts = 13 sampledImageIntegerSampleCounts = 13 sampledImageDepthSampleCounts = 13 sampledImageStencilSampleCounts = 13 storageImageSampleCounts = 1 maxSampleMaskWords = 1 timestampComputeAndGraphics = 1 timestampPeriod = 38.4615402 maxClipDistances = 0 maxCullDistances = 0 maxCombinedClipAndCullDistances = 0 discreteQueuePriorities = 2 pointSizeRange = ([0] = 1, [1] = 1024) lineWidthRange = ([0] = 1,
Is this a problem related to the 180MB memory address limit on the Mali device? It seems no matter the changes I've made to the shader such as number of threads or changing in the index offset it would always stop at the same index point (13363).
This sample works on other non-Mali devices.
Thanks.
Hello,
That is correct, the debug z value as index is correct even after index 13363 for m_skinnedMeshOutputStream. So my question is if this mean that we have to break up the meshes into smaller chunks to get around memory addressing size problem like this? Would there be a more specific size to use for the chunks or will it be just trial and error?
The 180MB data limit only applies to post-transform vertex data. Compute shaders don't have that issue at all, so it's not going to be that.
One obvious thing to check - do you have any mediump variables around? Are you possibly overflowing those on address calculation?
If you are still stuck, are you able to share the shader (and ideally a reproducer)? If this is not something you can share publicly, please can you email it to developer@arm.com.
Kind regards, Pete
All our int are full 32 bits so there should not be any overflow (int32 performance on everything aside). I am currently getting more info on releasing the apk. Once approval is given, we will have the appropriate way of sharing the apk.
Thank you.