Load/Store Unit and 16-bit arithmetic from mali oc are not as expected

I profiled my shaders, but i found the Load/Store Unit value is extremely large. Therefore, I tried to simplify the shader and run some tests.

Environment:  Mali-G715,  glslc in the latest Vulkan SDK .
#version 450
#define LENGTH  512 // 1024

layout(set = 0, binding = 0, std140) mediump uniform ubo0 {
mediump vec4 data[LENGTH];
} _ubo0;

layout(set = 0, binding = 1, std140) mediump uniform ubo1 {
mediump vec4 data[LENGTH];
} _ubo1;

layout(location = 0) out mediump vec4 outColor;

void main() 
{
    outColor = vec4(0);
    for(int i = 0; i < LENGTH; i++)
    {
        outColor+= _ubo0.data[i];
    }
    
    //for(int i = 0; i < LENGTH; i++)
    //{
    //    outColor+= _ubo1.data[i];
    //}
}

void confusedMain() 
{
    outColor = vec4(0);
    for(int i = 0; i < LENGTH; i++)
    {
        outColor+= _ubo0.data[i];
        outColor+= _ubo1.data[i];
    }
}
The profile result:
Uniform Count Per Unifrom Length LS 16-bit arithmetic Uniform Register Function
1 512 0.00 N/A 2 (3% used) main
1 1024 2.00 0.0 2 (3% used) main
2 512 0.0 N/A 2 (3% used) main
2 1024 4.00 0.0 2 (3% used) main
2 512 4.00 0.0 2 (3% used) confusedMain

The results seem to indicate that the LS value is related to the size of the UBO. However, when I tried the following code, the results confused me.

So I have some questions about the result above.

Q1: Does the size of the UBO really affect LS? Could it be that there is a special cache inside the chip, but due to the limited cache size, a large UBO increases LS?

Q2: Why different UBO size have different 16-bit arithmetic result?

Q3: Why did different calculation orders produce different results in the example above?

Q4: Why does the UBO size affect 16-bit arithmetic?

Parents
  • Q1: Does the size of the UBO really affect LS? Could it be that there is a special cache inside the chip, but due to the limited cache size, a large UBO increases LS?

    The result of the calculation is uniform because all inputs are uniforms, so it gets optimized out and is not recomputed per thread. 

    What's left is just compiler noise - it's clearly not recomputing 1024 additions if there is no significant amount of arithmetic.

Reply
  • Q1: Does the size of the UBO really affect LS? Could it be that there is a special cache inside the chip, but due to the limited cache size, a large UBO increases LS?

    The result of the calculation is uniform because all inputs are uniforms, so it gets optimized out and is not recomputed per thread. 

    What's left is just compiler noise - it's clearly not recomputing 1024 additions if there is no significant amount of arithmetic.

Children