Load/Store Unit and 16-bit arithmetic from mali oc are not as expected

I profiled my shaders, but i found the Load/Store Unit value is extremely large. Therefore, I tried to simplify the shader and run some tests.

Environment: Mali-G715, glslc in the latest Vulkan SDK .

#version 450
#define LENGTH  512 // 1024

layout(set = 0, binding = 0, std140) mediump uniform ubo0 {
mediump vec4 data[LENGTH];
} _ubo0;

layout(set = 0, binding = 1, std140) mediump uniform ubo1 {
mediump vec4 data[LENGTH];
} _ubo1;

layout(location = 0) out mediump vec4 outColor;

void main() 
{
    outColor = vec4(0);
    for(int i = 0; i < LENGTH; i++)
    {
        outColor+= _ubo0.data[i];
    }
    
    //for(int i = 0; i < LENGTH; i++)
    //{
    //    outColor+= _ubo1.data[i];
    //}
}

void confusedMain() 
{
    outColor = vec4(0);
    for(int i = 0; i < LENGTH; i++)
    {
        outColor+= _ubo0.data[i];
        outColor+= _ubo1.data[i];
    }
}

The profile result:

Uniform Count	Per Unifrom Length	LS	16-bit arithmetic	Uniform Register	Function
1	512	0.00	N/A	2 (3% used)	main
1	1024	2.00	0.0	2 (3% used)	main
2	512	0.0	N/A	2 (3% used)	main
2	1024	4.00	0.0	2 (3% used)	main
2	512	4.00	0.0	2 (3% used)	confusedMain

The results seem to indicate that the LS value is related to the size of the UBO. However, when I tried the following code, the results confused me.

So I have some questions about the result above.

Q1: Does the size of the UBO really affect LS? Could it be that there is a special cache inside the chip, but due to the limited cache size, a large UBO increases LS?

Q2: Why different UBO size have different 16-bit arithmetic result?

Q3: Why did different calculation orders produce different results in the example above？

Q4: Why does the UBO size affect 16-bit arithmetic?