I profiled my shaders, but i found the Load/Store Unit value is extremely large. Therefore, I tried to simplify the shader and run some tests.
#version 450 #define LENGTH 512 // 1024 layout(set = 0, binding = 0, std140) mediump uniform ubo0 { mediump vec4 data[LENGTH]; } _ubo0; layout(set = 0, binding = 1, std140) mediump uniform ubo1 { mediump vec4 data[LENGTH]; } _ubo1; layout(location = 0) out mediump vec4 outColor; void main() { outColor = vec4(0); for(int i = 0; i < LENGTH; i++) { outColor+= _ubo0.data[i]; } //for(int i = 0; i < LENGTH; i++) //{ // outColor+= _ubo1.data[i]; //} } void confusedMain() { outColor = vec4(0); for(int i = 0; i < LENGTH; i++) { outColor+= _ubo0.data[i]; outColor+= _ubo1.data[i]; } }
The results seem to indicate that the LS value is related to the size of the UBO. However, when I tried the following code, the results confused me.
So I have some questions about the result above.
Q1: Does the size of the UBO really affect LS? Could it be that there is a special cache inside the chip, but due to the limited cache size, a large UBO increases LS?
Q2: Why different UBO size have different 16-bit arithmetic result?
Q3: Why did different calculation orders produce different results in the example above?
Q4: Why does the UBO size affect 16-bit arithmetic?
P.S. Happy to review your shader and provide some advice if you are able to share. You can contact the team at developer@arm.com if you can't share publicly.