When I query the binary, I really get a binary and nothing human readable. I was expecting to see the generated assembly code like how Nvidia returns it. It's really difficult to write a maxFLOPS test without seeing this assembly. Moreover the Midgard architecture is a mixmatch between old school VLIW and scalar so I never know whether scalar or vector MULs are being generated from my code.
I also measured flops of VADD and VMUL separately and here are the results :
vec4 ADD : 21.3 Gflops/s
vec4 MUL : 42.3 Gflops/s
vec4 MADD : 42.6 Gflops/s
The shader for these are very similar, only the instruction inside the loop is changed. Here is vec4 MUL for example:
#version 300 es
layout( location = 0 ) out highp vec4 color;
uniform highp vec4 u0;
uniform highp vec4 u1;
uniform highp vec4 u2;
uniform highp vec4 u3;
uniform lowp int numLoopIterations;
void main()
{
highp vec4 v0 = u0;
highp vec4 v1 = u1;
highp vec4 v2 = u2;
highp vec4 v3 = u3;
highp vec4 v4 = u0 + u1;
highp vec4 v5 = u1 + u2;
for( lowp int i = 0; i < numLoopIterations; i++ )
v0 = v1 * v2;
v1 = v2 * v3;
v2 = v3 * v4;
v3 = v4 * v5;
v4 = v5 * v0;
v5 = v0 * v1;
}
color = v0 + v1 + v2 + v3 + v4 + v5;
Is it possible the VMUL is being executed on a separate unit but not VADD?
Yes, under certain circumstances the compiler can use the multiply functionality from the dot product to perform a VMUL but not a VADD. So this means you can do a VADD + VMUL or VMUL + VMUL for 8 flops a cycle. Where as VADD is only 4 flops a cycle.
Thanks for the info!