When I query the binary, I really get a binary and nothing human readable. I was expecting to see the generated assembly code like how Nvidia returns it. It's really difficult to write a maxFLOPS test without seeing this assembly. Moreover the Midgard architecture is a mixmatch between old school VLIW and scalar so I never know whether scalar or vector MULs are being generated from my code.
So I was finally able to achieve 42 Gflops/s using just MADDs on Note3. I measured the clock variation between 420-480 MHz so assuming 450 MHz average case:
8 flops * 2 pipes * 6 cores * 0.45MHz = 43.2 Gflops/s
The kernel I posted above has only one problem - register spilling to main memory. So I reduced the vec4 variables from 8 to 6 and ran the shader through offline compiler :
7 work registers used, 5 uniform registers used, spilling not used.
A L/S T Total Bound
Cycles: 15 0 0 15 A
Shortest Path: 4.5 0 0 4.5 A
Longest Path: 1 -1 -1 -1 A
I suspected register spilling might become an issue when using vec4 registers and this tool confirmed it. Btw, what is the difference between shortest and longest path?
Btw, what is the difference between shortest and longest path?
Shortest path assumes no conditional block or loop is executed (irrespective of the actual values of those conditions), so it's the shortest instruction sequence from main entry to the end of the program - but can be optimistic.
Longest path assumes every conditional block is executed once, but it doesn't understand loops, so in your case reports -1 as an unknown value.
HTH,Pete
I also measured flops of VADD and VMUL separately and here are the results :
vec4 ADD : 21.3 Gflops/s
vec4 MUL : 42.3 Gflops/s
vec4 MADD : 42.6 Gflops/s
The shader for these are very similar, only the instruction inside the loop is changed. Here is vec4 MUL for example:
#version 300 es
layout( location = 0 ) out highp vec4 color;
uniform highp vec4 u0;
uniform highp vec4 u1;
uniform highp vec4 u2;
uniform highp vec4 u3;
uniform lowp int numLoopIterations;
void main()
{
highp vec4 v0 = u0;
highp vec4 v1 = u1;
highp vec4 v2 = u2;
highp vec4 v3 = u3;
highp vec4 v4 = u0 + u1;
highp vec4 v5 = u1 + u2;
for( lowp int i = 0; i < numLoopIterations; i++ )
v0 = v1 * v2;
v1 = v2 * v3;
v2 = v3 * v4;
v3 = v4 * v5;
v4 = v5 * v0;
v5 = v0 * v1;
}
color = v0 + v1 + v2 + v3 + v4 + v5;
Is it possible the VMUL is being executed on a separate unit but not VADD?
Yes, under certain circumstances the compiler can use the multiply functionality from the dot product to perform a VMUL but not a VADD. So this means you can do a VADD + VMUL or VMUL + VMUL for 8 flops a cycle. Where as VADD is only 4 flops a cycle.
Thanks for the info!