When I query the binary, I really get a binary and nothing human readable. I was expecting to see the generated assembly code like how Nvidia returns it. It's really difficult to write a maxFLOPS test without seeing this assembly. Moreover the Midgard architecture is a mixmatch between old school VLIW and scalar so I never know whether scalar or vector MULs are being generated from my code.
Thanks for that information. I am hoping the vec4 MADD instructions would generate a vec4 MUL + ADD operation. So a shader like below should be able to fill up VMUL and VADD unit but the SMUL, SADD and VSFU units won't be doing anything. So to achieve peak floating point performance I should interleave scalar MULs, ADDs and DOT4 operations?
#version 300 es
layout( location = 0 ) out highp vec4 color;
uniform highp vec4 u0;
uniform highp vec4 u1;
uniform highp vec4 u2;
uniform highp vec4 u3;
void main()
{
highp vec4 v0 = u0;
highp vec4 v1 = u1;
highp vec4 v2 = u2;
highp vec4 v3 = u3;
highp vec4 v4 = u0 + u1;
highp vec4 v5 = u1 + u2;
highp vec4 v6 = u2 + u3;
highp vec4 v7 = u3 + u0;
for( lowp int i = 0; i < 4096; i++ )
v0 = ( v1 * v2 ) + v3;
v1 = ( v2 * v3 ) + v4;
v2 = ( v3 * v4 ) + v5;
v3 = ( v4 * v5 ) + v6;
v4 = ( v5 * v6 ) + v7;
v5 = ( v6 * v7 ) + v0;
v6 = ( v7 * v0 ) + v1;
v7 = ( v0 * v1 ) + v2;
}
color = v0 + v1 + v2 + v3 + v4 + v5 + v6 + v7;
The above kernel should give peak FP perf on both pure VLIW and pure scalar architectures. But I only get around 12 GFlops/s on Note3 ( Mali-628 MP6 ) which has peak around 34 GFlops ( 17Flops/A-pipe * 4 pipes * 0.5 GHz)
Hi pdsharma,
Here's a similar case where we looked in detail at a shader that wasn't achieving peak perf: What is the GLops of Mali T628MP6? Can't get 17 flops per pipe using OpenGL.
I only get around 12 GFlops/s on Note3 ( Mali-628 MP6 ) which has peak around 34 GFlops ( 17Flops/A-pipe * 4 pipes * 0.5 GHz)
Not sure how you got to 4 pipes there, there's 2 per core and 6 cores in an MP6 configuration, and 17 flops is not the peak in your case, as you are not using the DOT, SMUL, or SADD units, so your peak in this case is actually 8 flops for the VMUL and VADD units, 2 pipes per core, 6 cores, so by my maths that makes the peak:
8 flops * 2 pipes per core * 6 cores * 0.5GHz = 48 GFlops.
It is not expected that you will achieve peak theoretical flops with every shader in the real world, but 1/4 does seem a tad low in this case. One of the Developer Relations engineers will take a look at this soon and get back to you
Hth,
Chris
Hi Chris,
Thanks a lot for linking that thread. I didn't know you guys provide an offline shader compiler with metrics. This should be enough for what I am trying to do.
Thanks for correcting me on core count of ARM GPUs. I know real-world applications never achieve peak flops but still it's always a fun exercise to do!
Regards,
So I was finally able to achieve 42 Gflops/s using just MADDs on Note3. I measured the clock variation between 420-480 MHz so assuming 450 MHz average case:
8 flops * 2 pipes * 6 cores * 0.45MHz = 43.2 Gflops/s
The kernel I posted above has only one problem - register spilling to main memory. So I reduced the vec4 variables from 8 to 6 and ran the shader through offline compiler :
7 work registers used, 5 uniform registers used, spilling not used.
A L/S T Total Bound
Cycles: 15 0 0 15 A
Shortest Path: 4.5 0 0 4.5 A
Longest Path: 1 -1 -1 -1 A
I suspected register spilling might become an issue when using vec4 registers and this tool confirmed it. Btw, what is the difference between shortest and longest path?
Btw, what is the difference between shortest and longest path?
Shortest path assumes no conditional block or loop is executed (irrespective of the actual values of those conditions), so it's the shortest instruction sequence from main entry to the end of the program - but can be optimistic.
Longest path assumes every conditional block is executed once, but it doesn't understand loops, so in your case reports -1 as an unknown value.
HTH,Pete
I also measured flops of VADD and VMUL separately and here are the results :
vec4 ADD : 21.3 Gflops/s
vec4 MUL : 42.3 Gflops/s
vec4 MADD : 42.6 Gflops/s
The shader for these are very similar, only the instruction inside the loop is changed. Here is vec4 MUL for example:
uniform lowp int numLoopIterations;
for( lowp int i = 0; i < numLoopIterations; i++ )
v0 = v1 * v2;
v1 = v2 * v3;
v2 = v3 * v4;
v3 = v4 * v5;
v4 = v5 * v0;
v5 = v0 * v1;
color = v0 + v1 + v2 + v3 + v4 + v5;
Is it possible the VMUL is being executed on a separate unit but not VADD?
Yes, under certain circumstances the compiler can use the multiply functionality from the dot product to perform a VMUL but not a VADD. So this means you can do a VADD + VMUL or VMUL + VMUL for 8 flops a cycle. Where as VADD is only 4 flops a cycle.
Thanks for the info!