When I query the binary, I really get a binary and nothing human readable. I was expecting to see the generated assembly code like how Nvidia returns it. It's really difficult to write a maxFLOPS test without seeing this assembly. Moreover the Midgard architecture is a mixmatch between old school VLIW and scalar so I never know whether scalar or vector MULs are being generated from my code.
Sonic's reply is bang on, there are no guarantees about the data being returned by getProgramBinaryOES, contractually it is only intended to be passed back to ProgramBinaryOES. ARM do not currently provide public access to the ISA documentation for Mali GPUs, or any shader dis-assembly tools.
pdsharma wrote: the Midgard architecture is a mixmatch between old school VLIW and scalar so I never know whether scalar or vector MULs are being generated from my code.
pdsharma wrote:
the Midgard architecture is a mixmatch between old school VLIW and scalar so I never know whether scalar or vector MULs are being generated from my code.
Just to clarify, the vector and scalar functional units present in the A-pipe are entirely instructable by a given VLIW, there is no separate scalar instruction type. If you are performing a multiply on 2 vectors, this will be executed in the VMUL unit. Conversely float * float is executed on the SMUL. I don't believe there's anything fancy going on in that regard
Hth,
Chris
Thanks for that information. I am hoping the vec4 MADD instructions would generate a vec4 MUL + ADD operation. So a shader like below should be able to fill up VMUL and VADD unit but the SMUL, SADD and VSFU units won't be doing anything. So to achieve peak floating point performance I should interleave scalar MULs, ADDs and DOT4 operations?
#version 300 es
layout( location = 0 ) out highp vec4 color;
uniform highp vec4 u0;
uniform highp vec4 u1;
uniform highp vec4 u2;
uniform highp vec4 u3;
void main()
{
highp vec4 v0 = u0;
highp vec4 v1 = u1;
highp vec4 v2 = u2;
highp vec4 v3 = u3;
highp vec4 v4 = u0 + u1;
highp vec4 v5 = u1 + u2;
highp vec4 v6 = u2 + u3;
highp vec4 v7 = u3 + u0;
for( lowp int i = 0; i < 4096; i++ )
v0 = ( v1 * v2 ) + v3;
v1 = ( v2 * v3 ) + v4;
v2 = ( v3 * v4 ) + v5;
v3 = ( v4 * v5 ) + v6;
v4 = ( v5 * v6 ) + v7;
v5 = ( v6 * v7 ) + v0;
v6 = ( v7 * v0 ) + v1;
v7 = ( v0 * v1 ) + v2;
}
color = v0 + v1 + v2 + v3 + v4 + v5 + v6 + v7;
The above kernel should give peak FP perf on both pure VLIW and pure scalar architectures. But I only get around 12 GFlops/s on Note3 ( Mali-628 MP6 ) which has peak around 34 GFlops ( 17Flops/A-pipe * 4 pipes * 0.5 GHz)
Hi pdsharma,
Here's a similar case where we looked in detail at a shader that wasn't achieving peak perf: What is the GLops of Mali T628MP6? Can't get 17 flops per pipe using OpenGL.
I only get around 12 GFlops/s on Note3 ( Mali-628 MP6 ) which has peak around 34 GFlops ( 17Flops/A-pipe * 4 pipes * 0.5 GHz)
Not sure how you got to 4 pipes there, there's 2 per core and 6 cores in an MP6 configuration, and 17 flops is not the peak in your case, as you are not using the DOT, SMUL, or SADD units, so your peak in this case is actually 8 flops for the VMUL and VADD units, 2 pipes per core, 6 cores, so by my maths that makes the peak:
8 flops * 2 pipes per core * 6 cores * 0.5GHz = 48 GFlops.
It is not expected that you will achieve peak theoretical flops with every shader in the real world, but 1/4 does seem a tad low in this case. One of the Developer Relations engineers will take a look at this soon and get back to you
Hi Chris,
Thanks a lot for linking that thread. I didn't know you guys provide an offline shader compiler with metrics. This should be enough for what I am trying to do.
Thanks for correcting me on core count of ARM GPUs. I know real-world applications never achieve peak flops but still it's always a fun exercise to do!
Regards,
View all questions in Graphics and Gaming forum