When I query the binary, I really get a binary and nothing human readable. I was expecting to see the generated assembly code like how Nvidia returns it. It's really difficult to write a maxFLOPS test without seeing this assembly. Moreover the Midgard architecture is a mixmatch between old school VLIW and scalar so I never know whether scalar or vector MULs are being generated from my code.
The binary format that is returned from glGetProgramBinary is entirely driver dependent and depending on GPU vendor, it can be GPU family dependent as well.
You're lucky when grabbing the binary results that Nvidia returns because they use ARB assembly for their program binaries. This is human readable because they keep it in ASCII.
Most other companies will be returning a binary representation that isn't human readable.
Qualcomm for example returns the program compiled to the native architecture of the GPU, plus a bunch of metadata, and also the original program's sources. They return this "encrypted" with a simple XOR "encryption" to keep it from being human readable.
Really need either a shader disassembler to disassemble program binaries, or a host shader assembler that outputs the shader source in a human readable fashion.
Companies tend to dislike divulging any information about the GPU, which includes the ISA that the GPU runs.
If a fully documented ISA was released from companies it helps enthusiastic individuals with reversing their GPU and writing open source drivers.
A shader compiler really helps individuals when optimizing for a platform though. AMD's ShaderAnalyzer tool is a good example of a great shader compiler that shows the compiled shader sources and bottlenecks on that particular platform.
Just my two cents.
Sonic's reply is bang on, there are no guarantees about the data being returned by getProgramBinaryOES, contractually it is only intended to be passed back to ProgramBinaryOES. ARM do not currently provide public access to the ISA documentation for Mali GPUs, or any shader dis-assembly tools.
pdsharma wrote: the Midgard architecture is a mixmatch between old school VLIW and scalar so I never know whether scalar or vector MULs are being generated from my code.
pdsharma wrote:
the Midgard architecture is a mixmatch between old school VLIW and scalar so I never know whether scalar or vector MULs are being generated from my code.
Just to clarify, the vector and scalar functional units present in the A-pipe are entirely instructable by a given VLIW, there is no separate scalar instruction type. If you are performing a multiply on 2 vectors, this will be executed in the VMUL unit. Conversely float * float is executed on the SMUL. I don't believe there's anything fancy going on in that regard
Hth,
Chris
Honestly, I wasn't expecting to see any human readable binary from either of the vendors especially Nvidia. But still, PTX code isn't really useful for my purposes.
I agree with your sentiments that it is very useful to have a tool like AMD's Shaderanalyzer. I was able to achieve close to peak FP perf in my matrix multiplication code using that tool. Without that it's like trying to throw a coin from top of a pond into a bucket down below. There's a lot of guess work going on.
Thanks for that information. I am hoping the vec4 MADD instructions would generate a vec4 MUL + ADD operation. So a shader like below should be able to fill up VMUL and VADD unit but the SMUL, SADD and VSFU units won't be doing anything. So to achieve peak floating point performance I should interleave scalar MULs, ADDs and DOT4 operations?
#version 300 es
layout( location = 0 ) out highp vec4 color;
uniform highp vec4 u0;
uniform highp vec4 u1;
uniform highp vec4 u2;
uniform highp vec4 u3;
void main()
{
highp vec4 v0 = u0;
highp vec4 v1 = u1;
highp vec4 v2 = u2;
highp vec4 v3 = u3;
highp vec4 v4 = u0 + u1;
highp vec4 v5 = u1 + u2;
highp vec4 v6 = u2 + u3;
highp vec4 v7 = u3 + u0;
for( lowp int i = 0; i < 4096; i++ )
v0 = ( v1 * v2 ) + v3;
v1 = ( v2 * v3 ) + v4;
v2 = ( v3 * v4 ) + v5;
v3 = ( v4 * v5 ) + v6;
v4 = ( v5 * v6 ) + v7;
v5 = ( v6 * v7 ) + v0;
v6 = ( v7 * v0 ) + v1;
v7 = ( v0 * v1 ) + v2;
}
color = v0 + v1 + v2 + v3 + v4 + v5 + v6 + v7;
The above kernel should give peak FP perf on both pure VLIW and pure scalar architectures. But I only get around 12 GFlops/s on Note3 ( Mali-628 MP6 ) which has peak around 34 GFlops ( 17Flops/A-pipe * 4 pipes * 0.5 GHz)
Hi pdsharma,
Here's a similar case where we looked in detail at a shader that wasn't achieving peak perf: What is the GLops of Mali T628MP6? Can't get 17 flops per pipe using OpenGL.
I only get around 12 GFlops/s on Note3 ( Mali-628 MP6 ) which has peak around 34 GFlops ( 17Flops/A-pipe * 4 pipes * 0.5 GHz)
Not sure how you got to 4 pipes there, there's 2 per core and 6 cores in an MP6 configuration, and 17 flops is not the peak in your case, as you are not using the DOT, SMUL, or SADD units, so your peak in this case is actually 8 flops for the VMUL and VADD units, 2 pipes per core, 6 cores, so by my maths that makes the peak:
8 flops * 2 pipes per core * 6 cores * 0.5GHz = 48 GFlops.
It is not expected that you will achieve peak theoretical flops with every shader in the real world, but 1/4 does seem a tad low in this case. One of the Developer Relations engineers will take a look at this soon and get back to you
Hi Chris,
Thanks a lot for linking that thread. I didn't know you guys provide an offline shader compiler with metrics. This should be enough for what I am trying to do.
Thanks for correcting me on core count of ARM GPUs. I know real-world applications never achieve peak flops but still it's always a fun exercise to do!
Regards,
So I was finally able to achieve 42 Gflops/s using just MADDs on Note3. I measured the clock variation between 420-480 MHz so assuming 450 MHz average case:
8 flops * 2 pipes * 6 cores * 0.45MHz = 43.2 Gflops/s
The kernel I posted above has only one problem - register spilling to main memory. So I reduced the vec4 variables from 8 to 6 and ran the shader through offline compiler :
7 work registers used, 5 uniform registers used, spilling not used.
A L/S T Total Bound
Cycles: 15 0 0 15 A
Shortest Path: 4.5 0 0 4.5 A
Longest Path: 1 -1 -1 -1 A
I suspected register spilling might become an issue when using vec4 registers and this tool confirmed it. Btw, what is the difference between shortest and longest path?
Btw, what is the difference between shortest and longest path?
Shortest path assumes no conditional block or loop is executed (irrespective of the actual values of those conditions), so it's the shortest instruction sequence from main entry to the end of the program - but can be optimistic.
Longest path assumes every conditional block is executed once, but it doesn't understand loops, so in your case reports -1 as an unknown value.
HTH,Pete
I also measured flops of VADD and VMUL separately and here are the results :
vec4 ADD : 21.3 Gflops/s
vec4 MUL : 42.3 Gflops/s
vec4 MADD : 42.6 Gflops/s
The shader for these are very similar, only the instruction inside the loop is changed. Here is vec4 MUL for example:
uniform lowp int numLoopIterations;
for( lowp int i = 0; i < numLoopIterations; i++ )
v0 = v1 * v2;
v1 = v2 * v3;
v2 = v3 * v4;
v3 = v4 * v5;
v4 = v5 * v0;
v5 = v0 * v1;
color = v0 + v1 + v2 + v3 + v4 + v5;
Is it possible the VMUL is being executed on a separate unit but not VADD?
Yes, under certain circumstances the compiler can use the multiply functionality from the dot product to perform a VMUL but not a VADD. So this means you can do a VADD + VMUL or VMUL + VMUL for 8 flops a cycle. Where as VADD is only 4 flops a cycle.
Thanks for the info!
Hi! I know this was 7 years ago, but can you provide some information about the encryption?