Graphics, Gaming, and VR forum What is the GLops of Mali T628MP6? Can't get 17 flops per pipe using OpenGL.

State Accepted Answer
+1 person also asked this people also asked this
Locked Locked
Replies 11 replies
Subscribers 136 subscribers
Views 12869 views
Users 0 members are here

Options

Related

How was your experience today?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

What is the GLops of Mali T628MP6? Can't get 17 flops per pipe using OpenGL.

chen20062308 over 10 years ago

For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle.http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf shows this is compsed of:

7: dot product (4 Muls, 3 adds)
1: scalar add
4: vec4 add
4: vec4 multiply
1: scalar multiply

And also in What is exact double precision performance for Mali T628 MP6 (Arndale Octa Board) ? , @chrisvarns says 17flops.

And in http://malideveloper.arm.com/downloads/IWOCL.pdf, timhar01 also says 17flops.

But according to my measurement, it can't process dot product with vec4 MAD together. The running time of case 1 and case 2 is the same. Why? How can I get 17 flops?

case 1:

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

case2:

" color_out1 = vec4(dot(color_out1, color1));\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out1 = vec4(dot(color_out1, color1));\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out1 = vec4(dot(color_out1, color1));\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out1 = vec4(dot(color_out1, color1));\n"

" color_out5 = color_out5*color5+color6;\n"

Top replies

Chris Varnsverry over 10 years ago in reply to chen20062308 +1 verified

Hi chen, The number of floating point operations that the vector units can perform is double for FP16, than it is for FP32, resulting in double PEAK FLOPS for those units. But again, this is PEAK and we...

0 Chris Varnsverry over 10 years ago
Hi chen,
Can you provide your shaders for comparison? I've knocked a couple together based on your code above however:
case1.frag
#ifdef HIGHP precision highp float; #else precision mediump float; #endif varying vec4 color5; varying vec4 color6; vec4 color_out5; void main(void) {         color_out5 = vec4(1);         color_out5 = color_out5 * color5 + color6;         color_out5 = color_out5 * color5 + color6;         color_out5 = color_out5 * color5 + color6;         color_out5 = color_out5 * color5 + color6;         color_out5 = color_out5 * color5 + color6;         color_out5 = color_out5 * color5 + color6;         color_out5 = color_out5 * color5 + color6;         color_out5 = color_out5 * color5 + color6;         gl_FragColor = color_out5; }
case2.frag
#ifdef HIGHP precision highp float; #else precision mediump float; #endif varying vec4 color5; varying vec4 color6; varying vec4 color1; vec4 color_out5; vec4 color_out1; void main(void) {         color_out5 = vec4(1);         color_out1 = vec4(1);         color_out1 = vec4(dot(color_out1, color1));         color_out5 = color_out5 * color5 + color6;         color_out1 = vec4(dot(color_out1, color1));         color_out5 = color_out5 * color5 + color6;         color_out1 = vec4(dot(color_out1, color1));         color_out5 = color_out5 * color5 + color6;         color_out1 = vec4(dot(color_out1, color1));         color_out5 = color_out5 * color5 + color6;         gl_FragColor = color_out1 + color_out5; }
The output from malisc 4.2: (compiling with HIGHP because you're interested in FP32 ops, but obviously mediump gives better perf)

varnz@soma:/raid/scratch/forum-19453$ malisc -f -V -c Mali-T620 -r r1p0 -d Mali-T600_r3p0-00rel0 -D HIGHP=1 case1.frag

ARM Mali Offline Shader Compiler v4.2.0

(C) Copyright 2007-2014 ARM Limited.

All rights reserved.

Compilation successful.

3 work registers used, 0 uniform registers used, spilling not used.

                A       L/S     T       Total   Bound

Cycles:         9       2       0       11      A

Shortest Path: 4       2       0       6       A

Longest Path:   4       2       0       6       A

Note: The cycles counts do not include possible stalls due to cache misses.

varnz@soma:/raid/scratch/forum-19453$ malisc -f -V -c Mali-T620 -r r1p0 -d Mali-T600_r3p0-00rel0 -D HIGHP=1 case2.frag

ARM Mali Offline Shader Compiler v4.2.0

(C) Copyright 2007-2014 ARM Limited.

All rights reserved.

Compilation successful.

5 work registers used, 0 uniform registers used, spilling not used.

                A       L/S     T       Total   Bound

Cycles:         8       3       0       11      A

Shortest Path: 4       3       0       7       A

Longest Path:   4       3       0       7       A

Note: The cycles counts do not include possible stalls due to cache misses.

Case 1 is doing a vec4 multiply and a vec4 add (8 FLOPS) 8 times, for a total of 64 FLOPS. It takes 4 ALU cycles to do this, so is doing 16 FLOPS/cycle. Great start. Case 2 is also taking 4 cycles, but this time you are doing 4 vec4 multiply, 4 vec4 add and 4 dot product, so 28 FLOPS for the dot products, 16 each for the multiply and adds, totalling 60 FLOPS, or 15 FLOPS/cycle. Given that the peak 17 FP32 FLOPS per cycle is composed of:
7: dot product (4 Muls, 3 adds)
1: scalar add
4: vec4 add
4: vec4 multiply
1: scalar multiply
then 15 FLOPS/cycle for case 2 is exactly what I would expect to see, given that you are doing no scalar math. From reading this graph you might expect case 1 to total 8 FLOPS/cycle, as its only using the vadd and vmul, but obviously we are able to optimize this up to 16 FLOPS/cycle.
Can you let me know how these results differed from your expectation?
Thanks,
Chris
Cancel
Up 0 Down

Cancel
0 chen20062308 over 10 years ago in reply to Chris Varnsverry

Hi chrisvarns,
Thanks for your quick reply.
But according to my understanding, the Shortest Path and Longest Path result in offline compiler is the cycles for 2 ALUs for Mali T628, that means assuming the shader is running on single core. So the calculated "16 FLOPS/cycle" and "15 FLOPS/cycle" is the number for 2 ALUs. Am I right?
Cancel
Up 0 Down

Cancel
0 Chris Varnsverry over 10 years ago in reply to chen20062308

Ahh I see your point. Yes that's correct, the number of ALUs present in a core is known to the shader compiler and it takes this into account, so the numbers above are per core, so halve them to get the per ALU FLOPS. I forgot about that
In that case, case 1 is doing 8 FLOPS per ALU, which makes sense as you're using the 4 flops from VMUL and VADD respectively. This is peak output as you're not using the dot or scalar units, but I will check this.
Case 2, it's 7.5 FLOPS per cycle, but I would expect 15 (7 DOT, 4 VMUL, 4 VADD) so will take a look at the disassembly tomorrow with the compiler team and get back to you.
Apologies for the oversight
P.S. I think the final gl_FragColor = color_out1 + color_out5; I'm using to stop the compiler optimizing out the whole body is probably adding another cycle.
Thanks,
Chris
Cancel
Up 0 Down

Cancel
0 chen20062308 over 10 years ago in reply to Chris Varnsverry

Hi chrisvarns,
Thanks for your kindly help. It would be great to use full power of Mali GPU.
P.S. The reminder is very kindly.
Cancel
Up 0 Down

Cancel
0 Chris Varnsverry over 10 years ago in reply to chen20062308
Hi Chen,
I've had a good look at the disassembly, which has been a very educational experience. Unfortunately I cannot share with you the specifics of the microarchitecture and layout of the A pipe in the T6xx series of GPU, and how these shaders map to it, as this is not public information at present.
To be clear, 17 FLOPS is an architectural upper limit, i.e. it is the absolute peak throughput if you were to perfectly exercise all units in the pipe. Whilst we do have a synthetic benchmark that perfectly exercises all hardware units to perform 17 FLOPS per cycle, for real use cases it is not guaranteed, and is very unlikely that you will be able to achieve it. That said, I have slightly modified case 2 and produced a shader which DOES give 15 FLOPS, combining dot, multiply, and add instructions:
#ifdef HIGHP precision highp float; #else precision mediump float; #endif varying vec4 color5; varying vec4 color6; varying vec4 color1; vec4 color_out5; vec4 color_out1; vec4 tmp; void main(void) {         color_out5 = color5;         color_out1 = color1;         tmp = vec4(dot(color_out5, color1));         color_out5 = tmp * color_out5;         color_out1 = tmp + color_out1;         tmp = vec4(dot(color_out5, color1));         color_out5 = tmp * color_out5;         color_out1 = tmp + color_out1;         tmp = vec4(dot(color_out5, color1));         color_out5 = tmp * color_out5;         color_out1 = tmp + color_out1;         tmp = vec4(dot(color_out5, color1));         color_out5 = tmp * color_out5;         color_out1 = tmp + color_out1;         gl_FragColor = color_out1 + color_out5; }
You can see that I am still performing the dot product, multiply and add, so there's a total of 60 FLOPS there.
Shader output:

varnz@soma:/raid/scratch/forum-19453$ ../epicshader/Mali_Offline_Compiler_v4.2.0/malisc -f -V -c Mali-T620 -r r1p0 -d Mali-T600_r3p0-00rel0 -D HIGHP=1 case3.frag

ARM Mali Offline Shader Compiler v4.2.0

(C) Copyright 2007-2014 ARM Limited.

All rights reserved.

Compilation successful.

3 work registers used, 0 uniform registers used, spilling not used.

                A       L/S     T       Total   Bound

Cycles:         6       2       0       8       A

Shortest Path: 2       2       0       4       A, L/S

Longest Path:   2       2       0       4       A, L/S

Note: The cycles counts do not include possible stalls due to cache misses.

You can see that it takes 2 cycles, so for a single ALU that is 4 cycles, or 15 FLOPS/cycle
Hope this helps,
Chris
Cancel
Up 0 Down

Cancel
0 chen20062308 over 10 years ago in reply to Chris Varnsverry
Hi chrisvarns,
That's really amazing.
According to your explain, to reach the peak throughput, the program should be:
case 1:
tmp = vec4(dot(color_out5, color1));
color_out5 = tmp * color_out5;
color_out1 = tmp + color_out1;
But in common program, it should be:
case 2:
tmp = vec4(dot(color_out5, color1));
color_out5 = tmp * color_out5 + color_out1;
or !
case 3:
tmp = vec4(dot(color_out5, color1));
color_out5 = tmp * color_out5;
color_out1 = color_out5 + color_out1;
But the throughput of case 2&3 is only half of case 1. So is this the hardware limitation? If I want to achieve high performance, I need to write shaders like case 1?
Cancel
Up 0 Down

Cancel
0 Chris Varnsverry over 10 years ago in reply to chen20062308

Hi chen,
I can't talk publicly about the hardware internals at this level unfortunately, but suffice to say, the performance of a particular shader is completely dependent on the way that shader is written, and how it maps to/schedules on a particular architecture. It is not expected that all code will execute at the peak theoretical throughput, so you should not worry about achieving this target in practice. Best practice advice for writing shaders for Mali GPUs can be found in the Optimization Guide available at Mali GPU Application Optimization Guide v3.0 « Mali Developer Center, and this is the bulk of the public information on the subject of optimization for Mali GPUs.
Hope this helps,
Chris
Cancel
Up 0 Down

Cancel

0 Chris Varnsverry over 10 years ago in reply to Chris Varnsverry

P.S. In case anyone is interested, in the following shader I have modified my previous shader to use mediump, and now contains a total of 132 fp16 operations (120 for the 8x DOT, MUL, and ADD, and 12 for the final 2 MUL and 1 ADD necessary to stop the compiler optimizing everything out):

precision mediump float;
varying vec4 color5;
varying vec4 color6;
varying vec4 color1;


vec4 color_out5a;
vec4 color_out1a;
vec4 tmpa;
vec4 color_out5b;
vec4 color_out1b;
vec4 tmpb;


void main(void)
{
        color_out5a = color5;
        color_out1a = color1;
        color_out5b = color5;
        color_out1b = color1;
        tmpa = vec4(dot(color_out5a, color1));
        color_out5a = tmpa * color_out5a;
        color_out1a = tmpa + color_out1a;
        tmpb = vec4(dot(color_out5b, color1));
        color_out5b = tmpb * color_out5b;
        color_out1b = tmpb + color_out1b;
        tmpa = vec4(dot(color_out5a, color1));
        color_out5a = tmpa * color_out5a;
        color_out1a = tmpa + color_out1a;
        tmpb = vec4(dot(color_out5b, color1));
        color_out5b = tmpb * color_out5b;
        color_out1b = tmpb + color_out1b;
        tmpa = vec4(dot(color_out5a, color1));
        color_out5a = tmpa * color_out5a;
        color_out1a = tmpa + color_out1a;
        tmpb = vec4(dot(color_out5b, color1));
        color_out5b = tmpb * color_out5b;
        color_out1b = tmpb + color_out1b;
        tmpa = vec4(dot(color_out5a, color1));
        color_out5a = tmpa * color_out5a;
        color_out1a = tmpa + color_out1a;
        tmpb = vec4(dot(color_out5b, color1));
        color_out5b = tmpb * color_out5b;
        color_out1b = tmpb + color_out1b;
        gl_FragColor = color_out1a * color_out5a + color_out1b * color_out5b;
}

Shader compiler output:

varnz@soma:/raid/scratch/forum-19453$ Mali_Offline_Compiler_v4.2.0/malisc -f -V -c Mali-T620 -r r1p0 -d Mali-T600_r3p0-00rel0 case4.frag -o case4.r3p0.bin

ARM Mali Offline Shader Compiler v4.2.0

(C) Copyright 2007-2014 ARM Limited.

All rights reserved.

Compilation successful.

2 work registers used, 0 uniform registers used, spilling not used.

                A       L/S     T       Total   Bound

Cycles:         6       1       0       7       A

Shortest Path: 2       1       0       3       A

Longest Path:   2       1       0       3       A

Note: The cycles counts do not include possible stalls due to cache misses.

Output binary written to 'case4.r3p0.bin'.

so that totals 33 FLOPS/cycle/ALU! Also it's one less cycle in the L/S pipe, which is nice.

0 chen20062308 over 10 years ago in reply to Chris Varnsverry

Hi chrisvarns,
Thanks for your reply. It's really helpful.
Cancel
Up 0 Down

Cancel

0 chen20062308 over 10 years ago in reply to Chris Varnsverry

I find a strange thing. If I modify your code, change precision from mediump to highp, the cycles is also 6...

precision highp float; 
varying vec4 color5; 
varying vec4 color6; 
varying vec4 color1; 


vec4 color_out5a; 
vec4 color_out1a; 
vec4 tmpa; 
vec4 color_out5b; 
vec4 color_out1b; 
vec4 tmpb; 


void main(void) 
{ 
        color_out5a = color5; 
        color_out1a = color1; 
        color_out5b = color5; 
        color_out1b = color1; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        gl_FragColor = color_out1a * color_out5a + color_out1b * color_out5b; 
}

And when I change the input, the cycles doubles...

precision mediump float; 
varying vec4 color5; 
varying vec4 color6; 
varying vec4 color1; 
varying vec4 color2;
varying vec4 color3;


vec4 color_out5a; 
vec4 color_out1a; 
vec4 tmpa; 
vec4 color_out5b; 
vec4 color_out1b; 
vec4 tmpb; 


void main(void) 
{ 
        color_out5a = color5; 
        color_out1a = color1; 
        color_out5b = color2; 
        color_out1b = color3; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        gl_FragColor = color_out1a * color_out5a + color_out1b * color_out5b; 
}

So I think your code is somehow optimized.

According to peterharris's answer in the thread How many gigaflops GPU MALI T624 MP6 reaches?,

"Most graphics content heavily uses fp16 rather than fp32 - for Mali this means we can get (approximately) double the performance in terms of peak FP16 flops throughput". That means we can get double peak throughput.

How can we get that throughput?

+1 Chris Varnsverry over 10 years ago in reply to chen20062308

Hi chen,
The number of floating point operations that the vector units can perform is double for FP16, than it is for FP32, resulting in double PEAK FLOPS for those units. But again, this is PEAK and we are not trying to suggest that you should expect this level of performance with every shader. The shader above is a bad example of the effect on A pipe instructions apparently as in that case it is only affecting the number of load/store instructions (still a good optimization!). General advice is to use mediump wherever possible, as this gives the compiler the most chance of taking advantage of it.
Thanks,
Chris
Cancel
Up +1 Down

Cancel