Graphics, Gaming, and VR forum What is the GLops of Mali T628MP6? Can't get 17 flops per pipe using OpenGL.

State Accepted Answer
+1 person also asked this people also asked this
Locked Locked
Replies 11 replies
Subscribers 136 subscribers
Views 12869 views
Users 0 members are here

Options

Related

How was your experience today?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

What is the GLops of Mali T628MP6? Can't get 17 flops per pipe using OpenGL.

chen20062308 over 10 years ago

For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle.http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf shows this is compsed of:

7: dot product (4 Muls, 3 adds)
1: scalar add
4: vec4 add
4: vec4 multiply
1: scalar multiply

And also in What is exact double precision performance for Mali T628 MP6 (Arndale Octa Board) ? , @chrisvarns says 17flops.

And in http://malideveloper.arm.com/downloads/IWOCL.pdf, timhar01 also says 17flops.

But according to my measurement, it can't process dot product with vec4 MAD together. The running time of case 1 and case 2 is the same. Why? How can I get 17 flops?

case 1:

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out5 = color_out5*color5+color6;\n"

case2:

" color_out1 = vec4(dot(color_out1, color1));\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out1 = vec4(dot(color_out1, color1));\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out1 = vec4(dot(color_out1, color1));\n"

" color_out5 = color_out5*color5+color6;\n"

" color_out1 = vec4(dot(color_out1, color1));\n"

" color_out5 = color_out5*color5+color6;\n"

Top replies

Chris Varnsverry over 10 years ago in reply to chen20062308 +1 verified

Hi chen, The number of floating point operations that the vector units can perform is double for FP16, than it is for FP32, resulting in double PEAK FLOPS for those units. But again, this is PEAK and we...

Parents

0 chen20062308 over 10 years ago in reply to Chris Varnsverry

I find a strange thing. If I modify your code, change precision from mediump to highp, the cycles is also 6...

precision highp float; 
varying vec4 color5; 
varying vec4 color6; 
varying vec4 color1; 


vec4 color_out5a; 
vec4 color_out1a; 
vec4 tmpa; 
vec4 color_out5b; 
vec4 color_out1b; 
vec4 tmpb; 


void main(void) 
{ 
        color_out5a = color5; 
        color_out1a = color1; 
        color_out5b = color5; 
        color_out1b = color1; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        gl_FragColor = color_out1a * color_out5a + color_out1b * color_out5b; 
}

And when I change the input, the cycles doubles...

precision mediump float; 
varying vec4 color5; 
varying vec4 color6; 
varying vec4 color1; 
varying vec4 color2;
varying vec4 color3;


vec4 color_out5a; 
vec4 color_out1a; 
vec4 tmpa; 
vec4 color_out5b; 
vec4 color_out1b; 
vec4 tmpb; 


void main(void) 
{ 
        color_out5a = color5; 
        color_out1a = color1; 
        color_out5b = color2; 
        color_out1b = color3; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        gl_FragColor = color_out1a * color_out5a + color_out1b * color_out5b; 
}

So I think your code is somehow optimized.

According to peterharris's answer in the thread How many gigaflops GPU MALI T624 MP6 reaches?,

"Most graphics content heavily uses fp16 rather than fp32 - for Mali this means we can get (approximately) double the performance in terms of peak FP16 flops throughput". That means we can get double peak throughput.

How can we get that throughput?

Reply

0 chen20062308 over 10 years ago in reply to Chris Varnsverry

I find a strange thing. If I modify your code, change precision from mediump to highp, the cycles is also 6...

precision highp float; 
varying vec4 color5; 
varying vec4 color6; 
varying vec4 color1; 


vec4 color_out5a; 
vec4 color_out1a; 
vec4 tmpa; 
vec4 color_out5b; 
vec4 color_out1b; 
vec4 tmpb; 


void main(void) 
{ 
        color_out5a = color5; 
        color_out1a = color1; 
        color_out5b = color5; 
        color_out1b = color1; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        gl_FragColor = color_out1a * color_out5a + color_out1b * color_out5b; 
}

And when I change the input, the cycles doubles...

precision mediump float; 
varying vec4 color5; 
varying vec4 color6; 
varying vec4 color1; 
varying vec4 color2;
varying vec4 color3;


vec4 color_out5a; 
vec4 color_out1a; 
vec4 tmpa; 
vec4 color_out5b; 
vec4 color_out1b; 
vec4 tmpb; 


void main(void) 
{ 
        color_out5a = color5; 
        color_out1a = color1; 
        color_out5b = color2; 
        color_out1b = color3; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        tmpa = vec4(dot(color_out5a, color1)); 
        color_out5a = tmpa * color_out5a; 
        color_out1a = tmpa + color_out1a; 
        tmpb = vec4(dot(color_out5b, color1)); 
        color_out5b = tmpb * color_out5b; 
        color_out1b = tmpb + color_out1b; 
        gl_FragColor = color_out1a * color_out5a + color_out1b * color_out5b; 
}

So I think your code is somehow optimized.

According to peterharris's answer in the thread How many gigaflops GPU MALI T624 MP6 reaches?,

"Most graphics content heavily uses fp16 rather than fp32 - for Mali this means we can get (approximately) double the performance in terms of peak FP16 flops throughput". That means we can get double peak throughput.

How can we get that throughput?

Children

+1 Chris Varnsverry over 10 years ago in reply to chen20062308

Hi chen,
The number of floating point operations that the vector units can perform is double for FP16, than it is for FP32, resulting in double PEAK FLOPS for those units. But again, this is PEAK and we are not trying to suggest that you should expect this level of performance with every shader. The shader above is a bad example of the effect on A pipe instructions apparently as in that case it is only affecting the number of load/store instructions (still a good optimization!). General advice is to use mediump wherever possible, as this gives the compiler the most chance of taking advantage of it.
Thanks,
Chris
Cancel
Up +1 Down

Cancel