This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

What is the GLops of Mali T628MP6? Can't get 17 flops per pipe using OpenGL.

For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle.http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf shows this is compsed of:

  • 7: dot product (4 Muls, 3 adds)
  • 1: scalar add
  • 4: vec4 add
  • 4: vec4 multiply
  • 1: scalar multiply

And also in What is exact double precision performance for Mali T628 MP6 (Arndale Octa Board) ? ,  @chrisvarns says 17flops.

And in http://malideveloper.arm.com/downloads/IWOCL.pdf, timhar01 also says 17flops.

But according to my measurement, it can't process dot product with vec4 MAD together. The running time of case 1 and case 2 is the same. Why? How can I get 17 flops?

case 1:

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"


case2:

"    color_out1 = vec4(dot(color_out1, color1));\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out1 = vec4(dot(color_out1, color1));\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out1 = vec4(dot(color_out1, color1));\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out1 = vec4(dot(color_out1, color1));\n"

"    color_out5 = color_out5*color5+color6;\n"

Parents
  • Hi chrisvarns,

    That's really amazing.

    According to your explain, to reach the peak throughput, the program should be:

    case 1:

    1. tmp = vec4(dot(color_out5, color1)); 
    2. color_out5 = tmp * color_out5; 
    3. color_out1 = tmp + color_out1;

    But in common program, it should be:

    case 2:

    1. tmp = vec4(dot(color_out5, color1)); 
    2. color_out5 = tmp * color_out5 + color_out1; 

    or !

    case 3:

    1. tmp = vec4(dot(color_out5, color1)); 
    2. color_out5 = tmp * color_out5; 
    3. color_out1 = color_out5 + color_out1;

    But the throughput of case 2&3 is only half of case 1. So is this the hardware limitation? If I want to achieve high performance, I need to write shaders like case 1?

Reply
  • Hi chrisvarns,

    That's really amazing.

    According to your explain, to reach the peak throughput, the program should be:

    case 1:

    1. tmp = vec4(dot(color_out5, color1)); 
    2. color_out5 = tmp * color_out5; 
    3. color_out1 = tmp + color_out1;

    But in common program, it should be:

    case 2:

    1. tmp = vec4(dot(color_out5, color1)); 
    2. color_out5 = tmp * color_out5 + color_out1; 

    or !

    case 3:

    1. tmp = vec4(dot(color_out5, color1)); 
    2. color_out5 = tmp * color_out5; 
    3. color_out1 = color_out5 + color_out1;

    But the throughput of case 2&3 is only half of case 1. So is this the hardware limitation? If I want to achieve high performance, I need to write shaders like case 1?

Children