This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

What is the GLops of Mali T628MP6? Can't get 17 flops per pipe using OpenGL.

For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle.http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf shows this is compsed of:

  • 7: dot product (4 Muls, 3 adds)
  • 1: scalar add
  • 4: vec4 add
  • 4: vec4 multiply
  • 1: scalar multiply

And also in What is exact double precision performance for Mali T628 MP6 (Arndale Octa Board) ? ,  @chrisvarns says 17flops.

And in http://malideveloper.arm.com/downloads/IWOCL.pdf, timhar01 also says 17flops.

But according to my measurement, it can't process dot product with vec4 MAD together. The running time of case 1 and case 2 is the same. Why? How can I get 17 flops?

case 1:

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out5 = color_out5*color5+color6;\n"


case2:

"    color_out1 = vec4(dot(color_out1, color1));\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out1 = vec4(dot(color_out1, color1));\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out1 = vec4(dot(color_out1, color1));\n"

"    color_out5 = color_out5*color5+color6;\n"

"    color_out1 = vec4(dot(color_out1, color1));\n"

"    color_out5 = color_out5*color5+color6;\n"

Parents
  • Hi chen,

    Can you provide your shaders for comparison? I've knocked a couple together based on your code above however:

    case1.frag

    #ifdef HIGHP
    precision highp float;
    #else
    precision mediump float;
    #endif
    
    varying vec4 color5;
    varying vec4 color6;
    
    vec4 color_out5;
    
    void main(void)
    {
            color_out5 = vec4(1);
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            gl_FragColor = color_out5;
    }
    
    

    case2.frag

    #ifdef HIGHP
    precision highp float;
    #else
    precision mediump float;
    #endif
    varying vec4 color5;
    varying vec4 color6;
    varying vec4 color1;
    
    
    vec4 color_out5;
    vec4 color_out1;
    
    
    void main(void)
    {
            color_out5 = vec4(1);
            color_out1 = vec4(1);
            color_out1 = vec4(dot(color_out1, color1));
            color_out5 = color_out5 * color5 + color6;
            color_out1 = vec4(dot(color_out1, color1));
            color_out5 = color_out5 * color5 + color6;
            color_out1 = vec4(dot(color_out1, color1));
            color_out5 = color_out5 * color5 + color6;
            color_out1 = vec4(dot(color_out1, color1));
            color_out5 = color_out5 * color5 + color6;
            gl_FragColor = color_out1 + color_out5;
    }
    
    

    The output from malisc 4.2: (compiling with HIGHP because you're interested in FP32 ops, but obviously mediump gives better perf)

    varnz@soma:/raid/scratch/forum-19453$ malisc -f -V -c Mali-T620 -r r1p0 -d Mali-T600_r3p0-00rel0 -D HIGHP=1 case1.frag

    ARM Mali Offline Shader Compiler v4.2.0

    (C) Copyright 2007-2014 ARM Limited.

    All rights reserved.

    Compilation successful.

    3 work registers used, 0 uniform registers used, spilling not used.

                    A       L/S     T       Total   Bound

    Cycles:         9       2       0       11      A

    Shortest Path:  4       2       0       6       A

    Longest Path:   4       2       0       6       A

    Note: The cycles counts do not include possible stalls due to cache misses.

    varnz@soma:/raid/scratch/forum-19453$ malisc -f -V -c Mali-T620 -r r1p0 -d Mali-T600_r3p0-00rel0 -D HIGHP=1 case2.frag

    ARM Mali Offline Shader Compiler v4.2.0

    (C) Copyright 2007-2014 ARM Limited.

    All rights reserved.

    Compilation successful.

    5 work registers used, 0 uniform registers used, spilling not used.

                    A       L/S     T       Total   Bound

    Cycles:         8       3       0       11      A

    Shortest Path:  4       3       0       7       A

    Longest Path:   4       3       0       7       A

    Note: The cycles counts do not include possible stalls due to cache misses.

    Case 1 is doing a vec4 multiply and a vec4 add (8 FLOPS) 8 times, for a total of 64 FLOPS. It takes 4 ALU cycles to do this, so is doing 16 FLOPS/cycle. Great start. Case 2 is also taking 4 cycles, but this time you are doing 4 vec4 multiply, 4 vec4 add and 4 dot product, so 28 FLOPS for the dot products, 16 each for the multiply and adds, totalling 60 FLOPS, or 15 FLOPS/cycle. Given that the peak 17 FP32 FLOPS per cycle is composed of:

    • 7: dot product (4 Muls, 3 adds)
    • 1: scalar add
    • 4: vec4 add
    • 4: vec4 multiply
    • 1: scalar multiply

    then 15 FLOPS/cycle for case 2 is exactly what I would expect to see, given that you are doing no scalar math. From reading this graph you might expect case 1 to total 8 FLOPS/cycle, as its only using the vadd and vmul, but obviously we are able to optimize this up to 16 FLOPS/cycle.

    Can you let me know how these results differed from your expectation?

    Thanks,

    Chris

Reply
  • Hi chen,

    Can you provide your shaders for comparison? I've knocked a couple together based on your code above however:

    case1.frag

    #ifdef HIGHP
    precision highp float;
    #else
    precision mediump float;
    #endif
    
    varying vec4 color5;
    varying vec4 color6;
    
    vec4 color_out5;
    
    void main(void)
    {
            color_out5 = vec4(1);
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            color_out5 = color_out5 * color5 + color6;
            gl_FragColor = color_out5;
    }
    
    

    case2.frag

    #ifdef HIGHP
    precision highp float;
    #else
    precision mediump float;
    #endif
    varying vec4 color5;
    varying vec4 color6;
    varying vec4 color1;
    
    
    vec4 color_out5;
    vec4 color_out1;
    
    
    void main(void)
    {
            color_out5 = vec4(1);
            color_out1 = vec4(1);
            color_out1 = vec4(dot(color_out1, color1));
            color_out5 = color_out5 * color5 + color6;
            color_out1 = vec4(dot(color_out1, color1));
            color_out5 = color_out5 * color5 + color6;
            color_out1 = vec4(dot(color_out1, color1));
            color_out5 = color_out5 * color5 + color6;
            color_out1 = vec4(dot(color_out1, color1));
            color_out5 = color_out5 * color5 + color6;
            gl_FragColor = color_out1 + color_out5;
    }
    
    

    The output from malisc 4.2: (compiling with HIGHP because you're interested in FP32 ops, but obviously mediump gives better perf)

    varnz@soma:/raid/scratch/forum-19453$ malisc -f -V -c Mali-T620 -r r1p0 -d Mali-T600_r3p0-00rel0 -D HIGHP=1 case1.frag

    ARM Mali Offline Shader Compiler v4.2.0

    (C) Copyright 2007-2014 ARM Limited.

    All rights reserved.

    Compilation successful.

    3 work registers used, 0 uniform registers used, spilling not used.

                    A       L/S     T       Total   Bound

    Cycles:         9       2       0       11      A

    Shortest Path:  4       2       0       6       A

    Longest Path:   4       2       0       6       A

    Note: The cycles counts do not include possible stalls due to cache misses.

    varnz@soma:/raid/scratch/forum-19453$ malisc -f -V -c Mali-T620 -r r1p0 -d Mali-T600_r3p0-00rel0 -D HIGHP=1 case2.frag

    ARM Mali Offline Shader Compiler v4.2.0

    (C) Copyright 2007-2014 ARM Limited.

    All rights reserved.

    Compilation successful.

    5 work registers used, 0 uniform registers used, spilling not used.

                    A       L/S     T       Total   Bound

    Cycles:         8       3       0       11      A

    Shortest Path:  4       3       0       7       A

    Longest Path:   4       3       0       7       A

    Note: The cycles counts do not include possible stalls due to cache misses.

    Case 1 is doing a vec4 multiply and a vec4 add (8 FLOPS) 8 times, for a total of 64 FLOPS. It takes 4 ALU cycles to do this, so is doing 16 FLOPS/cycle. Great start. Case 2 is also taking 4 cycles, but this time you are doing 4 vec4 multiply, 4 vec4 add and 4 dot product, so 28 FLOPS for the dot products, 16 each for the multiply and adds, totalling 60 FLOPS, or 15 FLOPS/cycle. Given that the peak 17 FP32 FLOPS per cycle is composed of:

    • 7: dot product (4 Muls, 3 adds)
    • 1: scalar add
    • 4: vec4 add
    • 4: vec4 multiply
    • 1: scalar multiply

    then 15 FLOPS/cycle for case 2 is exactly what I would expect to see, given that you are doing no scalar math. From reading this graph you might expect case 1 to total 8 FLOPS/cycle, as its only using the vadd and vmul, but obviously we are able to optimize this up to 16 FLOPS/cycle.

    Can you let me know how these results differed from your expectation?

    Thanks,

    Chris

Children