This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

What is exact double precision performance for Mali T628 MP6 (Arndale Octa Board) ?

Different sources point to different numbers. On Arndale board I found something about 72 GFLOPS for T604.

Wikipedia show 109 GFLOPS for T628. Have you hear about performance measurements for this GPU and its theoretical capability ?

I'm think about using it for low-power HPC, let me know what you think about that. Opinions and links to related to this are welcome.

Regards,

Piotr

Parents
  • Hi pietrushnic,

    For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle. http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf shows this is compsed of:

    • 7: dot product (4 Muls, 3 adds)
    • 1: scalar add
    • 4: vec4 add
    • 4: vec4 multiply
    • 1: scalar multiply

    So the formula is:

    17 FP32 flops/cycle * ALU count * core count * frequency

    T604 MP4 : 17 * 2 * 4 * 0.533 = 72.488 FP32 GFLOPS

    T628 MP6 : 17 * 2 * 6 * 0.533 = 108.732 FP32 GFLOPS

    This is assuming FP32, but as the ALU's vector units are quite flexible, you can actually do more work in the vector units using FP16, or less using FP64. You can achieve 5 FP64 FLOPS per ALU per cycle, so that gives us:

    T604 MP4 : 5 * 2 * 4 * 0.533 = 21.32 FP64 GFLOPS

    T628 MP6 : 5 * 2 * 6 * 0.533 = 31.98 FP64 GFLOPS

    Hope this helps,

    Chris

    EDIT: Updated with official FP64 numbers, lower than previously quoted.

Reply
  • Hi pietrushnic,

    For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle. http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf shows this is compsed of:

    • 7: dot product (4 Muls, 3 adds)
    • 1: scalar add
    • 4: vec4 add
    • 4: vec4 multiply
    • 1: scalar multiply

    So the formula is:

    17 FP32 flops/cycle * ALU count * core count * frequency

    T604 MP4 : 17 * 2 * 4 * 0.533 = 72.488 FP32 GFLOPS

    T628 MP6 : 17 * 2 * 6 * 0.533 = 108.732 FP32 GFLOPS

    This is assuming FP32, but as the ALU's vector units are quite flexible, you can actually do more work in the vector units using FP16, or less using FP64. You can achieve 5 FP64 FLOPS per ALU per cycle, so that gives us:

    T604 MP4 : 5 * 2 * 4 * 0.533 = 21.32 FP64 GFLOPS

    T628 MP6 : 5 * 2 * 6 * 0.533 = 31.98 FP64 GFLOPS

    Hope this helps,

    Chris

    EDIT: Updated with official FP64 numbers, lower than previously quoted.

Children