Different sources point to different numbers. On Arndale board I found something about 72 GFLOPS for T604.
Wikipedia show 109 GFLOPS for T628. Have you hear about performance measurements for this GPU and its theoretical capability ?
I'm think about using it for low-power HPC, let me know what you think about that. Opinions and links to related to this are welcome.
Regards,
Piotr
Hi pietrushnic,
For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle. http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf shows this is compsed of:
So the formula is:
17 FP32 flops/cycle * ALU count * core count * frequency
T604 MP4 : 17 * 2 * 4 * 0.533 = 72.488 FP32 GFLOPS
T628 MP6 : 17 * 2 * 6 * 0.533 = 108.732 FP32 GFLOPS
This is assuming FP32, but as the ALU's vector units are quite flexible, you can actually do more work in the vector units using FP16, or less using FP64. You can achieve 5 FP64 FLOPS per ALU per cycle, so that gives us:
T604 MP4 : 5 * 2 * 4 * 0.533 = 21.32 FP64 GFLOPS
T628 MP6 : 5 * 2 * 6 * 0.533 = 31.98 FP64 GFLOPS
Hope this helps,
Chris
EDIT: Updated with official FP64 numbers, lower than previously quoted.
Chris,
thank you for being very clear and pointing to valuable materials. I really appreciate whole openness around ARM.
If I understand correctly double to single precision ratio is 1/2. Considering your numbers (especially T628 MP6 - 54.36 GFLOPS) it means that Arndale with T628 MP6 is the best board on market for double precision computation below $200 ($179/58.36 GFLOPS = $3.06/GFLOPS).
It is much better than i.e. brand new Jetson TK1 from NVIDIA where ratio is 1/24 and it gives 13 DP GFLOPS for $192 ($14.76/GFLOPS).
Is it anything else to take into consideration on small size, low power and low-end market ?
OTOH, I would like to know if anyone was able to utilize something close to 50 GFLOPS for Mali T628 MP6. If you know about this kind please let me know.
Hi Piotr,
I've updated my original reply with some updated numbers, as it's actually 5 FP64 FLOPS not 8.5, the simple half was a bit too naive So slightly less than 1/3, but not as bad as 1/24. So it becomes $5.60/GFLOPS.
As for utilization, we usually suggest 70% as being pretty optimal usage for a real application, but this number will change for different applications. Beyond that point the ALU is rarely a bottleneck, and you have to look a lot closer at cache utilization and the memory system feeding the GPU, eliminating CPU/GPU sync points to improve the pipelining of work to the GPU etc.