Different sources point to different numbers. On Arndale board I found something about 72 GFLOPS for T604.
Wikipedia show 109 GFLOPS for T628. Have you hear about performance measurements for this GPU and its theoretical capability ?
I'm think about using it for low-power HPC, let me know what you think about that. Opinions and links to related to this are welcome.
Regards,
Piotr
Chris,
thank you for being very clear and pointing to valuable materials. I really appreciate whole openness around ARM.
If I understand correctly double to single precision ratio is 1/2. Considering your numbers (especially T628 MP6 - 54.36 GFLOPS) it means that Arndale with T628 MP6 is the best board on market for double precision computation below $200 ($179/58.36 GFLOPS = $3.06/GFLOPS).
It is much better than i.e. brand new Jetson TK1 from NVIDIA where ratio is 1/24 and it gives 13 DP GFLOPS for $192 ($14.76/GFLOPS).
Is it anything else to take into consideration on small size, low power and low-end market ?
OTOH, I would like to know if anyone was able to utilize something close to 50 GFLOPS for Mali T628 MP6. If you know about this kind please let me know.
Hi Piotr,
I've updated my original reply with some updated numbers, as it's actually 5 FP64 FLOPS not 8.5, the simple half was a bit too naive So slightly less than 1/3, but not as bad as 1/24. So it becomes $5.60/GFLOPS.
As for utilization, we usually suggest 70% as being pretty optimal usage for a real application, but this number will change for different applications. Beyond that point the ALU is rarely a bottleneck, and you have to look a lot closer at cache utilization and the memory system feeding the GPU, eliminating CPU/GPU sync points to improve the pipelining of work to the GPU etc.
Hope this helps,
Chris