For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle.http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf shows this is compsed of:
And also in What is exact double precision performance for Mali T628 MP6 (Arndale Octa Board) ? , @chrisvarns says 17flops.
And in http://malideveloper.arm.com/downloads/IWOCL.pdf, timhar01 also says 17flops.
But according to my measurement, it can't process dot product with vec4 MAD together. The running time of case 1 and case 2 is the same. Why? How can I get 17 flops?
case 1:
" color_out5 = color_out5*color5+color6;\n"
case2:
" color_out1 = vec4(dot(color_out1, color1));\n"
Hi chrisvarns,
Thanks for your reply. It's really helpful.