For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle.http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf shows this is compsed of:
And also in What is exact double precision performance for Mali T628 MP6 (Arndale Octa Board) ? , @chrisvarns says 17flops.
And in http://malideveloper.arm.com/downloads/IWOCL.pdf, timhar01 also says 17flops.
But according to my measurement, it can't process dot product with vec4 MAD together. The running time of case 1 and case 2 is the same. Why? How can I get 17 flops?
case 1:
" color_out5 = color_out5*color5+color6;\n"
case2:
" color_out1 = vec4(dot(color_out1, color1));\n"
Hi chen,
The number of floating point operations that the vector units can perform is double for FP16, than it is for FP32, resulting in double PEAK FLOPS for those units. But again, this is PEAK and we are not trying to suggest that you should expect this level of performance with every shader. The shader above is a bad example of the effect on A pipe instructions apparently as in that case it is only affecting the number of load/store instructions (still a good optimization!). General advice is to use mediump wherever possible, as this gives the compiler the most chance of taking advantage of it.
Thanks,
Chris