The product page of both the m7 and m85 both show figures for integer MAC throughput; but omit such figures for floating point formats.
Im talking here about simple long-vector dot products, repeated fused-multiply-additions; read from TCM and with appropriate unrolling of course.
The best fp32 throughput is seen for the m4 comes from the arm libraries; at about 5-6 clock cycles per fused operation, which is a little disappointing.
The m85 architecture claims to be faster; and indeed makes concrete claims to that effect for int types.
But what float throughput (32 or 16 bit) does it actually manage? I am unable to find any data on the matter; but surely im not the only person interested in that figure?
Just of curiosity, can you link the relevant docs?