m7 vs m85 fp32 multiply-add throughput

The product page of both the m7 and m85 both show figures for integer MAC throughput; but omit such figures for floating point formats.

Im talking here about simple long-vector dot products, repeated fused-multiply-additions; read from TCM and with appropriate unrolling of course.

The best fp32 throughput is seen for the m4 comes from the arm libraries; at about 5-6 clock cycles per fused operation, which is a little disappointing.

The m85 architecture claims to be faster; and indeed makes concrete claims to that effect for int types.

But what float throughput (32 or 16 bit) does it actually manage? I am unable to find any data on the matter; but surely im not the only person interested in that figure?