I'm seeing Cortex-A7 cycle-timing table here :
http://hardwarebug.org/2014/05/15/cortex-a7-instruction-cycle-timings/
For example,
VADD.F32 Dd, Dn, Dm takes 2 cycles
VADD.F32 Qd, Qn, Qm takes 4 cycles
same goes for VMUL..
Is this really the case ? I think to remember that both take 1 cycle on Cortex-A8 ?
I'm wondering what's the benefit to use NEON in this case, except for compatibility reason maybe, where some parts could run 4 times faster on some other NEON implementations ?