I'n using this code because I'm sure that the end ARM code take exactly 5 cycles and let 2 bubbles in the pipeline for the branch.Remember this post http://pulsar.websha...h-instructions/I have a beagleboard XM (DM3730). But the processor is not the problem (i believe). Try the code I give and tell me if you found 15 cycles (10 for NEON part and 5 for ARM part).
I do not understand the point 5 and how you get 9 cycles !!!
- try to load (if it's possible) long time before using datas (there is enough registers to load the datas of the next iteration during the previous one).- try to write as soon as possible (that's to say as soon as the register are available for VSAVE).
- and now don't read the same memory bloc with consecutive VLOAD