We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I'n using this code because I'm sure that the end ARM code take exactly 5 cycles and let 2 bubbles in the pipeline for the branch.Remember this post http://pulsar.websha...h-instructions/I have a beagleboard XM (DM3730). But the processor is not the problem (i believe). Try the code I give and tell me if you found 15 cycles (10 for NEON part and 5 for ARM part).
I do not understand the point 5 and how you get 9 cycles !!!
- try to load (if it's possible) long time before using datas (there is enough registers to load the datas of the next iteration during the previous one).- try to write as soon as possible (that's to say as soon as the register are available for VSAVE).
- and now don't read the same memory bloc with consecutive VLOAD