Arm Community
Site
Search
User
Site
Search
User
Support forums
Arm Development Studio forum
Neon reg to ARM reg data transfer
Jump...
Cancel
Locked
Locked
Replies
5 replies
Subscribers
119 subscribers
Views
4398 views
Users
0 members are here
Options
Share
More actions
Cancel
Related
How was your experience today?
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
Neon reg to ARM reg data transfer
Vishwa Vishwa
over 12 years ago
Note: This was originally posted on 30th July 2009 at
http://forums.arm.com
I m transferring data from neon register to arm register, which is very costly.
i.e., it takes each vmov.32 r6,do[2] takes around 13 to 18 cycles.
This is proving to be very costly for a function which runs for many times.
Can anyone please suggest a way out of this???
Thanks in advance for any help..... :)
Parents
Peter Harris
over 12 years ago
Note: This was originally posted on 30th July 2009 at
http://forums.arm.com
You cannot reduce the NEON-ARM or ARM-NEON register move costs - these are caused by the microarchitecture of the NEON unit for the Cortex-A8.
One of the key parts of vectorizing for the Neon unit is to try and minimize the interaction between the main ARM pipeline and the NEON unit - pushing relatively large blocks of code through the NEON without too much dependence on what is happening on the ARM-pipeline. Any interaction is fairly time consuming as you have noted, but for many common tasks such a media CODECs the need for interaction is actually quite low.
It is worth noting that the NEON unit has its own load store hardware so the NEON code can make its own memory accesses, and does *not* need the ARM to load data for it (this then has to be vmov'd in to the NEON which is slow).
Cancel
Vote up
0
Vote down
Cancel
Reply
Peter Harris
over 12 years ago
Note: This was originally posted on 30th July 2009 at
http://forums.arm.com
You cannot reduce the NEON-ARM or ARM-NEON register move costs - these are caused by the microarchitecture of the NEON unit for the Cortex-A8.
One of the key parts of vectorizing for the Neon unit is to try and minimize the interaction between the main ARM pipeline and the NEON unit - pushing relatively large blocks of code through the NEON without too much dependence on what is happening on the ARM-pipeline. Any interaction is fairly time consuming as you have noted, but for many common tasks such a media CODECs the need for interaction is actually quite low.
It is worth noting that the NEON unit has its own load store hardware so the NEON code can make its own memory accesses, and does *not* need the ARM to load data for it (this then has to be vmov'd in to the NEON which is slow).
Cancel
Vote up
0
Vote down
Cancel
Children
No data