Arm Community
Site
Search
User
Site
Search
User
Support forums
Arm Development Studio forum
Neon reg to ARM reg data transfer
Jump...
Cancel
Locked
Locked
Replies
5 replies
Subscribers
119 subscribers
Views
4398 views
Users
0 members are here
Options
Share
More actions
Cancel
Related
How was your experience today?
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
Neon reg to ARM reg data transfer
Vishwa Vishwa
over 12 years ago
Note: This was originally posted on 30th July 2009 at
http://forums.arm.com
I m transferring data from neon register to arm register, which is very costly.
i.e., it takes each vmov.32 r6,do[2] takes around 13 to 18 cycles.
This is proving to be very costly for a function which runs for many times.
Can anyone please suggest a way out of this???
Thanks in advance for any help..... :)
Parents
Vishwa Vishwa
over 12 years ago
Note: This was originally posted on 13th August 2009 at
http://forums.arm.com
Hi guys,
I found a way out of this. There were some 16 such instructions(neon to arm transfer instructions) in my function. So accounting for some 208 cycles per call, (latency for vmov.u32 r6,d26[0] = 14).And this function was getting called some 20 thousand times. It was eating up lot of time.
But i had a opportunity in it. Interleaving 14 independent instructions in between.
code was something like this:
vmov.32 r6,d[0]
- instrns which use r6-
- instrn-
- instrn-
- instrn- ........
vmov.32 r6,d[0]
- instrns which use r6-
- instrn-
- instrn-
- instrn- ..........
................etc
So looking at the code structure, an option left is interleave. Take out the instructions which use the immediate result of r6 and insert them just before next neon to arm transfer instrn. This is assuming that the instrns in between are independent. This gave a gain of around 4%.
Thanks a lot for your help!!!!!!!
Cancel
Vote up
0
Vote down
Cancel
Reply
Vishwa Vishwa
over 12 years ago
Note: This was originally posted on 13th August 2009 at
http://forums.arm.com
Hi guys,
I found a way out of this. There were some 16 such instructions(neon to arm transfer instructions) in my function. So accounting for some 208 cycles per call, (latency for vmov.u32 r6,d26[0] = 14).And this function was getting called some 20 thousand times. It was eating up lot of time.
But i had a opportunity in it. Interleaving 14 independent instructions in between.
code was something like this:
vmov.32 r6,d[0]
- instrns which use r6-
- instrn-
- instrn-
- instrn- ........
vmov.32 r6,d[0]
- instrns which use r6-
- instrn-
- instrn-
- instrn- ..........
................etc
So looking at the code structure, an option left is interleave. Take out the instructions which use the immediate result of r6 and insert them just before next neon to arm transfer instrn. This is assuming that the instrns in between are independent. This gave a gain of around 4%.
Thanks a lot for your help!!!!!!!
Cancel
Vote up
0
Vote down
Cancel
Children
No data