Arm Community
Site
Search
User
Site
Search
User
Support forums
Arm Development Studio forum
Neon vldx_lane_y compilation efficiency
Jump...
Cancel
Locked
Locked
Replies
3 replies
Subscribers
119 subscribers
Views
3283 views
Users
0 members are here
Options
Share
More actions
Cancel
Related
How was your experience today?
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
Neon vldx_lane_y compilation efficiency
Pierre ELINE
over 12 years ago
Parents
Gilead Kutnick
over 12 years ago
Note: This was originally posted on 16th November 2011 at
http://forums.arm.com
What Armcc RVCT3.0 is doing actually makes at least some sense because it seems there's an extra stall if you load to different lanes in the same register back to back. But it's doing it very poorly by not alternating between two sets of 64-bit registers and saving the merge for the end, and by not even managing to pair the register allocation to perform 128-bit operations.
I have no idea what the others are doing.
Just goes to show that if you want good NEON performance you're best off writing ASM.
Cancel
Vote up
0
Vote down
Cancel
Reply
Gilead Kutnick
over 12 years ago
Note: This was originally posted on 16th November 2011 at
http://forums.arm.com
What Armcc RVCT3.0 is doing actually makes at least some sense because it seems there's an extra stall if you load to different lanes in the same register back to back. But it's doing it very poorly by not alternating between two sets of 64-bit registers and saving the merge for the end, and by not even managing to pair the register allocation to perform 128-bit operations.
I have no idea what the others are doing.
Just goes to show that if you want good NEON performance you're best off writing ASM.
Cancel
Vote up
0
Vote down
Cancel
Children
No data