Arm Development Studio forum Neon vldx_lane_y compilation efficiency

Locked Locked
Replies 3 replies
Subscribers 119 subscribers
Views 3283 views
Users 0 members are here

Options

Related

How was your experience today?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Neon vldx_lane_y compilation efficiency

Pierre ELINE over 12 years ago

Parents

Gilead Kutnick over 12 years ago

Note: This was originally posted on 16th November 2011 at http://forums.arm.com

What Armcc RVCT3.0 is doing actually makes at least some sense because it seems there's an extra stall if you load to different lanes in the same register back to back. But it's doing it very poorly by not alternating between two sets of 64-bit registers and saving the merge for the end, and by not even managing to pair the register allocation to perform 128-bit operations.

I have no idea what the others are doing.

Just goes to show that if you want good NEON performance you're best off writing ASM.
Cancel
Vote up 0 Vote down

Cancel

Reply

Gilead Kutnick over 12 years ago

Note: This was originally posted on 16th November 2011 at http://forums.arm.com

What Armcc RVCT3.0 is doing actually makes at least some sense because it seems there's an extra stall if you load to different lanes in the same register back to back. But it's doing it very poorly by not alternating between two sets of 64-bit registers and saving the merge for the end, and by not even managing to pair the register allocation to perform 128-bit operations.

I have no idea what the others are doing.

Just goes to show that if you want good NEON performance you're best off writing ASM.
Cancel
Vote up 0 Vote down

Cancel

Children

No data