This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Neon vldx_lane_y compilation efficiency

Parents
  • Note: This was originally posted on 16th November 2011 at http://forums.arm.com

    What Armcc  RVCT3.0 is doing actually makes at least some sense because it seems there's an extra stall if you load to different lanes in the same register back to back. But it's doing it very poorly by not alternating between two sets of 64-bit registers and saving the merge for the end, and by not even managing to pair the register allocation to perform 128-bit operations.

    I have no idea what the others are doing.

    Just goes to show that if you want good NEON performance you're best off writing ASM.
Reply
  • Note: This was originally posted on 16th November 2011 at http://forums.arm.com

    What Armcc  RVCT3.0 is doing actually makes at least some sense because it seems there's an extra stall if you load to different lanes in the same register back to back. But it's doing it very poorly by not alternating between two sets of 64-bit registers and saving the merge for the end, and by not even managing to pair the register allocation to perform 128-bit operations.

    I have no idea what the others are doing.

    Just goes to show that if you want good NEON performance you're best off writing ASM.
Children
No data