Neon vldx_lane_y compilation efficiency

  • Note: This was originally posted on 16th November 2011 at http://forums.arm.com

    Thx for your reply.

    I hardly read the cycle table, but according to this simulator there is no difference in term of cycle between (http://pulsar.websha...sample-96ee929b):


    vld2.8  {d0[0],d1[0]}, [r0]
    vld2.8  {d0[1],d1[1]}, [r0]
    vld2.8  {d0[2],d1[2]}, [r0]
    vld2.8  {d0[3],d1[3]}, [r0]


    and (http://pulsar.websha...sample-469dc426):


    vld2.8  {d0[0],d1[0]}, [r0]
    vld2.8  {d2[1],d3[1]}, [r0]
    vld2.8  {d4[2],d5[2]}, [r0]
    vld2.8  {d6[3],d7[3]}, [r0]
  • Note: This was originally posted on 16th November 2011 at http://forums.arm.com

    What Armcc  RVCT3.0 is doing actually makes at least some sense because it seems there's an extra stall if you load to different lanes in the same register back to back. But it's doing it very poorly by not alternating between two sets of 64-bit registers and saving the merge for the end, and by not even managing to pair the register allocation to perform 128-bit operations.

    I have no idea what the others are doing.

    Just goes to show that if you want good NEON performance you're best off writing ASM.
  • Note: This was originally posted on 18th November 2011 at http://forums.arm.com

    NEON on Cortex-A8 is complex and somewhat mysterious. The TRM is incomplete and while webshaker's simulator is good it's far from handling all of the edge cases. There's a lot we don't even fully understand. If you don't believe me I recommend trying it on real hardware.
More questions in this forum