This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Implementation in NEON of non uniform address jumps

Parents
  • Note: This was originally posted on 1st July 2012 at http://forums.arm.com

    Okay and the second half of D0[0] can be utilized by doing {snip}. Right?


    No. Remember this is a load-store architecture. The address is where the data comes from, the register specifier sets where the data gets stored to. You've changed the address, but not the register spec, so you've just overwritten the bottom 16-bits again.

    The first question is what do you mean by "the second half of D0[0]"? D0 is a double word register (8 bytes/64-bits). It is split into a number of lanes, depending on vector element size. In sim's example you are using a 16-bit, so 4 lanes per register. In this case you set the bottom 16-bits because you specified lane 0 (D0[0]). But this isn't half of anything - it sets ALL 16-bits of D0[0], and as there are 4 16-bit lanes this is setting 1/4th of the whole D register.

    If you want to load the second 16-bits of the register then you want D0[1].

    As I mentioned in of of your other posts, if you end up doing a lot of scalar loads you are really defeating the point of using a vector engine, so if you can restructure you algorithm so you do not to need to do this.

    Iso
Reply
  • Note: This was originally posted on 1st July 2012 at http://forums.arm.com

    Okay and the second half of D0[0] can be utilized by doing {snip}. Right?


    No. Remember this is a load-store architecture. The address is where the data comes from, the register specifier sets where the data gets stored to. You've changed the address, but not the register spec, so you've just overwritten the bottom 16-bits again.

    The first question is what do you mean by "the second half of D0[0]"? D0 is a double word register (8 bytes/64-bits). It is split into a number of lanes, depending on vector element size. In sim's example you are using a 16-bit, so 4 lanes per register. In this case you set the bottom 16-bits because you specified lane 0 (D0[0]). But this isn't half of anything - it sets ALL 16-bits of D0[0], and as there are 4 16-bit lanes this is setting 1/4th of the whole D register.

    If you want to load the second 16-bits of the register then you want D0[1].

    As I mentioned in of of your other posts, if you end up doing a lot of scalar loads you are really defeating the point of using a vector engine, so if you can restructure you algorithm so you do not to need to do this.

    Iso
Children
No data