This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

A8/9 NEON 128bit registers, 64bit alu's

Note: This was originally posted on 1st December 2011 at http://forums.arm.com

Hi,

I've been reading a lot about the neon architecture for a project but there is still this one thing I'm not entirely sure about. If I understand correctly the 128bit-view is more of an aid for the programmer since the alu's in the neon engine are only 64bit wide instructions working on a  qx register will just take double the time. I was going over the timing tables trying my best to understand them :-) and saw this confirmed in some instruction timings but in other the operation on 128bit would also just take one cycle ( add for example ).

Did I read the table wrong? If not: how is this achieved? Is this divided over 2 64bit adders which are coupled ( carry ) then?

Thanks in advance
Parents
  • Note: This was originally posted on 12th December 2011 at http://forums.arm.com


    Yes you are right, many operations use all 128-bits at once on Cortex-A8/A9, so your detailed analysis is quite good. I was just explaining the rough overview that even if the more complex 128-bit operations take 2 (or occasionally 3) cycles instead of 1, it is still a good idea to write NEON code for 128-bits because future ARMv7 devices will soon have even more 128-bit single-cycle NEON paths.

    -Shervin.


    I partially agree with this. Using 128-bit operations instead of 2x 64-bit even where the CPU takes 2 cycles can also save fetch/decode time, although that isn't usually a bottleneck with NEON on A8/A9. But, depending on your code, it could end up costing cycles moving to a 128-bit granularity, if you're not always using all the elements in the vector. In these cases you may be better off sticking with the 64-bit forms.

    Of course, if future proofing really is a big goal then that probably trumps this.
Reply
  • Note: This was originally posted on 12th December 2011 at http://forums.arm.com


    Yes you are right, many operations use all 128-bits at once on Cortex-A8/A9, so your detailed analysis is quite good. I was just explaining the rough overview that even if the more complex 128-bit operations take 2 (or occasionally 3) cycles instead of 1, it is still a good idea to write NEON code for 128-bits because future ARMv7 devices will soon have even more 128-bit single-cycle NEON paths.

    -Shervin.


    I partially agree with this. Using 128-bit operations instead of 2x 64-bit even where the CPU takes 2 cycles can also save fetch/decode time, although that isn't usually a bottleneck with NEON on A8/A9. But, depending on your code, it could end up costing cycles moving to a 128-bit granularity, if you're not always using all the elements in the vector. In these cases you may be better off sticking with the 64-bit forms.

    Of course, if future proofing really is a big goal then that probably trumps this.
Children
No data