This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

A8/9 NEON 128bit registers, 64bit alu's

Note: This was originally posted on 1st December 2011 at http://forums.arm.com

Hi,

I've been reading a lot about the neon architecture for a project but there is still this one thing I'm not entirely sure about. If I understand correctly the 128bit-view is more of an aid for the programmer since the alu's in the neon engine are only 64bit wide instructions working on a  qx register will just take double the time. I was going over the timing tables trying my best to understand them :-) and saw this confirmed in some instruction timings but in other the operation on 128bit would also just take one cycle ( add for example ).

Did I read the table wrong? If not: how is this achieved? Is this divided over 2 64bit adders which are coupled ( carry ) then?

Thanks in advance
  • Note: This was originally posted on 1st December 2011 at http://forums.arm.com

    This is a basic rundown of what Cortex-A8/Cortex-A9's NEON implementation provides (note that this is slightly speculative, but pretty well supported by existing documentation):

    - 2 64-bit simple integer ALUs, which are capable of add/sub/logic/shifts/compares/min/max/etc. Only one of them is capable of some operations like bit selects, variable shifts, and horizontal operations. And of course anything widening or narrowing isn't 128-bit to 128-bit. Note that the ALUs can do some full 64-bit operations like add/sub/shift.
    - 1 64/128-bit permute unit.. there are some 128 to 64-bit operations like vmovn that are one cycle, and some 128-bit operations like reverse and swap are too, but for the most part it's 1-cycle for 64-bit like with zip/unzip and ext. tbl is at least 2 cycles and 64-bit only.
    - 8 8x16 integer multipliers w/accumulate. These can be chained to do 8 8x8 mac, 4 16x16 mac, or 1 32x32 mac in a cycle (note the last one requires 2 32x32 mac in 2 cycles because of the register arrangement)
    - 1 128-bit load/store unit
    - 2 single precision floating point multipliers and 2 single precision floating point add/sub/cmp/etc

    Aside from what's mentioned in literature and the TRM's timings I've confirmed most of this experimentally.

    So a majority of simple integer operations (not counting multiplies) can be performed in 1 cycle, as can loads/stores and some permutes. I think that ARM wants to maintain NEON performance as being about double the throughput of the ARMv6 equivalent, where you have 2 32-bit ALUs (with some SIMD operations), 4 8x16 multipliers (although you can't do fully independent 16x16 macs or anything 8x8 or 8x16) and 1 single/double precision FPU. On Cortex-A5 NEON only has one 64-bit ALU, which corresponds with the integer core only having one one 32-bit ALU.
  • Note: This was originally posted on 12th December 2011 at http://forums.arm.com


    It's true that NEON has 128-bit registers but nearly always operates on just 64-bits at a time.


    This is NOT true, please read my post. For integer operations 128-bit operation is more the rule than the exception, at least on Cortex-A8 and A9. If you're targeting these devices it's important to know what operations do and don't operate on 128-bits in one cycle.
  • Note: This was originally posted on 12th December 2011 at http://forums.arm.com


    Yes you are right, many operations use all 128-bits at once on Cortex-A8/A9, so your detailed analysis is quite good. I was just explaining the rough overview that even if the more complex 128-bit operations take 2 (or occasionally 3) cycles instead of 1, it is still a good idea to write NEON code for 128-bits because future ARMv7 devices will soon have even more 128-bit single-cycle NEON paths.

    -Shervin.


    I partially agree with this. Using 128-bit operations instead of 2x 64-bit even where the CPU takes 2 cycles can also save fetch/decode time, although that isn't usually a bottleneck with NEON on A8/A9. But, depending on your code, it could end up costing cycles moving to a 128-bit granularity, if you're not always using all the elements in the vector. In these cases you may be better off sticking with the 64-bit forms.

    Of course, if future proofing really is a big goal then that probably trumps this.
  • Note: This was originally posted on 12th December 2011 at http://forums.arm.com


    This is NOT true, please read my post. For integer operations 128-bit operation is more the rule than the exception, at least on Cortex-A8 and A9. If you're targeting these devices it's important to know what operations do and don't operate on 128-bits in one cycle.


    Yes you are right, many operations use all 128-bits at once on Cortex-A8/A9, so your detailed analysis is quite good. I was just explaining the rough overview that even if the more complex 128-bit operations take 2 (or occasionally 3) cycles instead of 1, it is still a good idea to write NEON code for 128-bits because future ARMv7 devices will soon have even more 128-bit single-cycle NEON paths.

    -Shervin.
  • Note: This was originally posted on 10th December 2011 at http://forums.arm.com

    It's true that NEON has 128-bit registers but nearly always operates on just 64-bits at a time. But the size of the registers is an Architecture (language) specification that can't change whereas the internal use of 64-bits is an implementation issue that will change over time, so you can expect that perhaps in 1 more year, ARM devices will operate on 128-bits at a time instead of 64-bits. So if you write your NEON code now for 128-bit, it will be more future proof because the same code will potentially double in speed in the future!

    Cheers,
    Shervin Emami.
    http://www.shervinem...rmAssembly.html