Arm Community
Site
Search
User
Site
Search
User
Support forums
Arm Development Studio forum
A8/9 NEON 128bit registers, 64bit alu's
Jump...
Cancel
Locked
Locked
Replies
5 replies
Subscribers
119 subscribers
Views
4596 views
Users
0 members are here
Options
Share
More actions
Cancel
Related
How was your experience today?
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
A8/9 NEON 128bit registers, 64bit alu's
Sander Bogaert
over 12 years ago
Note: This was originally posted on 1st December 2011 at
http://forums.arm.com
Hi,
I've been reading a lot about the neon architecture for a project but there is still this one thing I'm not entirely sure about. If I understand correctly the 128bit-view is more of an aid for the programmer since the alu's in the neon engine are only 64bit wide instructions working on a qx register will just take double the time. I was going over the timing tables trying my best to understand them :-) and saw this confirmed in some instruction timings but in other the operation on 128bit would also just take one cycle ( add for example ).
Did I read the table wrong? If not: how is this achieved? Is this divided over 2 64bit adders which are coupled ( carry ) then?
Thanks in advance
Parents
Gilead Kutnick
over 12 years ago
Note: This was originally posted on 1st December 2011 at
http://forums.arm.com
This is a basic rundown of what Cortex-A8/Cortex-A9's NEON implementation provides (note that this is slightly speculative, but pretty well supported by existing documentation):
- 2 64-bit simple integer ALUs, which are capable of add/sub/logic/shifts/compares/min/max/etc. Only one of them is capable of some operations like bit selects, variable shifts, and horizontal operations. And of course anything widening or narrowing isn't 128-bit to 128-bit. Note that the ALUs can do some full 64-bit operations like add/sub/shift.
- 1 64/128-bit permute unit.. there are some 128 to 64-bit operations like vmovn that are one cycle, and some 128-bit operations like reverse and swap are too, but for the most part it's 1-cycle for 64-bit like with zip/unzip and ext. tbl is at least 2 cycles and 64-bit only.
- 8 8x16 integer multipliers w/accumulate. These can be chained to do 8 8x8 mac, 4 16x16 mac, or 1 32x32 mac in a cycle (note the last one requires 2 32x32 mac in 2 cycles because of the register arrangement)
- 1 128-bit load/store unit
- 2 single precision floating point multipliers and 2 single precision floating point add/sub/cmp/etc
Aside from what's mentioned in literature and the TRM's timings I've confirmed most of this experimentally.
So a majority of simple integer operations (not counting multiplies) can be performed in 1 cycle, as can loads/stores and some permutes. I think that ARM wants to maintain NEON performance as being about double the throughput of the ARMv6 equivalent, where you have 2 32-bit ALUs (with some SIMD operations), 4 8x16 multipliers (although you can't do fully independent 16x16 macs or anything 8x8 or 8x16) and 1 single/double precision FPU. On Cortex-A5 NEON only has one 64-bit ALU, which corresponds with the integer core only having one one 32-bit ALU.
Cancel
Vote up
0
Vote down
Cancel
Reply
Gilead Kutnick
over 12 years ago
Note: This was originally posted on 1st December 2011 at
http://forums.arm.com
This is a basic rundown of what Cortex-A8/Cortex-A9's NEON implementation provides (note that this is slightly speculative, but pretty well supported by existing documentation):
- 2 64-bit simple integer ALUs, which are capable of add/sub/logic/shifts/compares/min/max/etc. Only one of them is capable of some operations like bit selects, variable shifts, and horizontal operations. And of course anything widening or narrowing isn't 128-bit to 128-bit. Note that the ALUs can do some full 64-bit operations like add/sub/shift.
- 1 64/128-bit permute unit.. there are some 128 to 64-bit operations like vmovn that are one cycle, and some 128-bit operations like reverse and swap are too, but for the most part it's 1-cycle for 64-bit like with zip/unzip and ext. tbl is at least 2 cycles and 64-bit only.
- 8 8x16 integer multipliers w/accumulate. These can be chained to do 8 8x8 mac, 4 16x16 mac, or 1 32x32 mac in a cycle (note the last one requires 2 32x32 mac in 2 cycles because of the register arrangement)
- 1 128-bit load/store unit
- 2 single precision floating point multipliers and 2 single precision floating point add/sub/cmp/etc
Aside from what's mentioned in literature and the TRM's timings I've confirmed most of this experimentally.
So a majority of simple integer operations (not counting multiplies) can be performed in 1 cycle, as can loads/stores and some permutes. I think that ARM wants to maintain NEON performance as being about double the throughput of the ARMv6 equivalent, where you have 2 32-bit ALUs (with some SIMD operations), 4 8x16 multipliers (although you can't do fully independent 16x16 macs or anything 8x8 or 8x16) and 1 single/double precision FPU. On Cortex-A5 NEON only has one 64-bit ALU, which corresponds with the integer core only having one one 32-bit ALU.
Cancel
Vote up
0
Vote down
Cancel
Children
No data