This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Fastest s16 summation reduction of a q register

Note: This was originally posted on 26th April 2011 at http://forums.arm.com

Hi,

I've got a NEON q register filled with 8 signed 16 bit ints. I'd like to calculate the sum across all of them as quickly as possible. The result should ultimately be a 16 bit int (overflow will not occur due to external constraints). Here are a few possibilities:

(1) Add pairwise twice, then do a scalar addition:

int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);
int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);
int16_t sum = (int16_t)(vgetq_lane_s64(pairwiseAddedTwice, 0) + vgetq_lane_s64(pairwiseAddedTwice, 1));

(2) Add high and low d registers, then add pairwise twice:

int16x4_t addedDRegisters = vadd_s16(vget_low_s16(vec), vget_high_s16(vec));
int32x2_t pairwiseAddedOnce = vpaddl_s16(addedDRegisters);
int64x1_t pairwiseAddedTwice = vpaddl_s32(pairwiseAddedOnce);
int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedTwice, 0);

(3) Add pairwise twice, then do an integer narrow, then another pairwise addition:

int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);
int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);
int32x2_t narrowed = vmovn_s64(pairwiseAddedTwice);
int64x1_t pairwiseAddedThrice = vpaddl_s32(narrowed);
int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedThrice, 0);

There are many, many other ways to do this (pairwise once, integer narrow, pairwise twice; or pairwise once, add high and low d registers, pairwise again; or cut out to scalar addition at some earlier point, etc.).

Which one of the many possibilities is most efficient? And what's a good way for me to figure this out for myself next time (either a good way to measure effectively or how I ought to reason it through)?

Thanks!

Josh

P.S. In case it matters, I'm writing in Xcode 4, for iOS, targeting armv7, using LLVM+gcc4.2.

Parents

Josh Bleecher Snyder over 12 years ago

Note: This was originally posted on 27th April 2011 at http://forums.arm.com

So I 'll just give you some hint that could (may be) help you.
vpaddlq.s16 q1, q0 vpaddlq.s32 q0, q1 vadd.s32 d0, d0, d1
This code will take 6 cycles

Most of NEON instruction take only 1 cycles.
But NEON is pipelined and most of the time you can't use a destination register as a source of the next instruction.

The same example doing 3 times the computation
vpaddlq.s16 q1, q0 vpaddlq.s16 q3, q2 vpaddlq.s16 q5, q4 vpaddlq.s32 q0, q1 vpaddlq.s32 q2, q3 vpaddlq.s32 q4, q5 vadd.s32 d0, d0, d1 vadd.s32 d4, d4, d5 vadd.s32 d8, d8, d9

will take only 9 cycles!

I do not know neon intrinsic, but I'm quite sure that the performance of your code will depend of the quality of the compiler.

So this is not easy to reply to your question.
The shortest one (in number of instruction) should be the better if the compiler is good.

Thanks! This is a good start.

I can also inline some assembly -- it looks there's a pretty clear mapping between the intrinsics and the assembly instructions -- so it's helpful to know about the pipelining. I'll try to take that into account, and interleave some of the additions.
Cancel
Vote up 0 Vote down

Cancel

Reply

Josh Bleecher Snyder over 12 years ago

Note: This was originally posted on 27th April 2011 at http://forums.arm.com

So I 'll just give you some hint that could (may be) help you.
vpaddlq.s16 q1, q0 vpaddlq.s32 q0, q1 vadd.s32 d0, d0, d1
This code will take 6 cycles

Most of NEON instruction take only 1 cycles.
But NEON is pipelined and most of the time you can't use a destination register as a source of the next instruction.

The same example doing 3 times the computation
vpaddlq.s16 q1, q0 vpaddlq.s16 q3, q2 vpaddlq.s16 q5, q4 vpaddlq.s32 q0, q1 vpaddlq.s32 q2, q3 vpaddlq.s32 q4, q5 vadd.s32 d0, d0, d1 vadd.s32 d4, d4, d5 vadd.s32 d8, d8, d9

will take only 9 cycles!

I do not know neon intrinsic, but I'm quite sure that the performance of your code will depend of the quality of the compiler.

So this is not easy to reply to your question.
The shortest one (in number of instruction) should be the better if the compiler is good.

Thanks! This is a good start.

I can also inline some assembly -- it looks there's a pretty clear mapping between the intrinsics and the assembly instructions -- so it's helpful to know about the pipelining. I'll try to take that into account, and interleave some of the additions.
Cancel
Vote up 0 Vote down

Cancel

Children

No data