This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Fastest s16 summation reduction of a q register

Note: This was originally posted on 26th April 2011 at http://forums.arm.com

Hi,

I've got a NEON q register filled with 8 signed 16 bit ints. I'd like to calculate the sum across all of them as quickly as possible. The result should ultimately be a 16 bit int (overflow will not occur due to external constraints). Here are a few possibilities:

(1) Add pairwise twice, then do a scalar addition:

int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);
int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);
int16_t sum = (int16_t)(vgetq_lane_s64(pairwiseAddedTwice, 0) + vgetq_lane_s64(pairwiseAddedTwice, 1));

(2) Add high and low d registers, then add pairwise twice:

int16x4_t addedDRegisters = vadd_s16(vget_low_s16(vec), vget_high_s16(vec));
int32x2_t pairwiseAddedOnce = vpaddl_s16(addedDRegisters);
int64x1_t pairwiseAddedTwice = vpaddl_s32(pairwiseAddedOnce);
int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedTwice, 0);

(3) Add pairwise twice, then do an integer narrow, then another pairwise addition:

int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);
int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);
int32x2_t narrowed = vmovn_s64(pairwiseAddedTwice);
int64x1_t pairwiseAddedThrice = vpaddl_s32(narrowed);
int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedThrice, 0);

There are many, many other ways to do this (pairwise once, integer narrow, pairwise twice; or pairwise once, add high and low d registers, pairwise again; or cut out to scalar addition at some earlier point, etc.).

Which one of the many possibilities is most efficient? And what's a good way for me to figure this out for myself next time (either a good way to measure effectively or how I ought to reason it through)?

Thanks!

Josh

P.S. In case it matters, I'm writing in Xcode 4, for iOS, targeting armv7, using LLVM+gcc4.2.

Etienne SOBOLE over 12 years ago

Note: This was originally posted on 26th April 2011 at http://forums.arm.com

Does your code included into a loop or do you need to apply the algorithm only one time ?

Can you explain what you are wanting to do ?
Cancel
Vote up 0 Vote down

Cancel
Josh Bleecher Snyder over 12 years ago
Note: This was originally posted on 26th April 2011 at http://forums.arm.com

Does your code included into a loop or do you need to apply the algorithm only one time ?

Can you explain what you are wanting to do ?

It's part of a larger loop; I'm writing a row convolution implementation. After a vector multiply, I need to sum all the resulting components.

Actually, my summations are a little more complex than this. With different vectors I need to sum:

All the lanes.
The first lane, and the other seven lanes.
The first two lanes, and the other six lanes.
...
The first seven lanes, and the last lane.
For concreteness, for case (2) above, my current summation method is:

int16_t left_sum_1 = vgetq_lane_s16(multipliedVector_1, 0); int32x4_t added32x4_1 = vpaddlq_s16(multipliedVector_1); int64x2_t added64x2_1 = vpaddlq_s32(added32x4_1); int64_t right_sum_1 = vgetq_lane_s64(added64x2_1, 0) + vgetq_lane_s64(added64x2_1, 1) - left_sum_1;

Anyway, I figured that if I knew how to handle the first case efficiently (sum all), then I would have a good handle on how to approach the remaining cases as well, since I'd know the relative advantages of vector adds, pairwise adds, when to switch to lane extraction and scalars, etc. Sorry about my ignorance -- I just picked up intrinsics this weekend, and have no assembly background, so this is all new to me (but very exciting!).

If it'd be helpful, I'd also be happy post my entire method, although be warned that it's longish, and not as clear as might be hoped. I'm always game for tips anywhere, I just didn't want to overwhelm by posting long, long gobs of code.
Cancel
Vote up 0 Vote down

Cancel
Etienne SOBOLE over 12 years ago

Note: This was originally posted on 26th April 2011 at http://forums.arm.com

Well.

It's a little bit too complexe for me.

So I 'll just give you some hint that could (may be) help you.
vpaddlq.s16 q1, q0 vpaddlq.s32 q0, q1 vadd.s32 d0, d0, d1
This code will take 6 cycles

Most of NEON instruction take only 1 cycles.
But NEON is pipelined and most of the time you can't use a destination register as a source of the next instruction.

The same example doing 3 times the computation
vpaddlq.s16 q1, q0 vpaddlq.s16 q3, q2 vpaddlq.s16 q5, q4 vpaddlq.s32 q0, q1 vpaddlq.s32 q2, q3 vpaddlq.s32 q4, q5 vadd.s32 d0, d0, d1 vadd.s32 d4, d4, d5 vadd.s32 d8, d8, d9

will take only 9 cycles!

I do not know neon intrinsic, but I'm quite sure that the performance of your code will depend of the quality of the compiler.

So this is not easy to reply to your question.
The shortest one (in number of instruction) should be the better if the compiler is good.
Cancel
Vote up 0 Vote down

Cancel
Josh Bleecher Snyder over 12 years ago

Note: This was originally posted on 27th April 2011 at http://forums.arm.com

So I 'll just give you some hint that could (may be) help you.
vpaddlq.s16 q1, q0 vpaddlq.s32 q0, q1 vadd.s32 d0, d0, d1
This code will take 6 cycles

Most of NEON instruction take only 1 cycles.
But NEON is pipelined and most of the time you can't use a destination register as a source of the next instruction.

The same example doing 3 times the computation
vpaddlq.s16 q1, q0 vpaddlq.s16 q3, q2 vpaddlq.s16 q5, q4 vpaddlq.s32 q0, q1 vpaddlq.s32 q2, q3 vpaddlq.s32 q4, q5 vadd.s32 d0, d0, d1 vadd.s32 d4, d4, d5 vadd.s32 d8, d8, d9

will take only 9 cycles!

I do not know neon intrinsic, but I'm quite sure that the performance of your code will depend of the quality of the compiler.

So this is not easy to reply to your question.
The shortest one (in number of instruction) should be the better if the compiler is good.

Thanks! This is a good start.

I can also inline some assembly -- it looks there's a pretty clear mapping between the intrinsics and the assembly instructions -- so it's helpful to know about the pipelining. I'll try to take that into account, and interleave some of the additions.
Cancel
Vote up 0 Vote down

Cancel
Etienne SOBOLE over 12 years ago

Note: This was originally posted on 27th April 2011 at http://forums.arm.com

You can read this post

http://hilbert-space.de/?p=22

It show neon intrinsic performance(it's quite old. may be compiler are better now).

So inline assembler should be better
Separate assembler file should be more clear !

Etienne
Cancel
Vote up 0 Vote down

Cancel
Josh Bleecher Snyder over 12 years ago
Cancel
Vote up 0 Vote down

Cancel
Etienne SOBOLE over 12 years ago

Note: This was originally posted on 28th April 2011 at http://forums.arm.com

Interesting example !!!

In your case, interleave is not your main problem

NEON is a SIMD extension. It is done to handle a big amount of datas.
And, if it is possible, (and generaly it is) it's better if it can do that without the ARM core.

All that speach to said:
"Dont 'use VMOV rd, Dn instruction" (Rd is a ARM core register used as destination and Dn a NEON register used as source).

The first optimization you have to do in your function is to give a source pointer (kernel in you example) and a destination pointer (??? result)
And then use VLDx.xx to load datas and VSTx.xx to write datas.
All the job must be done without using ARM register as destination register.

You can use ARM register as source if you want (but you'll probably not have to do that.)
And you can of course use ARM register as pointer register if you need (and you'll need to do that).

Just by doing that your code will be:
- faster
- easier to optimize.

To reply to your question:
There is not enough instruction to avoid stall cycles, but you must remove VMOV instruction first.

Etienne
Cancel
Vote up 0 Vote down

Cancel