This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Fastest s16 summation reduction of a q register

Note: This was originally posted on 26th April 2011 at http://forums.arm.com

Hi,

I've got a NEON q register filled with 8 signed 16 bit ints. I'd like to calculate the sum across all of them as quickly as possible. The result should ultimately be a 16 bit int (overflow will not occur due to external constraints). Here are a few possibilities:

(1) Add pairwise twice, then do a scalar addition:

int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);
int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);
int16_t sum = (int16_t)(vgetq_lane_s64(pairwiseAddedTwice, 0) + vgetq_lane_s64(pairwiseAddedTwice, 1));


(2) Add high and low d registers, then add pairwise twice:

int16x4_t addedDRegisters = vadd_s16(vget_low_s16(vec), vget_high_s16(vec));
int32x2_t pairwiseAddedOnce = vpaddl_s16(addedDRegisters);
int64x1_t pairwiseAddedTwice = vpaddl_s32(pairwiseAddedOnce);
int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedTwice, 0);


(3) Add pairwise twice, then do an integer narrow, then another pairwise addition:

int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);
int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);
int32x2_t narrowed = vmovn_s64(pairwiseAddedTwice);
int64x1_t pairwiseAddedThrice = vpaddl_s32(narrowed);
int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedThrice, 0);


There are many, many other ways to do this (pairwise once, integer narrow, pairwise twice; or pairwise once, add high and low d registers, pairwise again; or cut out to scalar addition at some earlier point, etc.).

Which one of the many possibilities is most efficient? And what's a good way for me to figure this out for myself next time (either a good way to measure effectively or how I ought to reason it through)?

Thanks!

Josh

P.S. In case it matters, I'm writing in Xcode 4, for iOS, targeting armv7, using LLVM+gcc4.2.
Parents
  • Note: This was originally posted on 26th April 2011 at http://forums.arm.com


    Does your code included into a loop or do you need to apply the algorithm only one time ?

    Can you explain what you are wanting to do ?



    It's part of a larger loop; I'm writing a row convolution implementation. After a vector multiply, I need to sum all the resulting components.

    Actually, my summations are a little more complex than this. With different vectors I need to sum:

    • All the lanes.
    • The first lane, and the other seven lanes.
    • The first two lanes, and the other six lanes.
    •   ...
    • The first seven lanes, and the last lane.
    For concreteness, for case (2) above, my current summation method is:



    int16_t left_sum_1 = vgetq_lane_s16(multipliedVector_1, 0);
    int32x4_t added32x4_1 = vpaddlq_s16(multipliedVector_1);
    int64x2_t added64x2_1 = vpaddlq_s32(added32x4_1);
    int64_t right_sum_1 = vgetq_lane_s64(added64x2_1, 0) + vgetq_lane_s64(added64x2_1, 1) - left_sum_1;




    Anyway, I figured that if I knew how to handle the first case efficiently (sum all), then I would have a good handle on how to approach the remaining cases as well, since I'd know the relative advantages of vector adds, pairwise adds, when to switch to lane extraction and scalars, etc. Sorry about my ignorance -- I just picked up intrinsics this weekend, and have no assembly background, so this is all new to me (but very exciting!).

    If it'd be helpful, I'd also be happy post my entire method, although be warned that it's longish, and not as clear as might be hoped. I'm always game for tips anywhere, I just didn't want to overwhelm by posting long, long gobs of code. :)
Reply
  • Note: This was originally posted on 26th April 2011 at http://forums.arm.com


    Does your code included into a loop or do you need to apply the algorithm only one time ?

    Can you explain what you are wanting to do ?



    It's part of a larger loop; I'm writing a row convolution implementation. After a vector multiply, I need to sum all the resulting components.

    Actually, my summations are a little more complex than this. With different vectors I need to sum:

    • All the lanes.
    • The first lane, and the other seven lanes.
    • The first two lanes, and the other six lanes.
    •   ...
    • The first seven lanes, and the last lane.
    For concreteness, for case (2) above, my current summation method is:



    int16_t left_sum_1 = vgetq_lane_s16(multipliedVector_1, 0);
    int32x4_t added32x4_1 = vpaddlq_s16(multipliedVector_1);
    int64x2_t added64x2_1 = vpaddlq_s32(added32x4_1);
    int64_t right_sum_1 = vgetq_lane_s64(added64x2_1, 0) + vgetq_lane_s64(added64x2_1, 1) - left_sum_1;




    Anyway, I figured that if I knew how to handle the first case efficiently (sum all), then I would have a good handle on how to approach the remaining cases as well, since I'd know the relative advantages of vector adds, pairwise adds, when to switch to lane extraction and scalars, etc. Sorry about my ignorance -- I just picked up intrinsics this weekend, and have no assembly background, so this is all new to me (but very exciting!).

    If it'd be helpful, I'd also be happy post my entire method, although be warned that it's longish, and not as clear as might be hoped. I'm always game for tips anywhere, I just didn't want to overwhelm by posting long, long gobs of code. :)
Children
No data