This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Fastest s16 summation reduction of a q register

Note: This was originally posted on 26th April 2011 at http://forums.arm.com

Hi,

I've got a NEON q register filled with 8 signed 16 bit ints. I'd like to calculate the sum across all of them as quickly as possible. The result should ultimately be a 16 bit int (overflow will not occur due to external constraints). Here are a few possibilities:

(1) Add pairwise twice, then do a scalar addition:

int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);
int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);
int16_t sum = (int16_t)(vgetq_lane_s64(pairwiseAddedTwice, 0) + vgetq_lane_s64(pairwiseAddedTwice, 1));


(2) Add high and low d registers, then add pairwise twice:

int16x4_t addedDRegisters = vadd_s16(vget_low_s16(vec), vget_high_s16(vec));
int32x2_t pairwiseAddedOnce = vpaddl_s16(addedDRegisters);
int64x1_t pairwiseAddedTwice = vpaddl_s32(pairwiseAddedOnce);
int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedTwice, 0);


(3) Add pairwise twice, then do an integer narrow, then another pairwise addition:

int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);
int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);
int32x2_t narrowed = vmovn_s64(pairwiseAddedTwice);
int64x1_t pairwiseAddedThrice = vpaddl_s32(narrowed);
int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedThrice, 0);


There are many, many other ways to do this (pairwise once, integer narrow, pairwise twice; or pairwise once, add high and low d registers, pairwise again; or cut out to scalar addition at some earlier point, etc.).

Which one of the many possibilities is most efficient? And what's a good way for me to figure this out for myself next time (either a good way to measure effectively or how I ought to reason it through)?

Thanks!

Josh

P.S. In case it matters, I'm writing in Xcode 4, for iOS, targeting armv7, using LLVM+gcc4.2.
Parents
  • Note: This was originally posted on 26th April 2011 at http://forums.arm.com

    Well.

    It's a little bit too complexe for me.

    So I 'll just give you some hint that could (may be) help you.

        vpaddlq.s16  q1, q0
    vpaddlq.s32  q0, q1
    vadd.s32   d0, d0, d1

    This code will take 6 cycles

    Most of NEON instruction take only 1 cycles.
    But NEON is pipelined and most of the time you can't use a destination register as a source of the next instruction.

    The same example doing 3 times the computation

    vpaddlq.s16  q1, q0
    vpaddlq.s16  q3, q2
    vpaddlq.s16  q5, q4
    vpaddlq.s32  q0, q1
    vpaddlq.s32  q2, q3
    vpaddlq.s32  q4, q5
    vadd.s32   d0, d0, d1
    vadd.s32   d4, d4, d5
    vadd.s32   d8, d8, d9


    will take only 9 cycles!

    I do not know neon intrinsic, but I'm quite sure that the performance of your code will depend of the quality of the compiler.

    So this is not easy to reply to your question.
    The shortest one (in number of instruction) should be the better if the compiler is good.
Reply
  • Note: This was originally posted on 26th April 2011 at http://forums.arm.com

    Well.

    It's a little bit too complexe for me.

    So I 'll just give you some hint that could (may be) help you.

        vpaddlq.s16  q1, q0
    vpaddlq.s32  q0, q1
    vadd.s32   d0, d0, d1

    This code will take 6 cycles

    Most of NEON instruction take only 1 cycles.
    But NEON is pipelined and most of the time you can't use a destination register as a source of the next instruction.

    The same example doing 3 times the computation

    vpaddlq.s16  q1, q0
    vpaddlq.s16  q3, q2
    vpaddlq.s16  q5, q4
    vpaddlq.s32  q0, q1
    vpaddlq.s32  q2, q3
    vpaddlq.s32  q4, q5
    vadd.s32   d0, d0, d1
    vadd.s32   d4, d4, d5
    vadd.s32   d8, d8, d9


    will take only 9 cycles!

    I do not know neon intrinsic, but I'm quite sure that the performance of your code will depend of the quality of the compiler.

    So this is not easy to reply to your question.
    The shortest one (in number of instruction) should be the better if the compiler is good.
Children
No data