This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Fastest s16 summation reduction of a q register

Note: This was originally posted on 26th April 2011 at http://forums.arm.com

Hi,

I've got a NEON q register filled with 8 signed 16 bit ints. I'd like to calculate the sum across all of them as quickly as possible. The result should ultimately be a 16 bit int (overflow will not occur due to external constraints). Here are a few possibilities:

(1) Add pairwise twice, then do a scalar addition:

int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);
int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);
int16_t sum = (int16_t)(vgetq_lane_s64(pairwiseAddedTwice, 0) + vgetq_lane_s64(pairwiseAddedTwice, 1));


(2) Add high and low d registers, then add pairwise twice:

int16x4_t addedDRegisters = vadd_s16(vget_low_s16(vec), vget_high_s16(vec));
int32x2_t pairwiseAddedOnce = vpaddl_s16(addedDRegisters);
int64x1_t pairwiseAddedTwice = vpaddl_s32(pairwiseAddedOnce);
int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedTwice, 0);


(3) Add pairwise twice, then do an integer narrow, then another pairwise addition:

int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);
int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);
int32x2_t narrowed = vmovn_s64(pairwiseAddedTwice);
int64x1_t pairwiseAddedThrice = vpaddl_s32(narrowed);
int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedThrice, 0);


There are many, many other ways to do this (pairwise once, integer narrow, pairwise twice; or pairwise once, add high and low d registers, pairwise again; or cut out to scalar addition at some earlier point, etc.).

Which one of the many possibilities is most efficient? And what's a good way for me to figure this out for myself next time (either a good way to measure effectively or how I ought to reason it through)?

Thanks!

Josh

P.S. In case it matters, I'm writing in Xcode 4, for iOS, targeting armv7, using LLVM+gcc4.2.
  • Note: This was originally posted on 26th April 2011 at http://forums.arm.com

    Does your code included into a loop or do you need to apply the algorithm only one time ?

    Can you explain what you are wanting to do ?
  • Note: This was originally posted on 26th April 2011 at http://forums.arm.com


    Does your code included into a loop or do you need to apply the algorithm only one time ?

    Can you explain what you are wanting to do ?



    It's part of a larger loop; I'm writing a row convolution implementation. After a vector multiply, I need to sum all the resulting components.

    Actually, my summations are a little more complex than this. With different vectors I need to sum:

    • All the lanes.
    • The first lane, and the other seven lanes.
    • The first two lanes, and the other six lanes.
    •   ...
    • The first seven lanes, and the last lane.
    For concreteness, for case (2) above, my current summation method is:



    int16_t left_sum_1 = vgetq_lane_s16(multipliedVector_1, 0);
    int32x4_t added32x4_1 = vpaddlq_s16(multipliedVector_1);
    int64x2_t added64x2_1 = vpaddlq_s32(added32x4_1);
    int64_t right_sum_1 = vgetq_lane_s64(added64x2_1, 0) + vgetq_lane_s64(added64x2_1, 1) - left_sum_1;




    Anyway, I figured that if I knew how to handle the first case efficiently (sum all), then I would have a good handle on how to approach the remaining cases as well, since I'd know the relative advantages of vector adds, pairwise adds, when to switch to lane extraction and scalars, etc. Sorry about my ignorance -- I just picked up intrinsics this weekend, and have no assembly background, so this is all new to me (but very exciting!).

    If it'd be helpful, I'd also be happy post my entire method, although be warned that it's longish, and not as clear as might be hoped. I'm always game for tips anywhere, I just didn't want to overwhelm by posting long, long gobs of code. :)
  • Note: This was originally posted on 26th April 2011 at http://forums.arm.com

    Well.

    It's a little bit too complexe for me.

    So I 'll just give you some hint that could (may be) help you.

        vpaddlq.s16  q1, q0
    vpaddlq.s32  q0, q1
    vadd.s32   d0, d0, d1

    This code will take 6 cycles

    Most of NEON instruction take only 1 cycles.
    But NEON is pipelined and most of the time you can't use a destination register as a source of the next instruction.

    The same example doing 3 times the computation

    vpaddlq.s16  q1, q0
    vpaddlq.s16  q3, q2
    vpaddlq.s16  q5, q4
    vpaddlq.s32  q0, q1
    vpaddlq.s32  q2, q3
    vpaddlq.s32  q4, q5
    vadd.s32   d0, d0, d1
    vadd.s32   d4, d4, d5
    vadd.s32   d8, d8, d9


    will take only 9 cycles!

    I do not know neon intrinsic, but I'm quite sure that the performance of your code will depend of the quality of the compiler.

    So this is not easy to reply to your question.
    The shortest one (in number of instruction) should be the better if the compiler is good.
  • Note: This was originally posted on 27th April 2011 at http://forums.arm.com


    So I 'll just give you some hint that could (may be) help you.

        vpaddlq.s16  q1, q0
    vpaddlq.s32  q0, q1
    vadd.s32   d0, d0, d1

    This code will take 6 cycles

    Most of NEON instruction take only 1 cycles.
    But NEON is pipelined and most of the time you can't use a destination register as a source of the next instruction.

    The same example doing 3 times the computation

    vpaddlq.s16  q1, q0
    vpaddlq.s16  q3, q2
    vpaddlq.s16  q5, q4
    vpaddlq.s32  q0, q1
    vpaddlq.s32  q2, q3
    vpaddlq.s32  q4, q5
    vadd.s32   d0, d0, d1
    vadd.s32   d4, d4, d5
    vadd.s32   d8, d8, d9


    will take only 9 cycles!

    I do not know neon intrinsic, but I'm quite sure that the performance of your code will depend of the quality of the compiler.

    So this is not easy to reply to your question.
    The shortest one (in number of instruction) should be the better if the compiler is good.





    Thanks! This is a good start. :)

    I can also inline some assembly -- it looks there's a pretty clear mapping between the intrinsics and the assembly instructions -- so it's helpful to know about the pipelining. I'll try to take that into account, and interleave some of the additions.
  • Note: This was originally posted on 27th April 2011 at http://forums.arm.com

    You can read this post

    http://hilbert-space.de/?p=22

    It show neon intrinsic performance(it's quite old. may be compiler are better now).

    So inline assembler should be better
    Separate assembler file should be more clear !

    Etienne
  • Note: This was originally posted on 28th April 2011 at http://forums.arm.com

    Interesting example !!!

    In your case, interleave is not your main problem :)

    NEON is a SIMD extension. It is done to handle a big amount of datas.
    And, if it is possible, (and generaly it is) it's better if it can do that without the ARM core.

    All that speach to said:
    "Dont 'use VMOV rd, Dn instruction" (Rd is a ARM core register used as destination and Dn a NEON register used as source).

    The first optimization you have to do in your function is to give a source pointer (kernel in you example) and a destination pointer (??? result)
    And then use VLDx.xx to load datas and VSTx.xx to write datas.
    All the job must be done without using ARM register as destination register.

    You can use ARM register as source if you want (but you'll probably not have to do that.)
    And you can of course use ARM register as pointer register if you need (and you'll need to do that).

    Just by doing that your code will be:
    - faster
    - easier to optimize.

    To reply to your question:
    There is not enough instruction to avoid stall cycles, but you must remove VMOV instruction first.

    Etienne