This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Fastest s16 summation reduction of a q register

Note: This was originally posted on 26th April 2011 at http://forums.arm.com

Hi,

I've got a NEON q register filled with 8 signed 16 bit ints. I'd like to calculate the sum across all of them as quickly as possible. The result should ultimately be a 16 bit int (overflow will not occur due to external constraints). Here are a few possibilities:

(1) Add pairwise twice, then do a scalar addition:

int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);
int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);
int16_t sum = (int16_t)(vgetq_lane_s64(pairwiseAddedTwice, 0) + vgetq_lane_s64(pairwiseAddedTwice, 1));


(2) Add high and low d registers, then add pairwise twice:

int16x4_t addedDRegisters = vadd_s16(vget_low_s16(vec), vget_high_s16(vec));
int32x2_t pairwiseAddedOnce = vpaddl_s16(addedDRegisters);
int64x1_t pairwiseAddedTwice = vpaddl_s32(pairwiseAddedOnce);
int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedTwice, 0);


(3) Add pairwise twice, then do an integer narrow, then another pairwise addition:

int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);
int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);
int32x2_t narrowed = vmovn_s64(pairwiseAddedTwice);
int64x1_t pairwiseAddedThrice = vpaddl_s32(narrowed);
int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedThrice, 0);


There are many, many other ways to do this (pairwise once, integer narrow, pairwise twice; or pairwise once, add high and low d registers, pairwise again; or cut out to scalar addition at some earlier point, etc.).

Which one of the many possibilities is most efficient? And what's a good way for me to figure this out for myself next time (either a good way to measure effectively or how I ought to reason it through)?

Thanks!

Josh

P.S. In case it matters, I'm writing in Xcode 4, for iOS, targeting armv7, using LLVM+gcc4.2.
Parents
  • Note: This was originally posted on 28th April 2011 at http://forums.arm.com

    Interesting example !!!

    In your case, interleave is not your main problem :)

    NEON is a SIMD extension. It is done to handle a big amount of datas.
    And, if it is possible, (and generaly it is) it's better if it can do that without the ARM core.

    All that speach to said:
    "Dont 'use VMOV rd, Dn instruction" (Rd is a ARM core register used as destination and Dn a NEON register used as source).

    The first optimization you have to do in your function is to give a source pointer (kernel in you example) and a destination pointer (??? result)
    And then use VLDx.xx to load datas and VSTx.xx to write datas.
    All the job must be done without using ARM register as destination register.

    You can use ARM register as source if you want (but you'll probably not have to do that.)
    And you can of course use ARM register as pointer register if you need (and you'll need to do that).

    Just by doing that your code will be:
    - faster
    - easier to optimize.

    To reply to your question:
    There is not enough instruction to avoid stall cycles, but you must remove VMOV instruction first.

    Etienne
Reply
  • Note: This was originally posted on 28th April 2011 at http://forums.arm.com

    Interesting example !!!

    In your case, interleave is not your main problem :)

    NEON is a SIMD extension. It is done to handle a big amount of datas.
    And, if it is possible, (and generaly it is) it's better if it can do that without the ARM core.

    All that speach to said:
    "Dont 'use VMOV rd, Dn instruction" (Rd is a ARM core register used as destination and Dn a NEON register used as source).

    The first optimization you have to do in your function is to give a source pointer (kernel in you example) and a destination pointer (??? result)
    And then use VLDx.xx to load datas and VSTx.xx to write datas.
    All the job must be done without using ARM register as destination register.

    You can use ARM register as source if you want (but you'll probably not have to do that.)
    And you can of course use ARM register as pointer register if you need (and you'll need to do that).

    Just by doing that your code will be:
    - faster
    - easier to optimize.

    To reply to your question:
    There is not enough instruction to avoid stall cycles, but you must remove VMOV instruction first.

    Etienne
Children
No data