We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);int16_t sum = (int16_t)(vgetq_lane_s64(pairwiseAddedTwice, 0) + vgetq_lane_s64(pairwiseAddedTwice, 1));
int16x4_t addedDRegisters = vadd_s16(vget_low_s16(vec), vget_high_s16(vec));int32x2_t pairwiseAddedOnce = vpaddl_s16(addedDRegisters);int64x1_t pairwiseAddedTwice = vpaddl_s32(pairwiseAddedOnce);int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedTwice, 0);
int32x4_t pairwiseAddedOnce = vpaddlq_s16(vec);int64x2_t pairwiseAddedTwice = vpaddlq_s32(pairwiseAddedOnce);int32x2_t narrowed = vmovn_s64(pairwiseAddedTwice);int64x1_t pairwiseAddedThrice = vpaddl_s32(narrowed);int16_t sum = (int16_t)vget_lane_s64(pairwiseAddedThrice, 0);
So I 'll just give you some hint that could (may be) help you. vpaddlq.s16 q1, q0 vpaddlq.s32 q0, q1 vadd.s32 d0, d0, d1This code will take 6 cyclesMost of NEON instruction take only 1 cycles. But NEON is pipelined and most of the time you can't use a destination register as a source of the next instruction.The same example doing 3 times the computation vpaddlq.s16 q1, q0 vpaddlq.s16 q3, q2 vpaddlq.s16 q5, q4 vpaddlq.s32 q0, q1 vpaddlq.s32 q2, q3 vpaddlq.s32 q4, q5 vadd.s32 d0, d0, d1 vadd.s32 d4, d4, d5 vadd.s32 d8, d8, d9will take only 9 cycles!I do not know neon intrinsic, but I'm quite sure that the performance of your code will depend of the quality of the compiler.So this is not easy to reply to your question.The shortest one (in number of instruction) should be the better if the compiler is good.
vpaddlq.s16 q1, q0 vpaddlq.s32 q0, q1 vadd.s32 d0, d0, d1
vpaddlq.s16 q1, q0 vpaddlq.s16 q3, q2 vpaddlq.s16 q5, q4 vpaddlq.s32 q0, q1 vpaddlq.s32 q2, q3 vpaddlq.s32 q4, q5 vadd.s32 d0, d0, d1 vadd.s32 d4, d4, d5 vadd.s32 d8, d8, d9