This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Optimising with NEON

Note: This was originally posted on 11th February 2010 at http://forums.arm.com

Hello guys, I am a bit new with ARM and NEON and was wondering if anyone could help me with advice on how to optimise a small piece of code by NEON.
The problem is that whatever I do I can't get the NEON code to execute faster than the standard C code.
I am running the code on Beagle board with the latest Angstrom distribution and building with the Code Sourcery Lite gcc.
My gcc command line options are -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -ffast-math
the piece of code in question is the following:

int32_t weightsH[4];
int16_t weightsV[4];
int16_t *lut1, *lut2;

// start of loop

quantLevel1 = ...
quantLevel2 = ...
frac = ...
frac_1 = ...

pxlOut = 0;

lut1 = luts[quantLevel1];
lut2 = luts[quantLevel2];

w1 = lut1[0];
w2 = lut2[0];
weight = ((weightsH[0]>>16)*weightsV[0])>>13;
pxlOut += (weight*((w1*frac + w2*frac_1)>>14))>>14;

w1 = lut1[1];
w2 = lut2[1];
weight = ((weightsH[1]>>16)*weightsV[0])>>13;
pxlOut += (weight*((w1*frac + w2*frac_1)>>14))>>14;

w1 = lut1[2];
w2 = lut2[2];
weight = ((weightsH[0]>>16)*weightsV[1])>>13;
pxlOut += (weight*((w1*frac + w2*frac_1)>>14))>>14;

w1 = lut1[3];
w2 = lut2[3];
weight = ((weightsH[1]>>16)*weightsV[1])>>13;
pxlOut += (weight*((w1*frac + w2*frac_1)>>14))>>14;

weightsH[0] -= slope_x;
weightsH[1] += slope_x;

// end of loop

the NEON code with the best results that I managed to obtain is the following, yet it's still slower than the above, it's just killing me:

int16x4_t wVs, lt1, lt2, fr1, w;
int32x4_t wHs,s;
int32x4_t buf;
int32x2_t buf2;

slopes[0] = -slope_x;
slopes[1] = slope_x;
slopes[2] = -slope_x;
slopes[3] = slope_x;

wHs = vld1q_s32(weightsH);
s = vld1q_s32(slopes);
wVs = vld1_s16(weightsV);

// start of loop

quantLevel1 = ...
quantLevel2 = ...
frac = ...
frac_1 = ...

lut1 = luts[quantLevel1];
lut2 = luts[quantLevel2];

buf = vdupq_n_s32(0);
lt1 = vld1_s16(lut1);
lt2 = vld1_s16(lut2);
buf = vmlal_n_s16(buf, lt1, (int16_t)frac);
buf = vmlal_n_s16(buf, lt2, (int16_t)frac_1);
fr1 = vshrn_n_s32(buf, 14);
w = vshrn_n_s32(wHs, 16);
buf = vmull_s16(w, wVs);
w = vshrn_n_s32(buf, 13);
buf = vmull_s16(w, fr1);
fr1 = vshrn_n_s32(buf, 14);
buf2 = vpaddl_s16(fr1);
pxlOut = vget_lane_s32(buf2, 0) + vget_lane_s32(buf2, 1);

// end of loop

Thank you in advance,
Roumen