Arm Community
Site
Search
User
Site
Search
User
Support forums
Arm Development Studio forum
Optimising with NEON
Locked
Locked
Replies
2 replies
Subscribers
119 subscribers
Views
3563 views
Users
0 members are here
Options
Share
More actions
Related
How was your experience today?
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
Optimising with NEON
Roumen Popov
over 12 years ago
Note: This was originally posted on 11th February 2010 at
http://forums.arm.com
Hello guys, I am a bit new with ARM and NEON and was wondering if anyone could help me with advice on how to optimise a small piece of code by NEON.
The problem is that whatever I do I can't get the NEON code to execute faster than the standard C code.
I am running the code on Beagle board with the latest Angstrom distribution and building with the Code Sourcery Lite gcc.
My gcc command line options are -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -ffast-math
the piece of code in question is the following:
int32_t weightsH[4];
int16_t weightsV[4];
int16_t *lut1, *lut2;
// start of loop
quantLevel1 = ...
quantLevel2 = ...
frac = ...
frac_1 = ...
pxlOut = 0;
lut1 = luts[quantLevel1];
lut2 = luts[quantLevel2];
w1 = lut1[0];
w2 = lut2[0];
weight = ((weightsH[0]>>16)*weightsV[0])>>13;
pxlOut += (weight*((w1*frac + w2*frac_1)>>14))>>14;
w1 = lut1[1];
w2 = lut2[1];
weight = ((weightsH[1]>>16)*weightsV[0])>>13;
pxlOut += (weight*((w1*frac + w2*frac_1)>>14))>>14;
w1 = lut1[2];
w2 = lut2[2];
weight = ((weightsH[0]>>16)*weightsV[1])>>13;
pxlOut += (weight*((w1*frac + w2*frac_1)>>14))>>14;
w1 = lut1[3];
w2 = lut2[3];
weight = ((weightsH[1]>>16)*weightsV[1])>>13;
pxlOut += (weight*((w1*frac + w2*frac_1)>>14))>>14;
weightsH[0] -= slope_x;
weightsH[1] += slope_x;
// end of loop
the NEON code with the best results that I managed to obtain is the following, yet it's still slower than the above, it's just killing me:
int16x4_t wVs, lt1, lt2, fr1, w;
int32x4_t wHs,s;
int32x4_t buf;
int32x2_t buf2;
slopes[0] = -slope_x;
slopes[1] = slope_x;
slopes[2] = -slope_x;
slopes[3] = slope_x;
wHs = vld1q_s32(weightsH);
s = vld1q_s32(slopes);
wVs = vld1_s16(weightsV);
// start of loop
quantLevel1 = ...
quantLevel2 = ...
frac = ...
frac_1 = ...
lut1 = luts[quantLevel1];
lut2 = luts[quantLevel2];
buf = vdupq_n_s32(0);
lt1 = vld1_s16(lut1);
lt2 = vld1_s16(lut2);
buf = vmlal_n_s16(buf, lt1, (int16_t)frac);
buf = vmlal_n_s16(buf, lt2, (int16_t)frac_1);
fr1 = vshrn_n_s32(buf, 14);
w = vshrn_n_s32(wHs, 16);
buf = vmull_s16(w, wVs);
w = vshrn_n_s32(buf, 13);
buf = vmull_s16(w, fr1);
fr1 = vshrn_n_s32(buf, 14);
buf2 = vpaddl_s16(fr1);
pxlOut = vget_lane_s32(buf2, 0) + vget_lane_s32(buf2, 1);
// end of loop
Thank you in advance,
Roumen
0
Quote