There are two development articles metioned it that GCC can do it:
ntroducing NEON
NEON Support in Compilation Tools
But I tested code snap in these docs with GCC compling options but the generated assembly code
doesn't use any neon instruction.
I used the code snap in these two docs:
void add_ints(int * __restrict pa, int * __restrict pb, unsigned int n, int x)
{
unsigned int i;
for(i = 0; i < (n & ~3); i++)
pa[i] = pb[i] + x;
}
int accumulate(int * c, int len)
int i, retval;
for(i=0, retval = 0; i < (len & ~3) ; i++) {
retval += c[i];
return retval;
For compiler, I've tested FriendlyARM which is GCC 4.5, GCC 4.6 & 4.8 for ARM in Android NDK, and GCC 4.8 for another ARM CPU used in IP Camera.
For options I'v used '-mfpu=neon -ftree-vectorize' with '-mfloat-abi=soft/softfp/hard', all the same. I'v also tested -O3 which implied -ftree-vectorize.
I've also compiled the same coding using GCC 4.4.6 for x86_64, which will using SSE if compiling using -O3 & -O4, but -ftree-vectorize with other optimization
level won't, maybe it should turn on other options together with -ftree-vectorize.
I've tested again, but it works now. I don't know why I get different result today, maybe reloading modified file of gvim is not reliable?
Now it works with -O1/-O2/-O3/-O4, not work with -O0/-Os.
Hint of loop number like (len & ~3) is not required, both 4.6 and 4.8 I've tested will generate code to consider len is greater than 16
or not.
However, the below example code can be vectorized with GCC 4.8 but not 4.6:
int neon_line_abs_sum(const unsigned char* __attribute__ ((aligned (16))) src0,
const unsigned char* __attribute__ ((aligned (16))) src1,
int diff_threshold,
int line_width,
int line_step,
int line_num
)
int x, y;
int sum = 0;
line_width = (line_width + 15) & ~15;
for (y = 0; y < line_num; ++y)
for (x = 0; x < line_width; ++x)
int diff = abs(src0[y * line_step + x] - src1[y * line_step + x]);
sum += (diff > diff_threshold) ? 1 : 0;
// sum += diff;
return sum;
If we use only sum += diff then it will be vectorized in GCC 4.6.
Glad you got it working =)
Hint of loop number like (len & ~3) is not required,
In library code where "len" cannot be determined this helps the compiler avoid unneeded code. Normally it will still vectorize - but will need to include additional code to handle any left over parts. For example if you pass in a list of length 17 then it will run the vectorized part 4 times to handle 4x4 elements, and then need some scalar code to handle the one left over. The mask is a means by which you can "promise" the compiler that the application guarantees that the list is a multiple of 4 - so the vectorizer knows that it doesn't need to generate code to handle the "left over" parts which are not vectorizable (because there can never be any left overs).
HTH, Pete
It should be what you said, but the real compiling result is almost the same.
1) Without the hint of 'len & ~3'
int accumulate(int * __attribute__ ((aligned (16))) c, int len)
for(i=0, retval = 0; i < len; i++) {
Compling output:
accumulate:
.L3:
.L5:
.L29:
.L13:
.L7:
.L2:
.L28:
.L14:
2) With the hint of 'len & ~3'
Compling result:
The only different is that 'len & ~3' complied into a instruction 'bic r1, r1, #3', nothing else.