This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Does GCC really support automatic vectorization for NEON technology?

Cyberman Wu over 11 years ago

There are two development articles metioned it that GCC can do it:

ntroducing NEON

NEON Support in Compilation Tools

But I tested code snap in these docs with GCC compling options but the generated assembly code

doesn't use any neon instruction.

Top replies

Peter Harris over 11 years ago in reply to Cyberman Wu +1 verified

Glad you got it working =) Hint of loop number like (len & ~3) is not required, In library code where "len" cannot be determined this helps the compiler avoid unneeded code. Normally it will still vectorize...

Parents

0 Cyberman Wu over 11 years ago in reply to Peter Harris

I used the code snap in these two docs:
void add_ints(int * __restrict pa, int * __restrict pb, unsigned int n, int x)
{
    unsigned int i;
    for(i = 0; i < (n & ~3); i++)
        pa[i] = pb[i] + x;
}
int accumulate(int * c, int len)
{
    int i, retval;
    for(i=0, retval = 0; i < (len & ~3) ; i++) {
        retval += c[i];
    }
    return retval;
}
For compiler, I've tested FriendlyARM which is GCC 4.5, GCC 4.6 & 4.8 for ARM in Android NDK, and GCC 4.8 for another ARM CPU used in IP Camera.
For options I'v used '-mfpu=neon -ftree-vectorize' with '-mfloat-abi=soft/softfp/hard', all the same. I'v also tested -O3 which implied -ftree-vectorize.
I've also compiled the same coding using GCC 4.4.6 for x86_64, which will using SSE if compiling using -O3 & -O4, but -ftree-vectorize with other optimization
level won't, maybe it should turn on other options together with -ftree-vectorize.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Cyberman Wu over 11 years ago in reply to Peter Harris

I used the code snap in these two docs:
void add_ints(int * __restrict pa, int * __restrict pb, unsigned int n, int x)
{
    unsigned int i;
    for(i = 0; i < (n & ~3); i++)
        pa[i] = pb[i] + x;
}
int accumulate(int * c, int len)
{
    int i, retval;
    for(i=0, retval = 0; i < (len & ~3) ; i++) {
        retval += c[i];
    }
    return retval;
}
For compiler, I've tested FriendlyARM which is GCC 4.5, GCC 4.6 & 4.8 for ARM in Android NDK, and GCC 4.8 for another ARM CPU used in IP Camera.
For options I'v used '-mfpu=neon -ftree-vectorize' with '-mfloat-abi=soft/softfp/hard', all the same. I'v also tested -O3 which implied -ftree-vectorize.
I've also compiled the same coding using GCC 4.4.6 for x86_64, which will using SSE if compiling using -O3 & -O4, but -ftree-vectorize with other optimization
level won't, maybe it should turn on other options together with -ftree-vectorize.
Cancel
Vote up 0 Vote down

Cancel

Children

0 Cyberman Wu over 11 years ago in reply to Cyberman Wu

I've tested again, but it works now. I don't know why I get different result today, maybe reloading modified file of gvim is not reliable?
Now it works with -O1/-O2/-O3/-O4, not work with -O0/-Os.
Hint of loop number like (len & ~3) is not required, both 4.6 and 4.8 I've tested will generate code to consider len is greater than 16
or not.
However, the below example code can be vectorized with GCC 4.8 but not 4.6:
int neon_line_abs_sum(const unsigned char* __attribute__ ((aligned (16))) src0,
                      const unsigned char* __attribute__ ((aligned (16))) src1,
                      int diff_threshold,
                      int line_width,
                      int line_step,
                      int line_num
                      )
{
    int x, y;
    int sum = 0;
    line_width = (line_width + 15) & ~15;
    for (y = 0; y < line_num; ++y)
    {
        for (x = 0; x < line_width; ++x)
        {
            int diff = abs(src0[y * line_step + x] - src1[y * line_step + x]);
            sum += (diff > diff_threshold) ? 1 : 0;
            // sum += diff;
        }
    }
    return sum;
}
If we use only sum += diff then it will be vectorized in GCC 4.6.
Cancel
Vote up 0 Vote down

Cancel
+1 Peter Harris over 11 years ago in reply to Cyberman Wu

Glad you got it working =)
Hint of loop number like (len & ~3) is not required,
In library code where "len" cannot be determined this helps the compiler avoid unneeded code. Normally it will still vectorize - but will need to include additional code to handle any left over parts. For example if you pass in a list of length 17 then it will run the vectorized part 4 times to handle 4x4 elements, and then need some scalar code to handle the one left over. The mask is a means by which you can "promise" the compiler that the application guarantees that the list is a multiple of 4 - so the vectorizer knows that it doesn't need to generate code to handle the "left over" parts which are not vectorizable (because there can never be any left overs).
HTH,
Pete
Cancel
Vote up +1 Vote down

Cancel

0 Cyberman Wu over 11 years ago in reply to Peter Harris

It should be what you said, but the real compiling result is almost the same.

1) Without the hint of 'len & ~3'

int accumulate(int * __attribute__ ((aligned (16))) c, int len)

{

int i, retval;

for(i=0, retval = 0; i < len; i++) {

retval += c[i];

}

return retval;

}

Compling output:

	.align	2
	.global	accumulate
	.type	accumulate, %function

accumulate:

	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	cmp	r1, #0
	stmfd	sp!, {r4, r5, r6, r7}
	ble	.L14
	ands	r2, r0, #4
	mvnne	r2, #0
	and	r2, r2, #3
	cmp	r2, r1
	movcs	r2, r1
	cmp	r1, #6
	movls	r2, r1
	bhi	.L28

.L3:

	cmp	r2, #1
	ldr	r3, [r0]
	movls	ip, #1
	bls	.L5
	ldr	ip, [r0, #4]
	cmp	r2, #2
	add	r3, r3, ip
	movls	ip, #2
	bls	.L5
	ldr	ip, [r0, #8]
	cmp	r2, #3
	add	r3, r3, ip
	movls	ip, #3
	bls	.L5
	ldr	ip, [r0, #12]
	cmp	r2, #4
	add	r3, r3, ip
	movls	ip, #4
	bls	.L5
	cmp	r2, #5
	ldr	ip, [r0, #16]
	ldrhi	r4, [r0, #20]
	add	r3, r3, ip
	addhi	r3, r3, r4
	movls	ip, #5
	movhi	ip, #6

.L5:

	cmp	r1, r2
	beq	.L2
	rsb	r6, r2, r1
	mov	r5, r6, lsr #2
	movs	r7, r5, asl #2
	beq	.L7

.L29:

	add	r2, r0, r2, asl #2
	mov	r4, #0
	vmov.i32	q8, #0 @ v4si

.L13:

	add	r4, r4, #1
	vld1.64	{d18-d19}, [r2:64]!
	cmp	r5, r4
	vadd.i32	q8, q8, q9
	bhi	.L13
	vadd.i32	d16, d16, d17
	vmov.i32	q9, #0 @ v4si
	vpadd.i32	d18, d16, d16
	vmov.32	r2, d18[0]
	cmp	r6, r7
	add	ip, ip, r7
	add	r3, r3, r2
	beq	.L2

.L7:

	ldr	r4, [r0, ip, asl #2]
	add	r2, ip, #1
	cmp	r1, r2
	add	r3, r3, r4
	ble	.L2
	ldr	r2, [r0, r2, asl #2]
	add	ip, ip, #2
	cmp	r1, ip
	add	r3, r3, r2
	ldrgt	r2, [r0, ip, asl #2]
	addgt	r3, r3, r2

.L2:

	mov	r0, r3
	ldmfd	sp!, {r4, r5, r6, r7}
	bx	lr

.L28:

	cmp	r2, #0
	moveq	r3, r2
	moveq	ip, r2
	bne	.L3
	rsb	r6, r2, r1
	mov	r5, r6, lsr #2
	movs	r7, r5, asl #2
	bne	.L29
	b	.L7

.L14:

	mov	r3, #0
	b	.L2
	.size	accumulate, .-accumulate

2) With the hint of 'len & ~3'