This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Problem in understanding behaviour of GCC compiler (aarch64-none-elf-gcc) on Neon intrinsics for ARM cortex a53

Hi,

I am using IDE Xilinx SDK 2019.1 for my application and running it on ARM cortex a53  processor with Neon and floating point engine support available. I am working on a bare metal application.

The problem I am facing is that, I am unable to understand the disassembly of neon intrinsics functions in my code at highest level optimization i.e O3.

The following code is just for an example. My original code is using the same intrinsics functions but I am not achieving any performance boost as compared to my original C code. In this code, I am giving as an input two floating point arrays of each 16 elements and then multiplying each 4 elements chunk of array A with array B and storing its result in array C. All of the used variables are local.

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
The Neon intrinsics version of my code is:
// initialized arrays
float A[16]= {1.0,2.0,3.0,4.0,
1.0,2.0,3.0,4.0,
1.0,2.0,3.0,4.0,
1.0,2.0,3.0,4.0
};
float B[16] = {1.0,2.0,3.0,4.0,
1.0,2.0,3.0,4.0,
1.0,2.0,3.0,4.0,
1.0,2.0,3.0,4.0
};
float C[16];
//function definition
#include <arm_neon.h>
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

The assembly of above code at O3 optimization level is the following:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
00000000000251e0 <multiply_4x4_neon>:
return a + b * c;
251e0: 4f000400 movi v0.4s, #0x0
return __builtin_aarch64_ld1v4sf ((const __builtin_aarch64_simd_sf *) a);
251e4: 3dc00001 ldr q1, [x0]
251e8: 3dc00022 ldr q2, [x1]
return a + b * c;
251ec: 4ea01c03 mov v3.16b, v0.16b
251f0: 4e21cc43 fmla v3.4s, v2.4s, v1.4s
__extension__ extern __inline void
__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
vst1q_f32 (float32_t *a, float32x4_t b)
{
__builtin_aarch64_st1v4sf ((__builtin_aarch64_simd_sf *) a, b);
251f4: 3d800043 str q3, [x2]
return a + b * c;
251f8: 4ea01c03 mov v3.16b, v0.16b
return __builtin_aarch64_ld1v4sf ((const __builtin_aarch64_simd_sf *) a);
251fc: 3dc00401 ldr q1, [x0, #16]
25200: 3dc00422 ldr q2, [x1, #16]
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

The setting of compiler on IDE is:

I am not using any compiler option for optimization. I am unable to specify -mfpu=neon compiler option(because compiler is not recognizing it) but from the disassembly of code, it seems to me that it is running on Neon engine because I can see vector instructions in disassembly. So, please also confirm that either code is running on Neon engine?

I am not telling the compiler to use hardware linkages . For example if I use -mfloat-abi=hard in optimization setting of compiler, the compiler is not recognizing it. So, how can I tell the compiler to use hardware linkages?

I could not understand why there is a function body of intrinsic function vst1q_f32 (float32_t *a, float32x4_t b)  starting in the middle of assembly code.

I know that at highest optimization level, the compiler is somehow jumping around the instructions.

Could someone please help me on these confusions so that I can understand disassembly of code and  I can further optimize it?

0