This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Problem in understanding behaviour of GCC compiler (aarch64-none-elf-gcc) on Neon intrinsics for ARM cortex a53

Hi,

I am using IDE Xilinx SDK 2019.1 for my application and running it on ARM cortex a53  processor with Neon and floating point engine support available. I am working on a bare metal application.

The problem I am facing is that, I am unable to understand the disassembly of neon intrinsics functions in my code at highest level optimization i.e O3.

The following code is just for an example. My original code is using the same intrinsics functions but I am not achieving any performance boost as compared to my original C code. In this code, I am giving as an input two floating point arrays of each 16 elements and then multiplying each 4 elements chunk of array A with array B and storing its result in array C. All of the used variables are local.

The assembly of above code at O3 optimization level is the following:

The setting of compiler on IDE is:

I am not using any compiler option for optimization. I am unable to specify -mfpu=neon compiler option(because compiler is not recognizing it) but from the disassembly of code, it seems to me that it is running on Neon engine because I can see vector instructions in disassembly. So, please also confirm that either code is running on Neon engine?

I am not telling the compiler to use hardware linkages . For example if I use -mfloat-abi=hard in optimization setting of compiler, the compiler is not recognizing it. So, how can I tell the compiler to use hardware linkages?

I could not understand why there is a function body of intrinsic function vst1q_f32 (float32_t *a, float32x4_t b)  starting in the middle of assembly code.

I know that at highest optimization level, the compiler is somehow jumping around the instructions.

Could someone please help me on these confusions so that I can understand disassembly of code and  I can further optimize it?

0