This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Poor performance with GCC

ndvn over 3 years ago

I am porting a project from x86 to ARM64 and I have been struggling with poor performance for some time. Recently I tested switching from GCC to LLVM. To my surprise, I got a massive performance boost. In some cases code execution is several times faster. I experimented with all sorts of optimization flags but I can't get GCC to generate fast enough code. I suspect that vectorization doesn't work. When I compile a random source code file with the --verbose flag, LLVM reports +neon while GCC doesn't report SIMD features. I tried on different ARM64 cores and operating systems and the result is the same.

Any suggestions on how to enable vectorization with GCC on ARM64?

System:

GCC 12
LLVM 12
RHEL 7 and RHEL 8
ARMv8-a+neon

Top replies

0 Ronan Synnott over 3 years ago

I note you are using gcc 12, can you try the below release:
https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/downloads

Depending on your use case, Arm Compiler for Linux may be more appropriate:
https://developer.arm.com/Tools%20and%20Software/Arm%20Compiler%20for%20Linux

I also stumbled upon the below article which you may find useful:
https://sofiangotrong.wordpress.com/2017/10/16/simd-vectorization-on-aarch64/
Cancel
Vote up 0 Vote down

Cancel
0 Ronan Synnott over 3 years ago in reply to Ronan Synnott

My colleagues recently posted this update on gcc12:
https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/gcc-12

as well as this update on Arm Compiler for Linux:
https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/arm-compiler-for-linux-and-arm-performance-libraries-22-0
Cancel
Vote up 0 Vote down

Cancel
0 Tamar Christina over 3 years ago

On Linux, Arm does not support platforms which do not support SIMD. As such SIMD is always enabled, which is why GCC does not emit +simd, since it's implicitly always on.

GCC enables some vectorization at -O2 and all at -O3, but your post contains too few details to see what the problem is:

Could you post your full command-line flags, does your project use floating point math? can you give an example of what vectorizes with LLVM and not with GCC?

When using floating point math GCC and LLVM have different defaults. GCC defaults to honoring floating point traps while LLVM does not, this means that LLVM by default will more aggressively vectorize while GCC needs -Ofast or -fno-trapping-math.

So need some more details before can give you an answer.
Cancel
Vote up 0 Vote down

Cancel

0 ndvn over 3 years ago

Thanks for the answers! I tried the ARM compiler too. This is what I get performance-wise:

Compiler	Relative performance (more is better)	Compiler flags
Vanilla GCC/Gfortran 12.1 (built from source)	1,00	-O2 -ftree-vectorize -fopenmp
ARM GCC/Gfortran 11.2	1,08	-O2 -ftree-vectorize -fopenmp
ARM GCC/Gfortran 11.2	1,09	-O3 -ftree-vectorize -fopenmp
LLVM/Clang/Flang 12.0.0	2,86	-O2 -fopenmp
ARM C++/Fortran 22.0.2	3,24	-O2 -fopenmp -fsimdmath

I am experiencing crashes when the code is built with LLVM/Flang and ARM Fortran. Is it safe to use -fsimdmath? Accuracy (and stability) are important in my case.

0 Tamar Christina over 3 years ago in reply to ndvn

As I mentioned above. GCC does not vectorize floating point without you explicitly telling it which constraints it's allowed to relax to do so.

LLVM and LLVM derived compilers by *default* do not honor traps. So they vectorize without needing `-fno-trapping-math`. GCC does not. It requires you to tell it it can relax it. So you need to pass `-fno-trapping-math` to be equivalent to what LLVM does.
Cancel
Vote up 0 Vote down

Cancel
0 ndvn over 3 years ago in reply to Tamar Christina

I tried with -O3 -ftree-vectorize -fno-trapping-math without much difference.
Cancel
Vote up 0 Vote down

Cancel
0 Tamar Christina over 3 years ago in reply to ndvn

Then I'll need to see an example of what doesn't work to see what's going on.
Cancel
Vote up 0 Vote down

Cancel
0 ndvn over 3 years ago in reply to Tamar Christina

Here is an example: https://github.com/LLNL/LULESH
Configure with

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=/usr/local/gcc-12.1.0/bin/g++ -DWITH_MPI=Off -DWITH_OPENMP=On -DWITH_SILO=Off -DCMAKE_CXX_FLAGS="-ftree-vectorize -fno-trapping-math" ..

Set -DCMAKE_CXX_COMPILER to a CXX compiler of choice. Tweak -DCMAKE_CXX_FLAGS accordingly.

Then run lulesh2.0 (eventually with different sizes):

./lulesh2.0 -s 50 | grep Elapsed

I tested vanilla GCC 12.1.0, ARM GCC 11.2.0, Vanilla Clang 14.0.4 and ARM Clang 22.0.2. On my system the LLVM-based compilers produced about twice as fast a binary compared to both GCC versions.
Cancel
Vote up 0 Vote down

Cancel
0 Tamar Christina over 3 years ago in reply to ndvn

Thanks for the reproducer.

The issue here isn't to do with vectorization, the problem is GCC doesn't specialize the openmp outlined function. The actual clang vectorized code is quite small but only after specialization.

This is likely an instance of one of these two bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102443 or https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103976

after IPA specialization the function should become profitable enough to vectorize.
Cancel
Vote up +1 Vote down

Cancel
0 ndvn over 3 years ago in reply to Tamar Christina

Somehow this has to do with the combination of ARM64 + Linux. GCC performs better than Clang on Intel and AMD hardware. I tested it on Apple M1 under macOS - GCC was significantly better.
Ignore this. I tested single-threaded on M1.
Cancel
Vote up +1 Vote down

Cancel