I am porting a project from x86 to ARM64 and I have been struggling with poor performance for some time. Recently I tested switching from GCC to LLVM. To my surprise, I got a massive performance boost. In some cases code execution is several times faster. I experimented with all sorts of optimization flags but I can't get GCC to generate fast enough code. I suspect that vectorization doesn't work. When I compile a random source code file with the --verbose flag, LLVM reports +neon while GCC doesn't report SIMD features. I tried on different ARM64 cores and operating systems and the result is the same.
Any suggestions on how to enable vectorization with GCC on ARM64?
System:
As I mentioned above. GCC does not vectorize floating point without you explicitly telling it which constraints it's allowed to relax to do so.
LLVM and LLVM derived compilers by *default* do not honor traps. So they vectorize without needing `-fno-trapping-math`. GCC does not. It requires you to tell it it can relax it. So you need to pass `-fno-trapping-math` to be equivalent to what LLVM does.
I tried with -O3 -ftree-vectorize -fno-trapping-math without much difference.
Then I'll need to see an example of what doesn't work to see what's going on.
Here is an example: https://github.com/LLNL/LULESHConfigure withcmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=/usr/local/gcc-12.1.0/bin/g++ -DWITH_MPI=Off -DWITH_OPENMP=On -DWITH_SILO=Off -DCMAKE_CXX_FLAGS="-ftree-vectorize -fno-trapping-math" ..Set -DCMAKE_CXX_COMPILER to a CXX compiler of choice. Tweak -DCMAKE_CXX_FLAGS accordingly.
Then run lulesh2.0 (eventually with different sizes):./lulesh2.0 -s 50 | grep ElapsedI tested vanilla GCC 12.1.0, ARM GCC 11.2.0, Vanilla Clang 14.0.4 and ARM Clang 22.0.2. On my system the LLVM-based compilers produced about twice as fast a binary compared to both GCC versions.
Thanks for the reproducer.
The issue here isn't to do with vectorization, the problem is GCC doesn't specialize the openmp outlined function. The actual clang vectorized code is quite small but only after specialization.
This is likely an instance of one of these two bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102443 or https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103976
after IPA specialization the function should become profitable enough to vectorize.
Somehow this has to do with the combination of ARM64 + Linux. GCC performs better than Clang on Intel and AMD hardware. I tested it on Apple M1 under macOS - GCC was significantly better.Ignore this. I tested single-threaded on M1.