Poor performance with GCC

I am porting a project from x86 to ARM64 and I have been struggling with poor performance for some time. Recently I tested switching from GCC to LLVM. To my surprise, I got a massive performance boost. In some cases code execution is several times faster. I experimented with all sorts of optimization flags but I can't get GCC to generate fast enough code. I suspect that vectorization doesn't work. When I compile a random source code file with the --verbose flag, LLVM reports +neon while GCC doesn't report SIMD features. I tried on different ARM64 cores and operating systems and the result is the same.

Any suggestions on how to enable vectorization with GCC on ARM64?

System:

  • GCC 12
  • LLVM 12
  • RHEL 7 and RHEL 8
  • ARMv8-a+neon