This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Poor performance with GCC

I am porting a project from x86 to ARM64 and I have been struggling with poor performance for some time. Recently I tested switching from GCC to LLVM. To my surprise, I got a massive performance boost. In some cases code execution is several times faster. I experimented with all sorts of optimization flags but I can't get GCC to generate fast enough code. I suspect that vectorization doesn't work. When I compile a random source code file with the --verbose flag, LLVM reports +neon while GCC doesn't report SIMD features. I tried on different ARM64 cores and operating systems and the result is the same.

Any suggestions on how to enable vectorization with GCC on ARM64?

System:

  • GCC 12
  • LLVM 12
  • RHEL 7 and RHEL 8
  • ARMv8-a+neon
Parents
  • Here is an example: https://github.com/LLNL/LULESH
    Configure with

    cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=/usr/local/gcc-12.1.0/bin/g++ -DWITH_MPI=Off -DWITH_OPENMP=On -DWITH_SILO=Off -DCMAKE_CXX_FLAGS="-ftree-vectorize -fno-trapping-math" ..

    Set -DCMAKE_CXX_COMPILER to a  CXX compiler of choice. Tweak -DCMAKE_CXX_FLAGS accordingly.

    Then run lulesh2.0 (eventually with different sizes):

    ./lulesh2.0 -s 50 | grep Elapsed

    I tested vanilla GCC 12.1.0, ARM GCC 11.2.0, Vanilla Clang 14.0.4 and ARM Clang 22.0.2. On my system the LLVM-based compilers produced about twice as fast a binary compared to both GCC versions.

Reply
  • Here is an example: https://github.com/LLNL/LULESH
    Configure with

    cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=/usr/local/gcc-12.1.0/bin/g++ -DWITH_MPI=Off -DWITH_OPENMP=On -DWITH_SILO=Off -DCMAKE_CXX_FLAGS="-ftree-vectorize -fno-trapping-math" ..

    Set -DCMAKE_CXX_COMPILER to a  CXX compiler of choice. Tweak -DCMAKE_CXX_FLAGS accordingly.

    Then run lulesh2.0 (eventually with different sizes):

    ./lulesh2.0 -s 50 | grep Elapsed

    I tested vanilla GCC 12.1.0, ARM GCC 11.2.0, Vanilla Clang 14.0.4 and ARM Clang 22.0.2. On my system the LLVM-based compilers produced about twice as fast a binary compared to both GCC versions.

Children