This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM 23.04 compilers generate incorrect code for GROMACS from -02

I faced several issues when using the latest ARM 23.04.1 compilers with GROMACS on Fugaku (aka A64fx aka SVE 512bits)

This is the most problematic one. FWIW ARM compilers 23.1 works great, and even with -Ofast

The issue can be evidenced with the latest GROMACS 2023.1 and the regression test suite that can both be downloaded from https://manual.gromacs.org/2023.1/download.html

This is an extract of the two tests that fail:

Testing awh_multibias . . .

gmx grompp -f /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/grompp.mdp -c /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/conf -r /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/conf -p /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/topol -ref /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/rotref -maxwarn 10 -n /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/index >grompp.out 2>grompp.err gmx check -s1 /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/reference_s.tpr -s2 topol.tpr -tol 0.0001 -abstol 0.001 >checktpr.out 2>checktpr.err

gmx mdrun -ntmpi 1 -ntomp 1 -notunepme -cpi /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/continue -noappend >mdrun.out 2>&1

gmx check -e /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/reference_s.edr -e2 ener.part0002.edr -tol 0.001 -abstol 0.05 -lastener Potential >checkpot.out 2>checkpot.err

gmx check -f /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/reference_s.trr -f2 traj.part0002.trr -tol 0.001 -abstol 0.05 >checkforce.out 2>checkforce.err

FAILED.

Check checkpot.out (200 errors), checkforce.out (38 errors) file(s) in awh_multibias for awh_multibias

Testing awh_multidim . . .

gmx grompp -f /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/grompp.mdp -c /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/conf -r /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/conf -p /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/topol -ref /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/rotref -maxwarn 10 -n /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/index >grompp.out 2>grompp.err

gmx check -s1 /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/reference_s.tpr -s2 topol.tpr -tol 0.0001 -abstol 0.001 >checktpr.out 2>checktpr.err gmx mdrun -ntmpi 1 -ntomp 1 -notunepme >mdrun.out 2>&1

gmx check -e /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/reference_s.edr -e2 ener.edr -tol 0.001 -abstol 0.05 -lastener Potential >checkpot.out 2>checkpot.err

gmx check -f /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/reference_s.trr -f2 traj.trr -tol 0.001 -abstol 0.05 >checkforce.out 2>checkforce.err

FAILED. Check checkpot.out (106 errors), checkforce.out (3 errors) file(s) in awh_multidim for awh_multidim T

his is how I built GROMACS with ARM compilers and -O2

/usr/bin/cmake -G 'Unix Makefiles' -DCMAKE_INSTALL_PREFIX:STRING=$HOME/local/gromacs-2023.1/arm-23.04.1/2 -DCMAKE_BUILD_TYPE:STRING=Release -DBUILD_TESTING:BOOL=OFF -DCMAKE_INTERPROCEDURAL_OPTIMIZATION:BOOL=OFF -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DGMX_INSTALL_LEGACY_API=ON -DGMX_HWLOC:BOOL=ON -DGMX_GPU:STRING=OFF -DGMX_SIMD=ARM_SVE -DGMX_SIMD_ARM_SVE_LENGTH=512 -DGMX_USE_RDTSCP:BOOL=OFF -DGMX_OPENMP:BOOL=ON -DGMX_USE_RDTSCP:BOOL=OFF -DGMX_CYCLE_SUBCOUNTERS:BOOL=ON '-DCMAKE_C_FLAGS_RELEASE=-O2 -DNDEBUG' '-DCMAKE_CXX_FLAGS_RELEASE=-O2 -DNDEBUG' -DGMX_FFT_LIBRARY=fftpack -DGMX_MPI:BOOL=OFF -DCMAKE_C_COMPILER=armclang -DCMAKE_CXX_COMPILER=armclang++ -DBUILD_SHARED_LIBS=OFF $HOME/src/gromacs-2023.1

make -j 48 install

and then how I ran the test suite

. $HOME/local/gromacs-2023.1/arm-23.04.1/2/bin/GMXRC.bash

./gmxtest.pl -nt 1 -ntomp 1 -verbose all

This works just fine with ARM compilers 22.1 or LLVM 16.0.2 and LLVM 16.0.6 (even with -Ofast), so it seems the issue is specific to ARM compilers.

ARM compilers 23.04.1 works just fine if -O1 is used instead of -O2

I tried to identify the root cause, and found that it comes from the BiasState::updateFreeEnergyAndAddSamplesToHistogram(...) subroutine that is defined in

src/gromacs/applied_forces/awh/biasstate.cpp

A temporary workaround is to prepend the definition with

[[clang::optnone]]

Parents
  • Hi Gilles

    We think that this is a known issue when running ACfL on older Linux distributions and can be fixed by setting the LD_BIND_NOW environment variable.

     

    glibc has a known defect - reported as

    https://sourceware.org/bugzilla/show_bug.cgi?id=26798 - affecting lazy binding on

    AArch64 platforms. As a mitigation, Arm Compiler for Linux versions up to 22.0 were using '-z now' flag at the linking stage to disable lazy binding when building dynamically linked programs.

     

    The glibc bug is fixed from version 2.26, but on older systems for example Ubuntu 18.04, Ubuntu 20.04, and SLES 15 that run the pre-patched glibc it can still be encountered.

     

    Arm Compiler for Linux removed the mitigation from version 22.0 onwards but we think you are likely seeing it now because version 23.04 made -fsimdmath the default setting at -O2 and above. We think you would also see the bug on version 22.1 with -fsimdmath enabled.

     

    Anyway - the best workaround is to set the LD_BIND_NOW environment variable. You could also pass -z now on your link line.

     

    Can you confirm our diagnosis and that the workaround is suitable for you?

    Ta

    Rich

Reply
  • Hi Gilles

    We think that this is a known issue when running ACfL on older Linux distributions and can be fixed by setting the LD_BIND_NOW environment variable.

     

    glibc has a known defect - reported as

    https://sourceware.org/bugzilla/show_bug.cgi?id=26798 - affecting lazy binding on

    AArch64 platforms. As a mitigation, Arm Compiler for Linux versions up to 22.0 were using '-z now' flag at the linking stage to disable lazy binding when building dynamically linked programs.

     

    The glibc bug is fixed from version 2.26, but on older systems for example Ubuntu 18.04, Ubuntu 20.04, and SLES 15 that run the pre-patched glibc it can still be encountered.

     

    Arm Compiler for Linux removed the mitigation from version 22.0 onwards but we think you are likely seeing it now because version 23.04 made -fsimdmath the default setting at -O2 and above. We think you would also see the bug on version 22.1 with -fsimdmath enabled.

     

    Anyway - the best workaround is to set the LD_BIND_NOW environment variable. You could also pass -z now on your link line.

     

    Can you confirm our diagnosis and that the workaround is suitable for you?

    Ta

    Rich

Children
No data