I faced several issues when using the latest ARM 23.04.1 compilers with GROMACS on Fugaku (aka A64fx aka SVE 512bits)
This is the most problematic one. FWIW ARM compilers 23.1 works great, and even with -Ofast
The issue can be evidenced with the latest GROMACS 2023.1 and the regression test suite that can both be downloaded from https://manual.gromacs.org/2023.1/download.html
This is an extract of the two tests that fail:
Testing awh_multibias . . .
gmx grompp -f /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/grompp.mdp -c /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/conf -r /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/conf -p /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/topol -ref /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/rotref -maxwarn 10 -n /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/index >grompp.out 2>grompp.err gmx check -s1 /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/reference_s.tpr -s2 topol.tpr -tol 0.0001 -abstol 0.001 >checktpr.out 2>checktpr.err
gmx mdrun -ntmpi 1 -ntomp 1 -notunepme -cpi /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/continue -noappend >mdrun.out 2>&1
gmx check -e /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/reference_s.edr -e2 ener.part0002.edr -tol 0.001 -abstol 0.05 -lastener Potential >checkpot.out 2>checkpot.err
gmx check -f /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multibias/reference_s.trr -f2 traj.part0002.trr -tol 0.001 -abstol 0.05 >checkforce.out 2>checkforce.err
FAILED.
Check checkpot.out (200 errors), checkforce.out (38 errors) file(s) in awh_multibias for awh_multibias
Testing awh_multidim . . .
gmx grompp -f /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/grompp.mdp -c /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/conf -r /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/conf -p /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/topol -ref /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/rotref -maxwarn 10 -n /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/index >grompp.out 2>grompp.err
gmx check -s1 /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/reference_s.tpr -s2 topol.tpr -tol 0.0001 -abstol 0.001 >checktpr.out 2>checktpr.err gmx mdrun -ntmpi 1 -ntomp 1 -notunepme >mdrun.out 2>&1
gmx check -e /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/reference_s.edr -e2 ener.edr -tol 0.001 -abstol 0.05 -lastener Potential >checkpot.out 2>checkpot.err
gmx check -f /home/rist/r00018/src/regressiontests-2023.1/complex/awh_multidim/reference_s.trr -f2 traj.trr -tol 0.001 -abstol 0.05 >checkforce.out 2>checkforce.err
FAILED. Check checkpot.out (106 errors), checkforce.out (3 errors) file(s) in awh_multidim for awh_multidim T
his is how I built GROMACS with ARM compilers and -O2
/usr/bin/cmake -G 'Unix Makefiles' -DCMAKE_INSTALL_PREFIX:STRING=$HOME/local/gromacs-2023.1/arm-23.04.1/2 -DCMAKE_BUILD_TYPE:STRING=Release -DBUILD_TESTING:BOOL=OFF -DCMAKE_INTERPROCEDURAL_OPTIMIZATION:BOOL=OFF -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DGMX_INSTALL_LEGACY_API=ON -DGMX_HWLOC:BOOL=ON -DGMX_GPU:STRING=OFF -DGMX_SIMD=ARM_SVE -DGMX_SIMD_ARM_SVE_LENGTH=512 -DGMX_USE_RDTSCP:BOOL=OFF -DGMX_OPENMP:BOOL=ON -DGMX_USE_RDTSCP:BOOL=OFF -DGMX_CYCLE_SUBCOUNTERS:BOOL=ON '-DCMAKE_C_FLAGS_RELEASE=-O2 -DNDEBUG' '-DCMAKE_CXX_FLAGS_RELEASE=-O2 -DNDEBUG' -DGMX_FFT_LIBRARY=fftpack -DGMX_MPI:BOOL=OFF -DCMAKE_C_COMPILER=armclang -DCMAKE_CXX_COMPILER=armclang++ -DBUILD_SHARED_LIBS=OFF $HOME/src/gromacs-2023.1
make -j 48 install
and then how I ran the test suite
. $HOME/local/gromacs-2023.1/arm-23.04.1/2/bin/GMXRC.bash
./gmxtest.pl -nt 1 -ntomp 1 -verbose all
This works just fine with ARM compilers 22.1 or LLVM 16.0.2 and LLVM 16.0.6 (even with -Ofast), so it seems the issue is specific to ARM compilers.
ARM compilers 23.04.1 works just fine if -O1 is used instead of -O2
I tried to identify the root cause, and found that it comes from the BiasState::updateFreeEnergyAndAddSamplesToHistogram(...) subroutine that is defined in
BiasState::updateFreeEnergyAndAddSamplesToHistogram(...)
src/gromacs/applied_forces/awh/biasstate.cpp
A temporary workaround is to prepend the definition with
[[clang::optnone]]
Thanks for the defect report - we are looking into this.