I have been using ArmPL on Linux for quite some time. I have experienced numerous mysterious sporadic crashes which I haven't been able to identify the cause for. Recently, I started using ArmPL on macOS too and the same type of crashes started occurring on that platform as well. At first, I thought that the issue was related to the OpenMP library but after some experimenting I came to the conclusion that the crash is related to ArmPL. Here is my setup:
LINUX
MAC
The crash typically occurs after running the application for some time. Note that I use a wrapper around ArmPL. On macOS, I get the following output:
C [libomp.dylib+0x5750] ___kmp_fast_free+0xf0C [libomp.dylib+0x36704] __kmp_release_deps(int, kmp_taskdata*)+0xb0C [libomp.dylib+0x35894] void __kmp_task_finish<false>(int, kmp_task*, kmp_taskdata*)+0x148C [libomp.dylib+0x306c0] __kmp_invoke_task(int, kmp_task*, kmp_taskdata*)+0x2b0C [libomp.dylib+0x33960] int __kmp_execute_tasks_64<false, true>(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int)+0x31cC [libomp.dylib+0x3d594] kmp_flag_64<false, true>::wait(kmp_info*, int, void*)+0x618C [libomp.dylib+0x39b30] __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*)+0x98C [libomp.dylib+0x38730] __kmp_barrier+0x500C [libomp.dylib+0xf170] __kmpc_barrier+0x154C [libomp.dylib+0x6adec] __kmp_invoke_microtask+0x9c
C [libomp.dylib+0x5750] ___kmp_fast_free+0xf0
C [libomp.dylib+0x36704] __kmp_release_deps(int, kmp_taskdata*)+0xb0
C [libomp.dylib+0x35894] void __kmp_task_finish<false>(int, kmp_task*, kmp_taskdata*)+0x148
C [libomp.dylib+0x306c0] __kmp_invoke_task(int, kmp_task*, kmp_taskdata*)+0x2b0
C [libomp.dylib+0x33960] int __kmp_execute_tasks_64<false, true>(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int)+0x31c
C [libomp.dylib+0x3d594] kmp_flag_64<false, true>::wait(kmp_info*, int, void*)+0x618
C [libomp.dylib+0x39b30] __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*)+0x98
C [libomp.dylib+0x38730] __kmp_barrier+0x500
C [libomp.dylib+0xf170] __kmpc_barrier+0x154
C [libomp.dylib+0x6adec] __kmp_invoke_microtask+0x9c
On Linux, I get this:
C [libomp.so+0x1db1c] ___kmp_fast_free+0x120C [libomp.so+0x58c5c] __kmp_free_task_and_ancestors(int, kmp_taskdata*, kmp_info*)+0x90C [libomp.so+0x57b34] void __kmp_task_finish<false>(int, kmp_task*, kmp_taskdata*)+0xe8C [libomp.so+0x55068] __kmp_invoke_task(int, kmp_task*, kmp_taskdata*)+0x3ccC [libomp.so+0x5a8d4] int __kmp_execute_tasks_64<false, true>(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int)+0x2dcC [libomp.so+0x620bc] kmp_flag_64<false, true>.wait(kmp_info*, int, void*)+0x620C [libomp.so+0x5e320] __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*)+0x90C [libomp.so+0x5d110] __kmp_barrier+0x754C [libomp.so+0x28634] __kmpc_barrier+0x144C [libomp.so+0x8721c] GOMP_barrier+0x40C [libomp.so+0x87f20] __kmp_GOMP_microtask_wrapper(int*, int*, void (*)(void*), void*)+0x34C [libomp.so+0xa16cc] __kmp_invoke_microtask+0x9cIf I run without a wrapper, I get this:[thread 958273 also had an error][thread 958269 also had an error][thread 958267 also had an error][thread 958284 also had an error][thread 958280 also had an error][thread 958276 also had an error][thread 958281 also had an error][thread 958271 also had an error][thread 958272 also had an error][thread 958275 also had an error][thread 958283 also had an error][thread 958282 also had an error][thread 958270 also had an error][thread 958288 also had an error][thread 958287 also had an error][thread 958286 also had an error][thread 958262 also had an error][thread 958266 also had an error][thread 958274 also had an error][thread 958277 also had an error][thread 958290 also had an error][thread 958279 also had an error][thread 958278 also had an error][thread 958289 also had an error][thread 958285 also had an error][thread 958265 also had an error][thread 958268 also had an error][thread 958261 also had an error][thread 958263 also had an error][thread 958252 also had an error][thread 958264 also had an error]C [libarmpl_mp.so+0x176e034] zdot_conj_kernel+0xf4C [libarmpl_mp.so+0x258a444] std::complex<double> armpl::clag::reduce_add_parallel<std::complex<double>, bool armpl::clag::strat::dot::impl<std::complex<double>, std::complex<double>, armpl::clag::spec::neoverse_n1_machine_spec>(armpl::clag::spec::problem_context_2T<std::complex<double>, std::complex<double>, (armpl::clag::spec::problem_type)43, armpl::clag::spec::neoverse_n1_machine_spec> const&) const::{lambda(long)#1}>(int, bool armpl::clag::strat::dot::impl<std::complex<double>, std::complex<double>, armpl::clag::spec::neoverse_n1_machine_spec>(armpl::clag::spec::problem_context_2T<std::complex<double>, std::complex<double>, (armpl::clag::spec::problem_type)43, armpl::clag::spec::neoverse_n1_machine_spec> const&) const::{lambda(long)#1}) [clone ._omp_fn.0]+0xc4C [libomp.so+0x87f20] __kmp_GOMP_microtask_wrapper(int*, int*, void (*)(void*), void*)+0x34C [libomp.so+0xa16cc] __kmp_invoke_microtask+0x9cThe crash doesn't occur when using other BLAS/LAPACK implementations (OpenBLAS, vecLib). Any help with solving this problem will be much appreciated.
C [libomp.so+0x1db1c] ___kmp_fast_free+0x120
C [libomp.so+0x58c5c] __kmp_free_task_and_ancestors(int, kmp_taskdata*, kmp_info*)+0x90
C [libomp.so+0x57b34] void __kmp_task_finish<false>(int, kmp_task*, kmp_taskdata*)+0xe8
C [libomp.so+0x55068] __kmp_invoke_task(int, kmp_task*, kmp_taskdata*)+0x3cc
C [libomp.so+0x5a8d4] int __kmp_execute_tasks_64<false, true>(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int)+0x2dc
C [libomp.so+0x620bc] kmp_flag_64<false, true>.wait(kmp_info*, int, void*)+0x620
C [libomp.so+0x5e320] __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*)+0x90
C [libomp.so+0x5d110] __kmp_barrier+0x754
C [libomp.so+0x28634] __kmpc_barrier+0x144
C [libomp.so+0x8721c] GOMP_barrier+0x40
C [libomp.so+0x87f20] __kmp_GOMP_microtask_wrapper(int*, int*, void (*)(void*), void*)+0x34
C [libomp.so+0xa16cc] __kmp_invoke_microtask+0x9c
[thread 958273 also had an error][thread 958269 also had an error][thread 958267 also had an error][thread 958284 also had an error][thread 958280 also had an error][thread 958276 also had an error][thread 958281 also had an error][thread 958271 also had an error][thread 958272 also had an error][thread 958275 also had an error][thread 958283 also had an error][thread 958282 also had an error][thread 958270 also had an error][thread 958288 also had an error][thread 958287 also had an error][thread 958286 also had an error][thread 958262 also had an error][thread 958266 also had an error][thread 958274 also had an error][thread 958277 also had an error][thread 958290 also had an error][thread 958279 also had an error][thread 958278 also had an error][thread 958289 also had an error][thread 958285 also had an error][thread 958265 also had an error][thread 958268 also had an error][thread 958261 also had an error][thread 958263 also had an error][thread 958252 also had an error][thread 958264 also had an error]
C [libarmpl_mp.so+0x176e034] zdot_conj_kernel+0xf4
C [libarmpl_mp.so+0x258a444] std::complex<double> armpl::clag::reduce_add_parallel<std::complex<double>, bool armpl::clag::strat::dot::impl<std::complex<double>, std::complex<double>, armpl::clag::spec::neoverse_n1_machine_spec>(armpl::clag::spec::problem_context_2T<std::complex<double>, std::complex<double>, (armpl::clag::spec::problem_type)43, armpl::clag::spec::neoverse_n1_machine_spec> const&) const::{lambda(long)#1}>(int, bool armpl::clag::strat::dot::impl<std::complex<double>, std::complex<double>, armpl::clag::spec::neoverse_n1_machine_spec>(armpl::clag::spec::problem_context_2T<std::complex<double>, std::complex<double>, (armpl::clag::spec::problem_type)43, armpl::clag::spec::neoverse_n1_machine_spec> const&) const::{lambda(long)#1}) [clone ._omp_fn.0]+0xc4
Hi,
Thank you for reporting this. To help us investigate please could you let us know what the values of N, INCX and INCY are in the call to ZDOTC from your code? Also, how many threads are you using?
Regards,
Chris.
The application runs with different number of threads. I can confirm that it crashes for 4, 8 and 16 threads. The crash below happened at OMP_NUM_THREADS=16. But this doesn't necessary mean that ArmPL is running at 16 threads. Sometimes, a single application thread will call ArmPL (and the expectation is that ArmPL will run multi-threaded). Other times, multiple application threads will call ArmPL simultaneously. This depends on what algorithm is being used. I added debug printing just before the place where I call ZDOTC and it prints the following just before the crash:N = 1, INCX = 1, INCY = 1, omp_get_num_threads = 1
OMP_NUM_THREADS=16
ZDOTC
N = 1, INCX = 1, INCY = 1, omp_get_num_threads = 1
It runs OK for other sizes of N but it seems that the problem occurs when N = 1. I should also note that the crash doesn't always occur. It may take a few runs of the same workload until it crashes.
Thanks for the information, we'll try to reproduce this and get back in touch with an update.
Thanks! Some additional information. I observed the following pattern with two different workloads:
...N = 34251, INCX = 1, INCY = 1, omp_get_num_threads = 1END ZDOTCN = 34251, INCX = 1, INCY = 1, omp_get_num_threads = 1END ZDOTCN = 34251, INCX = 1, INCY = 1, omp_get_num_threads = 1END ZDOTC...N = 1, INCX = 1, INCY = 1, omp_get_num_threads = 1...N = 34251, INCX = 1, INCY = 1, omp_get_num_threads = 1END ZDOTCN = 34251, INCX = 1, INCY = 1, omp_get_num_threads = 1END ZDOTCN = 34251, INCX = 1, INCY = 1, omp_get_num_threads = 1END ZDOTC...N = 1, INCX = 1, INCY = 1, omp_get_num_threads = 1___crash___It seems that it crashes the second time ZDOTC is called with N=1.
...
N = 34251, INCX = 1, INCY = 1, omp_get_num_threads = 1
END ZDOTC
___crash___
N=1
We're struggling to reproduce this crash. For your Linux runs, please can you let us know which version(s) of GCC you're using, and on which Linux distribution? The OpenMP runtime libraries differ subtly in how they handle nested parallelism from GCC 11 onwards (something we've tried to document here: Arm Compiler for Linux OpenMP settings).
In our builds of Arm PL with different versions of GCC we try to match the same OpenMP behaviour as the compiler, with regards to nesting. So I wonder e.g. if you see the same crash with say GCC 12 and GCC 8?
Sorry for the late reply. I was on a very long journey. The code is compiled with GCC 12 and the crash occurs on Rocky Linux 8 (generic ARMv8) and on Ubuntu 22.04 (Neoverse-N1). I have seen similar crashes on Amazon Linux 2 with Graviton 2 and 3. I could try building the code with GCC 11. This is going to take some time.