ArmPL crash in FFT (Rader) with large dimensions

Hello.  I am experiencing a crash in the Rader code path for FFT.  This is on an M1 Mac, macOS 26.3, building with clang 21.  I'm using ArmPL 26.01, but the crash happens with 25.04 as well.  I also see this on macOS 12.6 and have a customer that is seeing this too on his M-series Mac (unspecified details).

The crash can be reproduced with a modified version of fftw_dft_2d_c_example.c. in examples_lp64_mp.  (I think it happens without openmp too.)  The key changes is going from a (5 x 2) array to (256 * 23 x 256 * 23).  Using 256 * 19 runs fine.  Primes 23 and over crash.

Here's a snippet of the crash info:

Triggered by Thread: 0, Dispatch Queue: com.apple.main-thread

Exception Type: EXC_BAD_ACCESS (SIGBUS)
Exception Subtype: KERN_PROTECTION_FAILURE at 0x000000082ec54000
Exception Codes: 0x0000000000000002, 0x000000082ec54000

Termination Reason: Namespace SIGNAL, Code 10, Bus error: 10
Terminating Process: exc handler [40031]


VM Region Info: 0x82ec54000 is in 0x82e400000-0xb2c000000; bytes after start: 8732672 bytes before end: 12838420479
REGION TYPE START - END [ VSIZE] PRT/MAX SHRMOD REGION DETAIL
MALLOC_SMALL 82e000000-82e400000 [ 4096K] rw-/rwx SM=PRV
---> commpage (reserved) 82e400000-b2c000000 [ 12.0G] ---/--- SM=NUL reserved VM address space (unallocated)
GAP OF 0x86000000 BYTES
MALLOC_LARGE bb2000000-bba000000 [128.0M] rw-/rwx SM=PRV

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libarmpl_lp64_mp.dylib 0x10b87a8f4 arm::fft1d::level_rader_t<std::__1::complex<double>, std::__1::complex<double>>::execute(long long, void const*, long long, long long, void*, long long, long long) const + 556
1 libarmpl_lp64_mp.dylib 0x10b8092d8 void arm::fft1d::execute<std::__1::complex<double>, std::__1::complex<double>>(arm::fft1d::composition<std::__1::complex<double>, std::__1::complex<double>> const&, long long, std::__1::complex<double> const*, std::__1::complex<double>*, long long, long long, long long, long long) + 472
2 libarmpl_lp64_mp.dylib 0x10b7d0c80 void arm::fft1d::parallel::parallel_loop<arm::fft1d::batched_1d_plan<std::__1::complex<double>, std::__1::complex<double>>::execute(long long, void const*, long long, long long, void*, long long, long long) const::'lambda'(int)>(int, arm::fft1d::batched_1d_plan<std::__1::complex<double>, std::__1::complex<double>>::execute(long long, void const*, long long, long long, void*, long long, long long) const::'lambda'(int)) (.omp_outlined) + 292
3 libomp.dylib 0x104ded1cc __kmp_invoke_microtask + 156
4 ??? 0x0 ???
5 ??? 0x550 ???
6 ??? 0xbb2000000 ???

Thread 1:
0 ??? 0x104cc4594 ???
1 libarmpl_lp64_mp.dylib 0x10b809290 void arm::fft1d::execute<std::__1::complex<double>, std::__1::complex<double>>(arm::fft1d::composition<std::__1::complex<double>, std::__1::complex<double>> const&, long long, std::__1::complex<double> const*, std::__1::complex<double>*, long long, long long, long long, long long) + 400
2 libarmpl_lp64_mp.dylib 0x10b7d0c80 void arm::fft1d::parallel::parallel_loop<arm::fft1d::batched_1d_plan<std::__1::complex<double>, std::__1::complex<double>>::execute(long long, void const*, long long, long long, void*, long long, long long) const::'lambda'(int)>(int, arm::fft1d::batched_1d_plan<std::__1::complex<double>, std::__1::complex<double>>::execute(long long, void const*, long long, long long, void*, long long, long long) const::'lambda'(int)) (.omp_outlined) + 292
3 libomp.dylib 0x104ded1cc __kmp_invoke_microtask + 156
4 ??? 0x1 ???
5 ??? 0x380 ???
6 ??? 0x200 ???
7 ??? 0x726854207265 ???

Thread 2:
0 ??? 0x104cc42a4 ???
1 libarmpl_lp64_mp.dylib 0x10b809290 void arm::fft1d::execute<std::__1::complex<double>, std::__1::complex<double>>(arm::fft1d::composition<std::__1::complex<double>, std::__1::complex<double>> const&, long long, std::__1::complex<double> const*, std::__1::complex<double>*, long long, long long, long long, long long) + 400
2 libarmpl_lp64_mp.dylib 0x10b7d0c80 void arm::fft1d::parallel::parallel_loop<arm::fft1d::batched_1d_plan<std::__1::complex<double>, std::__1::complex<double>>::execute(long long, void const*, long long, long long, void*, long long, long long) const::'lambda'(int)>(int, arm::fft1d::batched_1d_plan<std::__1::complex<double>, std::__1::complex<double>>::execute(long long, void const*, long long, long long, void*, long long, long long) const::'lambda'(int)) (.omp_outlined) + 292
3 libomp.dylib 0x104ded1cc __kmp_invoke_microtask + 156
4 ??? 0x2 ???
5 ??? 0x380 ???
6 ??? 0x200 ???
7 ??? 0x726854207265 ???

Thread 3:
0 libomp.dylib 0x104d8a830 __kmp_launch_thread + 392
1 libomp.dylib 0x104dcffb4 __kmp_launch_worker(void*) + 280
2 libsystem_pthread.dylib 0x198727c08 _pthread_start + 136
3 libsystem_pthread.dylib 0x198722ba8 thread_start + 8

Thread 4:
0 ??? 0x104cc4714 ???
1 libarmpl_lp64_mp.dylib 0x10b809290 void arm::fft1d::execute<std::__1::complex<double>, std::__1::complex<double>>(arm::fft1d::composition<std::__1::complex<double>, std::__1::complex<double>> const&, long long, std::__1::complex<double> const*, std::__1::complex<double>*, long long, long long, long long, long long) + 400
2 libarmpl_lp64_mp.dylib 0x10b7d0c80 void arm::fft1d::parallel::parallel_loop<arm::fft1d::batched_1d_plan<std::__1::complex<double>, std::__1::complex<double>>::execute(long long, void const*, long long, long long, void*, long long, long long) const::'lambda'(int)>(int, arm::fft1d::batched_1d_plan<std::__1::complex<double>, std::__1::complex<double>>::execute(long long, void const*, long long, long long, void*, long long, long long) const::'lambda'(int)) (.omp_outlined) + 292
3 libomp.dylib 0x104ded1cc __kmp_invoke_microtask + 156
4 ??? 0x6 ???
5 ??? 0x380 ???
6 ??? 0x200 ???
7 ??? 0x726854207265 ???

Thread 5:
0 ??? 0x104cc4344 ???
1 libarmpl_lp64_mp.dylib 0x10b809290 void arm::fft1d::execute<std::__1::complex<double>, std::__1::complex<double>>(arm::fft1d::composition<std::__1::complex<double>, std::__1::complex<double>> const&, long long, std::__1::complex<double> const*, std::__1::complex<double>*, long long, long long, long long, long long) + 400
2 libarmpl_lp64_mp.dylib 0x10b7d0c80 void arm::fft1d::parallel::parallel_loop<arm::fft1d::batched_1d_plan<std::__1::complex<double>, std::__1::complex<double>>::execute(long long, void const*, long long, long long, void*, long long, long long) const::'lambda'(int)>(int, arm::fft1d::batched_1d_plan<std::__1::complex<double>, std::__1::complex<double>>::execute(long long, void const*, long long, long long, void*, long long, long long) const::'lambda'(int)) (.omp_outlined) + 292
3 libomp.dylib 0x104ded1cc __kmp_invoke_microtask + 156
4 ??? 0x7 ???
5 ??? 0x380 ???
6 ??? 0x200 ???
7 ??? 0x726854207265 ???


Thread 0 crashed with ARM Thread State (64-bit):
x0: 0x000000082cc25200 x1: 0x000000082cc25200 x2: 0x0000000000000020 x3: 0x0000000000000020
x4: 0x0000000000000020 x5: 0x0000000000000100 x6: 0x0000000000000200 x7: 0x0000000000000080
x8: 0x0000000000000000 x9: 0x000000082d3183c8 x10: 0x000000082d554000 x11: 0x000000082cc25000
x12: 0x0000000000000016 x13: 0x0000000000170000 x14: 0x0000000000000080 x15: 0x00000000000000a0
x16: 0x0000000000000040 x17: 0x0000000000000060 x18: 0x0000000000000000 x19: 0x000000082cc27e00
x20: 0x0000000000170000 x21: 0x000000082cc27c00 x22: 0x0000000000000200 x23: 0x0000000000000008
x24: 0x000000082cc25000 x25: 0x000000082d56b000 x26: 0x000000082d2fc780 x27: 0x0000000000000020
x28: 0x0000000000000016 fp: 0x000000016b176560 lr: 0x000000010b87a8b0
sp: 0x000000016b176420 pc: 0x000000010b87a8f4 cpsr: 0x20001000
far: 0x000000082ec54000 esr: 0x92000047 (Data Abort) byte write Translation fault

Binary Images:
0x104c88000 - 0x104c8bfff fftw_dft_2d_c_doug.exe (*) <bc1e303d-deef-382a-a313-e82127e4f3d0> */fftw_dft_2d_c_doug.exe
0x108f68000 - 0x10c53ffff libarmpl_lp64_mp.dylib (*) <daeaca78-38e9-37be-b7cf-421def5bc699> /Users/USER/Downloads/*/libarmpl_lp64_mp.dylib
0x104d64000 - 0x104df7fff libomp.dylib (*) <86c894d2-ffc2-3ca7-88cd-11f483173719> /Users/USER/Downloads/*/libomp.dylib
0x0 - 0xffffffffffffffff ??? (*) <00000000-0000-0000-0000-000000000000> ???
0x198721000 - 0x19872dacb libsystem_pthread.dylib (*) <0596a7b6-bce2-3f06-a2e8-3eaab5371ed8> /usr/lib/system/libsystem_pthread.dylib

Instead of pasting the modified code, I will describe the changes to fftw_dft_2d_c_example.c.

  • Include <stdlib.h>
  • Change MMAX and NMAX to 6000
  • Change variables x and xx from stack variables to malloc'd arrays of MMAX * NMAX * sizeof(fftw_complex)
  • Change variables m and n to 256 *23
  • Optionally, free x and xx

In order to get a build, I had to edit the provided Makefile to remove -fopenmp from both CFLAGS and CLINKFLAGS.  The stock examples all ran fine this way.

Though this does appear to be an issue with ArmPL, is there anything I can do in my code to avoid the crash?

Thanks,
Doug