Some time ago I reported a performance issue with ArmPL related to CPUs with large number of cores. See https://community.arm.com/support-forums/f/high-performance-computing-forum/53959/negative-armpl-mt-speed-up-on-many-core-systems It was fixed in version 23.04. As a result, my application gained a lot of performance. Recently, I did some more scaling performance testing and discovered more issues.The testing was done on a 128-core Ampere Altra CPU running Ubuntu 22.04. The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12. The application uses ArmPL in two ways depending on algorithm: (1) a single application thread calls ArmPL; (2) multiple application threads call ArmPL at the same time. I did the performance profiling with perf-lib-tools.ZGETRFThe problem with ZGETRF occurs when multiple application threads are calling it at the same time. Typical inputs are M=N with a size varying between 10 and 300. Maybe there is some locking issue? Summary from perf-lib-tools:OMP_NUM_THREADS=32zgetrf_ cnt= 2434696 totTime= 1504.5387 called_tot= 171712 topTime= 93.8037 (%age of runtime: 6.428 )Most frequent calls:$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -nRoutine: zgetrf_ nCalls: 216920 Mean_time 8.647643e-05 nUserCalls: 12077 Mean_user_time: 8.447802e-05 Inputs: 52 52 52 2 0Routine: zgetrf_ nCalls: 66792 Mean_time 2.949315e-04 nUserCalls: 7386 Mean_user_time: 2.930019e-04 Inputs: 100 100 100 1 0Routine: zgetrf_ nCalls: 126456 Mean_time 3.720812e-04 nUserCalls: 13398 Mean_user_time: 3.665895e-04 Inputs: 110 110 110 1 0Routine: zgetrf_ nCalls: 201256 Mean_time 4.436309e-04 nUserCalls: 16952 Mean_user_time: 4.371607e-04 Inputs: 120 120 120 1 0Routine: zgetrf_ nCalls: 168344 Mean_time 5.473583e-04 nUserCalls: 14346 Mean_user_time: 5.396482e-04 Inputs: 130 130 130 1 0Routine: zgetrf_ nCalls: 147136 Mean_time 6.433120e-04 nUserCalls: 11615 Mean_user_time: 6.342174e-04 Inputs: 140 140 140 1 0Routine: zgetrf_ nCalls: 211288 Mean_time 6.791234e-04 nUserCalls: 9395 Mean_user_time: 6.745791e-04 Inputs: 142 142 142 1 0Routine: zgetrf_ nCalls: 123904 Mean_time 7.762974e-04 nUserCalls: 8320 Mean_user_time: 7.669183e-04 Inputs: 150 150 150 1 0Routine: zgetrf_ nCalls: 73040 Mean_time 8.895631e-04 nUserCalls: 5874 Mean_user_time: 8.788582e-04 Inputs: 160 160 160 1 0Routine: zgetrf_ nCalls: 195888 Mean_time 1.083270e-03 nUserCalls: 9152 Mean_user_time: 1.073041e-03 Inputs: 170 170 170 1 0 OMP_NUM_THREADS=128zgetrf_ cnt= 2434696 totTime= 2046.2830 called_tot= 241094 topTime= 151.9661 (%age of runtime: 9.300 )Most frequent calls:$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -nRoutine: zgetrf_ nCalls: 216920 Mean_time 2.516856e-04 nUserCalls: 11667 Mean_user_time: 1.828228e-04 Inputs: 52 52 52 2 0Routine: zgetrf_ nCalls: 66792 Mean_time 4.422276e-04 nUserCalls: 9588 Mean_user_time: 3.222948e-04 Inputs: 100 100 100 1 0Routine: zgetrf_ nCalls: 126456 Mean_time 5.297068e-04 nUserCalls: 18734 Mean_user_time: 3.932569e-04 Inputs: 110 110 110 1 0Routine: zgetrf_ nCalls: 201256 Mean_time 6.306487e-04 nUserCalls: 23959 Mean_user_time: 4.647142e-04 Inputs: 120 120 120 1 0Routine: zgetrf_ nCalls: 168344 Mean_time 7.413436e-04 nUserCalls: 27496 Mean_user_time: 5.610537e-04 Inputs: 130 130 130 1 0Routine: zgetrf_ nCalls: 147136 Mean_time 8.560422e-04 nUserCalls: 23896 Mean_user_time: 6.514217e-04 Inputs: 140 140 140 1 0Routine: zgetrf_ nCalls: 123904 Mean_time 1.012120e-03 nUserCalls: 20248 Mean_user_time: 7.814398e-04 Inputs: 150 150 150 1 0Routine: zgetrf_ nCalls: 73040 Mean_time 1.143292e-03 nUserCalls: 14404 Mean_user_time: 8.928624e-04 Inputs: 160 160 160 1 0Routine: zgetrf_ nCalls: 195888 Mean_time 1.406755e-03 nUserCalls: 10160 Mean_user_time: 1.091698e-03 Inputs: 170 170 170 1 0Routine: zgetrf_ nCalls: 19096 Mean_time 1.444096e-03 nUserCalls: 5755 Mean_user_time: 1.218681e-03 Inputs: 180 180 180 1 0DGEMMEven though DGEMM received substantial performance improvements, it still has some issues:OMP_NUM_THREADS=32dgemm_ cnt= 30596724 totTime= 272.4157 called_tot= 30596724 topTime= 272.4157 (%age of runtime: 2.876 )Example calls:Routine: dgemm_ nCalls: 14728 Mean_time 1.281673e-05 nUserCalls: 14728 Mean_user_time: 1.281673e-05 Inputs: 36 1 252 36 288 1226658 N NRoutine: dgemm_ nCalls: 9884 Mean_time 9.806415e-06 nUserCalls: 9884 Mean_user_time: 9.806415e-06 Inputs: 264 1 42 42 42 264 T NRoutine: dgemm_ nCalls: 14728 Mean_time 9.071300e-06 nUserCalls: 14728 Mean_user_time: 9.071300e-06 Inputs: 252 1 36 36 36 252 T NRoutine: dgemm_ nCalls: 9884 Mean_time 1.381515e-05 nUserCalls: 9884 Mean_user_time: 1.381515e-05 Inputs: 42 1 264 42 306 1226658 N NRoutine: dgemm_ nCalls: 12238 Mean_time 9.176557e-06 nUserCalls: 12238 Mean_user_time: 9.176557e-06 Inputs: 246 1 36 36 36 246 T NRoutine: dgemm_ nCalls: 17002 Mean_time 1.015412e-05 nUserCalls: 17002 Mean_user_time: 1.015412e-05 Inputs: 240 1 48 48 48 240 T NRoutine: dgemm_ nCalls: 13044 Mean_time 1.026566e-05 nUserCalls: 13044 Mean_user_time: 1.026566e-05 Inputs: 210 1 54 54 54 210 T NRoutine: dgemm_ nCalls: 10980 Mean_time 1.011446e-05 nUserCalls: 10980 Mean_user_time: 1.011446e-05 Inputs: 276 1 48 48 48 276 T NRoutine: dgemm_ nCalls: 24626 Mean_time 9.204542e-06 nUserCalls: 24626 Mean_user_time: 9.204542e-06 Inputs: 204 1 42 42 42 204 T NRoutine: dgemm_ nCalls: 19882 Mean_time 9.293070e-06 nUserCalls: 19882 Mean_user_time: 9.293070e-06 Inputs: 216 1 42 42 42 216 T NOMP_NUM_THREADS=128dgemm_ cnt= 30597188 totTime= 350.5334 called_tot= 30597188 topTime= 350.5334 (%age of runtime: 3.982 )Example calls:Routine: dgemm_ nCalls: 14728 Mean_time 3.143310e-05 nUserCalls: 14728 Mean_user_time: 3.143310e-05 Inputs: 36 1 252 36 288 1226658 N NRoutine: dgemm_ nCalls: 9884 Mean_time 2.066992e-05 nUserCalls: 9884 Mean_user_time: 2.066992e-05 Inputs: 264 1 42 42 42 264 T NRoutine: dgemm_ nCalls: 14728 Mean_time 1.720480e-05 nUserCalls: 14728 Mean_user_time: 1.720480e-05 Inputs: 252 1 36 36 36 252 T NRoutine: dgemm_ nCalls: 9884 Mean_time 2.001355e-05 nUserCalls: 9884 Mean_user_time: 2.001355e-05 Inputs: 42 1 264 42 306 1226658 N NRoutine: dgemm_ nCalls: 12246 Mean_time 3.380307e-05 nUserCalls: 12246 Mean_user_time: 3.380307e-05 Inputs: 246 1 36 36 36 246 T NRoutine: dgemm_ nCalls: 17012 Mean_time 2.844690e-05 nUserCalls: 17012 Mean_user_time: 2.844690e-05 Inputs: 240 1 48 48 48 240 T NRoutine: dgemm_ nCalls: 13042 Mean_time 2.803772e-05 nUserCalls: 13042 Mean_user_time: 2.803772e-05 Inputs: 210 1 54 54 54 210 T NRoutine: dgemm_ nCalls: 10982 Mean_time 2.659640e-05 nUserCalls: 10982 Mean_user_time: 2.659640e-05 Inputs: 276 1 48 48 48 276 T NRoutine: dgemm_ nCalls: 24612 Mean_time 2.413749e-05 nUserCalls: 24612 Mean_user_time: 2.413749e-05 Inputs: 204 1 42 42 42 204 T NRoutine: dgemm_ nCalls: 19862 Mean_time 2.381463e-05 nUserCalls: 19862 Mean_user_time: 2.381463e-05 Inputs: 216 1 42 42 42 216 T NIt would be nice to have this fixed.
perf-lib-tools
ZGETRF
M=N
10
300
OMP_NUM_THREADS=32
zgetrf_ cnt= 2434696 totTime= 1504.5387 called_tot= 171712 topTime= 93.8037 (%age of runtime: 6.428 )
$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -n
Routine: zgetrf_ nCalls: 216920 Mean_time 8.647643e-05 nUserCalls: 12077 Mean_user_time: 8.447802e-05 Inputs: 52 52 52 2 0
Routine: zgetrf_ nCalls: 66792 Mean_time 2.949315e-04 nUserCalls: 7386 Mean_user_time: 2.930019e-04 Inputs: 100 100 100 1 0
Routine: zgetrf_ nCalls: 126456 Mean_time 3.720812e-04 nUserCalls: 13398 Mean_user_time: 3.665895e-04 Inputs: 110 110 110 1 0
Routine: zgetrf_ nCalls: 201256 Mean_time 4.436309e-04 nUserCalls: 16952 Mean_user_time: 4.371607e-04 Inputs: 120 120 120 1 0
Routine: zgetrf_ nCalls: 168344 Mean_time 5.473583e-04 nUserCalls: 14346 Mean_user_time: 5.396482e-04 Inputs: 130 130 130 1 0
Routine: zgetrf_ nCalls: 147136 Mean_time 6.433120e-04 nUserCalls: 11615 Mean_user_time: 6.342174e-04 Inputs: 140 140 140 1 0
Routine: zgetrf_ nCalls: 211288 Mean_time 6.791234e-04 nUserCalls: 9395 Mean_user_time: 6.745791e-04 Inputs: 142 142 142 1 0
Routine: zgetrf_ nCalls: 123904 Mean_time 7.762974e-04 nUserCalls: 8320 Mean_user_time: 7.669183e-04 Inputs: 150 150 150 1 0
Routine: zgetrf_ nCalls: 73040 Mean_time 8.895631e-04 nUserCalls: 5874 Mean_user_time: 8.788582e-04 Inputs: 160 160 160 1 0
Routine: zgetrf_ nCalls: 195888 Mean_time 1.083270e-03 nUserCalls: 9152 Mean_user_time: 1.073041e-03 Inputs: 170 170 170 1 0
OMP_NUM_THREADS=128
zgetrf_ cnt= 2434696 totTime= 2046.2830 called_tot= 241094 topTime= 151.9661 (%age of runtime: 9.300 )
$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -n
Routine: zgetrf_ nCalls: 216920 Mean_time 2.516856e-04 nUserCalls: 11667 Mean_user_time: 1.828228e-04 Inputs: 52 52 52 2 0
Routine: zgetrf_ nCalls: 66792 Mean_time 4.422276e-04 nUserCalls: 9588 Mean_user_time: 3.222948e-04 Inputs: 100 100 100 1 0
Routine: zgetrf_ nCalls: 126456 Mean_time 5.297068e-04 nUserCalls: 18734 Mean_user_time: 3.932569e-04 Inputs: 110 110 110 1 0
Routine: zgetrf_ nCalls: 201256 Mean_time 6.306487e-04 nUserCalls: 23959 Mean_user_time: 4.647142e-04 Inputs: 120 120 120 1 0
Routine: zgetrf_ nCalls: 168344 Mean_time 7.413436e-04 nUserCalls: 27496 Mean_user_time: 5.610537e-04 Inputs: 130 130 130 1 0
Routine: zgetrf_ nCalls: 147136 Mean_time 8.560422e-04 nUserCalls: 23896 Mean_user_time: 6.514217e-04 Inputs: 140 140 140 1 0
Routine: zgetrf_ nCalls: 123904 Mean_time 1.012120e-03 nUserCalls: 20248 Mean_user_time: 7.814398e-04 Inputs: 150 150 150 1 0
Routine: zgetrf_ nCalls: 73040 Mean_time 1.143292e-03 nUserCalls: 14404 Mean_user_time: 8.928624e-04 Inputs: 160 160 160 1 0
Routine: zgetrf_ nCalls: 195888 Mean_time 1.406755e-03 nUserCalls: 10160 Mean_user_time: 1.091698e-03 Inputs: 170 170 170 1 0
Routine: zgetrf_ nCalls: 19096 Mean_time 1.444096e-03 nUserCalls: 5755 Mean_user_time: 1.218681e-03 Inputs: 180 180 180 1 0
DGEMM
dgemm_ cnt= 30596724 totTime= 272.4157 called_tot= 30596724 topTime= 272.4157 (%age of runtime: 2.876 )
Routine: dgemm_ nCalls: 14728 Mean_time 1.281673e-05 nUserCalls: 14728 Mean_user_time: 1.281673e-05 Inputs: 36 1 252 36 288 1226658 N N
Routine: dgemm_ nCalls: 9884 Mean_time 9.806415e-06 nUserCalls: 9884 Mean_user_time: 9.806415e-06 Inputs: 264 1 42 42 42 264 T N
Routine: dgemm_ nCalls: 14728 Mean_time 9.071300e-06 nUserCalls: 14728 Mean_user_time: 9.071300e-06 Inputs: 252 1 36 36 36 252 T N
Routine: dgemm_ nCalls: 9884 Mean_time 1.381515e-05 nUserCalls: 9884 Mean_user_time: 1.381515e-05 Inputs: 42 1 264 42 306 1226658 N N
Routine: dgemm_ nCalls: 12238 Mean_time 9.176557e-06 nUserCalls: 12238 Mean_user_time: 9.176557e-06 Inputs: 246 1 36 36 36 246 T N
Routine: dgemm_ nCalls: 17002 Mean_time 1.015412e-05 nUserCalls: 17002 Mean_user_time: 1.015412e-05 Inputs: 240 1 48 48 48 240 T N
Routine: dgemm_ nCalls: 13044 Mean_time 1.026566e-05 nUserCalls: 13044 Mean_user_time: 1.026566e-05 Inputs: 210 1 54 54 54 210 T N
Routine: dgemm_ nCalls: 10980 Mean_time 1.011446e-05 nUserCalls: 10980 Mean_user_time: 1.011446e-05 Inputs: 276 1 48 48 48 276 T N
Routine: dgemm_ nCalls: 24626 Mean_time 9.204542e-06 nUserCalls: 24626 Mean_user_time: 9.204542e-06 Inputs: 204 1 42 42 42 204 T N
Routine: dgemm_ nCalls: 19882 Mean_time 9.293070e-06 nUserCalls: 19882 Mean_user_time: 9.293070e-06 Inputs: 216 1 42 42 42 216 T N
dgemm_ cnt= 30597188 totTime= 350.5334 called_tot= 30597188 topTime= 350.5334 (%age of runtime: 3.982 )
Routine: dgemm_ nCalls: 14728 Mean_time 3.143310e-05 nUserCalls: 14728 Mean_user_time: 3.143310e-05 Inputs: 36 1 252 36 288 1226658 N N
Routine: dgemm_ nCalls: 9884 Mean_time 2.066992e-05 nUserCalls: 9884 Mean_user_time: 2.066992e-05 Inputs: 264 1 42 42 42 264 T N
Routine: dgemm_ nCalls: 14728 Mean_time 1.720480e-05 nUserCalls: 14728 Mean_user_time: 1.720480e-05 Inputs: 252 1 36 36 36 252 T N
Routine: dgemm_ nCalls: 9884 Mean_time 2.001355e-05 nUserCalls: 9884 Mean_user_time: 2.001355e-05 Inputs: 42 1 264 42 306 1226658 N N
Routine: dgemm_ nCalls: 12246 Mean_time 3.380307e-05 nUserCalls: 12246 Mean_user_time: 3.380307e-05 Inputs: 246 1 36 36 36 246 T N
Routine: dgemm_ nCalls: 17012 Mean_time 2.844690e-05 nUserCalls: 17012 Mean_user_time: 2.844690e-05 Inputs: 240 1 48 48 48 240 T N
Routine: dgemm_ nCalls: 13042 Mean_time 2.803772e-05 nUserCalls: 13042 Mean_user_time: 2.803772e-05 Inputs: 210 1 54 54 54 210 T N
Routine: dgemm_ nCalls: 10982 Mean_time 2.659640e-05 nUserCalls: 10982 Mean_user_time: 2.659640e-05 Inputs: 276 1 48 48 48 276 T N
Routine: dgemm_ nCalls: 24612 Mean_time 2.413749e-05 nUserCalls: 24612 Mean_user_time: 2.413749e-05 Inputs: 204 1 42 42 42 204 T N
Routine: dgemm_ nCalls: 19862 Mean_time 2.381463e-05 nUserCalls: 19862 Mean_user_time: 2.381463e-05 Inputs: 216 1 42 42 42 216 T N
Hi,
Thanks for the report, and also for using perf-libs-tools!
We are currently in the process of producing a new release of Arm PL (23.10) which will appear in the next few weeks. Unfortunately, any problem is unlikely to be addressed as part of that release. However, if we can pin down any potential issue then maybe we can help with an explanation and a possible fix in future releases.
You mentioned
> The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12.
This makes me wonder if you are mixing OpenMP runtimes. If the library you're using was built with GCC it will have a dependency on libgomp (the GNU OpenMP library); if your application is using LLVM OpenMP, then it's possible that you're seeing bad performance from unintended nested parallelism.
Please could you execute one of the Arm PL shared libraries in the distribution you're using at the command line? The shared libraries are executable, and should print out some diagnostic info. It would be useful if you could post that info in reply.
Best Regards,
Chris.
I don't think that there is OpenMP library mixing. I use my own ArmPL library which I compile something like this:
gcc -Iinclude -O2 -fPIC -fmath-errno -std=gnu99 -fopenmp -o obj/aarch64/myarmpl.o -c myarmpl.cgcc -shared -Llib/aarch64 -fPIC -pthread -larmpl_mp -lomp -lastring -lamath -lm -o lib/aarch64/libmyarmpl.so obj/aarch64/myarmpl.o
gcc -Iinclude -O2 -fPIC -fmath-errno -std=gnu99 -fopenmp -o obj/aarch64/myarmpl.o -c myarmpl.c
gcc -shared -Llib/aarch64 -fPIC -pthread -larmpl_mp -lomp -lastring -lamath -lm -o lib/aarch64/libmyarmpl.so obj/aarch64/myarmpl.o
It segfaults when I execute it. Here is what it depends on:ldd libmyarmpl.so linux-vdso.so.1 (0x0000ffff99958000) libomp.so => not found libastring.so => not found libamath.so => not found libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff94620000) libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff945f0000) libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffff945d0000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff94420000) /lib/ld-linux-aarch64.so.1 (0x0000ffff9991f000)
ldd libmyarmpl.so
linux-vdso.so.1 (0x0000ffff99958000)
libomp.so => not found
libastring.so => not found
libamath.so => not found
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff94620000)
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff945f0000)
libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffff945d0000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff94420000)
/lib/ld-linux-aarch64.so.1 (0x0000ffff9991f000)
I have verified that it doesn't load libgomp.so at runtime with lsof -p <PID> | grep omp - it loads libomp.so only.
libgomp.so
lsof -p <PID> | grep omp
libomp.so
Thanks for your reply. Please could you execute the original Arm PL library you're linking to, libarmpl_mp.so, just so that we can see the information it produces.
Yes, this is what it reports (this is on the build machine which has a different CPU and OS):$ LD_LIBRARY_PATH=/opt/arm/armpl-23.04.0_RHEL-8_gcc/lib /opt/arm/armpl-23.04.0_RHEL-8_gcc/lib/libarmpl_mp.soArm Performance LibrariesVersion 23.04.0Built from: 520bc09dcTarget Generic AArch64 (lp64+openmp)Runtime target Generic AArch64Available targets: ThunderX2 Neoverse N1 Generic AArch64 A64FX Neoverse V1 Generic SVECompiled by gcc (GCC) 12.2.0This build contains both NEON and SVE routine types.Runtime machine details (parsed from getauxval(AT_HWCAP)): Implementer: 0x50 (P) Part number: 0x0 Part variant: 0x3 Part revision: 0x2 Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuidIf you require support or would like to provide feedback, please contact support-hpc-sw@arm.com
$ LD_LIBRARY_PATH=/opt/arm/armpl-23.04.0_RHEL-8_gcc/lib /opt/arm/armpl-23.04.0_RHEL-8_gcc/lib/libarmpl_mp.so
Arm Performance Libraries
Version 23.04.0
Built from: 520bc09dc
Target Generic AArch64 (lp64+openmp)
Runtime target Generic AArch64
Available targets:
ThunderX2
Neoverse N1
Generic AArch64
A64FX
Neoverse V1
Generic SVE
Compiled by gcc (GCC) 12.2.0
This build contains both NEON and SVE routine types.
Runtime machine details (parsed from getauxval(AT_HWCAP)):
Implementer: 0x50 (P)
Part number: 0x0
Part variant: 0x3
Part revision: 0x2
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
If you require support or would like to provide feedback, please contact support-hpc-sw@arm.com
I forgot to mention, if I build the application with Arm Compiler for Linux 23.04 instead (and use the libarmpl_mp.a which comes with it), I get similar scaling problems. Here is an example workload:
libarmpl_mp.a
Thanks for this, we'll try to investigate and see what's going on.
Hello. Just to confirm, we've observed similar scaling issues in going from 32 to 128 cores for these small ZGETRF and DGEMM problems when working from cold caches. We'll be working on addressing these. FYI, we're just released version 23.10, but that version doesn't contain any tunings that will address these issues.
Thanks! I've started evaluating version 23.10. It looks good so far (no new issues observed, some old issues fixed).
I came across another scaling issue. It affects both macOS and Linux. It is less noticeable on macOS because Apple silicon comes with not so many cores but running on AWS Graviton3E (64 cores) and especially on Ampere Altra Max (128 cores) the performance hit is massive. Any BLAS/LAPACK implementation that I tested outperformed ArmPL (vecLib, Netlib, OpenBLAS) for the workload in question on both macOS and Linux. The problem seems to be that DGELS scales poorly. The software calls it in both multi-thread and multi-instance (several threads calling it at the same time) mode. Below are some example inputs:N 117 20 1 117 117 580 0N 117 20 1 117 117 -1 0N 117 10 1 189 189 580 0N 117 10 1 189 189 -1 0N 153 20 1 153 153 580 0N 153 20 1 153 153 -1 0N 189 20 1 189 189 580 0N 189 20 1 189 189 -1 0N 99 20 1 117 117 580 0N 99 20 1 117 117 -1 0N 63 10 1 189 189 580 0N 63 10 1 189 189 -1 0I am not sure if this is a scaling issue only or if the performance of DGELS in single-thread mode can be optimized too. Maybe the ArmPL team can have a look?
DGELS
N 117 20 1 117 117 580 0
N 117 20 1 117 117 -1 0
N 117 10 1 189 189 580 0
N 117 10 1 189 189 -1 0
N 153 20 1 153 153 580 0
N 153 20 1 153 153 -1 0
N 189 20 1 189 189 580 0
N 189 20 1 189 189 -1 0
N 99 20 1 117 117 580 0
N 99 20 1 117 117 -1 0
N 63 10 1 189 189 580 0
N 63 10 1 189 189 -1 0
Thanks, wel'll take a look at this too. It's probably a similar underlying issue to the one we've been addressing for GETRF which affects these small problems running with large numbers of cores. This time the problem may actually be in GEQRF as called by GELS.
Yes, could be. Our workloads which make heavy use of DGELS perform better with OpenBLAS. We see performance issues with DGBTRS too and possibly with DGBTRF. Are they all related to GEQRF?
Some example DGBTRS inputs:Inputs: 10 1 1 1 4 1 10 0 NInputs: 106 27 27 1 82 1 106 0 NInputs: 11 1 1 1 4 1 11 0 NInputs: 11 2 2 1 7 1 11 0 NInputs: 112 31 31 1 94 1 112 0 NInputs: 11 3 3 1 10 1 11 0 NInputs: 120 31 31 1 94 1 120 0 NInputs: 12 2 2 1 7 1 12 0 NInputs: 12 3 3 1 10 1 12 0 NInputs: 13 2 2 1 7 1 13 0 NInputs: 13 3 3 1 10 1 13 0 NInputs: 14 2 2 1 7 1 14 0 NInputs: 14 3 3 1 10 1 14 0 NInputs: 15 2 2 1 7 1 15 0 NInputs: 15 3 3 1 10 1 15 0 NInputs: 16 3 3 1 10 1 16 0 NInputs: 17 3 3 1 10 1 17 0 NInputs: 18 3 3 1 10 1 18 0 NInputs: 19 3 3 1 10 1 19 0 NInputs: 20 2 2 1 7 1 20 0 NInputs: 20 3 3 1 10 1 20 0 NInputs: 20 4 4 1 13 1 20 0 NInputs: 20 5 5 1 16 1 20 0 NInputs: 20 6 6 1 19 1 20 0 NInputs: 5 1 1 1 4 1 5 0 NInputs: 6 1 1 1 4 1 6 0 NInputs: 7 1 1 1 4 1 7 0 NInputs: 8 1 1 1 4 1 8 0 NInputs: 9 1 1 1 4 1 9 0 NInputs: 9 2 2 1 7 1 9 0 NInputs: 98 27 27 1 82 1 98 0 N
Inputs: 10 1 1 1 4 1 10 0 N
Inputs: 106 27 27 1 82 1 106 0 N
Inputs: 11 1 1 1 4 1 11 0 N
Inputs: 11 2 2 1 7 1 11 0 N
Inputs: 112 31 31 1 94 1 112 0 N
Inputs: 11 3 3 1 10 1 11 0 N
Inputs: 120 31 31 1 94 1 120 0 N
Inputs: 12 2 2 1 7 1 12 0 N
Inputs: 12 3 3 1 10 1 12 0 N
Inputs: 13 2 2 1 7 1 13 0 N
Inputs: 13 3 3 1 10 1 13 0 N
Inputs: 14 2 2 1 7 1 14 0 N
Inputs: 14 3 3 1 10 1 14 0 N
Inputs: 15 2 2 1 7 1 15 0 N
Inputs: 15 3 3 1 10 1 15 0 N
Inputs: 16 3 3 1 10 1 16 0 N
Inputs: 17 3 3 1 10 1 17 0 N
Inputs: 18 3 3 1 10 1 18 0 N
Inputs: 19 3 3 1 10 1 19 0 N
Inputs: 20 2 2 1 7 1 20 0 N
Inputs: 20 3 3 1 10 1 20 0 N
Inputs: 20 4 4 1 13 1 20 0 N
Inputs: 20 5 5 1 16 1 20 0 N
Inputs: 20 6 6 1 19 1 20 0 N
Inputs: 5 1 1 1 4 1 5 0 N
Inputs: 6 1 1 1 4 1 6 0 N
Inputs: 7 1 1 1 4 1 7 0 N
Inputs: 8 1 1 1 4 1 8 0 N
Inputs: 9 1 1 1 4 1 9 0 N
Inputs: 9 2 2 1 7 1 9 0 N
Inputs: 98 27 27 1 82 1 98 0 N
Some example DGELS inputs:
Inputs: 16 10 2 16 16 -1 0 NInputs: 16 10 2 16 16 570 0 NInputs: 16 10 2 19 19 -1 0 NInputs: 16 10 2 19 19 570 0 NInputs: 16 10 2 22 22 -1 0 NInputs: 16 10 2 22 22 570 0 NInputs: 16 10 2 26 26 -1 0 NInputs: 16 10 2 26 26 570 0 NInputs: 16 10 2 29 29 -1 0 NInputs: 16 10 2 29 29 570 0 NInputs: 16 10 2 32 32 -1 0 NInputs: 16 10 2 32 32 570 0 NInputs: 16 10 2 36 36 -1 0 NInputs: 16 10 2 36 36 570 0 NInputs: 19 10 2 19 19 -1 0 NInputs: 19 10 2 19 19 570 0 NInputs: 19 10 2 22 22 -1 0 NInputs: 19 10 2 22 22 570 0 NInputs: 19 10 2 26 26 -1 0 NInputs: 19 10 2 26 26 570 0 NInputs: 19 10 2 29 29 -1 0 NInputs: 19 10 2 29 29 570 0 NInputs: 19 10 2 32 32 -1 0 NInputs: 19 10 2 32 32 570 0 NInputs: 19 10 2 36 36 -1 0 NInputs: 19 10 2 36 36 570 0 NInputs: 22 10 2 22 22 -1 0 NInputs: 22 10 2 22 22 570 0 NInputs: 22 10 2 26 26 -1 0 NInputs: 22 10 2 26 26 570 0 NInputs: 22 10 2 29 29 -1 0 NInputs: 22 10 2 29 29 570 0 NInputs: 22 10 2 32 32 -1 0 NInputs: 22 10 2 32 32 570 0 NInputs: 22 10 2 36 36 -1 0 NInputs: 22 10 2 36 36 570 0 NInputs: 26 10 2 26 26 -1 0 NInputs: 26 10 2 26 26 570 0 NInputs: 26 10 2 29 29 -1 0 NInputs: 26 10 2 29 29 570 0 NInputs: 26 10 2 32 32 -1 0 NInputs: 26 10 2 32 32 570 0 NInputs: 26 10 2 36 36 -1 0 NInputs: 26 10 2 36 36 570 0 NInputs: 29 10 2 29 29 -1 0 NInputs: 29 10 2 29 29 570 0 NInputs: 29 10 2 32 32 -1 0 NInputs: 29 10 2 32 32 570 0 NInputs: 29 10 2 36 36 -1 0 NInputs: 29 10 2 36 36 570 0 NInputs: 32 10 2 32 32 -1 0 NInputs: 32 10 2 32 32 570 0 NInputs: 32 10 2 36 36 -1 0 NInputs: 32 10 2 36 36 570 0 NInputs: 36 10 2 36 36 -1 0 NInputs: 36 10 2 36 36 570 0 N
Inputs: 16 10 2 16 16 -1 0 N
Inputs: 16 10 2 16 16 570 0 N
Inputs: 16 10 2 19 19 -1 0 N
Inputs: 16 10 2 19 19 570 0 N
Inputs: 16 10 2 22 22 -1 0 N
Inputs: 16 10 2 22 22 570 0 N
Inputs: 16 10 2 26 26 -1 0 N
Inputs: 16 10 2 26 26 570 0 N
Inputs: 16 10 2 29 29 -1 0 N
Inputs: 16 10 2 29 29 570 0 N
Inputs: 16 10 2 32 32 -1 0 N
Inputs: 16 10 2 32 32 570 0 N
Inputs: 16 10 2 36 36 -1 0 N
Inputs: 16 10 2 36 36 570 0 N
Inputs: 19 10 2 19 19 -1 0 N
Inputs: 19 10 2 19 19 570 0 N
Inputs: 19 10 2 22 22 -1 0 N
Inputs: 19 10 2 22 22 570 0 N
Inputs: 19 10 2 26 26 -1 0 N
Inputs: 19 10 2 26 26 570 0 N
Inputs: 19 10 2 29 29 -1 0 N
Inputs: 19 10 2 29 29 570 0 N
Inputs: 19 10 2 32 32 -1 0 N
Inputs: 19 10 2 32 32 570 0 N
Inputs: 19 10 2 36 36 -1 0 N
Inputs: 19 10 2 36 36 570 0 N
Inputs: 22 10 2 22 22 -1 0 N
Inputs: 22 10 2 22 22 570 0 N
Inputs: 22 10 2 26 26 -1 0 N
Inputs: 22 10 2 26 26 570 0 N
Inputs: 22 10 2 29 29 -1 0 N
Inputs: 22 10 2 29 29 570 0 N
Inputs: 22 10 2 32 32 -1 0 N
Inputs: 22 10 2 32 32 570 0 N
Inputs: 22 10 2 36 36 -1 0 N
Inputs: 22 10 2 36 36 570 0 N
Inputs: 26 10 2 26 26 -1 0 N
Inputs: 26 10 2 26 26 570 0 N
Inputs: 26 10 2 29 29 -1 0 N
Inputs: 26 10 2 29 29 570 0 N
Inputs: 26 10 2 32 32 -1 0 N
Inputs: 26 10 2 32 32 570 0 N
Inputs: 26 10 2 36 36 -1 0 N
Inputs: 26 10 2 36 36 570 0 N
Inputs: 29 10 2 29 29 -1 0 N
Inputs: 29 10 2 29 29 570 0 N
Inputs: 29 10 2 32 32 -1 0 N
Inputs: 29 10 2 32 32 570 0 N
Inputs: 29 10 2 36 36 -1 0 N
Inputs: 29 10 2 36 36 570 0 N
Inputs: 32 10 2 32 32 -1 0 N
Inputs: 32 10 2 32 32 570 0 N
Inputs: 32 10 2 36 36 -1 0 N
Inputs: 32 10 2 36 36 570 0 N
Inputs: 36 10 2 36 36 -1 0 N
Inputs: 36 10 2 36 36 570 0 N
DGBTRS and DGBTRF are not related to GEQRF, but the issue here is the same: your problems are small, and our implementation is geared towards large parallel problems. We have a few LAPACK functions which are implemented in a similar way.
In order to understand the best way to deal with these issues, would you mind getting in touch via support-hpc-sw@arm.com? We have a few options, but it would be better to find out which is the most appropriate for your use cases.
Regards, Chris.