Some time ago I reported a performance issue with ArmPL related to CPUs with large number of cores. See https://community.arm.com/support-forums/f/high-performance-computing-forum/53959/negative-armpl-mt-speed-up-on-many-core-systems It was fixed in version 23.04. As a result, my application gained a lot of performance. Recently, I did some more scaling performance testing and discovered more issues.The testing was done on a 128-core Ampere Altra CPU running Ubuntu 22.04. The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12. The application uses ArmPL in two ways depending on algorithm: (1) a single application thread calls ArmPL; (2) multiple application threads call ArmPL at the same time. I did the performance profiling with perf-lib-tools.ZGETRFThe problem with ZGETRF occurs when multiple application threads are calling it at the same time. Typical inputs are M=N with a size varying between 10 and 300. Maybe there is some locking issue? Summary from perf-lib-tools:OMP_NUM_THREADS=32zgetrf_ cnt= 2434696 totTime= 1504.5387 called_tot= 171712 topTime= 93.8037 (%age of runtime: 6.428 )Most frequent calls:$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -nRoutine: zgetrf_ nCalls: 216920 Mean_time 8.647643e-05 nUserCalls: 12077 Mean_user_time: 8.447802e-05 Inputs: 52 52 52 2 0Routine: zgetrf_ nCalls: 66792 Mean_time 2.949315e-04 nUserCalls: 7386 Mean_user_time: 2.930019e-04 Inputs: 100 100 100 1 0Routine: zgetrf_ nCalls: 126456 Mean_time 3.720812e-04 nUserCalls: 13398 Mean_user_time: 3.665895e-04 Inputs: 110 110 110 1 0Routine: zgetrf_ nCalls: 201256 Mean_time 4.436309e-04 nUserCalls: 16952 Mean_user_time: 4.371607e-04 Inputs: 120 120 120 1 0Routine: zgetrf_ nCalls: 168344 Mean_time 5.473583e-04 nUserCalls: 14346 Mean_user_time: 5.396482e-04 Inputs: 130 130 130 1 0Routine: zgetrf_ nCalls: 147136 Mean_time 6.433120e-04 nUserCalls: 11615 Mean_user_time: 6.342174e-04 Inputs: 140 140 140 1 0Routine: zgetrf_ nCalls: 211288 Mean_time 6.791234e-04 nUserCalls: 9395 Mean_user_time: 6.745791e-04 Inputs: 142 142 142 1 0Routine: zgetrf_ nCalls: 123904 Mean_time 7.762974e-04 nUserCalls: 8320 Mean_user_time: 7.669183e-04 Inputs: 150 150 150 1 0Routine: zgetrf_ nCalls: 73040 Mean_time 8.895631e-04 nUserCalls: 5874 Mean_user_time: 8.788582e-04 Inputs: 160 160 160 1 0Routine: zgetrf_ nCalls: 195888 Mean_time 1.083270e-03 nUserCalls: 9152 Mean_user_time: 1.073041e-03 Inputs: 170 170 170 1 0 OMP_NUM_THREADS=128zgetrf_ cnt= 2434696 totTime= 2046.2830 called_tot= 241094 topTime= 151.9661 (%age of runtime: 9.300 )Most frequent calls:$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -nRoutine: zgetrf_ nCalls: 216920 Mean_time 2.516856e-04 nUserCalls: 11667 Mean_user_time: 1.828228e-04 Inputs: 52 52 52 2 0Routine: zgetrf_ nCalls: 66792 Mean_time 4.422276e-04 nUserCalls: 9588 Mean_user_time: 3.222948e-04 Inputs: 100 100 100 1 0Routine: zgetrf_ nCalls: 126456 Mean_time 5.297068e-04 nUserCalls: 18734 Mean_user_time: 3.932569e-04 Inputs: 110 110 110 1 0Routine: zgetrf_ nCalls: 201256 Mean_time 6.306487e-04 nUserCalls: 23959 Mean_user_time: 4.647142e-04 Inputs: 120 120 120 1 0Routine: zgetrf_ nCalls: 168344 Mean_time 7.413436e-04 nUserCalls: 27496 Mean_user_time: 5.610537e-04 Inputs: 130 130 130 1 0Routine: zgetrf_ nCalls: 147136 Mean_time 8.560422e-04 nUserCalls: 23896 Mean_user_time: 6.514217e-04 Inputs: 140 140 140 1 0Routine: zgetrf_ nCalls: 123904 Mean_time 1.012120e-03 nUserCalls: 20248 Mean_user_time: 7.814398e-04 Inputs: 150 150 150 1 0Routine: zgetrf_ nCalls: 73040 Mean_time 1.143292e-03 nUserCalls: 14404 Mean_user_time: 8.928624e-04 Inputs: 160 160 160 1 0Routine: zgetrf_ nCalls: 195888 Mean_time 1.406755e-03 nUserCalls: 10160 Mean_user_time: 1.091698e-03 Inputs: 170 170 170 1 0Routine: zgetrf_ nCalls: 19096 Mean_time 1.444096e-03 nUserCalls: 5755 Mean_user_time: 1.218681e-03 Inputs: 180 180 180 1 0DGEMMEven though DGEMM received substantial performance improvements, it still has some issues:OMP_NUM_THREADS=32dgemm_ cnt= 30596724 totTime= 272.4157 called_tot= 30596724 topTime= 272.4157 (%age of runtime: 2.876 )Example calls:Routine: dgemm_ nCalls: 14728 Mean_time 1.281673e-05 nUserCalls: 14728 Mean_user_time: 1.281673e-05 Inputs: 36 1 252 36 288 1226658 N NRoutine: dgemm_ nCalls: 9884 Mean_time 9.806415e-06 nUserCalls: 9884 Mean_user_time: 9.806415e-06 Inputs: 264 1 42 42 42 264 T NRoutine: dgemm_ nCalls: 14728 Mean_time 9.071300e-06 nUserCalls: 14728 Mean_user_time: 9.071300e-06 Inputs: 252 1 36 36 36 252 T NRoutine: dgemm_ nCalls: 9884 Mean_time 1.381515e-05 nUserCalls: 9884 Mean_user_time: 1.381515e-05 Inputs: 42 1 264 42 306 1226658 N NRoutine: dgemm_ nCalls: 12238 Mean_time 9.176557e-06 nUserCalls: 12238 Mean_user_time: 9.176557e-06 Inputs: 246 1 36 36 36 246 T NRoutine: dgemm_ nCalls: 17002 Mean_time 1.015412e-05 nUserCalls: 17002 Mean_user_time: 1.015412e-05 Inputs: 240 1 48 48 48 240 T NRoutine: dgemm_ nCalls: 13044 Mean_time 1.026566e-05 nUserCalls: 13044 Mean_user_time: 1.026566e-05 Inputs: 210 1 54 54 54 210 T NRoutine: dgemm_ nCalls: 10980 Mean_time 1.011446e-05 nUserCalls: 10980 Mean_user_time: 1.011446e-05 Inputs: 276 1 48 48 48 276 T NRoutine: dgemm_ nCalls: 24626 Mean_time 9.204542e-06 nUserCalls: 24626 Mean_user_time: 9.204542e-06 Inputs: 204 1 42 42 42 204 T NRoutine: dgemm_ nCalls: 19882 Mean_time 9.293070e-06 nUserCalls: 19882 Mean_user_time: 9.293070e-06 Inputs: 216 1 42 42 42 216 T NOMP_NUM_THREADS=128dgemm_ cnt= 30597188 totTime= 350.5334 called_tot= 30597188 topTime= 350.5334 (%age of runtime: 3.982 )Example calls:Routine: dgemm_ nCalls: 14728 Mean_time 3.143310e-05 nUserCalls: 14728 Mean_user_time: 3.143310e-05 Inputs: 36 1 252 36 288 1226658 N NRoutine: dgemm_ nCalls: 9884 Mean_time 2.066992e-05 nUserCalls: 9884 Mean_user_time: 2.066992e-05 Inputs: 264 1 42 42 42 264 T NRoutine: dgemm_ nCalls: 14728 Mean_time 1.720480e-05 nUserCalls: 14728 Mean_user_time: 1.720480e-05 Inputs: 252 1 36 36 36 252 T NRoutine: dgemm_ nCalls: 9884 Mean_time 2.001355e-05 nUserCalls: 9884 Mean_user_time: 2.001355e-05 Inputs: 42 1 264 42 306 1226658 N NRoutine: dgemm_ nCalls: 12246 Mean_time 3.380307e-05 nUserCalls: 12246 Mean_user_time: 3.380307e-05 Inputs: 246 1 36 36 36 246 T NRoutine: dgemm_ nCalls: 17012 Mean_time 2.844690e-05 nUserCalls: 17012 Mean_user_time: 2.844690e-05 Inputs: 240 1 48 48 48 240 T NRoutine: dgemm_ nCalls: 13042 Mean_time 2.803772e-05 nUserCalls: 13042 Mean_user_time: 2.803772e-05 Inputs: 210 1 54 54 54 210 T NRoutine: dgemm_ nCalls: 10982 Mean_time 2.659640e-05 nUserCalls: 10982 Mean_user_time: 2.659640e-05 Inputs: 276 1 48 48 48 276 T NRoutine: dgemm_ nCalls: 24612 Mean_time 2.413749e-05 nUserCalls: 24612 Mean_user_time: 2.413749e-05 Inputs: 204 1 42 42 42 204 T NRoutine: dgemm_ nCalls: 19862 Mean_time 2.381463e-05 nUserCalls: 19862 Mean_user_time: 2.381463e-05 Inputs: 216 1 42 42 42 216 T NIt would be nice to have this fixed.
perf-lib-tools
ZGETRF
M=N
10
300
OMP_NUM_THREADS=32
zgetrf_ cnt= 2434696 totTime= 1504.5387 called_tot= 171712 topTime= 93.8037 (%age of runtime: 6.428 )
$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -n
Routine: zgetrf_ nCalls: 216920 Mean_time 8.647643e-05 nUserCalls: 12077 Mean_user_time: 8.447802e-05 Inputs: 52 52 52 2 0
Routine: zgetrf_ nCalls: 66792 Mean_time 2.949315e-04 nUserCalls: 7386 Mean_user_time: 2.930019e-04 Inputs: 100 100 100 1 0
Routine: zgetrf_ nCalls: 126456 Mean_time 3.720812e-04 nUserCalls: 13398 Mean_user_time: 3.665895e-04 Inputs: 110 110 110 1 0
Routine: zgetrf_ nCalls: 201256 Mean_time 4.436309e-04 nUserCalls: 16952 Mean_user_time: 4.371607e-04 Inputs: 120 120 120 1 0
Routine: zgetrf_ nCalls: 168344 Mean_time 5.473583e-04 nUserCalls: 14346 Mean_user_time: 5.396482e-04 Inputs: 130 130 130 1 0
Routine: zgetrf_ nCalls: 147136 Mean_time 6.433120e-04 nUserCalls: 11615 Mean_user_time: 6.342174e-04 Inputs: 140 140 140 1 0
Routine: zgetrf_ nCalls: 211288 Mean_time 6.791234e-04 nUserCalls: 9395 Mean_user_time: 6.745791e-04 Inputs: 142 142 142 1 0
Routine: zgetrf_ nCalls: 123904 Mean_time 7.762974e-04 nUserCalls: 8320 Mean_user_time: 7.669183e-04 Inputs: 150 150 150 1 0
Routine: zgetrf_ nCalls: 73040 Mean_time 8.895631e-04 nUserCalls: 5874 Mean_user_time: 8.788582e-04 Inputs: 160 160 160 1 0
Routine: zgetrf_ nCalls: 195888 Mean_time 1.083270e-03 nUserCalls: 9152 Mean_user_time: 1.073041e-03 Inputs: 170 170 170 1 0
OMP_NUM_THREADS=128
zgetrf_ cnt= 2434696 totTime= 2046.2830 called_tot= 241094 topTime= 151.9661 (%age of runtime: 9.300 )
$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -n
Routine: zgetrf_ nCalls: 216920 Mean_time 2.516856e-04 nUserCalls: 11667 Mean_user_time: 1.828228e-04 Inputs: 52 52 52 2 0
Routine: zgetrf_ nCalls: 66792 Mean_time 4.422276e-04 nUserCalls: 9588 Mean_user_time: 3.222948e-04 Inputs: 100 100 100 1 0
Routine: zgetrf_ nCalls: 126456 Mean_time 5.297068e-04 nUserCalls: 18734 Mean_user_time: 3.932569e-04 Inputs: 110 110 110 1 0
Routine: zgetrf_ nCalls: 201256 Mean_time 6.306487e-04 nUserCalls: 23959 Mean_user_time: 4.647142e-04 Inputs: 120 120 120 1 0
Routine: zgetrf_ nCalls: 168344 Mean_time 7.413436e-04 nUserCalls: 27496 Mean_user_time: 5.610537e-04 Inputs: 130 130 130 1 0
Routine: zgetrf_ nCalls: 147136 Mean_time 8.560422e-04 nUserCalls: 23896 Mean_user_time: 6.514217e-04 Inputs: 140 140 140 1 0
Routine: zgetrf_ nCalls: 123904 Mean_time 1.012120e-03 nUserCalls: 20248 Mean_user_time: 7.814398e-04 Inputs: 150 150 150 1 0
Routine: zgetrf_ nCalls: 73040 Mean_time 1.143292e-03 nUserCalls: 14404 Mean_user_time: 8.928624e-04 Inputs: 160 160 160 1 0
Routine: zgetrf_ nCalls: 195888 Mean_time 1.406755e-03 nUserCalls: 10160 Mean_user_time: 1.091698e-03 Inputs: 170 170 170 1 0
Routine: zgetrf_ nCalls: 19096 Mean_time 1.444096e-03 nUserCalls: 5755 Mean_user_time: 1.218681e-03 Inputs: 180 180 180 1 0
DGEMM
dgemm_ cnt= 30596724 totTime= 272.4157 called_tot= 30596724 topTime= 272.4157 (%age of runtime: 2.876 )
Routine: dgemm_ nCalls: 14728 Mean_time 1.281673e-05 nUserCalls: 14728 Mean_user_time: 1.281673e-05 Inputs: 36 1 252 36 288 1226658 N N
Routine: dgemm_ nCalls: 9884 Mean_time 9.806415e-06 nUserCalls: 9884 Mean_user_time: 9.806415e-06 Inputs: 264 1 42 42 42 264 T N
Routine: dgemm_ nCalls: 14728 Mean_time 9.071300e-06 nUserCalls: 14728 Mean_user_time: 9.071300e-06 Inputs: 252 1 36 36 36 252 T N
Routine: dgemm_ nCalls: 9884 Mean_time 1.381515e-05 nUserCalls: 9884 Mean_user_time: 1.381515e-05 Inputs: 42 1 264 42 306 1226658 N N
Routine: dgemm_ nCalls: 12238 Mean_time 9.176557e-06 nUserCalls: 12238 Mean_user_time: 9.176557e-06 Inputs: 246 1 36 36 36 246 T N
Routine: dgemm_ nCalls: 17002 Mean_time 1.015412e-05 nUserCalls: 17002 Mean_user_time: 1.015412e-05 Inputs: 240 1 48 48 48 240 T N
Routine: dgemm_ nCalls: 13044 Mean_time 1.026566e-05 nUserCalls: 13044 Mean_user_time: 1.026566e-05 Inputs: 210 1 54 54 54 210 T N
Routine: dgemm_ nCalls: 10980 Mean_time 1.011446e-05 nUserCalls: 10980 Mean_user_time: 1.011446e-05 Inputs: 276 1 48 48 48 276 T N
Routine: dgemm_ nCalls: 24626 Mean_time 9.204542e-06 nUserCalls: 24626 Mean_user_time: 9.204542e-06 Inputs: 204 1 42 42 42 204 T N
Routine: dgemm_ nCalls: 19882 Mean_time 9.293070e-06 nUserCalls: 19882 Mean_user_time: 9.293070e-06 Inputs: 216 1 42 42 42 216 T N
dgemm_ cnt= 30597188 totTime= 350.5334 called_tot= 30597188 topTime= 350.5334 (%age of runtime: 3.982 )
Routine: dgemm_ nCalls: 14728 Mean_time 3.143310e-05 nUserCalls: 14728 Mean_user_time: 3.143310e-05 Inputs: 36 1 252 36 288 1226658 N N
Routine: dgemm_ nCalls: 9884 Mean_time 2.066992e-05 nUserCalls: 9884 Mean_user_time: 2.066992e-05 Inputs: 264 1 42 42 42 264 T N
Routine: dgemm_ nCalls: 14728 Mean_time 1.720480e-05 nUserCalls: 14728 Mean_user_time: 1.720480e-05 Inputs: 252 1 36 36 36 252 T N
Routine: dgemm_ nCalls: 9884 Mean_time 2.001355e-05 nUserCalls: 9884 Mean_user_time: 2.001355e-05 Inputs: 42 1 264 42 306 1226658 N N
Routine: dgemm_ nCalls: 12246 Mean_time 3.380307e-05 nUserCalls: 12246 Mean_user_time: 3.380307e-05 Inputs: 246 1 36 36 36 246 T N
Routine: dgemm_ nCalls: 17012 Mean_time 2.844690e-05 nUserCalls: 17012 Mean_user_time: 2.844690e-05 Inputs: 240 1 48 48 48 240 T N
Routine: dgemm_ nCalls: 13042 Mean_time 2.803772e-05 nUserCalls: 13042 Mean_user_time: 2.803772e-05 Inputs: 210 1 54 54 54 210 T N
Routine: dgemm_ nCalls: 10982 Mean_time 2.659640e-05 nUserCalls: 10982 Mean_user_time: 2.659640e-05 Inputs: 276 1 48 48 48 276 T N
Routine: dgemm_ nCalls: 24612 Mean_time 2.413749e-05 nUserCalls: 24612 Mean_user_time: 2.413749e-05 Inputs: 204 1 42 42 42 204 T N
Routine: dgemm_ nCalls: 19862 Mean_time 2.381463e-05 nUserCalls: 19862 Mean_user_time: 2.381463e-05 Inputs: 216 1 42 42 42 216 T N
Thanks for your reply. Please could you execute the original Arm PL library you're linking to, libarmpl_mp.so, just so that we can see the information it produces.
Yes, this is what it reports (this is on the build machine which has a different CPU and OS):$ LD_LIBRARY_PATH=/opt/arm/armpl-23.04.0_RHEL-8_gcc/lib /opt/arm/armpl-23.04.0_RHEL-8_gcc/lib/libarmpl_mp.soArm Performance LibrariesVersion 23.04.0Built from: 520bc09dcTarget Generic AArch64 (lp64+openmp)Runtime target Generic AArch64Available targets: ThunderX2 Neoverse N1 Generic AArch64 A64FX Neoverse V1 Generic SVECompiled by gcc (GCC) 12.2.0This build contains both NEON and SVE routine types.Runtime machine details (parsed from getauxval(AT_HWCAP)): Implementer: 0x50 (P) Part number: 0x0 Part variant: 0x3 Part revision: 0x2 Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuidIf you require support or would like to provide feedback, please contact support-hpc-sw@arm.com
$ LD_LIBRARY_PATH=/opt/arm/armpl-23.04.0_RHEL-8_gcc/lib /opt/arm/armpl-23.04.0_RHEL-8_gcc/lib/libarmpl_mp.so
Arm Performance Libraries
Version 23.04.0
Built from: 520bc09dc
Target Generic AArch64 (lp64+openmp)
Runtime target Generic AArch64
Available targets:
ThunderX2
Neoverse N1
Generic AArch64
A64FX
Neoverse V1
Generic SVE
Compiled by gcc (GCC) 12.2.0
This build contains both NEON and SVE routine types.
Runtime machine details (parsed from getauxval(AT_HWCAP)):
Implementer: 0x50 (P)
Part number: 0x0
Part variant: 0x3
Part revision: 0x2
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
If you require support or would like to provide feedback, please contact support-hpc-sw@arm.com
I forgot to mention, if I build the application with Arm Compiler for Linux 23.04 instead (and use the libarmpl_mp.a which comes with it), I get similar scaling problems. Here is an example workload:
libarmpl_mp.a
Thanks for this, we'll try to investigate and see what's going on.