Some time ago I reported a performance issue with ArmPL related to CPUs with large number of cores. See https://community.arm.com/support-forums/f/high-performance-computing-forum/53959/negative-armpl-mt-speed-up-on-many-core-systems It was fixed in version 23.04. As a result, my application gained a lot of performance. Recently, I did some more scaling performance testing and discovered more issues.The testing was done on a 128-core Ampere Altra CPU running Ubuntu 22.04. The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12. The application uses ArmPL in two ways depending on algorithm: (1) a single application thread calls ArmPL; (2) multiple application threads call ArmPL at the same time. I did the performance profiling with perf-lib-tools.ZGETRFThe problem with ZGETRF occurs when multiple application threads are calling it at the same time. Typical inputs are M=N with a size varying between 10 and 300. Maybe there is some locking issue? Summary from perf-lib-tools:OMP_NUM_THREADS=32zgetrf_ cnt= 2434696 totTime= 1504.5387 called_tot= 171712 topTime= 93.8037 (%age of runtime: 6.428 )Most frequent calls:$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -nRoutine: zgetrf_ nCalls: 216920 Mean_time 8.647643e-05 nUserCalls: 12077 Mean_user_time: 8.447802e-05 Inputs: 52 52 52 2 0Routine: zgetrf_ nCalls: 66792 Mean_time 2.949315e-04 nUserCalls: 7386 Mean_user_time: 2.930019e-04 Inputs: 100 100 100 1 0Routine: zgetrf_ nCalls: 126456 Mean_time 3.720812e-04 nUserCalls: 13398 Mean_user_time: 3.665895e-04 Inputs: 110 110 110 1 0Routine: zgetrf_ nCalls: 201256 Mean_time 4.436309e-04 nUserCalls: 16952 Mean_user_time: 4.371607e-04 Inputs: 120 120 120 1 0Routine: zgetrf_ nCalls: 168344 Mean_time 5.473583e-04 nUserCalls: 14346 Mean_user_time: 5.396482e-04 Inputs: 130 130 130 1 0Routine: zgetrf_ nCalls: 147136 Mean_time 6.433120e-04 nUserCalls: 11615 Mean_user_time: 6.342174e-04 Inputs: 140 140 140 1 0Routine: zgetrf_ nCalls: 211288 Mean_time 6.791234e-04 nUserCalls: 9395 Mean_user_time: 6.745791e-04 Inputs: 142 142 142 1 0Routine: zgetrf_ nCalls: 123904 Mean_time 7.762974e-04 nUserCalls: 8320 Mean_user_time: 7.669183e-04 Inputs: 150 150 150 1 0Routine: zgetrf_ nCalls: 73040 Mean_time 8.895631e-04 nUserCalls: 5874 Mean_user_time: 8.788582e-04 Inputs: 160 160 160 1 0Routine: zgetrf_ nCalls: 195888 Mean_time 1.083270e-03 nUserCalls: 9152 Mean_user_time: 1.073041e-03 Inputs: 170 170 170 1 0 OMP_NUM_THREADS=128zgetrf_ cnt= 2434696 totTime= 2046.2830 called_tot= 241094 topTime= 151.9661 (%age of runtime: 9.300 )Most frequent calls:$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -nRoutine: zgetrf_ nCalls: 216920 Mean_time 2.516856e-04 nUserCalls: 11667 Mean_user_time: 1.828228e-04 Inputs: 52 52 52 2 0Routine: zgetrf_ nCalls: 66792 Mean_time 4.422276e-04 nUserCalls: 9588 Mean_user_time: 3.222948e-04 Inputs: 100 100 100 1 0Routine: zgetrf_ nCalls: 126456 Mean_time 5.297068e-04 nUserCalls: 18734 Mean_user_time: 3.932569e-04 Inputs: 110 110 110 1 0Routine: zgetrf_ nCalls: 201256 Mean_time 6.306487e-04 nUserCalls: 23959 Mean_user_time: 4.647142e-04 Inputs: 120 120 120 1 0Routine: zgetrf_ nCalls: 168344 Mean_time 7.413436e-04 nUserCalls: 27496 Mean_user_time: 5.610537e-04 Inputs: 130 130 130 1 0Routine: zgetrf_ nCalls: 147136 Mean_time 8.560422e-04 nUserCalls: 23896 Mean_user_time: 6.514217e-04 Inputs: 140 140 140 1 0Routine: zgetrf_ nCalls: 123904 Mean_time 1.012120e-03 nUserCalls: 20248 Mean_user_time: 7.814398e-04 Inputs: 150 150 150 1 0Routine: zgetrf_ nCalls: 73040 Mean_time 1.143292e-03 nUserCalls: 14404 Mean_user_time: 8.928624e-04 Inputs: 160 160 160 1 0Routine: zgetrf_ nCalls: 195888 Mean_time 1.406755e-03 nUserCalls: 10160 Mean_user_time: 1.091698e-03 Inputs: 170 170 170 1 0Routine: zgetrf_ nCalls: 19096 Mean_time 1.444096e-03 nUserCalls: 5755 Mean_user_time: 1.218681e-03 Inputs: 180 180 180 1 0DGEMMEven though DGEMM received substantial performance improvements, it still has some issues:OMP_NUM_THREADS=32dgemm_ cnt= 30596724 totTime= 272.4157 called_tot= 30596724 topTime= 272.4157 (%age of runtime: 2.876 )Example calls:Routine: dgemm_ nCalls: 14728 Mean_time 1.281673e-05 nUserCalls: 14728 Mean_user_time: 1.281673e-05 Inputs: 36 1 252 36 288 1226658 N NRoutine: dgemm_ nCalls: 9884 Mean_time 9.806415e-06 nUserCalls: 9884 Mean_user_time: 9.806415e-06 Inputs: 264 1 42 42 42 264 T NRoutine: dgemm_ nCalls: 14728 Mean_time 9.071300e-06 nUserCalls: 14728 Mean_user_time: 9.071300e-06 Inputs: 252 1 36 36 36 252 T NRoutine: dgemm_ nCalls: 9884 Mean_time 1.381515e-05 nUserCalls: 9884 Mean_user_time: 1.381515e-05 Inputs: 42 1 264 42 306 1226658 N NRoutine: dgemm_ nCalls: 12238 Mean_time 9.176557e-06 nUserCalls: 12238 Mean_user_time: 9.176557e-06 Inputs: 246 1 36 36 36 246 T NRoutine: dgemm_ nCalls: 17002 Mean_time 1.015412e-05 nUserCalls: 17002 Mean_user_time: 1.015412e-05 Inputs: 240 1 48 48 48 240 T NRoutine: dgemm_ nCalls: 13044 Mean_time 1.026566e-05 nUserCalls: 13044 Mean_user_time: 1.026566e-05 Inputs: 210 1 54 54 54 210 T NRoutine: dgemm_ nCalls: 10980 Mean_time 1.011446e-05 nUserCalls: 10980 Mean_user_time: 1.011446e-05 Inputs: 276 1 48 48 48 276 T NRoutine: dgemm_ nCalls: 24626 Mean_time 9.204542e-06 nUserCalls: 24626 Mean_user_time: 9.204542e-06 Inputs: 204 1 42 42 42 204 T NRoutine: dgemm_ nCalls: 19882 Mean_time 9.293070e-06 nUserCalls: 19882 Mean_user_time: 9.293070e-06 Inputs: 216 1 42 42 42 216 T NOMP_NUM_THREADS=128dgemm_ cnt= 30597188 totTime= 350.5334 called_tot= 30597188 topTime= 350.5334 (%age of runtime: 3.982 )Example calls:Routine: dgemm_ nCalls: 14728 Mean_time 3.143310e-05 nUserCalls: 14728 Mean_user_time: 3.143310e-05 Inputs: 36 1 252 36 288 1226658 N NRoutine: dgemm_ nCalls: 9884 Mean_time 2.066992e-05 nUserCalls: 9884 Mean_user_time: 2.066992e-05 Inputs: 264 1 42 42 42 264 T NRoutine: dgemm_ nCalls: 14728 Mean_time 1.720480e-05 nUserCalls: 14728 Mean_user_time: 1.720480e-05 Inputs: 252 1 36 36 36 252 T NRoutine: dgemm_ nCalls: 9884 Mean_time 2.001355e-05 nUserCalls: 9884 Mean_user_time: 2.001355e-05 Inputs: 42 1 264 42 306 1226658 N NRoutine: dgemm_ nCalls: 12246 Mean_time 3.380307e-05 nUserCalls: 12246 Mean_user_time: 3.380307e-05 Inputs: 246 1 36 36 36 246 T NRoutine: dgemm_ nCalls: 17012 Mean_time 2.844690e-05 nUserCalls: 17012 Mean_user_time: 2.844690e-05 Inputs: 240 1 48 48 48 240 T NRoutine: dgemm_ nCalls: 13042 Mean_time 2.803772e-05 nUserCalls: 13042 Mean_user_time: 2.803772e-05 Inputs: 210 1 54 54 54 210 T NRoutine: dgemm_ nCalls: 10982 Mean_time 2.659640e-05 nUserCalls: 10982 Mean_user_time: 2.659640e-05 Inputs: 276 1 48 48 48 276 T NRoutine: dgemm_ nCalls: 24612 Mean_time 2.413749e-05 nUserCalls: 24612 Mean_user_time: 2.413749e-05 Inputs: 204 1 42 42 42 204 T NRoutine: dgemm_ nCalls: 19862 Mean_time 2.381463e-05 nUserCalls: 19862 Mean_user_time: 2.381463e-05 Inputs: 216 1 42 42 42 216 T NIt would be nice to have this fixed.
perf-lib-tools
ZGETRF
M=N
10
300
OMP_NUM_THREADS=32
zgetrf_ cnt= 2434696 totTime= 1504.5387 called_tot= 171712 topTime= 93.8037 (%age of runtime: 6.428 )
$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -n
Routine: zgetrf_ nCalls: 216920 Mean_time 8.647643e-05 nUserCalls: 12077 Mean_user_time: 8.447802e-05 Inputs: 52 52 52 2 0
Routine: zgetrf_ nCalls: 66792 Mean_time 2.949315e-04 nUserCalls: 7386 Mean_user_time: 2.930019e-04 Inputs: 100 100 100 1 0
Routine: zgetrf_ nCalls: 126456 Mean_time 3.720812e-04 nUserCalls: 13398 Mean_user_time: 3.665895e-04 Inputs: 110 110 110 1 0
Routine: zgetrf_ nCalls: 201256 Mean_time 4.436309e-04 nUserCalls: 16952 Mean_user_time: 4.371607e-04 Inputs: 120 120 120 1 0
Routine: zgetrf_ nCalls: 168344 Mean_time 5.473583e-04 nUserCalls: 14346 Mean_user_time: 5.396482e-04 Inputs: 130 130 130 1 0
Routine: zgetrf_ nCalls: 147136 Mean_time 6.433120e-04 nUserCalls: 11615 Mean_user_time: 6.342174e-04 Inputs: 140 140 140 1 0
Routine: zgetrf_ nCalls: 211288 Mean_time 6.791234e-04 nUserCalls: 9395 Mean_user_time: 6.745791e-04 Inputs: 142 142 142 1 0
Routine: zgetrf_ nCalls: 123904 Mean_time 7.762974e-04 nUserCalls: 8320 Mean_user_time: 7.669183e-04 Inputs: 150 150 150 1 0
Routine: zgetrf_ nCalls: 73040 Mean_time 8.895631e-04 nUserCalls: 5874 Mean_user_time: 8.788582e-04 Inputs: 160 160 160 1 0
Routine: zgetrf_ nCalls: 195888 Mean_time 1.083270e-03 nUserCalls: 9152 Mean_user_time: 1.073041e-03 Inputs: 170 170 170 1 0
OMP_NUM_THREADS=128
zgetrf_ cnt= 2434696 totTime= 2046.2830 called_tot= 241094 topTime= 151.9661 (%age of runtime: 9.300 )
$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -n
Routine: zgetrf_ nCalls: 216920 Mean_time 2.516856e-04 nUserCalls: 11667 Mean_user_time: 1.828228e-04 Inputs: 52 52 52 2 0
Routine: zgetrf_ nCalls: 66792 Mean_time 4.422276e-04 nUserCalls: 9588 Mean_user_time: 3.222948e-04 Inputs: 100 100 100 1 0
Routine: zgetrf_ nCalls: 126456 Mean_time 5.297068e-04 nUserCalls: 18734 Mean_user_time: 3.932569e-04 Inputs: 110 110 110 1 0
Routine: zgetrf_ nCalls: 201256 Mean_time 6.306487e-04 nUserCalls: 23959 Mean_user_time: 4.647142e-04 Inputs: 120 120 120 1 0
Routine: zgetrf_ nCalls: 168344 Mean_time 7.413436e-04 nUserCalls: 27496 Mean_user_time: 5.610537e-04 Inputs: 130 130 130 1 0
Routine: zgetrf_ nCalls: 147136 Mean_time 8.560422e-04 nUserCalls: 23896 Mean_user_time: 6.514217e-04 Inputs: 140 140 140 1 0
Routine: zgetrf_ nCalls: 123904 Mean_time 1.012120e-03 nUserCalls: 20248 Mean_user_time: 7.814398e-04 Inputs: 150 150 150 1 0
Routine: zgetrf_ nCalls: 73040 Mean_time 1.143292e-03 nUserCalls: 14404 Mean_user_time: 8.928624e-04 Inputs: 160 160 160 1 0
Routine: zgetrf_ nCalls: 195888 Mean_time 1.406755e-03 nUserCalls: 10160 Mean_user_time: 1.091698e-03 Inputs: 170 170 170 1 0
Routine: zgetrf_ nCalls: 19096 Mean_time 1.444096e-03 nUserCalls: 5755 Mean_user_time: 1.218681e-03 Inputs: 180 180 180 1 0
DGEMM
dgemm_ cnt= 30596724 totTime= 272.4157 called_tot= 30596724 topTime= 272.4157 (%age of runtime: 2.876 )
Routine: dgemm_ nCalls: 14728 Mean_time 1.281673e-05 nUserCalls: 14728 Mean_user_time: 1.281673e-05 Inputs: 36 1 252 36 288 1226658 N N
Routine: dgemm_ nCalls: 9884 Mean_time 9.806415e-06 nUserCalls: 9884 Mean_user_time: 9.806415e-06 Inputs: 264 1 42 42 42 264 T N
Routine: dgemm_ nCalls: 14728 Mean_time 9.071300e-06 nUserCalls: 14728 Mean_user_time: 9.071300e-06 Inputs: 252 1 36 36 36 252 T N
Routine: dgemm_ nCalls: 9884 Mean_time 1.381515e-05 nUserCalls: 9884 Mean_user_time: 1.381515e-05 Inputs: 42 1 264 42 306 1226658 N N
Routine: dgemm_ nCalls: 12238 Mean_time 9.176557e-06 nUserCalls: 12238 Mean_user_time: 9.176557e-06 Inputs: 246 1 36 36 36 246 T N
Routine: dgemm_ nCalls: 17002 Mean_time 1.015412e-05 nUserCalls: 17002 Mean_user_time: 1.015412e-05 Inputs: 240 1 48 48 48 240 T N
Routine: dgemm_ nCalls: 13044 Mean_time 1.026566e-05 nUserCalls: 13044 Mean_user_time: 1.026566e-05 Inputs: 210 1 54 54 54 210 T N
Routine: dgemm_ nCalls: 10980 Mean_time 1.011446e-05 nUserCalls: 10980 Mean_user_time: 1.011446e-05 Inputs: 276 1 48 48 48 276 T N
Routine: dgemm_ nCalls: 24626 Mean_time 9.204542e-06 nUserCalls: 24626 Mean_user_time: 9.204542e-06 Inputs: 204 1 42 42 42 204 T N
Routine: dgemm_ nCalls: 19882 Mean_time 9.293070e-06 nUserCalls: 19882 Mean_user_time: 9.293070e-06 Inputs: 216 1 42 42 42 216 T N
dgemm_ cnt= 30597188 totTime= 350.5334 called_tot= 30597188 topTime= 350.5334 (%age of runtime: 3.982 )
Routine: dgemm_ nCalls: 14728 Mean_time 3.143310e-05 nUserCalls: 14728 Mean_user_time: 3.143310e-05 Inputs: 36 1 252 36 288 1226658 N N
Routine: dgemm_ nCalls: 9884 Mean_time 2.066992e-05 nUserCalls: 9884 Mean_user_time: 2.066992e-05 Inputs: 264 1 42 42 42 264 T N
Routine: dgemm_ nCalls: 14728 Mean_time 1.720480e-05 nUserCalls: 14728 Mean_user_time: 1.720480e-05 Inputs: 252 1 36 36 36 252 T N
Routine: dgemm_ nCalls: 9884 Mean_time 2.001355e-05 nUserCalls: 9884 Mean_user_time: 2.001355e-05 Inputs: 42 1 264 42 306 1226658 N N
Routine: dgemm_ nCalls: 12246 Mean_time 3.380307e-05 nUserCalls: 12246 Mean_user_time: 3.380307e-05 Inputs: 246 1 36 36 36 246 T N
Routine: dgemm_ nCalls: 17012 Mean_time 2.844690e-05 nUserCalls: 17012 Mean_user_time: 2.844690e-05 Inputs: 240 1 48 48 48 240 T N
Routine: dgemm_ nCalls: 13042 Mean_time 2.803772e-05 nUserCalls: 13042 Mean_user_time: 2.803772e-05 Inputs: 210 1 54 54 54 210 T N
Routine: dgemm_ nCalls: 10982 Mean_time 2.659640e-05 nUserCalls: 10982 Mean_user_time: 2.659640e-05 Inputs: 276 1 48 48 48 276 T N
Routine: dgemm_ nCalls: 24612 Mean_time 2.413749e-05 nUserCalls: 24612 Mean_user_time: 2.413749e-05 Inputs: 204 1 42 42 42 204 T N
Routine: dgemm_ nCalls: 19862 Mean_time 2.381463e-05 nUserCalls: 19862 Mean_user_time: 2.381463e-05 Inputs: 216 1 42 42 42 216 T N
DGBTRS and DGBTRF are not related to GEQRF, but the issue here is the same: your problems are small, and our implementation is geared towards large parallel problems. We have a few LAPACK functions which are implemented in a similar way.
In order to understand the best way to deal with these issues, would you mind getting in touch via support-hpc-sw@arm.com? We have a few options, but it would be better to find out which is the most appropriate for your use cases.
Regards, Chris.