Some time ago I reported a performance issue with ArmPL related to CPUs with large number of cores. See https://community.arm.com/support-forums/f/high-performance-computing-forum/53959/negative-armpl-mt-speed-up-on-many-core-systems It was fixed in version 23.04. As a result, my application gained a lot of performance. Recently, I did some more scaling performance testing and discovered more issues.The testing was done on a 128-core Ampere Altra CPU running Ubuntu 22.04. The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12. The application uses ArmPL in two ways depending on algorithm: (1) a single application thread calls ArmPL; (2) multiple application threads call ArmPL at the same time. I did the performance profiling with perf-lib-tools.ZGETRFThe problem with ZGETRF occurs when multiple application threads are calling it at the same time. Typical inputs are M=N with a size varying between 10 and 300. Maybe there is some locking issue? Summary from perf-lib-tools:OMP_NUM_THREADS=32zgetrf_ cnt= 2434696 totTime= 1504.5387 called_tot= 171712 topTime= 93.8037 (%age of runtime: 6.428 )Most frequent calls:$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -nRoutine: zgetrf_ nCalls: 216920 Mean_time 8.647643e-05 nUserCalls: 12077 Mean_user_time: 8.447802e-05 Inputs: 52 52 52 2 0Routine: zgetrf_ nCalls: 66792 Mean_time 2.949315e-04 nUserCalls: 7386 Mean_user_time: 2.930019e-04 Inputs: 100 100 100 1 0Routine: zgetrf_ nCalls: 126456 Mean_time 3.720812e-04 nUserCalls: 13398 Mean_user_time: 3.665895e-04 Inputs: 110 110 110 1 0Routine: zgetrf_ nCalls: 201256 Mean_time 4.436309e-04 nUserCalls: 16952 Mean_user_time: 4.371607e-04 Inputs: 120 120 120 1 0Routine: zgetrf_ nCalls: 168344 Mean_time 5.473583e-04 nUserCalls: 14346 Mean_user_time: 5.396482e-04 Inputs: 130 130 130 1 0Routine: zgetrf_ nCalls: 147136 Mean_time 6.433120e-04 nUserCalls: 11615 Mean_user_time: 6.342174e-04 Inputs: 140 140 140 1 0Routine: zgetrf_ nCalls: 211288 Mean_time 6.791234e-04 nUserCalls: 9395 Mean_user_time: 6.745791e-04 Inputs: 142 142 142 1 0Routine: zgetrf_ nCalls: 123904 Mean_time 7.762974e-04 nUserCalls: 8320 Mean_user_time: 7.669183e-04 Inputs: 150 150 150 1 0Routine: zgetrf_ nCalls: 73040 Mean_time 8.895631e-04 nUserCalls: 5874 Mean_user_time: 8.788582e-04 Inputs: 160 160 160 1 0Routine: zgetrf_ nCalls: 195888 Mean_time 1.083270e-03 nUserCalls: 9152 Mean_user_time: 1.073041e-03 Inputs: 170 170 170 1 0 OMP_NUM_THREADS=128zgetrf_ cnt= 2434696 totTime= 2046.2830 called_tot= 241094 topTime= 151.9661 (%age of runtime: 9.300 )Most frequent calls:$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -nRoutine: zgetrf_ nCalls: 216920 Mean_time 2.516856e-04 nUserCalls: 11667 Mean_user_time: 1.828228e-04 Inputs: 52 52 52 2 0Routine: zgetrf_ nCalls: 66792 Mean_time 4.422276e-04 nUserCalls: 9588 Mean_user_time: 3.222948e-04 Inputs: 100 100 100 1 0Routine: zgetrf_ nCalls: 126456 Mean_time 5.297068e-04 nUserCalls: 18734 Mean_user_time: 3.932569e-04 Inputs: 110 110 110 1 0Routine: zgetrf_ nCalls: 201256 Mean_time 6.306487e-04 nUserCalls: 23959 Mean_user_time: 4.647142e-04 Inputs: 120 120 120 1 0Routine: zgetrf_ nCalls: 168344 Mean_time 7.413436e-04 nUserCalls: 27496 Mean_user_time: 5.610537e-04 Inputs: 130 130 130 1 0Routine: zgetrf_ nCalls: 147136 Mean_time 8.560422e-04 nUserCalls: 23896 Mean_user_time: 6.514217e-04 Inputs: 140 140 140 1 0Routine: zgetrf_ nCalls: 123904 Mean_time 1.012120e-03 nUserCalls: 20248 Mean_user_time: 7.814398e-04 Inputs: 150 150 150 1 0Routine: zgetrf_ nCalls: 73040 Mean_time 1.143292e-03 nUserCalls: 14404 Mean_user_time: 8.928624e-04 Inputs: 160 160 160 1 0Routine: zgetrf_ nCalls: 195888 Mean_time 1.406755e-03 nUserCalls: 10160 Mean_user_time: 1.091698e-03 Inputs: 170 170 170 1 0Routine: zgetrf_ nCalls: 19096 Mean_time 1.444096e-03 nUserCalls: 5755 Mean_user_time: 1.218681e-03 Inputs: 180 180 180 1 0DGEMMEven though DGEMM received substantial performance improvements, it still has some issues:OMP_NUM_THREADS=32dgemm_ cnt= 30596724 totTime= 272.4157 called_tot= 30596724 topTime= 272.4157 (%age of runtime: 2.876 )Example calls:Routine: dgemm_ nCalls: 14728 Mean_time 1.281673e-05 nUserCalls: 14728 Mean_user_time: 1.281673e-05 Inputs: 36 1 252 36 288 1226658 N NRoutine: dgemm_ nCalls: 9884 Mean_time 9.806415e-06 nUserCalls: 9884 Mean_user_time: 9.806415e-06 Inputs: 264 1 42 42 42 264 T NRoutine: dgemm_ nCalls: 14728 Mean_time 9.071300e-06 nUserCalls: 14728 Mean_user_time: 9.071300e-06 Inputs: 252 1 36 36 36 252 T NRoutine: dgemm_ nCalls: 9884 Mean_time 1.381515e-05 nUserCalls: 9884 Mean_user_time: 1.381515e-05 Inputs: 42 1 264 42 306 1226658 N NRoutine: dgemm_ nCalls: 12238 Mean_time 9.176557e-06 nUserCalls: 12238 Mean_user_time: 9.176557e-06 Inputs: 246 1 36 36 36 246 T NRoutine: dgemm_ nCalls: 17002 Mean_time 1.015412e-05 nUserCalls: 17002 Mean_user_time: 1.015412e-05 Inputs: 240 1 48 48 48 240 T NRoutine: dgemm_ nCalls: 13044 Mean_time 1.026566e-05 nUserCalls: 13044 Mean_user_time: 1.026566e-05 Inputs: 210 1 54 54 54 210 T NRoutine: dgemm_ nCalls: 10980 Mean_time 1.011446e-05 nUserCalls: 10980 Mean_user_time: 1.011446e-05 Inputs: 276 1 48 48 48 276 T NRoutine: dgemm_ nCalls: 24626 Mean_time 9.204542e-06 nUserCalls: 24626 Mean_user_time: 9.204542e-06 Inputs: 204 1 42 42 42 204 T NRoutine: dgemm_ nCalls: 19882 Mean_time 9.293070e-06 nUserCalls: 19882 Mean_user_time: 9.293070e-06 Inputs: 216 1 42 42 42 216 T NOMP_NUM_THREADS=128dgemm_ cnt= 30597188 totTime= 350.5334 called_tot= 30597188 topTime= 350.5334 (%age of runtime: 3.982 )Example calls:Routine: dgemm_ nCalls: 14728 Mean_time 3.143310e-05 nUserCalls: 14728 Mean_user_time: 3.143310e-05 Inputs: 36 1 252 36 288 1226658 N NRoutine: dgemm_ nCalls: 9884 Mean_time 2.066992e-05 nUserCalls: 9884 Mean_user_time: 2.066992e-05 Inputs: 264 1 42 42 42 264 T NRoutine: dgemm_ nCalls: 14728 Mean_time 1.720480e-05 nUserCalls: 14728 Mean_user_time: 1.720480e-05 Inputs: 252 1 36 36 36 252 T NRoutine: dgemm_ nCalls: 9884 Mean_time 2.001355e-05 nUserCalls: 9884 Mean_user_time: 2.001355e-05 Inputs: 42 1 264 42 306 1226658 N NRoutine: dgemm_ nCalls: 12246 Mean_time 3.380307e-05 nUserCalls: 12246 Mean_user_time: 3.380307e-05 Inputs: 246 1 36 36 36 246 T NRoutine: dgemm_ nCalls: 17012 Mean_time 2.844690e-05 nUserCalls: 17012 Mean_user_time: 2.844690e-05 Inputs: 240 1 48 48 48 240 T NRoutine: dgemm_ nCalls: 13042 Mean_time 2.803772e-05 nUserCalls: 13042 Mean_user_time: 2.803772e-05 Inputs: 210 1 54 54 54 210 T NRoutine: dgemm_ nCalls: 10982 Mean_time 2.659640e-05 nUserCalls: 10982 Mean_user_time: 2.659640e-05 Inputs: 276 1 48 48 48 276 T NRoutine: dgemm_ nCalls: 24612 Mean_time 2.413749e-05 nUserCalls: 24612 Mean_user_time: 2.413749e-05 Inputs: 204 1 42 42 42 204 T NRoutine: dgemm_ nCalls: 19862 Mean_time 2.381463e-05 nUserCalls: 19862 Mean_user_time: 2.381463e-05 Inputs: 216 1 42 42 42 216 T NIt would be nice to have this fixed.
perf-lib-tools
ZGETRF
M=N
10
300
OMP_NUM_THREADS=32
zgetrf_ cnt= 2434696 totTime= 1504.5387 called_tot= 171712 topTime= 93.8037 (%age of runtime: 6.428 )
$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -n
Routine: zgetrf_ nCalls: 216920 Mean_time 8.647643e-05 nUserCalls: 12077 Mean_user_time: 8.447802e-05 Inputs: 52 52 52 2 0
Routine: zgetrf_ nCalls: 66792 Mean_time 2.949315e-04 nUserCalls: 7386 Mean_user_time: 2.930019e-04 Inputs: 100 100 100 1 0
Routine: zgetrf_ nCalls: 126456 Mean_time 3.720812e-04 nUserCalls: 13398 Mean_user_time: 3.665895e-04 Inputs: 110 110 110 1 0
Routine: zgetrf_ nCalls: 201256 Mean_time 4.436309e-04 nUserCalls: 16952 Mean_user_time: 4.371607e-04 Inputs: 120 120 120 1 0
Routine: zgetrf_ nCalls: 168344 Mean_time 5.473583e-04 nUserCalls: 14346 Mean_user_time: 5.396482e-04 Inputs: 130 130 130 1 0
Routine: zgetrf_ nCalls: 147136 Mean_time 6.433120e-04 nUserCalls: 11615 Mean_user_time: 6.342174e-04 Inputs: 140 140 140 1 0
Routine: zgetrf_ nCalls: 211288 Mean_time 6.791234e-04 nUserCalls: 9395 Mean_user_time: 6.745791e-04 Inputs: 142 142 142 1 0
Routine: zgetrf_ nCalls: 123904 Mean_time 7.762974e-04 nUserCalls: 8320 Mean_user_time: 7.669183e-04 Inputs: 150 150 150 1 0
Routine: zgetrf_ nCalls: 73040 Mean_time 8.895631e-04 nUserCalls: 5874 Mean_user_time: 8.788582e-04 Inputs: 160 160 160 1 0
Routine: zgetrf_ nCalls: 195888 Mean_time 1.083270e-03 nUserCalls: 9152 Mean_user_time: 1.073041e-03 Inputs: 170 170 170 1 0
OMP_NUM_THREADS=128
zgetrf_ cnt= 2434696 totTime= 2046.2830 called_tot= 241094 topTime= 151.9661 (%age of runtime: 9.300 )
$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -n
Routine: zgetrf_ nCalls: 216920 Mean_time 2.516856e-04 nUserCalls: 11667 Mean_user_time: 1.828228e-04 Inputs: 52 52 52 2 0
Routine: zgetrf_ nCalls: 66792 Mean_time 4.422276e-04 nUserCalls: 9588 Mean_user_time: 3.222948e-04 Inputs: 100 100 100 1 0
Routine: zgetrf_ nCalls: 126456 Mean_time 5.297068e-04 nUserCalls: 18734 Mean_user_time: 3.932569e-04 Inputs: 110 110 110 1 0
Routine: zgetrf_ nCalls: 201256 Mean_time 6.306487e-04 nUserCalls: 23959 Mean_user_time: 4.647142e-04 Inputs: 120 120 120 1 0
Routine: zgetrf_ nCalls: 168344 Mean_time 7.413436e-04 nUserCalls: 27496 Mean_user_time: 5.610537e-04 Inputs: 130 130 130 1 0
Routine: zgetrf_ nCalls: 147136 Mean_time 8.560422e-04 nUserCalls: 23896 Mean_user_time: 6.514217e-04 Inputs: 140 140 140 1 0
Routine: zgetrf_ nCalls: 123904 Mean_time 1.012120e-03 nUserCalls: 20248 Mean_user_time: 7.814398e-04 Inputs: 150 150 150 1 0
Routine: zgetrf_ nCalls: 73040 Mean_time 1.143292e-03 nUserCalls: 14404 Mean_user_time: 8.928624e-04 Inputs: 160 160 160 1 0
Routine: zgetrf_ nCalls: 195888 Mean_time 1.406755e-03 nUserCalls: 10160 Mean_user_time: 1.091698e-03 Inputs: 170 170 170 1 0
Routine: zgetrf_ nCalls: 19096 Mean_time 1.444096e-03 nUserCalls: 5755 Mean_user_time: 1.218681e-03 Inputs: 180 180 180 1 0
DGEMM
dgemm_ cnt= 30596724 totTime= 272.4157 called_tot= 30596724 topTime= 272.4157 (%age of runtime: 2.876 )
Routine: dgemm_ nCalls: 14728 Mean_time 1.281673e-05 nUserCalls: 14728 Mean_user_time: 1.281673e-05 Inputs: 36 1 252 36 288 1226658 N N
Routine: dgemm_ nCalls: 9884 Mean_time 9.806415e-06 nUserCalls: 9884 Mean_user_time: 9.806415e-06 Inputs: 264 1 42 42 42 264 T N
Routine: dgemm_ nCalls: 14728 Mean_time 9.071300e-06 nUserCalls: 14728 Mean_user_time: 9.071300e-06 Inputs: 252 1 36 36 36 252 T N
Routine: dgemm_ nCalls: 9884 Mean_time 1.381515e-05 nUserCalls: 9884 Mean_user_time: 1.381515e-05 Inputs: 42 1 264 42 306 1226658 N N
Routine: dgemm_ nCalls: 12238 Mean_time 9.176557e-06 nUserCalls: 12238 Mean_user_time: 9.176557e-06 Inputs: 246 1 36 36 36 246 T N
Routine: dgemm_ nCalls: 17002 Mean_time 1.015412e-05 nUserCalls: 17002 Mean_user_time: 1.015412e-05 Inputs: 240 1 48 48 48 240 T N
Routine: dgemm_ nCalls: 13044 Mean_time 1.026566e-05 nUserCalls: 13044 Mean_user_time: 1.026566e-05 Inputs: 210 1 54 54 54 210 T N
Routine: dgemm_ nCalls: 10980 Mean_time 1.011446e-05 nUserCalls: 10980 Mean_user_time: 1.011446e-05 Inputs: 276 1 48 48 48 276 T N
Routine: dgemm_ nCalls: 24626 Mean_time 9.204542e-06 nUserCalls: 24626 Mean_user_time: 9.204542e-06 Inputs: 204 1 42 42 42 204 T N
Routine: dgemm_ nCalls: 19882 Mean_time 9.293070e-06 nUserCalls: 19882 Mean_user_time: 9.293070e-06 Inputs: 216 1 42 42 42 216 T N
dgemm_ cnt= 30597188 totTime= 350.5334 called_tot= 30597188 topTime= 350.5334 (%age of runtime: 3.982 )
Routine: dgemm_ nCalls: 14728 Mean_time 3.143310e-05 nUserCalls: 14728 Mean_user_time: 3.143310e-05 Inputs: 36 1 252 36 288 1226658 N N
Routine: dgemm_ nCalls: 9884 Mean_time 2.066992e-05 nUserCalls: 9884 Mean_user_time: 2.066992e-05 Inputs: 264 1 42 42 42 264 T N
Routine: dgemm_ nCalls: 14728 Mean_time 1.720480e-05 nUserCalls: 14728 Mean_user_time: 1.720480e-05 Inputs: 252 1 36 36 36 252 T N
Routine: dgemm_ nCalls: 9884 Mean_time 2.001355e-05 nUserCalls: 9884 Mean_user_time: 2.001355e-05 Inputs: 42 1 264 42 306 1226658 N N
Routine: dgemm_ nCalls: 12246 Mean_time 3.380307e-05 nUserCalls: 12246 Mean_user_time: 3.380307e-05 Inputs: 246 1 36 36 36 246 T N
Routine: dgemm_ nCalls: 17012 Mean_time 2.844690e-05 nUserCalls: 17012 Mean_user_time: 2.844690e-05 Inputs: 240 1 48 48 48 240 T N
Routine: dgemm_ nCalls: 13042 Mean_time 2.803772e-05 nUserCalls: 13042 Mean_user_time: 2.803772e-05 Inputs: 210 1 54 54 54 210 T N
Routine: dgemm_ nCalls: 10982 Mean_time 2.659640e-05 nUserCalls: 10982 Mean_user_time: 2.659640e-05 Inputs: 276 1 48 48 48 276 T N
Routine: dgemm_ nCalls: 24612 Mean_time 2.413749e-05 nUserCalls: 24612 Mean_user_time: 2.413749e-05 Inputs: 204 1 42 42 42 204 T N
Routine: dgemm_ nCalls: 19862 Mean_time 2.381463e-05 nUserCalls: 19862 Mean_user_time: 2.381463e-05 Inputs: 216 1 42 42 42 216 T N
Hello. Just to confirm, we've observed similar scaling issues in going from 32 to 128 cores for these small ZGETRF and DGEMM problems when working from cold caches. We'll be working on addressing these. FYI, we're just released version 23.10, but that version doesn't contain any tunings that will address these issues.
Thanks! I've started evaluating version 23.10. It looks good so far (no new issues observed, some old issues fixed).