Scaling issues with ArmPL 23.04

Some time ago I reported a performance issue with ArmPL related to CPUs with large number of cores. See https://community.arm.com/support-forums/f/high-performance-computing-forum/53959/negative-armpl-mt-speed-up-on-many-core-systems It was fixed in version 23.04. As a result, my application gained a lot of performance. Recently, I did some more scaling performance testing and discovered more issues.

The testing was done on a 128-core Ampere Altra CPU running Ubuntu 22.04. The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12. The application uses ArmPL in two ways depending on algorithm: (1) a single application thread calls ArmPL; (2) multiple application threads call ArmPL at the same time. I did the performance profiling with perf-lib-tools.

ZGETRF
The problem with ZGETRF occurs when multiple application threads are calling it at the same time. Typical inputs are M=N with a size varying between 10 and 300. Maybe there is some locking issue? Summary from perf-lib-tools:

OMP_NUM_THREADS=32
zgetrf_     cnt=    2434696  totTime=    1504.5387 called_tot=     171712  topTime=      93.8037    (%age of runtime:  6.428 )

Most frequent calls:
$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -n
Routine:  zgetrf_  nCalls: 216920  Mean_time 8.647643e-05   nUserCalls:  12077  Mean_user_time: 8.447802e-05   Inputs:            52           52           52            2            0
Routine:  zgetrf_  nCalls:  66792  Mean_time 2.949315e-04   nUserCalls:   7386  Mean_user_time: 2.930019e-04   Inputs:           100          100          100            1            0
Routine:  zgetrf_  nCalls: 126456  Mean_time 3.720812e-04   nUserCalls:  13398  Mean_user_time: 3.665895e-04   Inputs:           110          110          110            1            0
Routine:  zgetrf_  nCalls: 201256  Mean_time 4.436309e-04   nUserCalls:  16952  Mean_user_time: 4.371607e-04   Inputs:           120          120          120            1            0
Routine:  zgetrf_  nCalls: 168344  Mean_time 5.473583e-04   nUserCalls:  14346  Mean_user_time: 5.396482e-04   Inputs:           130          130          130            1            0
Routine:  zgetrf_  nCalls: 147136  Mean_time 6.433120e-04   nUserCalls:  11615  Mean_user_time: 6.342174e-04   Inputs:           140          140          140            1            0
Routine:  zgetrf_  nCalls: 211288  Mean_time 6.791234e-04   nUserCalls:   9395  Mean_user_time: 6.745791e-04   Inputs:           142          142          142            1            0
Routine:  zgetrf_  nCalls: 123904  Mean_time 7.762974e-04   nUserCalls:   8320  Mean_user_time: 7.669183e-04   Inputs:           150          150          150            1            0
Routine:  zgetrf_  nCalls:  73040  Mean_time 8.895631e-04   nUserCalls:   5874  Mean_user_time: 8.788582e-04   Inputs:           160          160          160            1            0
Routine:  zgetrf_  nCalls: 195888  Mean_time 1.083270e-03   nUserCalls:   9152  Mean_user_time: 1.073041e-03   Inputs:           170          170          170            1            0

OMP_NUM_THREADS=128
zgetrf_     cnt=    2434696  totTime=    2046.2830 called_tot=     241094  topTime=     151.9661    (%age of runtime:  9.300 )

Most frequent calls:
$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -n
Routine:  zgetrf_  nCalls: 216920  Mean_time 2.516856e-04   nUserCalls:  11667  Mean_user_time: 1.828228e-04   Inputs:            52           52           52            2            0
Routine:  zgetrf_  nCalls:  66792  Mean_time 4.422276e-04   nUserCalls:   9588  Mean_user_time: 3.222948e-04   Inputs:           100          100          100            1            0
Routine:  zgetrf_  nCalls: 126456  Mean_time 5.297068e-04   nUserCalls:  18734  Mean_user_time: 3.932569e-04   Inputs:           110          110          110            1            0
Routine:  zgetrf_  nCalls: 201256  Mean_time 6.306487e-04   nUserCalls:  23959  Mean_user_time: 4.647142e-04   Inputs:           120          120          120            1            0
Routine:  zgetrf_  nCalls: 168344  Mean_time 7.413436e-04   nUserCalls:  27496  Mean_user_time: 5.610537e-04   Inputs:           130          130          130            1            0
Routine:  zgetrf_  nCalls: 147136  Mean_time 8.560422e-04   nUserCalls:  23896  Mean_user_time: 6.514217e-04   Inputs:           140          140          140            1            0
Routine:  zgetrf_  nCalls: 123904  Mean_time 1.012120e-03   nUserCalls:  20248  Mean_user_time: 7.814398e-04   Inputs:           150          150          150            1            0
Routine:  zgetrf_  nCalls:  73040  Mean_time 1.143292e-03   nUserCalls:  14404  Mean_user_time: 8.928624e-04   Inputs:           160          160          160            1            0
Routine:  zgetrf_  nCalls: 195888  Mean_time 1.406755e-03   nUserCalls:  10160  Mean_user_time: 1.091698e-03   Inputs:           170          170          170            1            0
Routine:  zgetrf_  nCalls:  19096  Mean_time 1.444096e-03   nUserCalls:   5755  Mean_user_time: 1.218681e-03   Inputs:           180          180          180            1            0


DGEMM
Even though DGEMM received substantial performance improvements, it still has some issues:

OMP_NUM_THREADS=32
dgemm_     cnt=   30596724  totTime=     272.4157   called_tot=   30596724  topTime=     272.4157    (%age of runtime:  2.876 )

Example calls:
Routine:   dgemm_  nCalls:  14728  Mean_time 1.281673e-05   nUserCalls:  14728  Mean_user_time: 1.281673e-05   Inputs:            36            1          252           36          288      1226658 N N
Routine:   dgemm_  nCalls:   9884  Mean_time 9.806415e-06   nUserCalls:   9884  Mean_user_time: 9.806415e-06   Inputs:           264            1           42           42           42          264 T N
Routine:   dgemm_  nCalls:  14728  Mean_time 9.071300e-06   nUserCalls:  14728  Mean_user_time: 9.071300e-06   Inputs:           252            1           36           36           36          252 T N
Routine:   dgemm_  nCalls:   9884  Mean_time 1.381515e-05   nUserCalls:   9884  Mean_user_time: 1.381515e-05   Inputs:            42            1          264           42          306      1226658 N N
Routine:   dgemm_  nCalls:  12238  Mean_time 9.176557e-06   nUserCalls:  12238  Mean_user_time: 9.176557e-06   Inputs:           246            1           36           36           36          246 T N
Routine:   dgemm_  nCalls:  17002  Mean_time 1.015412e-05   nUserCalls:  17002  Mean_user_time: 1.015412e-05   Inputs:           240            1           48           48           48          240 T N
Routine:   dgemm_  nCalls:  13044  Mean_time 1.026566e-05   nUserCalls:  13044  Mean_user_time: 1.026566e-05   Inputs:           210            1           54           54           54          210 T N
Routine:   dgemm_  nCalls:  10980  Mean_time 1.011446e-05   nUserCalls:  10980  Mean_user_time: 1.011446e-05   Inputs:           276            1           48           48           48          276 T N
Routine:   dgemm_  nCalls:  24626  Mean_time 9.204542e-06   nUserCalls:  24626  Mean_user_time: 9.204542e-06   Inputs:           204            1           42           42           42          204 T N
Routine:   dgemm_  nCalls:  19882  Mean_time 9.293070e-06   nUserCalls:  19882  Mean_user_time: 9.293070e-06   Inputs:           216            1           42           42           42          216 T N

OMP_NUM_THREADS=128
dgemm_     cnt=   30597188  totTime=     350.5334   called_tot=   30597188  topTime=     350.5334    (%age of runtime:  3.982 )

Example calls:
Routine:   dgemm_  nCalls:  14728  Mean_time 3.143310e-05   nUserCalls:  14728  Mean_user_time: 3.143310e-05   Inputs:            36            1          252           36          288      1226658 N N
Routine:   dgemm_  nCalls:   9884  Mean_time 2.066992e-05   nUserCalls:   9884  Mean_user_time: 2.066992e-05   Inputs:           264            1           42           42           42          264 T N
Routine:   dgemm_  nCalls:  14728  Mean_time 1.720480e-05   nUserCalls:  14728  Mean_user_time: 1.720480e-05   Inputs:           252            1           36           36           36          252 T N
Routine:   dgemm_  nCalls:   9884  Mean_time 2.001355e-05   nUserCalls:   9884  Mean_user_time: 2.001355e-05   Inputs:            42            1          264           42          306      1226658 N N
Routine:   dgemm_  nCalls:  12246  Mean_time 3.380307e-05   nUserCalls:  12246  Mean_user_time: 3.380307e-05   Inputs:           246            1           36           36           36          246 T N
Routine:   dgemm_  nCalls:  17012  Mean_time 2.844690e-05   nUserCalls:  17012  Mean_user_time: 2.844690e-05   Inputs:           240            1           48           48           48          240 T N
Routine:   dgemm_  nCalls:  13042  Mean_time 2.803772e-05   nUserCalls:  13042  Mean_user_time: 2.803772e-05   Inputs:           210            1           54           54           54          210 T N
Routine:   dgemm_  nCalls:  10982  Mean_time 2.659640e-05   nUserCalls:  10982  Mean_user_time: 2.659640e-05   Inputs:           276            1           48           48           48          276 T N
Routine:   dgemm_  nCalls:  24612  Mean_time 2.413749e-05   nUserCalls:  24612  Mean_user_time: 2.413749e-05   Inputs:           204            1           42           42           42          204 T N
Routine:   dgemm_  nCalls:  19862  Mean_time 2.381463e-05   nUserCalls:  19862  Mean_user_time: 2.381463e-05   Inputs:           216            1           42           42           42          216 T N

It would be nice to have this fixed.