Scaling issues with ArmPL 23.04

Some time ago I reported a performance issue with ArmPL related to CPUs with large number of cores. See https://community.arm.com/support-forums/f/high-performance-computing-forum/53959/negative-armpl-mt-speed-up-on-many-core-systems It was fixed in version 23.04. As a result, my application gained a lot of performance. Recently, I did some more scaling performance testing and discovered more issues.

The testing was done on a 128-core Ampere Altra CPU running Ubuntu 22.04. The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12. The application uses ArmPL in two ways depending on algorithm: (1) a single application thread calls ArmPL; (2) multiple application threads call ArmPL at the same time. I did the performance profiling with perf-lib-tools.

ZGETRF
The problem with ZGETRF occurs when multiple application threads are calling it at the same time. Typical inputs are M=N with a size varying between 10 and 300. Maybe there is some locking issue? Summary from perf-lib-tools:

OMP_NUM_THREADS=32
zgetrf_     cnt=    2434696  totTime=    1504.5387 called_tot=     171712  topTime=      93.8037    (%age of runtime:  6.428 )

Most frequent calls:
$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -n
Routine:  zgetrf_  nCalls: 216920  Mean_time 8.647643e-05   nUserCalls:  12077  Mean_user_time: 8.447802e-05   Inputs:            52           52           52            2            0
Routine:  zgetrf_  nCalls:  66792  Mean_time 2.949315e-04   nUserCalls:   7386  Mean_user_time: 2.930019e-04   Inputs:           100          100          100            1            0
Routine:  zgetrf_  nCalls: 126456  Mean_time 3.720812e-04   nUserCalls:  13398  Mean_user_time: 3.665895e-04   Inputs:           110          110          110            1            0
Routine:  zgetrf_  nCalls: 201256  Mean_time 4.436309e-04   nUserCalls:  16952  Mean_user_time: 4.371607e-04   Inputs:           120          120          120            1            0
Routine:  zgetrf_  nCalls: 168344  Mean_time 5.473583e-04   nUserCalls:  14346  Mean_user_time: 5.396482e-04   Inputs:           130          130          130            1            0
Routine:  zgetrf_  nCalls: 147136  Mean_time 6.433120e-04   nUserCalls:  11615  Mean_user_time: 6.342174e-04   Inputs:           140          140          140            1            0
Routine:  zgetrf_  nCalls: 211288  Mean_time 6.791234e-04   nUserCalls:   9395  Mean_user_time: 6.745791e-04   Inputs:           142          142          142            1            0
Routine:  zgetrf_  nCalls: 123904  Mean_time 7.762974e-04   nUserCalls:   8320  Mean_user_time: 7.669183e-04   Inputs:           150          150          150            1            0
Routine:  zgetrf_  nCalls:  73040  Mean_time 8.895631e-04   nUserCalls:   5874  Mean_user_time: 8.788582e-04   Inputs:           160          160          160            1            0
Routine:  zgetrf_  nCalls: 195888  Mean_time 1.083270e-03   nUserCalls:   9152  Mean_user_time: 1.073041e-03   Inputs:           170          170          170            1            0

OMP_NUM_THREADS=128
zgetrf_     cnt=    2434696  totTime=    2046.2830 called_tot=     241094  topTime=     151.9661    (%age of runtime:  9.300 )

Most frequent calls:
$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -n
Routine:  zgetrf_  nCalls: 216920  Mean_time 2.516856e-04   nUserCalls:  11667  Mean_user_time: 1.828228e-04   Inputs:            52           52           52            2            0
Routine:  zgetrf_  nCalls:  66792  Mean_time 4.422276e-04   nUserCalls:   9588  Mean_user_time: 3.222948e-04   Inputs:           100          100          100            1            0
Routine:  zgetrf_  nCalls: 126456  Mean_time 5.297068e-04   nUserCalls:  18734  Mean_user_time: 3.932569e-04   Inputs:           110          110          110            1            0
Routine:  zgetrf_  nCalls: 201256  Mean_time 6.306487e-04   nUserCalls:  23959  Mean_user_time: 4.647142e-04   Inputs:           120          120          120            1            0
Routine:  zgetrf_  nCalls: 168344  Mean_time 7.413436e-04   nUserCalls:  27496  Mean_user_time: 5.610537e-04   Inputs:           130          130          130            1            0
Routine:  zgetrf_  nCalls: 147136  Mean_time 8.560422e-04   nUserCalls:  23896  Mean_user_time: 6.514217e-04   Inputs:           140          140          140            1            0
Routine:  zgetrf_  nCalls: 123904  Mean_time 1.012120e-03   nUserCalls:  20248  Mean_user_time: 7.814398e-04   Inputs:           150          150          150            1            0
Routine:  zgetrf_  nCalls:  73040  Mean_time 1.143292e-03   nUserCalls:  14404  Mean_user_time: 8.928624e-04   Inputs:           160          160          160            1            0
Routine:  zgetrf_  nCalls: 195888  Mean_time 1.406755e-03   nUserCalls:  10160  Mean_user_time: 1.091698e-03   Inputs:           170          170          170            1            0
Routine:  zgetrf_  nCalls:  19096  Mean_time 1.444096e-03   nUserCalls:   5755  Mean_user_time: 1.218681e-03   Inputs:           180          180          180            1            0


DGEMM
Even though DGEMM received substantial performance improvements, it still has some issues:

OMP_NUM_THREADS=32
dgemm_     cnt=   30596724  totTime=     272.4157   called_tot=   30596724  topTime=     272.4157    (%age of runtime:  2.876 )

Example calls:
Routine:   dgemm_  nCalls:  14728  Mean_time 1.281673e-05   nUserCalls:  14728  Mean_user_time: 1.281673e-05   Inputs:            36            1          252           36          288      1226658 N N
Routine:   dgemm_  nCalls:   9884  Mean_time 9.806415e-06   nUserCalls:   9884  Mean_user_time: 9.806415e-06   Inputs:           264            1           42           42           42          264 T N
Routine:   dgemm_  nCalls:  14728  Mean_time 9.071300e-06   nUserCalls:  14728  Mean_user_time: 9.071300e-06   Inputs:           252            1           36           36           36          252 T N
Routine:   dgemm_  nCalls:   9884  Mean_time 1.381515e-05   nUserCalls:   9884  Mean_user_time: 1.381515e-05   Inputs:            42            1          264           42          306      1226658 N N
Routine:   dgemm_  nCalls:  12238  Mean_time 9.176557e-06   nUserCalls:  12238  Mean_user_time: 9.176557e-06   Inputs:           246            1           36           36           36          246 T N
Routine:   dgemm_  nCalls:  17002  Mean_time 1.015412e-05   nUserCalls:  17002  Mean_user_time: 1.015412e-05   Inputs:           240            1           48           48           48          240 T N
Routine:   dgemm_  nCalls:  13044  Mean_time 1.026566e-05   nUserCalls:  13044  Mean_user_time: 1.026566e-05   Inputs:           210            1           54           54           54          210 T N
Routine:   dgemm_  nCalls:  10980  Mean_time 1.011446e-05   nUserCalls:  10980  Mean_user_time: 1.011446e-05   Inputs:           276            1           48           48           48          276 T N
Routine:   dgemm_  nCalls:  24626  Mean_time 9.204542e-06   nUserCalls:  24626  Mean_user_time: 9.204542e-06   Inputs:           204            1           42           42           42          204 T N
Routine:   dgemm_  nCalls:  19882  Mean_time 9.293070e-06   nUserCalls:  19882  Mean_user_time: 9.293070e-06   Inputs:           216            1           42           42           42          216 T N

OMP_NUM_THREADS=128
dgemm_     cnt=   30597188  totTime=     350.5334   called_tot=   30597188  topTime=     350.5334    (%age of runtime:  3.982 )

Example calls:
Routine:   dgemm_  nCalls:  14728  Mean_time 3.143310e-05   nUserCalls:  14728  Mean_user_time: 3.143310e-05   Inputs:            36            1          252           36          288      1226658 N N
Routine:   dgemm_  nCalls:   9884  Mean_time 2.066992e-05   nUserCalls:   9884  Mean_user_time: 2.066992e-05   Inputs:           264            1           42           42           42          264 T N
Routine:   dgemm_  nCalls:  14728  Mean_time 1.720480e-05   nUserCalls:  14728  Mean_user_time: 1.720480e-05   Inputs:           252            1           36           36           36          252 T N
Routine:   dgemm_  nCalls:   9884  Mean_time 2.001355e-05   nUserCalls:   9884  Mean_user_time: 2.001355e-05   Inputs:            42            1          264           42          306      1226658 N N
Routine:   dgemm_  nCalls:  12246  Mean_time 3.380307e-05   nUserCalls:  12246  Mean_user_time: 3.380307e-05   Inputs:           246            1           36           36           36          246 T N
Routine:   dgemm_  nCalls:  17012  Mean_time 2.844690e-05   nUserCalls:  17012  Mean_user_time: 2.844690e-05   Inputs:           240            1           48           48           48          240 T N
Routine:   dgemm_  nCalls:  13042  Mean_time 2.803772e-05   nUserCalls:  13042  Mean_user_time: 2.803772e-05   Inputs:           210            1           54           54           54          210 T N
Routine:   dgemm_  nCalls:  10982  Mean_time 2.659640e-05   nUserCalls:  10982  Mean_user_time: 2.659640e-05   Inputs:           276            1           48           48           48          276 T N
Routine:   dgemm_  nCalls:  24612  Mean_time 2.413749e-05   nUserCalls:  24612  Mean_user_time: 2.413749e-05   Inputs:           204            1           42           42           42          204 T N
Routine:   dgemm_  nCalls:  19862  Mean_time 2.381463e-05   nUserCalls:  19862  Mean_user_time: 2.381463e-05   Inputs:           216            1           42           42           42          216 T N

It would be nice to have this fixed.

Parents
  • I came across another scaling issue. It affects both macOS and Linux. It is less noticeable on macOS because Apple silicon comes with not so many cores but running on AWS Graviton3E (64 cores) and especially on Ampere Altra Max (128 cores) the performance hit is massive. Any BLAS/LAPACK implementation that I tested outperformed ArmPL (vecLib, Netlib, OpenBLAS) for the workload in question on both macOS and Linux. The problem seems to be that DGELS scales poorly. The software calls it in both multi-thread and multi-instance (several threads calling it at the same time) mode. Below are some example inputs:

    N 117 20 1 117 117 580 0
    N 117 20 1 117 117 -1 0
    N 117 10 1 189 189 580 0
    N 117 10 1 189 189 -1 0
    N 153 20 1 153 153 580 0
    N 153 20 1 153 153 -1 0
    N 189 20 1 189 189 580 0
    N 189 20 1 189 189 -1 0
    N 99 20 1 117 117 580 0
    N 99 20 1 117 117 -1 0
    N 63 10 1 189 189 580 0
    N 63 10 1 189 189 -1 0

    I am not sure if this is a scaling issue only or if the performance of DGELS in single-thread mode can be optimized too. Maybe the ArmPL team can have a look?

Reply
  • I came across another scaling issue. It affects both macOS and Linux. It is less noticeable on macOS because Apple silicon comes with not so many cores but running on AWS Graviton3E (64 cores) and especially on Ampere Altra Max (128 cores) the performance hit is massive. Any BLAS/LAPACK implementation that I tested outperformed ArmPL (vecLib, Netlib, OpenBLAS) for the workload in question on both macOS and Linux. The problem seems to be that DGELS scales poorly. The software calls it in both multi-thread and multi-instance (several threads calling it at the same time) mode. Below are some example inputs:

    N 117 20 1 117 117 580 0
    N 117 20 1 117 117 -1 0
    N 117 10 1 189 189 580 0
    N 117 10 1 189 189 -1 0
    N 153 20 1 153 153 580 0
    N 153 20 1 153 153 -1 0
    N 189 20 1 189 189 580 0
    N 189 20 1 189 189 -1 0
    N 99 20 1 117 117 580 0
    N 99 20 1 117 117 -1 0
    N 63 10 1 189 189 580 0
    N 63 10 1 189 189 -1 0

    I am not sure if this is a scaling issue only or if the performance of DGELS in single-thread mode can be optimized too. Maybe the ArmPL team can have a look?

Children