This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Scaling issues with ArmPL 23.04

Some time ago I reported a performance issue with ArmPL related to CPUs with large number of cores. See https://community.arm.com/support-forums/f/high-performance-computing-forum/53959/negative-armpl-mt-speed-up-on-many-core-systems It was fixed in version 23.04. As a result, my application gained a lot of performance. Recently, I did some more scaling performance testing and discovered more issues.

The testing was done on a 128-core Ampere Altra CPU running Ubuntu 22.04. The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12. The application uses ArmPL in two ways depending on algorithm: (1) a single application thread calls ArmPL; (2) multiple application threads call ArmPL at the same time. I did the performance profiling with perf-lib-tools.

ZGETRF
The problem with ZGETRF occurs when multiple application threads are calling it at the same time. Typical inputs are M=N with a size varying between 10 and 300. Maybe there is some locking issue? Summary from perf-lib-tools:

OMP_NUM_THREADS=32
zgetrf_     cnt=    2434696  totTime=    1504.5387 called_tot=     171712  topTime=      93.8037    (%age of runtime:  6.428 )

Most frequent calls:
$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -n
Routine:  zgetrf_  nCalls: 216920  Mean_time 8.647643e-05   nUserCalls:  12077  Mean_user_time: 8.447802e-05   Inputs:            52           52           52            2            0
Routine:  zgetrf_  nCalls:  66792  Mean_time 2.949315e-04   nUserCalls:   7386  Mean_user_time: 2.930019e-04   Inputs:           100          100          100            1            0
Routine:  zgetrf_  nCalls: 126456  Mean_time 3.720812e-04   nUserCalls:  13398  Mean_user_time: 3.665895e-04   Inputs:           110          110          110            1            0
Routine:  zgetrf_  nCalls: 201256  Mean_time 4.436309e-04   nUserCalls:  16952  Mean_user_time: 4.371607e-04   Inputs:           120          120          120            1            0
Routine:  zgetrf_  nCalls: 168344  Mean_time 5.473583e-04   nUserCalls:  14346  Mean_user_time: 5.396482e-04   Inputs:           130          130          130            1            0
Routine:  zgetrf_  nCalls: 147136  Mean_time 6.433120e-04   nUserCalls:  11615  Mean_user_time: 6.342174e-04   Inputs:           140          140          140            1            0
Routine:  zgetrf_  nCalls: 211288  Mean_time 6.791234e-04   nUserCalls:   9395  Mean_user_time: 6.745791e-04   Inputs:           142          142          142            1            0
Routine:  zgetrf_  nCalls: 123904  Mean_time 7.762974e-04   nUserCalls:   8320  Mean_user_time: 7.669183e-04   Inputs:           150          150          150            1            0
Routine:  zgetrf_  nCalls:  73040  Mean_time 8.895631e-04   nUserCalls:   5874  Mean_user_time: 8.788582e-04   Inputs:           160          160          160            1            0
Routine:  zgetrf_  nCalls: 195888  Mean_time 1.083270e-03   nUserCalls:   9152  Mean_user_time: 1.073041e-03   Inputs:           170          170          170            1            0

OMP_NUM_THREADS=128
zgetrf_     cnt=    2434696  totTime=    2046.2830 called_tot=     241094  topTime=     151.9661    (%age of runtime:  9.300 )

Most frequent calls:
$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -n
Routine:  zgetrf_  nCalls: 216920  Mean_time 2.516856e-04   nUserCalls:  11667  Mean_user_time: 1.828228e-04   Inputs:            52           52           52            2            0
Routine:  zgetrf_  nCalls:  66792  Mean_time 4.422276e-04   nUserCalls:   9588  Mean_user_time: 3.222948e-04   Inputs:           100          100          100            1            0
Routine:  zgetrf_  nCalls: 126456  Mean_time 5.297068e-04   nUserCalls:  18734  Mean_user_time: 3.932569e-04   Inputs:           110          110          110            1            0
Routine:  zgetrf_  nCalls: 201256  Mean_time 6.306487e-04   nUserCalls:  23959  Mean_user_time: 4.647142e-04   Inputs:           120          120          120            1            0
Routine:  zgetrf_  nCalls: 168344  Mean_time 7.413436e-04   nUserCalls:  27496  Mean_user_time: 5.610537e-04   Inputs:           130          130          130            1            0
Routine:  zgetrf_  nCalls: 147136  Mean_time 8.560422e-04   nUserCalls:  23896  Mean_user_time: 6.514217e-04   Inputs:           140          140          140            1            0
Routine:  zgetrf_  nCalls: 123904  Mean_time 1.012120e-03   nUserCalls:  20248  Mean_user_time: 7.814398e-04   Inputs:           150          150          150            1            0
Routine:  zgetrf_  nCalls:  73040  Mean_time 1.143292e-03   nUserCalls:  14404  Mean_user_time: 8.928624e-04   Inputs:           160          160          160            1            0
Routine:  zgetrf_  nCalls: 195888  Mean_time 1.406755e-03   nUserCalls:  10160  Mean_user_time: 1.091698e-03   Inputs:           170          170          170            1            0
Routine:  zgetrf_  nCalls:  19096  Mean_time 1.444096e-03   nUserCalls:   5755  Mean_user_time: 1.218681e-03   Inputs:           180          180          180            1            0


DGEMM
Even though DGEMM received substantial performance improvements, it still has some issues:

OMP_NUM_THREADS=32
dgemm_     cnt=   30596724  totTime=     272.4157   called_tot=   30596724  topTime=     272.4157    (%age of runtime:  2.876 )

Example calls:
Routine:   dgemm_  nCalls:  14728  Mean_time 1.281673e-05   nUserCalls:  14728  Mean_user_time: 1.281673e-05   Inputs:            36            1          252           36          288      1226658 N N
Routine:   dgemm_  nCalls:   9884  Mean_time 9.806415e-06   nUserCalls:   9884  Mean_user_time: 9.806415e-06   Inputs:           264            1           42           42           42          264 T N
Routine:   dgemm_  nCalls:  14728  Mean_time 9.071300e-06   nUserCalls:  14728  Mean_user_time: 9.071300e-06   Inputs:           252            1           36           36           36          252 T N
Routine:   dgemm_  nCalls:   9884  Mean_time 1.381515e-05   nUserCalls:   9884  Mean_user_time: 1.381515e-05   Inputs:            42            1          264           42          306      1226658 N N
Routine:   dgemm_  nCalls:  12238  Mean_time 9.176557e-06   nUserCalls:  12238  Mean_user_time: 9.176557e-06   Inputs:           246            1           36           36           36          246 T N
Routine:   dgemm_  nCalls:  17002  Mean_time 1.015412e-05   nUserCalls:  17002  Mean_user_time: 1.015412e-05   Inputs:           240            1           48           48           48          240 T N
Routine:   dgemm_  nCalls:  13044  Mean_time 1.026566e-05   nUserCalls:  13044  Mean_user_time: 1.026566e-05   Inputs:           210            1           54           54           54          210 T N
Routine:   dgemm_  nCalls:  10980  Mean_time 1.011446e-05   nUserCalls:  10980  Mean_user_time: 1.011446e-05   Inputs:           276            1           48           48           48          276 T N
Routine:   dgemm_  nCalls:  24626  Mean_time 9.204542e-06   nUserCalls:  24626  Mean_user_time: 9.204542e-06   Inputs:           204            1           42           42           42          204 T N
Routine:   dgemm_  nCalls:  19882  Mean_time 9.293070e-06   nUserCalls:  19882  Mean_user_time: 9.293070e-06   Inputs:           216            1           42           42           42          216 T N

OMP_NUM_THREADS=128
dgemm_     cnt=   30597188  totTime=     350.5334   called_tot=   30597188  topTime=     350.5334    (%age of runtime:  3.982 )

Example calls:
Routine:   dgemm_  nCalls:  14728  Mean_time 3.143310e-05   nUserCalls:  14728  Mean_user_time: 3.143310e-05   Inputs:            36            1          252           36          288      1226658 N N
Routine:   dgemm_  nCalls:   9884  Mean_time 2.066992e-05   nUserCalls:   9884  Mean_user_time: 2.066992e-05   Inputs:           264            1           42           42           42          264 T N
Routine:   dgemm_  nCalls:  14728  Mean_time 1.720480e-05   nUserCalls:  14728  Mean_user_time: 1.720480e-05   Inputs:           252            1           36           36           36          252 T N
Routine:   dgemm_  nCalls:   9884  Mean_time 2.001355e-05   nUserCalls:   9884  Mean_user_time: 2.001355e-05   Inputs:            42            1          264           42          306      1226658 N N
Routine:   dgemm_  nCalls:  12246  Mean_time 3.380307e-05   nUserCalls:  12246  Mean_user_time: 3.380307e-05   Inputs:           246            1           36           36           36          246 T N
Routine:   dgemm_  nCalls:  17012  Mean_time 2.844690e-05   nUserCalls:  17012  Mean_user_time: 2.844690e-05   Inputs:           240            1           48           48           48          240 T N
Routine:   dgemm_  nCalls:  13042  Mean_time 2.803772e-05   nUserCalls:  13042  Mean_user_time: 2.803772e-05   Inputs:           210            1           54           54           54          210 T N
Routine:   dgemm_  nCalls:  10982  Mean_time 2.659640e-05   nUserCalls:  10982  Mean_user_time: 2.659640e-05   Inputs:           276            1           48           48           48          276 T N
Routine:   dgemm_  nCalls:  24612  Mean_time 2.413749e-05   nUserCalls:  24612  Mean_user_time: 2.413749e-05   Inputs:           204            1           42           42           42          204 T N
Routine:   dgemm_  nCalls:  19862  Mean_time 2.381463e-05   nUserCalls:  19862  Mean_user_time: 2.381463e-05   Inputs:           216            1           42           42           42          216 T N

It would be nice to have this fixed.

  • Hi,

    Thanks for the report, and also for using perf-libs-tools!

    We are currently in the process of producing a new release of Arm PL (23.10) which will appear in the next few weeks. Unfortunately, any problem is unlikely to be addressed as part of that release. However, if we can pin down any potential issue then maybe we can help with an explanation and a possible fix in future releases.

    You mentioned 

    The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12.

    This makes me wonder if you are mixing OpenMP runtimes. If the library you're using was built with GCC it will have a dependency on libgomp (the GNU OpenMP library); if your application is using LLVM OpenMP, then it's possible that you're seeing bad performance from unintended nested parallelism.

    Please could you execute one of the Arm PL shared libraries in the distribution you're using at the command line? The shared libraries are executable, and should print out some diagnostic info. It would be useful if you could post that info in reply.

    Best Regards,

    Chris.

  • I don't think that there is OpenMP library mixing. I use my own ArmPL library which I compile something like this:


    gcc -Iinclude -O2 -fPIC -fmath-errno -std=gnu99 -fopenmp -o obj/aarch64/myarmpl.o -c myarmpl.c
    gcc -shared -Llib/aarch64 -fPIC -pthread -larmpl_mp -lomp -lastring -lamath -lm -o lib/aarch64/libmyarmpl.so obj/aarch64/myarmpl.o

    It segfaults when I execute it. Here is what it depends on:

    ldd libmyarmpl.so
            linux-vdso.so.1 (0x0000ffff99958000)
            libomp.so => not found
            libastring.so => not found
            libamath.so => not found
            libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff94620000)
            libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff945f0000)
            libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffff945d0000)
            libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff94420000)
            /lib/ld-linux-aarch64.so.1 (0x0000ffff9991f000)

    I have verified that it doesn't load libgomp.so at runtime with lsof -p <PID> | grep omp - it loads libomp.so only.

  • Thanks for your reply. Please could you execute the original Arm PL library you're linking to, libarmpl_mp.so, just so that we can see the information it produces.

  • Yes, this is what it reports (this is on the build machine which has a different CPU and OS):

    $ LD_LIBRARY_PATH=/opt/arm/armpl-23.04.0_RHEL-8_gcc/lib /opt/arm/armpl-23.04.0_RHEL-8_gcc/lib/libarmpl_mp.so
    Arm Performance Libraries
    Version 23.04.0
    Built from: 520bc09dc
    Target Generic AArch64 (lp64+openmp)
    Runtime target Generic AArch64
    Available targets:
      ThunderX2
      Neoverse N1
      Generic AArch64
      A64FX
      Neoverse V1
      Generic SVE
    Compiled by gcc (GCC) 12.2.0
    This build contains both NEON and SVE routine types.

    Runtime machine details (parsed from getauxval(AT_HWCAP)):
      Implementer: 0x50 (P)
      Part number: 0x0
      Part variant: 0x3
      Part revision: 0x2
      Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

    If you require support or would like to provide feedback, please contact support-hpc-sw@arm.com

    I forgot to mention, if I build the application with Arm Compiler for Linux 23.04 instead (and use the libarmpl_mp.a which comes with it), I get similar scaling problems. Here is an example workload:

    Run time (s) 32 threads 128 threads
    GCC 12 190 249
    Arm Compiler for Linux 23.04 223 329
  • Thanks for this, we'll try to investigate and see what's going on.

  • Hello. Just to confirm, we've observed similar scaling issues in going from 32 to 128 cores for these small ZGETRF and DGEMM problems when working from cold caches. We'll be working on addressing these. FYI, we're just released version 23.10, but that version doesn't contain any tunings that will address these issues.

  • Thanks! I've started evaluating version 23.10. It looks good so far (no new issues observed, some old issues fixed).

  • I came across another scaling issue. It affects both macOS and Linux. It is less noticeable on macOS because Apple silicon comes with not so many cores but running on AWS Graviton3E (64 cores) and especially on Ampere Altra Max (128 cores) the performance hit is massive. Any BLAS/LAPACK implementation that I tested outperformed ArmPL (vecLib, Netlib, OpenBLAS) for the workload in question on both macOS and Linux. The problem seems to be that DGELS scales poorly. The software calls it in both multi-thread and multi-instance (several threads calling it at the same time) mode. Below are some example inputs:

    N 117 20 1 117 117 580 0
    N 117 20 1 117 117 -1 0
    N 117 10 1 189 189 580 0
    N 117 10 1 189 189 -1 0
    N 153 20 1 153 153 580 0
    N 153 20 1 153 153 -1 0
    N 189 20 1 189 189 580 0
    N 189 20 1 189 189 -1 0
    N 99 20 1 117 117 580 0
    N 99 20 1 117 117 -1 0
    N 63 10 1 189 189 580 0
    N 63 10 1 189 189 -1 0

    I am not sure if this is a scaling issue only or if the performance of DGELS in single-thread mode can be optimized too. Maybe the ArmPL team can have a look?

  • Thanks, wel'll take a look at this too. It's probably a similar underlying issue to the one we've been addressing for GETRF which affects these small problems running with large numbers of cores. This time the problem may actually be in GEQRF as called by GELS.

    Chris.

  • Yes, could be. Our workloads which make heavy use of DGELS perform better with OpenBLAS. We see performance issues with DGBTRS too and possibly with DGBTRF. Are they all related to GEQRF?

    Some example DGBTRS inputs:
    Inputs: 10 1 1 1 4 1 10 0 N
    Inputs: 106 27 27 1 82 1 106 0 N
    Inputs: 11 1 1 1 4 1 11 0 N
    Inputs: 11 2 2 1 7 1 11 0 N
    Inputs: 112 31 31 1 94 1 112 0 N
    Inputs: 11 3 3 1 10 1 11 0 N
    Inputs: 120 31 31 1 94 1 120 0 N
    Inputs: 12 2 2 1 7 1 12 0 N
    Inputs: 12 3 3 1 10 1 12 0 N
    Inputs: 13 2 2 1 7 1 13 0 N
    Inputs: 13 3 3 1 10 1 13 0 N
    Inputs: 14 2 2 1 7 1 14 0 N
    Inputs: 14 3 3 1 10 1 14 0 N
    Inputs: 15 2 2 1 7 1 15 0 N
    Inputs: 15 3 3 1 10 1 15 0 N
    Inputs: 16 3 3 1 10 1 16 0 N
    Inputs: 17 3 3 1 10 1 17 0 N
    Inputs: 18 3 3 1 10 1 18 0 N
    Inputs: 19 3 3 1 10 1 19 0 N
    Inputs: 20 2 2 1 7 1 20 0 N
    Inputs: 20 3 3 1 10 1 20 0 N
    Inputs: 20 4 4 1 13 1 20 0 N
    Inputs: 20 5 5 1 16 1 20 0 N
    Inputs: 20 6 6 1 19 1 20 0 N
    Inputs: 5 1 1 1 4 1 5 0 N
    Inputs: 6 1 1 1 4 1 6 0 N
    Inputs: 7 1 1 1 4 1 7 0 N
    Inputs: 8 1 1 1 4 1 8 0 N
    Inputs: 9 1 1 1 4 1 9 0 N
    Inputs: 9 2 2 1 7 1 9 0 N
    Inputs: 98 27 27 1 82 1 98 0 N

    Some example DGELS inputs:

    Inputs: 16 10 2 16 16 -1 0 N
    Inputs: 16 10 2 16 16 570 0 N
    Inputs: 16 10 2 19 19 -1 0 N
    Inputs: 16 10 2 19 19 570 0 N
    Inputs: 16 10 2 22 22 -1 0 N
    Inputs: 16 10 2 22 22 570 0 N
    Inputs: 16 10 2 26 26 -1 0 N
    Inputs: 16 10 2 26 26 570 0 N
    Inputs: 16 10 2 29 29 -1 0 N
    Inputs: 16 10 2 29 29 570 0 N
    Inputs: 16 10 2 32 32 -1 0 N
    Inputs: 16 10 2 32 32 570 0 N
    Inputs: 16 10 2 36 36 -1 0 N
    Inputs: 16 10 2 36 36 570 0 N
    Inputs: 19 10 2 19 19 -1 0 N
    Inputs: 19 10 2 19 19 570 0 N
    Inputs: 19 10 2 22 22 -1 0 N
    Inputs: 19 10 2 22 22 570 0 N
    Inputs: 19 10 2 26 26 -1 0 N
    Inputs: 19 10 2 26 26 570 0 N
    Inputs: 19 10 2 29 29 -1 0 N
    Inputs: 19 10 2 29 29 570 0 N
    Inputs: 19 10 2 32 32 -1 0 N
    Inputs: 19 10 2 32 32 570 0 N
    Inputs: 19 10 2 36 36 -1 0 N
    Inputs: 19 10 2 36 36 570 0 N
    Inputs: 22 10 2 22 22 -1 0 N
    Inputs: 22 10 2 22 22 570 0 N
    Inputs: 22 10 2 26 26 -1 0 N
    Inputs: 22 10 2 26 26 570 0 N
    Inputs: 22 10 2 29 29 -1 0 N
    Inputs: 22 10 2 29 29 570 0 N
    Inputs: 22 10 2 32 32 -1 0 N
    Inputs: 22 10 2 32 32 570 0 N
    Inputs: 22 10 2 36 36 -1 0 N
    Inputs: 22 10 2 36 36 570 0 N
    Inputs: 26 10 2 26 26 -1 0 N
    Inputs: 26 10 2 26 26 570 0 N
    Inputs: 26 10 2 29 29 -1 0 N
    Inputs: 26 10 2 29 29 570 0 N
    Inputs: 26 10 2 32 32 -1 0 N
    Inputs: 26 10 2 32 32 570 0 N
    Inputs: 26 10 2 36 36 -1 0 N
    Inputs: 26 10 2 36 36 570 0 N
    Inputs: 29 10 2 29 29 -1 0 N
    Inputs: 29 10 2 29 29 570 0 N
    Inputs: 29 10 2 32 32 -1 0 N
    Inputs: 29 10 2 32 32 570 0 N
    Inputs: 29 10 2 36 36 -1 0 N
    Inputs: 29 10 2 36 36 570 0 N
    Inputs: 32 10 2 32 32 -1 0 N
    Inputs: 32 10 2 32 32 570 0 N
    Inputs: 32 10 2 36 36 -1 0 N
    Inputs: 32 10 2 36 36 570 0 N
    Inputs: 36 10 2 36 36 -1 0 N
    Inputs: 36 10 2 36 36 570 0 N

  • DGBTRS and DGBTRF are not related to GEQRF, but the issue here is the same: your problems are small, and our implementation is geared towards large parallel problems. We have a few LAPACK functions which are implemented in a similar way.

    In order to understand the best way to deal with these issues, would you mind getting in touch via support-hpc-sw@arm.com? We have a few options, but it would be better to find out which is the most appropriate for your use cases.

    Regards, Chris.