This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Scaling issues with ArmPL 23.04

ndvn over 2 years ago

Some time ago I reported a performance issue with ArmPL related to CPUs with large number of cores. See https://community.arm.com/support-forums/f/high-performance-computing-forum/53959/negative-armpl-mt-speed-up-on-many-core-systems It was fixed in version 23.04. As a result, my application gained a lot of performance. Recently, I did some more scaling performance testing and discovered more issues.

The testing was done on a 128-core Ampere Altra CPU running Ubuntu 22.04. The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12. The application uses ArmPL in two ways depending on algorithm: (1) a single application thread calls ArmPL; (2) multiple application threads call ArmPL at the same time. I did the performance profiling with perf-lib-tools.

ZGETRF
The problem with ZGETRF occurs when multiple application threads are calling it at the same time. Typical inputs are M=N with a size varying between 10 and 300. Maybe there is some locking issue? Summary from perf-lib-tools:

OMP_NUM_THREADS=32
zgetrf_     cnt=    2434696 totTime=    1504.5387 called_tot=     171712 topTime=      93.8037    (%age of runtime: 6.428 )

Most frequent calls:
$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -n
Routine: zgetrf_ nCalls: 216920 Mean_time 8.647643e-05   nUserCalls: 12077 Mean_user_time: 8.447802e-05   Inputs:            52           52           52            2            0
Routine: zgetrf_ nCalls: 66792 Mean_time 2.949315e-04   nUserCalls:   7386 Mean_user_time: 2.930019e-04   Inputs:           100          100          100            1            0
Routine: zgetrf_ nCalls: 126456 Mean_time 3.720812e-04   nUserCalls: 13398 Mean_user_time: 3.665895e-04   Inputs:           110          110          110            1            0
Routine: zgetrf_ nCalls: 201256 Mean_time 4.436309e-04   nUserCalls: 16952 Mean_user_time: 4.371607e-04   Inputs:           120          120          120            1            0
Routine: zgetrf_ nCalls: 168344 Mean_time 5.473583e-04   nUserCalls: 14346 Mean_user_time: 5.396482e-04   Inputs:           130          130          130            1            0
Routine: zgetrf_ nCalls: 147136 Mean_time 6.433120e-04   nUserCalls: 11615 Mean_user_time: 6.342174e-04   Inputs:           140          140          140            1            0
Routine: zgetrf_ nCalls: 211288 Mean_time 6.791234e-04   nUserCalls:   9395 Mean_user_time: 6.745791e-04   Inputs:           142          142          142            1            0
Routine: zgetrf_ nCalls: 123904 Mean_time 7.762974e-04   nUserCalls:   8320 Mean_user_time: 7.669183e-04   Inputs:           150          150          150            1            0
Routine: zgetrf_ nCalls: 73040 Mean_time 8.895631e-04   nUserCalls:   5874 Mean_user_time: 8.788582e-04   Inputs:           160          160          160            1            0
Routine: zgetrf_ nCalls: 195888 Mean_time 1.083270e-03   nUserCalls:   9152 Mean_user_time: 1.073041e-03   Inputs:           170          170          170            1            0

OMP_NUM_THREADS=128
zgetrf_     cnt=    2434696 totTime=    2046.2830 called_tot=     241094 topTime=     151.9661    (%age of runtime: 9.300 )

Most frequent calls:
$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -n
Routine: zgetrf_ nCalls: 216920 Mean_time 2.516856e-04   nUserCalls: 11667 Mean_user_time: 1.828228e-04   Inputs:            52           52           52            2            0
Routine: zgetrf_ nCalls: 66792 Mean_time 4.422276e-04   nUserCalls:   9588 Mean_user_time: 3.222948e-04   Inputs:           100          100          100            1            0
Routine: zgetrf_ nCalls: 126456 Mean_time 5.297068e-04   nUserCalls: 18734 Mean_user_time: 3.932569e-04   Inputs:           110          110          110            1            0
Routine: zgetrf_ nCalls: 201256 Mean_time 6.306487e-04   nUserCalls: 23959 Mean_user_time: 4.647142e-04   Inputs:           120          120          120            1            0
Routine: zgetrf_ nCalls: 168344 Mean_time 7.413436e-04   nUserCalls: 27496 Mean_user_time: 5.610537e-04   Inputs:           130          130          130            1            0
Routine: zgetrf_ nCalls: 147136 Mean_time 8.560422e-04   nUserCalls: 23896 Mean_user_time: 6.514217e-04   Inputs:           140          140          140            1            0
Routine: zgetrf_ nCalls: 123904 Mean_time 1.012120e-03   nUserCalls: 20248 Mean_user_time: 7.814398e-04   Inputs:           150          150          150            1            0
Routine: zgetrf_ nCalls: 73040 Mean_time 1.143292e-03   nUserCalls: 14404 Mean_user_time: 8.928624e-04   Inputs:           160          160          160            1            0
Routine: zgetrf_ nCalls: 195888 Mean_time 1.406755e-03   nUserCalls: 10160 Mean_user_time: 1.091698e-03   Inputs:           170          170          170            1            0
Routine: zgetrf_ nCalls: 19096 Mean_time 1.444096e-03   nUserCalls:   5755 Mean_user_time: 1.218681e-03   Inputs:           180          180          180            1            0

DGEMM
Even though DGEMM received substantial performance improvements, it still has some issues:

OMP_NUM_THREADS=32
dgemm_     cnt=   30596724 totTime=     272.4157   called_tot=   30596724 topTime=     272.4157    (%age of runtime: 2.876 )

Example calls:
Routine:   dgemm_ nCalls: 14728 Mean_time 1.281673e-05   nUserCalls: 14728 Mean_user_time: 1.281673e-05   Inputs:            36            1          252           36          288      1226658 N N
Routine:   dgemm_ nCalls:   9884 Mean_time 9.806415e-06   nUserCalls:   9884 Mean_user_time: 9.806415e-06   Inputs:           264            1           42           42           42          264 T N
Routine:   dgemm_ nCalls: 14728 Mean_time 9.071300e-06   nUserCalls: 14728 Mean_user_time: 9.071300e-06   Inputs:           252            1           36           36           36          252 T N
Routine:   dgemm_ nCalls:   9884 Mean_time 1.381515e-05   nUserCalls:   9884 Mean_user_time: 1.381515e-05   Inputs:            42            1          264           42          306      1226658 N N
Routine:   dgemm_ nCalls: 12238 Mean_time 9.176557e-06   nUserCalls: 12238 Mean_user_time: 9.176557e-06   Inputs:           246            1           36           36           36          246 T N
Routine:   dgemm_ nCalls: 17002 Mean_time 1.015412e-05   nUserCalls: 17002 Mean_user_time: 1.015412e-05   Inputs:           240            1           48           48           48          240 T N
Routine:   dgemm_ nCalls: 13044 Mean_time 1.026566e-05   nUserCalls: 13044 Mean_user_time: 1.026566e-05   Inputs:           210            1           54           54           54          210 T N
Routine:   dgemm_ nCalls: 10980 Mean_time 1.011446e-05   nUserCalls: 10980 Mean_user_time: 1.011446e-05   Inputs:           276            1           48           48           48          276 T N
Routine:   dgemm_ nCalls: 24626 Mean_time 9.204542e-06   nUserCalls: 24626 Mean_user_time: 9.204542e-06   Inputs:           204            1           42           42           42          204 T N
Routine:   dgemm_ nCalls: 19882 Mean_time 9.293070e-06   nUserCalls: 19882 Mean_user_time: 9.293070e-06   Inputs:           216            1           42           42           42          216 T N

OMP_NUM_THREADS=128
dgemm_     cnt=   30597188 totTime=     350.5334   called_tot=   30597188 topTime=     350.5334    (%age of runtime: 3.982 )

Example calls:
Routine:   dgemm_ nCalls: 14728 Mean_time 3.143310e-05   nUserCalls: 14728 Mean_user_time: 3.143310e-05   Inputs:            36            1          252           36          288      1226658 N N
Routine:   dgemm_ nCalls:   9884 Mean_time 2.066992e-05   nUserCalls:   9884 Mean_user_time: 2.066992e-05   Inputs:           264            1           42           42           42          264 T N
Routine:   dgemm_ nCalls: 14728 Mean_time 1.720480e-05   nUserCalls: 14728 Mean_user_time: 1.720480e-05   Inputs:           252            1           36           36           36          252 T N
Routine:   dgemm_ nCalls:   9884 Mean_time 2.001355e-05   nUserCalls:   9884 Mean_user_time: 2.001355e-05   Inputs:            42            1          264           42          306      1226658 N N
Routine:   dgemm_ nCalls: 12246 Mean_time 3.380307e-05   nUserCalls: 12246 Mean_user_time: 3.380307e-05   Inputs:           246            1           36           36           36          246 T N
Routine:   dgemm_ nCalls: 17012 Mean_time 2.844690e-05   nUserCalls: 17012 Mean_user_time: 2.844690e-05   Inputs:           240            1           48           48           48          240 T N
Routine:   dgemm_ nCalls: 13042 Mean_time 2.803772e-05   nUserCalls: 13042 Mean_user_time: 2.803772e-05   Inputs:           210            1           54           54           54          210 T N
Routine:   dgemm_ nCalls: 10982 Mean_time 2.659640e-05   nUserCalls: 10982 Mean_user_time: 2.659640e-05   Inputs:           276            1           48           48           48          276 T N
Routine:   dgemm_ nCalls: 24612 Mean_time 2.413749e-05   nUserCalls: 24612 Mean_user_time: 2.413749e-05   Inputs:           204            1           42           42           42          204 T N
Routine:   dgemm_ nCalls: 19862 Mean_time 2.381463e-05   nUserCalls: 19862 Mean_user_time: 2.381463e-05   Inputs:           216            1           42           42           42          216 T N

It would be nice to have this fixed.

Top replies

0 Chris Armstrong over 2 years ago

Hi,

Thanks for the report, and also for using perf-libs-tools!

We are currently in the process of producing a new release of Arm PL (23.10) which will appear in the next few weeks. Unfortunately, any problem is unlikely to be addressed as part of that release. However, if we can pin down any potential issue then maybe we can help with an explanation and a possible fix in future releases.

You mentioned

> The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12.

This makes me wonder if you are mixing OpenMP runtimes. If the library you're using was built with GCC it will have a dependency on libgomp (the GNU OpenMP library); if your application is using LLVM OpenMP, then it's possible that you're seeing bad performance from unintended nested parallelism.

Please could you execute one of the Arm PL shared libraries in the distribution you're using at the command line? The shared libraries are executable, and should print out some diagnostic info. It would be useful if you could post that info in reply.

Best Regards,

Chris.
Cancel
Vote up 0 Vote down

Cancel
0 ndvn over 2 years ago in reply to Chris Armstrong

I don't think that there is OpenMP library mixing. I use my own ArmPL library which I compile something like this:

gcc -Iinclude -O2 -fPIC -fmath-errno -std=gnu99 -fopenmp -o obj/aarch64/myarmpl.o -c myarmpl.c
gcc -shared -Llib/aarch64 -fPIC -pthread -larmpl_mp -lomp -lastring -lamath -lm -o lib/aarch64/libmyarmpl.so obj/aarch64/myarmpl.o

It segfaults when I execute it. Here is what it depends on:

ldd libmyarmpl.so
        linux-vdso.so.1 (0x0000ffff99958000)
        libomp.so => not found
        libastring.so => not found
        libamath.so => not found
        libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff94620000)
        libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff945f0000)
        libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffff945d0000)
        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff94420000)
        /lib/ld-linux-aarch64.so.1 (0x0000ffff9991f000)

I have verified that it doesn't load libgomp.so at runtime with lsof -p <PID> | grep omp - it loads libomp.so only.
Cancel
Vote up 0 Vote down

Cancel
0 Chris Armstrong over 2 years ago in reply to ndvn

Thanks for your reply. Please could you execute the original Arm PL library you're linking to, libarmpl_mp.so, just so that we can see the information it produces.
Cancel
Vote up 0 Vote down

Cancel
0 ndvn over 2 years ago in reply to Chris Armstrong

Yes, this is what it reports (this is on the build machine which has a different CPU and OS):

$ LD_LIBRARY_PATH=/opt/arm/armpl-23.04.0_RHEL-8_gcc/lib /opt/arm/armpl-23.04.0_RHEL-8_gcc/lib/libarmpl_mp.so
Arm Performance Libraries
Version 23.04.0
Built from: 520bc09dc
Target Generic AArch64 (lp64+openmp)
Runtime target Generic AArch64
Available targets:
ThunderX2
Neoverse N1
Generic AArch64
A64FX
Neoverse V1
Generic SVE
Compiled by gcc (GCC) 12.2.0
This build contains both NEON and SVE routine types.

Runtime machine details (parsed from getauxval(AT_HWCAP)):
Implementer: 0x50 (P)
Part number: 0x0
Part variant: 0x3
Part revision: 0x2
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

If you require support or would like to provide feedback, please contact support-hpc-sw@arm.com

I forgot to mention, if I build the application with Arm Compiler for Linux 23.04 instead (and use the libarmpl_mp.a which comes with it), I get similar scaling problems. Here is an example workload:

Run time (s) 32 threads 128 threads

GCC 12 190 249

Arm Compiler for Linux 23.04 223 329
Cancel
Vote up 0 Vote down

Cancel
0 Chris Armstrong over 2 years ago in reply to ndvn

Thanks for this, we'll try to investigate and see what's going on.
Cancel
Vote up 0 Vote down

Cancel
0 Chris Armstrong over 2 years ago

Hello. Just to confirm, we've observed similar scaling issues in going from 32 to 128 cores for these small ZGETRF and DGEMM problems when working from cold caches. We'll be working on addressing these. FYI, we're just released version 23.10, but that version doesn't contain any tunings that will address these issues.
Cancel
Vote up +1 Vote down

Cancel
0 ndvn over 2 years ago in reply to Chris Armstrong

Thanks! I've started evaluating version 23.10. It looks good so far (no new issues observed, some old issues fixed).
Cancel
Vote up 0 Vote down

Cancel
0 ndvn over 1 year ago

I came across another scaling issue. It affects both macOS and Linux. It is less noticeable on macOS because Apple silicon comes with not so many cores but running on AWS Graviton3E (64 cores) and especially on Ampere Altra Max (128 cores) the performance hit is massive. Any BLAS/LAPACK implementation that I tested outperformed ArmPL (vecLib, Netlib, OpenBLAS) for the workload in question on both macOS and Linux. The problem seems to be that DGELS scales poorly. The software calls it in both multi-thread and multi-instance (several threads calling it at the same time) mode. Below are some example inputs:

N 117 20 1 117 117 580 0
N 117 20 1 117 117 -1 0
N 117 10 1 189 189 580 0
N 117 10 1 189 189 -1 0
N 153 20 1 153 153 580 0
N 153 20 1 153 153 -1 0
N 189 20 1 189 189 580 0
N 189 20 1 189 189 -1 0
N 99 20 1 117 117 580 0
N 99 20 1 117 117 -1 0
N 63 10 1 189 189 580 0
N 63 10 1 189 189 -1 0

I am not sure if this is a scaling issue only or if the performance of DGELS in single-thread mode can be optimized too. Maybe the ArmPL team can have a look?
Cancel
Vote up 0 Vote down

Cancel
0 Chris Armstrong over 1 year ago in reply to ndvn

Thanks, wel'll take a look at this too. It's probably a similar underlying issue to the one we've been addressing for GETRF which affects these small problems running with large numbers of cores. This time the problem may actually be in GEQRF as called by GELS.

Chris.
Cancel
Vote up +2 Vote down

Cancel
0 ndvn over 1 year ago in reply to Chris Armstrong

Yes, could be. Our workloads which make heavy use of DGELS perform better with OpenBLAS. We see performance issues with DGBTRS too and possibly with DGBTRF. Are they all related to GEQRF?

Some example DGBTRS inputs:
Inputs: 10 1 1 1 4 1 10 0 N
Inputs: 106 27 27 1 82 1 106 0 N
Inputs: 11 1 1 1 4 1 11 0 N
Inputs: 11 2 2 1 7 1 11 0 N
Inputs: 112 31 31 1 94 1 112 0 N
Inputs: 11 3 3 1 10 1 11 0 N
Inputs: 120 31 31 1 94 1 120 0 N
Inputs: 12 2 2 1 7 1 12 0 N
Inputs: 12 3 3 1 10 1 12 0 N
Inputs: 13 2 2 1 7 1 13 0 N
Inputs: 13 3 3 1 10 1 13 0 N
Inputs: 14 2 2 1 7 1 14 0 N
Inputs: 14 3 3 1 10 1 14 0 N
Inputs: 15 2 2 1 7 1 15 0 N
Inputs: 15 3 3 1 10 1 15 0 N
Inputs: 16 3 3 1 10 1 16 0 N
Inputs: 17 3 3 1 10 1 17 0 N
Inputs: 18 3 3 1 10 1 18 0 N
Inputs: 19 3 3 1 10 1 19 0 N
Inputs: 20 2 2 1 7 1 20 0 N
Inputs: 20 3 3 1 10 1 20 0 N
Inputs: 20 4 4 1 13 1 20 0 N
Inputs: 20 5 5 1 16 1 20 0 N
Inputs: 20 6 6 1 19 1 20 0 N
Inputs: 5 1 1 1 4 1 5 0 N
Inputs: 6 1 1 1 4 1 6 0 N
Inputs: 7 1 1 1 4 1 7 0 N
Inputs: 8 1 1 1 4 1 8 0 N
Inputs: 9 1 1 1 4 1 9 0 N
Inputs: 9 2 2 1 7 1 9 0 N
Inputs: 98 27 27 1 82 1 98 0 N

Some example DGELS inputs:

Inputs: 16 10 2 16 16 -1 0 N
Inputs: 16 10 2 16 16 570 0 N
Inputs: 16 10 2 19 19 -1 0 N
Inputs: 16 10 2 19 19 570 0 N
Inputs: 16 10 2 22 22 -1 0 N
Inputs: 16 10 2 22 22 570 0 N
Inputs: 16 10 2 26 26 -1 0 N
Inputs: 16 10 2 26 26 570 0 N
Inputs: 16 10 2 29 29 -1 0 N
Inputs: 16 10 2 29 29 570 0 N
Inputs: 16 10 2 32 32 -1 0 N
Inputs: 16 10 2 32 32 570 0 N
Inputs: 16 10 2 36 36 -1 0 N
Inputs: 16 10 2 36 36 570 0 N
Inputs: 19 10 2 19 19 -1 0 N
Inputs: 19 10 2 19 19 570 0 N
Inputs: 19 10 2 22 22 -1 0 N
Inputs: 19 10 2 22 22 570 0 N
Inputs: 19 10 2 26 26 -1 0 N
Inputs: 19 10 2 26 26 570 0 N
Inputs: 19 10 2 29 29 -1 0 N
Inputs: 19 10 2 29 29 570 0 N
Inputs: 19 10 2 32 32 -1 0 N
Inputs: 19 10 2 32 32 570 0 N
Inputs: 19 10 2 36 36 -1 0 N
Inputs: 19 10 2 36 36 570 0 N
Inputs: 22 10 2 22 22 -1 0 N
Inputs: 22 10 2 22 22 570 0 N
Inputs: 22 10 2 26 26 -1 0 N
Inputs: 22 10 2 26 26 570 0 N
Inputs: 22 10 2 29 29 -1 0 N
Inputs: 22 10 2 29 29 570 0 N
Inputs: 22 10 2 32 32 -1 0 N
Inputs: 22 10 2 32 32 570 0 N
Inputs: 22 10 2 36 36 -1 0 N
Inputs: 22 10 2 36 36 570 0 N
Inputs: 26 10 2 26 26 -1 0 N
Inputs: 26 10 2 26 26 570 0 N
Inputs: 26 10 2 29 29 -1 0 N
Inputs: 26 10 2 29 29 570 0 N
Inputs: 26 10 2 32 32 -1 0 N
Inputs: 26 10 2 32 32 570 0 N
Inputs: 26 10 2 36 36 -1 0 N
Inputs: 26 10 2 36 36 570 0 N
Inputs: 29 10 2 29 29 -1 0 N
Inputs: 29 10 2 29 29 570 0 N
Inputs: 29 10 2 32 32 -1 0 N
Inputs: 29 10 2 32 32 570 0 N
Inputs: 29 10 2 36 36 -1 0 N
Inputs: 29 10 2 36 36 570 0 N
Inputs: 32 10 2 32 32 -1 0 N
Inputs: 32 10 2 32 32 570 0 N
Inputs: 32 10 2 36 36 -1 0 N
Inputs: 32 10 2 36 36 570 0 N
Inputs: 36 10 2 36 36 -1 0 N
Inputs: 36 10 2 36 36 570 0 N
Cancel
Vote up 0 Vote down

Cancel
0 Chris Armstrong over 1 year ago in reply to ndvn

DGBTRS and DGBTRF are not related to GEQRF, but the issue here is the same: your problems are small, and our implementation is geared towards large parallel problems. We have a few LAPACK functions which are implemented in a similar way.

In order to understand the best way to deal with these issues, would you mind getting in touch via support-hpc-sw@arm.com? We have a few options, but it would be better to find out which is the most appropriate for your use cases.

Regards, Chris.
Cancel
Vote up +2 Vote down

Cancel

Run time (s)	32 threads	128 threads
GCC 12	190	249
Arm Compiler for Linux 23.04	223	329