This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Scaling issues with ArmPL 23.04

Some time ago I reported a performance issue with ArmPL related to CPUs with large number of cores. See https://community.arm.com/support-forums/f/high-performance-computing-forum/53959/negative-armpl-mt-speed-up-on-many-core-systems It was fixed in version 23.04. As a result, my application gained a lot of performance. Recently, I did some more scaling performance testing and discovered more issues.

The testing was done on a 128-core Ampere Altra CPU running Ubuntu 22.04. The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12. The application uses ArmPL in two ways depending on algorithm: (1) a single application thread calls ArmPL; (2) multiple application threads call ArmPL at the same time. I did the performance profiling with perf-lib-tools.

ZGETRF
The problem with ZGETRF occurs when multiple application threads are calling it at the same time. Typical inputs are M=N with a size varying between 10 and 300. Maybe there is some locking issue? Summary from perf-lib-tools:

OMP_NUM_THREADS=32
zgetrf_     cnt=    2434696  totTime=    1504.5387 called_tot=     171712  topTime=      93.8037    (%age of runtime:  6.428 )

Most frequent calls:
$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -n
Routine:  zgetrf_  nCalls: 216920  Mean_time 8.647643e-05   nUserCalls:  12077  Mean_user_time: 8.447802e-05   Inputs:            52           52           52            2            0
Routine:  zgetrf_  nCalls:  66792  Mean_time 2.949315e-04   nUserCalls:   7386  Mean_user_time: 2.930019e-04   Inputs:           100          100          100            1            0
Routine:  zgetrf_  nCalls: 126456  Mean_time 3.720812e-04   nUserCalls:  13398  Mean_user_time: 3.665895e-04   Inputs:           110          110          110            1            0
Routine:  zgetrf_  nCalls: 201256  Mean_time 4.436309e-04   nUserCalls:  16952  Mean_user_time: 4.371607e-04   Inputs:           120          120          120            1            0
Routine:  zgetrf_  nCalls: 168344  Mean_time 5.473583e-04   nUserCalls:  14346  Mean_user_time: 5.396482e-04   Inputs:           130          130          130            1            0
Routine:  zgetrf_  nCalls: 147136  Mean_time 6.433120e-04   nUserCalls:  11615  Mean_user_time: 6.342174e-04   Inputs:           140          140          140            1            0
Routine:  zgetrf_  nCalls: 211288  Mean_time 6.791234e-04   nUserCalls:   9395  Mean_user_time: 6.745791e-04   Inputs:           142          142          142            1            0
Routine:  zgetrf_  nCalls: 123904  Mean_time 7.762974e-04   nUserCalls:   8320  Mean_user_time: 7.669183e-04   Inputs:           150          150          150            1            0
Routine:  zgetrf_  nCalls:  73040  Mean_time 8.895631e-04   nUserCalls:   5874  Mean_user_time: 8.788582e-04   Inputs:           160          160          160            1            0
Routine:  zgetrf_  nCalls: 195888  Mean_time 1.083270e-03   nUserCalls:   9152  Mean_user_time: 1.073041e-03   Inputs:           170          170          170            1            0

OMP_NUM_THREADS=128
zgetrf_     cnt=    2434696  totTime=    2046.2830 called_tot=     241094  topTime=     151.9661    (%age of runtime:  9.300 )

Most frequent calls:
$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -n
Routine:  zgetrf_  nCalls: 216920  Mean_time 2.516856e-04   nUserCalls:  11667  Mean_user_time: 1.828228e-04   Inputs:            52           52           52            2            0
Routine:  zgetrf_  nCalls:  66792  Mean_time 4.422276e-04   nUserCalls:   9588  Mean_user_time: 3.222948e-04   Inputs:           100          100          100            1            0
Routine:  zgetrf_  nCalls: 126456  Mean_time 5.297068e-04   nUserCalls:  18734  Mean_user_time: 3.932569e-04   Inputs:           110          110          110            1            0
Routine:  zgetrf_  nCalls: 201256  Mean_time 6.306487e-04   nUserCalls:  23959  Mean_user_time: 4.647142e-04   Inputs:           120          120          120            1            0
Routine:  zgetrf_  nCalls: 168344  Mean_time 7.413436e-04   nUserCalls:  27496  Mean_user_time: 5.610537e-04   Inputs:           130          130          130            1            0
Routine:  zgetrf_  nCalls: 147136  Mean_time 8.560422e-04   nUserCalls:  23896  Mean_user_time: 6.514217e-04   Inputs:           140          140          140            1            0
Routine:  zgetrf_  nCalls: 123904  Mean_time 1.012120e-03   nUserCalls:  20248  Mean_user_time: 7.814398e-04   Inputs:           150          150          150            1            0
Routine:  zgetrf_  nCalls:  73040  Mean_time 1.143292e-03   nUserCalls:  14404  Mean_user_time: 8.928624e-04   Inputs:           160          160          160            1            0
Routine:  zgetrf_  nCalls: 195888  Mean_time 1.406755e-03   nUserCalls:  10160  Mean_user_time: 1.091698e-03   Inputs:           170          170          170            1            0
Routine:  zgetrf_  nCalls:  19096  Mean_time 1.444096e-03   nUserCalls:   5755  Mean_user_time: 1.218681e-03   Inputs:           180          180          180            1            0


DGEMM
Even though DGEMM received substantial performance improvements, it still has some issues:

OMP_NUM_THREADS=32
dgemm_     cnt=   30596724  totTime=     272.4157   called_tot=   30596724  topTime=     272.4157    (%age of runtime:  2.876 )

Example calls:
Routine:   dgemm_  nCalls:  14728  Mean_time 1.281673e-05   nUserCalls:  14728  Mean_user_time: 1.281673e-05   Inputs:            36            1          252           36          288      1226658 N N
Routine:   dgemm_  nCalls:   9884  Mean_time 9.806415e-06   nUserCalls:   9884  Mean_user_time: 9.806415e-06   Inputs:           264            1           42           42           42          264 T N
Routine:   dgemm_  nCalls:  14728  Mean_time 9.071300e-06   nUserCalls:  14728  Mean_user_time: 9.071300e-06   Inputs:           252            1           36           36           36          252 T N
Routine:   dgemm_  nCalls:   9884  Mean_time 1.381515e-05   nUserCalls:   9884  Mean_user_time: 1.381515e-05   Inputs:            42            1          264           42          306      1226658 N N
Routine:   dgemm_  nCalls:  12238  Mean_time 9.176557e-06   nUserCalls:  12238  Mean_user_time: 9.176557e-06   Inputs:           246            1           36           36           36          246 T N
Routine:   dgemm_  nCalls:  17002  Mean_time 1.015412e-05   nUserCalls:  17002  Mean_user_time: 1.015412e-05   Inputs:           240            1           48           48           48          240 T N
Routine:   dgemm_  nCalls:  13044  Mean_time 1.026566e-05   nUserCalls:  13044  Mean_user_time: 1.026566e-05   Inputs:           210            1           54           54           54          210 T N
Routine:   dgemm_  nCalls:  10980  Mean_time 1.011446e-05   nUserCalls:  10980  Mean_user_time: 1.011446e-05   Inputs:           276            1           48           48           48          276 T N
Routine:   dgemm_  nCalls:  24626  Mean_time 9.204542e-06   nUserCalls:  24626  Mean_user_time: 9.204542e-06   Inputs:           204            1           42           42           42          204 T N
Routine:   dgemm_  nCalls:  19882  Mean_time 9.293070e-06   nUserCalls:  19882  Mean_user_time: 9.293070e-06   Inputs:           216            1           42           42           42          216 T N

OMP_NUM_THREADS=128
dgemm_     cnt=   30597188  totTime=     350.5334   called_tot=   30597188  topTime=     350.5334    (%age of runtime:  3.982 )

Example calls:
Routine:   dgemm_  nCalls:  14728  Mean_time 3.143310e-05   nUserCalls:  14728  Mean_user_time: 3.143310e-05   Inputs:            36            1          252           36          288      1226658 N N
Routine:   dgemm_  nCalls:   9884  Mean_time 2.066992e-05   nUserCalls:   9884  Mean_user_time: 2.066992e-05   Inputs:           264            1           42           42           42          264 T N
Routine:   dgemm_  nCalls:  14728  Mean_time 1.720480e-05   nUserCalls:  14728  Mean_user_time: 1.720480e-05   Inputs:           252            1           36           36           36          252 T N
Routine:   dgemm_  nCalls:   9884  Mean_time 2.001355e-05   nUserCalls:   9884  Mean_user_time: 2.001355e-05   Inputs:            42            1          264           42          306      1226658 N N
Routine:   dgemm_  nCalls:  12246  Mean_time 3.380307e-05   nUserCalls:  12246  Mean_user_time: 3.380307e-05   Inputs:           246            1           36           36           36          246 T N
Routine:   dgemm_  nCalls:  17012  Mean_time 2.844690e-05   nUserCalls:  17012  Mean_user_time: 2.844690e-05   Inputs:           240            1           48           48           48          240 T N
Routine:   dgemm_  nCalls:  13042  Mean_time 2.803772e-05   nUserCalls:  13042  Mean_user_time: 2.803772e-05   Inputs:           210            1           54           54           54          210 T N
Routine:   dgemm_  nCalls:  10982  Mean_time 2.659640e-05   nUserCalls:  10982  Mean_user_time: 2.659640e-05   Inputs:           276            1           48           48           48          276 T N
Routine:   dgemm_  nCalls:  24612  Mean_time 2.413749e-05   nUserCalls:  24612  Mean_user_time: 2.413749e-05   Inputs:           204            1           42           42           42          204 T N
Routine:   dgemm_  nCalls:  19862  Mean_time 2.381463e-05   nUserCalls:  19862  Mean_user_time: 2.381463e-05   Inputs:           216            1           42           42           42          216 T N

It would be nice to have this fixed.

Parents
  • Hi,

    Thanks for the report, and also for using perf-libs-tools!

    We are currently in the process of producing a new release of Arm PL (23.10) which will appear in the next few weeks. Unfortunately, any problem is unlikely to be addressed as part of that release. However, if we can pin down any potential issue then maybe we can help with an explanation and a possible fix in future releases.

    You mentioned 

    The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12.

    This makes me wonder if you are mixing OpenMP runtimes. If the library you're using was built with GCC it will have a dependency on libgomp (the GNU OpenMP library); if your application is using LLVM OpenMP, then it's possible that you're seeing bad performance from unintended nested parallelism.

    Please could you execute one of the Arm PL shared libraries in the distribution you're using at the command line? The shared libraries are executable, and should print out some diagnostic info. It would be useful if you could post that info in reply.

    Best Regards,

    Chris.

Reply
  • Hi,

    Thanks for the report, and also for using perf-libs-tools!

    We are currently in the process of producing a new release of Arm PL (23.10) which will appear in the next few weeks. Unfortunately, any problem is unlikely to be addressed as part of that release. However, if we can pin down any potential issue then maybe we can help with an explanation and a possible fix in future releases.

    You mentioned 

    The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12.

    This makes me wonder if you are mixing OpenMP runtimes. If the library you're using was built with GCC it will have a dependency on libgomp (the GNU OpenMP library); if your application is using LLVM OpenMP, then it's possible that you're seeing bad performance from unintended nested parallelism.

    Please could you execute one of the Arm PL shared libraries in the distribution you're using at the command line? The shared libraries are executable, and should print out some diagnostic info. It would be useful if you could post that info in reply.

    Best Regards,

    Chris.

Children