This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Negative ArmPL MT speed-up on many core systems

Recently I have been investigating poor performance of my application on an Ampere Altra Max M128-30 system which has 128 cores. I am using multiple BLAS and LAPACK functions from the ArmPL 22.1 multi-threaded library. It seems that the OpenMP speed-up for some matrix sizes can turn negative after a certain threshold. This is typically happens when using more than 32 cores. In the example below it actually happens when using all 128 cores. For simplicity, in the below example I am using DGEMM with square matrices with transa=N and transb=N. The unit is microseconds.

M=N=K= np=1 np=2 np=4 np=8 np=16 np=32 np=64 np=128
32 4 4 4 4 4 4 4 4
64 29 42 29 20 15 16 14 14
128 217 135 100 61 31 26 23 36
256 1649 868 453 245 135 85 57 78
512 12827 6509 3313 1733 899 504 309 399
1024 101935 51296 25902 13239 6654 3569 2029 1621
2048 827560 417254 211716 106906 53777 27922 14496 10282

My guess is that for sizes up to 32, ArmPL runs in a single thread. For large matrices the scaling is OK. But for medium sizes performance turns negative at 128 cores. My application turns out to be operating at exactly that range. With different matrix shapes and different functions, negative performance appears at different thread counts.

Can I prevent this performance degradation at 128 threads on my side? Maybe ArmPL needs some fine tuning for very high number of threads with smaller matrices.

Parents Reply Children