Recently I have been investigating poor performance of my application on an Ampere Altra Max M128-30 system which has 128 cores. I am using multiple BLAS and LAPACK functions from the ArmPL 22.1 multi-threaded library. It seems that the OpenMP speed-up for some matrix sizes can turn negative after a certain threshold. This is typically happens when using more than 32 cores. In the example below it actually happens when using all 128 cores. For simplicity, in the below example I am using DGEMM with square matrices with transa=N and transb=N. The unit is microseconds.
My guess is that for sizes up to 32, ArmPL runs in a single thread. For large matrices the scaling is OK. But for medium sizes performance turns negative at 128 cores. My application turns out to be operating at exactly that range. With different matrix shapes and different functions, negative performance appears at different thread counts.
Can I prevent this performance degradation at 128 threads on my side? Maybe ArmPL needs some fine tuning for very high number of threads with smaller matrices.