Support forums

High Performance Computing (HPC) forum Negative ArmPL MT speed-up on many core systems

State Suggested Answer
Locked Locked
Replies 3 replies
Answers 1 answer
Subscribers 25 subscribers
Views 3430 views
Users 0 members are here

Options

How was your experience today?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Negative ArmPL MT speed-up on many core systems

ndvn over 3 years ago

Recently I have been investigating poor performance of my application on an Ampere Altra Max M128-30 system which has 128 cores. I am using multiple BLAS and LAPACK functions from the ArmPL 22.1 multi-threaded library. It seems that the OpenMP speed-up for some matrix sizes can turn negative after a certain threshold. This is typically happens when using more than 32 cores. In the example below it actually happens when using all 128 cores. For simplicity, in the below example I am using DGEMM with square matrices with transa=N and transb=N. The unit is microseconds.

M=N=K=	np=1	np=2	np=4	np=8	np=16	np=32	np=64	np=128
32	4	4	4	4	4	4	4	4
64	29	42	29	20	15	16	14	14
128	217	135	100	61	31	26	23	36
256	1649	868	453	245	135	85	57	78
512	12827	6509	3313	1733	899	504	309	399
1024	101935	51296	25902	13239	6654	3569	2029	1621
2048	827560	417254	211716	106906	53777	27922	14496	10282

My guess is that for sizes up to 32, ArmPL runs in a single thread. For large matrices the scaling is OK. But for medium sizes performance turns negative at 128 cores. My application turns out to be operating at exactly that range. With different matrix shapes and different functions, negative performance appears at different thread counts.

Can I prevent this performance degradation at 128 threads on my side? Maybe ArmPL needs some fine tuning for very high number of threads with smaller matrices.

Top replies

Chris Goodyer over 3 years ago in reply to ndvn +1 suggested

Hi. Thanks, this is all good info. The slowdowns at 128 cores are a factor of the problem sizes of some calls not being big enough to outweigh the OpenMP thread creation costs. As I mentioned, we've...