Recently I have been investigating poor performance of my application on an Ampere Altra Max M128-30 system which has 128 cores. I am using multiple BLAS and LAPACK functions from the ArmPL 22.1 multi-threaded library. It seems that the OpenMP speed-up for some matrix sizes can turn negative after a certain threshold. This is typically happens when using more than 32 cores. In the example below it actually happens when using all 128 cores. For simplicity, in the below example I am using DGEMM with square matrices with transa=N and transb=N. The unit is microseconds.
My guess is that for sizes up to 32, ArmPL runs in a single thread. For large matrices the scaling is OK. But for medium sizes performance turns negative at 128 cores. My application turns out to be operating at exactly that range. With different matrix shapes and different functions, negative performance appears at different thread counts.
Can I prevent this performance degradation at 128 threads on my side? Maybe ArmPL needs some fine tuning for very high number of threads with smaller matrices.
Hi.
Thanks, this is all good info. The slowdowns at 128 cores are a factor of the problem sizes of some calls not being big enough to outweigh the OpenMP thread creation costs. As I mentioned, we've also now got access to 128+ core systems so will be adding this tuning for as many routines as we can identify and get fixed up for the 23.0 release.This should include the GEMMT routines you mention.
Awkward and unusual matrix sizes are always of interest. For most users these don't form a significant percentage of overall application runtime, so identifying cases where they do is important.
One thing you can do to help us identify key routines and matrix shapes of interest is to use the perf-libs-tools we provide via GitHub - https://github.com/ARM-software/perf-libs-tools. This can record the execution of a program and gather summary information on libraries calls made, including information such as matrix sizes and other supplied options. I recommend not running for huge lengths of time, as most applications typically repeat similar libraries call patterns many times over the solution of a real case. Information recorded that you are happy to share with us, like either the traces (which can get big) or even just the high-level summary information (obtained from process_summary.py) do help inform our future work as we discover users' bottlenecks. My e-mail address should be available on my profile page.
Thanks.
Chris