Recently I have been investigating poor performance of my application on an Ampere Altra Max M128-30 system which has 128 cores. I am using multiple BLAS and LAPACK functions from the ArmPL 22.1 multi-threaded library. It seems that the OpenMP speed-up for some matrix sizes can turn negative after a certain threshold. This is typically happens when using more than 32 cores. In the example below it actually happens when using all 128 cores. For simplicity, in the below example I am using DGEMM with square matrices with transa=N and transb=N. The unit is microseconds.
My guess is that for sizes up to 32, ArmPL runs in a single thread. For large matrices the scaling is OK. But for medium sizes performance turns negative at 128 cores. My application turns out to be operating at exactly that range. With different matrix shapes and different functions, negative performance appears at different thread counts.
Can I prevent this performance degradation at 128 threads on my side? Maybe ArmPL needs some fine tuning for very high number of threads with smaller matrices.
Thanks for sharing the numbers. I can agree with your assessment of the performance degredation on those three highlighted cases but it's great to see the impressive scaling on the larger cases.
As you can probably imagine tuning every routine for every microarchitecture on every problem size and at every thread count is quite a big job! We have endeavoured to optimize the performance for most of those cases through thread throttling, as you rightly surmise. Getting access to 128-core machines was previously harder, and we'd assumed that most real-world use cases would run such small GEMMs across so many threads, but it looks like you may have an application needing this. As such we will endeavour to correct this poor scaling in the 23.0 release which comes out early next year.
In the meantime you can limit this degredation in your own code by enclosing your DGEMM call between some calls similar to:
int saved_thread_count = omp_get_max_threads();if (N<1024) omp_set_num_threads(64);dgemm_(...);if (N<1024) omp_set_num_threads(saved_thread_count);
int saved_thread_count = omp_get_max_threads();
if (N<1024) omp_set_num_threads(64);
if (N<1024) omp_set_num_threads(saved_thread_count);
Note the subtly different function names in the OpenMP runtime for accomplishing this.
Hopefully this wont be too complicated to add to your code for the interim period until the updated library is released.
Thank you for looking into the issue! I am afraid that the problem is more serious. I've tested some other functions and have seen very poor results. If you for example give ?GEMMT a try at 1 thread and then test it at many threads (16, 32, 64, 128) with some smaller matrices (dimensions less than 100), you will see that running multi-threaded results in extremely poor performance. OpenBLAS (via ReLAPCK) seems to work much better, at least for this example. I haven't tested all functions but there must be more problematic ones. In real world, running a workload with my app at 32 cores takes 5.4 hours on the machine while running the same workload at 128 cores takes 10.2 hours. I don't see this problem on x86 with proprietary BLAS libraries. The performance for the same workload either gets better or stays the same after some threshold.It would be nice to have ArmPL optimized for small matrices and odd shaped matrices too, especially in a multi-threaded environment. Some times in the real world you need to break a big problem into small chunks.
Thanks, this is all good info. The slowdowns at 128 cores are a factor of the problem sizes of some calls not being big enough to outweigh the OpenMP thread creation costs. As I mentioned, we've also now got access to 128+ core systems so will be adding this tuning for as many routines as we can identify and get fixed up for the 23.0 release.This should include the GEMMT routines you mention.
Awkward and unusual matrix sizes are always of interest. For most users these don't form a significant percentage of overall application runtime, so identifying cases where they do is important.
One thing you can do to help us identify key routines and matrix shapes of interest is to use the perf-libs-tools we provide via GitHub - https://github.com/ARM-software/perf-libs-tools. This can record the execution of a program and gather summary information on libraries calls made, including information such as matrix sizes and other supplied options. I recommend not running for huge lengths of time, as most applications typically repeat similar libraries call patterns many times over the solution of a real case. Information recorded that you are happy to share with us, like either the traces (which can get big) or even just the high-level summary information (obtained from process_summary.py) do help inform our future work as we discover users' bottlenecks. My e-mail address should be available on my profile page.