This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Negative ArmPL MT speed-up on many core systems

Recently I have been investigating poor performance of my application on an Ampere Altra Max M128-30 system which has 128 cores. I am using multiple BLAS and LAPACK functions from the ArmPL 22.1 multi-threaded library. It seems that the OpenMP speed-up for some matrix sizes can turn negative after a certain threshold. This is typically happens when using more than 32 cores. In the example below it actually happens when using all 128 cores. For simplicity, in the below example I am using DGEMM with square matrices with transa=N and transb=N. The unit is microseconds.

M=N=K= np=1 np=2 np=4 np=8 np=16 np=32 np=64 np=128
32 4 4 4 4 4 4 4 4
64 29 42 29 20 15 16 14 14
128 217 135 100 61 31 26 23 36
256 1649 868 453 245 135 85 57 78
512 12827 6509 3313 1733 899 504 309 399
1024 101935 51296 25902 13239 6654 3569 2029 1621
2048 827560 417254 211716 106906 53777 27922 14496 10282

My guess is that for sizes up to 32, ArmPL runs in a single thread. For large matrices the scaling is OK. But for medium sizes performance turns negative at 128 cores. My application turns out to be operating at exactly that range. With different matrix shapes and different functions, negative performance appears at different thread counts.

Can I prevent this performance degradation at 128 threads on my side? Maybe ArmPL needs some fine tuning for very high number of threads with smaller matrices.

Parents
  • Hi,

    Thanks for sharing the numbers.  I can agree with your assessment of the performance degredation on those three highlighted cases but it's great to see the impressive scaling on the larger cases.

    As you can probably imagine tuning every routine for every microarchitecture on every problem size and at every thread count is quite a big job!  We have endeavoured to optimize the performance for most of those cases through thread throttling, as you rightly surmise.  Getting access to 128-core machines was previously harder, and we'd assumed that most real-world use cases would run such small GEMMs across so many threads, but it looks like you may have an application needing this.  As such we will endeavour to correct this poor scaling in the 23.0 release which comes out early next year.

    In the meantime you can limit this degredation in your own code by enclosing your DGEMM call between some calls similar to:

    int saved_thread_count = omp_get_max_threads();
    if (N<1024) omp_set_num_threads(64);
    dgemm_(...);
    if (N<1024) omp_set_num_threads(saved_thread_count);

    Note the subtly different function names in the OpenMP runtime for accomplishing this.

    Hopefully this wont be too complicated to add to your code for the interim period until the updated library is released.

    Thanks.

    Chris

Reply
  • Hi,

    Thanks for sharing the numbers.  I can agree with your assessment of the performance degredation on those three highlighted cases but it's great to see the impressive scaling on the larger cases.

    As you can probably imagine tuning every routine for every microarchitecture on every problem size and at every thread count is quite a big job!  We have endeavoured to optimize the performance for most of those cases through thread throttling, as you rightly surmise.  Getting access to 128-core machines was previously harder, and we'd assumed that most real-world use cases would run such small GEMMs across so many threads, but it looks like you may have an application needing this.  As such we will endeavour to correct this poor scaling in the 23.0 release which comes out early next year.

    In the meantime you can limit this degredation in your own code by enclosing your DGEMM call between some calls similar to:

    int saved_thread_count = omp_get_max_threads();
    if (N<1024) omp_set_num_threads(64);
    dgemm_(...);
    if (N<1024) omp_set_num_threads(saved_thread_count);

    Note the subtly different function names in the OpenMP runtime for accomplishing this.

    Hopefully this wont be too complicated to add to your code for the interim period until the updated library is released.

    Thanks.

    Chris

Children