We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello,
I discovered this issue on macOS but it applies to Linux too. It turned out that one of my workloads runs a lot slower with ArmPL compared to vecLib, OpenBLAS and even vanilla LAPACK. I did some profiling and the culprit seems to be DGESVD. My applications does a large number of calls to DGESVD, some of which are single-threaded while others are parallel single-thread calls. I investigated one of the most common single-thread calls and it processes for 1320 ns with ArmPL, 680 ns with OpenBLAS and 400 ns with LAPACK.
DGESVD
Bellow is a summary of the inputs with number of calls in the first column and the rest are the input arguments excluding the input arrays:
Single-thread----------------------------------61891934 DGESVD S S 3 3 3 3 3 -1 061891934 DGESVD S S 3 3 3 3 3 6 0 420 DGESVD N N 1 1 1 1 1 -1 0 420 DGESVD N N 1 1 1 1 1 5 0 840 DGESVD N N 2 2 2 1 1 -1 0 840 DGESVD N N 2 2 2 1 1 10 0 392 DGESVD N N 3 3 3 1 1 -1 0 392 DGESVD N N 3 3 3 1 1 15 0 84 DGESVD N N 3 6 3 1 1 -1 0 84 DGESVD N N 3 6 3 1 1 15 0 56 DGESVD N N 3 9 3 1 1 -1 0 56 DGESVD N N 3 9 3 1 1 18 0 84 DGESVD N N 4 4 4 1 1 -1 0 84 DGESVD N N 4 4 4 1 1 20 0 56 DGESVD N N 8 8 8 1 1 -1 0 56 DGESVD N N 8 8 8 1 1 40 0
----------------------------------
61891934 DGESVD S S 3 3 3 3 3 -1 0
61891934 DGESVD S S 3 3 3 3 3 6 0
420 DGESVD N N 1 1 1 1 1 -1 0
420 DGESVD N N 1 1 1 1 1 5 0
840 DGESVD N N 2 2 2 1 1 -1 0
840 DGESVD N N 2 2 2 1 1 10 0
392 DGESVD N N 3 3 3 1 1 -1 0
392 DGESVD N N 3 3 3 1 1 15 0
84 DGESVD N N 3 6 3 1 1 -1 0
84 DGESVD N N 3 6 3 1 1 15 0
56 DGESVD N N 3 9 3 1 1 -1 0
56 DGESVD N N 3 9 3 1 1 18 0
84 DGESVD N N 4 4 4 1 1 -1 0
84 DGESVD N N 4 4 4 1 1 20 0
56 DGESVD N N 8 8 8 1 1 -1 0
56 DGESVD N N 8 8 8 1 1 40 0
Multi-thread---------------------------------- 1204 DGESVD N N 3 3 3 1 1 -1 0 1204 DGESVD N N 3 3 3 1 1 15 0 252 DGESVD N N 2 2 2 1 1 -1 0 252 DGESVD N N 2 2 2 1 1 10 0 119 DGESVD N S 3 3 3 1 3 -1 0 119 DGESVD N S 3 3 3 1 3 6 0 84 DGESVD N N 3 2 3 1 1 -1 0 84 DGESVD N N 3 2 3 1 1 10 0 84 DGESVD N N 3 6 3 1 1 -1 0 84 DGESVD N N 3 6 3 1 1 15 0 56 DGESVD N N 3 9 3 1 1 -1 0 56 DGESVD N N 3 9 3 1 1 18 0 35 DGESVD N N 1 1 1 1 1 -1 0 35 DGESVD N N 1 1 1 1 1 5 0 49 DGESVD N S 2 2 2 1 2 -1 0 49 DGESVD N S 2 2 2 1 2 4 0 42 DGESVD N N 8 8 8 1 1 -1 0 42 DGESVD N N 8 8 8 1 1 40 0 7 DGESVD S S 9 10 9 9 9 -1 0 7 DGESVD S S 9 10 9 9 9 45 0
1204 DGESVD N N 3 3 3 1 1 -1 0
1204 DGESVD N N 3 3 3 1 1 15 0
252 DGESVD N N 2 2 2 1 1 -1 0
252 DGESVD N N 2 2 2 1 1 10 0
119 DGESVD N S 3 3 3 1 3 -1 0
119 DGESVD N S 3 3 3 1 3 6 0
84 DGESVD N N 3 2 3 1 1 -1 0
84 DGESVD N N 3 2 3 1 1 10 0
35 DGESVD N N 1 1 1 1 1 -1 0
35 DGESVD N N 1 1 1 1 1 5 0
49 DGESVD N S 2 2 2 1 2 -1 0
49 DGESVD N S 2 2 2 1 2 4 0
42 DGESVD N N 8 8 8 1 1 -1 0
42 DGESVD N N 8 8 8 1 1 40 0
7 DGESVD S S 9 10 9 9 9 -1 0
7 DGESVD S S 9 10 9 9 9 45 0
It would be nice to have this fixed. Thanks!
Hi,
Thanks for the feedback. These are *very* small problems, and I think what you are seeing are the overheads of the setup & handling of more optimized code paths for larger problems, hence why PL and OpenBLAS are actually slower than reference for these problems. We'll take a look, thanks!
Chris.