Hello,
I discovered this issue on macOS but it applies to Linux too. It turned out that one of my workloads runs a lot slower with ArmPL compared to vecLib, OpenBLAS and even vanilla LAPACK. I did some profiling and the culprit seems to be DGESVD. My applications does a large number of calls to DGESVD, some of which are single-threaded while others are parallel single-thread calls. I investigated one of the most common single-thread calls and it processes for 1320 ns with ArmPL, 680 ns with OpenBLAS and 400 ns with LAPACK.
DGESVD
Bellow is a summary of the inputs with number of calls in the first column and the rest are the input arguments excluding the input arrays:
Single-thread----------------------------------61891934 DGESVD S S 3 3 3 3 3 -1 061891934 DGESVD S S 3 3 3 3 3 6 0 420 DGESVD N N 1 1 1 1 1 -1 0 420 DGESVD N N 1 1 1 1 1 5 0 840 DGESVD N N 2 2 2 1 1 -1 0 840 DGESVD N N 2 2 2 1 1 10 0 392 DGESVD N N 3 3 3 1 1 -1 0 392 DGESVD N N 3 3 3 1 1 15 0 84 DGESVD N N 3 6 3 1 1 -1 0 84 DGESVD N N 3 6 3 1 1 15 0 56 DGESVD N N 3 9 3 1 1 -1 0 56 DGESVD N N 3 9 3 1 1 18 0 84 DGESVD N N 4 4 4 1 1 -1 0 84 DGESVD N N 4 4 4 1 1 20 0 56 DGESVD N N 8 8 8 1 1 -1 0 56 DGESVD N N 8 8 8 1 1 40 0
----------------------------------
61891934 DGESVD S S 3 3 3 3 3 -1 0
61891934 DGESVD S S 3 3 3 3 3 6 0
420 DGESVD N N 1 1 1 1 1 -1 0
420 DGESVD N N 1 1 1 1 1 5 0
840 DGESVD N N 2 2 2 1 1 -1 0
840 DGESVD N N 2 2 2 1 1 10 0
392 DGESVD N N 3 3 3 1 1 -1 0
392 DGESVD N N 3 3 3 1 1 15 0
84 DGESVD N N 3 6 3 1 1 -1 0
84 DGESVD N N 3 6 3 1 1 15 0
56 DGESVD N N 3 9 3 1 1 -1 0
56 DGESVD N N 3 9 3 1 1 18 0
84 DGESVD N N 4 4 4 1 1 -1 0
84 DGESVD N N 4 4 4 1 1 20 0
56 DGESVD N N 8 8 8 1 1 -1 0
56 DGESVD N N 8 8 8 1 1 40 0
Multi-thread---------------------------------- 1204 DGESVD N N 3 3 3 1 1 -1 0 1204 DGESVD N N 3 3 3 1 1 15 0 252 DGESVD N N 2 2 2 1 1 -1 0 252 DGESVD N N 2 2 2 1 1 10 0 119 DGESVD N S 3 3 3 1 3 -1 0 119 DGESVD N S 3 3 3 1 3 6 0 84 DGESVD N N 3 2 3 1 1 -1 0 84 DGESVD N N 3 2 3 1 1 10 0 84 DGESVD N N 3 6 3 1 1 -1 0 84 DGESVD N N 3 6 3 1 1 15 0 56 DGESVD N N 3 9 3 1 1 -1 0 56 DGESVD N N 3 9 3 1 1 18 0 35 DGESVD N N 1 1 1 1 1 -1 0 35 DGESVD N N 1 1 1 1 1 5 0 49 DGESVD N S 2 2 2 1 2 -1 0 49 DGESVD N S 2 2 2 1 2 4 0 42 DGESVD N N 8 8 8 1 1 -1 0 42 DGESVD N N 8 8 8 1 1 40 0 7 DGESVD S S 9 10 9 9 9 -1 0 7 DGESVD S S 9 10 9 9 9 45 0
1204 DGESVD N N 3 3 3 1 1 -1 0
1204 DGESVD N N 3 3 3 1 1 15 0
252 DGESVD N N 2 2 2 1 1 -1 0
252 DGESVD N N 2 2 2 1 1 10 0
119 DGESVD N S 3 3 3 1 3 -1 0
119 DGESVD N S 3 3 3 1 3 6 0
84 DGESVD N N 3 2 3 1 1 -1 0
84 DGESVD N N 3 2 3 1 1 10 0
35 DGESVD N N 1 1 1 1 1 -1 0
35 DGESVD N N 1 1 1 1 1 5 0
49 DGESVD N S 2 2 2 1 2 -1 0
49 DGESVD N S 2 2 2 1 2 4 0
42 DGESVD N N 8 8 8 1 1 -1 0
42 DGESVD N N 8 8 8 1 1 40 0
7 DGESVD S S 9 10 9 9 9 -1 0
7 DGESVD S S 9 10 9 9 9 45 0
It would be nice to have this fixed. Thanks!
Hi,
Thanks for the feedback. These are *very* small problems, and I think what you are seeing are the overheads of the setup & handling of more optimized code paths for larger problems, hence why PL and OpenBLAS are actually slower than reference for these problems. We'll take a look, thanks!
Chris.