Some time ago I reported a performance issue with ArmPL related to CPUs with large number of cores. See https://community.arm.com/support-forums/f/high-performance-computing-forum/53959/negative-armpl-mt-speed-up-on-many-core-systems It was fixed in version 23.04. As a result, my application gained a lot of performance. Recently, I did some more scaling performance testing and discovered more issues.The testing was done on a 128-core Ampere Altra CPU running Ubuntu 22.04. The application uses ArmPL 23.04, LLVM OpenMP and is compiled with GCC 12. The application uses ArmPL in two ways depending on algorithm: (1) a single application thread calls ArmPL; (2) multiple application threads call ArmPL at the same time. I did the performance profiling with perf-lib-tools.ZGETRFThe problem with ZGETRF occurs when multiple application threads are calling it at the same time. Typical inputs are M=N with a size varying between 10 and 300. Maybe there is some locking issue? Summary from perf-lib-tools:OMP_NUM_THREADS=32zgetrf_ cnt= 2434696 totTime= 1504.5387 called_tot= 171712 topTime= 93.8037 (%age of runtime: 6.428 )Most frequent calls:$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -nRoutine: zgetrf_ nCalls: 216920 Mean_time 8.647643e-05 nUserCalls: 12077 Mean_user_time: 8.447802e-05 Inputs: 52 52 52 2 0Routine: zgetrf_ nCalls: 66792 Mean_time 2.949315e-04 nUserCalls: 7386 Mean_user_time: 2.930019e-04 Inputs: 100 100 100 1 0Routine: zgetrf_ nCalls: 126456 Mean_time 3.720812e-04 nUserCalls: 13398 Mean_user_time: 3.665895e-04 Inputs: 110 110 110 1 0Routine: zgetrf_ nCalls: 201256 Mean_time 4.436309e-04 nUserCalls: 16952 Mean_user_time: 4.371607e-04 Inputs: 120 120 120 1 0Routine: zgetrf_ nCalls: 168344 Mean_time 5.473583e-04 nUserCalls: 14346 Mean_user_time: 5.396482e-04 Inputs: 130 130 130 1 0Routine: zgetrf_ nCalls: 147136 Mean_time 6.433120e-04 nUserCalls: 11615 Mean_user_time: 6.342174e-04 Inputs: 140 140 140 1 0Routine: zgetrf_ nCalls: 211288 Mean_time 6.791234e-04 nUserCalls: 9395 Mean_user_time: 6.745791e-04 Inputs: 142 142 142 1 0Routine: zgetrf_ nCalls: 123904 Mean_time 7.762974e-04 nUserCalls: 8320 Mean_user_time: 7.669183e-04 Inputs: 150 150 150 1 0Routine: zgetrf_ nCalls: 73040 Mean_time 8.895631e-04 nUserCalls: 5874 Mean_user_time: 8.788582e-04 Inputs: 160 160 160 1 0Routine: zgetrf_ nCalls: 195888 Mean_time 1.083270e-03 nUserCalls: 9152 Mean_user_time: 1.073041e-03 Inputs: 170 170 170 1 0 OMP_NUM_THREADS=128zgetrf_ cnt= 2434696 totTime= 2046.2830 called_tot= 241094 topTime= 151.9661 (%age of runtime: 9.300 )Most frequent calls:$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -nRoutine: zgetrf_ nCalls: 216920 Mean_time 2.516856e-04 nUserCalls: 11667 Mean_user_time: 1.828228e-04 Inputs: 52 52 52 2 0Routine: zgetrf_ nCalls: 66792 Mean_time 4.422276e-04 nUserCalls: 9588 Mean_user_time: 3.222948e-04 Inputs: 100 100 100 1 0Routine: zgetrf_ nCalls: 126456 Mean_time 5.297068e-04 nUserCalls: 18734 Mean_user_time: 3.932569e-04 Inputs: 110 110 110 1 0Routine: zgetrf_ nCalls: 201256 Mean_time 6.306487e-04 nUserCalls: 23959 Mean_user_time: 4.647142e-04 Inputs: 120 120 120 1 0Routine: zgetrf_ nCalls: 168344 Mean_time 7.413436e-04 nUserCalls: 27496 Mean_user_time: 5.610537e-04 Inputs: 130 130 130 1 0Routine: zgetrf_ nCalls: 147136 Mean_time 8.560422e-04 nUserCalls: 23896 Mean_user_time: 6.514217e-04 Inputs: 140 140 140 1 0Routine: zgetrf_ nCalls: 123904 Mean_time 1.012120e-03 nUserCalls: 20248 Mean_user_time: 7.814398e-04 Inputs: 150 150 150 1 0Routine: zgetrf_ nCalls: 73040 Mean_time 1.143292e-03 nUserCalls: 14404 Mean_user_time: 8.928624e-04 Inputs: 160 160 160 1 0Routine: zgetrf_ nCalls: 195888 Mean_time 1.406755e-03 nUserCalls: 10160 Mean_user_time: 1.091698e-03 Inputs: 170 170 170 1 0Routine: zgetrf_ nCalls: 19096 Mean_time 1.444096e-03 nUserCalls: 5755 Mean_user_time: 1.218681e-03 Inputs: 180 180 180 1 0DGEMMEven though DGEMM received substantial performance improvements, it still has some issues:OMP_NUM_THREADS=32dgemm_ cnt= 30596724 totTime= 272.4157 called_tot= 30596724 topTime= 272.4157 (%age of runtime: 2.876 )Example calls:Routine: dgemm_ nCalls: 14728 Mean_time 1.281673e-05 nUserCalls: 14728 Mean_user_time: 1.281673e-05 Inputs: 36 1 252 36 288 1226658 N NRoutine: dgemm_ nCalls: 9884 Mean_time 9.806415e-06 nUserCalls: 9884 Mean_user_time: 9.806415e-06 Inputs: 264 1 42 42 42 264 T NRoutine: dgemm_ nCalls: 14728 Mean_time 9.071300e-06 nUserCalls: 14728 Mean_user_time: 9.071300e-06 Inputs: 252 1 36 36 36 252 T NRoutine: dgemm_ nCalls: 9884 Mean_time 1.381515e-05 nUserCalls: 9884 Mean_user_time: 1.381515e-05 Inputs: 42 1 264 42 306 1226658 N NRoutine: dgemm_ nCalls: 12238 Mean_time 9.176557e-06 nUserCalls: 12238 Mean_user_time: 9.176557e-06 Inputs: 246 1 36 36 36 246 T NRoutine: dgemm_ nCalls: 17002 Mean_time 1.015412e-05 nUserCalls: 17002 Mean_user_time: 1.015412e-05 Inputs: 240 1 48 48 48 240 T NRoutine: dgemm_ nCalls: 13044 Mean_time 1.026566e-05 nUserCalls: 13044 Mean_user_time: 1.026566e-05 Inputs: 210 1 54 54 54 210 T NRoutine: dgemm_ nCalls: 10980 Mean_time 1.011446e-05 nUserCalls: 10980 Mean_user_time: 1.011446e-05 Inputs: 276 1 48 48 48 276 T NRoutine: dgemm_ nCalls: 24626 Mean_time 9.204542e-06 nUserCalls: 24626 Mean_user_time: 9.204542e-06 Inputs: 204 1 42 42 42 204 T NRoutine: dgemm_ nCalls: 19882 Mean_time 9.293070e-06 nUserCalls: 19882 Mean_user_time: 9.293070e-06 Inputs: 216 1 42 42 42 216 T NOMP_NUM_THREADS=128dgemm_ cnt= 30597188 totTime= 350.5334 called_tot= 30597188 topTime= 350.5334 (%age of runtime: 3.982 )Example calls:Routine: dgemm_ nCalls: 14728 Mean_time 3.143310e-05 nUserCalls: 14728 Mean_user_time: 3.143310e-05 Inputs: 36 1 252 36 288 1226658 N NRoutine: dgemm_ nCalls: 9884 Mean_time 2.066992e-05 nUserCalls: 9884 Mean_user_time: 2.066992e-05 Inputs: 264 1 42 42 42 264 T NRoutine: dgemm_ nCalls: 14728 Mean_time 1.720480e-05 nUserCalls: 14728 Mean_user_time: 1.720480e-05 Inputs: 252 1 36 36 36 252 T NRoutine: dgemm_ nCalls: 9884 Mean_time 2.001355e-05 nUserCalls: 9884 Mean_user_time: 2.001355e-05 Inputs: 42 1 264 42 306 1226658 N NRoutine: dgemm_ nCalls: 12246 Mean_time 3.380307e-05 nUserCalls: 12246 Mean_user_time: 3.380307e-05 Inputs: 246 1 36 36 36 246 T NRoutine: dgemm_ nCalls: 17012 Mean_time 2.844690e-05 nUserCalls: 17012 Mean_user_time: 2.844690e-05 Inputs: 240 1 48 48 48 240 T NRoutine: dgemm_ nCalls: 13042 Mean_time 2.803772e-05 nUserCalls: 13042 Mean_user_time: 2.803772e-05 Inputs: 210 1 54 54 54 210 T NRoutine: dgemm_ nCalls: 10982 Mean_time 2.659640e-05 nUserCalls: 10982 Mean_user_time: 2.659640e-05 Inputs: 276 1 48 48 48 276 T NRoutine: dgemm_ nCalls: 24612 Mean_time 2.413749e-05 nUserCalls: 24612 Mean_user_time: 2.413749e-05 Inputs: 204 1 42 42 42 204 T NRoutine: dgemm_ nCalls: 19862 Mean_time 2.381463e-05 nUserCalls: 19862 Mean_user_time: 2.381463e-05 Inputs: 216 1 42 42 42 216 T NIt would be nice to have this fixed.
perf-lib-tools
ZGETRF
M=N
10
300
OMP_NUM_THREADS=32
zgetrf_ cnt= 2434696 totTime= 1504.5387 called_tot= 171712 topTime= 93.8037 (%age of runtime: 6.428 )
$ grep -i zgetrf /tmp/armplsummary_915245.apl | sort -k8 -nr | head | sort -k12 -n
Routine: zgetrf_ nCalls: 216920 Mean_time 8.647643e-05 nUserCalls: 12077 Mean_user_time: 8.447802e-05 Inputs: 52 52 52 2 0
Routine: zgetrf_ nCalls: 66792 Mean_time 2.949315e-04 nUserCalls: 7386 Mean_user_time: 2.930019e-04 Inputs: 100 100 100 1 0
Routine: zgetrf_ nCalls: 126456 Mean_time 3.720812e-04 nUserCalls: 13398 Mean_user_time: 3.665895e-04 Inputs: 110 110 110 1 0
Routine: zgetrf_ nCalls: 201256 Mean_time 4.436309e-04 nUserCalls: 16952 Mean_user_time: 4.371607e-04 Inputs: 120 120 120 1 0
Routine: zgetrf_ nCalls: 168344 Mean_time 5.473583e-04 nUserCalls: 14346 Mean_user_time: 5.396482e-04 Inputs: 130 130 130 1 0
Routine: zgetrf_ nCalls: 147136 Mean_time 6.433120e-04 nUserCalls: 11615 Mean_user_time: 6.342174e-04 Inputs: 140 140 140 1 0
Routine: zgetrf_ nCalls: 211288 Mean_time 6.791234e-04 nUserCalls: 9395 Mean_user_time: 6.745791e-04 Inputs: 142 142 142 1 0
Routine: zgetrf_ nCalls: 123904 Mean_time 7.762974e-04 nUserCalls: 8320 Mean_user_time: 7.669183e-04 Inputs: 150 150 150 1 0
Routine: zgetrf_ nCalls: 73040 Mean_time 8.895631e-04 nUserCalls: 5874 Mean_user_time: 8.788582e-04 Inputs: 160 160 160 1 0
Routine: zgetrf_ nCalls: 195888 Mean_time 1.083270e-03 nUserCalls: 9152 Mean_user_time: 1.073041e-03 Inputs: 170 170 170 1 0
OMP_NUM_THREADS=128
zgetrf_ cnt= 2434696 totTime= 2046.2830 called_tot= 241094 topTime= 151.9661 (%age of runtime: 9.300 )
$ grep -i zgetrf /tmp/armplsummary_912580.apl | sort -k8 -nr | head | sort -k12 -n
Routine: zgetrf_ nCalls: 216920 Mean_time 2.516856e-04 nUserCalls: 11667 Mean_user_time: 1.828228e-04 Inputs: 52 52 52 2 0
Routine: zgetrf_ nCalls: 66792 Mean_time 4.422276e-04 nUserCalls: 9588 Mean_user_time: 3.222948e-04 Inputs: 100 100 100 1 0
Routine: zgetrf_ nCalls: 126456 Mean_time 5.297068e-04 nUserCalls: 18734 Mean_user_time: 3.932569e-04 Inputs: 110 110 110 1 0
Routine: zgetrf_ nCalls: 201256 Mean_time 6.306487e-04 nUserCalls: 23959 Mean_user_time: 4.647142e-04 Inputs: 120 120 120 1 0
Routine: zgetrf_ nCalls: 168344 Mean_time 7.413436e-04 nUserCalls: 27496 Mean_user_time: 5.610537e-04 Inputs: 130 130 130 1 0
Routine: zgetrf_ nCalls: 147136 Mean_time 8.560422e-04 nUserCalls: 23896 Mean_user_time: 6.514217e-04 Inputs: 140 140 140 1 0
Routine: zgetrf_ nCalls: 123904 Mean_time 1.012120e-03 nUserCalls: 20248 Mean_user_time: 7.814398e-04 Inputs: 150 150 150 1 0
Routine: zgetrf_ nCalls: 73040 Mean_time 1.143292e-03 nUserCalls: 14404 Mean_user_time: 8.928624e-04 Inputs: 160 160 160 1 0
Routine: zgetrf_ nCalls: 195888 Mean_time 1.406755e-03 nUserCalls: 10160 Mean_user_time: 1.091698e-03 Inputs: 170 170 170 1 0
Routine: zgetrf_ nCalls: 19096 Mean_time 1.444096e-03 nUserCalls: 5755 Mean_user_time: 1.218681e-03 Inputs: 180 180 180 1 0
DGEMM
dgemm_ cnt= 30596724 totTime= 272.4157 called_tot= 30596724 topTime= 272.4157 (%age of runtime: 2.876 )
Routine: dgemm_ nCalls: 14728 Mean_time 1.281673e-05 nUserCalls: 14728 Mean_user_time: 1.281673e-05 Inputs: 36 1 252 36 288 1226658 N N
Routine: dgemm_ nCalls: 9884 Mean_time 9.806415e-06 nUserCalls: 9884 Mean_user_time: 9.806415e-06 Inputs: 264 1 42 42 42 264 T N
Routine: dgemm_ nCalls: 14728 Mean_time 9.071300e-06 nUserCalls: 14728 Mean_user_time: 9.071300e-06 Inputs: 252 1 36 36 36 252 T N
Routine: dgemm_ nCalls: 9884 Mean_time 1.381515e-05 nUserCalls: 9884 Mean_user_time: 1.381515e-05 Inputs: 42 1 264 42 306 1226658 N N
Routine: dgemm_ nCalls: 12238 Mean_time 9.176557e-06 nUserCalls: 12238 Mean_user_time: 9.176557e-06 Inputs: 246 1 36 36 36 246 T N
Routine: dgemm_ nCalls: 17002 Mean_time 1.015412e-05 nUserCalls: 17002 Mean_user_time: 1.015412e-05 Inputs: 240 1 48 48 48 240 T N
Routine: dgemm_ nCalls: 13044 Mean_time 1.026566e-05 nUserCalls: 13044 Mean_user_time: 1.026566e-05 Inputs: 210 1 54 54 54 210 T N
Routine: dgemm_ nCalls: 10980 Mean_time 1.011446e-05 nUserCalls: 10980 Mean_user_time: 1.011446e-05 Inputs: 276 1 48 48 48 276 T N
Routine: dgemm_ nCalls: 24626 Mean_time 9.204542e-06 nUserCalls: 24626 Mean_user_time: 9.204542e-06 Inputs: 204 1 42 42 42 204 T N
Routine: dgemm_ nCalls: 19882 Mean_time 9.293070e-06 nUserCalls: 19882 Mean_user_time: 9.293070e-06 Inputs: 216 1 42 42 42 216 T N
dgemm_ cnt= 30597188 totTime= 350.5334 called_tot= 30597188 topTime= 350.5334 (%age of runtime: 3.982 )
Routine: dgemm_ nCalls: 14728 Mean_time 3.143310e-05 nUserCalls: 14728 Mean_user_time: 3.143310e-05 Inputs: 36 1 252 36 288 1226658 N N
Routine: dgemm_ nCalls: 9884 Mean_time 2.066992e-05 nUserCalls: 9884 Mean_user_time: 2.066992e-05 Inputs: 264 1 42 42 42 264 T N
Routine: dgemm_ nCalls: 14728 Mean_time 1.720480e-05 nUserCalls: 14728 Mean_user_time: 1.720480e-05 Inputs: 252 1 36 36 36 252 T N
Routine: dgemm_ nCalls: 9884 Mean_time 2.001355e-05 nUserCalls: 9884 Mean_user_time: 2.001355e-05 Inputs: 42 1 264 42 306 1226658 N N
Routine: dgemm_ nCalls: 12246 Mean_time 3.380307e-05 nUserCalls: 12246 Mean_user_time: 3.380307e-05 Inputs: 246 1 36 36 36 246 T N
Routine: dgemm_ nCalls: 17012 Mean_time 2.844690e-05 nUserCalls: 17012 Mean_user_time: 2.844690e-05 Inputs: 240 1 48 48 48 240 T N
Routine: dgemm_ nCalls: 13042 Mean_time 2.803772e-05 nUserCalls: 13042 Mean_user_time: 2.803772e-05 Inputs: 210 1 54 54 54 210 T N
Routine: dgemm_ nCalls: 10982 Mean_time 2.659640e-05 nUserCalls: 10982 Mean_user_time: 2.659640e-05 Inputs: 276 1 48 48 48 276 T N
Routine: dgemm_ nCalls: 24612 Mean_time 2.413749e-05 nUserCalls: 24612 Mean_user_time: 2.413749e-05 Inputs: 204 1 42 42 42 204 T N
Routine: dgemm_ nCalls: 19862 Mean_time 2.381463e-05 nUserCalls: 19862 Mean_user_time: 2.381463e-05 Inputs: 216 1 42 42 42 216 T N
I came across another scaling issue. It affects both macOS and Linux. It is less noticeable on macOS because Apple silicon comes with not so many cores but running on AWS Graviton3E (64 cores) and especially on Ampere Altra Max (128 cores) the performance hit is massive. Any BLAS/LAPACK implementation that I tested outperformed ArmPL (vecLib, Netlib, OpenBLAS) for the workload in question on both macOS and Linux. The problem seems to be that DGELS scales poorly. The software calls it in both multi-thread and multi-instance (several threads calling it at the same time) mode. Below are some example inputs:N 117 20 1 117 117 580 0N 117 20 1 117 117 -1 0N 117 10 1 189 189 580 0N 117 10 1 189 189 -1 0N 153 20 1 153 153 580 0N 153 20 1 153 153 -1 0N 189 20 1 189 189 580 0N 189 20 1 189 189 -1 0N 99 20 1 117 117 580 0N 99 20 1 117 117 -1 0N 63 10 1 189 189 580 0N 63 10 1 189 189 -1 0I am not sure if this is a scaling issue only or if the performance of DGELS in single-thread mode can be optimized too. Maybe the ArmPL team can have a look?
DGELS
N 117 20 1 117 117 580 0
N 117 20 1 117 117 -1 0
N 117 10 1 189 189 580 0
N 117 10 1 189 189 -1 0
N 153 20 1 153 153 580 0
N 153 20 1 153 153 -1 0
N 189 20 1 189 189 580 0
N 189 20 1 189 189 -1 0
N 99 20 1 117 117 580 0
N 99 20 1 117 117 -1 0
N 63 10 1 189 189 580 0
N 63 10 1 189 189 -1 0
Thanks, wel'll take a look at this too. It's probably a similar underlying issue to the one we've been addressing for GETRF which affects these small problems running with large numbers of cores. This time the problem may actually be in GEQRF as called by GELS.
Chris.
Yes, could be. Our workloads which make heavy use of DGELS perform better with OpenBLAS. We see performance issues with DGBTRS too and possibly with DGBTRF. Are they all related to GEQRF?
Some example DGBTRS inputs:Inputs: 10 1 1 1 4 1 10 0 NInputs: 106 27 27 1 82 1 106 0 NInputs: 11 1 1 1 4 1 11 0 NInputs: 11 2 2 1 7 1 11 0 NInputs: 112 31 31 1 94 1 112 0 NInputs: 11 3 3 1 10 1 11 0 NInputs: 120 31 31 1 94 1 120 0 NInputs: 12 2 2 1 7 1 12 0 NInputs: 12 3 3 1 10 1 12 0 NInputs: 13 2 2 1 7 1 13 0 NInputs: 13 3 3 1 10 1 13 0 NInputs: 14 2 2 1 7 1 14 0 NInputs: 14 3 3 1 10 1 14 0 NInputs: 15 2 2 1 7 1 15 0 NInputs: 15 3 3 1 10 1 15 0 NInputs: 16 3 3 1 10 1 16 0 NInputs: 17 3 3 1 10 1 17 0 NInputs: 18 3 3 1 10 1 18 0 NInputs: 19 3 3 1 10 1 19 0 NInputs: 20 2 2 1 7 1 20 0 NInputs: 20 3 3 1 10 1 20 0 NInputs: 20 4 4 1 13 1 20 0 NInputs: 20 5 5 1 16 1 20 0 NInputs: 20 6 6 1 19 1 20 0 NInputs: 5 1 1 1 4 1 5 0 NInputs: 6 1 1 1 4 1 6 0 NInputs: 7 1 1 1 4 1 7 0 NInputs: 8 1 1 1 4 1 8 0 NInputs: 9 1 1 1 4 1 9 0 NInputs: 9 2 2 1 7 1 9 0 NInputs: 98 27 27 1 82 1 98 0 N
Inputs: 10 1 1 1 4 1 10 0 N
Inputs: 106 27 27 1 82 1 106 0 N
Inputs: 11 1 1 1 4 1 11 0 N
Inputs: 11 2 2 1 7 1 11 0 N
Inputs: 112 31 31 1 94 1 112 0 N
Inputs: 11 3 3 1 10 1 11 0 N
Inputs: 120 31 31 1 94 1 120 0 N
Inputs: 12 2 2 1 7 1 12 0 N
Inputs: 12 3 3 1 10 1 12 0 N
Inputs: 13 2 2 1 7 1 13 0 N
Inputs: 13 3 3 1 10 1 13 0 N
Inputs: 14 2 2 1 7 1 14 0 N
Inputs: 14 3 3 1 10 1 14 0 N
Inputs: 15 2 2 1 7 1 15 0 N
Inputs: 15 3 3 1 10 1 15 0 N
Inputs: 16 3 3 1 10 1 16 0 N
Inputs: 17 3 3 1 10 1 17 0 N
Inputs: 18 3 3 1 10 1 18 0 N
Inputs: 19 3 3 1 10 1 19 0 N
Inputs: 20 2 2 1 7 1 20 0 N
Inputs: 20 3 3 1 10 1 20 0 N
Inputs: 20 4 4 1 13 1 20 0 N
Inputs: 20 5 5 1 16 1 20 0 N
Inputs: 20 6 6 1 19 1 20 0 N
Inputs: 5 1 1 1 4 1 5 0 N
Inputs: 6 1 1 1 4 1 6 0 N
Inputs: 7 1 1 1 4 1 7 0 N
Inputs: 8 1 1 1 4 1 8 0 N
Inputs: 9 1 1 1 4 1 9 0 N
Inputs: 9 2 2 1 7 1 9 0 N
Inputs: 98 27 27 1 82 1 98 0 N
Some example DGELS inputs:
Inputs: 16 10 2 16 16 -1 0 NInputs: 16 10 2 16 16 570 0 NInputs: 16 10 2 19 19 -1 0 NInputs: 16 10 2 19 19 570 0 NInputs: 16 10 2 22 22 -1 0 NInputs: 16 10 2 22 22 570 0 NInputs: 16 10 2 26 26 -1 0 NInputs: 16 10 2 26 26 570 0 NInputs: 16 10 2 29 29 -1 0 NInputs: 16 10 2 29 29 570 0 NInputs: 16 10 2 32 32 -1 0 NInputs: 16 10 2 32 32 570 0 NInputs: 16 10 2 36 36 -1 0 NInputs: 16 10 2 36 36 570 0 NInputs: 19 10 2 19 19 -1 0 NInputs: 19 10 2 19 19 570 0 NInputs: 19 10 2 22 22 -1 0 NInputs: 19 10 2 22 22 570 0 NInputs: 19 10 2 26 26 -1 0 NInputs: 19 10 2 26 26 570 0 NInputs: 19 10 2 29 29 -1 0 NInputs: 19 10 2 29 29 570 0 NInputs: 19 10 2 32 32 -1 0 NInputs: 19 10 2 32 32 570 0 NInputs: 19 10 2 36 36 -1 0 NInputs: 19 10 2 36 36 570 0 NInputs: 22 10 2 22 22 -1 0 NInputs: 22 10 2 22 22 570 0 NInputs: 22 10 2 26 26 -1 0 NInputs: 22 10 2 26 26 570 0 NInputs: 22 10 2 29 29 -1 0 NInputs: 22 10 2 29 29 570 0 NInputs: 22 10 2 32 32 -1 0 NInputs: 22 10 2 32 32 570 0 NInputs: 22 10 2 36 36 -1 0 NInputs: 22 10 2 36 36 570 0 NInputs: 26 10 2 26 26 -1 0 NInputs: 26 10 2 26 26 570 0 NInputs: 26 10 2 29 29 -1 0 NInputs: 26 10 2 29 29 570 0 NInputs: 26 10 2 32 32 -1 0 NInputs: 26 10 2 32 32 570 0 NInputs: 26 10 2 36 36 -1 0 NInputs: 26 10 2 36 36 570 0 NInputs: 29 10 2 29 29 -1 0 NInputs: 29 10 2 29 29 570 0 NInputs: 29 10 2 32 32 -1 0 NInputs: 29 10 2 32 32 570 0 NInputs: 29 10 2 36 36 -1 0 NInputs: 29 10 2 36 36 570 0 NInputs: 32 10 2 32 32 -1 0 NInputs: 32 10 2 32 32 570 0 NInputs: 32 10 2 36 36 -1 0 NInputs: 32 10 2 36 36 570 0 NInputs: 36 10 2 36 36 -1 0 NInputs: 36 10 2 36 36 570 0 N
Inputs: 16 10 2 16 16 -1 0 N
Inputs: 16 10 2 16 16 570 0 N
Inputs: 16 10 2 19 19 -1 0 N
Inputs: 16 10 2 19 19 570 0 N
Inputs: 16 10 2 22 22 -1 0 N
Inputs: 16 10 2 22 22 570 0 N
Inputs: 16 10 2 26 26 -1 0 N
Inputs: 16 10 2 26 26 570 0 N
Inputs: 16 10 2 29 29 -1 0 N
Inputs: 16 10 2 29 29 570 0 N
Inputs: 16 10 2 32 32 -1 0 N
Inputs: 16 10 2 32 32 570 0 N
Inputs: 16 10 2 36 36 -1 0 N
Inputs: 16 10 2 36 36 570 0 N
Inputs: 19 10 2 19 19 -1 0 N
Inputs: 19 10 2 19 19 570 0 N
Inputs: 19 10 2 22 22 -1 0 N
Inputs: 19 10 2 22 22 570 0 N
Inputs: 19 10 2 26 26 -1 0 N
Inputs: 19 10 2 26 26 570 0 N
Inputs: 19 10 2 29 29 -1 0 N
Inputs: 19 10 2 29 29 570 0 N
Inputs: 19 10 2 32 32 -1 0 N
Inputs: 19 10 2 32 32 570 0 N
Inputs: 19 10 2 36 36 -1 0 N
Inputs: 19 10 2 36 36 570 0 N
Inputs: 22 10 2 22 22 -1 0 N
Inputs: 22 10 2 22 22 570 0 N
Inputs: 22 10 2 26 26 -1 0 N
Inputs: 22 10 2 26 26 570 0 N
Inputs: 22 10 2 29 29 -1 0 N
Inputs: 22 10 2 29 29 570 0 N
Inputs: 22 10 2 32 32 -1 0 N
Inputs: 22 10 2 32 32 570 0 N
Inputs: 22 10 2 36 36 -1 0 N
Inputs: 22 10 2 36 36 570 0 N
Inputs: 26 10 2 26 26 -1 0 N
Inputs: 26 10 2 26 26 570 0 N
Inputs: 26 10 2 29 29 -1 0 N
Inputs: 26 10 2 29 29 570 0 N
Inputs: 26 10 2 32 32 -1 0 N
Inputs: 26 10 2 32 32 570 0 N
Inputs: 26 10 2 36 36 -1 0 N
Inputs: 26 10 2 36 36 570 0 N
Inputs: 29 10 2 29 29 -1 0 N
Inputs: 29 10 2 29 29 570 0 N
Inputs: 29 10 2 32 32 -1 0 N
Inputs: 29 10 2 32 32 570 0 N
Inputs: 29 10 2 36 36 -1 0 N
Inputs: 29 10 2 36 36 570 0 N
Inputs: 32 10 2 32 32 -1 0 N
Inputs: 32 10 2 32 32 570 0 N
Inputs: 32 10 2 36 36 -1 0 N
Inputs: 32 10 2 36 36 570 0 N
Inputs: 36 10 2 36 36 -1 0 N
Inputs: 36 10 2 36 36 570 0 N
DGBTRS and DGBTRF are not related to GEQRF, but the issue here is the same: your problems are small, and our implementation is geared towards large parallel problems. We have a few LAPACK functions which are implemented in a similar way.
In order to understand the best way to deal with these issues, would you mind getting in touch via support-hpc-sw@arm.com? We have a few options, but it would be better to find out which is the most appropriate for your use cases.
Regards, Chris.