I am trying to solve a batch of linear systems using QR factorization.The steps I follow are: 1) Assemble matrix and right-hand sides, 2) interleave with dge_interleave, 3) A P = QR with dgeqrff_interleave_batch, 4) B := Q^T B with dormqr_interleave_batch, 5) solve R X = B with dtrsm_interleave_batch. Now I need to apply the row (?) permutation to get the true X. I have tried the following process
// The solution X is now sitting in the B_p buffer. // The interleaved pivot vectors P**i are in jvpt_p. std::vector<double> col_B(m*nrhs); // regular column-major array for (int i = 0; i < ninter; i++) { // Deinterleave ARMPL_CHECK( armpl_dge_deinterleave(ninter, i, m, nrhs, col_B.data(), 1, m, B_p, istrd_B, jstrd_B)); // Permute LAPACKE_dlaswp(LAPACK_COL_MAJOR, nrhs, col_B.data(), m, 0, m-1, jpvt_p, istrd_jpvt); // Print the result vector (first right-hand side only) for (int row = 0; row < m; row++) { std::cout << col_B[row] << '\n'; } }
I didn't know about the nesting feature of omp_set_num_threads. That is very practical. Thanks!I will test also against the non-batched LAPACK QR. It is just a 4-line change plus some extra work-space allocation (see below). With a different LAPACK library (Apple Accelerate), LU with pivoting was twice as fast as QR. So I call it a success the batched QR in Arm PL was ~15 % faster than LU with pivoting.
#if 0 // LU Factorization dgesv_(&nt,&nc,wrk,&ld,ipiv,coeffs,&ld,ierr); #else // QR Factorization double *work = wrk + ld*nt; int lwork = ld*nt + nt*nt; char trans = 'N'; dgels_(&trans,&nt,&nt,&nc,wrk,&ld,coeffs,&ld,work,&lwork,ierr); #endif
Hi Chris,I just repeated my test with the ArmPL 26.01 library release. The setup was the same as before (M=36, NRHS=9), and solving ~210k systems, where the systems are assembled on the fly.The results I got with 1 thread (OpenMP executable) and sequential ArmPL (-larmpl) are the following:
So the batch QR delivers almost 2x speed-up with NINTER=8, which I guess is the best we can expect with the 128-bit vector length. I did manage to improve the performance of the LU variant, which is now the fastest (also thanks to ArmPL!). As mentioned earlier, due to pivoting reasons, I cannot use the batched LU routines. But I do a have a different problem where the matrices aren't square, so the batched QR will be the right one.Here's a plot of the multi-threaded scaling of the outer loop over systems:
Hi Ivan,
Thanks for letting us know. This is great feedback and it's nice to hear these routines are providing a benefit to you.
Cheers,
Chris.