I am trying to solve a batch of linear systems using QR factorization.The steps I follow are: 1) Assemble matrix and right-hand sides, 2) interleave with dge_interleave, 3) A P = QR with dgeqrff_interleave_batch, 4) B := Q^T B with dormqr_interleave_batch, 5) solve R X = B with dtrsm_interleave_batch. Now I need to apply the row (?) permutation to get the true X. I have tried the following process
// The solution X is now sitting in the B_p buffer. // The interleaved pivot vectors P**i are in jvpt_p. std::vector<double> col_B(m*nrhs); // regular column-major array for (int i = 0; i < ninter; i++) { // Deinterleave ARMPL_CHECK( armpl_dge_deinterleave(ninter, i, m, nrhs, col_B.data(), 1, m, B_p, istrd_B, jstrd_B)); // Permute LAPACKE_dlaswp(LAPACK_COL_MAJOR, nrhs, col_B.data(), m, 0, m-1, jpvt_p, istrd_jpvt); // Print the result vector (first right-hand side only) for (int row = 0; row < m; row++) { std::cout << col_B[row] << '\n'; } }
Thank you for the answers. My problem requires assembling O(10^6) and more linear systems of the form AW=B. The weights W are packed into a batch of sparse matrices. The individual matrix sizes are generally between 30 and 70. I made some experiments with different `ninter` values for `M = 36`, and `NRHS = 9`.
These are single-thread results from an Apple M2 using Arm PL 25.07. For comparison, using LU factorization (non-batched) with pivoting on the same problem, I observe ~70k systems per second. So the batched QR factorization is just a tiny bit faster here.I did observe one problem when multi-threading is in use. I use my own outer OpenMP loop which fills the matrices, does the batch factorization (ninter = 8), and packs the results into a sparse matrix. When I link against libarmpl (serial) things work nicely. But when I link against libarmpl_pl, I observe a slowdown - I speculate this is due to thread contention with the threading using internally in Arm PL. Is it possible to control the number of threads used by ArmPL locally, similar to the way Intel oneAPI MKL offers `mkl_set_num_threads_local`/`mkl_domain_set_num_threads`?
Regards,Ivan