I am trying to solve a batch of linear systems using QR factorization.The steps I follow are: 1) Assemble matrix and right-hand sides, 2) interleave with dge_interleave, 3) A P = QR with dgeqrff_interleave_batch, 4) B := Q^T B with dormqr_interleave_batch, 5) solve R X = B with dtrsm_interleave_batch. Now I need to apply the row (?) permutation to get the true X. I have tried the following process
// The solution X is now sitting in the B_p buffer. // The interleaved pivot vectors P**i are in jvpt_p. std::vector<double> col_B(m*nrhs); // regular column-major array for (int i = 0; i < ninter; i++) { // Deinterleave ARMPL_CHECK( armpl_dge_deinterleave(ninter, i, m, nrhs, col_B.data(), 1, m, B_p, istrd_B, jstrd_B)); // Permute LAPACKE_dlaswp(LAPACK_COL_MAJOR, nrhs, col_B.data(), m, 0, m-1, jpvt_p, istrd_jpvt); // Print the result vector (first right-hand side only) for (int row = 0; row < m; row++) { std::cout << col_B[row] << '\n'; } }
Hi Ivan,
Thanks for sharing the details. It's good to hear you've seen _some_ speedup with double precision data on a machine with a vector width of 128-bits. We really expected these functions to show a benefit for larger vector widths, and since we added them to the library the LAPACK QR factorization has been optimized further, so that's now quite a high bar to beat.
We don't currently expose an specific API for controlling threading within the library. We stick to the OpenMP spec closely however, so you should be able to control threading either via calling `omp_set_num_threads` before calling into the library, or by setting your OMP_NUM_THREADS environment variable as a comma separated list to control nested threading.
Regards,
Chris.
I didn't know about the nesting feature of omp_set_num_threads. That is very practical. Thanks!I will test also against the non-batched LAPACK QR. It is just a 4-line change plus some extra work-space allocation (see below). With a different LAPACK library (Apple Accelerate), LU with pivoting was twice as fast as QR. So I call it a success the batched QR in Arm PL was ~15 % faster than LU with pivoting.
#if 0 // LU Factorization dgesv_(&nt,&nc,wrk,&ld,ipiv,coeffs,&ld,ierr); #else // QR Factorization double *work = wrk + ld*nt; int lwork = ld*nt + nt*nt; char trans = 'N'; dgels_(&trans,&nt,&nt,&nc,wrk,&ld,coeffs,&ld,work,&lwork,ierr); #endif