How to apply interleaved batch permutation?

Ivan Pribec 1 month ago

I am trying to solve a batch of linear systems using QR factorization.

The steps I follow are: 1) Assemble matrix and right-hand sides, 2) interleave with dge_interleave, 3) A P = QR with dgeqrff_interleave_batch, 4) B := Q^T B with dormqr_interleave_batch, 5) solve R X = B with dtrsm_interleave_batch.

Now I need to apply the row (?) permutation to get the true X.

I have tried the following process

    // The solution X is now sitting in the B_p buffer.
    // The interleaved pivot vectors P**i are in jvpt_p.

    std::vector<double> col_B(m*nrhs); // regular column-major array

    for (int i = 0; i < ninter; i++) {

        // Deinterleave
        ARMPL_CHECK(
            armpl_dge_deinterleave(ninter, i,
                m, nrhs, col_B.data(), 1, m,
                B_p, istrd_B, jstrd_B));

        // Permute
        LAPACKE_dlaswp(LAPACK_COL_MAJOR, nrhs, col_B.data(), m,
            0, m-1, jpvt_p, istrd_jpvt);

        // Print the result vector (first right-hand side only)
        for (int row = 0; row < m; row++) {
            std::cout << col_B[row] << '\n';
        }
    }

but it doesn't give me the expected result.

I would be very grateful if you could add an example of using the batch-interleave QR functions in future releases.

Top replies

Parents

0 Chris Armstrong 1 month ago in reply to Ivan Pribec

Hi Ivan,

Thanks for sharing the details. It's good to hear you've seen _some_ speedup with double precision data on a machine with a vector width of 128-bits. We really expected these functions to show a benefit for larger vector widths, and since we added them to the library the LAPACK QR factorization has been optimized further, so that's now quite a high bar to beat.

We don't currently expose an specific API for controlling threading within the library. We stick to the OpenMP spec closely however, so you should be able to control threading either via calling `omp_set_num_threads` before calling into the library, or by setting your OMP_NUM_THREADS environment variable as a comma separated list to control nested threading.

Regards,

Chris.
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel

Reply

0 Chris Armstrong 1 month ago in reply to Ivan Pribec

Hi Ivan,

Thanks for sharing the details. It's good to hear you've seen _some_ speedup with double precision data on a machine with a vector width of 128-bits. We really expected these functions to show a benefit for larger vector widths, and since we added them to the library the LAPACK QR factorization has been optimized further, so that's now quite a high bar to beat.

We don't currently expose an specific API for controlling threading within the library. We stick to the OpenMP spec closely however, so you should be able to control threading either via calling `omp_set_num_threads` before calling into the library, or by setting your OMP_NUM_THREADS environment variable as a comma separated list to control nested threading.

Regards,

Chris.
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel

Children

0 Ivan Pribec 1 month ago in reply to Chris Armstrong
I didn't know about the nesting feature of omp_set_num_threads. That is very practical. Thanks!

I will test also against the non-batched LAPACK QR. It is just a 4-line change plus some extra work-space allocation (see below). With a different LAPACK library (Apple Accelerate), LU with pivoting was twice as fast as QR. So I call it a success the batched QR in Arm PL was ~15 % faster than LU with pivoting.

#if 0 // LU Factorization dgesv_(&nt,&nc,wrk,&ld,ipiv,coeffs,&ld,ierr); #else // QR Factorization double *work = wrk + ld*nt; int lwork = ld*nt + nt*nt; char trans = 'N'; dgels_(&trans,&nt,&nt,&nc,wrk,&ld,coeffs,&ld,work,&lwork,ierr); #endif

Cheers,
Ivan
Cancel
Vote up +1 Vote down

Reply

Accept answer

Cancel