How to apply interleaved batch permutation?

Ivan Pribec 1 month ago

I am trying to solve a batch of linear systems using QR factorization.

The steps I follow are: 1) Assemble matrix and right-hand sides, 2) interleave with dge_interleave, 3) A P = QR with dgeqrff_interleave_batch, 4) B := Q^T B with dormqr_interleave_batch, 5) solve R X = B with dtrsm_interleave_batch.

Now I need to apply the row (?) permutation to get the true X.

I have tried the following process

    // The solution X is now sitting in the B_p buffer.
    // The interleaved pivot vectors P**i are in jvpt_p.

    std::vector<double> col_B(m*nrhs); // regular column-major array

    for (int i = 0; i < ninter; i++) {

        // Deinterleave
        ARMPL_CHECK(
            armpl_dge_deinterleave(ninter, i,
                m, nrhs, col_B.data(), 1, m,
                B_p, istrd_B, jstrd_B));

        // Permute
        LAPACKE_dlaswp(LAPACK_COL_MAJOR, nrhs, col_B.data(), m,
            0, m-1, jpvt_p, istrd_jpvt);

        // Print the result vector (first right-hand side only)
        for (int row = 0; row < m; row++) {
            std::cout << col_B[row] << '\n';
        }
    }

but it doesn't give me the expected result.

I would be very grateful if you could add an example of using the batch-interleave QR functions in future releases.

Top replies

Parents

0 Ivan Pribec 1 month ago in reply to Chris Armstrong
I didn't know about the nesting feature of omp_set_num_threads. That is very practical. Thanks!

I will test also against the non-batched LAPACK QR. It is just a 4-line change plus some extra work-space allocation (see below). With a different LAPACK library (Apple Accelerate), LU with pivoting was twice as fast as QR. So I call it a success the batched QR in Arm PL was ~15 % faster than LU with pivoting.

#if 0 // LU Factorization dgesv_(&nt,&nc,wrk,&ld,ipiv,coeffs,&ld,ierr); #else // QR Factorization double *work = wrk + ld*nt; int lwork = ld*nt + nt*nt; char trans = 'N'; dgels_(&trans,&nt,&nt,&nc,wrk,&ld,coeffs,&ld,work,&lwork,ierr); #endif

Cheers,
Ivan
Cancel
Vote up +1 Vote down

Reply

Accept answer

Cancel

Reply

0 Ivan Pribec 1 month ago in reply to Chris Armstrong
I didn't know about the nesting feature of omp_set_num_threads. That is very practical. Thanks!

I will test also against the non-batched LAPACK QR. It is just a 4-line change plus some extra work-space allocation (see below). With a different LAPACK library (Apple Accelerate), LU with pivoting was twice as fast as QR. So I call it a success the batched QR in Arm PL was ~15 % faster than LU with pivoting.

#if 0 // LU Factorization dgesv_(&nt,&nc,wrk,&ld,ipiv,coeffs,&ld,ierr); #else // QR Factorization double *work = wrk + ld*nt; int lwork = ld*nt + nt*nt; char trans = 'N'; dgels_(&trans,&nt,&nt,&nc,wrk,&ld,coeffs,&ld,work,&lwork,ierr); #endif

Cheers,
Ivan
Cancel
Vote up +1 Vote down

Reply

Accept answer

Cancel

Children

No data