How to apply interleaved batch permutation?


I am trying to solve a batch of linear systems using QR factorization.

The steps I follow are: 1) Assemble matrix and right-hand sides, 2) interleave with dge_interleave, 3) A P = QR with dgeqrff_interleave_batch, 4) B := Q^T B with dormqr_interleave_batch, 5) solve R X = B with dtrsm_interleave_batch. 

Now I need to apply the row (?) permutation to get the true X. 

I have tried the following process

    // The solution X is now sitting in the B_p buffer.
    // The interleaved pivot vectors P**i are in jvpt_p.

    std::vector<double> col_B(m*nrhs); // regular column-major array

    for (int i = 0; i < ninter; i++) {

        // Deinterleave
        ARMPL_CHECK(
            armpl_dge_deinterleave(ninter, i,
                m, nrhs, col_B.data(), 1, m,
                B_p, istrd_B, jstrd_B));

        // Permute
        LAPACKE_dlaswp(LAPACK_COL_MAJOR, nrhs, col_B.data(), m,
            0, m-1, jpvt_p, istrd_jpvt);

        // Print the result vector (first right-hand side only)
        for (int row = 0; row < m; row++) {
            std::cout << col_B[row] << '\n';
        }
    }


but it doesn't give me the expected result.

I would be very grateful if you could add an example of using the batch-interleave QR functions in future releases.


Parents
  • Hi Chris,

    I just repeated my test with the ArmPL 26.01 library release. The setup was the same as before (M=36, NRHS=9), and solving ~210k systems, where the systems are assembled on the fly.

    The results I got with 1 thread (OpenMP executable) and sequential ArmPL (-larmpl) are the following:

    Variant Rate (systems per second)
    Batch QR (NINTER=2)
    20072.9
    Batch QR (NINTER=4)
    72928.0
    Batch QR (NINTER=8)
    82158.7
    QR (DGELS)
    45166.1
    LU (DGESV)
    117065.6

    So the batch QR delivers almost 2x speed-up with NINTER=8, which I guess is the best we can expect with the 128-bit vector length. 

    I did manage to improve the performance of the LU variant, which is now the fastest (also thanks to ArmPL!). As mentioned earlier, due to pivoting reasons, I cannot use the batched LU routines. But I do a have a different problem where the matrices aren't square, so the batched QR will be the right one.

    Here's a plot of the multi-threaded scaling of the outer loop over systems:

Reply
  • Hi Chris,

    I just repeated my test with the ArmPL 26.01 library release. The setup was the same as before (M=36, NRHS=9), and solving ~210k systems, where the systems are assembled on the fly.

    The results I got with 1 thread (OpenMP executable) and sequential ArmPL (-larmpl) are the following:

    Variant Rate (systems per second)
    Batch QR (NINTER=2)
    20072.9
    Batch QR (NINTER=4)
    72928.0
    Batch QR (NINTER=8)
    82158.7
    QR (DGELS)
    45166.1
    LU (DGESV)
    117065.6

    So the batch QR delivers almost 2x speed-up with NINTER=8, which I guess is the best we can expect with the 128-bit vector length. 

    I did manage to improve the performance of the LU variant, which is now the fastest (also thanks to ArmPL!). As mentioned earlier, due to pivoting reasons, I cannot use the batched LU routines. But I do a have a different problem where the matrices aren't square, so the batched QR will be the right one.

    Here's a plot of the multi-threaded scaling of the outer loop over systems:

Children
No data