How to apply interleaved batch permutation?


I am trying to solve a batch of linear systems using QR factorization.

The steps I follow are: 1) Assemble matrix and right-hand sides, 2) interleave with dge_interleave, 3) A P = QR with dgeqrff_interleave_batch, 4) B := Q^T B with dormqr_interleave_batch, 5) solve R X = B with dtrsm_interleave_batch. 

Now I need to apply the row (?) permutation to get the true X. 

I have tried the following process

    // The solution X is now sitting in the B_p buffer.
    // The interleaved pivot vectors P**i are in jvpt_p.

    std::vector<double> col_B(m*nrhs); // regular column-major array

    for (int i = 0; i < ninter; i++) {

        // Deinterleave
        ARMPL_CHECK(
            armpl_dge_deinterleave(ninter, i,
                m, nrhs, col_B.data(), 1, m,
                B_p, istrd_B, jstrd_B));

        // Permute
        LAPACKE_dlaswp(LAPACK_COL_MAJOR, nrhs, col_B.data(), m,
            0, m-1, jpvt_p, istrd_jpvt);

        // Print the result vector (first right-hand side only)
        for (int row = 0; row < m; row++) {
            std::cout << col_B[row] << '\n';
        }
    }


but it doesn't give me the expected result.

I would be very grateful if you could add an example of using the batch-interleave QR functions in future releases.


  • It looks like I have resolved the issue - jpvt is a simple permutation array.

    (It is not series of swap as used by the DGETRF/DGETRS routines, see LAPACK discussion, so DLASWP can not be used here). 

    The following row swapping code worked for me:

            std::vector<double> final_X(m*nrhs);
            for (int j = 0; j < m; j++) {
                // Retrieve k
                int k = jpvt[j * ninter + i];
    
                // Copy all RHS for this row
                for (int rhs = 0; rhs < nrhs; rhs++) {
                    final_X[k + rhs * m] = col_B[j + rhs * m];
                }
            }


    I would assume the column permutation array uses zero-based indexing in C/C++ and one-based indexing in Fortran?
    Maybe that would be worthy of a note in the batch-interleave API documentation. 

    ---

    If that's okay, I have a couple more questions:

    - What would be the optimal interleaving factor (ninter) on Nvidia Grace (4x 128b SVE2 per core)? 
    - What would be the optimal interleaving factor on Apple M2?

    I'm interested in medium size matrices (20 to 100). I guess I'll have to run some tests empirically to find out?