How to apply interleaved batch permutation?


I am trying to solve a batch of linear systems using QR factorization.

The steps I follow are: 1) Assemble matrix and right-hand sides, 2) interleave with dge_interleave, 3) A P = QR with dgeqrff_interleave_batch, 4) B := Q^T B with dormqr_interleave_batch, 5) solve R X = B with dtrsm_interleave_batch. 

Now I need to apply the row (?) permutation to get the true X. 

I have tried the following process

    // The solution X is now sitting in the B_p buffer.
    // The interleaved pivot vectors P**i are in jvpt_p.

    std::vector<double> col_B(m*nrhs); // regular column-major array

    for (int i = 0; i < ninter; i++) {

        // Deinterleave
        ARMPL_CHECK(
            armpl_dge_deinterleave(ninter, i,
                m, nrhs, col_B.data(), 1, m,
                B_p, istrd_B, jstrd_B));

        // Permute
        LAPACKE_dlaswp(LAPACK_COL_MAJOR, nrhs, col_B.data(), m,
            0, m-1, jpvt_p, istrd_jpvt);

        // Print the result vector (first right-hand side only)
        for (int row = 0; row < m; row++) {
            std::cout << col_B[row] << '\n';
        }
    }


but it doesn't give me the expected result.

I would be very grateful if you could add an example of using the batch-interleave QR functions in future releases.


Parents
  • Thank you for the answers. 

    My problem requires assembling O(10^6) and more linear systems of the form AW=B. The weights W are packed into a batch of sparse matrices.  The individual matrix sizes are generally between 30 and 70.

    I made some experiments with different `ninter` values for `M = 36`, and `NRHS = 9`.

    NINTER Elapsed (seconds) Rate (systems per second)
    2 10.9 19269
    4 2.94 71453
    8 2.66 78998
    16 2.80 75198


    These are single-thread results from an Apple M2 using Arm PL 25.07. 

    For comparison, using LU factorization (non-batched) with pivoting on the same problem, I observe ~70k systems per second. So the batched QR factorization is just a tiny bit faster here.

    Batched Assembly Flamegraph

    I did observe one problem when multi-threading is in use. I use my own outer OpenMP loop which fills the matrices, does the batch factorization (ninter = 8), and packs the results into a sparse matrix. When I link against libarmpl (serial) things work nicely. But when I link against libarmpl_pl, I observe a slowdown - I speculate this is due to thread contention with the threading using internally in Arm PL. 

    Is it possible to control the number of threads used by ArmPL locally, similar to the way Intel oneAPI MKL offers `mkl_set_num_threads_local`/`mkl_domain_set_num_threads`?

    Regards,
    Ivan


Reply
  • Thank you for the answers. 

    My problem requires assembling O(10^6) and more linear systems of the form AW=B. The weights W are packed into a batch of sparse matrices.  The individual matrix sizes are generally between 30 and 70.

    I made some experiments with different `ninter` values for `M = 36`, and `NRHS = 9`.

    NINTER Elapsed (seconds) Rate (systems per second)
    2 10.9 19269
    4 2.94 71453
    8 2.66 78998
    16 2.80 75198


    These are single-thread results from an Apple M2 using Arm PL 25.07. 

    For comparison, using LU factorization (non-batched) with pivoting on the same problem, I observe ~70k systems per second. So the batched QR factorization is just a tiny bit faster here.

    Batched Assembly Flamegraph

    I did observe one problem when multi-threading is in use. I use my own outer OpenMP loop which fills the matrices, does the batch factorization (ninter = 8), and packs the results into a sparse matrix. When I link against libarmpl (serial) things work nicely. But when I link against libarmpl_pl, I observe a slowdown - I speculate this is due to thread contention with the threading using internally in Arm PL. 

    Is it possible to control the number of threads used by ArmPL locally, similar to the way Intel oneAPI MKL offers `mkl_set_num_threads_local`/`mkl_domain_set_num_threads`?

    Regards,
    Ivan


Children
No data