How to apply interleaved batch permutation?


I am trying to solve a batch of linear systems using QR factorization.

The steps I follow are: 1) Assemble matrix and right-hand sides, 2) interleave with dge_interleave, 3) A P = QR with dgeqrff_interleave_batch, 4) B := Q^T B with dormqr_interleave_batch, 5) solve R X = B with dtrsm_interleave_batch. 

Now I need to apply the row (?) permutation to get the true X. 

I have tried the following process

    // The solution X is now sitting in the B_p buffer.
    // The interleaved pivot vectors P**i are in jvpt_p.

    std::vector<double> col_B(m*nrhs); // regular column-major array

    for (int i = 0; i < ninter; i++) {

        // Deinterleave
        ARMPL_CHECK(
            armpl_dge_deinterleave(ninter, i,
                m, nrhs, col_B.data(), 1, m,
                B_p, istrd_B, jstrd_B));

        // Permute
        LAPACKE_dlaswp(LAPACK_COL_MAJOR, nrhs, col_B.data(), m,
            0, m-1, jpvt_p, istrd_jpvt);

        // Print the result vector (first right-hand side only)
        for (int row = 0; row < m; row++) {
            std::cout << col_B[row] << '\n';
        }
    }


but it doesn't give me the expected result.

I would be very grateful if you could add an example of using the batch-interleave QR functions in future releases.


Parents
  • Thank you for the answers. 

    My problem requires assembling O(10^6) and more linear systems of the form AW=B. The weights W are packed into a batch of sparse matrices.  The individual matrix sizes are generally between 30 and 70.

    I made some experiments with different `ninter` values for `M = 36`, and `NRHS = 9`.

    NINTER Elapsed (seconds) Rate (systems per second)
    2 10.9 19269
    4 2.94 71453
    8 2.66 78998
    16 2.80 75198


    These are single-thread results from an Apple M2 using Arm PL 25.07. 

    For comparison, using LU factorization (non-batched) with pivoting on the same problem, I observe ~70k systems per second. So the batched QR factorization is just a tiny bit faster here.

    Batched Assembly Flamegraph

    I did observe one problem when multi-threading is in use. I use my own outer OpenMP loop which fills the matrices, does the batch factorization (ninter = 8), and packs the results into a sparse matrix. When I link against libarmpl (serial) things work nicely. But when I link against libarmpl_pl, I observe a slowdown - I speculate this is due to thread contention with the threading using internally in Arm PL. 

    Is it possible to control the number of threads used by ArmPL locally, similar to the way Intel oneAPI MKL offers `mkl_set_num_threads_local`/`mkl_domain_set_num_threads`?

    Regards,
    Ivan


Reply
  • Thank you for the answers. 

    My problem requires assembling O(10^6) and more linear systems of the form AW=B. The weights W are packed into a batch of sparse matrices.  The individual matrix sizes are generally between 30 and 70.

    I made some experiments with different `ninter` values for `M = 36`, and `NRHS = 9`.

    NINTER Elapsed (seconds) Rate (systems per second)
    2 10.9 19269
    4 2.94 71453
    8 2.66 78998
    16 2.80 75198


    These are single-thread results from an Apple M2 using Arm PL 25.07. 

    For comparison, using LU factorization (non-batched) with pivoting on the same problem, I observe ~70k systems per second. So the batched QR factorization is just a tiny bit faster here.

    Batched Assembly Flamegraph

    I did observe one problem when multi-threading is in use. I use my own outer OpenMP loop which fills the matrices, does the batch factorization (ninter = 8), and packs the results into a sparse matrix. When I link against libarmpl (serial) things work nicely. But when I link against libarmpl_pl, I observe a slowdown - I speculate this is due to thread contention with the threading using internally in Arm PL. 

    Is it possible to control the number of threads used by ArmPL locally, similar to the way Intel oneAPI MKL offers `mkl_set_num_threads_local`/`mkl_domain_set_num_threads`?

    Regards,
    Ivan


Children
  • Ignore my threading observation. I meant to write libarmpl_mp and not _pl. The outer multi-threading works just fine when I use clang/flang and hence libomp (the same one linked in the performance libraries), not libgomp. 

     Btw, I noticed in the flamegraph that larfg makes some slow calls to dlapy2 and dnrm2. I was able to shave off a few cycle by linking my own versions (albeit less numerically stable). 

    Best,
    Ivan

  • Hi Ivan,

    Thanks for sharing the details. It's good to hear you've seen _some_ speedup with double precision data on a machine with a vector width of 128-bits. We really expected these functions to show a benefit for larger vector widths, and since we added them to the library the LAPACK QR factorization has been optimized further, so that's now quite a high bar to beat.

    We don't currently expose an specific API for controlling threading within the library. We stick to the OpenMP spec closely however, so you should be able to control threading either via calling `omp_set_num_threads` before calling into the library, or by setting your OMP_NUM_THREADS environment variable as a comma separated list to control nested threading.

    Regards,

    Chris.

  • I didn't know about the nesting feature of omp_set_num_threads. That is very practical. Thanks!

    I will test also against the non-batched LAPACK QR. It is just a 4-line change plus some extra work-space allocation (see below). With a different LAPACK library (Apple Accelerate), LU with pivoting was twice as fast as QR. So I call it a success the batched QR in Arm PL was ~15 % faster than LU with pivoting. 

    #if 0
        // LU Factorization
        dgesv_(&nt,&nc,wrk,&ld,ipiv,coeffs,&ld,ierr);
    #else
        // QR Factorization
        double *work = wrk + ld*nt;
        int lwork = ld*nt + nt*nt;
        char trans = 'N';
        dgels_(&trans,&nt,&nt,&nc,wrk,&ld,coeffs,&ld,work,&lwork,ierr);
    #endif


    Cheers,
    Ivan