How to apply interleaved batch permutation?


I am trying to solve a batch of linear systems using QR factorization.

The steps I follow are: 1) Assemble matrix and right-hand sides, 2) interleave with dge_interleave, 3) A P = QR with dgeqrff_interleave_batch, 4) B := Q^T B with dormqr_interleave_batch, 5) solve R X = B with dtrsm_interleave_batch. 

Now I need to apply the row (?) permutation to get the true X. 

I have tried the following process

    // The solution X is now sitting in the B_p buffer.
    // The interleaved pivot vectors P**i are in jvpt_p.

    std::vector<double> col_B(m*nrhs); // regular column-major array

    for (int i = 0; i < ninter; i++) {

        // Deinterleave
        ARMPL_CHECK(
            armpl_dge_deinterleave(ninter, i,
                m, nrhs, col_B.data(), 1, m,
                B_p, istrd_B, jstrd_B));

        // Permute
        LAPACKE_dlaswp(LAPACK_COL_MAJOR, nrhs, col_B.data(), m,
            0, m-1, jpvt_p, istrd_jpvt);

        // Print the result vector (first right-hand side only)
        for (int row = 0; row < m; row++) {
            std::cout << col_B[row] << '\n';
        }
    }


but it doesn't give me the expected result.

I would be very grateful if you could add an example of using the batch-interleave QR functions in future releases.


  • It looks like I have resolved the issue - jpvt is a simple permutation array.

    (It is not series of swap as used by the DGETRF/DGETRS routines, see LAPACK discussion, so DLASWP can not be used here). 

    The following row swapping code worked for me:

            std::vector<double> final_X(m*nrhs);
            for (int j = 0; j < m; j++) {
                // Retrieve k
                int k = jpvt[j * ninter + i];
    
                // Copy all RHS for this row
                for (int rhs = 0; rhs < nrhs; rhs++) {
                    final_X[k + rhs * m] = col_B[j + rhs * m];
                }
            }


    I would assume the column permutation array uses zero-based indexing in C/C++ and one-based indexing in Fortran?
    Maybe that would be worthy of a note in the batch-interleave API documentation. 

    ---

    If that's okay, I have a couple more questions:

    - What would be the optimal interleaving factor (ninter) on Nvidia Grace (4x 128b SVE2 per core)? 
    - What would be the optimal interleaving factor on Apple M2?

    I'm interested in medium size matrices (20 to 100). I guess I'll have to run some tests empirically to find out?

  • Hi,

    Many thanks for your questions about our interleaved-batch functions in ArmPL.

    It seems that you've managed to fix all your immediate problems. We'll add a note to the documentation for `jpvt` to point out that it is not compatible with LAPACK's `LASWP`. Your assumption about 0-based indexing is correct as well. We do not provide Fortran interfaces for the interleaved-batch functions so there is no support for 1-based indexing. We will also document that, if these functions are called from Fortran (via the C interface), then the indexing is still 0-based.

    We'll also add an example in the future. We're about to release a new version of the library very soon, so the new example won't make the upcoming release but we can get one added for the release following later in the year.

    We do recommend running some benchmarks to determine the best interleaving factor for your application on different target systems. Multiples of the vector width make sense, and for the 2 machines you mentioned this is 128-bits (i.e. 2 doubles). The blog I wrote when we released these functions shows some of the types of graphs you might want to produce to determine the best factor:  Introducing interleave-batched linear algebra functions in Arm PL 

    Remember that these functions are intended for use when solving many very small problems, so as you approach individual problem sizes of 100 you might find it better to use the LAPACK function `DGEQRF` at some point.

    Hope that helps, please let us know if you have any more feedback or need any help.

    Regards,

    Chris.

  • Thank you for the answers. 

    My problem requires assembling O(10^6) and more linear systems of the form AW=B. The weights W are packed into a batch of sparse matrices.  The individual matrix sizes are generally between 30 and 70.

    I made some experiments with different `ninter` values for `M = 36`, and `NRHS = 9`.

    NINTER Elapsed (seconds) Rate (systems per second)
    2 10.9 19269
    4 2.94 71453
    8 2.66 78998
    16 2.80 75198


    These are single-thread results from an Apple M2 using Arm PL 25.07. 

    For comparison, using LU factorization (non-batched) with pivoting on the same problem, I observe ~70k systems per second. So the batched QR factorization is just a tiny bit faster here.

    Batched Assembly Flamegraph

    I did observe one problem when multi-threading is in use. I use my own outer OpenMP loop which fills the matrices, does the batch factorization (ninter = 8), and packs the results into a sparse matrix. When I link against libarmpl (serial) things work nicely. But when I link against libarmpl_pl, I observe a slowdown - I speculate this is due to thread contention with the threading using internally in Arm PL. 

    Is it possible to control the number of threads used by ArmPL locally, similar to the way Intel oneAPI MKL offers `mkl_set_num_threads_local`/`mkl_domain_set_num_threads`?

    Regards,
    Ivan


  • Ignore my threading observation. I meant to write libarmpl_mp and not _pl. The outer multi-threading works just fine when I use clang/flang and hence libomp (the same one linked in the performance libraries), not libgomp. 

     Btw, I noticed in the flamegraph that larfg makes some slow calls to dlapy2 and dnrm2. I was able to shave off a few cycle by linking my own versions (albeit less numerically stable). 

    Best,
    Ivan

  • Hi Ivan,

    Thanks for sharing the details. It's good to hear you've seen _some_ speedup with double precision data on a machine with a vector width of 128-bits. We really expected these functions to show a benefit for larger vector widths, and since we added them to the library the LAPACK QR factorization has been optimized further, so that's now quite a high bar to beat.

    We don't currently expose an specific API for controlling threading within the library. We stick to the OpenMP spec closely however, so you should be able to control threading either via calling `omp_set_num_threads` before calling into the library, or by setting your OMP_NUM_THREADS environment variable as a comma separated list to control nested threading.

    Regards,

    Chris.

  • I didn't know about the nesting feature of omp_set_num_threads. That is very practical. Thanks!

    I will test also against the non-batched LAPACK QR. It is just a 4-line change plus some extra work-space allocation (see below). With a different LAPACK library (Apple Accelerate), LU with pivoting was twice as fast as QR. So I call it a success the batched QR in Arm PL was ~15 % faster than LU with pivoting. 

    #if 0
        // LU Factorization
        dgesv_(&nt,&nc,wrk,&ld,ipiv,coeffs,&ld,ierr);
    #else
        // QR Factorization
        double *work = wrk + ld*nt;
        int lwork = ld*nt + nt*nt;
        char trans = 'N';
        dgels_(&trans,&nt,&nt,&nc,wrk,&ld,coeffs,&ld,work,&lwork,ierr);
    #endif


    Cheers,
    Ivan