Performing linear algebra operations on large batches of small matrices is a common use-case in High-Performance Computing. This is not something that was anticipated in the design of the BLAS and LAPACK interfaces, the two standards for performing basic and advanced linear algebra computations. BLAS and LAPACK solve a single problem per function call in which each call involves parameter checking. Repeatedly encountering the latency costs of a function call which does very little work means that we are not using the hardware resources available to us in a modern, vectorized, out-of-order core: the pipelines of floating point SIMD units are never fully "warmed up".
It is possible to improve performance by defining new interfaces which allow multiple problems to be solved in a single function call. This can be done by re-arranging the layout of data in memory in a way that is more friendly to efficient use of SIMD instructions. This is the approach taken by Arm Performance Libraries' new "interleave-batch" functions, which are introduced in release 21.0. In this initial release, we have added 11 key computational functions for double-precision, real matrices. These include general and triangular matrix multiplication (armpl_dgemm_interleave_batch, armpl_dtrmm_interleave_batch) and triangular matrix solve (armpl_dtrsm_interleave_batch), as well as the most common one-sided matrix factorizations: Cholesky (armpl_dpotrf_interleave_batch), LU (armpl_dgetrftp_interleave_batch) and QR (armpl_dgeqrfrr_interleave_batch).
These functions are applicable when all problems in the batch are the same (that is, the same matrix dimensions, same transpose options). As shown in the following, significant performance improvements can be made for large batches of small problems. We first describe the new data format - how matrices need to be laid out in memory - before providing an example and some performance results.
The idea behind the interleave-batch approach is to use vector instructions as much as possible, loading, computing, and storing in such a way that each vector lane is processing a different matrix. Since contiguous loads and stores perform far better than having to gather and scatter data, we want our matrices to be interleaved. This is best illustrated with an example. Suppose we have two 3-by-4 matrices A1 and A2:
Interleaving means we want common elements of these two matrices to be placed contiguously in memory. If we use a row-major interleaved layout then these matrices would be arranged like this:
Here the arrow indicates the order in memory of the matrix elements: A11,1 is the first element, followed by A21,1, then A11,2 and A21,2. When we get to the end of a row, the next row follows. In this example the interleaving factor, ninter, is 2 - we have interleaved 2 matrices. With 64-bit (double-precision) floating point data, we can now use Neon SIMD instructions to load, process, and store elements of the two matrices trivially (assuming the algorithm we are using contains no data-dependent branches).
ninter
If we were to use a column-major interleaved layout for these two matrices, then the arrangement would be:
We again pair-up the same matrix elements in the two matrices, but this time proceed down the columns of the matrices first, and columns follow one after the other. We can again use 128-bit SIMD instructions to load, process, and store the elements of the matrices.
The choice between a column-major or row-major layout is implied in the interleave-batch functions by the ability to provide strides in both dimensions. Parameter istrd_A gives the distance (number of elements in memory) between elements of the same logical matrix in the same column. Parameter jstrd_A gives the distance between elements of the same logical matrix in the same row. In the row-major layout above istrd_A is 8 (the distance between, e.g., A11,1 and A12,1 is 8 elements in memory); jstrd_A is 2 (the distance between, e.g., A11,1 and A11,2). In the column-major layout istrd_A is 2 and jstrd_A is 6.
istrd_A
jstrd_A
In addition to allowing arbitrary strides, the interleave-batch format also allows arbitrary interleaving factors, ninter. There is no requirement that this should be 2 for double-precision data with Neon cores, for example. The previous examples show an interleaving factor of 2 since there are 2 matrices. If we had hundreds or thousands of matrices then better performance may be achieved by using slightly larger values of ninter. This would give us the opportunity to use computational kernels within the library which unroll the inner ninter dimension, and thereby use more vector registers. After a certain limit, depending on whether you are running on a Neon or SVE-capable core, using a larger interleaving factor ninter will start to degrade performance. The importance of the parameter ninter is illustrated in our benchmark results.
What happens if we have thousands of matrices in our batch? We simply repeat the interleaved layout nbatch times, so that the number of matrices becomes nbatch*ninter . Expanding our row-major example previously to involve 6 matrices, if we maintain a value of ninter = 2 then we would have nbatch = 3 separate interleavings of pairs of matrices as follows:
nbatch
nbatch*ninter
ninter = 2
nbatch = 3
Repeating the interleaved layout over nbatch batches is an outer loop in our functions, over which we have parallelized workloads with OpenMP. With a good choice of ninter, we can achieve high performance for large numbers of problems. The stride between one batch and the next is another parameter in our interface: bstrd_A. For the previous example, if batches follow one after another with no padding in between, then bstrd_A would be 24.
bstrd_A
See the Arm Performance Libraries Reference Guide for complete documentation of all of the functions we provide. Here we give an overview of what is available. Note that at present, we only provide double precision, real functions.
First, we provide utility functions for packing and unpacking. Use armpl_dge_interleave to take a single, contiguous matrix (for example, as would be used in a BLAS/LAPACK call) and pack it into the interleave-batch format. To do the reverse, use armpl_dge_deinterleave .
The list of BLAS functionality at 21.0 is:
The list of LAPACK functionality at 21.0 is:
Note that the QR and LU factorization routines differ slightly from the LAPACK equivalents DGEQRF and DGETRF, since the interleave-batch versions offer additional functionality. We use a column-pivoting strategy for QR factorization to reveal the ranks of the matrices. LU factorization with threshold pivoting allows users to specify a value, theta, which is used to determine whether row-pivoting is used at each step of the computation.
theta
The following example workflow shows how to take matrices laid out in the usual BLAS/LAPACK column-major format and pack them into the interleave-batch format. The example then performs a QR factorization. Next, the original matrices are reconstructed by multiplication of the two factors. Finally, the example unpacks and performs a validation check. This is compared with calling the equivalent LAPACK functions.
/* * ARMPL version 21.0 Copyright ARM 2021 */ #include <armpl.h> #include <math.h> #include <omp.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #define MIN(a,b) (((a)<(b))?(a):(b)) #define MAX(a,b) (((a)>(b))?(a):(b)) /* * This function uses interleave-batch functions. Starting from a batch of * matrices laid out in the standard LAPACK column-major format, it packs them * into the interleaved format, performs QR factorization, extracts the R * factors into a separate array, then multiplies Q by R before unpacking back * into LAPACK format and optionally checks the result. * The function returns the time taken in seconds. */ double run_ib_version(armpl_int_t nbatch, armpl_int_t ninter, armpl_int_t m, armpl_int_t n, armpl_int_t check_result) { armpl_int_t min_mn = MIN(m, n); armpl_int_t total_matrices = nbatch*ninter; /* Interleaved-batch setup Set strides without any padding Use a "column-major" layout for the input matrices (i.e. istrd_A matches ninter) A is m by n */ armpl_int_t istrd_A = ninter; armpl_int_t jstrd_A = istrd_A*m; armpl_int_t bstrd_A = jstrd_A*n; armpl_int_t total_size_A = bstrd_A*nbatch; armpl_int_t istrd_R = ninter; armpl_int_t jstrd_R = istrd_R*m; armpl_int_t bstrd_R = jstrd_R*n; armpl_int_t total_size_R = bstrd_R*nbatch; /* jpvt is the array of column-pivots */ armpl_int_t istrd_jpvt = ninter; armpl_int_t bstrd_jpvt = istrd_jpvt*n; armpl_int_t total_size_jpvt = bstrd_jpvt*nbatch; /* tau is the array of scalar factors of elementary reflectors */ armpl_int_t istrd_tau = ninter; armpl_int_t bstrd_tau = istrd_tau*n; armpl_int_t total_size_tau = bstrd_tau*nbatch; armpl_status_t stat; /* Interleave-batch arrays */ double *A_ib_p = (double *)malloc(sizeof(double)*total_size_A); double *R_ib_p = (double *)malloc(sizeof(double)*total_size_R); double *tau_p = (double *)malloc(sizeof(double)*total_size_tau); armpl_int_t *jpvt_p = (armpl_int_t *)malloc(sizeof(armpl_int_t)*total_size_jpvt); armpl_int_t *rank = (armpl_int_t *)malloc(sizeof(armpl_int_t)*nbatch*ninter); /* Pack from LAPACK arrays at the start, and unpack back at the end */ armpl_int_t lda = m; double *A_lpk_p = (double *)malloc(sizeof(double)*lda*n*total_matrices); armpl_int_t ldqr = m; double *QR_lpk_p = (double *)malloc(sizeof(double)*ldqr*n*total_matrices); double *QR_pvt_lpk_p = (double *)malloc(sizeof(double)*ldqr*n*total_matrices); srand(4733); /* Populate the LAPACK format matrices with random values in [0,1) */ for (armpl_int_t i = 0; i < m*n*total_matrices; i++) { A_lpk_p[i] = (double)rand()/RAND_MAX; } /* Now time packing the matrices, doing QR factorization followed by multiplying Q by R, and unpack back out */ double t1_ib = omp_get_wtime(); /* Pack matrices into interleaved-batch format, one by one */ double t1_ib_pack = omp_get_wtime(); armpl_int_t error = 0; #pragma omp parallel for for (armpl_int_t ib = 0; ib < nbatch; ib++) { double *A_ib = &A_ib_p[ib*bstrd_A]; for (armpl_int_t ii = 0; ii < ninter; ii++) { double *A_lpk = &A_lpk_p[(ib*ninter + ii)*lda*n]; armpl_int_t row_strd = 1;/* LAPACK matrices are always col-major */ armpl_int_t col_strd = lda; stat = armpl_dge_interleave(ninter, ii, m, n, A_lpk, row_strd, col_strd, A_ib, istrd_A, jstrd_A); if (stat != ARMPL_STATUS_SUCCESS) { fprintf(stderr, "Error packing A matrices, exit.\n"); #pragma omp atomic update error++; } } } if (error) { return EXIT_FAILURE; } double t2_ib_pack = omp_get_wtime(); double t1_ib_qr = omp_get_wtime(); /* Perform QR factorizations */ stat = armpl_dgeqrfrr_interleave_batch(ninter, nbatch, m, n, A_ib_p, bstrd_A, istrd_A, jstrd_A, jpvt_p, bstrd_jpvt, istrd_jpvt, tau_p, bstrd_tau, istrd_tau, rank); if (stat != ARMPL_STATUS_SUCCESS) { fprintf(stderr, "Error performing interleave-batch QR factorization, exit.\n"); return EXIT_FAILURE; } double t2_ib_qr = omp_get_wtime(); /* Check whether all of the matrices in the batch are full rank */ armpl_int_t full_rank = 1; for (armpl_int_t i = 0; i < nbatch*ninter; i++) { if (rank[i] != min_mn) { full_rank = 0; } } /* If full rank, extract Q and then multiply by triangular R with trmm */ double t1_ib_mq = omp_get_wtime(); if (full_rank) { if (check_result) { printf("Info: all matrices are full rank; reconstructing A.\n"); } /* Make a copy of R */ memcpy((void *)R_ib_p, (void *)A_ib_p, sizeof(double)*total_size_A); /* Zero lower-triangular part of R */ #pragma omp parallel for for (armpl_int_t ib = 0; ib < nbatch; ib++) { for (armpl_int_t j = 0; j < n; j++) { for (armpl_int_t i = j + 1; i < m; i++) { for (armpl_int_t ii = 0; ii < ninter; ii++) { R_ib_p[ib*bstrd_A + j*jstrd_A + i*istrd_A + ii] = 0.0; } } } } /* Multiply Q by R */ char side = 'L'; char transQ = 'N'; stat = armpl_dormqr_interleave_batch(ninter, nbatch, side, transQ, m, n, rank, A_ib_p, bstrd_A, istrd_A, jstrd_A, tau_p, bstrd_tau, istrd_tau, R_ib_p, bstrd_R, istrd_R, jstrd_R); if (stat != ARMPL_STATUS_SUCCESS) { fprintf(stderr, "Error in multiplying by Q matrix, exit.\n"); return EXIT_FAILURE; } } else { printf("Info: not all matrices full rank, not reconstructing A.\n"); } double t2_ib_mq = omp_get_wtime(); double t1_ib_unpack = omp_get_wtime(); const double eps = nextafter(1.0, 2.0) - 1.0; armpl_int_t fail = 0; /* Unpack matrices from interleaved-batch format, one by one */ #pragma omp parallel for for (armpl_int_t ib = 0; ib < nbatch; ib++) { double *QR_ib = &R_ib_p[ib*bstrd_A]; for (armpl_int_t ii = 0; ii < ninter; ii++) { double *QR_lpk = &QR_lpk_p[(ib*ninter + ii)*ldqr*n]; armpl_int_t row_strd = 1; armpl_int_t col_strd = ldqr; stat = armpl_dge_deinterleave(ninter, ii, m, n, QR_lpk, row_strd, col_strd, QR_ib, istrd_A, jstrd_A); if (stat != ARMPL_STATUS_SUCCESS) { fprintf(stderr, "Error unpacking output, exit.\n"); #pragma omp atomic update error++; } else if (check_result) { /* Apply column permutes and then check the result */ double *QR_pvt_lpk = &QR_pvt_lpk_p[(ib*ninter + ii)*ldqr*n]; /* Apply pivots */ for (armpl_int_t j = 0; j < n; j++) { armpl_int_t col_from = j*m; armpl_int_t col_to = jpvt_p[bstrd_jpvt*ib + j*istrd_jpvt + ii]*m; memcpy((void *)&QR_pvt_lpk[col_to], (void *)&QR_lpk[col_from], sizeof(double)*m); } /* Compute 1-norms of original matrix A and computed QR */ double *A_lpk = &A_lpk_p[(ib*ninter + ii)*lda*n]; double norm_a_minus_qr = 0.0; double norm_a = 0.0; for (armpl_int_t j = 0; j < n; j++) { double sum_diff = 0.0; double sum_col_a = 0.0; for (armpl_int_t i = 0; i < m; i++) { sum_diff += fabs(A_lpk[lda*j + i] - QR_pvt_lpk[ldqr*j + i]); sum_col_a += fabs(A_lpk[lda*j + i]); } norm_a_minus_qr = MAX(sum_diff, norm_a_minus_qr); norm_a = MAX(sum_col_a, norm_a); } /* Check that norm1(A-QR) <= eps*m*n*norm1(A) */ double tol = 5.0*eps*m*n*norm_a; if (norm_a_minus_qr > tol) { #pragma omp atomic update fail++; } } } } if (error) { return EXIT_FAILURE; } double t2_ib_unpack = omp_get_wtime(); double t2_ib = omp_get_wtime(); if (check_result) { if (fail == 0) { printf("Interleave-batch result check passed: "); printf("norm1(A-QR) < eps*n*norm1(A) for all cases.\n"); } else { printf("Interleave-batch result check failed:\n"); printf("\tnumber of cases where norm1(A-QR) > eps*n*norm1(A) = "); printf("%d.\n", fail); } } if (check_result) { printf("Interleave-batch breakdown:\n"); printf("\tpack: %f\n", t2_ib_pack - t1_ib_pack); printf("\tfactorize: %f\n", t2_ib_qr- t1_ib_qr); printf("\tmultiply: %f\n", t2_ib_mq - t1_ib_mq); printf("\tunpack: %f\n", t2_ib_unpack - t1_ib_unpack); } free(A_ib_p); free(R_ib_p); free(tau_p); free(jpvt_p); free(rank); free(A_lpk_p); free(QR_lpk_p); free(QR_pvt_lpk_p); return t2_ib - t1_ib; } /* * This function does the same as above with standard LAPACK calls, * and returns the time taken in seconds. */ double run_lpk_version(armpl_int_t nbatch, armpl_int_t ninter, armpl_int_t m, armpl_int_t n, armpl_int_t check_result) { armpl_int_t total_matrices = nbatch*ninter; /* LAPACK setup */ armpl_int_t lda = m; double *A_lpk_p = (double *)malloc(sizeof(double)*lda*n*total_matrices); double *R_lpk_p = (double *)malloc(sizeof(double)*lda*n*total_matrices); double *A_orig_lpk_p = (double *)malloc(sizeof(double)*lda*n*total_matrices); armpl_int_t ldqr = m; double *QR_lpk_p = (double *)malloc(sizeof(double)*ldqr*n*total_matrices); double *tau_lpk_p = (double *)malloc(sizeof(double)*n*total_matrices); /* Populate the LAPACK format matrices with random values in [0,1) */ for (armpl_int_t i = 0; i < m*n*total_matrices; i++) { A_lpk_p[i] = (double)rand()/RAND_MAX; A_orig_lpk_p[i] = A_lpk_p[i]; } /* Workspace query */ armpl_int_t lwork = -1; double dlwork; armpl_int_t info; dgeqrf_(&m, &n, A_lpk_p, &lda, tau_lpk_p, &dlwork, &lwork, &info); lwork = (armpl_int_t)dlwork; if (lwork < 1) { fprintf(stderr, "Error: DGEQRF workspace query failed. Exiting.\n"); return EXIT_FAILURE; } double *work_lpk_p = (double *)malloc(sizeof(double)*lwork*total_matrices); const double eps = nextafter(1.0, 2.0) - 1.0; armpl_int_t fail = 0; armpl_int_t error = 0; /* Time a parallelized loop over equivalent LAPACK calls */ double t1_lpk = omp_get_wtime(); #pragma omp parallel for for (armpl_int_t ib = 0; ib < nbatch; ib++) { for (armpl_int_t ii = 0; ii < ninter; ii++) { double *A_lpk = &A_lpk_p[(ib*ninter + ii)*lda*n]; double *work_lpk = &work_lpk_p[lwork*omp_get_thread_num()]; double *tau_lpk = &tau_lpk_p[n*omp_get_thread_num()]; dgeqrf_(&m, &n, A_lpk, &lda, tau_lpk, work_lpk, &lwork, &info); if (info != 0) { fprintf(stderr, "Error performing LAPACK QR factorization, exit.n"); #pragma omp atomic update error++; } /* Extract R */ double *R_lpk = &R_lpk_p[(ib*ninter + ii)*lda*n]; memcpy((void *)R_lpk, (void *)A_lpk, sizeof(double)*lda*n); for (armpl_int_t j = 0; j < n; j++) { for (armpl_int_t i = j + 1; i < m; i++) { R_lpk[j*lda + i] = 0.0; } } char side = 'L'; char transQ = 'N'; dormqr_(&side, &transQ, &m, &n, &n, A_lpk, &lda, tau_lpk, R_lpk, &lda, work_lpk, &lwork, &info); if (info != 0) { fprintf(stderr, "Error multiplying LAPACK QR, exit.\n"); #pragma omp atomic update error++; } if (check_result) { double *A_orig = &A_orig_lpk_p[(ib*ninter + ii)*lda*n]; double norm_a_minus_qr = 0.0; double norm_a = 0.0; for (armpl_int_t j = 0; j < n; j++) { double sum_diff = 0.0; double sum_col_a = 0.0; for (armpl_int_t i = 0; i < m; i++) { sum_diff += fabs(A_orig[lda*j + i] - R_lpk[ldqr*j + i]); sum_col_a += fabs(A_orig[lda*j + i]); } norm_a_minus_qr = MAX(sum_diff, norm_a_minus_qr); norm_a = MAX(sum_col_a, norm_a); } /* Check that norm1(A-QR) <= eps*m*n*norm1(A) */ double tol = 5.0*eps*m*n*norm_a; if (norm_a_minus_qr > tol) { #pragma omp atomic update fail++; } } } } if (error) { return EXIT_FAILURE; } double t2_lpk = omp_get_wtime(); if (check_result) { if (fail == 0) { printf("LAPACK result check passed: "); printf("norm1(A-QR) < eps*n*norm1(A) for all cases.\n"); } else { printf("LAPACK result check failed:\n"); printf("\tnumber of cases where norm1(A-QR) > eps*n*norm1(A) = "); printf("%d.\n", fail); } } free(A_lpk_p); free(R_lpk_p); free(A_orig_lpk_p); free(QR_lpk_p); free(tau_lpk_p); free(work_lpk_p); return t2_lpk - t1_lpk; } int main(int argc, char **argv) { if (argc != 5) { fprintf(stderr, "Error: requires 4 command-line arguments, but %d were provided.\n", argc); fprintf(stderr, "Usage: ./ib_blog_qr_example <nbatch> <ninter> <nrows> <ncols>.\n"); return EXIT_FAILURE; } armpl_int_t nbatch = atoi(argv[1]); armpl_int_t ninter = atoi(argv[2]); armpl_int_t m = atoi(argv[3]); armpl_int_t n = atoi(argv[4]); printf("Running example with nbatch = %d, ninter = %d, m = %d, n = %d\n", nbatch, ninter, m, n); printf("Total number of matrices: %d\n", nbatch*ninter); double t_ib; double t_lpk; /* Warm-up runs */ for (armpl_int_t nw = 0; nw < 3; nw++) { t_ib = run_ib_version(nbatch, ninter, m, n, 0); t_lpk = run_lpk_version(nbatch, ninter, m, n, 0); } /* Reported runs */ t_ib = run_ib_version(nbatch, ninter, m, n, 1); t_lpk = run_lpk_version(nbatch, ninter, m, n, 1); printf("Time for interleave-batch computation: %f\n", t_ib); printf("Time for LAPACK computation: %f\n", t_lpk); printf("Speedup for interleave-batch over LAPACK: %f\n", t_lpk/t_ib); return EXIT_SUCCESS; }
The performance improvement gained in this example from using the interleave-batch functions on 32,768 matrices is given for execution using 4 OpenMP threads in an AWS Graviton 2 instance. The results indicate how it is profitable to call multiple interleave-batch functions to amortize the cost of packing and unpacking data:
In this and the results below we have run with three different interleave-batch parameters to demonstrate how performance can vary. It is worthwhile experimenting with different settings for your use-case and hardware. The values used are:
nbatch = 1024
ninter = 32
nbatch = 2048
ninter = 16
nbatch = 4096
ninter = 8
The following results compare the performance of interleave-batch functions, with parallel loops over their BLAS/LAPACK counterparts. Here we show the performance in GFLOPS for these focused cases, discounting the cost of packing and unpacking. In each graph the speedup line is shown for the overall best interleave-batch parameter option, which is indicated in the legend. All results were generated using the Arm Compiler for Linux version of the library. So far we have added optimized Neon interleave-batch kernels to the library. We are currently working on SVE optimizations, which we expect to deliver an even greater benefit for large batches of small problems on systems with SVE vector widths greater than 128-bits.