Low performance using ARM inspector-executor pattern for sparse daxpyi

Hi everyone,

I'm currently implementing a sparse daxpyi operation on an aarch64 platform, using the inspector-executor pattern.

I compared three implementations:

  1.  A simple scalar for loop version (baseline)
  2.  My inspector-executor version on aarch64 platform
  3.  cblas_daxpyi from Intel MKL on x86_64 platform

Results:

  1. aarch64 version achieves only ~0.5x speed-up compared to the baseline scalar loop;
  2. MKL version on x86_64 platform achieves ~1.2x speed-up over the same scalar baseline;

I'm wondering:

  1. Am I misapplying the inspector-executor model?
  2. Or is there any architectural behavior (e.g., cache, memory) on aarch64 that might explain this?
  3. Do you have any best practices, or know performance tips for this kind of sparse vector update on aarch64 platform?

Thanks in advance for any insights or suggestions! If helpful, I can share more details.

  • Below is the core of my current ARM implementation for reference:

void arm_daxpyi2(const int n, const double alpha, const double *x, const int *indx, double *y)
{
    // Early return if alpha is zero (no operation needed)
    if (alpha == 0.0)
    {
        return;
    }

    // use std::max_element to find the maximum index
    const int full_size = *std::max_element(indx, indx + n) + 1; // Find max index in indx array

    // Create sparse vector descriptor for x
    armpl_spvec_t spvec_x;
    armpl_status_t status = armpl_spvec_create_d(&spvec_x,  // Pointer to sparse vector object to create
                                                 0,         // Index base (0 for C-style indexing)
                                                 full_size, // Dimension of the sparse vector
                                                 n,         // Number of non-zero elements
                                                 indx,      // Array of indices
                                                 x,         // Array of non-zero values
                                                 0          // Flags (currently unused)
    );

    if (status != ARMPL_STATUS_SUCCESS)
    {
        // Handle error
        return;
    }

    // Execute the sparse vector operation: y = alpha*x + beta*y
    // Use beta = 1.0 to keep the existing values in y
    const double beta = 1.0;
    status = armpl_spaxpby_exec_d(alpha,   // alpha coefficient
                                  spvec_x, // sparse vector x
                                  beta,    // beta coefficient
                                  y        // dense vector y (input/output)
    );

    if (status != ARMPL_STATUS_SUCCESS)
    {
        // Handle error
        armpl_spvec_destroy(spvec_x);
        return;
    }

    // Clean up
    armpl_spvec_destroy(spvec_x);
}

Parents
  • Hi, thanks for your patience :)

    With regards to your first question, the answer is "maybe"... 

    if you are timing the whole function you shared, which includes creating the sparse vector and destroying it as well as execution of the actual operation, then yes that will be slower than simply using your own loop. I observe the same performance ratio that you reported in that case, when comparing with a naive loop implementation.

    However, if you just time the `armpl_spaxpby_exec_d` call over a number of iterations, then you should see better performance than the naive loop implementation. I see 2x better performance in Arm PL in this case. E.g.

        // Start timing
        struct timespec start, end;
        clock_gettime(CLOCK_MONOTONIC, &start);
    
        for (int i = 0; i < niters; ++i)
        {
            status = armpl_spaxpby_exec_d(alpha, spvec_x, beta, y);
    
            // Re-initialize y for fair timing
            for (int i = 0; i < n; i++) {
                y[i] = (double) rand() / RAND_MAX;
            }
    
            if (status != ARMPL_STATUS_SUCCESS)
            {
                fprintf(stderr, "Error: armpl_spaxpby_exec_d failed at iteration %d with status %d\n", i, status);
                armpl_spvec_destroy(spvec_x);
                return;
            }
        }
    
        // End timing
        clock_gettime(CLOCK_MONOTONIC, &end);
    
        // Compute elapsed time in seconds
        double elapsed_sec = (end.tv_sec - start.tv_sec) + 1e-9 * (end.tv_nsec - start.tv_nsec);
    
        printf("Executed armpl_spaxpby_exec_d %d times\n", niters);
        printf("Total time: %.6f seconds\n", elapsed_sec);

    My answer was "maybe" to your question about misapplying the interface because it depends on what you're trying to do.

    If you have a scenario where the sparsity pattern of the sparse vector remains the same over repeated applications of spaxpby, then you should only create the `armpl_spvec_t` object once and reuse that in a loop over the` armpl_spaxpby_exec_d` call. The library should give good performance then.

    However, if your sparse vector changes each time, then as you've seen you're probably better off avoiding the overhead of creating & destroying the `armpl_spvec_t` object each time, and using your own loop instead. 

    With regards to question 2: since this is a memory-bound operation, there aren't really any caching optimizations.

    For question 3, if you have the second scenario (a sparse vector that changes each time) and need your own loop implementation, then ensuring the loop is using Neon or SVE SIMD instructions can make a difference. Compilers can generate these for you with the right set of flags and possibly requiring some compiler directives such as `#pragma omp simd`.

    I hope that helps! Please let me know if you need any further help.

    Regards,

    Chris.

Reply
  • Hi, thanks for your patience :)

    With regards to your first question, the answer is "maybe"... 

    if you are timing the whole function you shared, which includes creating the sparse vector and destroying it as well as execution of the actual operation, then yes that will be slower than simply using your own loop. I observe the same performance ratio that you reported in that case, when comparing with a naive loop implementation.

    However, if you just time the `armpl_spaxpby_exec_d` call over a number of iterations, then you should see better performance than the naive loop implementation. I see 2x better performance in Arm PL in this case. E.g.

        // Start timing
        struct timespec start, end;
        clock_gettime(CLOCK_MONOTONIC, &start);
    
        for (int i = 0; i < niters; ++i)
        {
            status = armpl_spaxpby_exec_d(alpha, spvec_x, beta, y);
    
            // Re-initialize y for fair timing
            for (int i = 0; i < n; i++) {
                y[i] = (double) rand() / RAND_MAX;
            }
    
            if (status != ARMPL_STATUS_SUCCESS)
            {
                fprintf(stderr, "Error: armpl_spaxpby_exec_d failed at iteration %d with status %d\n", i, status);
                armpl_spvec_destroy(spvec_x);
                return;
            }
        }
    
        // End timing
        clock_gettime(CLOCK_MONOTONIC, &end);
    
        // Compute elapsed time in seconds
        double elapsed_sec = (end.tv_sec - start.tv_sec) + 1e-9 * (end.tv_nsec - start.tv_nsec);
    
        printf("Executed armpl_spaxpby_exec_d %d times\n", niters);
        printf("Total time: %.6f seconds\n", elapsed_sec);

    My answer was "maybe" to your question about misapplying the interface because it depends on what you're trying to do.

    If you have a scenario where the sparsity pattern of the sparse vector remains the same over repeated applications of spaxpby, then you should only create the `armpl_spvec_t` object once and reuse that in a loop over the` armpl_spaxpby_exec_d` call. The library should give good performance then.

    However, if your sparse vector changes each time, then as you've seen you're probably better off avoiding the overhead of creating & destroying the `armpl_spvec_t` object each time, and using your own loop instead. 

    With regards to question 2: since this is a memory-bound operation, there aren't really any caching optimizations.

    For question 3, if you have the second scenario (a sparse vector that changes each time) and need your own loop implementation, then ensuring the loop is using Neon or SVE SIMD instructions can make a difference. Compilers can generate these for you with the right set of flags and possibly requiring some compiler directives such as `#pragma omp simd`.

    I hope that helps! Please let me know if you need any further help.

    Regards,

    Chris.

Children