Low performance using ARM inspector-executor pattern for sparse daxpyi

Harold Hwan 16 days ago

Hi everyone,

I'm currently implementing a sparse daxpyi operation on an aarch64 platform, using the inspector-executor pattern.

I compared three implementations:

A simple scalar for loop version (baseline)
My inspector-executor version on aarch64 platform
cblas_daxpyi from Intel MKL on x86_64 platform

Results:

aarch64 version achieves only ~0.5x speed-up compared to the baseline scalar loop;
MKL version on x86_64 platform achieves ~1.2x speed-up over the same scalar baseline;

I'm wondering:

Am I misapplying the inspector-executor model?
Or is there any architectural behavior (e.g., cache, memory) on aarch64 that might explain this?
Do you have any best practices, or know performance tips for this kind of sparse vector update on aarch64 platform?

Thanks in advance for any insights or suggestions! If helpful, I can share more details.

Below is the core of my current ARM implementation for reference:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void arm_daxpyi2(const int n, const double alpha, const double *x, const int *indx, double *y)
{
    // Early return if alpha is zero (no operation needed)
    if (alpha == 0.0)
    {
        return;
    }
    // use std::max_element to find the maximum index
    const int full_size = *std::max_element(indx, indx + n) + 1; // Find max index in indx array
    // Create sparse vector descriptor for x
    armpl_spvec_t spvec_x;
    armpl_status_t status = armpl_spvec_create_d(&spvec_x,  // Pointer to sparse vector object to create
                                                 0,         // Index base (0 for C-style indexing)
                                                 full_size, // Dimension of the sparse vector
                                                 n,         // Number of non-zero elements
                                                 indx,      // Array of indices
                                                 x,         // Array of non-zero values
                                                 0          // Flags (currently unused)
    );
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

void arm_daxpyi2(const int n, const double alpha, const double *x, const int *indx, double *y)
{
    // Early return if alpha is zero (no operation needed)
    if (alpha == 0.0)
    {
        return;
    }

    // use std::max_element to find the maximum index
    const int full_size = *std::max_element(indx, indx + n) + 1; // Find max index in indx array

    // Create sparse vector descriptor for x
    armpl_spvec_t spvec_x;
    armpl_status_t status = armpl_spvec_create_d(&spvec_x,  // Pointer to sparse vector object to create
                                                 0,         // Index base (0 for C-style indexing)
                                                 full_size, // Dimension of the sparse vector
                                                 n,         // Number of non-zero elements
                                                 indx,      // Array of indices
                                                 x,         // Array of non-zero values
                                                 0          // Flags (currently unused)
    );

    if (status != ARMPL_STATUS_SUCCESS)
    {
        // Handle error
        return;
    }

    // Execute the sparse vector operation: y = alpha*x + beta*y
    // Use beta = 1.0 to keep the existing values in y
    const double beta = 1.0;
    status = armpl_spaxpby_exec_d(alpha,   // alpha coefficient
                                  spvec_x, // sparse vector x
                                  beta,    // beta coefficient
                                  y        // dense vector y (input/output)
    );

    if (status != ARMPL_STATUS_SUCCESS)
    {
        // Handle error
        armpl_spvec_destroy(spvec_x);
        return;
    }

    // Clean up
    armpl_spvec_destroy(spvec_x);
}

Top replies

Chris Armstrong 10 days ago +1 verified

Hi, thanks for your patience :) With regards to your first question, the answer is "maybe"... if you are timing the whole function you shared, which includes creating the sparse vector and destroying...