Low performance using ARM inspector-executor pattern for sparse daxpyi

Hi everyone,

I'm currently implementing a sparse daxpyi operation on an aarch64 platform, using the inspector-executor pattern.

I compared three implementations:

  1.  A simple scalar for loop version (baseline)
  2.  My inspector-executor version on aarch64 platform
  3.  cblas_daxpyi from Intel MKL on x86_64 platform

Results:

  1. aarch64 version achieves only ~0.5x speed-up compared to the baseline scalar loop;
  2. MKL version on x86_64 platform achieves ~1.2x speed-up over the same scalar baseline;

I'm wondering:

  1. Am I misapplying the inspector-executor model?
  2. Or is there any architectural behavior (e.g., cache, memory) on aarch64 that might explain this?
  3. Do you have any best practices, or know performance tips for this kind of sparse vector update on aarch64 platform?

Thanks in advance for any insights or suggestions! If helpful, I can share more details.

  • Below is the core of my current ARM implementation for reference:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void arm_daxpyi2(const int n, const double alpha, const double *x, const int *indx, double *y)
{
// Early return if alpha is zero (no operation needed)
if (alpha == 0.0)
{
return;
}
// use std::max_element to find the maximum index
const int full_size = *std::max_element(indx, indx + n) + 1; // Find max index in indx array
// Create sparse vector descriptor for x
armpl_spvec_t spvec_x;
armpl_status_t status = armpl_spvec_create_d(&spvec_x, // Pointer to sparse vector object to create
0, // Index base (0 for C-style indexing)
full_size, // Dimension of the sparse vector
n, // Number of non-zero elements
indx, // Array of indices
x, // Array of non-zero values
0 // Flags (currently unused)
);
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

0