Low performance using ARM inspector-executor pattern for sparse daxpyi

Hi everyone,

I'm currently implementing a sparse daxpyi operation on an aarch64 platform, using the inspector-executor pattern.

I compared three implementations:

  1.  A simple scalar for loop version (baseline)
  2.  My inspector-executor version on aarch64 platform
  3.  cblas_daxpyi from Intel MKL on x86_64 platform

Results:

  1. aarch64 version achieves only ~0.5x speed-up compared to the baseline scalar loop;
  2. MKL version on x86_64 platform achieves ~1.2x speed-up over the same scalar baseline;

I'm wondering:

  1. Am I misapplying the inspector-executor model?
  2. Or is there any architectural behavior (e.g., cache, memory) on aarch64 that might explain this?
  3. Do you have any best practices, or know performance tips for this kind of sparse vector update on aarch64 platform?

Thanks in advance for any insights or suggestions! If helpful, I can share more details.

  • Below is the core of my current ARM implementation for reference:

0