Questions on Performance Testing ARMPL Functions (Sequential vs OpenMP vs Naive Implementations)

Hello everyone,

I have been experimenting with ARM Performance Libraries (ARMPL) functions and comparing them against my own naive implementations. I would really appreciate your insights, as I have some questions about the results and my methodology.

For the setup:

  • I included the directory from the official package:
    /opt/Arm_PL/armpl-24.10.1_Ubuntu-20.04_gcc/include (from arm-performance-libraries_25.07_acfl2410.tar).

  • For linking, I used:

    • Sequential: libarmpl_lp64.so

    • OpenMP: libomp.so and libarmpl_lp64_mp.so

For timing, I used clock_gettime(CLOCK_MONOTONIC, &start) and clock_gettime(CLOCK_MONOTONIC, &end) from <time.h>.



1. Vector–Matrix Multiplication (Complex Double)

  • Problem: multiplying a vector A[1×16] with a matrix B[16×16] inside an N-times loop.

  • ARMPL function: cblas_zgemv

  • Comparison: against my naive vector–matrix multiplication implementation.

1.a) Sequential results:

  • For small N, the naive implementation was faster.

    • N=1 → ARMPL ≈ 0.144 ms vs naive ≈ 0.009 ms.

  • As N increased, ARMPL became much faster.

    • N=100 → ARMPL = 0.52 ms vs naive = 0.798 ms.

    • N=1000 → ARMPL = 3.57 ms vs naive = 8.01 ms.

1.b) OpenMP results:

  • I expected parallelization to improve performance, but it did not.

    • In fact, results were similar or slightly slower.

    • N=1 → 0.185 ms

    • N=100 → 0.53 ms

  • N=1000 → 3.73 ms

2. Matrix–Matrix Multiplication (Double Real)

  • Problem: multiplying A[24×32] with B[32×20] inside an N-times loop.

  • ARMPL function: cblas_dgemm

  • Comparison: against my naive matrix–matrix multiplication implementation.

2.a) Sequential results:

  • Again, naive was slightly faster at N=1.

    • N=1 → ARMPL ≈ 0.24 ms vs naive ≈ 0.176 ms.

  • As N increased, ARMPL was significantly faster.

    • N=100 → ARMPL = 2.15 ms vs naive = 16.15 ms.

    • N=1000 → ARMPL = 18.50 ms vs naive = 161.20 ms.

2.b) OpenMP results:

  • I expected OpenMP to give much better speedup, but results were only slightly improved.

    • N=1 → 0.31 ms

    • N=100 → 1.85 ms

    • N=1000 → 15.5 ms


My Questions

  1. Are these results expected when comparing ARMPL sequential vs OpenMP builds?

    • Should I expect only small improvements (or even overhead) for these matrix sizes?

  2. Why do naive implementations sometimes appear faster at very small N, but ARMPL becomes much faster at larger N?

  3. Is my testing methodology valid?

    • I simply placed the function calls in a loop of size N and measured total runtime of that loop.

    • Is this a realistic way to benchmark, or should I use a different approach (e.g., larger problem sizes, warm-up runs, etc.)?

  4. In general, when is it advantageous to use ARMPL over naive implementations, and when does OpenMP provide noticeable benefits?


Any clarification or guidance on whether I am testing correctly and what to expect from ARMPL would be greatly appreciated. Thank you in advance!

Parents Reply Children
No data