We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello everyone,
I have been experimenting with ARM Performance Libraries (ARMPL) functions and comparing them against my own naive implementations. I would really appreciate your insights, as I have some questions about the results and my methodology.
For the setup:
I included the directory from the official package: /opt/Arm_PL/armpl-24.10.1_Ubuntu-20.04_gcc/include (from arm-performance-libraries_25.07_acfl2410.tar).
/opt/Arm_PL/armpl-24.10.1_Ubuntu-20.04_gcc/include
arm-performance-libraries_25.07_acfl2410.tar
For linking, I used:
Sequential: libarmpl_lp64.so
libarmpl_lp64.so
OpenMP: libomp.so and libarmpl_lp64_mp.so
libomp.so
libarmpl_lp64_mp.so
For timing, I used clock_gettime(CLOCK_MONOTONIC, &start) and clock_gettime(CLOCK_MONOTONIC, &end) from <time.h>.
clock_gettime(CLOCK_MONOTONIC, &start)
clock_gettime(CLOCK_MONOTONIC, &end)
<time.h>
1. Vector–Matrix Multiplication (Complex Double)
Problem: multiplying a vector A[1×16] with a matrix B[16×16] inside an N-times loop.
ARMPL function: cblas_zgemv
cblas_zgemv
Comparison: against my naive vector–matrix multiplication implementation.
1.a) Sequential results:
For small N, the naive implementation was faster.
N=1 → ARMPL ≈ 0.144 ms vs naive ≈ 0.009 ms.
As N increased, ARMPL became much faster.
N=100 → ARMPL = 0.52 ms vs naive = 0.798 ms.
N=1000 → ARMPL = 3.57 ms vs naive = 8.01 ms.
1.b) OpenMP results:
I expected parallelization to improve performance, but it did not.
In fact, results were similar or slightly slower.
N=1 → 0.185 ms
N=100 → 0.53 ms
2. Matrix–Matrix Multiplication (Double Real)
Problem: multiplying A[24×32] with B[32×20] inside an N-times loop.
ARMPL function: cblas_dgemm
cblas_dgemm
Comparison: against my naive matrix–matrix multiplication implementation.
2.a) Sequential results:
Again, naive was slightly faster at N=1.
N=1 → ARMPL ≈ 0.24 ms vs naive ≈ 0.176 ms.
As N increased, ARMPL was significantly faster.
N=100 → ARMPL = 2.15 ms vs naive = 16.15 ms.
N=1000 → ARMPL = 18.50 ms vs naive = 161.20 ms.
2.b) OpenMP results:
I expected OpenMP to give much better speedup, but results were only slightly improved.
N=1 → 0.31 ms
N=100 → 1.85 ms
N=1000 → 15.5 ms
My Questions
Are these results expected when comparing ARMPL sequential vs OpenMP builds?
Should I expect only small improvements (or even overhead) for these matrix sizes?
Why do naive implementations sometimes appear faster at very small N, but ARMPL becomes much faster at larger N?
Is my testing methodology valid?
I simply placed the function calls in a loop of size N and measured total runtime of that loop.
Is this a realistic way to benchmark, or should I use a different approach (e.g., larger problem sizes, warm-up runs, etc.)?
In general, when is it advantageous to use ARMPL over naive implementations, and when does OpenMP provide noticeable benefits?
Any clarification or guidance on whether I am testing correctly and what to expect from ARMPL would be greatly appreciated. Thank you in advance!
Thank you very much for the detailed explanation.