We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello everyone,
I have been experimenting with ARM Performance Libraries (ARMPL) functions and comparing them against my own naive implementations. I would really appreciate your insights, as I have some questions about the results and my methodology.
For the setup:
I included the directory from the official package: /opt/Arm_PL/armpl-24.10.1_Ubuntu-20.04_gcc/include (from arm-performance-libraries_25.07_acfl2410.tar).
/opt/Arm_PL/armpl-24.10.1_Ubuntu-20.04_gcc/include
arm-performance-libraries_25.07_acfl2410.tar
For linking, I used:
Sequential: libarmpl_lp64.so
libarmpl_lp64.so
OpenMP: libomp.so and libarmpl_lp64_mp.so
libomp.so
libarmpl_lp64_mp.so
For timing, I used clock_gettime(CLOCK_MONOTONIC, &start) and clock_gettime(CLOCK_MONOTONIC, &end) from <time.h>.
clock_gettime(CLOCK_MONOTONIC, &start)
clock_gettime(CLOCK_MONOTONIC, &end)
<time.h>
1. Vector–Matrix Multiplication (Complex Double)
Problem: multiplying a vector A[1×16] with a matrix B[16×16] inside an N-times loop.
ARMPL function: cblas_zgemv
cblas_zgemv
Comparison: against my naive vector–matrix multiplication implementation.
1.a) Sequential results:
For small N, the naive implementation was faster.
N=1 → ARMPL ≈ 0.144 ms vs naive ≈ 0.009 ms.
As N increased, ARMPL became much faster.
N=100 → ARMPL = 0.52 ms vs naive = 0.798 ms.
N=1000 → ARMPL = 3.57 ms vs naive = 8.01 ms.
1.b) OpenMP results:
I expected parallelization to improve performance, but it did not.
In fact, results were similar or slightly slower.
N=1 → 0.185 ms
N=100 → 0.53 ms
2. Matrix–Matrix Multiplication (Double Real)
Problem: multiplying A[24×32] with B[32×20] inside an N-times loop.
ARMPL function: cblas_dgemm
cblas_dgemm
Comparison: against my naive matrix–matrix multiplication implementation.
2.a) Sequential results:
Again, naive was slightly faster at N=1.
N=1 → ARMPL ≈ 0.24 ms vs naive ≈ 0.176 ms.
As N increased, ARMPL was significantly faster.
N=100 → ARMPL = 2.15 ms vs naive = 16.15 ms.
N=1000 → ARMPL = 18.50 ms vs naive = 161.20 ms.
2.b) OpenMP results:
I expected OpenMP to give much better speedup, but results were only slightly improved.
N=1 → 0.31 ms
N=100 → 1.85 ms
N=1000 → 15.5 ms
My Questions
Are these results expected when comparing ARMPL sequential vs OpenMP builds?
Should I expect only small improvements (or even overhead) for these matrix sizes?
Why do naive implementations sometimes appear faster at very small N, but ARMPL becomes much faster at larger N?
Is my testing methodology valid?
I simply placed the function calls in a loop of size N and measured total runtime of that loop.
Is this a realistic way to benchmark, or should I use a different approach (e.g., larger problem sizes, warm-up runs, etc.)?
In general, when is it advantageous to use ARMPL over naive implementations, and when does OpenMP provide noticeable benefits?
Any clarification or guidance on whether I am testing correctly and what to expect from ARMPL would be greatly appreciated. Thank you in advance!
Hi there,
Thank you again for downloading and using ArmPL. In answer to your questions:
1) For the matrix sizes you use I would not expect any improvement from using the OpenMP version instead of sequential ArmPL. The library limits the number of threads used based on the sizes of the inputs, and for those sorts of sizes I would expect that only 1 thread would be used. You can check this by looking at the output from a utility like "htop" while running your benchmarks. To see benefits from OpenMP I would try much larger inputs, for example starting with matrices of 256x256 and increasing sizes from there.
2) When calling a function once you are paying one-off start-up costs for ArmPL that the naive implementations do not have. When you do more repetitions this one-off cost is outweighed by the improved computational performance of ArmPL versus the naive implementations. From your figures it looks like the runtime of naive implementation scales linearly with the number of repetitions, while this is not true for ArmPL (this is the effect of the start-up overhead).
3) I think I have answered this already, but for a more representative set of results I would consider a range of inputs, including larger sizes, and performing many iterations/use warm-up runs to amortize the start-up overheads in ArmPL.
4) If you are doing a one-off operation on a very small input you might prefer a naive implementation. But as soon as you need to handle ranges of input sizes (particularly larger inputs) or you do not know what sizes of inputs you need to handle then ArmPL will be preferable. OpenMP will provide noticeable benefits when your inputs are sufficiently large to benefit from running on multiple threads, and ArmPL will determine the thresholds and numbers of threads for you.
I hope this answers your questions. Please let me know if you have any further questions.
Nick
Thank you very much for the detailed explanation.