Questions on Performance Testing ARMPL Functions (Sequential vs OpenMP vs Naive Implementations)

talat Yavuz 2 months ago

Hello everyone,

I have been experimenting with ARM Performance Libraries (ARMPL) functions and comparing them against my own naive implementations. I would really appreciate your insights, as I have some questions about the results and my methodology.

For the setup:

I included the directory from the official package:
/opt/Arm_PL/armpl-24.10.1_Ubuntu-20.04_gcc/include (from arm-performance-libraries_25.07_acfl2410.tar).
For linking, I used:
- Sequential: libarmpl_lp64.so
- OpenMP: libomp.so and libarmpl_lp64_mp.so

For timing, I used clock_gettime(CLOCK_MONOTONIC, &start) and clock_gettime(CLOCK_MONOTONIC, &end) from <time.h>.

1. Vector–Matrix Multiplication (Complex Double)

Problem: multiplying a vector A[1×16] with a matrix B[16×16] inside an N-times loop.
ARMPL function: cblas_zgemv
Comparison: against my naive vector–matrix multiplication implementation.

1.a) Sequential results:

For small N, the naive implementation was faster.
- N=1 → ARMPL ≈ 0.144 ms vs naive ≈ 0.009 ms.
As N increased, ARMPL became much faster.
- N=100 → ARMPL = 0.52 ms vs naive = 0.798 ms.
- N=1000 → ARMPL = 3.57 ms vs naive = 8.01 ms.

1.b) OpenMP results:

I expected parallelization to improve performance, but it did not.
- In fact, results were similar or slightly slower.
- N=1 → 0.185 ms
- N=100 → 0.53 ms
N=1000 → 3.73 ms

2. Matrix–Matrix Multiplication (Double Real)

Problem: multiplying A[24×32] with B[32×20] inside an N-times loop.
ARMPL function: cblas_dgemm
Comparison: against my naive matrix–matrix multiplication implementation.

2.a) Sequential results:

Again, naive was slightly faster at N=1.
- N=1 → ARMPL ≈ 0.24 ms vs naive ≈ 0.176 ms.
As N increased, ARMPL was significantly faster.
- N=100 → ARMPL = 2.15 ms vs naive = 16.15 ms.
- N=1000 → ARMPL = 18.50 ms vs naive = 161.20 ms.

2.b) OpenMP results:

I expected OpenMP to give much better speedup, but results were only slightly improved.
- N=1 → 0.31 ms
- N=100 → 1.85 ms
- N=1000 → 15.5 ms

My Questions

Are these results expected when comparing ARMPL sequential vs OpenMP builds?
- Should I expect only small improvements (or even overhead) for these matrix sizes?
Why do naive implementations sometimes appear faster at very small N, but ARMPL becomes much faster at larger N?
Is my testing methodology valid?
- I simply placed the function calls in a loop of size N and measured total runtime of that loop.
- Is this a realistic way to benchmark, or should I use a different approach (e.g., larger problem sizes, warm-up runs, etc.)?
In general, when is it advantageous to use ARMPL over naive implementations, and when does OpenMP provide noticeable benefits?

Any clarification or guidance on whether I am testing correctly and what to expect from ARMPL would be greatly appreciated. Thank you in advance!

Top replies

Nick Dingle 2 months ago +2 verified

Hi there, Thank you again for downloading and using ArmPL. In answer to your questions: 1) For the matrix sizes you use I would not expect any improvement from using the OpenMP version instead of sequential...

Parents

0 talat Yavuz 2 months ago in reply to Nick Dingle

Thank you very much for the detailed explanation.
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel

Reply

0 talat Yavuz 2 months ago in reply to Nick Dingle

Thank you very much for the detailed explanation.
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel

Children

No data