I tried building and running the example codes provided with ARM Performance Libraries (ARMPL), such as FFT/IFFT computations and matrix multiplication. These examples come in two variants: MP (multi-threaded) and sequential. The code is essentially the same in both cases, but the linked .so libraries differ.
.so
My goal was to compare the performance of the MP and sequential versions by measuring the time taken to perform the computations, hoping to better understand how parallelization works in the MP version.
Since the example code executes the computation only once (i.e., without iteration), measuring the execution time directly yields a very short duration (unit time), which makes it difficult to compare meaningfully. I initially expected the MP version to be faster, assuming it could parallelize tasks like multiplying matrix rows and columns simultaneously, whereas the sequential version would do this step-by-step. However, this was just a naive assumption—I don’t know the exact implementation details inside ARMPL.
Surprisingly, in these single-run (unit time) measurements, the MP version sometimes appears slower than the sequential one. But when I modified the example code to run the same operation repeatedly in a loop (e.g., thousands of times), the MP version showed a significant performance advantage—around 3x faster than the sequential version. Since my system supports up to 4 threads, this result seems reasonable.
My questions are:
Why does the MP version not outperform the sequential one in unit-time runs?
How exactly does parallelization work in ARMPL?
Are there startup or thread management overheads that explain the initial slowness of MP?
Any insights into the internal behavior or optimization strategies used by ARMPL would be greatly appreciated.
The decision between OpenMP or not OpenMP will come down to what your expected workloads look like. If you are calling multiple ArmPL functions (not necessarily in a tight loop) and you do not expect to run on very small inputs, then I would start with the OpenMP version. If you expect only to call ArmPL intermittently and/or on very small input sizes, then the non-OpenMP version might be a better choice. Also, if you intend to handle parallelism yourself in your application (for example calling ArmPL functions within OpenMP parallel regions) then the non-OpenMP version of the library might be a better choice.
I assume that Petalinux does not include tools like perf, which allow you to see the percentage of time an executable spends in various function calls? If not, then timing the execution of test programs that contain representative workflows is probably the best approach -- the ArmPL examples are probably not a good representation of a real application as each executable only calls one or two ArmPL functions on small input. You can then compare the execution time with and without OpenMP.
perf
If you are interested in only timing sub-regions of your test programs you can insert calls to something like clock_gettime(CLOCK_MONOTONIC, ...) into your code. How much time your programs spend in ArmPL calls versus their total run-time will let you see how important the choice of library is: if execution of the program only spends a few percent of its time in ArmPL then the speedup of using OpenMP may not change the overall run-time of your program very much.
clock_gettime(CLOCK_MONOTONIC, ...)
thank you very much for your time.