We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I tried building and running the example codes provided with ARM Performance Libraries (ARMPL), such as FFT/IFFT computations and matrix multiplication. These examples come in two variants: MP (multi-threaded) and sequential. The code is essentially the same in both cases, but the linked .so libraries differ.
.so
My goal was to compare the performance of the MP and sequential versions by measuring the time taken to perform the computations, hoping to better understand how parallelization works in the MP version.
Since the example code executes the computation only once (i.e., without iteration), measuring the execution time directly yields a very short duration (unit time), which makes it difficult to compare meaningfully. I initially expected the MP version to be faster, assuming it could parallelize tasks like multiplying matrix rows and columns simultaneously, whereas the sequential version would do this step-by-step. However, this was just a naive assumption—I don’t know the exact implementation details inside ARMPL.
Surprisingly, in these single-run (unit time) measurements, the MP version sometimes appears slower than the sequential one. But when I modified the example code to run the same operation repeatedly in a loop (e.g., thousands of times), the MP version showed a significant performance advantage—around 3x faster than the sequential version. Since my system supports up to 4 threads, this result seems reasonable.
My questions are:
Why does the MP version not outperform the sequential one in unit-time runs?
How exactly does parallelization work in ARMPL?
Are there startup or thread management overheads that explain the initial slowness of MP?
Any insights into the internal behavior or optimization strategies used by ARMPL would be greatly appreciated.