Armpl mp vs sequential

I tried building and running the example codes provided with ARM Performance Libraries (ARMPL), such as FFT/IFFT computations and matrix multiplication. These examples come in two variants: MP (multi-threaded) and sequential. The code is essentially the same in both cases, but the linked .so libraries differ.

My goal was to compare the performance of the MP and sequential versions by measuring the time taken to perform the computations, hoping to better understand how parallelization works in the MP version.

Since the example code executes the computation only once (i.e., without iteration), measuring the execution time directly yields a very short duration (unit time), which makes it difficult to compare meaningfully. I initially expected the MP version to be faster, assuming it could parallelize tasks like multiplying matrix rows and columns simultaneously, whereas the sequential version would do this step-by-step. However, this was just a naive assumption—I don’t know the exact implementation details inside ARMPL.

Surprisingly, in these single-run (unit time) measurements, the MP version sometimes appears slower than the sequential one. But when I modified the example code to run the same operation repeatedly in a loop (e.g., thousands of times), the MP version showed a significant performance advantage—around 3x faster than the sequential version. Since my system supports up to 4 threads, this result seems reasonable.

My questions are:

  1. Why does the MP version not outperform the sequential one in unit-time runs?

  2. How exactly does parallelization work in ARMPL?

  3. Are there startup or thread management overheads that explain the initial slowness of MP?

Any insights into the internal behavior or optimization strategies used by ARMPL would be greatly appreciated.

  • Hi there,

    Thank you very much for downloading and using ArmPL.

    1) In general it is very hard to conclude anything from one-off calls of functions. This is especially the case for our examples, which run on very small input problems. The execution time is likely to be affected by many sources of noise; OpenMP overheads may be one, but there will also be noise from the operating system, interruptions from other processes running on the machine etc. As you say, to get an accurate view of performance you have to run a function many times and probably on a range of input sizes.

    2) Unfortunately there is no easy answer to that, because it depends on which function you are calling and what the inputs are. The OpenMP implementations generally divide work between threads and operate on these parts at the same time. ArmPL decides how many thread to use based on the size of the problem being solved -- it may use less than the total number of available threads.

    3) There could be startup and thread management costs when using OpenMP, but how much of an overhead they cause will depend on the size of the problem being solved: above a certain size (where the exact threshold depends on which function you are calling and with what options) any overheads will be offset by the amount of computational work done inside the function.

    I hope this helps -- please let me know if you have any follow-up questions.

  • Hello Nick, thank you very much for the answer.
    Then, what would you suggest me whether I should use the mp version or not? I mean, since I will not benchmark in an ordinary case (I will call repeated tasks of course but I will not call those functions repeatedly in a for loop) how should I decide which one to use? What systematic comparison technique do you suggest me? By the way, I run these codes on petalinux hence I cannot easily access heavy complex analysis tools :).

  • The decision between OpenMP or not OpenMP will come down to what your expected workloads look like. If you are calling multiple ArmPL functions (not necessarily in a tight loop) and you do not expect to run on very small inputs, then I would start with the OpenMP version. If you expect only to call ArmPL intermittently and/or on very small input sizes, then the non-OpenMP version might be a better choice. Also, if you intend to handle parallelism yourself in your application (for example calling ArmPL functions within OpenMP parallel regions) then the non-OpenMP version of the library might be a better choice.

    I assume that Petalinux does not include tools like perf, which allow you to see the percentage of time an executable spends in various function calls? If not, then timing the execution of test programs that contain representative workflows is probably the best approach -- the ArmPL examples are probably not a good representation of a real application as each executable only calls one or two ArmPL functions on small input. You can then compare the execution time with and without OpenMP.

    If you are interested in only timing sub-regions of your test programs you can insert calls to something like clock_gettime(CLOCK_MONOTONIC, ...) into your code. How much time your programs spend in ArmPL calls versus their total run-time will let you see how important the choice of library is: if execution of the program only spends a few percent of its time in ArmPL then the speedup of using OpenMP may not change the overall run-time of your program very much.