Arm DDT and Arm MAP are excellent tools for finding program flaws and performance issues – they are also very helpful for studying codes and coding techniques. In this article I present a handful of optimization techniques and use Arm MAP to illustrate their benefits. These tips are focused on scientific programmers who want to adopt best practices and the tips are limited to those that:
On most machines the division operator is significantly more expensive (i.e. takes many more clock cycles) than all other operators. When possible, refactor your code to not use division. In the following code snippet, I have two loops performing the same calculation – one using division operations and one using multiplication operations.
Each loop is iterated an equivalent number of times. Let’s compile and run the code in Arm MAP…
In the source code view – in the middle of the GUI, there is a green plot to the left that shows the amount of CPU utilization for each line of code. In the first loop that uses division we can see that the operation consumed over 78% of the entire run, but in the second loop which uses multiplication it only used about 21% of the run.
Many technical computing algorithms, especially in HPC require double precision instead of single precision – typically 64 bits vs. 32 bits. Many, however, do not. When selecting data types for a given algorithm, or even a sub-part of the algorithm, try to determine if single precision is adequate. Single precision will consume half the amount of memory and execute faster.
In the example above I have two loops performing the same calculation. The first in single precision and the second in double. By looking at the CPU utilization plots to the left of the source code we can see that the first loop used about 41% of the total execution time and the second loop used about 58% of the total execution…does C really need to be double?
FORTRAN uses column major ordering of multi-dimensional data. That is individual columns are laid out in memory one after the other. C/C++ use row-major ordering such that individual rows are laid out one after another. For best efficiency, it is important to access individual elements in sequential order. Depending on which language you are using this means that the inner-most loop indices will be different. In the following example, I illustrate two methods of accessing and modifying a two-dimensional FORTRAN variable. In the first loop, I access sequential row elements of the variable a. For C/C++ this would be optimal but as we will see in Arm MAP this is not good in FORTRAN.
Let’s compile the program and run in Arm MAP…
Looking at the CPU utilization plot next to the source code, we can see that the first loop uses about 76% of the entire run and that the second loop only uses about 23%. We can understand this a little deeper by looking at the Arm MAP metrics view at the top of the GUI. In the first ¾ of the run (the inefficient loop) memory access is very high and floating point is very low.
Loop jamming, often called loop fusion is the combination of statements into single or similar loops. The execution of loops themselves has overhead and when the work of multiple loops can be combined and the work of individual loops increased, efficiency will be improved.
In the example above I have two loops that compute the same result for the variable C. In the first loop this is done with two sub-loops. On inspection we can see that my trivial equation can be reduced to a single statement, computed in a single loop. That is the technique used in the second outer loop. Looking at the CPU utilization plots to the left of the source code we see that the first loop took about 65% and the second loop only took about 35%.
In this article I presented a handful of code optimization techniques, which if adopted will help provide you with a good baseline for writing performance-oriented code. I hope that the illustrations encourage you to use Arm MAP to explore and study additional optimization techniques to see how your own codes perform.