Four simple tips for optimizing your code

July 23, 2014

3 minute read time.

Arm DDT and Arm MAP are excellent tools for finding program flaws and performance issues – they are also very helpful for studying codes and coding techniques. In this article I present a handful of optimization techniques and use Arm MAP to illustrate their benefits. These tips are focused on scientific programmers who want to adopt best practices and the tips are limited to those that:

are beneficial across most current architectures and probably on most future architectures,
do not require any deep understanding of a given computer architecture, and
do not require a refactoring of your code that reduces readability.

1. Division is expensive

On most machines the division operator is significantly more expensive (i.e. takes many more clock cycles) than all other operators. When possible, refactor your code to not use division. In the following code snippet, I have two loops performing the same calculation – one using division operations and one using multiplication operations.

Optimize code division

Each loop is iterated an equivalent number of times. Let’s compile and run the code in Arm MAP…

Optimize code division source code

In the source code view – in the middle of the GUI, there is a green plot to the left that shows the amount of CPU utilization for each line of code. In the first loop that uses division we can see that the operation consumed over 78% of the entire run, but in the second loop which uses multiplication it only used about 21% of the run.

2. Use the appropriate precision

Many technical computing algorithms, especially in HPC require double precision instead of single precision – typically 64 bits vs. 32 bits. Many, however, do not. When selecting data types for a given algorithm, or even a sub-part of the algorithm, try to determine if single precision is adequate. Single precision will consume half the amount of memory and execute faster.

Optimize precision code Arm map

In the example above I have two loops performing the same calculation. The first in single precision and the second in double. By looking at the CPU utilization plots to the left of the source code we can see that the first loop used about 41% of the total execution time and the second loop used about 58% of the total execution…does C really need to be double?

3. Be friendly to your memory

FORTRAN uses column major ordering of multi-dimensional data. That is individual columns are laid out in memory one after the other. C/C++ use row-major ordering such that individual rows are laid out one after another. For best efficiency, it is important to access individual elements in sequential order. Depending on which language you are using this means that the inner-most loop indices will be different. In the following example, I illustrate two methods of accessing and modifying a two-dimensional FORTRAN variable. In the first loop, I access sequential row elements of the variable a. For C/C++ this would be optimal but as we will see in Arm MAP this is not good in FORTRAN.

Optimize memory code Arm map

Let’s compile the program and run in Arm MAP…

Optimize memory run code Arm Map

Looking at the CPU utilization plot next to the source code, we can see that the first loop uses about 76% of the entire run and that the second loop only uses about 23%. We can understand this a little deeper by looking at the Arm MAP metrics view at the top of the GUI. In the first ¾ of the run (the inefficient loop) memory access is very high and floating point is very low.

4. Always be jamming

Loop jamming, often called loop fusion is the combination of statements into single or similar loops. The execution of loops themselves has overhead and when the work of multiple loops can be combined and the work of individual loops increased, efficiency will be improved.

Optimize loop jamming code Arm Map

In the example above I have two loops that compute the same result for the variable C. In the first loop this is done with two sub-loops. On inspection we can see that my trivial equation can be reduced to a single statement, computed in a single loop. That is the technique used in the second outer loop. Looking at the CPU utilization plots to the left of the source code we see that the first loop took about 65% and the second loop only took about 35%.

Conclusion

In this article I presented a handful of code optimization techniques, which if adopted will help provide you with a good baseline for writing performance-oriented code. I hope that the illustrations encourage you to use Arm MAP to explore and study additional optimization techniques to see how your own codes perform.

High Performance Computing (HPC) blog

Expanding Arm on Arm with the NVIDIA Grace CPU

Tim Thornton

In this blog post, we show how the Arm Neoverse V2-based NVIDIA Grace CPU can run Arm's most performance-critical workloads and allows Arm to operate a consistent environment in-cloud and on-prem.
- November 20, 2024
Arm Performance Libraries 24.10

Chris Goodyer

In this blog post, we review the improvements made to Arm Performance Libraries 24.10.
- November 11, 2024
Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL

David Lecomber

This blog post outlines the strategies utilized to enhance Pardiso's performance by leveraging the Arm architecture and presents a comparative study with Intel MKL Pardiso.
- October 22, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Four simple tips for optimizing your code

1. Division is expensive

2. Use the appropriate precision

3. Be friendly to your memory

4. Always be jamming

Conclusion

Expanding Arm on Arm with the NVIDIA Grace CPU

Arm Performance Libraries 24.10

Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL