Four simple tips for optimizing your code

July 23, 2014

3 minute read time.

Arm DDT and Arm MAP are excellent tools for finding program flaws and performance issues – they are also very helpful for studying codes and coding techniques. In this article I present a handful of optimization techniques and use Arm MAP to illustrate their benefits. These tips are focused on scientific programmers who want to adopt best practices and the tips are limited to those that:

are beneficial across most current architectures and probably on most future architectures,
do not require any deep understanding of a given computer architecture, and
do not require a refactoring of your code that reduces readability.

1. Division is expensive

On most machines the division operator is significantly more expensive (i.e. takes many more clock cycles) than all other operators. When possible, refactor your code to not use division. In the following code snippet, I have two loops performing the same calculation – one using division operations and one using multiplication operations.

Optimize code division

Each loop is iterated an equivalent number of times. Let’s compile and run the code in Arm MAP…

Optimize code division source code

In the source code view – in the middle of the GUI, there is a green plot to the left that shows the amount of CPU utilization for each line of code. In the first loop that uses division we can see that the operation consumed over 78% of the entire run, but in the second loop which uses multiplication it only used about 21% of the run.

2. Use the appropriate precision

Many technical computing algorithms, especially in HPC require double precision instead of single precision – typically 64 bits vs. 32 bits. Many, however, do not. When selecting data types for a given algorithm, or even a sub-part of the algorithm, try to determine if single precision is adequate. Single precision will consume half the amount of memory and execute faster.

Optimize precision code Arm map

In the example above I have two loops performing the same calculation. The first in single precision and the second in double. By looking at the CPU utilization plots to the left of the source code we can see that the first loop used about 41% of the total execution time and the second loop used about 58% of the total execution…does C really need to be double?

3. Be friendly to your memory

FORTRAN uses column major ordering of multi-dimensional data. That is individual columns are laid out in memory one after the other. C/C++ use row-major ordering such that individual rows are laid out one after another. For best efficiency, it is important to access individual elements in sequential order. Depending on which language you are using this means that the inner-most loop indices will be different. In the following example, I illustrate two methods of accessing and modifying a two-dimensional FORTRAN variable. In the first loop, I access sequential row elements of the variable a. For C/C++ this would be optimal but as we will see in Arm MAP this is not good in FORTRAN.

Optimize memory code Arm map

Let’s compile the program and run in Arm MAP…

Optimize memory run code Arm Map

Looking at the CPU utilization plot next to the source code, we can see that the first loop uses about 76% of the entire run and that the second loop only uses about 23%. We can understand this a little deeper by looking at the Arm MAP metrics view at the top of the GUI. In the first ¾ of the run (the inefficient loop) memory access is very high and floating point is very low.

4. Always be jamming

Loop jamming, often called loop fusion is the combination of statements into single or similar loops. The execution of loops themselves has overhead and when the work of multiple loops can be combined and the work of individual loops increased, efficiency will be improved.

Optimize loop jamming code Arm Map

In the example above I have two loops that compute the same result for the variable C. In the first loop this is done with two sub-loops. On inspection we can see that my trivial equation can be reduced to a single statement, computed in a single loop. That is the technique used in the second outer loop. Looking at the CPU utilization plots to the left of the source code we see that the first loop took about 65% and the second loop only took about 35%.

Conclusion

In this article I presented a handful of code optimization techniques, which if adopted will help provide you with a good baseline for writing performance-oriented code. I hope that the illustrations encourage you to use Arm MAP to explore and study additional optimization techniques to see how your own codes perform.

Servers and Cloud Computing blog

How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Marc Meunier

Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
- October 13, 2025
Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

odinlmshen

In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
- October 13, 2025
Accelerating early developer bring-up and pre-silicon validation with Arm Neoverse CSS V3

odinlmshen

Discover the Arm Neoverse RD-V3 Software Stack Learning Path—helping developers accelerate early bring-up and pre-silicon validation for complex firmware on Neoverse CSS V3.
- October 13, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Four simple tips for optimizing your code

1. Division is expensive

2. Use the appropriate precision

3. Be friendly to your memory

4. Always be jamming

Conclusion

How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

Accelerating early developer bring-up and pre-silicon validation with Arm Neoverse CSS V3