Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Four simple tips for optimizing your code
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • Arm DDT
  • HPC Tools
  • Development Tools
  • Arm MAP
  • infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Four simple tips for optimizing your code

Beau Paisley
Beau Paisley
July 23, 2014
3 minute read time.

Arm DDT and Arm MAP are excellent tools for finding program flaws and performance issues – they are also very helpful for studying codes and coding techniques. In this article I present a handful of optimization techniques and use Arm MAP to illustrate their benefits. These tips are focused on scientific programmers who want to adopt best practices and the tips are limited to those that:

  • are beneficial across most current architectures and probably on most future architectures,
  • do not require any deep understanding of a given computer architecture, and
  • do not require a refactoring of your code that reduces readability.

1. Division is expensive

On most machines the division operator is significantly more expensive (i.e. takes many more clock cycles) than all other operators. When possible, refactor your code to not use division. In the following code snippet, I have two loops performing the same calculation – one using division operations and one using multiplication operations.

Optimize code division

Each loop is iterated an equivalent number of times. Let’s compile and run the code in Arm MAP…

Optimize code division source code

In the source code view – in the middle of the GUI, there is a green plot to the left that shows the amount of CPU utilization for each line of code. In the first loop that uses division we can see that the operation consumed over 78% of the entire run, but in the second loop which uses multiplication it only used about 21% of the run. 

2. Use the appropriate precision

Many technical computing algorithms, especially in HPC require double precision instead of single precision – typically 64 bits vs. 32 bits. Many, however, do not. When selecting data types for a given algorithm, or even a sub-part of the algorithm, try to determine if single precision is adequate.  Single precision will consume half the amount of memory and execute faster.

Optimize precision code Arm map

In the example above I have two loops performing the same calculation.  The first in single precision and the second in double.  By looking at the CPU utilization plots to the left of the source code we can see that the first loop used about 41% of the total execution time and the second loop used about 58% of the total execution…does C really need to be double?

3. Be friendly to your memory

FORTRAN uses column major ordering of multi-dimensional data. That is individual columns are laid out in memory one after the other. C/C++ use row-major ordering such that individual rows are laid out one after another. For best efficiency, it is important to access individual elements in sequential order.  Depending on which language you are using this means that the inner-most loop indices will be different. In the following example, I illustrate two methods of accessing and modifying a two-dimensional FORTRAN variable. In the first loop, I access sequential row elements of the variable a. For C/C++ this would be optimal but as we will see in Arm MAP this is not good in FORTRAN. 

Optimize memory code Arm map

Let’s compile the program and run in Arm MAP…

Optimize memory run code Arm Map

Looking at the CPU utilization plot next to the source code, we can see that the first loop uses about 76% of the entire run and that the second loop only uses about 23%. We can understand this a little deeper by looking at the Arm MAP metrics view at the top of the GUI. In the first ¾ of the run (the inefficient loop) memory access is very high and floating point is very low.

4. Always be jamming

Loop jamming, often called loop fusion is the combination of statements into single or similar loops.  The execution of loops themselves has overhead and when the work of multiple loops can be combined and the work of individual loops increased, efficiency will be improved.

Optimize loop jamming code Arm Map

In the example above I have two loops that compute the same result for the variable C. In the first loop this is done with two sub-loops. On inspection we can see that my trivial equation can be reduced to a single statement, computed in a single loop. That is the technique used in the second outer loop.  Looking at the CPU utilization plots to the left of the source code we see that the first loop took about 65% and the second loop only took about 35%.

Conclusion

In this article I presented a handful of code optimization techniques, which if adopted will help provide you with a good baseline for writing performance-oriented code. I hope that the illustrations encourage you to use Arm MAP to explore and study additional optimization techniques to see how your own codes perform.

Anonymous
Servers and Cloud Computing blog
  • Out-of-band telemetry on Arm Neoverse based servers

    Samer El-Haj-Mahmoud
    Samer El-Haj-Mahmoud
    Arm and Insyde advance out-of-band telemetry on Neoverse servers, enabling scalable, real-time datacenter insights via open standards and fleet analytics.
    • September 17, 2025
  • Optimizing Code Cache Performance for Large Code Footprint Java Applications on Neoverse

    Yanqin Wei
    Yanqin Wei
    Learn how smarter cache use transforms heavy Java apps into faster, more efficient workloads.
    • September 16, 2025
  • Redefining Datacenter Performance for AI: The Arm Neoverse Advantage

    Shivangi Agrawal
    Shivangi Agrawal
    In this blog post, explore the features that make Neoverse V series the choice of compute platform for AI.
    • September 8, 2025