GNU compilers and LLVM-based compilers like the Arm Compiler for HPC have three compiler flags in common: -march, -mtune, and -mcpu. These flags control binary code generation, so the correct use of these flags can dramatically improve runtime performance. What exactly do these flags do? Do they have the same meaning when compiling for Arm as when compiling for x86? Do they mean the same thing to all compilers? How should you use them to get the best performance for your application?
TL;DR: Whenever possible use only -mcpu=native. Avoid -march and -mtune.
Let’s look at what these flags mean for GNU compilers and LLVM-based compilers like the Arm Compiler for HPC. For those compilers, the -march flag specifies the target architecture. -march tells the compiler that it is allowed to generate special instructions to use the specific hardware features of a given architecture. The -mtune flag specifies the target microarchitecture. -mtune tells the compiler to generate a binary with a bare-minimum, generic instruction set but also tune the resulting binary code for the specified target. The -mtune flag does not enable the compiler to use the special hardware features of the target. It only advises the compiler to perform architecture-independent optimizations like instruction reordering. When these flags are used to build Arm binaries, the -mcpu flag specifies the target architecture much the same way as ‑march, but it accepts the same parameter values as the -mtune flag. This is a crucial difference between Arm and x86! When GNU or LLVM compilers use the same flags on x86, the -mcpu flag is just a deprecated synonym for -mtune. [1]
Figure 1: Architecture vs. Microarchitecture in the Arm Ecosystem.
If you plot some Arm architecture specifications (e.g. Armv8.1-a) against some architecture implementations (e.g. ThunderX2) then it looks a little like Figure 1. The graph axes somewhat conflated since architectures and microarchitectures are closely linked, so the blue horizontal lines show the baseline architecture for each microarchitecture on the vertical axis. We’ll soon talk more about the idea of a baseline architecture and why it matters. For now, just focus on the idea that each target has an architecture and a microarchitecture. If you’re more familiar with the x86 ecosystem, you could draw a similar chart by putting things like Intel Broadwell or AMD Bulldozer on the vertical axis and putting things like Intel Xeon and AMD Ryzen on the horizontal axis.
Figure 2: Execution and optimization space resulting from -march=armv8.1-a.
If you compile for Arm using only with the -march=armv8.1-a flag, the compiler will generate a binary (let’s call it “a.out”) that isn’t tuned for any particular microarchitecture but is guaranteed to execute on the v8.1 architecture and all supersets of the v8.1 architecture. Figure 2 shows a.out’s execution space, which is the set of targets where we know for certain that a.out will execute.[2] The compiler will also try to take advantage of any optimizations specific to the v8.1 architecture, so a.out may (not will) perform better on a v8.1 architecture than on a v8.2 architecture. The orange area in Figure 2 represents a.out’s optimization space, which is the set of targets where the compiler may have attempted to optimize. We don’t know for certain if a.out will execute optimally on targets in the optimization space, but we do know for certain that a.out has not been optimized for targets outside the optimization space.
You may notice that many of the targets in a.out’s execution space in Figure 2 don’t exist. In reality there’s no such thing as a target that implements the complete Arm v8.3 architecture and the ThunderX2 microarchitecture. However, if it ever did exist then we know for certain that this example a.out binary would be able to execute on that target because Arm v8.3 is a superset of Arm v8.1. Similarly, just because the Qualcomm Falkor and the Marvel ThunderX2 are both in the optimization space shown in Figure 2 doesn’t mean the binary will perform optimally on both targets – or either, in fact. The optimization space only shows the targets for which the compiler may have performed optimizations.
Figure 3: Execution and optimization spaces resulting from -mtune=thunderx2t99.
Now let’s look at how the -mtune flag affects the execution and optimizations spaces. This flag advises the compiler to optimize for a target microarchitecture, but only for a generic instruction set. This flag does not allow the compiler to make any assumptions about the available instructions, so unless you pass additional flags the compiler won’t take advantage of any special instructions the target may provide. For example, if you compile with ‑mtune=thunderx2t99, the compiler will generate a binary that is optimized for the ThunderX2 microarchitecture but uses the v8.0 instruction set instead of v8.1. This binary will not take full advantage of all the ThunderX2’s hardware features! But the binary may be somewhat optimized for the ThunderX2 and it will be more portable than a binary compiled with ‑march=armv8.1‑a. Figure 3 shows the execution and optimization spaces of a binary compiled with -mtune=thunderx2t99.
Figure 4: Execution and optimization spaces resulting from -mcpu=thunderx2t99.
On Arm, if you want to optimize for both a particular architecture and microarchitecture then you use the -mcpu flag. This is different from x86, where -mcpu is a deprecated synonym for ‑mtune. The -mcpu flag accepts the same parameter values as the -mtune flag. For example, -mcpu=thunderx2t99 is correct but ‑mcpu=armv8.1-a is not. Figure 4 shows the execution and optimization spaces of a binary compiled with -mcpu=thunderx2t99. In this case, the binary could execute on anything implementing the v8.1 architecture or better, but it has been optimized for execution on the ThunderX2.
GNU and LLVM compilers support passing the special parameter value “native” to these flags. The “native” value tells the compiler to detect the architecture and/or microarchitecture of the machine on which the compiler is executing and use that (micro)architecture as the parameter to -march, -mtune, or -mcpu as appropriate. Assuming architecture detection works for your platform, passing “native” is usually the best choice if you’re not cross-compiling and all you care about is performance.[3] With the GNU compiler, all three flags can accept “native” as a parameter so -march=native, -mtune=native, and -mcpu=native are all valid. LLVM compilers only support “native” for the -mcpu and -mtune flags. You cannot use -march=native with LLVM-based compilers. If you’re not cross-compiling, always use -mcpu=native to maximize optimization and compatibility across compilers.
What happens when -march, -mtune, and -mcpu are used in combination? On Arm, the -march and -mtune flags override any value passed to -mcpu. For example, if you’re compiling on a ThunderX2 with the flags “-march=armv8-a -mcpu=native” then the resulting binary won’t be fully optimized since the armv8-a parameter will override the armv8.1-a parameter implied by “native” on ThunderX2. Fortunately, the GNU compiler will issue a warning in this case.
Figure 5: Execution and optimization spaces resulting from -march=armv8-a -mtune=thunderx2t99.
Another difference between Arm and x86 is that the -march and -mtune flags are entirely orthogonal on Arm. -march does not override -mtune; -mtune does not override -march. Mix and match freely! Using -march=X -mtune=Y tells the compiler to generate binary code for architecture X and to tune it for microarchitecture Y. The resulting binary will execute on architecture X and all supersets of architecture X, but will be optimized for microarchitecture Y. For example, you could use -march=armv8-a -mtune=thunderx2t99 to generate a binary that uses the v8.0 architecture for maximum portability but is tuned for ThunderX2. The binary would have execution and optimization spaces as shown in Figure 5. The execution space is larger than in Figure 4 because we’ve specified the v8.0 architecture, and the optimization space is smaller than in Figure 3 because we’ve indicated that tuning should target ThunderX2.
Figure 6: Execution and optimization spaces resulting from -mcpu=thunderx2t99 when considering architecture extensions.
So why have the -mcpu flag at all if -mcpu is just an alias for -mtune on x86, and -march and -mtune are orthogonal on Arm? Why not just combine -march and -mtune as needed on Arm, or follow the x86 convention and let -march imply -mtune? Up to this point we’ve simplified a bit and considered each target’s baseline architecture to be the only architecture the target supports, i.e. in Figure 5 it appears as if both the Arm Neoverse N1 and the Fujitsu A64FX have the same architecture and share similar execution spaces. In reality, CPU architects frequently add extensions from multiple Arm architectures to the baseline, both above and below the baseline architecture version.
Figure 7: Arm Neoverse N1 CPU.
The Arm Neoverse N1 is a perfect example of how targets typically have a complete implementation of one architecture but support features from other architectures as well. Figure 7 shows the N1 baseline architecture is v8.2, but the N1 includes extensions from the v8.1, v8.3, v8.4, and v8.5 architectures. When you consider these extensions, the execution and optimization spaces for a binary compiled with -mcpu=thunderx2t99 look more like Figure 6 than Figure 4, with the spaces “thinning” as the architecture version increases. On the ThunderX2, any instruction from the v8.1 architecture is guaranteed to work, but many other instructions are available, and we would like to take advantage of them all. When you specify -march, you are confining the compiler to only the baseline architecture, so the compiler is unable take advantage of any architecture extensions beyond the baseline. In order to take advantage of all the features of a particular target, you should use the -mcpu flag to simultaneously specify the architecture with all its extensions, and the microarchitecture. It is possible to use -march and list out every possible architecture extension, but this is cumbersome, non-portable, and reverse mapping multiple compiler flags to a single target is more trouble than it’s worth. Instead, just use -mcpu=target to tell the compiler exactly what you want, and the compiler will do the rest.
Figure 8: Example code “foo.c” demonstrating __sync_fetch_and_add GNU intrinsic.
As a concrete example of how these flags affect application performance, let’s see what the GNU and LLVM compilers do with a simple C code that invokes the __sync_fetch_and_add() intrinsic to atomically update an integer. The code is shown in Figure 6. Arm v8.0 has no special support for atomics, so compiling this code for Arm v8.0 will generate multiple instructions to perform the atomic operation. Arm v8.1 defines the Large System Extension (LSE) instruction ldaddal, which can atomically update an integer in a single instruction. Real world application speed-ups of 10x and even 100x have been reported when using LSE, so if our target supports LSE then we would very much like to use LSE instructions.
gcc -march=armv8.1-a
gcc -mtune=thunderx2t99
gcc -mcpu=thunderx2t99
.arch armv8.1-a+crc... ldaddal w1, w2, [x0]
.arch armv8-a....L3: ldxr w2, [x0] add w2, w2, w1 stlxr w3, w2, [x0] cbnz w3, .L3
.arch armv8.1-a+crypto+crc... ldaddal w1, w2, [x0]
armclang -march=armv8.1-a
armclang -mtune=thunderx2t99
armclang -mcpu=thunderx2t99
.p2align 2 ... ldaddal w0, w0, [x1]
.p2align 2 ....LBB0_1: add x8, sp, #12 ldaxr w9, [x8] mov w0, w9 mov w9, w0 ldr w10, [sp, #4] add w9, w9, w10 stlxr w11, w9, [x8] cbnz w11, .LBB0_1
.p2align 6 ... ldaddal w0, w0, [x1]
Figure 9: Assembly code generated by different compiler command lines.
Figure 7 shows six different compiler command lines and the relevant part of the resulting assembly code. Although the ThunderX2 implements LSE and supports the ldaddal instruction, compiling with only -mtune=thunderx2t99 will target the Arm v8.0 instruction set because this flag does not allow the compiler to make any assumptions about the target architecture. In contrast, compiling with -march=armv8.1-a allows the compiler to use LSE instructions, so the resulting assembly code is much more efficient. It seems counterintuitive, but -mtune=thunderx2t99 generates code that runs poorly on ThunderX2 while the code generated by the more generic -march=armv8.1-a will perform quite well on ThunderX2! Figure 7 also shows that using -mcpu=thunderx2t99 is the best option for all compilers. In this case, the desired LSE instruction is generated and the compiler has optimized for the ThunderX2 by padding the location counter to 64 bytes.
We’ve shown how these three compiler flags have different meanings when compiling for Arm or x86, so what’s the best way to port from one platform to another? And can you optimize for both Arm and x86 with a single set of compiler flags? Can you use the same compiler flags for GNU and LLVM? Well, there are several key differences in how these flags are interpreted on x86 and Arm:
When porting from x86 to Arm, you should replace any occurrences of -march with -mcpu and remove any instances of -mtune. This will have the desired effect of simultaneously tuning for the target architecture and microarchitecture. You should also retain your compiler flags that use -march for x86 because on x86 -mcpu is a synonym for -mtune. Using the Arm flags on x86 may give suboptimal results analogous to Figure 7’s middle column.[6] In short, you cannot support both x86 and Arm with exactly the same compiler flags. You must use -mcpu for Arm and -march for x86. If performance is all you care about then using -mtune alone on any platform is likely not what you intended.
As long as you’re not cross compiling, the simplest and easiest way to get the best performance on Arm with both GNU compilers and LLVM-compilers is to use only -mcpu=native and actively avoid using -mtune or -march.
[1] https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
[2] It’s entirely possible that the same a.out binary could execute outside the execution space e.g. if the compiler just happened to only generate v8.0 instructions even though we passed -march=armv8.1-a. Members of a.out’s execution space are guaranteed to execute a.out without error, but a.out could theoretically just happen to execute on members in the negation of the execution space by lucky chance. To keep it safe and simple, we can assume that a.out only executes on members of its execution space.
[3] There are cases when “native” doesn’t work as expected, e.g. https://lemire.me/blog/2018/07/25/it-is-more-complicated-than-i-thought-mtune-march-in-gcc/. If you find such a case with the Arm compiler, please contact Arm’s support team at https://developer.arm.com/support.
[4] https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html
[5] https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
[6] http://sdf.org/~riley/blog/2014/10/30/march-mtune/