ARM Cortex-A Processors and GCC Command Lines

Chinese Version中文版: ARM Cortex-A 处理器和 GCC 命令行

The GNU Compiler Collection’s (GCC) command line options for ARM processors were originally designed many years ago when the list of available processors and variants was much shorter than it is today. As the ARM architecture has evolved, the options needed to get the best code out of the GCC have also changed, but attempts have been made to ensure that existing sets of options don't change their meaning. The design of the compiler means that the options needed to get the best out of your ARM CortexTM-A processor are now quite complex. This blog covers three areas of GCC’s command line options: CPU, floating point and SIMD (Single Instruction, Multiple Data) acceleration.

What options should I use for my CPU?

Firstly, let's look at the key options for telling the compiler about the CPU you are using; later on we'll discuss some more advanced options that can be used in special cases.

Whenever you compile a file, the compiler needs to know the type of CPU that you intend to use to run the resulting code. The primary option for doing this is the -mcpu=<cpu-name> option. As you might expect, the cpu-name is replaced by the specific name of the CPU type that you have, but in lower case. For example, for the Cortex-A9, the option is -mcpu=cortex-a9. GCC currently supports all Cortex-A processors up to, and including the Cortex-A15; that is:

Cortex-A5 -mcpu=cortex-a5

Cortex-A7 -mcpu=cortex-a7

Cortex-A8 -mcpu=cortex-a8

Cortex-A9 -mcpu=cortex-a9

Cortex-A15 -mcpu=cortex-a15

If your version of GCC doesn't recognize one of the above, then it may be too old and you should consider upgrading. If you don't specify the CPU to use, GCC will use its built-in default -- that can vary depending on how the compiler was originally built and it may mean that the code generated will execute quite slowly (or not at all) on the CPU that you have.

Adding floating-point and SIMD

All ARM Cortex-A processors available today come with a floating-point unit and most also have a SIMD unit that implements the ARM Advanced-SIMD processor extensions (commonly known as NEONTM). However, the precise set of instructions available depends on the processor that you have and GCC requires a separate option to control this; it doesn't try to work it out from the -mcpu option. The choice of floating-point and SIMD instructions is controlled by the option –mfpu and the recommended choices for each of the CPUs are given in the table below:

blogentry-103749-004812900 1365712953_thumb.png

VFPv3 and VFPv4 implementations start with 32 double-precision registers, however, when NEON is not present, the top 16 registers become optional; this is controlled by the d16 component of the option name. The fp16 component of the name specifies the presence of half-precision (16-bit) floating-point load, store and conversion instructions; this is an extension to VFPv3 but available in all VFPv4 implementations.

For historical reasons GCC will only use floating-point and NEON instructions if it is explicitly told that it is safe to do so. The option to control this is, somewhat confusingly, part of an option that can also change the ABI that the compiler conforms to. The option -mfloat-abi takes three possible options:

-mfloat-abi=soft -- ignore all FPU and NEON instructions, use only the core register set and emulate all floating-point operations using library calls.

-mfloat-abi=softfp --
use the same calling conventions as -float-abi=soft, but use floating-point and NEON instructions as appropriate. This option is binary compatible with -mfloat-abi=soft and can be used to improve the performance of code that has to conform to a soft-float environment but where it is known that the relevant hardware instructions will be available.

-mfloat-abi=hard --
use the floating-point and NEON instructions as appropriate and also change the ABI calling conventions in order to generate more efficient function calls; floating-point and vector types can now be passed between functions in the extension registers which not only saves a significant amount of copying but also means that fewer calls need to pass arguments on the stack.

Which of the above options you should use will very much depend on your target system and it may be that the correct option is already the default. Ubuntu 12.04 (Precise), for example, now uses -mfloat-abi=hard by default.

Vectorizing floating-point operations

The NEON architecture contains instructions that operate on both integer and floating-point data types and GCC now has powerful auto-vectorizing optimizations to spot when it is appropriate to use the vector engine to improve performance. What surprises many users, however, is that the compiler fails to vectorize their code, even when they might expect this to be done.

The first thing to remember is that the auto-vectorizer is only enabled by default at -O3. There are options to turn it on at other times and you can find these in the GCC manual.

However, even with the vectorizer enabled, floating-point code is often not vectorized. The reason for this is that although the floating-point operations in NEON use the IEEE single-precision format for holding values, in order to minimize the amount of power needed in the NEON unit and maximize the throughput, the vector engine only complies fully to the standard if the inputs and the results are within the normal operating ranges (that is the values are not de-normal or a NaN). GCC's default configuration is to generate code that strictly conforms to the rules for IEEE floating-point arithmetic and the limitations just described mean that it is not appropriate to use the SIMD instructions by default.

Fortunately, GCC does provide a number of command-line options that can be used to control precisely which level of adherence to the IEEE standard is required. While details are beyond the scope of this discussion, in most cases it will be perfectly safe to use the option -ffast-math to relax the rules and enable vectorization.

Alternatively the option -Ofast can be used on GCC 4.6 or later to achieve much the same effect. It turns on both -O3 plus a number of other optimizations which should normally be safe in order to get the best performance out of your code.

Another thing to remember is that NEON only supports vector operations on single-precision data. Unless your code is written to work with that format then you may find that vectorization does not work. You should also be aware of floating-point constants (literals) that end up forcing the compiler to perform a calculation in double precision. In C and C++ write '1.0F' not '1.0' to ensure that the compiler knows what you mean.

Finally, if you're still having problems working out why the vectorizer is not behaving as you might wish, and you're prepared to get your hands dirty, GCC can provide a wealth of information about what it is doing. The options -fdump-tree-vect and -ftree-vectorizer-verbose=<level> control the amount of information that is generated, where level is a number in the range of 1 to 9. While most of the information produced will only be of interest to compiler developers you may at times find hints in the output as to why your code is not being vectorized as expected.

Putting it all together

So that's a lot of options, what should I use in day-to-day operation? Fortunately, once the target environment is determined most of the options won't change on a regular basis. Here are a few examples:

A Cortex-A15 processor with NEON and some floating point code that manipulates arrays of data using 'float' data types. The operating environment can support passing parameters in floating-point registers:

arm-gcc -O3 -mcpu=cortex-a15 -mfpu=neon-vfpv4 -mfloat-abi=hard \

-ffast-math -o myprog.exe myprog.c

A Cortex-A7 processor without NEON, processing floating-point code. The operating environment only supports passing arguments in integer registers, but can support use of the floating-point hardware

arm-gcc -O3 -mcpu=cortex-a7 -mfpu=vfpv4-d16 -mfloat-abi=softfp \

-o myprog2.exe myprog2.c

Finally, a Cortex-A9 processor operating in an environment where the floating-point/NEON register set cannot be used at all (for example, because it's in the middle of an interrupt handler and the floating point context is reserved for user state).

arm-gcc -O3 -mcpu=cortex-a9 -mfloat-abi=soft -c -o myfile.o myfile.c


Related ARM Blogs:

  • Coding for NEON Series
  • Ne10: A New Open Source Library to Accelerate your Applications with NEON
  • Ne10 Library Getting Started

Richard Earnshaw, Principal Engineer, ARM, Richard is the overall technical lead and software architect in ARM's GNU compiler tools team. Amongst other tasks, he has worked on and with compilers for nearly 20 years and is one of the GCC Global Reviewers.

  • Great info, Richard. I find the following a little ambiguous, though:  "VFPv3 and VFPv4 implementations start with 32 double-precision registers, however, when NEON is not present, the top 16 registers become optional; this is controlled by the d16 component of the option name."  Does d16 mean that the optional registers are present or that they have been removed?
  • Great info, Richard. I find the following a little ambiguous, though:  "VFPv3 and VFPv4 implementations start with 32 double-precision registers, however, when NEON is not present, the top 16 registers become optional; this is controlled by the d16 component of the option name."  Does d16 mean that the optional registers are present or that they have been removed?


    Good point.  The options describe what you have.  So,

    [font="Courier New"]-mfpu=vfpv4[/font] implies that there are 32 double precision registers, d0...d31.

    [font="Courier New"]-mfpu=vfpv4-d16[/font] implies that there are 16 double precision registers, d0...d15.

    Of course, in both cases you have 32 single precision registers, which overlap d0...d15.
  • FYI, http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0425/ch04s09s03.html

    • -mfpu=vfpv3 or -mfpu=vfpv3-d16 (for Cortex-A8 and Cortex-A9 processors).
    • -mfpu=vfpv4 or -mfpu=vfpv4-d16 (for Cortex-A5 and Cortex-A15 processors).