ARM DS-5 version 5.26 includes both ARM Compiler 5.06u4 and ARM Compiler 6.6. ARM Compiler 6 includes full support for the ARMv6-M and the ARMv7-A/R/M architectures. This means users have the option to continue with ARM Compiler 5 or migrate to ARM Compiler 6. Many ARM Compiler users are creating bare-metal software and are sensitive to both performance and code size. Recent results with ARM Compiler 6.6 have demonstrated significant improvement, and now is a good time to investigate migrating to ARM Compiler 6.
The ARM Compiler Migration and Compatibility Guide compares the command line options, source code differences, assembly syntax, and other topics of interest to ease the work of porting to ARM Compiler 6. Once initial porting is done, users are interested in the details of the resulting code size and performance. This is the key driver in deciding when it makes sense to move to ARM Compiler 6.
One way to compare the different compilers and the various options of each compiler is to do some benchmarking with ARM Cycle Models. The ARM Compiler team has done a great job at optimizing ARM Compiler 6 for popular real-world workloads. By doing this, as opposed to focusing on a single artificial benchmark, they have delivered consistent improvements across a wide range of applications and commercial benchmarks such as RTX, CoreMark Pro and other benchmarks from EEMBC. But how can users find out how the compiler will perform with their code?
ARM Cycle Models provide cycle accurate models of ARM CPUs and other system IP such as interconnect, interrupt controllers, memory controllers, and other peripherals. Traditionally, Cycle Models are used in the hardware design process to compare and select the best fit CPU or the optimal configuration options for CPU, interconnect, and memory. Cycle Models enable hardware performance analysis and profiling to help guide architects and hardware designers to the best CPU subsystem for the project. Often benchmarks are used in the design process to enable system performance analysis.
Software and firmware engineers are typically given a fixed hardware platform and asked to optimize system performance by modifying software and utilizing the compiler to generate the best fit for code size and performance. The performance vs. code size trade-off is closely monitored throughout the software development process.
It may not be possible to run the entire firmware suite from an embedded product on models due to the close interaction between firmware, custom hardware, and the external environment. However, it is typically possible to take a subset of the firmware which can run on a simplified system without all of the custom hardware and peripherals and use it for compiler analysis.
To illustrate the process of moving to ARM Compiler 6 and using ARM Cycles Models to benchmark performance differences an example bare-metal project for the Cortex-R8 processor can be used. The software comes with the Cortex-R8 CPAK from ARM System Exchange. Cycle Model Performance Analysis Kits (CPAKs) are example systems which are easy to download and run. The CPAK provides example systems and models to run bare-metal benchmarks and easily add additional software to the code.
For users interested in migrating from ARM Compiler 5, the first step is to successfully compile with ARM Compiler 6. This generally takes a combination of Makefile changes to invoke the new compiler as well as some source code adaptations. The described process is representative of what any team must do to migrate to ARM Compiler 6.
First, switch the compiler binary from armcc (ARM Compiler 5) to armclang (ARM Compiler 6). Other tools like armasm and armlink can still be used.
A few compiler command line option changes will also be required as shown in the table below.
ARM Compiler 5
ARM Compiler 6
armcc
armclang
--cpu=Cortex-R8
--target=arm-arm-none-eabi –mcpu=Cortex-R8
--fpu=VFPv3
-mfpu=vfpv3-d16-fp16
-Ospace
-Os / -Oz
-Onum (default is 2)
-Onum (default is 0)
More details are available in the migration guide related to specific switches, but these are the basics to get started. Some compiler switches may need to be removed because they are specific to armcc. For example, --apcs /interwork and --no_inline are not needed with armclang and can be removed.
One thing to note on the changes for Cortex-R8 is related to the FPU. Because the default number of double precision registers in VFPv3 is 32, ARM Compiler 6 expects explicit command-line configuration for Cortex-R8’s 16 registers. This distinction was not necessary for ARM Compiler 5.
Once the basic Makefile is modified to invoke armclang, some source code changes might be required to make use of the keywords offered by ARM Compiler 6.
A summary of the changes needed to migrate the CPAK example software are covered below.
A substantial difference is the syntax used for assembly language embedded in C between ARM Compiler 5 and ARM Compiler 6. The changes are meant to help with portability of non-ARM Compiler code, as the latter is compatible with the popular GAS (GNU Assembler) syntax.
For example, the following code with armcc:
__asm { MRC p15, 0, temp, c1, c0, 0 }; /* Read control reg */
is converted to GAS syntax for armclang:
__asm ( "mrc p15, 0, %0, c1, c0, 0" : "=r"(temp) ); /* Read control reg */
Pragmas are also commonly found in code for ARM Compiler 5, such as this one to disable semi-hosting.
#pragma import(__use_no_semihosting)
For ARM Compiler 6 this should be changed to:
asm(" .global __use_no_semihosting\n");
Another common scenario is the use of pragmas to temporarily suppress warnings. This can be removed for ARM Compiler 6.
#pragma diag_remark 236 /* Suppress "controlling expression is constant" warning */
if (secondary_cpu_initialization)
#pragma diag_default 236 /* Allow warning again */
Some keywords specific to ARM Compiler 5 are supported by specific attributes with ARM Compiler 6. For example, the keyword __weak has an equivalent attribute with ARM Compiler 6.
In ARM Compiler 5:
EXTERN __weak secondary_cpu_initialization(UWORD32 cpu);
Updated for ARM Compiler 6:
EXTERN __attribute__((weak)) secondary_cpu_initialization(UWORD32 cpu);
A similar patter can be seen with __inline in ARM Compiler 5 and converted to __attribute__((always_inline)) in ARM Compiler 6.
ARM Compiler 5 has intrinsics such as __memory_changed() and __wfi() to simplify ordering of memory accesses and the wait for interrupt instruction. For these, users can make their own functions with equivalent behavior or directly write assembly instructions.
Over time more and more intrinsics have been added to ARM Compiler 6. Version 6.5 added __memory_changed() in the include file arm_compat.h. The compatibility guide has a table summarizing which intrinsics are supported by ARM Compiler 6.
These are just some of the most common examples encountered when porting software from ARM Compiler 5 to ARM Compiler 6. Users will need to work through these to migrate code and compare performance.
The first thing to compare is the generated code size for various optimization values.
The table below shows the executable file size for various optimization levels using ARM Compiler 5 and ARM Compiler 6 included in DS-5 v5.26 for one of the .axf files from the Cortex-R8 CPAK.
This is for illustrative purposes only and is not a recommendation on which optimization is best, but it does show how different file size can be for different compilers and different switches.
axf file Size
axf file size
-O0
189280
169520
-O2
183224
162940
-O3
182352
162772
-Os
157924
-Oz
154572
-Omax
146152
Of course, selecting compiler options is normally a balance of code size and performance. Comparing the file size of the .axf file is easy but comparing performance is a bit more difficult.
Performance can be compared using ARM Cycle Models. The general technique to measure performance using Cycle Models is to run to a breakpoint which represents an interesting place to start profiling and enable profiling. Then continue execution to another breakpoint which represents the end of the interesting code. At the second breakpoint, turn off profiling and study the results.
All of this can be scripted and automated if needed to gather results using both ARM Compiler 5 and ARM Compiler 6 as well as with different optimization switches.
To start things off, build the software with ARM Compiler 5 and with ARM Compiler 6. Run the Cortex-R8 CPAK and use a software debugger to set a breakpoint on the interesting starting point. Cycle Models allow connections from any software which supports CADI including modeldebugger and DS-5.
Once the first breakpoint is reached, enable profiling for the CPU and Software. The Profile Manager is shown below.
Software Profiling is at the bottom of the list and will provide information about how much time is spent in each C function and other useful software information.
Now, run the simulation to the second breakpoint, the end of the interesting section to profile.
The result is a profiling database with all PMU events and bus transactions which can be analyzed. The software profiling information is also included. All profiling information is obtained without writing any software and requires no instrumentation of the software being profiled.
The procedure can be repeated for each software image to compare ARM Compiler 5 and ARM Compiler 6 as well as with different compiler switches.
For illustrative purposes, here is summary of using both ARM Compiler 5 and ARM Compiler 6 to profile two different sections of code. The table below shows a comparison for running the same code with ARM Compiler 5 with –O3 vs. ARM Compiler 6 with –Omax.
ARM Compiler 5 with –O3
ARM Compiler 6 with -Omax
Cycles
Instructions
Code Sample 1
542,113
569,957
306,952
303,930
Code Sample 2
15,521,551
9,438,679
9,608,792
5,867,342
The graphs below show the PMU events from the Instruction group for ARM Compiler 5 and then for ARM Compiler 6. The cycles are on the x-axis which shows the differences in simulation time for the same section of the code and PMU event 0x8 shows the instruction count for each case.
To check the difference between compiler switches of ARM Compiler 6, Code Sample 1 was compiled with –O1, the recommended setting for debugging and compared to the previous run with –Omax.
–O1 for debugging
1,395,054
1,259,371
Clearly, compiler choices impact the number of cycles required to run code, but Cycle Models make it easy to quantify the differences and try out many combinations in a short time.
ARM Cycle Models can be used to compare optimization switches and understand compiler choices. Example software from the Cortex-R8 CPAK was used to highlight the conversion from ARM Compiler 5 to ARM Compiler 6. Performance analysis features provided by Cycle Models make it easy to understand the impact of a compiler on software performance. When I started experimenting, I was expecting some small differences between compilers and optimization levels, but found out that performance can be very different just by changing optimization levels. To understand the code being generated, ARM Models might be a good way to get started. This article demonstrated Cycle Models, but ARM Fast Models can also identify instruction count differences and be able to run longer tests by sacrificing cycle accuracy.
ARM Developer is a great resource for more information on Cycle Models and Fast Models as well as the place to find out more about ARM Compiler 6.