Code size is always a popular topic in embedded software because most embedded systems face memory constraints. Compiler related meetings often discuss code size as a key care about. Arm has been migrating from Arm Compiler 5 to Arm Compiler 6 and improving the quality of results of Arm Compiler 6 for all CPUs. As a result of these efforts, Arm now recommends Arm Compiler 6 for new projects, but understands that migration from Arm Compiler 5 to Arm Compiler 6 may take some time so both continue to be available. Some projects also use the GNU Arm embedded compiler (gcc) for Arm CPUs, and there are additional commercial compilers available from the Arm ecosystem.
Many partners are interested to find out which compiler is the best fit and to understand how Arm compilers compare to GNU gcc. Other compiler providers likely get similar questions from users.
One way to investigate compiler differences is using Arm Models. Both Arm Fast Models and Arm Cycle Models play a role in analyzing compiler differences. Last year, I wrote an article with some basics about using Arm Cycle Models to compare Arm Compiler 5 to Arm Compiler 6 for the Cortex-R8. The article had some background about migrating to Arm Compiler 6 and some information about how to use models to compare performance, but did not spend much time on code size.
Code size is one of the biggest factors in evaluating compilers for embedded projects. The good news is that it’s relatively easy to compare code size. In fact, it’s not even necessary to run the code to compare the results from different compilers or from changing compiler switches and optimization levels (as long as optimizations don’t break the code).
Arm Compilers provide a utility called “fromelf” which can return the size of the code for an executable (known as an elf or axf file).
$ fromelf -z build-gcc/sort/sort.axf ======================================================================== ** Object/Image Component Sizes Code (inc. data) RO Data RW Data ZI Data Debug Object Name 47696 1488 2840 2776 0 0 ROM Totals for build-gcc/sort/sort.axf $ fromelf -z build-armclang/sort/sort.axf ======================================================================== ** Object/Image Component Sizes Code (inc. data) RO Data RW Data ZI Data Debug Object Name 7700 640 448 20 0 0 ROM Totals for build-armclang/sort/sort.axf
The last line has the ROM totals for the axf file. Some people recommend using the Code column and some recommend summing the Code (including data) and the RO Data and RW Data as the “code size”. Sometimes compiler information will provide ROM and RAM usage for comparison.
I did some trials targeting the Arm Cortex-M0+ to compare Arm Compilers to GNU Arm embedded gcc. What I found are the code sizes are drastically different between Arm Compiler 6 and GNU gcc. In fact, they are so different it almost seems like something could be wrong.
Below is a table with results for the EEMBC automotive test suite and other common programs:
Software program
GCC code size
Arm Compiler 6 code size
EEMBC auto
80,874
37,115
Dhrystone
45,480
7,908
Sort
54,800
8,808
Coremark
53,640
20,188
Options used for gcc with standard libraries
-mcpu=cortex-m0plus -mthumb -DCLOCK_RATE -DCYCLE_SHIFT=6 -DNDEBUG -fomit-frame-pointer -fno-common -O3 -DFLAGS_STR=-fomit-frame-pointer -fno-common -O3 -g
Options used for Arm Compiler 6 with standard libraries
-target=arm-arm-none-eabi -march=armv6-m -mcpu=cortex-m0plus -mthumb -DCLOCK_RATE -DCYCLE_SHIFT=6 -DNDEBUG -fomit-frame-pointer -fno-common -Omax -DFLAGS_STR=-fomit-frame-pointer -fno-common -Omax -g
--lto is also used for link time optimization
For memory constrained systems, compilers also offer options to use libraries that are optimized for size. For gcc, this is called “newlib-nano” and for Arm Compiler it is called “microlib”.
Code size for the same programs compiled with the size optimized libraries are below:
39,498
22,483
12,590
4,404
14,044
3,412
19,928
10,822
Options used for gcc with newlib-nano
-mcpu=cortex-m0plus -mthumb -DCLOCK_RATE -DCYCLE_SHIFT=6 -DNDEBUG -fomit-frame-pointer -fno-common -Os -DFLAGS_STR=-fomit-frame-pointer -fno-common -Os -g
Use -specs=nano.specs in the linker flags
Options used for Arm Compiler 6 with microlib
-target=arm-arm-none-eabi -march=armv6-m -mcpu=cortex-m0plus -mthumb -DUSE_MICROLIB -fomit-frame-pointer -fno-common -Oz -DCLOCK_RATE -DCYCLE_SHIFT=6 -DNDEBUG -DFLAGS_STR=-fomit-frame-pointer -fno-common -Oz -g
Use --library_type=microlib in the linker flags and --lto for link time optimization
The values are generally in line with the summary of Arm Compiler 6 on Arm Developer.
It's clear that Arm Compiler 6 generates smaller code, but just checking code size is not enough without knowing WHY the code is smaller.
There could be several possible reasons:
Without getting too much into performance, I can say that the performance improvements from Arm Compiler 6 are good, but as as drastic as the code size indicates. In most cases Arm Compiler 6 provides better performance, but not in every case, and the performance differences tend to be within 20%.
Arm Fast Models can be used to collect more information about code size differences. My first thought was to figure out how much code in an image is actually executed. Of course, there will be non-executed code due to branches in the program, error handing which doesn’t normally occur, and more, but it would be interesting to find out if there are other reasons why the code sizes are so different. Visually comparing instructions from different compilers is difficult for anyone who is not a compiler expert so I decided to use some simple scripts to get more details about where the code size differences come from.
If the executed instructions are compared to the set of all instructions in the image it should highlight where the differences are coming from. Here is my method for analyzing code size:
Arm Fast Models provide a Model Trace Interface (MTI) to trace execution of software and other hardware events. For the Cortex-M0+ trials I used a very simple system to run bare-metal applications, just the CPU and memory. Semihosting was also used to avoid any need for additional hardware modeling. Semihosting is supported by both Arm Compiler 6 and gcc.
Fast Models include a number of example plugins which demonstrate how to use the Model Trace Interface. One of the example called SimpleTrace is almost exactly the plugin needed to trace all of the PC values. It can be found in $PVLIB_HOME/examples/MTI/SimpleTrace
Building the plugin on Linux is as simple as typing make. I made one change to the source code. The plugin normally prints the PC values to stdout using printf() so I modified the code to open a file using fopen() and then used fprintf() to print the PC values and keep the traced values separate from the other output coming from the model. This makes it easy for scripts to access the traced values.
Loading the plugin is done via the --plugin command line.
$ ./Linux64-Debug-GCC-5.4/isim_system -a build-armclang/sort/sort.axf --plugin ~/Arm/FastModelsPortfolio_11.0/examples/MTI/SimpleTrace/SimpleTrace.so --stat
The simulation is run with the .axf file, the SimpleTrace plugin which will write the PC values to a file named pc_trace.out, and --stat which will print out the statistics at the end of the simulation.
Attached TRACE.SimpleTrace to component: m0p.armcortexm0plusct Ignoring component m0p.armcortexm0plusct.acp_mapper as it does not contain an INST source Ignoring component m0p.armcortexm0plusct.ext_bus as it does not contain an INST source Ignoring component m0p.armcortexm0plusct.ext_bus.mapper as it does not contain an INST source Ignoring component m0p.armcortexm0plusct.l2_flusher as it does not contain an INST source Ignoring component m0p.ramdevice as it does not contain an INST source Ignoring component m0p.ramdevice.bus_slave as it does not contain an INST source Ignoring component m0p.ramdevice1 as it does not contain an INST source Ignoring component m0p.ramdevice1.bus_slave as it does not contain an INST source Cortex-M0+ bare-metal startup, flags: -fomit-frame-pointer -fno-common -Omax Insertion sort took 7 clock ticks Shell sort took 2 clock ticks Quick sort took 1 clock ticks All done! Info: /OSCI/SystemC: Simulation stopped by user. --- m0p statistics: ----------------------------------------------------------- Simulated time : 1085525.000000s User time : 0.160000s System time : 0.004000s Wall time : 0.167770s Performance index : 6470316.50 m0p.armcortexm0plusct : 6.62 MIPS ( 1085530 Inst) -------------------------------------------------------------------------------
The instructions executed are printed at the end of the log. This will be useful to confirm the instruction coverage is correct.
Generating the disassembly file is done with fromelf. The -z will generate the code size information as described above and the -c will generate the disassembly.
$ fromelf -cz build-armclang/sort/sort.axf > build-armclang/sort/sizes.txt
With the disassembly file and the PC trace file, now it's just a matter of annotating the disassembly to identify which instructions were actually executed and how many times they were executed. This can be done with a simple bash script. The script works by reading the PC trace into an associative array. Then it loops through the disassembly file to identify lines which contain actual instructions. I filtered out the DCD and DCB lines since they are data and not executed instructions.
#!/bin/bash if [ "$#" -ne 2 ]; then echo "Please pass the directory and the name of the test as arguments" exit 1 fi AXFFILE=$1/$2.axf # the DISFILE contains the fromelf output, disassembly and size info DISFILE=$1/sizes.txt TRACEFILE=$1/pc_trace.out OUTFILE=$1/$2.cover MISFILE=$1/$2.miss MISSRC=$1/$2.miss-src rm -f $MISFILE hits=0 misses=0 notcode=0 total_instructions=0 # Read PC trace file into associative array declare -A PCTRACE while read pc; do if [ ${PCTRACE[$pc]+_} ] then tmp=${PCTRACE[$pc]} let "tmp = tmp + 1" PCTRACE[$pc]=$tmp else PCTRACE[$pc]=1 fi done < $TRACEFILE # Loop through disassembly file and check how many times each isntruction was executed # by looking up the PC inthe associative array while read value; do vals=( $value ) S1=${vals[0]} INST=${vals[3]} # sometimes there are spaces in the DCD data so check next token also ALTINST=${vals[4]} if [[ ${S1:0:2} = 0x && ${S1:10} = : && $INST != "DCD" && $ALTINST != "DCD" && $INST != "DCB" && $INST != "DCW" ]] then pc=${S1:0:10} cnt=${PCTRACE[$pc]} if [ ! -z $cnt ] then echo "H $cnt $value" let total_instructions+=$cnt let hits++ else echo "M $value" echo ${S1:0:10} >> $MISFILE let misses++ fi else echo "$value" let notcode++ fi done < $DISFILE > $OUTFILE let total=$hits+$misses rate=`echo $hits/$total*100|bc -l` echo "Results: hits $hits misses $misses total instructions $total_instructions rate $rate" arm-none-eabi-addr2line -e $AXFFILE @$MISFILE > $MISSRC.tmp sort -u $MISSRC.tmp > $MISSRC rm -f $MISSRC.tmp
The script produces a number of useful outputs:
With this information it is much easier to figure out where the differences in code size are coming from.
One result is an annotated version of the disassembly file created by fromelf. It has H (hit) or M (miss) at the start of each line that is an instruction and a count of how many times the instruction was executed. Below is a section from the sort disassembly file for a memory copy function.
__aeabi_memcpy4 __aeabi_memcpy8 H 3 0x00000174: b570 p. PUSH {r4-r6,lr} H 3 0x00000176: 4605 .F MOV r5,r0 H 3 0x00000178: 460c .F MOV r4,r1 H 3 0x0000017a: 4616 .F MOV r6,r2 H 3 0x0000017c: e002 .. B 0x184 ; __aeabi_memcpy4 + 16 H 150 0x0000017e: cc0f .. LDM r4!,{r0-r3} H 150 0x00000180: c50f .. STM r5!,{r0-r3} H 150 0x00000182: 3e10 .> SUBS r6,r6,#0x10 H 153 0x00000184: 2e10 .. CMP r6,#0x10 H 153 0x00000186: d2fa .. BCS 0x17e ; __aeabi_memcpy4 + 10 H 3 0x00000188: 2e08 .. CMP r6,#8 H 3 0x0000018a: d302 .. BCC 0x192 ; __aeabi_memcpy4 + 30 M 0x0000018c: cc03 .. LDM r4!,{r0,r1} M 0x0000018e: c503 .. STM r5!,{r0,r1} M 0x00000190: 3e08 .> SUBS r6,r6,#8 H 3 0x00000192: 2e04 .. CMP r6,#4 H 3 0x00000194: d307 .. BCC 0x1a6 ; __aeabi_memcpy4 + 50 M 0x00000196: cc01 .. LDM r4!,{r0} M 0x00000198: c501 .. STM r5!,{r0} M 0x0000019a: 1f36 6. SUBS r6,r6,#4 M 0x0000019c: e003 .. B 0x1a6 ; __aeabi_memcpy4 + 50 M 0x0000019e: 7821 !x LDRB r1,[r4,#0] M 0x000001a0: 7029 )p STRB r1,[r5,#0] M 0x000001a2: 1c64 d. ADDS r4,r4,#1 M 0x000001a4: 1c6d m. ADDS r5,r5,#1 H 3 0x000001a6: 1e76 v. SUBS r6,r6,#1 H 3 0x000001a8: d2f9 .. BCS 0x19e ; __aeabi_memcpy4 + 42 H 3 0x000001aa: bd70 p. POP {r4-r6,pc}
To see how much code is actually used, a summary of the instructions hit is below. The instructions run are the total number of instructions. The instructions hit are the number of unique instructions in the annotated disassembly file that have been run at least 1 time. The instructions miss are instructions in the disassembly file that were not executed. The percent of code used is simply the instructions hit / (instructions hit + instructions miss).
There is clearly more unused code in the images produced by gcc. For applications which have memory constraints, especially for Cortex-M or Cortex-R, Arm Compiler 6 produces smaller code. Another observation is the code generated with Arm Compiler 6 microlib is extremely small. Far fewer unique instructions are used to perform the same functionality. The newlib-nano and microlib options create images with much higher percent of code executed, but the instruction count to complete the functionality is higher and performance is lower for some applications. To get actual performance information the code can be run on Arm Cycle Models.
Using the coverage information, the unexecuted instructions can be examined in more detail. There are a few ways to do this.
The addr2line utility can be run on any address, but only works well on the addresses for which the source code exists. This means the library references don't generate anything useful. The good news is this means it's easy to find unused code which is from the application source. The bad news is it doesn't provide much insight into library code.
After inspecting the differences using these techniques the libraries are the largest source of unexecuted code. Arm Compiler 6 seems to include much less code from library functions and gcc seems to have large chunks of library code which are not used. Another source of unused code is from unused functions in the application. Arm Compiler 6 does a better job of removing code that is not used.
As application size increases, the impact of the unused code may shrink. An application with hundreds of kb of code may be more interesting to evaluate, but many microcontrollers have only 256 kb to work with.
Another place to compare compilers is on individual functions. To do this I took 2 small to medium functions from the sort example provided in DS-5, insert_sort() and shell_sort(), and marked them with __attribute__((noinline)) so they would be preserved by the compilers. Using the same coverage information, the instruction count with each compiler is shown below.
Arm Compiler 6
GNU gcc
function
Instructions
Instructions Used
insert_sort()
35
37
shell_sort()
75
78
77
Arm Fast Model plugins can also be used to trace function start and end points and record the number of instructions in the function. The plugin at $PVLIB_HOME/examples/MTI/SimpleTrace is a good starting point as it maps the software function names to PC values. I modified this plugin to add a print of the instruction count to build a solution to measure instruction counts for a function. Using breakpoints in a software debugger can also be used to start and stop and count instructions.
Self Instructions
506,647
97,307
479,661
98,992
124,535
39,705
140,327
46,223
The function insert_sort() uses strcmp for string compare and shell_short() also uses strcmp and does some integer math using a library function. This explains why the sum of instructions from the coverage file is less than the total instructions to run the function. This also helps isolate library differences.
The last check is to use Arm Cycle Models from Arm IP Exchange and the software profiling feature of SoC Designer to measure the cycles used to run each sort function. Using cycle counts is far more accurate than just instruction count and will give a better indication of performance. The images below show the profiling information for insert_sort() and shell_sort() for each compiler.
The results show the cycle counts for the functions and the cycles for each of the sub-functions. Arm Compiler 6 performance is better than gcc, but not as dramatic as the code size comparisons.
There are many ways to evaluate embedded compilers for Arm. Code size is a common metric used to compare compilers and optimization levels. Looking at code size is a start, but most engineers will ask WHY the code size is different because just saying the code is smaller is not very satisfying. Arm Models can be used to investigate code size in more detail. Other solutions may also be able to do the same analysis if trace information from the software can be obtained. When optimizing software for code size and performance there is no silver bullet to automatically create the smallest code with the fastest performance. In fact, there are a wide variety of possible results with larger variances than I expected when I stated looking into compiler choices. With the information in this article embedded software engineers can dig into the details for the specific software they are working with instead of relying on general compiler marketing information to pick the best solution.
Arm Compiler 6 is now recommended for new projects and many projects are now in the process of migrating from Arm Compiler 5 to Arm Compiler 6 and are interested in performance and code size. The linker and libraries provided with Arm compilers differentiate the quality of results from open source compilers making Arm Compiler 6 a strong choice for projects looking for the smallest code size.
Learn more about Arm Compiler 6
Why did you use link time optimization with armv6 but not with gcc?Isn't that beating all further assumptions and comparisons?