More code in less space with Arm Compiler 6

June 28, 2017

15 minute read time.

Code size is always a popular topic in embedded software because most embedded systems face memory constraints. Compiler related meetings often discuss code size as a key care about. Arm has been migrating from Arm Compiler 5 to Arm Compiler 6 and improving the quality of results of Arm Compiler 6 for all CPUs. As a result of these efforts, Arm now recommends Arm Compiler 6 for new projects, but understands that migration from Arm Compiler 5 to Arm Compiler 6 may take some time so both continue to be available. Some projects also use the GNU Arm embedded compiler (gcc) for Arm CPUs, and there are additional commercial compilers available from the Arm ecosystem.

Many partners are interested to find out which compiler is the best fit and to understand how Arm compilers compare to GNU gcc. Other compiler providers likely get similar questions from users.

One way to investigate compiler differences is using Arm Models. Both Arm Fast Models and Arm Cycle Models play a role in analyzing compiler differences. Last year, I wrote an article with some basics about using Arm Cycle Models to compare Arm Compiler 5 to Arm Compiler 6 for the Cortex-R8. The article had some background about migrating to Arm Compiler 6 and some information about how to use models to compare performance, but did not spend much time on code size.

Comparing Code Size

Code size is one of the biggest factors in evaluating compilers for embedded projects. The good news is that it’s relatively easy to compare code size. In fact, it’s not even necessary to run the code to compare the results from different compilers or from changing compiler switches and optimization levels (as long as optimizations don’t break the code).

Arm Compilers provide a utility called “fromelf” which can return the size of the code for an executable (known as an elf or axf file).

$ fromelf -z build-gcc/sort/sort.axf 

========================================================================

** Object/Image Component Sizes

      Code (inc. data)   RO Data    RW Data    ZI Data      Debug   Object Name

     47696       1488       2840       2776          0          0   ROM Totals for build-gcc/sort/sort.axf

$ fromelf -z build-armclang/sort/sort.axf 

========================================================================

** Object/Image Component Sizes

      Code (inc. data)   RO Data    RW Data    ZI Data      Debug   Object Name

      7700        640        448         20          0          0   ROM Totals for build-armclang/sort/sort.axf

The last line has the ROM totals for the axf file. Some people recommend using the Code column and some recommend summing the Code (including data) and the RO Data and RW Data as the “code size”. Sometimes compiler information will provide ROM and RAM usage for comparison.

Trials targeting the Cortex-M0+

I did some trials targeting the Arm Cortex-M0+ to compare Arm Compilers to GNU Arm embedded gcc. What I found are the code sizes are drastically different between Arm Compiler 6 and GNU gcc. In fact, they are so different it almost seems like something could be wrong.

Below is a table with results for the EEMBC automotive test suite and other common programs:

Software program	GCC code size	Arm Compiler 6 code size
EEMBC auto	80,874	37,115
Dhrystone	45,480	7,908
Sort	54,800	8,808
Coremark	53,640	20,188

Options used for gcc with standard libraries

-mcpu=cortex-m0plus -mthumb -DCLOCK_RATE -DCYCLE_SHIFT=6 -DNDEBUG -fomit-frame-pointer -fno-common -O3 -DFLAGS_STR=-fomit-frame-pointer -fno-common -O3 -g

Options used for Arm Compiler 6 with standard libraries

-target=arm-arm-none-eabi -march=armv6-m -mcpu=cortex-m0plus -mthumb -DCLOCK_RATE -DCYCLE_SHIFT=6 -DNDEBUG -fomit-frame-pointer -fno-common -Omax -DFLAGS_STR=-fomit-frame-pointer -fno-common -Omax -g

--lto is also used for link time optimization

For memory constrained systems, compilers also offer options to use libraries that are optimized for size. For gcc, this is called “newlib-nano” and for Arm Compiler it is called “microlib”.

Code size for the same programs compiled with the size optimized libraries are below:

Software program	GCC code size	Arm Compiler 6 code size
EEMBC auto	39,498	22,483
Dhrystone	12,590	4,404
Sort	14,044	3,412
Coremark	19,928	10,822

Options used for gcc with newlib-nano

-mcpu=cortex-m0plus -mthumb -DCLOCK_RATE -DCYCLE_SHIFT=6 -DNDEBUG -fomit-frame-pointer -fno-common -Os -DFLAGS_STR=-fomit-frame-pointer -fno-common -Os -g

Use -specs=nano.specs in the linker flags

Options used for Arm Compiler 6 with microlib

-target=arm-arm-none-eabi -march=armv6-m -mcpu=cortex-m0plus -mthumb -DUSE_MICROLIB -fomit-frame-pointer -fno-common -Oz -DCLOCK_RATE -DCYCLE_SHIFT=6 -DNDEBUG -DFLAGS_STR=-fomit-frame-pointer -fno-common -Oz -g

Use --library_type=microlib in the linker flags and --lto for link time optimization

The values are generally in line with the summary of Arm Compiler 6 on Arm Developer.

Code Size Summary

It's clear that Arm Compiler 6 generates smaller code, but just checking code size is not enough without knowing WHY the code is smaller.

There could be several possible reasons:

Extra code from libraries that is not needed
Extra code from the program itself that is not needed
The libraries have different functions or the library sizes are optimized differently
The generated code is functionally equivalent, but the compiler does the same work in less instructions

Without getting too much into performance, I can say that the performance improvements from Arm Compiler 6 are good, but as as drastic as the code size indicates. In most cases Arm Compiler 6 provides better performance, but not in every case, and the performance differences tend to be within 20%.

Finding the code size differences

Arm Fast Models can be used to collect more information about code size differences. My first thought was to figure out how much code in an image is actually executed. Of course, there will be non-executed code due to branches in the program, error handing which doesn’t normally occur, and more, but it would be interesting to find out if there are other reasons why the code sizes are so different. Visually comparing instructions from different compilers is difficult for anyone who is not a compiler expert so I decided to use some simple scripts to get more details about where the code size differences come from.

Measuring instruction coverage

If the executed instructions are compared to the set of all instructions in the image it should highlight where the differences are coming from. Here is my method for analyzing code size:

Run the program on the Arm Fast Model and use a Fast Model plugin to record the executed PC values
- Verify the sum of the recorded values matches the total number of instructions executed (as reported by the Fast Model at the end of simulation)
Generate a disassembly file (using fromelf) for the executable which contains all of the code in the image
Match the recorded PC values to the disassembly file and mark each instruction as Hit (was executed) or Miss (was not executed)
- Also annotate how many times the instruction was hit
- Total instructions = (sum of instructions hit * count they were hit)
Analyze the misses and see where they are coming from
Map PC values back to source code using GNU addr2line and check what kind of code is not executed

Obtaining the instruction trace

Arm Fast Models provide a Model Trace Interface (MTI) to trace execution of software and other hardware events. For the Cortex-M0+ trials I used a very simple system to run bare-metal applications, just the CPU and memory. Semihosting was also used to avoid any need for additional hardware modeling. Semihosting is supported by both Arm Compiler 6 and gcc.

Fast Models include a number of example plugins which demonstrate how to use the Model Trace Interface. One of the example called SimpleTrace is almost exactly the plugin needed to trace all of the PC values. It can be found in $PVLIB_HOME/examples/MTI/SimpleTrace

Building the plugin on Linux is as simple as typing make. I made one change to the source code. The plugin normally prints the PC values to stdout using printf() so I modified the code to open a file using fopen() and then used fprintf() to print the PC values and keep the traced values separate from the other output coming from the model. This makes it easy for scripts to access the traced values.

Loading the plugin is done via the --plugin command line.

$ ./Linux64-Debug-GCC-5.4/isim_system -a build-armclang/sort/sort.axf --plugin ~/Arm/FastModelsPortfolio_11.0/examples/MTI/SimpleTrace/SimpleTrace.so --stat

The simulation is run with the .axf file, the SimpleTrace plugin which will write the PC values to a file named pc_trace.out, and --stat which will print out the statistics at the end of the simulation.

Attached TRACE.SimpleTrace to component: m0p.armcortexm0plusct
Ignoring component m0p.armcortexm0plusct.acp_mapper as it does not contain an INST source
Ignoring component m0p.armcortexm0plusct.ext_bus as it does not contain an INST source
Ignoring component m0p.armcortexm0plusct.ext_bus.mapper as it does not contain an INST source
Ignoring component m0p.armcortexm0plusct.l2_flusher as it does not contain an INST source
Ignoring component m0p.ramdevice as it does not contain an INST source
Ignoring component m0p.ramdevice.bus_slave as it does not contain an INST source
Ignoring component m0p.ramdevice1 as it does not contain an INST source
Ignoring component m0p.ramdevice1.bus_slave as it does not contain an INST source
Cortex-M0+ bare-metal startup, flags: -fomit-frame-pointer -fno-common -Omax
Insertion sort took 7 clock ticks
Shell sort took 2 clock ticks
Quick sort took 1 clock ticks
All done!

Info: /OSCI/SystemC: Simulation stopped by user.

--- m0p statistics: -----------------------------------------------------------
Simulated time                          : 1085525.000000s
User time                               : 0.160000s
System time                             : 0.004000s
Wall time                               : 0.167770s
Performance index                       : 6470316.50
m0p.armcortexm0plusct                   :   6.62 MIPS (     1085530 Inst)
-------------------------------------------------------------------------------

The instructions executed are printed at the end of the log. This will be useful to confirm the instruction coverage is correct.

Matching the PC trace to the disassembly

Generating the disassembly file is done with fromelf. The -z will generate the code size information as described above and the -c will generate the disassembly.

$ fromelf -cz build-armclang/sort/sort.axf > build-armclang/sort/sizes.txt

With the disassembly file and the PC trace file, now it's just a matter of annotating the disassembly to identify which instructions were actually executed and how many times they were executed. This can be done with a simple bash script. The script works by reading the PC trace into an associative array. Then it loops through the disassembly file to identify lines which contain actual instructions. I filtered out the DCD and DCB lines since they are data and not executed instructions.

#!/bin/bash

if [ "$#" -ne 2 ]; then
    echo "Please pass the directory and the name of the test as arguments"
    exit 1
fi


AXFFILE=$1/$2.axf
# the DISFILE contains the fromelf output, disassembly and size info
DISFILE=$1/sizes.txt
TRACEFILE=$1/pc_trace.out
OUTFILE=$1/$2.cover
MISFILE=$1/$2.miss
MISSRC=$1/$2.miss-src

rm -f $MISFILE

hits=0
misses=0
notcode=0
total_instructions=0

# Read PC trace file into associative array
declare -A PCTRACE
while read pc; do

    if [ ${PCTRACE[$pc]+_} ]
    then
        tmp=${PCTRACE[$pc]}
        let "tmp = tmp + 1"
        PCTRACE[$pc]=$tmp
    else
        PCTRACE[$pc]=1
    fi
done < $TRACEFILE

# Loop through disassembly file and check how many times each isntruction was executed
# by looking up the PC inthe associative array
while read value; do
   vals=( $value )

   S1=${vals[0]}
   INST=${vals[3]}
   # sometimes there are spaces in the DCD data so check next token also
   ALTINST=${vals[4]}

   if [[ ${S1:0:2} = 0x && ${S1:10} = : && $INST != "DCD" && $ALTINST != "DCD"  && $INST != "DCB" &&
 $INST != "DCW" ]]
   then
       pc=${S1:0:10}
       cnt=${PCTRACE[$pc]}
       if [ ! -z $cnt ]
       then
           echo "H $cnt $value"
           let total_instructions+=$cnt
           let hits++
       else
           echo "M   $value"
           echo ${S1:0:10} >> $MISFILE
           let misses++
       fi
   else
       echo "$value"
       let notcode++
   fi

done < $DISFILE > $OUTFILE

let total=$hits+$misses
rate=`echo $hits/$total*100|bc -l`
echo "Results: hits $hits misses $misses total instructions $total_instructions rate $rate"
arm-none-eabi-addr2line -e $AXFFILE @$MISFILE > $MISSRC.tmp
sort -u $MISSRC.tmp > $MISSRC
rm -f $MISSRC.tmp

The script produces a number of useful outputs:

Summary of instructions executed and not executed
Annotated disassembly file marking the executed instructions
A list of PC values that were never executed
Source references to the unexecuted instructions

With this information it is much easier to figure out where the differences in code size are coming from.

Looking at the annotated disassembly file

One result is an annotated version of the disassembly file created by fromelf. It has H (hit) or M (miss) at the start of each line that is an instruction and a count of how many times the instruction was executed. Below is a section from the sort disassembly file for a memory copy function.

__aeabi_memcpy4
__aeabi_memcpy8
H 3 0x00000174:    b570        p.      PUSH     {r4-r6,lr}
H 3 0x00000176:    4605        .F      MOV      r5,r0
H 3 0x00000178:    460c        .F      MOV      r4,r1
H 3 0x0000017a:    4616        .F      MOV      r6,r2
H 3 0x0000017c:    e002        ..      B        0x184 ; __aeabi_memcpy4 + 16
H 150 0x0000017e:    cc0f        ..      LDM      r4!,{r0-r3}
H 150 0x00000180:    c50f        ..      STM      r5!,{r0-r3}
H 150 0x00000182:    3e10        .>      SUBS     r6,r6,#0x10
H 153 0x00000184:    2e10        ..      CMP      r6,#0x10
H 153 0x00000186:    d2fa        ..      BCS      0x17e ; __aeabi_memcpy4 + 10
H 3 0x00000188:    2e08        ..      CMP      r6,#8
H 3 0x0000018a:    d302        ..      BCC      0x192 ; __aeabi_memcpy4 + 30
M   0x0000018c:    cc03        ..      LDM      r4!,{r0,r1}
M   0x0000018e:    c503        ..      STM      r5!,{r0,r1}
M   0x00000190:    3e08        .>      SUBS     r6,r6,#8
H 3 0x00000192:    2e04        ..      CMP      r6,#4
H 3 0x00000194:    d307        ..      BCC      0x1a6 ; __aeabi_memcpy4 + 50
M   0x00000196:    cc01        ..      LDM      r4!,{r0}
M   0x00000198:    c501        ..      STM      r5!,{r0}
M   0x0000019a:    1f36        6.      SUBS     r6,r6,#4
M   0x0000019c:    e003        ..      B        0x1a6 ; __aeabi_memcpy4 + 50
M   0x0000019e:    7821        !x      LDRB     r1,[r4,#0]
M   0x000001a0:    7029        )p      STRB     r1,[r5,#0]
M   0x000001a2:    1c64        d.      ADDS     r4,r4,#1
M   0x000001a4:    1c6d        m.      ADDS     r5,r5,#1
H 3 0x000001a6:    1e76        v.      SUBS     r6,r6,#1
H 3 0x000001a8:    d2f9        ..      BCS      0x19e ; __aeabi_memcpy4 + 42
H 3 0x000001aa:    bd70        p.      POP      {r4-r6,pc}

Coverage results

To see how much code is actually used, a summary of the instructions hit is below. The instructions run are the total number of instructions. The instructions hit are the number of unique instructions in the annotated disassembly file that have been run at least 1 time. The instructions miss are instructions in the disassembly file that were not executed. The percent of code used is simply the instructions hit / (instructions hit + instructions miss).

dhrystone	instructions run	instructions hit	instructions miss	% Code Used
gcc	5886094	3023	14513	17
armclang	3046779	1414	1023	58
gcc newlib-nano	12759963	2016	2293	47
armclang microlib	16723675	820	192	81

sort
gcc	1384920	3887	18312	18
armclang	1085530	2034	1335	60
gcc newlib-nano	1316909	2542	2682	49
armclang microlib	1192840	834	267	76

coremark
gcc	18792022	7896	13561	37
armclang	20535946	5021	2777	64
gcc newlib-nano	28716240	2442	1232	66
armclang microlib	28134087	2564	1259	67

EEMBC auto
gcc	1940120	7685	19213	28
armclang	1513362	5252	4792	51

Coverage conclusions

There is clearly more unused code in the images produced by gcc. For applications which have memory constraints, especially for Cortex-M or Cortex-R, Arm Compiler 6 produces smaller code. Another observation is the code generated with Arm Compiler 6 microlib is extremely small. Far fewer unique instructions are used to perform the same functionality. The newlib-nano and microlib options create images with much higher percent of code executed, but the instruction count to complete the functionality is higher and performance is lower for some applications. To get actual performance information the code can be run on Arm Cycle Models.

Finding the unused code

Using the coverage information, the unexecuted instructions can be examined in more detail. There are a few ways to do this.

Look at the disassembly file
View the code in a debugger
Map the addresses back to the source code using addr2line

The addr2line utility can be run on any address, but only works well on the addresses for which the source code exists. This means the library references don't generate anything useful. The good news is this means it's easy to find unused code which is from the application source. The bad news is it doesn't provide much insight into library code.

After inspecting the differences using these techniques the libraries are the largest source of unexecuted code. Arm Compiler 6 seems to include much less code from library functions and gcc seems to have large chunks of library code which are not used. Another source of unused code is from unused functions in the application. Arm Compiler 6 does a better job of removing code that is not used.

As application size increases, the impact of the unused code may shrink. An application with hundreds of kb of code may be more interesting to evaluate, but many microcontrollers have only 256 kb to work with.

Comparing at the function level

Another place to compare compilers is on individual functions. To do this I took 2 small to medium functions from the sort example provided in DS-5, insert_sort() and shell_sort(), and marked them with __attribute__((noinline)) so they would be preserved by the compilers. Using the same coverage information, the instruction count with each compiler is shown below.

	Arm Compiler 6		GNU gcc
function	Instructions	Instructions Used	Instructions	Instructions Used
insert_sort()	35	35	37	37
shell_sort()	75	75	78	77

Arm Fast Model plugins can also be used to trace function start and end points and record the number of instructions in the function. The plugin at $PVLIB_HOME/examples/MTI/SimpleTrace is a good starting point as it maps the software function names to PC values. I modified this plugin to add a print of the instruction count to build a solution to measure instruction counts for a function. Using breakpoints in a software debugger can also be used to start and stop and count instructions.

	Arm Compiler 6		GNU gcc
function	Instructions	Self Instructions	Instructions	Self Instructions
insert_sort()	506,647	97,307	479,661	98,992
shell_sort()	124,535	39,705	140,327	46,223

The function insert_sort() uses strcmp for string compare and shell_short() also uses strcmp and does some integer math using a library function. This explains why the sum of instructions from the coverage file is less than the total instructions to run the function. This also helps isolate library differences.

The last check is to use Arm Cycle Models from Arm IP Exchange and the software profiling feature of SoC Designer to measure the cycles used to run each sort function. Using cycle counts is far more accurate than just instruction count and will give a better indication of performance. The images below show the profiling information for insert_sort() and shell_sort() for each compiler.

The results show the cycle counts for the functions and the cycles for each of the sub-functions. Arm Compiler 6 performance is better than gcc, but not as dramatic as the code size comparisons.

Conclusion

There are many ways to evaluate embedded compilers for Arm. Code size is a common metric used to compare compilers and optimization levels. Looking at code size is a start, but most engineers will ask WHY the code size is different because just saying the code is smaller is not very satisfying. Arm Models can be used to investigate code size in more detail. Other solutions may also be able to do the same analysis if trace information from the software can be obtained. When optimizing software for code size and performance there is no silver bullet to automatically create the smallest code with the fastest performance. In fact, there are a wide variety of possible results with larger variances than I expected when I stated looking into compiler choices. With the information in this article embedded software engineers can dig into the details for the specific software they are working with instead of relying on general compiler marketing information to pick the best solution.

Arm Compiler 6 is now recommended for new projects and many projects are now in the process of migrating from Arm Compiler 5 to Arm Compiler 6 and are interested in performance and code size. The linker and libraries provided with Arm compilers differentiate the quality of results from open source compilers making Arm Compiler 6 a strong choice for projects looking for the smallest code size.

Learn more about Arm Compiler 6

Parents

findcatlesstar over 2 years ago

Why did you use link time optimization with armv6 but not with gcc?
Isn't that beating all further assumptions and comparisons?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

findcatlesstar over 2 years ago

Why did you use link time optimization with armv6 but not with gcc?
Isn't that beating all further assumptions and comparisons?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Tools, Software and IDEs blog

Python on Arm: 2025 Update

Diego Russo

Python powers applications across Machine Learning (ML), automation, data science, DevOps, web development, and developer tooling.
- August 21, 2025
Product update: Arm Development Studio 2025.0 now available

Stephen Theobald

Arm Development Studio 2025.0 now available with Arm Toolchain for Embedded Professional.
- July 18, 2025
GCC 15: Continuously Improving

Tamar Christina

GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
- June 26, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog