CoreMark and Compiler Performance

Chinese Version 中文版:CoreMark 和编译器性能

CoreMark is quickly gaining traction as the de facto benchmark for CPU performance. It is freely available, easy to compile and run, and returns a single value result, simplifying performance analysis. As with Dhrystone in the 90's, we are seeing developers attempting to determine compiler efficiency based on CPU performance. This determination can often be misleading as CoreMark, just like Dhrystone before it, is a small special purpose benchmark targeting CPU performance rather than a broad-based embedded software workload. The value of CoreMark for determining compiler efficiency depends on how closely your application resembles the benchmark. Recently, customers have been asking for CoreMark performance data based on compilation by the ARM Compiler, hence, we decided to investigate CoreMark and its potential value as a compiler effectiveness indicator. This blog describes our experience, which resulted in significant CoreMark performance improvements in the latest version of the ARM Compiler.

Historically, benchmarks allowed comparison of compiler effectiveness and CPU performance. The first popular benchmark, SIEVE, calculated just prime numbers and was published in January 1983 in BYTE Magazine. Later, Dhrystone and Whetstone became popular. Dhrystone focuses on integer and string operations whereas Whetstone primarily uses floating point arithmetic. Today's compiler technology allows calculation of many of the internal benchmark operations at compile time and the CPU performance indication of these benchmarks may be misleading. CoreMark uses randomly generated data as input, making it impossible for compilers to pre-compute parts of the benchmark at compile time. While CoreMark is more difficult to "defeat" [1] than Dhrystone, clever compiler writers can still improve the benchmark result by crafting optimizations that are aimed at coding constructs of the benchmark. From a compiler-writer's perspective, this is like shooting fish in a barrel!

The Experiment: Improve ARM Compiler Performance using CoreMark
We began our experiment by analyzing the CoreMark coding structure and the functions the benchmark performs. CoreMark consists of a linked list data structure which is scanned at runtime. Control is then offloaded to a loop-controlled state machine or matrix manipulation routine, depending on the linked list data value. After analysis, we identified techniques to deal with the state machine and opportunities for more aggressive loop unrolling.

State Machine Analysis: the CoreMark state machine is implemented as a series of switch statements encapsulated in a loop. Below is an example of this structure.


Based on the structure of the switch statements, we know the new destination after each case statement, making it possible to eliminate the switch by directing each case to branch directly to the successive case. For example:
Depending on aspects of the code, such as the enclosing loop structure and the amount of code before and after the switch, this transformation can improve performance of a general state machine. As you might expect, replication of the pre- and post-switch code after each switch case will result in increased code size. More on that later.
Loop Unrolling: the ARM Compiler uses heuristics to determine which loops to unroll. Too aggressive unrolling can significantly bloat the resultant code; too conservative unrolling leaves performance on the table. The ARM Compiler typically takes a conservative approach to loop unrolling, as our customers generally develop embedded applications which are often memory constrained. The default heuristics take into account the overall impact on code size, trying to improve performance without substantially bloating the generated code. As part of this experiment, we took a more aggressive approach to loop unrolling, adjusting the compiler's heuristics to aggressively unroll loops irrespective of code size increase.

The Results: ARM Compiler shows significant CoreMark Improvements
Applying the techniques mentioned above resulted in a dramatic improvement in CoreMark benchmark performance. While the result was impressive, there was the unwelcome side effect of a 13% code size increase when enabling the switch statement and aggressive loop optimizations. Given that the ARM Compiler targets embedded developers, this is generally not an acceptable outcome for our customers.

Why Code Size Matters
As mentioned above, the performance improvement techniques come with an unwelcome impact on code size. Generally speaking, ARM's powerful 32-bit microcontrollers have plenty of horsepower for embedded workloads, but memory capacity is often limited in embedded applications for cost reasons. Although MCU prices have fallen dramatically, moving to a microcontroller with more on-chip Flash memory can increase your BOM cost drastically. For example, purchasing 1500 pieces of a popular ARM Cortex-M3 microcontroller the unit price of the 128KB MCU Flash variant is $4.88 whereas the 64KB Flash variant is just $2.80. A compiler that is primarily tuned for performance can therefore result in significantly increasing the overall project cost.

System cost is not the only reason why code size is important in today's modern embedded processors. Compact code increases the number of instructions which can fit into cache, potentially improving performance based on more efficient cache usage and, perhaps more importantly, potentially reducing overall power consumption.

The ARM Compiler team has always focused on both performance and code density, resulting in a well-tuned compiler that balances execution speed and code size. To compliment compact code generation, we created MicroLib, a size optimized library for ARM-based embedded applications. When compared to a standard C library, MicroLib provides significant code size advantages. For example, the 13% code size increase mentioned above turns into a net 23% code size reduction when using MicroLib.

For users who care more about performance than code size, the CoreMark improvements mentioned above will prove welcome. For example, code consisting of finite state machines implemented using switch statements or well-formed[2[2] for loops and while loops could see significant improvement for those constructs. Testing the new optimizations with our standard benchmark suite, which consists of over 60 applications targeting a wide rang of embedded use cases, showed a substantial improvement only on some. This was not surprising as CoreMark is a small piece of software and targeted compiler optimizations aimed at a limited set of code constructs may not scale across broader code bases.

CoreMark performance can be significantly increased by applying compiler optimizations specifically targeting the constructs of the benchmark code. Comparing CoreMark scores between compilers can give an indication of which compiler fares better on CoreMark-like code, but this may or may not improve real-world embedded application performance and even worse, could introduce unwanted code bloat. The code size penalty can be mitigated by using a compact library, such as MicroLib. As always, the best approach is to evaluate compilers of interest on your code, taking into account impact on code size when aggressive performance optimizations are used.

For those who are interested in evaluating the ARM Compiler improvements referenced in this blog, download Keil MDK-ARM v4.70  or ARM DS-5 v5.14 (available March 2013). The release notes provide details on how to use the new optimizations. I encourage you to try it and give me your feedback.

[1] Idiomatic loops with known constant upper and lower bounds, loops with unknown upper bound, loops containing a small number of C-statements.

[2] Dhrystone can be compromised by the compiler by pre-computing values at compile time or optimizing away timed portions of the code. 

  • I think that would be great. We would first need to standardize on a way to measure code size and what gets included in the measurement (probably a topic for a different blog!).
  • The speedup comparison is at the highest performance level, -O3 –Otime. We didn’t add any fundamentally new technology. The ARM Compiler is used in a very wide range of embedded applications and we have customers that are quite sensitive to code size, even at –O3 –Otime, so we generally take a code size conservative approach. For the CoreMark state machine, we found opportunities to inline code within the switch statements resulting in better performance at the expense of more code. Furthermore, the compiler loop optimizer can apply many levels of unrolling, but we apply conservative heuristics by default to ensure that code size doesn’t bloat. We added an option to let the user trade off code size for performance with regard to loop unrolling.
  • What exactly is the base for the speedup comparison? Optimization for code size? Other?The optimizations described are applicable to these constructs (loops or switch statements) in general, and not particular to the benchmark. Did the compiler have to be enhanced to be able to use tail chaining, loop unrolling, or any other control flow transforms? In other words, is CoreMark pushing the compiler to incorporate new technologies and thus helps applications that require more performance and willing to pay for the increased code size, or did the compiler have these capabilities before, and CoreMark merely helps customers understand the potential benefit and tradeoff with code size?