CoreMark and Compiler Performance

September 11, 2013

CoreMark is quickly gaining traction as the de facto benchmark for CPU performance. It is freely available, easy to compile and run, and returns a single value result, simplifying performance analysis. As with Dhrystone in the 90's, we are seeing developers attempting to determine compiler efficiency based on CPU performance. This determination can often be misleading as CoreMark, just like Dhrystone before it, is a small special purpose benchmark targeting CPU performance rather than a broad-based embedded software workload. The value of CoreMark for determining compiler efficiency depends on how closely your application resembles the benchmark. Recently, customers have been asking for CoreMark performance data based on compilation by the ARM Compiler, hence, we decided to investigate CoreMark and its potential value as a compiler effectiveness indicator. This blog describes our experience, which resulted in significant CoreMark performance improvements in the latest version of the ARM Compiler.

Background
Historically, benchmarks allowed comparison of compiler effectiveness and CPU performance. The first popular benchmark, SIEVE, calculated just prime numbers and was published in January 1983 in BYTE Magazine. Later, Dhrystone and Whetstone became popular. Dhrystone focuses on integer and string operations whereas Whetstone primarily uses floating point arithmetic. Today's compiler technology allows calculation of many of the internal benchmark operations at compile time and the CPU performance indication of these benchmarks may be misleading. CoreMark uses randomly generated data as input, making it impossible for compilers to pre-compute parts of the benchmark at compile time. While CoreMark is more difficult to "defeat" [1] than Dhrystone, clever compiler writers can still improve the benchmark result by crafting optimizations that are aimed at coding constructs of the benchmark. From a compiler-writer's perspective, this is like shooting fish in a barrel!

The Experiment: Improve ARM Compiler Performance using CoreMark
We began our experiment by analyzing the CoreMark coding structure and the functions the benchmark performs. CoreMark consists of a linked list data structure which is scanned at runtime. Control is then offloaded to a loop-controlled state machine or matrix manipulation routine, depending on the linked list data value. After analysis, we identified techniques to deal with the state machine and opportunities for more aggressive loop unrolling.

State Machine Analysis: the CoreMark state machine is implemented as a series of switch statements encapsulated in a loop. Below is an example of this structure.

Based on the structure of the switch statements, we know the new destination after each case statement, making it possible to eliminate the switch by directing each case to branch directly to the successive case. For example:

Depending on aspects of the code, such as the enclosing loop structure and the amount of code before and after the switch, this transformation can improve performance of a general state machine. As you might expect, replication of the pre- and post-switch code after each switch case will result in increased code size. More on that later.

Loop Unrolling: the ARM Compiler uses heuristics to determine which loops to unroll. Too aggressive unrolling can significantly bloat the resultant code; too conservative unrolling leaves performance on the table. The ARM Compiler typically takes a conservative approach to loop unrolling, as our customers generally develop embedded applications which are often memory constrained. The default heuristics take into account the overall impact on code size, trying to improve performance without substantially bloating the generated code. As part of this experiment, we took a more aggressive approach to loop unrolling, adjusting the compiler's heuristics to aggressively unroll loops irrespective of code size increase.

The Results: ARM Compiler shows significant CoreMark Improvements
Applying the techniques mentioned above resulted in a dramatic improvement in CoreMark benchmark performance. While the result was impressive, there was the unwelcome side effect of a 13% code size increase when enabling the switch statement and aggressive loop optimizations. Given that the ARM Compiler targets embedded developers, this is generally not an acceptable outcome for our customers.

Why Code Size Matters
As mentioned above, the performance improvement techniques come with an unwelcome impact on code size. Generally speaking, ARM's powerful 32-bit microcontrollers have plenty of horsepower for embedded workloads, but memory capacity is often limited in embedded applications for cost reasons. Although MCU prices have fallen dramatically, moving to a microcontroller with more on-chip Flash memory can increase your BOM cost drastically. For example, purchasing 1500 pieces of a popular ARM Cortex-M3 microcontroller the unit price of the 128KB MCU Flash variant is $4.88 whereas the 64KB Flash variant is just $2.80. A compiler that is primarily tuned for performance can therefore result in significantly increasing the overall project cost.

System cost is not the only reason why code size is important in today's modern embedded processors. Compact code increases the number of instructions which can fit into cache, potentially improving performance based on more efficient cache usage and, perhaps more importantly, potentially reducing overall power consumption.

The ARM Compiler team has always focused on both performance and code density, resulting in a well-tuned compiler that balances execution speed and code size. To compliment compact code generation, we created MicroLib, a size optimized library for ARM-based embedded applications. When compared to a standard C library, MicroLib provides significant code size advantages. For example, the 13% code size increase mentioned above turns into a net 23% code size reduction when using MicroLib.

For users who care more about performance than code size, the CoreMark improvements mentioned above will prove welcome. For example, code consisting of finite state machines implemented using switch statements or well-formed[2[2] for loops and while loops could see significant improvement for those constructs. Testing the new optimizations with our standard benchmark suite, which consists of over 60 applications targeting a wide rang of embedded use cases, showed a substantial improvement only on some. This was not surprising as CoreMark is a small piece of software and targeted compiler optimizations aimed at a limited set of code constructs may not scale across broader code bases.

Summary
CoreMark performance can be significantly increased by applying compiler optimizations specifically targeting the constructs of the benchmark code. Comparing CoreMark scores between compilers can give an indication of which compiler fares better on CoreMark-like code, but this may or may not improve real-world embedded application performance and even worse, could introduce unwanted code bloat. The code size penalty can be mitigated by using a compact library, such as MicroLib. As always, the best approach is to evaluate compilers of interest on your code, taking into account impact on code size when aggressive performance optimizations are used.

For those who are interested in evaluating the ARM Compiler improvements referenced in this blog, download Keil MDK-ARM v4.70 or ARM DS-5 v5.14 (available March 2013). The release notes provide details on how to use the new optimizations. I encourage you to try it and give me your feedback.

----
[1] Idiomatic loops with known constant upper and lower bounds, loops with unknown upper bound, loops containing a small number of C-statements.

[2] Dhrystone can be compromised by the compiler by pre-computing values at compile time or optimizing away timed portions of the code.

5 comments
0 members are here

Embedded and Microcontrollers blog

Adapting Kubernetes for high-performance IoT Edge deployments

Alexandre Peixoto Ferreira

In this blog post, we address heterogeneity in IoT edge deployments using Kubernetes.
- August 21, 2024
Evolving Edge Computing and Harnessing Heterogeneity

Alexandre Peixoto Ferreira

This blog post identifies heterogeneity as an opportunity to create better edge computing systems.
- August 21, 2024
Demonstrating a Hybrid Runtime for Containerized Applications in High-Performance IoT Edge

Chris Adeniyi-Jones

In this blog post, we show how a hybrid runtime and k3s can be used to deploy an application onto an edge platform that includes an embedded processor.
- August 21, 2024

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

CoreMark and Compiler Performance

Adapting Kubernetes for high-performance IoT Edge deployments

Evolving Edge Computing and Harnessing Heterogeneity

Demonstrating a Hybrid Runtime for Containerized Applications in High-Performance IoT Edge