Five things you may not know about ARM Cortex-M0+

October 25, 2013

5 minute read time.

1 - Fast adoption

The Cortex-M0+ processor was announced in March 2012, with two lead partners, namelyFreescale and NXP, who went public with their plans to release MCUs based on this new processor. These plans became reality shortly afterwards with the launch of the Freescale's Kinets L series and, later in that year, with NXP's LPC800 (see The un'NXP'ected LPC800), and the flow of announcements and releases has been steady since then. We have seen Fujitsu Semiconductor adopt the processor (the MCU product line subsequently being taken over by Spansion), Atmel offering an alternative in the low-end to its AVR line with the SAM D20 (see The un'NXP'ected LPC800) and, more recently, Silicon Labs unveiled their Zero Gecko. And there are more to come, but we cannot tell yet !

Even if all these product lines share the same processor, each company has added its ingredients and know-how, making each series unique. I love Freescale KL02 in its tiny 1.9x2.0mm WLCSP package, I'm fond of the LPC800 pattern matching engine, Atmel got me with their peripheral event system, and last but not least thanks to Silabs for their Low-Energy Sensor Interface. This is a good demonstration that low end MCU doesn't rhyme with low end functionality, and that our partners focus on bringing new differentiating features instead of developing and maintaining a proprietary processor architecture.

The off-the-shelf MCU devices are just the tip of the iceberg, there are even more designs in deeply embedded applications where the processor will be not be exposed, for example, coupled with sensors or for creating an intelligent subsystem in a bigger SoCs. In that space it is quite common that companies do not communicate in an open manner on the selected processor.

2 - ARM's shortest pipeline ever

It all started with the bold idea to implement the ARMv6-M architecture using just a 2-stage pipeline. Since ARM started to architect processors, the minimalist pipeline design never went under the three well known stages: Fetch, Decode and Execute. The Cortex-M0+ changes that: its implementation splits the functionality of the Decode stage and merges it into the Fetch and the Execute stages. The advantages are twofold, it saves one stage of registers reducing the power consumption, and on the other hand, program execution becomes faster as non-sequential program accesses, namely branches and exceptions, need one cycle less to complete.

The Cherry on the cake: the processor remains 100% compatible with Cortex-M0 as they share the very same programmers model: same instruction set and same interrupt management. It is also de-facto upwards compatible with the Cortex-M3 and Cortex-M4 processors.

3 - Forget low power

Low power, ultra low power, pico power, what's next? Marketing imagination was very creative in the last years, but what counts in the end for most design is the energy efficiency. One should consider the intrinsic energy efficiency of the processor while keeping also an equal focus on the efficiency at system level, everything must match.

The 2-stage pipeline, combined with the ingenuity and latest best practise of our engineers, led up to an overall 30% reduction of dynamic power compared to Cortex-M0, while delivering a nice performance uplift in the range of 7 to 9% depending on the selected benchmark. Combining both improvements, the Cortex-M0+ is more than 40% more energy efficient when running the same task.

Cortex-M0+ also helps to keep power consumption low at the system level with a variety of features. These features are mostly transparent to the user, but one key contribution is to limit accesses to memories which are among the most important in the energy budget with the processor.

4 - Giving bugs a hard time

When the Cortex-M0 was released in 2009, one of the main objectives was to offer an ARM Cortex-M compatible processor within an area and power envelope equivalent to a typical 8-bit processor (e.g. 8051), so that developers could use a common architecture and tools infrastructure to address different level of requirements. Applications in this segment tended to be an order of magnitude less complex than for programs running on a Cortex-M3, and it was decided that for debug purpose breakpoints and watchpoints would be sufficient and program trace was too large in that footprint. It is worth noting that the trace module of Cortex-M3 is roughly two thirds of a minimal configuration Cortex-M0!

In reality developers that embraced the new Cortex-M0 product did not limit themselves in just porting their code from the 8-bit MCU, but took advantage of the roughly 10 times higher performance, implementing more complex algorithms, communication stacks and higher exception load. It became apparent that these same developers might be able to benefit from a more space optimized trace mechanism, and this was addressed in Cortex-M0+ with the creation of the Micro Trace Buffer (MTB). The MTB offered a lightweight trace capability, roughly 5 times smaller than the one of the Cortex-M3, whilst accepting some reduction of the capabilities such as the lack of real-time streaming.

The MTB stores information on non-sequential program flow (2 words per branch) in an internal RAM, which can be either shared with the application or be separate. Once the tracing stops, the debugger retrieves the data via the standard Serial Wire Debug (no additional pins needed) and it rebuilds the program execution flow.

5 - Last but not least

If I had to pick the one feature I like most, and that would not be easy, the fast I/O port would be in my shortlist.

Cortex-M0 and Cortex-M0+ are based on von Neumann architecture and rely on a single AHB-Lite. It is a good trade-off in the low end for area, power and performance, but this may introduce some latency and undesired jitter as the bus is shared for multiple uses. Some applications require fast and timing accurate accesses - this is where the fast I/O port comes into action, adding a dedicated single cycle access port to the Cortex-M0+. It is reserved for data accesses and we recommend its use for time critical GPIOs, registers, hardware accelerators or a SRAM reserved for computing intensive algorithms. For example, a communication protocol fully implemented in software and driving directly GPIOs will take great advantage of this port.

Additional takeaway: when an instruction performs accesses via the fast I/O port, the AHB-Lite is free and the processor takes advantage to load forthcoming instructions into the internal pre-fetch buffer (usually two instructions per access, as most instructions are 16-bit coded). This "Harvard-like" behaviour delivers extra performance.

Take a look at this blog by jyiu for more details on the Cortex-M0+ processor.

3 comments
0 members are here

Architectures and Processors blog

Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025
Caches and Self-Modifying Code: Working with Threads

Jacob Bramley

How to synchronize JIT-compiled instructions across threads.
- January 21, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Five things you may not know about ARM Cortex-M0+

1 - Fast adoption

2 - ARM's shortest pipeline ever

3 - Forget low power

4 - Giving bugs a hard time

5 - Last but not least

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Caches and Self-Modifying Code: Working with Threads