Arm Cortex-M55 and Ethos-U55 Processors: Extending the Performance of Arm’s ML Portfolio for Endpoint Devices

February 10, 2020

10 minute read time.

The advent of artificial intelligence (AI) is creating a wealth of opportunities ranging from better user experiences with consumer products to automated quality control on factory floors – and this list of AI-driven use-cases is growing exponentially. The performance capabilities driving these devices are underpinned by innovative signal processing and machine learning (ML) techniques. But the challenge is: how should a system architect go about supporting such techniques in System on Chip (SoC) designs?

Furthermore, IoT and embedded applications that are limited by power, cost and size do not need as much computational performance as computational efficiency. This means that the chosen SoC architecture needs to closely map to the overall requirements of the target application, minimizing silicon area and device cost. This poses a challenge for silicon designers aiming to differentiate their microcontroller with the greatest level of intelligence possible.

Expanding Arm's compute technologies for endpoint AI

This was the context that drove the development of the Arm Cortex-M55 and Ethos-U55 processors - industry-changing technology that delivers efficient on-device processing for endpoint AI. Arm is enhancing the performance capability of its AI platform for microcontroller-based devices, offering IP that will enable longer battery life, greater privacy, richer applications, and faster response times.

This new technology provides an unprecedented ML performance uplift of up to 480x compared to Cortex-M processors, and a simplified development toolchain for software developers. Developers are already deploying ML to Cortex-M-based devices today and now they are able to seamlessly port that work to this new IP. All of this is supported by existing open-source software libraries, such as CMSIS-NN and CMSIS-DSP, and existing neural network (NN) frameworks, such as TensorFlow Lite for microcontrollers. It is not a ‘one size fits all’ in ML, so we have designed these technologies to empower developers to seamlessly scale across applications that require signal processing, classical ML, or neural network techniques.

This new offering includes:

Arm Cortex-M55 processor

Arm’s most AI-capable Cortex-M processor, offering up to 15x ML performance improvement and up to 5x signal processing performance uplift compared to existing Cortex-M processors.
The first Arm Cortex-M processor that includes Arm Helium technology, an extension of the Armv8.1-M architecture, bringing 150 new scalar and vector instructions.
Scalable as a standalone processor for ML applications or with the Ethos-U55, when more complex workloads and increased efficiency are needed.
Arm Custom Instructions to extend the processor’s capabilities for workload-specific optimization (available in 2021).

Arm Ethos-U55 microNPU

The industry’s first microNPU designed to work with the Cortex-M, including the Cortex-M55, Cortex-M33, Cortex-M7, and Cortex-M4 processors.
Best paired with the Cortex-M55 processor, delivering up to a combined 480x uplift in ML performance over existing Cortex-M processors.
Provides an extra 32x uplift in performance over the Cortex-M55.
Highly configurable, supporting implementations from 32-256 MACs for differentiation across various markets.
A single toolchain for Ethos-U55 and Cortex-M eases developer use and creation of AI applications.

Corstone-300 reference design

The fastest way to incorporate the Cortex-M55 processor, with or without the Ethos-U55 processor, into an SoC design.
Makes chip-level security easier, faster, and more robust with the system-wide implementation of TrustZone for Armv8-M.
Simplifies software development with out-of-the-box support in the open-source Trusted Firmware-M (TF-M), offering an accelerated route to PSA Certified.
Design confidently with FPGA and Fixed Virtual Prototyping (FVP) platforms based on Corstone-300.

This is a block diagram for the Corstone-300.

Figure 1: Corstone-300 reference design, including the Cortex-M55 and Ethos-U55 processors.

Example use case: enabling an efficient, voice-assisted world

These new compute technologies have the capabilities to transform endpoint IoT and embedded use cases of the future. Let’s explore one example:

While today’s microcontrollers can be used in voice-enabled devices for keyword detection, localized speech recognition requires offloading of some compute tasks to the cloud.

A graph to show typical ML workload voice assistant.

Figure 2: The Cortex-M55 and Ethos-U55 processors offer faster inference speeds and higher energy efficiency.

However, with the availability of Cortex-M55 and Ethos-U55 hardware, more compute can be done on-device, enabling local voice command processing and (to some extent) automatic speech recognition (ASR). On-device processing, or endpoint AI, offers faster response times, reduced energy consumption, and greater privacy by limiting the need to send bulky voice data to the cloud for inferencing. Figure 3, for example, demonstrates the significant decrease in latency and energy that is spent when using the Cortex-M55 and Ethos-U55 processors.

Now, let’s explore this new IP in more detail. Click the following links if you would like to skip to a section:

Cortex-M55 processor
Ethos-U55 microNPU
Corstone-300 reference design

Cortex-M55 processor: adding energy-efficient signal processing and ML capabilities to a general-purpose processor

Cortex-M55 processor

The new Arm Cortex-M55 processor offers greater on-device performance and ease-of-use, bringing endpoint AI to billions of more devices and developers. The Cortex-M55 is Arm’s most AI-capable Cortex-M processor and the first to feature Arm Helium vector processing technology, bringing enhanced, energy-efficient signal processing and ML performance. But, which impact does Helium technology have on the Cortex-M55 processor?

Helium introduces the concepts of a beat, which corresponds to 32-bits’ worth of arithmetic operation. In the Cortex-M55 processor, we chose to build the architecture around a ‘dual-beat per tick’ implementation of Helium, with 2x32=64 bits worth of compute per ‘tick’ (processor cycle). By keeping the overall data bus width memory to 64-bits and scaling datapath execution units accordingly, we keep the lid on system cost and energy usage, while still delivering a significant uplift in signal processing and ML compute performance.

Building efficient embedded systems also means supporting multiple data types. This allows the developer to optimize memory usage, while achieving the required algorithmic performance. To this end, the Cortex-M55 processor supports both vector and scalar processing with 8-bit, 16-bit and 32-bit integer datatypes. It provides native vector and scalar operations with half (fp16) and full (fp32) precision floating-point datatypes. Furthermore, it supports native scalar double-precision (fp64) operations, as well. Native support for half-precision floating-point is new for the Cortex-M family. These data formats are of value in certain audio, sensor, and ML applications.

To optimize silicon area and energy usage, the register bank in the floating-point unit (FPU) is reused for vector processing. Our internal studies over a broad range of critical DSP and ML routines proved that sharing FPU and vector registers does not compromise performance.

As an example of how a typical arithmetic operation works on the Cortex-M55, let’s take one example. The instruction VRMLALVHA.S32 (vector MAC with 32-bit integers and 64-bit accumulation) is one of over 150 new scalar and vector instructions that are supported by the new Armv8.1-M instruction set. With two 32-bit multipliers, which are fed by two 32-bit busses, VRMLALVHA.S32 allows the Cortex-M55 processor to carry out two 32x32 MACs per cycle, with dual-issuing of the associated data moves.

The flexible nature of the Cortex-M55 arithmetic units means that this throughput increases as datatypes become more compact. With 16-bit integer or fp16 datatypes, for example, 4x16 bit data values can be transferred, resulting in 4 MACs per cycle. Similarly, with the 8-bit integer datatypes that are commonly used in ML, throughput increases to 8 MACs per cycle.

An efficient architecture for constrained embedded systems also means mapping the hardware as closely as possible to the target application, so we built considerable configurability into the Cortex-M55 processor.

Performance and energy-efficiency

Let’s look at how the Cortex-M55 performs over a range of typical DSP kernels and datatypes:

CMSIS-DSP Kernels

Figure 2: Average DSP kernel performance per datatype relative to the Cortex-M55 processor

The design goal of achieving a significant increase in performance has been achieved. But what about that power efficiency goal? Over a selection of key DSP kernels, the Cortex-M55 achieves eight times greater power efficiency than Cortex-M7 processor using the latest power simulation results for CMSIS-DSP.

The Cortex-M55 processor takes AI on Cortex-M to the next level. We are excited for this technology to bring enhanced, energy-efficient signal processing and ML performance to the next generation of IoT devices.

For more details about the Cortex-M55 processor, please refer to:

Ethos-U55 microNPU: increasing ML workload performance by up to 480x

Ethos-U55 microNPU

The Ethos-U55 is Arm’s first microNPU (Neural Processing Unit) designed for microcontroller class devices. It integrates fully with a single Cortex-M toolchain, providing an exceptional performance uplift without more software complexity. The Ethos-U55 offers an extra 32x ML performance boost over Cortex-M55 for more demanding ML systems, so together, they increase ML workload performance by up to 480x compared to previous Cortex-M generations. Ethos-U55 configurations run as small as 0.1mm2 in 16nm, ideal for AI applications in cost-sensitive, and energy-constrained devices.

How do you get started with software?

This new technology integrates signal processing and machine learning development in a unified toolchain for more productivity and ease. They work with existing ML libraries, CMSIS-NN and CMSIS-DSP for signal processing and classical ML, as well as with common ML frameworks, such as TensorFlow Lite Micro. This makes it vastly easier and quicker to design, develop, and maintain AI-based IoT applications with the lowest risk and cost possible.

As an example, creating any sort of application is easy for developers to take their existing TensorFlow Lite models and run them with Arm’s modified TensorFlow Lite Micro runtime. The modifications include an offline optimizer that does automatic graph partitioning, scheduling, and optimizations. These simple additions make it easy to run ML on a heterogenous system, as developers do not have to make any modifications to their networks – it just works.

For more details about the Ethos-U55 processor, please refer to:

Corstone-300 reference design: accelerating time to market

Designing a secure SoC is challenging and time-consuming, so to help designers get to market quickly, Arm provides the Corstone-300 reference design: the fastest way to build a system with the Cortex-M55 processor, with the option to easily integrate the Ethos-U55 processor on an expansion interface. It contains various system IP components and a reference design for architecting a system, with power management features integrated to help balance trade-offs between performance and power.

Security capability is built in at the heart, as the Corstone-300 system architecture is designed with Arm TrustZone security for hardware-enforced isolation. Corstone-300 also simplifies software development with easier porting of open-source TF-M, accelerating the route to PSA Certified.

For more details about Corstone-300, please refer to:

Corstone-300 web page

Securely scale AI to the most power-constrained devices

The greatest potential for the next computing revolution lies in scaling AI to the billions of smaller, power-constrained endpoint devices. Innovative signal processing and ML techniques open new opportunities for SoC architects to deliver these new levels of efficient AI performance for microcontrollers.

The Cortex-M55 processor, Ethos-U55 microNPU, Corstone-300 reference design, and Arm’s industry-leading embedded ecosystem of software libraries and tools support, will bring efficient endpoint AI to the billions, removing barriers to ML adoption and deployment. Arm’s new compute technologies extend the performance of Arm’s AI platform for endpoint devices, offering silicon providers a more diverse range of hardware choices and empowering developers to deliver this next revolution in computing.

For more information about this technology, watch our latest webinar by clicking on the link below.

Watch Webinar

0 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog