The advent of artificial intelligence (AI) is creating a wealth of opportunities ranging from better user experiences with consumer products to automated quality control on factory floors – and this list of AI-driven use-cases is growing exponentially. The performance capabilities driving these devices are underpinned by innovative signal processing and machine learning (ML) techniques. But the challenge is: how should a system architect go about supporting such techniques in System on Chip (SoC) designs?
Furthermore, IoT and embedded applications that are limited by power, cost and size do not need as much computational performance as computational efficiency. This means that the chosen SoC architecture needs to closely map to the overall requirements of the target application, minimizing silicon area and device cost. This poses a challenge for silicon designers aiming to differentiate their microcontroller with the greatest level of intelligence possible.
This was the context that drove the development of the Arm Cortex-M55 and Ethos-U55 processors - industry-changing technology that delivers efficient on-device processing for endpoint AI. Arm is enhancing the performance capability of its AI platform for microcontroller-based devices, offering IP that will enable longer battery life, greater privacy, richer applications, and faster response times.
This new technology provides an unprecedented ML performance uplift of up to 480x compared to Cortex-M processors, and a simplified development toolchain for software developers. Developers are already deploying ML to Cortex-M-based devices today and now they are able to seamlessly port that work to this new IP. All of this is supported by existing open-source software libraries, such as CMSIS-NN and CMSIS-DSP, and existing neural network (NN) frameworks, such as TensorFlow Lite for microcontrollers. It is not a ‘one size fits all’ in ML, so we have designed these technologies to empower developers to seamlessly scale across applications that require signal processing, classical ML, or neural network techniques.
This new offering includes:
Figure 1: Corstone-300 reference design, including the Cortex-M55 and Ethos-U55 processors.
These new compute technologies have the capabilities to transform endpoint IoT and embedded use cases of the future. Let’s explore one example:
While today’s microcontrollers can be used in voice-enabled devices for keyword detection, localized speech recognition requires offloading of some compute tasks to the cloud.
Figure 2: The Cortex-M55 and Ethos-U55 processors offer faster inference speeds and higher energy efficiency.
However, with the availability of Cortex-M55 and Ethos-U55 hardware, more compute can be done on-device, enabling local voice command processing and (to some extent) automatic speech recognition (ASR). On-device processing, or endpoint AI, offers faster response times, reduced energy consumption, and greater privacy by limiting the need to send bulky voice data to the cloud for inferencing. Figure 3, for example, demonstrates the significant decrease in latency and energy that is spent when using the Cortex-M55 and Ethos-U55 processors.
Now, let’s explore this new IP in more detail. Click the following links if you would like to skip to a section:
The new Arm Cortex-M55 processor offers greater on-device performance and ease-of-use, bringing endpoint AI to billions of more devices and developers. The Cortex-M55 is Arm’s most AI-capable Cortex-M processor and the first to feature Arm Helium vector processing technology, bringing enhanced, energy-efficient signal processing and ML performance. But, which impact does Helium technology have on the Cortex-M55 processor?
Helium introduces the concepts of a beat, which corresponds to 32-bits’ worth of arithmetic operation. In the Cortex-M55 processor, we chose to build the architecture around a ‘dual-beat per tick’ implementation of Helium, with 2x32=64 bits worth of compute per ‘tick’ (processor cycle). By keeping the overall data bus width memory to 64-bits and scaling datapath execution units accordingly, we keep the lid on system cost and energy usage, while still delivering a significant uplift in signal processing and ML compute performance.
Building efficient embedded systems also means supporting multiple data types. This allows the developer to optimize memory usage, while achieving the required algorithmic performance. To this end, the Cortex-M55 processor supports both vector and scalar processing with 8-bit, 16-bit and 32-bit integer datatypes. It provides native vector and scalar operations with half (fp16) and full (fp32) precision floating-point datatypes. Furthermore, it supports native scalar double-precision (fp64) operations, as well. Native support for half-precision floating-point is new for the Cortex-M family. These data formats are of value in certain audio, sensor, and ML applications.
To optimize silicon area and energy usage, the register bank in the floating-point unit (FPU) is reused for vector processing. Our internal studies over a broad range of critical DSP and ML routines proved that sharing FPU and vector registers does not compromise performance.
As an example of how a typical arithmetic operation works on the Cortex-M55, let’s take one example. The instruction VRMLALVHA.S32 (vector MAC with 32-bit integers and 64-bit accumulation) is one of over 150 new scalar and vector instructions that are supported by the new Armv8.1-M instruction set. With two 32-bit multipliers, which are fed by two 32-bit busses, VRMLALVHA.S32 allows the Cortex-M55 processor to carry out two 32x32 MACs per cycle, with dual-issuing of the associated data moves.
The flexible nature of the Cortex-M55 arithmetic units means that this throughput increases as datatypes become more compact. With 16-bit integer or fp16 datatypes, for example, 4x16 bit data values can be transferred, resulting in 4 MACs per cycle. Similarly, with the 8-bit integer datatypes that are commonly used in ML, throughput increases to 8 MACs per cycle.
An efficient architecture for constrained embedded systems also means mapping the hardware as closely as possible to the target application, so we built considerable configurability into the Cortex-M55 processor.
Let’s look at how the Cortex-M55 performs over a range of typical DSP kernels and datatypes:
Figure 2: Average DSP kernel performance per datatype relative to the Cortex-M55 processor
The design goal of achieving a significant increase in performance has been achieved. But what about that power efficiency goal? Over a selection of key DSP kernels, the Cortex-M55 achieves eight times greater power efficiency than Cortex-M7 processor using the latest power simulation results for CMSIS-DSP.
The Cortex-M55 processor takes AI on Cortex-M to the next level. We are excited for this technology to bring enhanced, energy-efficient signal processing and ML performance to the next generation of IoT devices.
For more details about the Cortex-M55 processor, please refer to:
The Ethos-U55 is Arm’s first microNPU (Neural Processing Unit) designed for microcontroller class devices. It integrates fully with a single Cortex-M toolchain, providing an exceptional performance uplift without more software complexity. The Ethos-U55 offers an extra 32x ML performance boost over Cortex-M55 for more demanding ML systems, so together, they increase ML workload performance by up to 480x compared to previous Cortex-M generations. Ethos-U55 configurations run as small as 0.1mm2 in 16nm, ideal for AI applications in cost-sensitive, and energy-constrained devices.
This new technology integrates signal processing and machine learning development in a unified toolchain for more productivity and ease. They work with existing ML libraries, CMSIS-NN and CMSIS-DSP for signal processing and classical ML, as well as with common ML frameworks, such as TensorFlow Lite Micro. This makes it vastly easier and quicker to design, develop, and maintain AI-based IoT applications with the lowest risk and cost possible.
As an example, creating any sort of application is easy for developers to take their existing TensorFlow Lite models and run them with Arm’s modified TensorFlow Lite Micro runtime. The modifications include an offline optimizer that does automatic graph partitioning, scheduling, and optimizations. These simple additions make it easy to run ML on a heterogenous system, as developers do not have to make any modifications to their networks – it just works.
For more details about the Ethos-U55 processor, please refer to:
Designing a secure SoC is challenging and time-consuming, so to help designers get to market quickly, Arm provides the Corstone-300 reference design: the fastest way to build a system with the Cortex-M55 processor, with the option to easily integrate the Ethos-U55 processor on an expansion interface. It contains various system IP components and a reference design for architecting a system, with power management features integrated to help balance trade-offs between performance and power.
Security capability is built in at the heart, as the Corstone-300 system architecture is designed with Arm TrustZone security for hardware-enforced isolation. Corstone-300 also simplifies software development with easier porting of open-source TF-M, accelerating the route to PSA Certified.
For more details about Corstone-300, please refer to:
The greatest potential for the next computing revolution lies in scaling AI to the billions of smaller, power-constrained endpoint devices. Innovative signal processing and ML techniques open new opportunities for SoC architects to deliver these new levels of efficient AI performance for microcontrollers.
The Cortex-M55 processor, Ethos-U55 microNPU, Corstone-300 reference design, and Arm’s industry-leading embedded ecosystem of software libraries and tools support, will bring efficient endpoint AI to the billions, removing barriers to ML adoption and deployment. Arm’s new compute technologies extend the performance of Arm’s AI platform for endpoint devices, offering silicon providers a more diverse range of hardware choices and empowering developers to deliver this next revolution in computing.
For more information about this technology, watch our latest webinar by clicking on the link below.
[CTAToken URL = "https://www.brighttalk.com/webcast/17792/391867" target="_blank" text="Watch Webinar" class ="green"]