Real-time embedded systems, such as Solid-State Drives (SSDs), have relied heavily on the proven, 32-bit Arm Cortex-R5 and Cortex-R8 processors for successful system architectures for generations of products. These systems have historically required less then 4GB of DRAM and addressable space and have not had a need to run Linux. With continually increasing storage capacities and performance requirements to saturate increasing throughput of storage host interfaces, the 4GB limit and inability to run Linux are adding complexity, and in some cases, becoming barriers.
There is a need for higher performance, real-time compute with more addressable space and the ability to run Linux to enable the next generation of computational storage devices.
The Cortex-R82 processor is optimized for systems where high-performance real-time is required. Designed to easily handle demanding workloads and provide more addressable space, it enables new capabilities of future storage devices, bringing:
Faster response time and reduced latency
The Cortex-R82 processor maintains its heritage as a classic Cortex-R real-time processor with:
The Cortex-R82 processor is a high performance, 64-bit real-time processor capable of addressing up to 1TB of address space to fulfill the requirements of growing capacities and emerging memory technologies.
The Cortex-R82 processor easily enables higher performance storage devices, but with Linux support, paves the way for simplified computational storage architectures and flexible SoC designs that can reallocate compute resources dynamically based upon changing workloads or different products. Cortex-R82 leverages the Arm Linux ecosystem that has been ported, optimized and validated on Arm. The ecosystem development, that was accelerated through the Linaro partnership that started in 2010, and Linux, or any other High Level Operating System, that today work on Arm Cortex-A series processors will seamlessly work on Cortex-R82.
The Cortex-R82 processor is able to run high-level operating systems, such as Linux, and other application code by including an optional Memory Management Unit (MMU).
The Cortex-R82 processor optionally supports Arm Neon technology for accelerating ML workloads that will be at the heart of computational storage applications.
Cortex-R82 is the first Armv8-R 64-bit processor that retains classic Cortex-R real-time compute but provides the higher compute performance needed to run new workloads such as machine learning (ML). It is also Arm’s first Cortex-R processor to support a trusted and robust ecosystem of rich operating systems and software components that already exist in the Linux and cloud development ecosystem.
The Cortex-R82 processor represents a significant uplift compared to Cortex-R8 and Cortex-R5, implementing a whole range of new features and enhancements. Let’s review a few key features in detail.
First Arm processor that can combine MPU and optional MMU for real-time and Linux
Cortex-R82 is the first Arm processor that combines both real-time contexts and MMU-based contexts in a single core.
In traditional Cortex-R real-time behavior, a Cortex-R82 core can still be configured with a Memory Protection Unit (MPU) to run bare metal and RTOS. In Cortex-R82, that same core can also be configured with an optional MMU to allow a High-Level Operating System, like Linux, to execute. Both the real-time and MMU contexts can be handled by the same core simultaneously, or selected cores in a cluster can be dedicated to real-time or Linux, which increases the flexibility of an SoC design to accommodate multiple products and markets. This choice is handled by software and can even be changed dynamically, enabling the balance to be dynamically adjusted depending on demand.
Cortex-R82 has three Exception levels (ELs). EL2 is the highest level that enables a Secure enclave and separation/isolation of virtual machines for OEM code and customer code. More specifically, a Memory Protection Unit (MPU, real-time) context running at EL2 handles context switches between MPU and MMU contexts at EL1 with OEM and/or OS code while user code runs at EL0. Linux can be running and when a real-time event occurs, the processor can switch to handle the real-time event, then switch back to Linux. The security enables isolation of the main firmware and enables end customers of Cortex-R82 based devices to add custom software, either real time or Linux based.
64-bit processor with 40-bit addressing to access up to 1TB of address space
Cortex-R82 is the first 64-bit real-time capable Arm processor with 40 address bits. The 40 address bits allow the processor to directly address up to 1TB of addressable space. The direct addressability enables very large memory or device real-time systems and improved performance over windowing solutions. This large address space can be accessed either over AXI or CHI to enable additional capabilities including atomics and cache stashing.
Major performance uplift over Cortex-R8 on standard benchmarks and 2x on real partner code
The Cortex-R82 processor provides a performance uplift over Cortex-R8 on standard benchmarks and even higher uplift on actual partner code. Partner code execution is showing 74-125% performance uplift compared with Cortex-R8. The Cortex-R82 processor also provides a 21% performance uplift over Cortex-A55 when running SPECINT2006 benchmarks. The performance uplift satisfies the most demanding real-time embedded workloads and easily runs full Linux distributions.
Neon for ML
The Cortex-R82 processor optionally includes the latest Neon instructions to greatly accelerate machine learning (ML) workloads with capabilities such as Dot Product support. This is especially useful for computational storage where the Arm Compute Library and Arm NN library can be accelerated by Neon, for example to search for a specific image in a drive full of images.
Read also our Guide to Computational Storage for more insight.
The ability to run both real-time and Linux on the same core or cluster of cores is key in emerging technologies such as computational storage. The real-time capability is required for the data transfers through the SSD, just like traditional SSDs. Running Linux and associated software tools directly on the drive facilitates computational workload management and filesystem recognition to perform the on-drive computation and generate insight on the drive greatly reducing data movement, latencies, and energy consumption.
This same capability could be achieved with a cluster of Cortex-R8 cores, for example, and a cluster of Cortex-A cores for Linux, but the overall system architecture is simplified with Cortex-R82 since it can handle both. This reduces die size, cost, and most importantly, enables flexibility. The same SoC can be used for an ordinary enterprise SSD and reconfigured for a CSD product, saving the large mask-set costs in smaller processes to create multiple SoCs. The same product can even be dynamically configured through software to run SSD functions during the day and switch to Computational Storage at night.
One storage controller tapeout for both pure storage and computational storage applications with Cortex-R82 cores
Adjusting the types of workload running on the storage controller based external demands with Cortex-R82 cores
The Cortex-R82 processor provides a significant performance uplift over the Cortex-R8 processor.
Using the Arm Compiler 6.14 with O3 as optimization level, the EEMBC Consumer benchmark is significantly improved thanks to the Neon SIMD instructions. Note that the generic benchmarks, which only typically exercise the core pipeline capabilities, do not all demonstrate the major system enhancements that greatly improve real-world applications. What really matters are the actual Customer code benchmarks that show 74% to 125% improvement over Cortex-R8.
Performance measurements when using the MMU Linux also show a 21% SPECINT2006 improvement over Cortex-A55 and 23% improvement on SPECFP2006. These results from our performance model show this is clearly a significant uplift compared with the current high-efficiency Cortex-A cores.
Processor power, performance, and area are highly dependent on process, libraries, and optimizations. The following table estimates a typical four-core cluster implementation of the Cortex-R82 processor on mainstream low-power process technology (5 nm) with standard-performance cell libraries. Each core is configured with:
The processor cluster is configured with an integrated 1MB L2 shared cache.
Cortex-R82 four-core cluster
Maximum clock frequency
Above 1.8 GHz
3.41 / 4.32 / 8.67 DMIPS/MHz*5.82 CoreMark/MHz**
Total area (including Cluster+Cores+RAM+Routing)
From 2.0 mm2***
From 30 DMIPS/mW***
* Benchmark built with GCC 9.2. The first result abides by all of the 'ground rules' laid out in the Dhrystone documentation, the second permits inlining of functions (not just the permitted C string libraries) while the third additionally permits link time optimizations. All are with the version 2.1 of Dhrystone and ANSI-C-style function declarations.
** Benchmark built with Green Hills Software compiler 2020.1.4 using “-Ospeed -Omax -OI -OB -OV” between others.
*** Preliminary estimates, subject to be refined once the product is released
Hear how Cortex-R82 can meet the most demanding real-time workloads in the storage market at Arm DevSummit.
Arm has a suite of technologies and tools to support, speed up, and reduce risk of the development of Cortex-R82 based storage controllers. Arm Development Studio and Fast Models enable early hardware and software co-development and Cycle Models allow custom benchmarking and performance optimization ahead of silicon availability. Arm training and design review services and Cortex-R82 Artisan® Physical IP and POP IP accelerate time to market and reduce risk. Arm is developing a TSMC 7FF POP to deliver the best PPA required for Cortex-R82 use cases.
Cortex-R82, Arm’s first 64-bit Cortex-R processor, is accelerating the computational storage revolution.
Visit our website for more information on the Arm Computational Storage solution.
Learn more about Cortex-R82