Is Cortex-M4 the Strongest?

March 19, 2016

7 minute read time.

[note] This is an English translation edition of my article which appeared on the Interface magazine issued by the CQ publishing. I'm sorry for my poor English.

[1] Background of the strongest Cortex-M4.

[1-1] Silicon Vendors know -- Cortex-M4 is low power.

If we hear of the low power CPU about Cortex-M, we easily think of Cortex-M0 or M0+. However, it would be likely to adopt Cortex-M4 as a CPU for low power microcontrollers since about 2015 (It would be sometimes Cortex-M3 as the Blue Gecko).
For example:
- Apollo (Ambiq Micro)
- STM32L4/STM32L411 8STMicroelecctronics)
- Gecko Series (e.g. Blue Gecko) (Silicon Laboratories)
- MSP432/CC2640 (Texas Instruments)
- Bio-Processor (Samsung)

The recent trend would be appeal both low power and high performance by using Cortex-M4.

[1-2] Floating Point Unit is desirable.

One reason why Ambiq Micro has adopt the Cortex-M4 is that compared with the competitors, even if Cottex-M4 was adopted, Cortex-M4 would be higher performance in IoT area than Cortex-M0 at the same power range [1]. Also, the existence of FPU would be hidden advantage to port MATLAB [5]. Ambiq Micro seems to think MATLAB would become a killer application of IoT era.

MSP432 of Texas Instruments (TI) is a successor of MSP430 of which appeal points would be the ultra low power 16 bit MCU. TI's reason why he had adopt Cortex-M4 was that microcontrollers should become needed much higher operation performances in both the conventional industry areas and the future IoT related areas [2]. There is also a comment that the performance of Cortex-M4 would be about 10 times of one of Cortex-M0+ [3].

Accidentally (or naturally), the comments of Ambiq Micro and IT would be almost the same. Therefore, Cortex-M0/M0+ would have little performance for IoT areas. Also, FPU or DSP features would be welcome for IoT areas.

o How about Cortex-M7?
If the performance is mandatory, we can choose Cortex-M7. However, there is suspicious Cortex-M7's power consumption might be bigger even though the performance would be too much. The higher performance would mean the die size would be bigger. It would be against the vendors will which they would like to put a lot of functionality into the small chip. Therefore Cortex-M4 (or M3 in some areas) would be the most appropriate.

[1-3] Experiment to measure Cortex-M4F's FPU performance.

Using real development boards, I measured the FPU performances of Cortex-M4. The boards are low price MCU boards from Freescale (now, NXP) and they are FRDM-KL25Z (Cortex-M0+ base) and FRDM-K64F (Cortex-M4F base). Of course, as Cortex-M0+ has no FPU, the performance was measured by software emulation. As for Cortex-M4F, the performance was measured by both hardware FPU and software FPU emulation aspects. It could be said that the software FPU emulation performance would be identical to one of Cortex-M3 floating point operation performance.

The measurements were performed by the internal SysTick timer, counting CPU clock cycles. This means that the results would show the relative performance at the same operation clock frequency. The used test suites were Whetstone and Linpack benchmarks which are well known benchmark tests to measure floating point performance. The compiler is EWARM compiler. By the way, although the benchmark results would vary according to the number of matrix elements, this time, the number is the elements is (only) 50 because of several reasons. The results are shown in Figure 1.

In Cortex-M4 case, the FPU performance is about 60 to 80 % higher than software emulation performance. Compared with Cortex-M0+, Cortex-M4 performance is about 6 times higher performance. Also, because Cortex-M0+ adopted the 2 stage pipeline, it would not get such faster clock speed as Cortex-M4. If we consider the performance of Cortex-M4 and Cortex-M0+ including clock frequency, it would be proven the rumor which Cortex-M4 would be about 10 times faster than Cortex-M0+.

[2] Is Cortex-M4's power consumption lower than Cortex-M0/M0+?

[2-1] The power consumption of Cortex-M4F and Cortex-M0 would be the same (if the same operation would be executed).

In order to show the low power metrics of CPUs, EEMBC has released ULPBench [4]. According to ULPBench, MSP432 score is 167.4 and it is much better efficiency than MSP430 which is the predecessor of MSP432 and 16 bit MCU as MSP430 score is about 110 to 120. In addition, MSP432 was best efficiency among Cortex-M4 base MCU at the time of April of 2015.

However, the ULPBnech score of Cortex-M0+ base SAM L21 J18A-UES (Atmel) is 185.8 and Cortex-M0+ might have a possibility of higher efficiency than Cortex-M4. Anyway, the rough results of ULPBench is shown in Figure 2.

Although Cortex-M0+ could not have faster clock frequency, Cortex-M0 could have relatively higher clock frequency because of the longer pipeline stages (i.e. 3 stages) than Cortex-M0+ (2 stages). For example, DA14680, Wearable on Chip Series of Dialog Semiconductor, is adopted Coetex-M0 and it can run at fast speed of 96MHz but its power consumption is only 30uA/MHz. The SAM L21 above is 100uA/MHz.This means that at application areas which simple functionality and ultra low power are needed such as stand alone sensors, Cortex-M0/M0+ are still valuable.

[2-2] Even though Cortex-M0/M0+ exist, Cortex-M4F could be still the strongest.

We should look at the power consumption per operation. Given a certain operation, the faster it would be processed, the lower power it would consume. This means higher performance of Coetex-M4 (sometimes Cortex-M3) could be relatively more superior to Cortex-M0/M0+ in the power consumption view point.

This would come from the fact that ARM's official announce which Dhrystone or CoreMark performance per MHz is higher than Cortex-M0 by about 45%. Both Cortex-M0 and Cortex-M4 equip the 3 stage pipeline structure, but the performance differences would derive from ones of the instruction set architecture. As we know well, Cortex-M0 has Thumb compatible and Cortex-M4 has Thumb-2. This means that Cortex-M4 might achieve lower power consumption because of the smaller number of instructions to realize a certain operation.

As this thought might be proven, Ambiq Micro's Cortex-M4F base Apollo MCU had gotten 377.5 score at ULPBench, and it is being still the best score. Until then, the best score was 187.7 which was made by STMicro's Cortex-M4 base STM32L476. At this time, the 2nd position honor was replaced by Analog Devices'es Cortex-M3 base ADuCM302x of which score is 245.5. Actually, Cortex-M4 is more efficient power consumption than Cortex-M0/M0+.

[2-3] The story of IF: Is Cortex-M0F the strongest, if it exists.

Cortex-M series are put emphasis on the fact which there are scalable lineups from Cprtex-M0 to Cortex-M7. It is true and important metrics for their sales, but Cortex-M4 has been widely adopted in recent IoT devices or the wearables. This shows the common sense which Cortex-M0 is the lowest power consumption seems to have been forgotten. The main reason would be a lack of inexistence of FPU or DSP. If Cortex-M0 had FPU and DSP, the lowest power consumption and the highest MCU might be born.

[2-4] Isn't it Cortex-M4F, is it?

The recent fashion would be the low power Cortex-M4. This means the high performance (or clock frequency) would be mandatory. In older days, such application as requiring small die size and low power had adopted Cortex-M0 or Cortex-M0+. As Cortex-M0+ is the successor of Cortex-M0, the birth of Cortex-M0+ was thought that it would kill the Cortex-M0.

However. Cortex-M0 is survived and widely adopted for the non-FPU/non-DSP application areas. It is other than Cortex-M0+. The reason is why Cortex-M0 can run at faster than 200MHz but Cortex-M0+ cannot. This comes from the same 3 stage pipeline structure as Cortex-M4. Cortex-M0+ of which the pipeline stages are 2 seems not to achieve 200MHz clock frequency. Here, we shall forget the difference of ISA.

The main reason why Cortex-M0 remaims still valuable would be low power features(apart from the results of ULPBench) which would derive from its small die size. I am afraid the die size of Cortex-M0(F) would be the same as Cortex-M4F if Cortex-M0 had FPU and DSP. This might result in the same power consumption. In this meaning, Cortex-M0F could become meaningless.

[3] The significance of Cortex-M7.

[3-1] To get more performance, are caches and TCMs needed?

To get more performance, built-in caches and TCMs might be needed. This might be the trigger of ARM9 rehabilitation. It would against the ARM expectations. Thefore Cortex-M7 could be born. Regarding Cortex-M7, its significance would be still unknown.

To get more performance, built-in caches and TCMs might be needed. This might be the trigger of ARM9 rehabilitation. It would against the ARM expectations. Therefore Cortex-M7 had been born. Regarding Cortex-M7, its significance would be still unknown.

Today, Cortex-M series have been going on the original way which migh not be ARM's expectations or roadmap. It might be the time to reconsider the importance of existence of Cortex-M again.

<References>
[1] Subthreshold design at MCU-scale yields 10x energy efficiency．
http://www.electronics-eetimes.com/en/subthreshold-design-at-mcu-scale-yields-10x-energy-efficiency.html?cmp_id=7&news_id=222923565&vID=44#
[2] TI’s 32-bit‘Successor’to the 16-bit MCU．
http://www.eetimes.com/document.asp?elq=cc9c541e84b142a8a92294c69eaea9c3&elqCampaignId=22285&elqaid=25047&elqat=1&elqTrackId=19651301ed71477bb7e9895fde1f0024&doc_id=1326109&page_number=1
[3] MSP430 の系譜を継ぐ，低消費電力重視のARM Cortex-M4マイコン「MSP432」を発表．
http://eetimes.jp/ee/articles/1504/02/news143.html
[4] EEMBC ULPBench web site.
http://www.eembc.org/ulpbench/
[5] Why Choose the ARM Cortex-M4 over the M0 for Wearables and IoT?
http://ambiqmicro.com/news/why-choose-arm-cortex-m4-over-m0-wearables-and-iot

0 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Is Cortex-M4 the Strongest?

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC