I’ve been interning at ARM for the last two months as a summer student and spent a fair amount of time looking into the Digital Signal Processing (DSP) market and how it relates to ARM. DSP is used in speech recognition, radar signal analysis, weather and economic forecasting, control engineering and nearly any situation involving discrete data.
In this blog, I will look at the performance advantages of using the CMSIS-DSP library rather than a typical textbook implementation of DSP algorithms. Cortex Microcontroller Software Interface Standard (CMSIS) provides a consistent software interface to Cortex-M processors, and it comes with a DSP library supporting common DSP functions such as IIR, FIR filters and FFTs. For more information see jyiu’s in-depth guide to Cortex-M3 and Cortex-M4 processors.
The Fast Fourier Transform (FFT) is a DSP algorithm which converts data in the time domain to data in the frequency domain and is one of the most useful and commonly used DSP algorithms. It is what I used to benchmark the performance of the CMSIS’s DSP code.
For 32-bit integer data, an adaption of the fixed point algorithm written by Tom Roberts, Malcolm Slaney and Dimitrios P. Bouras was used. For 32-bit float data, the radix-2 implementation from Numerical Recipes was used. Both of these implementations are a common starting point for writing fixed and floating-point FFTs and are typically what someone would write after spending some time learning about FFTs. The code using CMSIS-DSP was based on the examples provided by ARM for both the fixed and floating-point examples. Additionally, the comparison is fair as they both have the same complexity as CMSIS-DSP’s FFT [O(nlog(n)) which is the lowest known bound for the FFT algorithm].
The FFT code was run on a Cortex-M4 MCU-based device (a STM32F407), with Keil microcontroller development kit 4.70a. The system is configured to run at 30MHz with zero wait states in the flash memory. In the first case, a 32-bit integer complex FFT (Q31) was carried out on different FFT sample sizes resulting in the following results:
Figure 1: CMSIS speed comparison on a Cortex-M4 microcontroller for various sizes of fixed point FFTs. The data was obtained with the following conditions: zero-wait state memory, 32-bit complex integer. The Roberts et al. fixed point implementation was adapted for 32-bit as it was originally published in 16-bit.
In this case, CMSIS consistently leads by a factor of 1.9, demonstrating the advantages of using a software library optimized for the hardware. The second case looks at the performance of various sizes of floating-point FFTs (floating point format used: IEEE 754 single precision).
Figure 2: CMSIS speed comparison on a Cortex-M4 microcontroller for various sizes of FFTs. The data was obtained with the following conditions: zero-wait state memory, 32-bit complex floating-point data and with Floating Point Unit (FPU) enabled and disabled (done in software instead).
Figure 2 shows how using the CMSIS-DSP library is about 35x times faster than the Numerical Recipes in C implementation with FPU enabled for both. It is interesting to notice how with FPU disabled CMSIS-DSP’s FFT is only 4 times faster on average, highlighting the point to which CMSIS-DSP is optimized for the Cortex-M4 MCU’s floating-point hardware.
Hopefully I’ve given you a taste of how CMSIS-DSP provides superior optimization for the Cortex-M family, by using ARM DSP SIMD, FPU and, in the case of the Cortex-M4 core, instruction set extensions to full capability. Another reason to go the CMSIS way is that code is portable across the whole Cortex-M family due to the abstraction layer it provides.
I’m Sebastian and I’m currently an electrical and electronic engineering student at Imperial in London and will be entering my final year. As part of the UKESF scheme, I interned at ARM last year in its Processor Division, working as an engineer as part of the TrustZone team. This year I was back in the Cambridge offices with the Cortex-M product marketing team. It’s been a great way to learn about daily life of ARM product managers, ARM’s business and generally what it is like to work here!
For anyone considering interning (or working) at ARM, I would definitely recommend it. Cambridge is a great place over summer (it’s very different than what it is in term time when all the students are here) and there’s always something happening. If you’re looking for a bigger city or something different to spend your weekend, London is always 45 minutes away on the train – which I often end up doing. ARM also provides a summer student social committee to get all the interns together after work. Naturally, being in Cambridge, this involves going punting! Also if you’re into cycling, I’d suggest cycling the Cambridgeshire guided busway down to St.Ives – it’s the longest guided busway in the world and offers a cycle path alongside as well as great views. If you find the busway too easy, you can always consider cycling down to London!
Hi Sebastian,
In Figure 2, CMSIS cycles should be the denominator in the label for vertical axis.
Regards,
Goodwin
sylvie.boube-politano, in case you didn't see this blog. Nice results on an ST device.
STMicroelectronics
pbeckmann richardyorkianjohnson. These folks will be very pleased to see your good analysis. They're in charge of the cortex-m4 and know know the capabilities which is fast enough to replace many of the traditional DSP processors for many kinds of applications. In addition to the processing power, the Cortex-M4 offers connectivity and certain devices have equally competitive power consumption as a traditional low powered DSP.