Ne10 FFT Feature: Radix-3 and Radix-5 FFT are supported, NEON Optimization Significant Performance Improvement by NEON-Optimization

January 16, 2015

1 minute read time.

Ne10 v1.2.0 is released. Now radix-3 and radix-5 are supported in floating point complex FFT. Benchmark data below shows that NEON optimization has significantly improved performance of FFT.

1. Project Ne10

The Ne10 project has been set up to provide a set of common, useful functions which have been heavily optimized for the ARM Architecture and provide consistent well tested behavior that can be easily incorporated into applications. C interfaces to the functions are provided for both assembler and NEON™ implementations. The library supports static and dynamic linking and is modular, so that functionality that is not required can be discarded. For details of Ne10, please check this blog. For more details of FFT feature in Ne10, please refer this blog.

2. Benchmark

2.1. Time cost

Figure 1 is benchmark data (time cost) of four FFT implementations, including Ne10 (v1.2.0), pffft (2013), kissFFT (1.3.0), and one inside Opus (v1.1.1-beta). Ne10 and pffft are well NEON-optimized, while kissFFT and Opus FFT are not. All implementations are compiled by LLVM 3.5, with -O2 flag. All these implementations have been tested on ARM v7-A (Cortex-A9, 1.0GHz) and AArch64 (Cortex-A53, 850MHz).

Figure 1

In figure 1, x axis is size of FFT and y axis is time cost (ms), smaller is better. Each FFT has been run for 2.048x10⁶/ (size of FFT) times. Say, we run 2000 times for 1024 points FFT. Only multiple of 16 sizes are supported in pffft, so its curve starts from 240. Performance boost after NEON optimization is obvious.

2.2. Mega Floating-point operations per second (MFLOPS)

Figure 2

Figure 2 is benchmark data in MFLOPS of these four implementations. Data are calculated according to this link. MFLOPS is a measure of performance of different algorithms in solving the same problem, bigger is better. When data are packed and processed by NEON instructions (in Ne10 and Pffft), MFLOPS is much higher.

3. Usage

API of FFT is not modified. Ne10 detects whether the size of FFT is multiple of 3 or 5, and then selects the best algorithms to execute. For more detail, please refer this blog.

Parents

tatiana over 8 years ago

Good day! Sorry I bad know English.
Im using ARM Cortex A9 + NEON. Installed in ARM armv7-hf (Linux).
Compiled on armv7l NE10, FFTW3, FFTS-master and other.
Stated -> whith NEON, 256 points, fft= 3,7 us.
However, I get the following results (at 512 samples):
///////////////////////////////////////////////////////// FFTW3 lib
FFT_FFTW3 mean_sec: (0.000019s)
first_sec:   (0.000022s)
max_sec: (0.000016s)
min_sec:   (0.000016s)
IFFT_FFTW3 mean_sec: (0.000018s)
first_sec:   (0.000021s)
max_sec: (0.000016s)
min_sec:   (0.000016s)
///////////////////////////////////////////////////////// NE10 lib
(use ne10_fft_c2c_1d_float32_neon)
FFT_NE10 mean_sec:          (0.000019s)
first_sec:   (0.000020s)
max_sec: (0.000016s)
min_sec:   (0.000016s)
IFFT_NE10 mean_sec:          (0.000013s)
first_sec:   (0.000013s)
max_sec: (0.000013s)
min_sec:   (0.000013s)
(use ne10_fft_c2c_1d_float32_c)
FFT_NE10_ARN mean_sec: (0.000024s)
first_sec:   (0.000027s)
max_sec: (0.000021s)
min_sec:   (0.000021s)
IFFT_NE10_ARM mean_sec: (0.000025s)
first_sec:   (0.000027s)
max_sec: (0.000024s)
min_sec:   (0.000024s)
Why slowly is calculated?
PS: Advance thanks.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

tatiana over 8 years ago

Good day! Sorry I bad know English.
Im using ARM Cortex A9 + NEON. Installed in ARM armv7-hf (Linux).
Compiled on armv7l NE10, FFTW3, FFTS-master and other.
Stated -> whith NEON, 256 points, fft= 3,7 us.
However, I get the following results (at 512 samples):
///////////////////////////////////////////////////////// FFTW3 lib
FFT_FFTW3 mean_sec: (0.000019s)
first_sec:   (0.000022s)
max_sec: (0.000016s)
min_sec:   (0.000016s)
IFFT_FFTW3 mean_sec: (0.000018s)
first_sec:   (0.000021s)
max_sec: (0.000016s)
min_sec:   (0.000016s)
///////////////////////////////////////////////////////// NE10 lib
(use ne10_fft_c2c_1d_float32_neon)
FFT_NE10 mean_sec:          (0.000019s)
first_sec:   (0.000020s)
max_sec: (0.000016s)
min_sec:   (0.000016s)
IFFT_NE10 mean_sec:          (0.000013s)
first_sec:   (0.000013s)
max_sec: (0.000013s)
min_sec:   (0.000013s)
(use ne10_fft_c2c_1d_float32_c)
FFT_NE10_ARN mean_sec: (0.000024s)
first_sec:   (0.000027s)
max_sec: (0.000021s)
min_sec:   (0.000021s)
IFFT_NE10_ARM mean_sec: (0.000025s)
first_sec:   (0.000027s)
max_sec: (0.000024s)
min_sec:   (0.000024s)
Why slowly is calculated?
PS: Advance thanks.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Operating Systems blog

Enhancing Chromium’s Memory Safety with Armv9

Richard Townsend

The Arm Open-source Software team is delighted to mark the release of Chromium M115, with experimental support for Arm’s Memory Tagging Extension (MTE).
- August 7, 2023
New Memory Tagging Extension User Guide for Android OS Developers

Roberto Lopez Mendez

In this blog, read about what to expect with the new MTE User Guide for Android OS.
- May 25, 2023
Enhancing Chromium's Control Flow Integrity with Armv9

Richard Townsend

This blog explains how Control Flow Integrity, an Armv9 security feature, works on the newly launched Chromium M105.
- October 11, 2022

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Ne10 FFT Feature: Radix-3 and Radix-5 FFT are supported, NEON Optimization Significant Performance Improvement by NEON-Optimization

1. Project Ne10

2. Benchmark

2.1. Time cost

2.2. Mega Floating-point operations per second (MFLOPS)

3. Usage

Enhancing Chromium’s Memory Safety with Armv9

New Memory Tagging Extension User Guide for Android OS Developers

Enhancing Chromium's Control Flow Integrity with Armv9