Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Mobile blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Open a support case
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
Operating Systems blog Ne10 FFT Feature: Radix-3 and Radix-5 FFT are supported, NEON Optimization Significant Performance Improvement by NEON-Optimization
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • fft
  • Cortex-A53
  • ne10
  • NEON
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Ne10 FFT Feature: Radix-3 and Radix-5 FFT are supported, NEON Optimization Significant Performance Improvement by NEON-Optimization

Phil Wang
Phil Wang
January 16, 2015

Ne10 v1.2.0 is released. Now radix-3 and radix-5 are supported in floating point complex FFT. Benchmark data below shows that NEON optimization has significantly improved performance of FFT.

1. Project Ne10

The Ne10 project has been set up to provide a set of common, useful functions which have been heavily optimized for the ARM Architecture and provide consistent well tested behavior that can be easily incorporated into applications. C interfaces to the functions are provided for both assembler and NEON™ implementations. The library supports static and dynamic linking and is modular, so that functionality that is not required can be discarded. For details of Ne10, please check this blog. For more details of FFT feature in Ne10, please refer this blog.

2. Benchmark

2.1. Time cost

Figure 1 is benchmark data (time cost) of four FFT implementations, including Ne10 (v1.2.0), pffft (2013), kissFFT (1.3.0), and one inside Opus (v1.1.1-beta). Ne10 and pffft are well NEON-optimized, while kissFFT and Opus FFT are not. All implementations are compiled by LLVM 3.5, with -O2 flag. All these implementations have been tested on ARM v7-A (Cortex-A9, 1.0GHz) and AArch64 (Cortex-A53, 850MHz).

Figure 1

In figure 1, x axis is size of FFT and y axis is time cost (ms), smaller is better. Each FFT has been run for 2.048x106 / (size of FFT) times. Say, we run 2000 times for 1024 points FFT. Only multiple of 16 sizes are supported in pffft, so its curve starts from 240. Performance boost after NEON optimization is obvious.

2.2. Mega Floating-point operations per second (MFLOPS)

Figure 2

Figure 2 is benchmark data in MFLOPS of these four implementations. Data are calculated according to this link. MFLOPS is a measure of performance of different algorithms in solving the same problem, bigger is better. When data are packed and processed by NEON instructions (in Ne10 and Pffft), MFLOPS is much higher.

3. Usage

API of FFT is not modified. Ne10 detects whether the size of FFT is multiple of 3 or 5, and then selects the best algorithms to execute. For more detail, please refer this blog.

Anonymous
  • Rutul
    Offline Rutul over 6 years ago

    @Phil Wang  Can you tell me what clock rate was provided for the results shown in figure 1?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • tatiana
    Offline tatiana over 6 years ago

    also used 256 points!

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • tatiana
    Offline tatiana over 6 years ago

    Good day! Sorry I bad know English.

    Im using ARM Cortex A9 + NEON. Installed in ARM armv7-hf (Linux).

    Compiled on armv7l NE10, FFTW3, FFTS-master and other.

    Stated  -> whith NEON, 256 points, fft= 3,7 us.

    However, I get the following results (at 512 samples):

    ///////////////////////////////////////////////////////// FFTW3 lib

    FFT_FFTW3  mean_sec: (0.000019s)

    first_sec:   (0.000022s)

    max_sec:  (0.000016s)

    min_sec:   (0.000016s)

    IFFT_FFTW3  mean_sec: (0.000018s)

    first_sec:   (0.000021s)

    max_sec:  (0.000016s)

    min_sec:   (0.000016s)

    ///////////////////////////////////////////////////////// NE10 lib

    (use ne10_fft_c2c_1d_float32_neon)

    FFT_NE10  mean_sec:          (0.000019s)

    first_sec:   (0.000020s)

    max_sec:  (0.000016s)

    min_sec:   (0.000016s)

    IFFT_NE10  mean_sec:          (0.000013s)

    first_sec:   (0.000013s)

    max_sec:  (0.000013s)

    min_sec:   (0.000013s)

    (use ne10_fft_c2c_1d_float32_c)

    FFT_NE10_ARN  mean_sec: (0.000024s)

    first_sec:   (0.000027s)

    max_sec:  (0.000021s)

    min_sec:   (0.000021s)

    IFFT_NE10_ARM  mean_sec: (0.000025s)

    first_sec:   (0.000027s)

    max_sec:  (0.000024s)

    min_sec:   (0.000024s)

    Why slowly is calculated?

    PS: Advance thanks.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • tatiana
    Offline tatiana over 6 years ago

    Good day! Sorry I bad know English.

    Im using ARM Cortex A9 + NEON. Installed in ARM armv7-hf (Linux)

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Operating Systems blog
  • MongoDB performance on Arm Neoverse based AWS Graviton2 processors

    Julio Suarez
    Julio Suarez
    In this post, we show how the AWS Graviton2 based R6g achieves 117% higher throughput on MongoDB than the x86-based R5.
    • June 9, 2021
  • OCI Ampere A1 Compute instances can significantly reduce video encoding costs versus modern CPUs

    Steve Demski
    Steve Demski
    In this blog we show how OCI A1 instances provide leading performance per dollar for x264 video encoding.
    • May 25, 2021
  • Arm-based OCI Ampere A1 Compute instances beat the latest competition on NGINX

    Steve Demski
    Steve Demski
    In this blog we test the performance of OCI A1 Arm-based instances on NGINX Plus compared to competitive offerings.
    • May 25, 2021