Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
Operating Systems blog Ne10 FFT feature
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • fft
  • ne10
  • NEON
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Ne10 FFT feature

Yang Zhang 张洋
Yang Zhang 张洋
December 17, 2013
6 minute read time.

FFT feature in ProjectNe10

1 Introduction

Project Ne10 recently received an updated version of FFT, which is heavily NEON optimized for both ARM v7-A/v8-A AArch32 and v8-A AArch64 and is faster than almost all of the other existing open source FFT implementations such as FFTW and the FFT routine in OpenMax DL. This article will introduce this a bit.

2 Performance comparison with some other FFT’s on ARM v7-A

The following chart illustrates the benchmarking results of the complex FFT (32-bit float data type) of Ne10, FFTW and OpenMax. The test platform is ARM Cortex A9. The X-axis of the chart represents the length of FFT. The Y-axis represents the execution time of FFT. Smaller is better.

From this chart, we can find that Ne10 is better than FFTW, OpenMax DL in most of cases.

3 FFT on ARM v7-A/v8-A AArch32 and ARM v8-A AArch64

3.1 NEON usage

To utilize NEON accelerator, usually we have two choices:

  • NEON assembly
  • NEON intrinsic

The following table describes the pros and cons of using assembly/intrinsic.

NEON assembly

NEON intrinsic

Performance

Always shows the best performance for the specified platform

Depends heavily on the toolchain that is used

Portability

The different ISA (i.e. ARM v7-A/v8-A AArch32 and ARM v8-A AArch64) has different assembly implementation. Even for the same ISA, the assembly might need to be fine-tuned to achieve ideal performance between different micro architectures.

Program once and run on different ISA’s. The compiler may also grant performance fine-tuning for different micro-architectures.

Maintainability

Hard to read/write compared with C.

Similar to C code, it’s easy to read/write.

3.2 ARM v7-A/v8-A AArch32 and v8-A AArch64 FFT implementations

According to the aforementioned pros/cons comparison, the intrinsic is preferred for the implementation of the Ne10 library

But for FFT, we still have different versions of implementations for ARM v7-A/v8-A AArch32 and v8-A AArch64 due to the reason described as follows:

// radix 4 butterfly with twiddles

scratch[0].r = scratch_in[0].r;

scratch[0].i = scratch_in[0].i;

scratch[1].r = scratch_in[1].r * scratch_tw[0].r - scratch_in[1].i * scratch_tw[0].i;

scratch[1].i = scratch_in[1].i * scratch_tw[0].r + scratch_in[1].r * scratch_tw[0].i;

scratch[2].r = scratch_in[2].r * scratch_tw[1].r - scratch_in[2].i * scratch_tw[1].i;

scratch[2].i = scratch_in[2].i * scratch_tw[1].r + scratch_in[2].r * scratch_tw[1].i;

scratch[3].r = scratch_in[3].r * scratch_tw[2].r - scratch_in[3].i * scratch_tw[2].i;

scratch[3].i = scratch_in[3].i * scratch_tw[2].r + scratch_in[3].r * scratch_tw[2].i;

The above code snippet lists the basic element of FFT---- radix4 butterfly. From the code, we can conclude that:

  • 20 64-bit NEON registers are needed if 2 radix4 butterflies are executed in one loop.
  • 20 128-bit NEON registers are needed if 4 radix4 butterflies are executed in one loop.

And, for ARM v7-A/v8-A AArch32 and v8-A AArch64,

  • There are 32 64-bit or 16 128-bit NEON registers for ARM v7-A/v8-A AArch32.
  • There are 32 128-bit NEON registers for ARM v8-A AArch64.

Considering the above factors, in practice the implementation of Ne10 eventually has an assembly version, in which 2 radix4 butterflies are executed in one loop, for ARM v7-A/v8-A AAch32, and an intrinsic version, in which 4 radix4 butterflies are executed in one loop, for ARM v8-A AArch64.

3.3 C/NEON performance boost

The following charts show the C/NEON performance boosts in ARM v8-A AArch32 and AArch64 on the same Cortex-A53 CPU of Juno. Larger is better.

All the blue bars show the data in the AArch32 mode. The NEON code is v7-A/v8-A AArch32 assembly. The toolchain used is gcc 4.9.

All the red bars show the data in the AArch64 mode. The NEON code is intrinsic. The performance of intrinsic depends on toolchains greatly. The toolchain used here is llvm3.5.

From these charts, we can conclude that float complex FFT shows the similar or better performance boost between the AArch64 mode and the AArch32 mode. But for int32/16 complex FFT, the performance boost in the AArch32 mode is usually better than in the AArch64 mode (but this doesn’t mean the int32/16 complex FFT performs faster in the AArch32 mode than in the AArch64 mode!)

The data from this exercise is useful to analyze the performance boost for ARM v8-A AArch64 mode but we still need more data to verify and reinforce our concept.

3.4 AArch32/AArch64 performance boost

The following charts are based on performance of the AArch32 C version and show the performance ratios of the AArch32 NEON version and the AArch64 C version, and the AArch64 NEON version on the same Cortex-A53 CPU on Juno. Larger is better.

From these charts, we can conclude that FFT in the AArch64 mode performs faster than in the AArch32 mode, no matter C or NEON.

4 Usage

4.1 APIs

The FFT still supports the following features:

Feature

Data type

Length

c2c FFT/IFFT

float/int32/int16

2^N (N is 2, 3….)

r2c FFT

float/int32/int16

2^N (N is 3, 4….)

c2r IFFT

float/int32/int16

2^N (N is 3, 4….)

But the APIs have changed. The old users need to update to latest version v1.1.2 or master.

More API details, please check http://projectne10.github.io/Ne10/doc/group__C2C__FFT__IFFT.html.

4.2 Example

Take the float c2c FFT/IFFT as an example, current APIs are used as follows.

#include "NE10.h"

……

{

    fftSize = 2^N; //N is 2, 3, 4, 5, 6....

    in = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));

    out = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));

    ne10_fft_cfg_float32_t cfg;

    cfg = ne10_fft_alloc_c2c_float32 (fftSize);

    ……

    //FFT

    ne10_fft_c2c_1d_float32_neon (out, in, cfg, 0);

    ……

    //IFFT

    ne10_fft_c2c_1d_float32_neon (out, in, cfg, 1);

    ……

    NE10_FREE (in);

    NE10_FREE (out);

    NE10_FREE (cfg);

}

5 Conclusion

The FFT shows that you can get a significant performance boost in the ARM v8-A AArch64 mode. You may find more use cases of course. We welcome feedback and are looking to publish use cases to cross promote ProjectNe10 and the projects that use it.

For more details, please access http://projectne10.github.com/Ne10/

Anonymous
  • niu niu
    Offline niu niu over 3 years ago

    你好 我在aarch64下编译出来的libNE10.a 在用提供的FFT接口函数,并调用libNE10.a 的时候不好使 ,编译总是出错,找不到定义的函数  

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • ymarko
    Offline ymarko over 4 years ago

    hello

    I'm new with the Neon project, and I need you help with the following:

    I'm using NXP LS104x CPU , 64bit, using llvm compiler, I want to use the NE project , but with llvm compiler.

    (I understand the project s compiled under GNU),

    can you help ,me how to use it in llvm? is there an option to convert the asm code to llvm compiler?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • chloe
    Offline chloe over 6 years ago

    您好,

           ne10库编译成功后,要如何加入交叉编译工具链中呢?我想更改GNUlinuxconfig.cmake文件中的工具链地址,但找不到gcc、g++等,直接编译后,利用生成sample中的静态库例子可以成功在板子上运行,但是动态库生成例子显示找不到ne10_test.so.10库.我运行了命令$export LD_LIBRARY_PATH=/usr/home/project/build/modules,在板子中和虚拟机编译库前后都尝试了这句命令,仍然找不到该库。

    同时我利用eclipse工具编译sample源代码,显示找不到库和头文件。

           我使用的板子是IMX6QP,交叉编译工具链是arm-poky-linux-gnueabi.

    谢谢

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • UFO
    Offline UFO over 6 years ago

    yangzhang,你好

    我在ubuntun16下编译了NE10库,是可以编译通过,但是我的硬件平台的Linux内核是3.0.5,当我重新编译程序用这个库运行程序的时候,会直接报错segment fault。

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • UFO
    Offline UFO over 6 years ago

    这个问题我把Ubuntu16弄好以后,在编译库进行测试一下,谢谢您的回复。

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
>
Operating Systems blog
  • Enhancing Chromium's Control Flow Integrity with Armv9

    Richard Townsend
    Richard Townsend
    This blog explains how Control Flow Integrity, an Armv9 security feature, works on the newly launched Chromium M105.
    • October 11, 2022
  • Playing with Virtual Prototypes: Debugging and testing Android games should be as much fun as playing them!

    Sam Tennent
    Sam Tennent
    One of the most amazing developments in the last few years has been the explosion in mobile gaming. Not so long ago, if you wanted to while away the time playing a game on your phone there were not many…
    • September 21, 2020
  • SUSE Rocks in Nashville!

    P. Robin
    P. Robin
    I attended the 2019 edition of SUSE conference. I take you through the the latest developments in enterprise Linux, OpenStack, CEPH and more, from the conference.
    • May 1, 2019