Ne10 FFT feature

December 17, 2013

6 minute read time.

FFT feature in ProjectNe10

1 Introduction

Project Ne10 recently received an updated version of FFT, which is heavily NEON optimized for both ARM v7-A/v8-A AArch32 and v8-A AArch64 and is faster than almost all of the other existing open source FFT implementations such as FFTW and the FFT routine in OpenMax DL. This article will introduce this a bit.

2 Performance comparison with some other FFT’s on ARM v7-A

The following chart illustrates the benchmarking results of the complex FFT (32-bit float data type) of Ne10, FFTW and OpenMax. The test platform is ARM Cortex A9. The X-axis of the chart represents the length of FFT. The Y-axis represents the execution time of FFT. Smaller is better.

From this chart, we can find that Ne10 is better than FFTW, OpenMax DL in most of cases.

3 FFT on ARM v7-A/v8-A AArch32 and ARM v8-A AArch64

3.1 NEON usage

To utilize NEON accelerator, usually we have two choices:

NEON assembly
NEON intrinsic

The following table describes the pros and cons of using assembly/intrinsic.

	NEON assembly	NEON intrinsic
Performance	Always shows the best performance for the specified platform	Depends heavily on the toolchain that is used
Portability	The different ISA (i.e. ARM v7-A/v8-A AArch32 and ARM v8-A AArch64) has different assembly implementation. Even for the same ISA, the assembly might need to be fine-tuned to achieve ideal performance between different micro architectures.	Program once and run on different ISA’s. The compiler may also grant performance fine-tuning for different micro-architectures.
Maintainability	Hard to read/write compared with C.	Similar to C code, it’s easy to read/write.

3.2 ARM v7-A/v8-A AArch32 and v8-A AArch64 FFT implementations

According to the aforementioned pros/cons comparison, the intrinsic is preferred for the implementation of the Ne10 library

But for FFT, we still have different versions of implementations for ARM v7-A/v8-A AArch32 and v8-A AArch64 due to the reason described as follows:

// radix 4 butterfly with twiddles

scratch[0].r = scratch_in[0].r;

scratch[0].i = scratch_in[0].i;

scratch[1].r = scratch_in[1].r * scratch_tw[0].r - scratch_in[1].i * scratch_tw[0].i;

scratch[1].i = scratch_in[1].i * scratch_tw[0].r + scratch_in[1].r * scratch_tw[0].i;

scratch[2].r = scratch_in[2].r * scratch_tw[1].r - scratch_in[2].i * scratch_tw[1].i;

scratch[2].i = scratch_in[2].i * scratch_tw[1].r + scratch_in[2].r * scratch_tw[1].i;

scratch[3].r = scratch_in[3].r * scratch_tw[2].r - scratch_in[3].i * scratch_tw[2].i;

scratch[3].i = scratch_in[3].i * scratch_tw[2].r + scratch_in[3].r * scratch_tw[2].i;

The above code snippet lists the basic element of FFT---- radix4 butterfly. From the code, we can conclude that:

20 64-bit NEON registers are needed if 2 radix4 butterflies are executed in one loop.
20 128-bit NEON registers are needed if 4 radix4 butterflies are executed in one loop.

And, for ARM v7-A/v8-A AArch32 and v8-A AArch64,

There are 32 64-bit or 16 128-bit NEON registers for ARM v7-A/v8-A AArch32.
There are 32 128-bit NEON registers for ARM v8-A AArch64.

Considering the above factors, in practice the implementation of Ne10 eventually has an assembly version, in which 2 radix4 butterflies are executed in one loop, for ARM v7-A/v8-A AAch32, and an intrinsic version, in which 4 radix4 butterflies are executed in one loop, for ARM v8-A AArch64.

3.3 C/NEON performance boost

The following charts show the C/NEON performance boosts in ARM v8-A AArch32 and AArch64 on the same Cortex-A53 CPU of Juno. Larger is better.

All the blue bars show the data in the AArch32 mode. The NEON code is v7-A/v8-A AArch32 assembly. The toolchain used is gcc 4.9.

All the red bars show the data in the AArch64 mode. The NEON code is intrinsic. The performance of intrinsic depends on toolchains greatly. The toolchain used here is llvm3.5.

From these charts, we can conclude that float complex FFT shows the similar or better performance boost between the AArch64 mode and the AArch32 mode. But for int32/16 complex FFT, the performance boost in the AArch32 mode is usually better than in the AArch64 mode (but this doesn’t mean the int32/16 complex FFT performs faster in the AArch32 mode than in the AArch64 mode!)

The data from this exercise is useful to analyze the performance boost for ARM v8-A AArch64 mode but we still need more data to verify and reinforce our concept.

3.4 AArch32/AArch64 performance boost

The following charts are based on performance of the AArch32 C version and show the performance ratios of the AArch32 NEON version and the AArch64 C version, and the AArch64 NEON version on the same Cortex-A53 CPU on Juno. Larger is better.

From these charts, we can conclude that FFT in the AArch64 mode performs faster than in the AArch32 mode, no matter C or NEON.

4 Usage

4.1 APIs

The FFT still supports the following features:

Feature	Data type	Length
c2c FFT/IFFT	float/int32/int16	2^N (N is 2, 3….)
r2c FFT	float/int32/int16	2^N (N is 3, 4….)
c2r IFFT	float/int32/int16	2^N (N is 3, 4….)

But the APIs have changed. The old users need to update to latest version v1.1.2 or master.

More API details, please check http://projectne10.github.io/Ne10/doc/group__C2C__FFT__IFFT.html.

4.2 Example

Take the float c2c FFT/IFFT as an example, current APIs are used as follows.

#include "NE10.h"

……

{

fftSize = 2^N; //N is 2, 3, 4, 5, 6....

in = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));

out = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));

ne10_fft_cfg_float32_t cfg;

cfg = ne10_fft_alloc_c2c_float32 (fftSize);

……

//FFT

ne10_fft_c2c_1d_float32_neon (out, in, cfg, 0);

……

//IFFT

ne10_fft_c2c_1d_float32_neon (out, in, cfg, 1);

……

NE10_FREE (in);

NE10_FREE (out);

NE10_FREE (cfg);

}

5 Conclusion

The FFT shows that you can get a significant performance boost in the ARM v8-A AArch64 mode. You may find more use cases of course. We welcome feedback and are looking to publish use cases to cross promote ProjectNe10 and the projects that use it.

For more details, please access http://projectne10.github.com/Ne10/

Parents

Zhigang over 10 years ago

Yang, thank you for helping. If you think it is appropriate, we can communicate with emails. zhigang.gao@gmail.com.
The length of FFT is 128
init() passed, but there may be a bug in it, since the cpuinfo file does not have "neon" inside.
I am doing cross platform compiling, compile on a Linux VM (Redhat Linux 5.50), and run it on ARM Cortex A9 MPCore (Rev 1) LE platform.
Please refer to the code and results below for details. I modified the ne10_init.c to add a few pirntf.
/******* added a few printf in ne10_init.c ************/
ne10_result_t ne10_init()
{
    ne10_result_t status = NE10_ERR;
#ifndef __MACH__
    printf("__MACH__ not defined \n"); //ZZZZ
    FILE*   infofile = NULL;               // To open the file /proc/cpuinfo
    ne10_int8_t    cpuinfo[CPUINFO_BUFFER_SIZE]; // The buffer to read in the string
    ne10_uint32_t bytes = 0;                     // Numbers of bytes read from the file
    ne10_int32_t     i = 0;                         // Temporary loop counter
    memset (cpuinfo, 0, CPUINFO_BUFFER_SIZE);
    infofile = fopen ("/proc/cpuinfo", "r");
    if (!infofile)
    {
        printf("ERROR: couldn't read file \"/proc/cpuinfo\".\n"); //ZZZZ
        fprintf(stderr, "ERROR: couldn't read file \"/proc/cpuinfo\".\n");
        return NE10_ERR;
    }
    bytes    = fread (cpuinfo, 1, sizeof (cpuinfo), infofile);
    fclose (infofile);
    if (0 == bytes || CPUINFO_BUFFER_SIZE == bytes)
    {
        printf("ERROR: cpuinfo size is 0 \"/proc/cpuinfo\".\n"); //ZZZZ
        fprintf (stderr, "ERROR: Couldn't read the file \"/proc/cpuinfo\". NE10_init() failed.\n");
        return NE10_ERR;
    }
    while ('\0' != cpuinfo[i])
    {
        cpuinfo[i] = (ne10_int8_t) tolower (cpuinfo[i]);
        ++i;
    }
    if (0 != strstr ( (const char *)cpuinfo, "neon"))
    {
    printf("cpuinfo bytes = %d, found neon at %d \n", bytes, strstr ( (const char *)cpuinfo, "neon")); //ZZZZ
        is_NEON_available = NE10_OK;
    } else {
    printf("neon is not in cpuinfo \n"); //ZZZZ
return NE10_ERR;
    }
#else //__MACH__
    printf("__MACH__ defined \n"); //ZZZZ
    is_NEON_available = NE10_OK;
#endif //__MACH__
    printf("is_NEON_available = %d \n", is_NEON_available); //ZZZZ
#if defined (NE10_ENABLE_MATH)
    printf("init math \n"); //ZZZZ
    status = ne10_init_math (is_NEON_available);
    if (status != NE10_OK)
    {
    printf("init math failed\n"); //ZZZZ
        fprintf(stderr, "ERROR: init math failed\n");
        return NE10_ERR;
    }
#endif
#if defined (NE10_ENABLE_DSP)
    printf("init DSP \n"); //ZZZZ
    status = ne10_init_dsp (is_NEON_available);
    if (status != NE10_OK)
    {
    printf("init DSP failed\n"); //ZZZZ
        fprintf(stderr, "ERROR: init dsp failed\n");
        return NE10_ERR;
    }
#endif
/********************************************************************************************/
/*******core code in test.c*******/
if (USE_C) {
ne10_fft_c2c_1d_float32_c(ifftOut, ifftIn, cfg, 1);
ne10_vmul_vec2f_c(tmp2f, (ne10_vec2f_t *)ifftOut, (ne10_vec2f_t *)ifftOut, fftLen);
} else {
if (USE_INIT) {
if (ne10_init() == NE10_OK) {
printf(" ne10_init() pass\n");
ne10_fft_c2c_1d_float32(ifftOut, ifftIn, cfg, 1);
printf("ne10_fft_c2c_1d_float32 done \n");
ne10_vmul_vec2f(tmp2f, (ne10_vec2f_t *)ifftOut, (ne10_vec2f_t *)ifftOut, fftLen); // asm("ne10_vmul_vec2f_neon");
printf("ne10_vmul_vec2f_float32 done \n");
} else {
printf(" ne10_init() fail\n");
}
} else {
ne10_fft_c2c_1d_float32_neon(ifftOut, ifftIn, cfg, 1);
printf("ne10_fft_c2c_1d_float32_neon done \n");
ne10_vmul_vec2f_neon(tmp2f, (ne10_vec2f_t *)ifftOut, (ne10_vec2f_t *)ifftOut, fftLen);
printf("ne10_vmul_vec2f_1d_float32_neon done \n");
}
}
...
   /*convert to time in us and return */
*t = ( (ne10_float32_t)index ) / BW;
...
/******************************************************************/
/**************Results******************************************/
/**********************************************************************/
1. when USE_C=1, the results are good
m00180A00F300:/etc# ./Ne10/build/samples/test 1 0
*t = index/BW = 39/40 = 0.975000
*t = index/BW = 39/40 = 0.975000
*t = index/BW = 40/40 = 1.000000
*t = index/BW = 40/40 = 1.000000
C: Used 448.000000 usec
Matlab D=23.950001, NE10 d=23.940002
2. when USE_C = 0 and USE_INIT=1, I got this.
m00180A00F300:/etc# ./Ne10/build/samples/test 0 1
__MACH__ not defined
cpuinfo bytes = 343, found neon at -1096417952
is_NEON_available = 0
init math
init DSP
ne10_init() pass
Segmentation fault (core dumped)
3. when USE_C = 0 and USE_INIT=0, I got this. the constant BW is no longer 40, it was corrupted.
m00180A00F300:/etc# ./Ne10/build/samples/test 0 0
*t = index/BW = 39/997874430 = 0.000000
*t = index/BW = 39/1041829706 = 0.000000
*t = index/BW = 40/1043333275 = 0.000000
*t = index/BW = 40/1065999687 = 0.000000
NEON: Used 448.000000 usec
Matlab D=23.950001, NE10 d=-272.309998
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

Zhigang over 10 years ago

Yang, thank you for helping. If you think it is appropriate, we can communicate with emails. zhigang.gao@gmail.com.
The length of FFT is 128
init() passed, but there may be a bug in it, since the cpuinfo file does not have "neon" inside.
I am doing cross platform compiling, compile on a Linux VM (Redhat Linux 5.50), and run it on ARM Cortex A9 MPCore (Rev 1) LE platform.
Please refer to the code and results below for details. I modified the ne10_init.c to add a few pirntf.
/******* added a few printf in ne10_init.c ************/
ne10_result_t ne10_init()
{
    ne10_result_t status = NE10_ERR;
#ifndef __MACH__
    printf("__MACH__ not defined \n"); //ZZZZ
    FILE*   infofile = NULL;               // To open the file /proc/cpuinfo
    ne10_int8_t    cpuinfo[CPUINFO_BUFFER_SIZE]; // The buffer to read in the string
    ne10_uint32_t bytes = 0;                     // Numbers of bytes read from the file
    ne10_int32_t     i = 0;                         // Temporary loop counter
    memset (cpuinfo, 0, CPUINFO_BUFFER_SIZE);
    infofile = fopen ("/proc/cpuinfo", "r");
    if (!infofile)
    {
        printf("ERROR: couldn't read file \"/proc/cpuinfo\".\n"); //ZZZZ
        fprintf(stderr, "ERROR: couldn't read file \"/proc/cpuinfo\".\n");
        return NE10_ERR;
    }
    bytes    = fread (cpuinfo, 1, sizeof (cpuinfo), infofile);
    fclose (infofile);
    if (0 == bytes || CPUINFO_BUFFER_SIZE == bytes)
    {
        printf("ERROR: cpuinfo size is 0 \"/proc/cpuinfo\".\n"); //ZZZZ
        fprintf (stderr, "ERROR: Couldn't read the file \"/proc/cpuinfo\". NE10_init() failed.\n");
        return NE10_ERR;
    }
    while ('\0' != cpuinfo[i])
    {
        cpuinfo[i] = (ne10_int8_t) tolower (cpuinfo[i]);
        ++i;
    }
    if (0 != strstr ( (const char *)cpuinfo, "neon"))
    {
    printf("cpuinfo bytes = %d, found neon at %d \n", bytes, strstr ( (const char *)cpuinfo, "neon")); //ZZZZ
        is_NEON_available = NE10_OK;
    } else {
    printf("neon is not in cpuinfo \n"); //ZZZZ
return NE10_ERR;
    }
#else //__MACH__
    printf("__MACH__ defined \n"); //ZZZZ
    is_NEON_available = NE10_OK;
#endif //__MACH__
    printf("is_NEON_available = %d \n", is_NEON_available); //ZZZZ
#if defined (NE10_ENABLE_MATH)
    printf("init math \n"); //ZZZZ
    status = ne10_init_math (is_NEON_available);
    if (status != NE10_OK)
    {
    printf("init math failed\n"); //ZZZZ
        fprintf(stderr, "ERROR: init math failed\n");
        return NE10_ERR;
    }
#endif
#if defined (NE10_ENABLE_DSP)
    printf("init DSP \n"); //ZZZZ
    status = ne10_init_dsp (is_NEON_available);
    if (status != NE10_OK)
    {
    printf("init DSP failed\n"); //ZZZZ
        fprintf(stderr, "ERROR: init dsp failed\n");
        return NE10_ERR;
    }
#endif
/********************************************************************************************/
/*******core code in test.c*******/
if (USE_C) {
ne10_fft_c2c_1d_float32_c(ifftOut, ifftIn, cfg, 1);
ne10_vmul_vec2f_c(tmp2f, (ne10_vec2f_t *)ifftOut, (ne10_vec2f_t *)ifftOut, fftLen);
} else {
if (USE_INIT) {
if (ne10_init() == NE10_OK) {
printf(" ne10_init() pass\n");
ne10_fft_c2c_1d_float32(ifftOut, ifftIn, cfg, 1);
printf("ne10_fft_c2c_1d_float32 done \n");
ne10_vmul_vec2f(tmp2f, (ne10_vec2f_t *)ifftOut, (ne10_vec2f_t *)ifftOut, fftLen); // asm("ne10_vmul_vec2f_neon");
printf("ne10_vmul_vec2f_float32 done \n");
} else {
printf(" ne10_init() fail\n");
}
} else {
ne10_fft_c2c_1d_float32_neon(ifftOut, ifftIn, cfg, 1);
printf("ne10_fft_c2c_1d_float32_neon done \n");
ne10_vmul_vec2f_neon(tmp2f, (ne10_vec2f_t *)ifftOut, (ne10_vec2f_t *)ifftOut, fftLen);
printf("ne10_vmul_vec2f_1d_float32_neon done \n");
}
}
...
   /*convert to time in us and return */
*t = ( (ne10_float32_t)index ) / BW;
...
/******************************************************************/
/**************Results******************************************/
/**********************************************************************/
1. when USE_C=1, the results are good
m00180A00F300:/etc# ./Ne10/build/samples/test 1 0
*t = index/BW = 39/40 = 0.975000
*t = index/BW = 39/40 = 0.975000
*t = index/BW = 40/40 = 1.000000
*t = index/BW = 40/40 = 1.000000
C: Used 448.000000 usec
Matlab D=23.950001, NE10 d=23.940002
2. when USE_C = 0 and USE_INIT=1, I got this.
m00180A00F300:/etc# ./Ne10/build/samples/test 0 1
__MACH__ not defined
cpuinfo bytes = 343, found neon at -1096417952
is_NEON_available = 0
init math
init DSP
ne10_init() pass
Segmentation fault (core dumped)
3. when USE_C = 0 and USE_INIT=0, I got this. the constant BW is no longer 40, it was corrupted.
m00180A00F300:/etc# ./Ne10/build/samples/test 0 0
*t = index/BW = 39/997874430 = 0.000000
*t = index/BW = 39/1041829706 = 0.000000
*t = index/BW = 40/1043333275 = 0.000000
*t = index/BW = 40/1065999687 = 0.000000
NEON: Used 448.000000 usec
Matlab D=23.950001, NE10 d=-272.309998
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Operating Systems blog

Enhancing Chromium’s Memory Safety with Armv9

Richard Townsend

The Arm Open-source Software team is delighted to mark the release of Chromium M115, with experimental support for Arm’s Memory Tagging Extension (MTE).
- August 7, 2023
New Memory Tagging Extension User Guide for Android OS Developers

Roberto Lopez Mendez

In this blog, read about what to expect with the new MTE User Guide for Android OS.
- May 25, 2023
Enhancing Chromium's Control Flow Integrity with Armv9

Richard Townsend

This blog explains how Control Flow Integrity, an Armv9 security feature, works on the newly launched Chromium M105.
- October 11, 2022

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog