FFT feature in ProjectNe10
Project Ne10 recently received an updated version of FFT, which is heavily NEON optimized for both ARM v7-A/v8-A AArch32 and v8-A AArch64 and is faster than almost all of the other existing open source FFT implementations such as FFTW and the FFT routine in OpenMax DL. This article will introduce this a bit.
The following chart illustrates the benchmarking results of the complex FFT (32-bit float data type) of Ne10, FFTW and OpenMax. The test platform is ARM Cortex A9. The X-axis of the chart represents the length of FFT. The Y-axis represents the execution time of FFT. Smaller is better.
From this chart, we can find that Ne10 is better than FFTW, OpenMax DL in most of cases.
To utilize NEON accelerator, usually we have two choices:
The following table describes the pros and cons of using assembly/intrinsic.
NEON assembly
NEON intrinsic
Performance
Always shows the best performance for the specified platform
Depends heavily on the toolchain that is used
Portability
The different ISA (i.e. ARM v7-A/v8-A AArch32 and ARM v8-A AArch64) has different assembly implementation. Even for the same ISA, the assembly might need to be fine-tuned to achieve ideal performance between different micro architectures.
Program once and run on different ISA’s. The compiler may also grant performance fine-tuning for different micro-architectures.
Maintainability
Hard to read/write compared with C.
Similar to C code, it’s easy to read/write.
According to the aforementioned pros/cons comparison, the intrinsic is preferred for the implementation of the Ne10 library
But for FFT, we still have different versions of implementations for ARM v7-A/v8-A AArch32 and v8-A AArch64 due to the reason described as follows:
// radix 4 butterfly with twiddles
scratch[0].r = scratch_in[0].r;
scratch[0].i = scratch_in[0].i;
scratch[1].r = scratch_in[1].r * scratch_tw[0].r - scratch_in[1].i * scratch_tw[0].i;
scratch[1].i = scratch_in[1].i * scratch_tw[0].r + scratch_in[1].r * scratch_tw[0].i;
scratch[2].r = scratch_in[2].r * scratch_tw[1].r - scratch_in[2].i * scratch_tw[1].i;
scratch[2].i = scratch_in[2].i * scratch_tw[1].r + scratch_in[2].r * scratch_tw[1].i;
scratch[3].r = scratch_in[3].r * scratch_tw[2].r - scratch_in[3].i * scratch_tw[2].i;
scratch[3].i = scratch_in[3].i * scratch_tw[2].r + scratch_in[3].r * scratch_tw[2].i;
The above code snippet lists the basic element of FFT---- radix4 butterfly. From the code, we can conclude that:
And, for ARM v7-A/v8-A AArch32 and v8-A AArch64,
Considering the above factors, in practice the implementation of Ne10 eventually has an assembly version, in which 2 radix4 butterflies are executed in one loop, for ARM v7-A/v8-A AAch32, and an intrinsic version, in which 4 radix4 butterflies are executed in one loop, for ARM v8-A AArch64.
The following charts show the C/NEON performance boosts in ARM v8-A AArch32 and AArch64 on the same Cortex-A53 CPU of Juno. Larger is better.
All the blue bars show the data in the AArch32 mode. The NEON code is v7-A/v8-A AArch32 assembly. The toolchain used is gcc 4.9.
All the red bars show the data in the AArch64 mode. The NEON code is intrinsic. The performance of intrinsic depends on toolchains greatly. The toolchain used here is llvm3.5.
From these charts, we can conclude that float complex FFT shows the similar or better performance boost between the AArch64 mode and the AArch32 mode. But for int32/16 complex FFT, the performance boost in the AArch32 mode is usually better than in the AArch64 mode (but this doesn’t mean the int32/16 complex FFT performs faster in the AArch32 mode than in the AArch64 mode!)
The data from this exercise is useful to analyze the performance boost for ARM v8-A AArch64 mode but we still need more data to verify and reinforce our concept.
The following charts are based on performance of the AArch32 C version and show the performance ratios of the AArch32 NEON version and the AArch64 C version, and the AArch64 NEON version on the same Cortex-A53 CPU on Juno. Larger is better.
From these charts, we can conclude that FFT in the AArch64 mode performs faster than in the AArch32 mode, no matter C or NEON.
The FFT still supports the following features:
Feature
Data type
Length
c2c FFT/IFFT
float/int32/int16
2^N (N is 2, 3….)
r2c FFT
2^N (N is 3, 4….)
c2r IFFT
But the APIs have changed. The old users need to update to latest version v1.1.2 or master.
More API details, please check http://projectne10.github.io/Ne10/doc/group__C2C__FFT__IFFT.html.
Take the float c2c FFT/IFFT as an example, current APIs are used as follows.
#include "NE10.h"
……
{
fftSize = 2^N; //N is 2, 3, 4, 5, 6....
in = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));
out = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));
ne10_fft_cfg_float32_t cfg;
cfg = ne10_fft_alloc_c2c_float32 (fftSize);
//FFT
ne10_fft_c2c_1d_float32_neon (out, in, cfg, 0);
//IFFT
ne10_fft_c2c_1d_float32_neon (out, in, cfg, 1);
NE10_FREE (in);
NE10_FREE (out);
NE10_FREE (cfg);
}
The FFT shows that you can get a significant performance boost in the ARM v8-A AArch64 mode. You may find more use cases of course. We welcome feedback and are looking to publish use cases to cross promote ProjectNe10 and the projects that use it.
For more details, please access http://projectne10.github.com/Ne10/
你好,非常感谢您百忙中回帖。
对于C和NEON我运行官方例程的时候运行通过,两者结果是一致的,我在自己的程序里面也加上了C和NEON的对比,程序如下
main函数加粗下划线部分是我加上了数学运算的NEON测试,发现结果是对的^oo^。。
测试信号生成频率是10HZ,点数1024.
#define TEST_LENGTH_SAMPLES (1024)
void genarate_signal(float *complex_float_list,int freq,int total_num)
int ii;
for(ii = 0;ii<total_num;ii++)
complex_float_list[2*ii] = 10*(float)sin(2*ii*PI*freq/total_num);
complex_float_list[2*ii+1] = 10*(float)cos(2*ii*PI*freq/total_num);
/**简单的C2C测试**/
void et_fft_float32_test()
uint32_t fftSize = 0;
uint32_t i = 0;
uint32_t out_num;
fftSize = TEST_LENGTH_SAMPLES; //N is 2, 3, 4, 5, 6....
out_c = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));
out_neon = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));
genarate_signal((float *)in,TEST_FREQ,fftSize);
cfg_c = ne10_fft_alloc_c2c_float32_c (fftSize);
if (cfg_c == NULL)
fprintf (stdout, "======ERROR, FFT alloc fails\n");
cfg_neon = ne10_fft_alloc_c2c_float32_neon (fftSize);
if (cfg_neon == NULL)
ne10_fft_c2c_1d_float32_c(out_c, in, cfg_c, 0);
ne10_fft_c2c_1d_float32_neon (out_neon, in, cfg_neon, 0);
ne10_vmul_vec2f_c( (ne10_vec2f_t *)out_amp_c, (ne10_vec2f_t *)out_c, (ne10_vec2f_t *)out_c, fftSize);
ne10_vmul_vec2f_neon( (ne10_vec2f_t *)out_amp_neon, (ne10_vec2f_t *)out_neon, (ne10_vec2f_t *)out_neon, fftSize);
for(i=0;i<TEST_LENGTH_SAMPLES;i++){
y_out_c[i] = sqrtf( out_amp_c[2*i]+out_amp_c[2*i+1] ) * 2/TEST_LENGTH_SAMPLES;
y_out_neon[i] = sqrtf( out_amp_neon[2*i]+out_amp_neon[2*i+1] ) * 2/TEST_LENGTH_SAMPLES;
search_MaxIdx(y_out_c,&out_num,TEST_LENGTH_SAMPLES);
fprintf (stdout, "c max point num is %d = %f\n",out_num,y_out_c[out_num]);
search_MaxIdx(y_out_neon,&out_num,TEST_LENGTH_SAMPLES);
fprintf (stdout, "c max point num is %d = %f\n",out_num,y_out_neon[out_num]);
NE10_FREE (out_c);
NE10_FREE (out_neon);
NE10_FREE (cfg_c);
NE10_FREE (cfg_neon);
int main (ne10_int32_t argc, char** argv)
ne10_result_t stat;
ne10_result_t math_stat;
ne10_result_t dsp_stat;
ne10_result_t adc;
uint32_t i;
float abuf[32] = {10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,} ;
float buf[32] = {100.1,1024.5,123.0,2345.1,8923.5,4567.2,99.8,11.11};
float dstbuf[32] = {100.1,1024.5,123.0,2345.1,8923.5,4567.2,99.8,11.11};
stat = ne10_init();
if(stat == NE10_OK)
printf("ne10_init OK!\n");
math_stat = ne10_init_math (stat);
printf("ne10_init_math OK!\n");
dsp_stat = ne10_init_dsp (stat);
printf("ne10_init_dsp OK!\n");
stat = ne10_HasNEON();
printf("cpu with neon!\n");
adc = ne10_add_float_neon (dstbuf,abuf,buf, 8) ;
fprintf (stdout, "add neon result :\n");
for(i=0;i<8;i++)
fprintf (stdout, "%f\n",dstbuf[i]);
//all_test();
et_fft_float32_test();
return 0;
以下是打印的程序运行结果
root@myzr /mnt/nfs$ ./A9_test
ne10_init OK!
ne10_init_math OK!
ne10_init_dsp OK!
cpu with neon!
add neon result :
10100.099609
11024.500000
10123.000000
12345.099609
18923.500000
14567.200195
10099.799805
10011.110352
c max point num is 246 = 111.185280
neon max point num is 246 = 111.185280
root@myzr /mnt/nfs$
不知道问题出在哪里,您测试的那个动态库可否共享一下呢,我想测试下,是不是我编译的库有问题。