FFT feature in ProjectNe10
Project Ne10 recently received an updated version of FFT, which is heavily NEON optimized for both ARM v7-A/v8-A AArch32 and v8-A AArch64 and is faster than almost all of the other existing open source FFT implementations such as FFTW and the FFT routine in OpenMax DL. This article will introduce this a bit.
The following chart illustrates the benchmarking results of the complex FFT (32-bit float data type) of Ne10, FFTW and OpenMax. The test platform is ARM Cortex A9. The X-axis of the chart represents the length of FFT. The Y-axis represents the execution time of FFT. Smaller is better.
From this chart, we can find that Ne10 is better than FFTW, OpenMax DL in most of cases.
To utilize NEON accelerator, usually we have two choices:
The following table describes the pros and cons of using assembly/intrinsic.
NEON assembly
NEON intrinsic
Performance
Always shows the best performance for the specified platform
Depends heavily on the toolchain that is used
Portability
The different ISA (i.e. ARM v7-A/v8-A AArch32 and ARM v8-A AArch64) has different assembly implementation. Even for the same ISA, the assembly might need to be fine-tuned to achieve ideal performance between different micro architectures.
Program once and run on different ISA’s. The compiler may also grant performance fine-tuning for different micro-architectures.
Maintainability
Hard to read/write compared with C.
Similar to C code, it’s easy to read/write.
According to the aforementioned pros/cons comparison, the intrinsic is preferred for the implementation of the Ne10 library
But for FFT, we still have different versions of implementations for ARM v7-A/v8-A AArch32 and v8-A AArch64 due to the reason described as follows:
// radix 4 butterfly with twiddles
scratch[0].r = scratch_in[0].r;
scratch[0].i = scratch_in[0].i;
scratch[1].r = scratch_in[1].r * scratch_tw[0].r - scratch_in[1].i * scratch_tw[0].i;
scratch[1].i = scratch_in[1].i * scratch_tw[0].r + scratch_in[1].r * scratch_tw[0].i;
scratch[2].r = scratch_in[2].r * scratch_tw[1].r - scratch_in[2].i * scratch_tw[1].i;
scratch[2].i = scratch_in[2].i * scratch_tw[1].r + scratch_in[2].r * scratch_tw[1].i;
scratch[3].r = scratch_in[3].r * scratch_tw[2].r - scratch_in[3].i * scratch_tw[2].i;
scratch[3].i = scratch_in[3].i * scratch_tw[2].r + scratch_in[3].r * scratch_tw[2].i;
The above code snippet lists the basic element of FFT---- radix4 butterfly. From the code, we can conclude that:
And, for ARM v7-A/v8-A AArch32 and v8-A AArch64,
Considering the above factors, in practice the implementation of Ne10 eventually has an assembly version, in which 2 radix4 butterflies are executed in one loop, for ARM v7-A/v8-A AAch32, and an intrinsic version, in which 4 radix4 butterflies are executed in one loop, for ARM v8-A AArch64.
The following charts show the C/NEON performance boosts in ARM v8-A AArch32 and AArch64 on the same Cortex-A53 CPU of Juno. Larger is better.
All the blue bars show the data in the AArch32 mode. The NEON code is v7-A/v8-A AArch32 assembly. The toolchain used is gcc 4.9.
All the red bars show the data in the AArch64 mode. The NEON code is intrinsic. The performance of intrinsic depends on toolchains greatly. The toolchain used here is llvm3.5.
From these charts, we can conclude that float complex FFT shows the similar or better performance boost between the AArch64 mode and the AArch32 mode. But for int32/16 complex FFT, the performance boost in the AArch32 mode is usually better than in the AArch64 mode (but this doesn’t mean the int32/16 complex FFT performs faster in the AArch32 mode than in the AArch64 mode!)
The data from this exercise is useful to analyze the performance boost for ARM v8-A AArch64 mode but we still need more data to verify and reinforce our concept.
The following charts are based on performance of the AArch32 C version and show the performance ratios of the AArch32 NEON version and the AArch64 C version, and the AArch64 NEON version on the same Cortex-A53 CPU on Juno. Larger is better.
From these charts, we can conclude that FFT in the AArch64 mode performs faster than in the AArch32 mode, no matter C or NEON.
The FFT still supports the following features:
Feature
Data type
Length
c2c FFT/IFFT
float/int32/int16
2^N (N is 2, 3….)
r2c FFT
2^N (N is 3, 4….)
c2r IFFT
But the APIs have changed. The old users need to update to latest version v1.1.2 or master.
More API details, please check http://projectne10.github.io/Ne10/doc/group__C2C__FFT__IFFT.html.
Take the float c2c FFT/IFFT as an example, current APIs are used as follows.
#include "NE10.h"
……
{
fftSize = 2^N; //N is 2, 3, 4, 5, 6....
in = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));
out = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));
ne10_fft_cfg_float32_t cfg;
cfg = ne10_fft_alloc_c2c_float32 (fftSize);
//FFT
ne10_fft_c2c_1d_float32_neon (out, in, cfg, 0);
//IFFT
ne10_fft_c2c_1d_float32_neon (out, in, cfg, 1);
NE10_FREE (in);
NE10_FREE (out);
NE10_FREE (cfg);
}
The FFT shows that you can get a significant performance boost in the ARM v8-A AArch64 mode. You may find more use cases of course. We welcome feedback and are looking to publish use cases to cross promote ProjectNe10 and the projects that use it.
For more details, please access http://projectne10.github.com/Ne10/
你好,在使用NE10库测试FFT功能时,发现无法获得正确结果,硬件平台I.MX6Q cortex-A9 with NEON ,ne10库使用gcc hf编译动态库。程序如下:
void genarate_signal(float *complex_float_list,int freq,int total_num)
int ii;
for(ii = 0;ii<total_num;ii++)
complex_float_list[2*ii] = 100+
10*sin(2*ii*PI*freq/total_num)+
30*sin(2*ii*PI*freq*2/total_num)+
50*cos(2*ii*PI*freq/2/total_num);
fprintf (stdout, "%f \n",complex_float_list[2*ii] );
complex_float_list[2*ii+1] =0;
void test_fft_c2c_1d_float32_test1024()
ne10_int32_t i = 0;
ne10_int32_t fftSize = TEST_LENGTH_SAMPLES;
ne10_int32_t flag_result = NE10_OK;
genarate_signal(testInput_f32,50,TEST_LENGTH_SAMPLES);
/* FFT test */
memcpy (in_c, testInput_f32, 2 * fftSize * sizeof (ne10_float32_t));
memcpy (in_neon, testInput_f32, 2 * fftSize * sizeof (ne10_float32_t));
flag_result = test_c2c_alloc (fftSize);
if (flag_result == NE10_ERR)
return;
ne10_fft_c2c_1d_float32_neon ( (ne10_fft_cpx_float32_t*) out_neon, (ne10_fft_cpx_float32_t*) in_neon, cfg_neon, 0);
ne10_vmul_vec2f_neon(out_amp_f32, (ne10_vec2f_t *)out_neon, (ne10_vec2f_t *)out_neon, fftSize);
NE10_FREE (cfg_c);
NE10_FREE (cfg_neon);
void Test_float_1024()
uint32_t index = 0;
uint32_t i = 0;
float *p =out_amp_f32;
my_test_setup();
test_fft_c2c_1d_float32_test1024();
for(i=0;i<TEST_LENGTH_SAMPLES;i++){
y_out[i] = sqrt(out_amp_f32[2*i]+out_amp_f32[2*i+1]) * 2/TEST_LENGTH_SAMPLES;
p =y_out;
for(i=0;i<TEST_LENGTH_SAMPLES/8;i++){
fprintf (stdout, "%f %f %f %f\n",\
*(p+0),
*(p+1),
*(p+2),
*(p+3)
);
p+=4;
index = search_MaxIdx(y_out,TEST_LENGTH_SAMPLES);
fprintf (stdout, "max point num is %d = %f\n",index,y_out[index]);
终端打印测试结果如下:
70.334671 28.591980 9.328466 267.897675
119.986031 42.944847 30.605927 178.480011
2.802216 0.700661 0.826645 0.592073
8.013395 0.335176 0.786936 2.207294
24.438063 9.649106 1.739657 6.406336
49.667835 14.928893 9.956310 4.767300
8.998783 3.898322 3.776710 2.063089
22.830889 3.079061 5.898489 7.429659
41.256512 81.707916 1.372445 16.695229
64.763451 110.923729 12.887453 11.843207
17.583437 3.969235 7.544624 16.488457
30.619579 4.536862 11.899246 71.173515
47.053455 134.195160 6.081620 141.013489
67.156342 185.168762 20.658308 89.664536
28.569382 7.696736 17.699919 19.752989
62.671543 1.706358 55.254002 94.741020
121.885223 50.442410 13.321230 191.812775
194.830917 73.343239 35.800415 121.644196
6.072055 1.357412 3.059041 1.598315
16.019142 0.397725 6.136387 7.071984
45.714428 6.032665 1.914879 22.993378
88.425896 5.176524 7.825125 17.754688
11.952877 0.998287 3.868603 7.507260
38.311497 0.812865 6.276744 25.597332
72.796387 24.283131 3.027011 53.883930
115.895729 36.538441 23.552629 35.805290
20.097837 1.411887 8.378696 46.889088
64.220795 1.707423 8.499466 191.427780
124.767319 52.643639 3.248033 360.684113
199.053009 74.936172 15.439220 219.206482
30.036509 3.189077 10.664173 46.363052
54.214390 0.719910 35.880524 214.325302
119.226959 21.574148 10.367178 419.649170
199.155594 31.703428 48.919361 258.139801
9.362836 2.081679 3.328502 3.414439
19.946901 0.639163 8.752939 18.909359
53.746208 10.769773 6.246037 69.574326
104.665573 10.923879 50.765209 57.735943
16.781229 2.264422 12.837126 25.497133
54.318176 1.661909 13.626813 89.291451
105.531845 44.081799 26.905071 191.079147
171.331207 61.084209 160.569611 128.230637
32.309647 2.234591 22.595129 168.871979
50.999645 2.601134 20.814266 691.298950
82.439835 77.982269 30.584463 1303.456665
123.491844 108.612312 144.464279 791.645569
51.197216 4.541424 28.213940 167.160614
89.401024 1.010197 46.828716 770.945618
158.618576 29.891911 23.131077 1505.259277
244.922058 43.440079 128.933334 923.010559
0.781249 0.261833 0.972973 0.173306
1.519032 0.162986 1.576106 0.862871
6.569441 4.932852 1.029856 4.475933
18.061516 7.669132 4.722344 4.704646
3.131462 2.002090 1.140756 2.442270
10.600313 1.583177 2.895870 9.639306
22.244659 42.139153 3.595555 22.658943
38.662354 57.456371 19.420525 16.423494
8.779945 2.066681 1.879913 23.084402
20.494619 2.375584 3.326605 99.979324
36.922276 70.676743 3.210587 198.109390
58.160000 98.095161 18.002308 125.774437
16.583111 4.101026 11.155333 27.642084
35.591835 0.914379 20.515997 132.214783
71.135513 27.176350 9.725371 266.915863
116.408607 39.721714 34.145473 168.792496
2.395200 0.632174 1.010430 0.534596
7.250558 0.295623 0.553165 1.912158
21.566204 8.335085 1.192122 5.346692
42.562283 12.651375 6.921299 3.846786
7.329902 3.245823 3.128737 1.614435
18.660418 2.522219 4.925622 5.653284
33.294441 65.927452 1.238718 12.381126
51.399342 88.253151 6.899793 8.577411
13.582767 3.116997 5.153646 11.683537
22.320833 3.519556 9.097980 49.421280
33.011452 102.923698 4.488597 96.089935
45.807373 140.509064 15.091604 60.036774
21.147333 5.782093 14.575815 13.011012
46.495182 1.269837 42.136086 61.454144
88.984634 37.206020 11.166471 122.641159
140.679443 53.645432 30.870945 76.730774
4.009279 0.985009 2.400937 0.995405
10.617405 0.286453 4.619491 4.351654
30.200201 4.314039 1.442225 13.988586
57.881393 3.676853 5.458117 10.685783
7.449679 0.704537 2.959206 4.472389
21.998724 0.570169 4.699492 15.102328
40.785259 16.934618 1.511200 31.499527
63.923351 25.340105 11.726213 20.748184
12.262385 0.973994 5.265160 26.944889
42.646652 1.171906 6.144588 109.129822
81.530411 35.958084 1.687330 204.059067
128.873108 50.947800 5.419734 123.117332
17.909845 2.158560 8.456322 25.858757
41.245060 0.485195 26.023010 118.742477
86.326035 14.481146 7.408440 231.013153
140.599380 21.196472 30.085545 141.232834
5.623376 1.386527 2.652725 1.857092
12.436990 0.424176 6.625458 10.226460
33.575073 7.122200 2.764629 37.421822
65.302597 7.199735 23.643044 30.891472
9.328558 1.487573 7.611353 13.573223
30.355850 1.088321 8.573069 47.301987
58.874477 28.780415 11.672723 100.747841
95.321388 39.764107 69.452614 67.303818
17.177988 1.450534 10.831605 88.246025
35.235332 1.683838 12.519513 359.714935
63.287758 50.347927 14.295434 675.468872
98.716370 69.943459 64.209389 408.614044
26.627571 2.917269 12.159424 85.949753
46.819054 0.647349 33.782028 394.925873
87.866707 19.110685 6.656260 768.306152
138.886826 27.709297 42.433552 469.470947
max point num is 251 = 1505.259277
结果显示,最大点是251,幅度值是1505。与理论值不一致