Ne10 FFT feature

December 17, 2013

6 minute read time.

FFT feature in ProjectNe10

1 Introduction

Project Ne10 recently received an updated version of FFT, which is heavily NEON optimized for both ARM v7-A/v8-A AArch32 and v8-A AArch64 and is faster than almost all of the other existing open source FFT implementations such as FFTW and the FFT routine in OpenMax DL. This article will introduce this a bit.

2 Performance comparison with some other FFT’s on ARM v7-A

The following chart illustrates the benchmarking results of the complex FFT (32-bit float data type) of Ne10, FFTW and OpenMax. The test platform is ARM Cortex A9. The X-axis of the chart represents the length of FFT. The Y-axis represents the execution time of FFT. Smaller is better.

From this chart, we can find that Ne10 is better than FFTW, OpenMax DL in most of cases.

3 FFT on ARM v7-A/v8-A AArch32 and ARM v8-A AArch64

3.1 NEON usage

To utilize NEON accelerator, usually we have two choices:

NEON assembly
NEON intrinsic

The following table describes the pros and cons of using assembly/intrinsic.

	NEON assembly	NEON intrinsic
Performance	Always shows the best performance for the specified platform	Depends heavily on the toolchain that is used
Portability	The different ISA (i.e. ARM v7-A/v8-A AArch32 and ARM v8-A AArch64) has different assembly implementation. Even for the same ISA, the assembly might need to be fine-tuned to achieve ideal performance between different micro architectures.	Program once and run on different ISA’s. The compiler may also grant performance fine-tuning for different micro-architectures.
Maintainability	Hard to read/write compared with C.	Similar to C code, it’s easy to read/write.

3.2 ARM v7-A/v8-A AArch32 and v8-A AArch64 FFT implementations

According to the aforementioned pros/cons comparison, the intrinsic is preferred for the implementation of the Ne10 library

But for FFT, we still have different versions of implementations for ARM v7-A/v8-A AArch32 and v8-A AArch64 due to the reason described as follows:

// radix 4 butterfly with twiddles

scratch[0].r = scratch_in[0].r;

scratch[0].i = scratch_in[0].i;

scratch[1].r = scratch_in[1].r * scratch_tw[0].r - scratch_in[1].i * scratch_tw[0].i;

scratch[1].i = scratch_in[1].i * scratch_tw[0].r + scratch_in[1].r * scratch_tw[0].i;

scratch[2].r = scratch_in[2].r * scratch_tw[1].r - scratch_in[2].i * scratch_tw[1].i;

scratch[2].i = scratch_in[2].i * scratch_tw[1].r + scratch_in[2].r * scratch_tw[1].i;

scratch[3].r = scratch_in[3].r * scratch_tw[2].r - scratch_in[3].i * scratch_tw[2].i;

scratch[3].i = scratch_in[3].i * scratch_tw[2].r + scratch_in[3].r * scratch_tw[2].i;

The above code snippet lists the basic element of FFT---- radix4 butterfly. From the code, we can conclude that:

20 64-bit NEON registers are needed if 2 radix4 butterflies are executed in one loop.
20 128-bit NEON registers are needed if 4 radix4 butterflies are executed in one loop.

And, for ARM v7-A/v8-A AArch32 and v8-A AArch64,

There are 32 64-bit or 16 128-bit NEON registers for ARM v7-A/v8-A AArch32.
There are 32 128-bit NEON registers for ARM v8-A AArch64.

Considering the above factors, in practice the implementation of Ne10 eventually has an assembly version, in which 2 radix4 butterflies are executed in one loop, for ARM v7-A/v8-A AAch32, and an intrinsic version, in which 4 radix4 butterflies are executed in one loop, for ARM v8-A AArch64.

3.3 C/NEON performance boost

The following charts show the C/NEON performance boosts in ARM v8-A AArch32 and AArch64 on the same Cortex-A53 CPU of Juno. Larger is better.

All the blue bars show the data in the AArch32 mode. The NEON code is v7-A/v8-A AArch32 assembly. The toolchain used is gcc 4.9.

All the red bars show the data in the AArch64 mode. The NEON code is intrinsic. The performance of intrinsic depends on toolchains greatly. The toolchain used here is llvm3.5.

From these charts, we can conclude that float complex FFT shows the similar or better performance boost between the AArch64 mode and the AArch32 mode. But for int32/16 complex FFT, the performance boost in the AArch32 mode is usually better than in the AArch64 mode (but this doesn’t mean the int32/16 complex FFT performs faster in the AArch32 mode than in the AArch64 mode!)

The data from this exercise is useful to analyze the performance boost for ARM v8-A AArch64 mode but we still need more data to verify and reinforce our concept.

3.4 AArch32/AArch64 performance boost

The following charts are based on performance of the AArch32 C version and show the performance ratios of the AArch32 NEON version and the AArch64 C version, and the AArch64 NEON version on the same Cortex-A53 CPU on Juno. Larger is better.

From these charts, we can conclude that FFT in the AArch64 mode performs faster than in the AArch32 mode, no matter C or NEON.

4 Usage

4.1 APIs

The FFT still supports the following features:

Feature	Data type	Length
c2c FFT/IFFT	float/int32/int16	2^N (N is 2, 3….)
r2c FFT	float/int32/int16	2^N (N is 3, 4….)
c2r IFFT	float/int32/int16	2^N (N is 3, 4….)

But the APIs have changed. The old users need to update to latest version v1.1.2 or master.

More API details, please check http://projectne10.github.io/Ne10/doc/group__C2C__FFT__IFFT.html.

4.2 Example

Take the float c2c FFT/IFFT as an example, current APIs are used as follows.

#include "NE10.h"

……

{

fftSize = 2^N; //N is 2, 3, 4, 5, 6....

in = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));

out = (ne10_fft_cpx_float32_t*) NE10_MALLOC (fftSize * sizeof (ne10_fft_cpx_float32_t));

ne10_fft_cfg_float32_t cfg;

cfg = ne10_fft_alloc_c2c_float32 (fftSize);

……

//FFT

ne10_fft_c2c_1d_float32_neon (out, in, cfg, 0);

……

//IFFT

ne10_fft_c2c_1d_float32_neon (out, in, cfg, 1);

……

NE10_FREE (in);

NE10_FREE (out);

NE10_FREE (cfg);

}

5 Conclusion

The FFT shows that you can get a significant performance boost in the ARM v8-A AArch64 mode. You may find more use cases of course. We welcome feedback and are looking to publish use cases to cross promote ProjectNe10 and the projects that use it.

For more details, please access http://projectne10.github.com/Ne10/

Parents

UFO over 7 years ago

你好，在使用NE10库测试FFT功能时，发现无法获得正确结果，硬件平台I.MX6Q cortex-A9 with NEON ，ne10库使用gcc hf编译动态库。程序如下：
void genarate_signal(float *complex_float_list,int freq,int total_num)
{
int ii;
for(ii = 0;ii<total_num;ii++)
{
complex_float_list[2*ii] = 100+
       10*sin(2*ii*PI*freq/total_num)+
       30*sin(2*ii*PI*freq*2/total_num)+
       50*cos(2*ii*PI*freq/2/total_num);
       fprintf (stdout, "%f \n",complex_float_list[2*ii] );
       complex_float_list[2*ii+1] =0;
}
}
void test_fft_c2c_1d_float32_test1024()
{
    ne10_int32_t i = 0;
    ne10_int32_t fftSize = TEST_LENGTH_SAMPLES;
    ne10_int32_t flag_result = NE10_OK;
    genarate_signal(testInput_f32,50,TEST_LENGTH_SAMPLES);
/* FFT test */
memcpy (in_c, testInput_f32, 2 * fftSize * sizeof (ne10_float32_t));
memcpy (in_neon, testInput_f32, 2 * fftSize * sizeof (ne10_float32_t));
flag_result = test_c2c_alloc (fftSize);
if (flag_result == NE10_ERR)
      return;
ne10_fft_c2c_1d_float32_neon ( (ne10_fft_cpx_float32_t*) out_neon, (ne10_fft_cpx_float32_t*) in_neon, cfg_neon, 0);
ne10_vmul_vec2f_neon(out_amp_f32, (ne10_vec2f_t *)out_neon, (ne10_vec2f_t *)out_neon, fftSize);
NE10_FREE (cfg_c);
NE10_FREE (cfg_neon);
}
void Test_float_1024()
{
uint32_t index = 0;
uint32_t i = 0;
float *p =out_amp_f32;
my_test_setup();
test_fft_c2c_1d_float32_test1024();
for(i=0;i<TEST_LENGTH_SAMPLES;i++){
       y_out[i] = sqrt(out_amp_f32[2*i]+out_amp_f32[2*i+1]) * 2/TEST_LENGTH_SAMPLES;
}
p =y_out;
for(i=0;i<TEST_LENGTH_SAMPLES/8;i++){
       fprintf (stdout, "%f   %f   %f   %f\n",\
            *(p+0),
            *(p+1),
            *(p+2),
            *(p+3)
            );
            p+=4;
}
index = search_MaxIdx(y_out,TEST_LENGTH_SAMPLES);
fprintf (stdout, "max point num is %d = %f\n",index,y_out[index]);
}
终端打印测试结果如下：
70.334671   28.591980   9.328466   267.897675
119.986031   42.944847   30.605927   178.480011
2.802216   0.700661   0.826645   0.592073
8.013395   0.335176   0.786936   2.207294
24.438063   9.649106   1.739657   6.406336
49.667835   14.928893   9.956310   4.767300
8.998783   3.898322   3.776710   2.063089
22.830889   3.079061   5.898489   7.429659
41.256512   81.707916   1.372445   16.695229
64.763451   110.923729   12.887453   11.843207
17.583437   3.969235   7.544624   16.488457
30.619579   4.536862   11.899246   71.173515
47.053455   134.195160   6.081620   141.013489
67.156342   185.168762   20.658308   89.664536
28.569382   7.696736   17.699919   19.752989
62.671543   1.706358   55.254002   94.741020
121.885223   50.442410   13.321230   191.812775
194.830917   73.343239   35.800415   121.644196
6.072055   1.357412   3.059041   1.598315
16.019142   0.397725   6.136387   7.071984
45.714428   6.032665   1.914879   22.993378
88.425896   5.176524   7.825125   17.754688
11.952877   0.998287   3.868603   7.507260
38.311497   0.812865   6.276744   25.597332
72.796387   24.283131   3.027011   53.883930
115.895729   36.538441   23.552629   35.805290
20.097837   1.411887   8.378696   46.889088
64.220795   1.707423   8.499466   191.427780
124.767319   52.643639   3.248033   360.684113
199.053009   74.936172   15.439220   219.206482
30.036509   3.189077   10.664173   46.363052
54.214390   0.719910   35.880524   214.325302
119.226959   21.574148   10.367178   419.649170
199.155594   31.703428   48.919361   258.139801
9.362836   2.081679   3.328502   3.414439
19.946901   0.639163   8.752939   18.909359
53.746208   10.769773   6.246037   69.574326
104.665573   10.923879   50.765209   57.735943
16.781229   2.264422   12.837126   25.497133
54.318176   1.661909   13.626813   89.291451
105.531845   44.081799   26.905071   191.079147
171.331207   61.084209   160.569611   128.230637
32.309647   2.234591   22.595129   168.871979
50.999645   2.601134   20.814266   691.298950
82.439835   77.982269   30.584463   1303.456665
123.491844   108.612312   144.464279   791.645569
51.197216   4.541424   28.213940   167.160614
89.401024   1.010197   46.828716   770.945618
158.618576   29.891911   23.131077   1505.259277
244.922058   43.440079   128.933334   923.010559
0.781249   0.261833   0.972973   0.173306
1.519032   0.162986   1.576106   0.862871
6.569441   4.932852   1.029856   4.475933
18.061516   7.669132   4.722344   4.704646
3.131462   2.002090   1.140756   2.442270
10.600313   1.583177   2.895870   9.639306
22.244659   42.139153   3.595555   22.658943
38.662354   57.456371   19.420525   16.423494
8.779945   2.066681   1.879913   23.084402
20.494619   2.375584   3.326605   99.979324
36.922276   70.676743   3.210587   198.109390
58.160000   98.095161   18.002308   125.774437
16.583111   4.101026   11.155333   27.642084
35.591835   0.914379   20.515997   132.214783
71.135513   27.176350   9.725371   266.915863
116.408607   39.721714   34.145473   168.792496
2.395200   0.632174   1.010430   0.534596
7.250558   0.295623   0.553165   1.912158
21.566204   8.335085   1.192122   5.346692
42.562283   12.651375   6.921299   3.846786
7.329902   3.245823   3.128737   1.614435
18.660418   2.522219   4.925622   5.653284
33.294441   65.927452   1.238718   12.381126
51.399342   88.253151   6.899793   8.577411
13.582767   3.116997   5.153646   11.683537
22.320833   3.519556   9.097980   49.421280
33.011452   102.923698   4.488597   96.089935
45.807373   140.509064   15.091604   60.036774
21.147333   5.782093   14.575815   13.011012
46.495182   1.269837   42.136086   61.454144
88.984634   37.206020   11.166471   122.641159
140.679443   53.645432   30.870945   76.730774
4.009279   0.985009   2.400937   0.995405
10.617405   0.286453   4.619491   4.351654
30.200201   4.314039   1.442225   13.988586
57.881393   3.676853   5.458117   10.685783
7.449679   0.704537   2.959206   4.472389
21.998724   0.570169   4.699492   15.102328
40.785259   16.934618   1.511200   31.499527
63.923351   25.340105   11.726213   20.748184
12.262385   0.973994   5.265160   26.944889
42.646652   1.171906   6.144588   109.129822
81.530411   35.958084   1.687330   204.059067
128.873108   50.947800   5.419734   123.117332
17.909845   2.158560   8.456322   25.858757
41.245060   0.485195   26.023010   118.742477
86.326035   14.481146   7.408440   231.013153
140.599380   21.196472   30.085545   141.232834
5.623376   1.386527   2.652725   1.857092
12.436990   0.424176   6.625458   10.226460
33.575073   7.122200   2.764629   37.421822
65.302597   7.199735   23.643044   30.891472
9.328558   1.487573   7.611353   13.573223
30.355850   1.088321   8.573069   47.301987
58.874477   28.780415   11.672723   100.747841
95.321388   39.764107   69.452614   67.303818
17.177988   1.450534   10.831605   88.246025
35.235332   1.683838   12.519513   359.714935
63.287758   50.347927   14.295434   675.468872
98.716370   69.943459   64.209389   408.614044
26.627571   2.917269   12.159424   85.949753
46.819054   0.647349   33.782028   394.925873
87.866707   19.110685   6.656260   768.306152
138.886826   27.709297   42.433552   469.470947
max point num is 251 = 1505.259277
结果显示，最大点是251，幅度值是1505。与理论值不一致
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

UFO over 7 years ago

你好，在使用NE10库测试FFT功能时，发现无法获得正确结果，硬件平台I.MX6Q cortex-A9 with NEON ，ne10库使用gcc hf编译动态库。程序如下：
void genarate_signal(float *complex_float_list,int freq,int total_num)
{
int ii;
for(ii = 0;ii<total_num;ii++)
{
complex_float_list[2*ii] = 100+
       10*sin(2*ii*PI*freq/total_num)+
       30*sin(2*ii*PI*freq*2/total_num)+
       50*cos(2*ii*PI*freq/2/total_num);
       fprintf (stdout, "%f \n",complex_float_list[2*ii] );
       complex_float_list[2*ii+1] =0;
}
}
void test_fft_c2c_1d_float32_test1024()
{
    ne10_int32_t i = 0;
    ne10_int32_t fftSize = TEST_LENGTH_SAMPLES;
    ne10_int32_t flag_result = NE10_OK;
    genarate_signal(testInput_f32,50,TEST_LENGTH_SAMPLES);
/* FFT test */
memcpy (in_c, testInput_f32, 2 * fftSize * sizeof (ne10_float32_t));
memcpy (in_neon, testInput_f32, 2 * fftSize * sizeof (ne10_float32_t));
flag_result = test_c2c_alloc (fftSize);
if (flag_result == NE10_ERR)
      return;
ne10_fft_c2c_1d_float32_neon ( (ne10_fft_cpx_float32_t*) out_neon, (ne10_fft_cpx_float32_t*) in_neon, cfg_neon, 0);
ne10_vmul_vec2f_neon(out_amp_f32, (ne10_vec2f_t *)out_neon, (ne10_vec2f_t *)out_neon, fftSize);
NE10_FREE (cfg_c);
NE10_FREE (cfg_neon);
}
void Test_float_1024()
{
uint32_t index = 0;
uint32_t i = 0;
float *p =out_amp_f32;
my_test_setup();
test_fft_c2c_1d_float32_test1024();
for(i=0;i<TEST_LENGTH_SAMPLES;i++){
       y_out[i] = sqrt(out_amp_f32[2*i]+out_amp_f32[2*i+1]) * 2/TEST_LENGTH_SAMPLES;
}
p =y_out;
for(i=0;i<TEST_LENGTH_SAMPLES/8;i++){
       fprintf (stdout, "%f   %f   %f   %f\n",\
            *(p+0),
            *(p+1),
            *(p+2),
            *(p+3)
            );
            p+=4;
}
index = search_MaxIdx(y_out,TEST_LENGTH_SAMPLES);
fprintf (stdout, "max point num is %d = %f\n",index,y_out[index]);
}
终端打印测试结果如下：
70.334671   28.591980   9.328466   267.897675
119.986031   42.944847   30.605927   178.480011
2.802216   0.700661   0.826645   0.592073
8.013395   0.335176   0.786936   2.207294
24.438063   9.649106   1.739657   6.406336
49.667835   14.928893   9.956310   4.767300
8.998783   3.898322   3.776710   2.063089
22.830889   3.079061   5.898489   7.429659
41.256512   81.707916   1.372445   16.695229
64.763451   110.923729   12.887453   11.843207
17.583437   3.969235   7.544624   16.488457
30.619579   4.536862   11.899246   71.173515
47.053455   134.195160   6.081620   141.013489
67.156342   185.168762   20.658308   89.664536
28.569382   7.696736   17.699919   19.752989
62.671543   1.706358   55.254002   94.741020
121.885223   50.442410   13.321230   191.812775
194.830917   73.343239   35.800415   121.644196
6.072055   1.357412   3.059041   1.598315
16.019142   0.397725   6.136387   7.071984
45.714428   6.032665   1.914879   22.993378
88.425896   5.176524   7.825125   17.754688
11.952877   0.998287   3.868603   7.507260
38.311497   0.812865   6.276744   25.597332
72.796387   24.283131   3.027011   53.883930
115.895729   36.538441   23.552629   35.805290
20.097837   1.411887   8.378696   46.889088
64.220795   1.707423   8.499466   191.427780
124.767319   52.643639   3.248033   360.684113
199.053009   74.936172   15.439220   219.206482
30.036509   3.189077   10.664173   46.363052
54.214390   0.719910   35.880524   214.325302
119.226959   21.574148   10.367178   419.649170
199.155594   31.703428   48.919361   258.139801
9.362836   2.081679   3.328502   3.414439
19.946901   0.639163   8.752939   18.909359
53.746208   10.769773   6.246037   69.574326
104.665573   10.923879   50.765209   57.735943
16.781229   2.264422   12.837126   25.497133
54.318176   1.661909   13.626813   89.291451
105.531845   44.081799   26.905071   191.079147
171.331207   61.084209   160.569611   128.230637
32.309647   2.234591   22.595129   168.871979
50.999645   2.601134   20.814266   691.298950
82.439835   77.982269   30.584463   1303.456665
123.491844   108.612312   144.464279   791.645569
51.197216   4.541424   28.213940   167.160614
89.401024   1.010197   46.828716   770.945618
158.618576   29.891911   23.131077   1505.259277
244.922058   43.440079   128.933334   923.010559
0.781249   0.261833   0.972973   0.173306
1.519032   0.162986   1.576106   0.862871
6.569441   4.932852   1.029856   4.475933
18.061516   7.669132   4.722344   4.704646
3.131462   2.002090   1.140756   2.442270
10.600313   1.583177   2.895870   9.639306
22.244659   42.139153   3.595555   22.658943
38.662354   57.456371   19.420525   16.423494
8.779945   2.066681   1.879913   23.084402
20.494619   2.375584   3.326605   99.979324
36.922276   70.676743   3.210587   198.109390
58.160000   98.095161   18.002308   125.774437
16.583111   4.101026   11.155333   27.642084
35.591835   0.914379   20.515997   132.214783
71.135513   27.176350   9.725371   266.915863
116.408607   39.721714   34.145473   168.792496
2.395200   0.632174   1.010430   0.534596
7.250558   0.295623   0.553165   1.912158
21.566204   8.335085   1.192122   5.346692
42.562283   12.651375   6.921299   3.846786
7.329902   3.245823   3.128737   1.614435
18.660418   2.522219   4.925622   5.653284
33.294441   65.927452   1.238718   12.381126
51.399342   88.253151   6.899793   8.577411
13.582767   3.116997   5.153646   11.683537
22.320833   3.519556   9.097980   49.421280
33.011452   102.923698   4.488597   96.089935
45.807373   140.509064   15.091604   60.036774
21.147333   5.782093   14.575815   13.011012
46.495182   1.269837   42.136086   61.454144
88.984634   37.206020   11.166471   122.641159
140.679443   53.645432   30.870945   76.730774
4.009279   0.985009   2.400937   0.995405
10.617405   0.286453   4.619491   4.351654
30.200201   4.314039   1.442225   13.988586
57.881393   3.676853   5.458117   10.685783
7.449679   0.704537   2.959206   4.472389
21.998724   0.570169   4.699492   15.102328
40.785259   16.934618   1.511200   31.499527
63.923351   25.340105   11.726213   20.748184
12.262385   0.973994   5.265160   26.944889
42.646652   1.171906   6.144588   109.129822
81.530411   35.958084   1.687330   204.059067
128.873108   50.947800   5.419734   123.117332
17.909845   2.158560   8.456322   25.858757
41.245060   0.485195   26.023010   118.742477
86.326035   14.481146   7.408440   231.013153
140.599380   21.196472   30.085545   141.232834
5.623376   1.386527   2.652725   1.857092
12.436990   0.424176   6.625458   10.226460
33.575073   7.122200   2.764629   37.421822
65.302597   7.199735   23.643044   30.891472
9.328558   1.487573   7.611353   13.573223
30.355850   1.088321   8.573069   47.301987
58.874477   28.780415   11.672723   100.747841
95.321388   39.764107   69.452614   67.303818
17.177988   1.450534   10.831605   88.246025
35.235332   1.683838   12.519513   359.714935
63.287758   50.347927   14.295434   675.468872
98.716370   69.943459   64.209389   408.614044
26.627571   2.917269   12.159424   85.949753
46.819054   0.647349   33.782028   394.925873
87.866707   19.110685   6.656260   768.306152
138.886826   27.709297   42.433552   469.470947
max point num is 251 = 1505.259277
结果显示，最大点是251，幅度值是1505。与理论值不一致
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Operating Systems blog

Enhancing Chromium’s Memory Safety with Armv9

Richard Townsend

The Arm Open-source Software team is delighted to mark the release of Chromium M115, with experimental support for Arm’s Memory Tagging Extension (MTE).
- August 7, 2023
New Memory Tagging Extension User Guide for Android OS Developers

Roberto Lopez Mendez

In this blog, read about what to expect with the new MTE User Guide for Android OS.
- May 25, 2023
Enhancing Chromium's Control Flow Integrity with Armv9

Richard Townsend

This blog explains how Control Flow Integrity, an Armv9 security feature, works on the newly launched Chromium M105.
- October 11, 2022

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog