code:
static void fftwf_armpl_fp32(fftwf_complex* signal, int row, int col) { fftwf_plan plan_f = fftwf_plan_dft_2d(col, row, signal, signal, FFTW_FORWARD, FFTW_ESTIMATE); fftwf_execute(plan_f); fftwf_destroy_plan(plan_f);}
static void fftwf_armpl_fp16(fftwh_complex* signal, int row, int col) { fftwh_plan plan_h = fftwh_plan_dft_2d(col, row, signal, signal, FFTW_FORWARD, FFTW_ESTIMATE); fftwh_execute(plan_h); fftwh_destroy_plan(plan_h);}
size
FP32(ms)
FP16(ms)
256*256
4.45
3.09
512*512
16.4
12.7
1024*1024
35.7
36.0
2048*2048
180.1
169.1
4096*4096
761.5
861.4
Hi.
Thanks for getting in contact.
Planning time for an FFT call is typically far greater than the execution time. If doing a benchmark it is therefore sensible to time the two parts separately. I don't have any comparison figures to hand on a Kunpeng920, but I'd imagine that planning costs are comparable between the precisions, which may well be what your results show. I'd recommend calling out the two costs independently in your table. The usage model of the FFTW interface is that you plan once and use the resulting plan many times.
For a 1-d case I'd recommend averaging over (many) calls, but in 2-d that's less important, may may be worth a go.
Hope this helps.
Chris
Thinks. I only get fftwh_execute and fftwf_execute cost time, The result is fp16 slow than fp32. kunpeng920(armv8.2) is supported fp16 instruction, but why fp16 has not acceleration effect.
Thanks for confirming. We can observe a similar lack of extra performance on other 128-bit Neon platforms. Using these functions on, for example, an A64FX would show the 2x performance difference we would expect.
We've added looking at this to our future work list. Thanks for raising the issue.