This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Use armpl(22.0) to calculate fft, but fftwh(fp16) is slow than fftwf(fp32) in kunpeng920 arm server, I expect fftwh is faster 2x than fftwf

yan.wei over 2 years ago

code:

static void fftwf_armpl_fp32(fftwf_complex* signal, int row, int col) {
fftwf_plan plan_f = fftwf_plan_dft_2d(col, row, signal, signal, FFTW_FORWARD, FFTW_ESTIMATE);
fftwf_execute(plan_f);
fftwf_destroy_plan(plan_f);
}

static void fftwf_armpl_fp16(fftwh_complex* signal, int row, int col) {
fftwh_plan plan_h = fftwh_plan_dft_2d(col, row, signal, signal, FFTW_FORWARD, FFTW_ESTIMATE);
fftwh_execute(plan_h);
fftwh_destroy_plan(plan_h);
}

size	FP32(ms)	FP16(ms)
256*256	4.45	3.09
512*512	16.4	12.7
1024*1024	35.7	36.0
2048*2048	180.1	169.1
4096*4096	761.5	861.4

Top replies

yan.wei over 2 years ago in reply to Chris Goodyer +1 suggested

Thinks. I only get fftwh_execute and fftwf_execute cost time, The result is fp16 slow than fp32. kunpeng920(armv8.2) is supported fp16 instruction, but why fp16 has not acceleration effect.