I am using ARMv8 GCC compiler and I would like to optimize Neon Intrinsics code for better execution time performance. I have already tried loop unrolling and I am using look up table for the computation of log10. Any ideas?
Here is the code:
static inline void func(float32x4x4_t inputData_real, float * outputs ) { float32x4x4_t outputData; float32x4x4_t outputData1; for(unsigned short i =0; i < 4; i++) { outputData.val[i] = vmulq_f32 (inputData_real.val[i], inputData_real.val[i]); outputData1.val[i] = vmlaq_f32 (outputData.val[i], inputData_imag.val[i], inputData_imag.val[i]); outputs[i] = 10.0F * log10f_c(vaddvq_f32(outputData1.val[i]) + (a_imagvalue[i] * b_imagvalue[i]) + (a_realvalue[i] * b_realvalue[i])) - 7.89865767467723; } }
Did you check this?developer.arm.com/.../compiling-for-neon-with-auto-vectorization
Hi ! Thank you so much for your input. I used -ftree-vectorize compiler option and I read this document. Could you please comment on that whether I am doing right?
I cannot tell, sorry. Just check the resulting output.
I checked the timing of my code yesterday and it was improved from 2.18ms to 2.11ms so I guess I am doing right.
Good. A memory hint might further improve it. (BTW the "F" is missing on the 7.8....).
One more note: Using short for loop variables is no good idea. In general, if not specific size is needed it is better to use int .
Thanks for your suggestion and correction. Could you please elaborate more on memory hint?
There are "memory hint" instruction, which tell the memory subsystem to preload some memory you will access soon. So while calculating the first chunk, the next will be loaded into cache.
I do not know the instruction by heart.
View all questions in Cortex-A / A-Profile forum