We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I am using ARMv8 GCC compiler and I would like to optimize Neon Intrinsics code for better execution time performance. I have already tried loop unrolling and I am using look up table for the computation of log10. Any ideas?
Here is the code:
static inline void func(float32x4x4_t inputData_real, float * outputs ) { float32x4x4_t outputData; float32x4x4_t outputData1; for(unsigned short i =0; i < 4; i++) { outputData.val[i] = vmulq_f32 (inputData_real.val[i], inputData_real.val[i]); outputData1.val[i] = vmlaq_f32 (outputData.val[i], inputData_imag.val[i], inputData_imag.val[i]); outputs[i] = 10.0F * log10f_c(vaddvq_f32(outputData1.val[i]) + (a_imagvalue[i] * b_imagvalue[i]) + (a_realvalue[i] * b_realvalue[i])) - 7.89865767467723; } }
Try cache pre-loading.
I never experienced cache pre-loading. I am reading about it but if it does not take much your time, could you please guide me how can I achieve this in my code?