This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Optimization of Neon Intrinsics on ARM cortexa53

khan777 over 5 years ago

I am using ARMv8 GCC compiler and I would like to optimize Neon Intrinsics code for better execution time performance. I have already tried loop unrolling and I am using look up table for the computation of log10. Any ideas?

Here is the code:

static inline void func(float32x4x4_t inputData_real, float * outputs )	
{
	float32x4x4_t outputData;
	float32x4x4_t outputData1;
	
	for(unsigned short i =0; i < 4; i++)
	{
    	outputData.val[i] = vmulq_f32 (inputData_real.val[i], inputData_real.val[i]);
    	outputData1.val[i] = vmlaq_f32 (outputData.val[i], inputData_imag.val[i], inputData_imag.val[i]);
    	outputs[i] = 10.0F * log10f_c(vaddvq_f32(outputData1.val[i]) + (a_imagvalue[i] * b_imagvalue[i])
    			+ (a_realvalue[i] * b_realvalue[i])) - 7.89865767467723;
    }
}

Parents

0 42Bastian Schick over 5 years ago

Try cache pre-loading.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 42Bastian Schick over 5 years ago

Try cache pre-loading.
Cancel
Vote up 0 Vote down

Cancel

Children

0 khan777 over 5 years ago in reply to 42Bastian Schick

I never experienced cache pre-loading. I am reading about it but if it does not take much your time, could you please guide me how can I achieve this in my code?
Cancel
Vote up 0 Vote down

Cancel