This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Optimization of Neon Intrinsics on ARM cortexa53

khan777 over 6 years ago

I am using ARMv8 GCC compiler and I would like to optimize Neon Intrinsics code for better execution time performance. I have already tried loop unrolling and I am using look up table for the computation of log10. Any ideas?

Here is the code:

static inline void func(float32x4x4_t inputData_real, float * outputs )	
{
	float32x4x4_t outputData;
	float32x4x4_t outputData1;
	
	for(unsigned short i =0; i < 4; i++)
	{
    	outputData.val[i] = vmulq_f32 (inputData_real.val[i], inputData_real.val[i]);
    	outputData1.val[i] = vmlaq_f32 (outputData.val[i], inputData_imag.val[i], inputData_imag.val[i]);
    	outputs[i] = 10.0F * log10f_c(vaddvq_f32(outputData1.val[i]) + (a_imagvalue[i] * b_imagvalue[i])
    			+ (a_realvalue[i] * b_realvalue[i])) - 7.89865767467723;
    }
}

Parents

0 42Bastian Schick over 6 years ago in reply to khan777

There are "memory hint" instruction, which tell the memory subsystem to preload some memory you will access soon. So while calculating the first chunk, the next will be loaded into cache.

I do not know the instruction by heart.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 42Bastian Schick over 6 years ago in reply to khan777

There are "memory hint" instruction, which tell the memory subsystem to preload some memory you will access soon. So while calculating the first chunk, the next will be loaded into cache.

I do not know the instruction by heart.
Cancel
Vote up 0 Vote down

Cancel

Children

No data