This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Optimization of Neon Intrinsics on ARM cortexa53

I am using ARMv8 GCC compiler and I would like to optimize Neon Intrinsics code for better execution time performance. I have already tried loop unrolling and I am using look up table for the computation of log10. Any ideas?

Here is the code:

static inline void func(float32x4x4_t inputData_real, float * outputs )	
{
	float32x4x4_t outputData;
	float32x4x4_t outputData1;
	
	for(unsigned short i =0; i < 4; i++)
	{
    	outputData.val[i] = vmulq_f32 (inputData_real.val[i], inputData_real.val[i]);
    	outputData1.val[i] = vmlaq_f32 (outputData.val[i], inputData_imag.val[i], inputData_imag.val[i]);
    	outputs[i] = 10.0F * log10f_c(vaddvq_f32(outputData1.val[i]) + (a_imagvalue[i] * b_imagvalue[i])
    			+ (a_realvalue[i] * b_realvalue[i])) - 7.89865767467723;
    }
}

0 42Bastian Schick over 4 years ago

Try cache pre-loading.
Cancel
Up 0 Down

Cancel
0 42Bastian Schick over 4 years ago

Did you check this?
developer.arm.com/.../compiling-for-neon-with-auto-vectorization
Cancel
Up 0 Down

Cancel
0 khan777 over 4 years ago in reply to 42Bastian Schick

Hi ! Thank you so much for your input. I used -ftree-vectorize compiler option and I read this document. Could you please comment on that whether I am doing right?
Cancel
Up 0 Down

Cancel
0 42Bastian Schick over 4 years ago in reply to khan777

I cannot tell, sorry. Just check the resulting output.
Cancel
Up 0 Down

Cancel
0 khan777 over 4 years ago in reply to 42Bastian Schick

I never experienced cache pre-loading. I am reading about it but if it does not take much your time, could you please guide me how can I achieve this in my code?
Cancel
Up 0 Down

Cancel
0 khan777 over 4 years ago in reply to 42Bastian Schick

I checked the timing of my code yesterday and it was improved from 2.18ms to 2.11ms so I guess I am doing right.
Cancel
Up 0 Down

Cancel
0 42Bastian Schick over 4 years ago in reply to khan777

Good. A memory hint might further improve it. (BTW the "F" is missing on the 7.8....).

One more note: Using short for loop variables is no good idea. In general, if not specific size is needed it is better to use int .
Cancel
Up 0 Down

Cancel
0 vstehle over 4 years ago in reply to khan777

Hi @khan77,

Measuring sub-millisecond durations can be tricky, depending on the timer used.

You might want to measure several times in a row to be sure, and/or use a cycle counter.
Cancel
Up 0 Down

Cancel
0 khan777 over 4 years ago in reply to 42Bastian Schick

Thanks for your suggestion and correction. Could you please elaborate more on memory hint?
Cancel
Up 0 Down

Cancel

0 khan777 over 4 years ago in reply to vstehle

Yeah, I am measuring time of my code like this :

#define TIMEDIFF(t1,t2) (t2 - t1)
#define MILLISECONDS(t) (1000.0 * t / COUNTS_PER_SECOND)

// Start a test
void startTest() {
 XTime_GetTime(&start);
}

void endTest() {
 XTime_GetTime(&end);
 double time_curr = TIMEDIFF(start, end);
 double msec = MILLISECONDS(time_curr);
 printf("Run-time = %.2f msec...\n", msec);
 // Achieved Bandwidth = (total bytes transferred) / (msec)
 // Average Latency = (msec) / (total memory accesses)
}

......
startTest();

for(i= 0; i<4500; i++)
{
//Func call
}

endTest();

0 42Bastian Schick over 4 years ago in reply to khan777

There are "memory hint" instruction, which tell the memory subsystem to preload some memory you will access soon. So while calculating the first chunk, the next will be loaded into cache.

I do not know the instruction by heart.
Cancel
Up 0 Down

Cancel
0 42Bastian Schick over 4 years ago in reply to khan777

If you run the test with a warm cache, you get false results. Depending on the size of your test data, it might end up in the caches and from then on you measure only the cache bandwidth plus calculation time. If that's what you want, ok. If not you need to invalidate the cache prio each call to the function under test.
Cancel
Up 0 Down

Cancel