I am using ARMv8 GCC compiler and I would like to optimize Neon Intrinsics code for better execution time performance. I have already tried loop unrolling and I am using look up table for the computation of log10. Any ideas?
Here is the code:
static inline void func(float32x4x4_t inputData_real, float * outputs ) { float32x4x4_t outputData; float32x4x4_t outputData1; for(unsigned short i =0; i < 4; i++) { outputData.val[i] = vmulq_f32 (inputData_real.val[i], inputData_real.val[i]); outputData1.val[i] = vmlaq_f32 (outputData.val[i], inputData_imag.val[i], inputData_imag.val[i]); outputs[i] = 10.0F * log10f_c(vaddvq_f32(outputData1.val[i]) + (a_imagvalue[i] * b_imagvalue[i]) + (a_realvalue[i] * b_realvalue[i])) - 7.89865767467723; } }
Yeah, I am measuring time of my code like this :
#define TIMEDIFF(t1,t2) (t2 - t1) #define MILLISECONDS(t) (1000.0 * t / COUNTS_PER_SECOND) // Start a test void startTest() { XTime_GetTime(&start); } void endTest() { XTime_GetTime(&end); double time_curr = TIMEDIFF(start, end); double msec = MILLISECONDS(time_curr); printf("Run-time = %.2f msec...\n", msec); // Achieved Bandwidth = (total bytes transferred) / (msec) // Average Latency = (msec) / (total memory accesses) } ...... startTest(); for(i= 0; i<4500; i++) { //Func call } endTest();
If you run the test with a warm cache, you get false results. Depending on the size of your test data, it might end up in the caches and from then on you measure only the cache bandwidth plus calculation time. If that's what you want, ok. If not you need to invalidate the cache prio each call to the function under test.