Hi,
i'm currently trying to measure cycle counts for FIR-filtering with the NE10 library. I'm using a Raspberry Pi 2 with ARM Cortex-A7 running on Raspbian as a target.
I activated the Cortex-A7 performance counter register to read out the cycles before and after the filter-execution.
Now i tested both functions "ne10_fir_float_neon()" and "ne10_fir_float_c()" and expected the NEON-Assembly version to be faster than the C version.
To my surprise i seem to get better results with the plain C version. I checked with different Blocksizes and Filter-lengths but in all my tests the C-only version has a smaller cycle count.
For example using a blocksize of 128 and 21 filter-taps i get this results:
using ne10_fir_float_neon(): average of 10212 cycles which is ~3.8 cycles per sample per tap
using ne10_fir_float_c(): average of 8436 cycles which is ~3.1 cycles per sample per tap
Is there a reason why the NEON version is slower than the C version on the Cortex A-7 and could that be different on a different target, say Cortex A-9?
Or could there be something wrong with my measurements and the NEON version should always be faster? Or is it only faster for specific blocksizes and filter-lengths?
Or maybe i did something wrong and i have to activate NEON correctly?
I used "ne10_init()" and "ne10_HasNEON()" returns "NE10_OK". So this should be fine...
Thank you
View all questions in Cortex-A / A-Profile forum