This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NE10-Library -> FIR-Filter cycle counts: C-version faster than NEON-version?

Hi,

i'm currently trying to measure cycle counts for FIR-filtering with the NE10 library. I'm using a Raspberry Pi 2 with ARM Cortex-A7 running on Raspbian as a target.

I activated the Cortex-A7 performance counter register to read out the cycles before and after the filter-execution.

Now i tested both functions "ne10_fir_float_neon()" and "ne10_fir_float_c()" and expected the NEON-Assembly version to be faster than the C version.

To my surprise i seem to get better results with the plain C version. I checked with different Blocksizes and Filter-lengths but in all my tests the C-only version has a smaller cycle count.

For example using a blocksize of 128 and 21 filter-taps i get this results:

using ne10_fir_float_neon(): average of 10212 cycles which is ~3.8 cycles per sample per tap

using ne10_fir_float_c():    average of 8436 cycles which is ~3.1 cycles per sample per tap

Is there a reason why the NEON version is slower than the C version on the Cortex A-7 and could that be different on a different target, say Cortex A-9?

Or could there be something wrong with my measurements and the NEON version should always be faster? Or is it only faster for specific blocksizes and filter-lengths?

Or maybe i did something wrong and i have to activate NEON correctly?

I used "ne10_init()" and "ne10_HasNEON()" returns "NE10_OK". So this should be fine...

Thank you