Hi team,
in an education environment, I tested a 128-tap FIR filter with some random input signal, 1000 samples long, both 32 bit integer.
int32_t Sb[128]; // FIR coefficients int32_t x[1000]; // input signal // ---- 1000 discrete convolution OP's int32_t test_fir_int() { int64_t y = 0; for (int n=0; n < 1000; n++) { for (int i = 0; i < 128; i++) { if (n >= i) y += (int64_t)x[n-i] * Sb[i]; // use of SMLAL: Compiler's choice } } return y >> 20; } ...
This code was built and run on Arduino 1.8.19 IDE on 2 different boards:
My observations:
I would have expected a factor of around 8 considering clock-speed ratio between M7/M3, another factor 2 because of the M7 dual-issue property, in total a factor 16.
My questions:
Thanks a lot.
Best regards,
Wolfgang
using the DWT_CYCCNT measure I get 3698352 cycles for the M3, 756000 cycles for the M7, each for running the code in test_fir_int().Thus, there is a factor 4.9 in cycle count ratio, the M7 performs better than the M3. The reason for this M7 performance advantage is not quite clear to me.
It is difficult to give a single reason. The internal micro-architecture of these processors are very different (3-stage vs 6-stage).
Does the code have many load/stores? These memory accesses would stall the Cortex-M3 pipeline more frequently.
There are 2 LDR operations within the inner loop, right before the SMLAL MAC operation. I cannot say, if this is many. From your answers I understood, that my original question for a main reason for the M7 performance increase has no simple answer.I think it could be helpful to have some simulator tool to get more detailed information about the overall performance to be expected.
Thank you very much.