M7 vs. M3 FIR filter Performance

Hi team,

in an education environment, I tested a 128-tap FIR filter with some random input signal, 1000 samples long, both 32 bit integer.

int32_t Sb[128];    // FIR coefficients
int32_t x[1000];    // input signal

// ---- 1000 discrete convolution OP's
int32_t test_fir_int() {
  int64_t y = 0;
  for (int n=0; n < 1000; n++) {
    for (int i = 0; i < 128; i++) {
      if (n >= i)
        y += (int64_t)x[n-i] * Sb[i];  // use of SMLAL: Compiler's choice
    }
  }
  return y >> 20;
}
...

This code was built and run on Arduino 1.8.19 IDE on 2 different boards:

  • Arduino Due (Atmel SAM3X, Cortex M3, 84 MHz)
  • Teensy 4.1 (NXP IMXRT1060, Cortex M7, 600 MHz)

My observations:

  • The assembly codes of test_fir_int() for both, M3 and M7 boards are almost identical (tool: arm-none-eabi-objdump.exe):
  • The SMLAL DSP extension (64:32x32 MAC) is compiled into both, M3 and M7 assembly codes (standard gcc-settings)
  • On the M7 board: The runtime for test_fir_int()  is    1260 us
  • On the M3 board: The runtime for test_fir_int() is  44028 us
  • Thus, the performance ratio M7/M3 for test_fir_int() is about factor 35

I would have expected a factor of around 8 considering clock-speed ratio between M7/M3, another factor 2 because of the M7 dual-issue property, in total a factor 16.

My questions:

  • Which of the M7 features mainly can explain, that the observed M7/M3 performance-ratio is (by factor 2) better than expected ?
    (caches, pipelines, branch prediction,... ?)
  • Is it possible to study the individual effects of the M7 enhancements (by simple means, like compiler options) ?

Thanks a lot.

Best regards,

Wolfgang