I am trying to understand how the cascaded biquad filtering is optimized for Arm processors in CMSIS using Neon extensions. The code is ifdefed under `#if defined(ARM_MATH_NEON)` here: https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/DSP/Source/FilteringFunctions/arm_biquad_cascade_df2T_f32.c.
Documentation: arm-software.github.io/.../group__BiquadCascadeDF2T.html
The NEON intrinsics are used when there are more than 4 biquads cascaded. I am puzzled how could any kind of parallel instruction execution be done if output from one biduaq is fed as input to the next one? Could anyone explain what is done in parallel in that peace of code?