Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.
We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.
Thank you for your understanding.
Hello,
I developed few signal processing routines (e.g FIR) using interleaved mnemonics like:
vld2q_f32 : load float32x4x2_t
vst2q_f32: store float32x4x2_t
Using those mnemonics, is simple and the code is clean.
Then I did the same with:
vld1q_f32: load flot32x4_t
vst1q_f32: load float32x4_t
In this case the input was 2 consecutive vectors of real, imaginary float32x4
Of course I had to run the calculation twice: for real, imaginary
It seems that using load\store of float32x4_t worked faster even that I had to run it twice.
Does it make sense ?
If relevant, I can share full source code.
Thank you,
Zvika