Hello,
I developed few signal processing routines (e.g FIR) using interleaved mnemonics like:
vld2q_f32 : load float32x4x2_t
vst2q_f32: store float32x4x2_t
Using those mnemonics, is simple and the code is clean.
Then I did the same with:
vld1q_f32: load flot32x4_t
vst1q_f32: load float32x4_t
In this case the input was 2 consecutive vectors of real, imaginary float32x4
Of course I had to run the calculation twice: for real, imaginary
It seems that using load\store of float32x4_t worked faster even that I had to run it twice.
Does it make sense ?
If relevant, I can share full source code.
Thank you,
Zvika