We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello,
I developed few signal processing routines (e.g FIR) using interleaved mnemonics like:
vld2q_f32 : load float32x4x2_t
vst2q_f32: store float32x4x2_t
Using those mnemonics, is simple and the code is clean.
Then I did the same with:
vld1q_f32: load flot32x4_t
vst1q_f32: load float32x4_t
In this case the input was 2 consecutive vectors of real, imaginary float32x4
Of course I had to run the calculation twice: for real, imaginary
It seems that using load\store of float32x4_t worked faster even that I had to run it twice.
Does it make sense ?
If relevant, I can share full source code.
Thank you,
Zvika