This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A-53 : R/W data using interleaved mnemonics

Hello,

I developed few signal processing routines (e.g FIR) using interleaved mnemonics like: 

vld2q_f32 : load float32x4x2_t

vst2q_f32: store float32x4x2_t

Using those mnemonics, is simple and the code is clean. 

Then I did the same with:

vld1q_f32: load flot32x4_t

vst1q_f32: load float32x4_t

In this case the input was 2 consecutive vectors of real, imaginary float32x4 

Of course I had to run the calculation twice: for real, imaginary 

It seems that using load\store of float32x4_t worked faster even that I had to run it twice. 

Does it make sense ?

If relevant, I can share full source code. 

Thank you,

Zvika