Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.

We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.

Thank you for your understanding.


This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A-53 : R/W data using interleaved mnemonics

Hello,

I developed few signal processing routines (e.g FIR) using interleaved mnemonics like: 

vld2q_f32 : load float32x4x2_t

vst2q_f32: store float32x4x2_t

Using those mnemonics, is simple and the code is clean. 

Then I did the same with:

vld1q_f32: load flot32x4_t

vst1q_f32: load float32x4_t

In this case the input was 2 consecutive vectors of real, imaginary float32x4 

Of course I had to run the calculation twice: for real, imaginary 

It seems that using load\store of float32x4_t worked faster even that I had to run it twice. 

Does it make sense ?

If relevant, I can share full source code. 

Thank you,

Zvika