Hi,
I have some dataset used by other algorithms. So, the layout of it cannot be modified.
That is my problem.
So what is left: scattered data in memory but in a contiguous way per group and all group are of the same length.
gr1: offset 0 : AA BB CC DD
gr2: offset 256 : EE FF GG HH
gr3: offset 512 : II JJ KK LL
gr4: offset 768 : MM NN OO PP
Keep in mind that EE == (AA+256), II == (EE + 256) and so on.
And I need:
AA EE II MM
BB FF JJ NN
CC GG KK OO
DD HH LL PP
So it is basically a transposition. And we have the vtrn instruction that can do this just fine. BUT it needs to be repeated three times. (see vectors arrangement)
Is there a way to avoid those three instruction to arrange my vectors properly or are there any other (faster) way to do it?
Currently I need to load the contiguous data and permute the vectors. VLDn don't seem to help me.
Do you see a possibility I missed somewhere?
Thanks