This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Vectors optimization

Hi,

 

I have some dataset used by other algorithms. So, the layout of it cannot be modified.

That is my problem.

So what is left: scattered data in memory but in a contiguous way per group and all group are of the same length.

 

gr1: offset 0    :  AA BB CC DD

gr2: offset 256 : EE FF GG HH

gr3: offset 512 : II JJ KK LL

gr4: offset 768 : MM NN OO PP

Keep in mind that EE == (AA+256), II == (EE + 256) and so on.

 

And I need:

AA EE II MM

BB FF JJ NN

CC GG KK OO

DD HH LL PP

 

So it is basically a transposition. And we have the vtrn instruction that can do this just fine. BUT it needs to be repeated three times. (see vectors arrangement)

Is there a way to avoid those three instruction to arrange my vectors properly or are there any other (faster) way to do it?

Currently I need to load the contiguous data and permute the vectors. VLDn don't seem to help me.

 

Do you see a possibility I missed somewhere?

Thanks