Hi,
I have some dataset used by other algorithms. So, the layout of it cannot be modified.
That is my problem.
So what is left: scattered data in memory but in a contiguous way per group and all group are of the same length.
gr1: offset 0 : AA BB CC DD
gr2: offset 256 : EE FF GG HH
gr3: offset 512 : II JJ KK LL
gr4: offset 768 : MM NN OO PP
Keep in mind that EE == (AA+256), II == (EE + 256) and so on.
And I need:
AA EE II MM
BB FF JJ NN
CC GG KK OO
DD HH LL PP
So it is basically a transposition. And we have the vtrn instruction that can do this just fine. BUT it needs to be repeated three times. (see vectors arrangement)
Is there a way to avoid those three instruction to arrange my vectors properly or are there any other (faster) way to do it?
Currently I need to load the contiguous data and permute the vectors. VLDn don't seem to help me.
Do you see a possibility I missed somewhere?
Thanks
Okay I understand now. I would have thought the main overhead was reading the words at a 256 byte displacement from each other, and that overhead is what needs to be reduced as far as possible. I don't think there is a large scope for improvement but using preload instuctions withing the loop might help. So I'd go for something like preload adr preload adr+256 preload adr+2*256 preload adr+3*256 loop: load adr preload adr+4*256 load adr+256 preload adr+5*256 load adr+2*256 preload adr+6*256 load adr+3*256 preload adr+7*256 do the transformation store dest add 1024 to adr add 16 to dest go to loop if not finished I don't know how many preloads can be outstanding. I have four above but I'd have though a processor would cope with more. The code can be changed easily to do more by having more before the loop and preloading fom further forward in the loop. But what's there ought to cover the overheads of the transformation code.