This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Vectors optimization

MarkL over 8 years ago

Hi,

I have some dataset used by other algorithms. So, the layout of it cannot be modified.

That is my problem.

So what is left: scattered data in memory but in a contiguous way per group and all group are of the same length.

gr1: offset 0 : AA BB CC DD

gr2: offset 256 : EE FF GG HH

gr3: offset 512 : II JJ KK LL

gr4: offset 768 : MM NN OO PP

Keep in mind that EE == (AA+256), II == (EE + 256) and so on.

And I need:

AA EE II MM

BB FF JJ NN

CC GG KK OO

DD HH LL PP

So it is basically a transposition. And we have the vtrn instruction that can do this just fine. BUT it needs to be repeated three times. (see vectors arrangement)

Is there a way to avoid those three instruction to arrange my vectors properly or are there any other (faster) way to do it?

Currently I need to load the contiguous data and permute the vectors. VLDn don't seem to help me.

Do you see a possibility I missed somewhere?

Thanks

0 MarkL over 8 years ago in reply to daith

Ah, so preloading is my only option right? I will deal with it then.

Thanks daith!
Cancel
Vote up 0 Vote down

View discussion

Cancel
+1 daith over 8 years ago in reply to MarkL

Okay I understand now. I would have thought the main overhead was reading the words at a 256 byte displacement from each other, and that overhead is what needs to be reduced as far as possible. I don't think there is a large scope for improvement but using preload instuctions withing the loop might help. So I'd go for something like
preload adr
preload adr+256
preload adr+2*256
preload adr+3*256

loop:
load adr
preload adr+4*256
load adr+256
preload adr+5*256
load adr+2*256
preload adr+6*256
load adr+3*256
preload adr+7*256
do the transformation
store dest
add 1024 to adr
add 16 to dest
go to loop if not finished

I don't know how many preloads can be outstanding. I have four above but I'd have though a processor would cope with more. The code can be changed easily to do more by having more before the loop and preloading fom further forward in the loop. But what's there ought to cover the overheads of the transformation code.
Cancel
Vote up 0 Vote down

View discussion

Cancel
0 MarkL over 8 years ago in reply to daith

Thanks daith,

We are talking about bytes here. I have to load the data from memory as the dataset is to big for the cache (800-1000Mb) but the data follows a pattern (hence the 256 bytes offset) I can take advantage of .

I already have a solution, which is to fill the vectors d1,d2,d3,d4 with contiguous data at different offsets, permute them (3 vtrn operations) and do my job from here.

But if there was an instruction that could avoid the 3 vtrn instruction per load, I would have been happier (causing less consumption on a battery restrained unit and faster return to sleep state).
Cancel
Vote up 0 Vote down

View discussion

Cancel
0 daith over 8 years ago

Is the 256 bits or bytes? And what type machine are you talking about? Is the main problem reading the data from store rather than a cache? How much is there altogether? It doesn't really sound like a SIMD problem to me but a data movement problem.
Cancel
Vote up 0 Vote down

View discussion

Cancel