This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Vectors optimization

Hi,

 

I have some dataset used by other algorithms. So, the layout of it cannot be modified.

That is my problem.

So what is left: scattered data in memory but in a contiguous way per group and all group are of the same length.

 

gr1: offset 0    :  AA BB CC DD

gr2: offset 256 : EE FF GG HH

gr3: offset 512 : II JJ KK LL

gr4: offset 768 : MM NN OO PP

Keep in mind that EE == (AA+256), II == (EE + 256) and so on.

 

And I need:

AA EE II MM

BB FF JJ NN

CC GG KK OO

DD HH LL PP

 

So it is basically a transposition. And we have the vtrn instruction that can do this just fine. BUT it needs to be repeated three times. (see vectors arrangement)

Is there a way to avoid those three instruction to arrange my vectors properly or are there any other (faster) way to do it?

Currently I need to load the contiguous data and permute the vectors. VLDn don't seem to help me.

 

Do you see a possibility I missed somewhere?

Thanks

  • Ah, so preloading is my only option right? I will deal with it then.

    Thanks daith!
  • Okay I understand now. I would have thought the main overhead was reading the words at a 256 byte displacement from each other, and that overhead is what needs to be reduced as far as possible. I don't think there is a large scope for improvement but using preload instuctions withing the loop might help. So I'd go for something like
    preload adr
    preload adr+256
    preload adr+2*256
    preload adr+3*256

    loop:
    load adr
    preload adr+4*256
    load adr+256
    preload adr+5*256
    load adr+2*256
    preload adr+6*256
    load adr+3*256
    preload adr+7*256
    do the transformation
    store dest
    add 1024 to adr
    add 16 to dest
    go to loop if not finished

    I don't know how many preloads can be outstanding. I have four above but I'd have though a processor would cope with more. The code can be changed easily to do more by having more before the loop and preloading fom further forward in the loop. But what's there ought to cover the overheads of the transformation code.

  • Thanks daith,

    We are talking about bytes here. I have to load the data from memory as the dataset is to big for the cache (800-1000Mb) but the data follows a pattern (hence the 256 bytes offset) I can take advantage of .

    I already have a solution, which is to fill the vectors d1,d2,d3,d4 with contiguous data at different offsets, permute them (3 vtrn operations) and do my job from here.

    But if there was an instruction that could avoid the 3 vtrn instruction per load, I would have been happier (causing less consumption on a battery restrained unit and faster return to sleep state).
  • Is the 256 bits or bytes? And what type machine are you talking about? Is the main problem reading the data from store rather than a cache? How much is there altogether? It doesn't really sound like a SIMD problem to me but a data movement problem.