This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Vectors optimization

Hi,

 

I have some dataset used by other algorithms. So, the layout of it cannot be modified.

That is my problem.

So what is left: scattered data in memory but in a contiguous way per group and all group are of the same length.

 

gr1: offset 0    :  AA BB CC DD

gr2: offset 256 : EE FF GG HH

gr3: offset 512 : II JJ KK LL

gr4: offset 768 : MM NN OO PP

Keep in mind that EE == (AA+256), II == (EE + 256) and so on.

 

And I need:

AA EE II MM

BB FF JJ NN

CC GG KK OO

DD HH LL PP

 

So it is basically a transposition. And we have the vtrn instruction that can do this just fine. BUT it needs to be repeated three times. (see vectors arrangement)

Is there a way to avoid those three instruction to arrange my vectors properly or are there any other (faster) way to do it?

Currently I need to load the contiguous data and permute the vectors. VLDn don't seem to help me.

 

Do you see a possibility I missed somewhere?

Thanks

Parents
  • Okay I understand now. I would have thought the main overhead was reading the words at a 256 byte displacement from each other, and that overhead is what needs to be reduced as far as possible. I don't think there is a large scope for improvement but using preload instuctions withing the loop might help. So I'd go for something like
    preload adr
    preload adr+256
    preload adr+2*256
    preload adr+3*256

    loop:
    load adr
    preload adr+4*256
    load adr+256
    preload adr+5*256
    load adr+2*256
    preload adr+6*256
    load adr+3*256
    preload adr+7*256
    do the transformation
    store dest
    add 1024 to adr
    add 16 to dest
    go to loop if not finished

    I don't know how many preloads can be outstanding. I have four above but I'd have though a processor would cope with more. The code can be changed easily to do more by having more before the loop and preloading fom further forward in the loop. But what's there ought to cover the overheads of the transformation code.

Reply
  • Okay I understand now. I would have thought the main overhead was reading the words at a 256 byte displacement from each other, and that overhead is what needs to be reduced as far as possible. I don't think there is a large scope for improvement but using preload instuctions withing the loop might help. So I'd go for something like
    preload adr
    preload adr+256
    preload adr+2*256
    preload adr+3*256

    loop:
    load adr
    preload adr+4*256
    load adr+256
    preload adr+5*256
    load adr+2*256
    preload adr+6*256
    load adr+3*256
    preload adr+7*256
    do the transformation
    store dest
    add 1024 to adr
    add 16 to dest
    go to loop if not finished

    I don't know how many preloads can be outstanding. I have four above but I'd have though a processor would cope with more. The code can be changed easily to do more by having more before the loop and preloading fom further forward in the loop. But what's there ought to cover the overheads of the transformation code.

Children