Arm Development Studio forum How to get 4 values from a table once ?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to get 4 values from a table once ?

zzliu over 12 years ago

hi, guys:

let me set an example:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

int table[256]={......};

int lookup_tbl(int index)

{

return table[index];

}

int main()

{

int idx0, idx1, idx2, idx3;

int tbl0, tbl1, tbl2, tbl3;

idx0 = 2;

idx1 = 36;

idx2 = 111;

idx3 = 204;

tbl0 = lookup_tbl(idx0);

tbl1 = lookup_tbl(idx1);

tbl2 = lookup_tbl(idx2);

tbl3 = lookup_tbl(idx3);

}

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

my question is:

Could I use some neon intrinsic(maybe VTBL) to get the 4 values once? if I have set the index into a 32x4_t variable ?

thank you very much.

Parents

daith over 12 years ago

That operation is part of the gather scatter set of operations that the very latest high end intel processors implement. The best option for them lower down if it is important is to use the gpu if the rest of the job fits in one well. The basic problem is that memory access is the most time consuming part and a straightforward implementation of this in hardware wouldn't be any faster than four separate accesses as it requires access to four quite separate places in memory and is certainly not a RISC type operation. There are optimisations to be made in hardware especially if the indexes are closer or ordered or a bit mask is used, but I think we'll need someone talking about ARM getting into the high end compute market before they start implementing anything like that!
Cancel
Vote up 0 Vote down

Cancel

Reply

daith over 12 years ago

That operation is part of the gather scatter set of operations that the very latest high end intel processors implement. The best option for them lower down if it is important is to use the gpu if the rest of the job fits in one well. The basic problem is that memory access is the most time consuming part and a straightforward implementation of this in hardware wouldn't be any faster than four separate accesses as it requires access to four quite separate places in memory and is certainly not a RISC type operation. There are optimisations to be made in hardware especially if the indexes are closer or ordered or a bit mask is used, but I think we'll need someone talking about ARM getting into the high end compute market before they start implementing anything like that!
Cancel
Vote up 0 Vote down

Cancel

Children

No data