We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
hi, guys:
let me set an example:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
int table[256]={......};
int lookup_tbl(int index)
{
return table[index];
}
int main()
int idx0, idx1, idx2, idx3;
int tbl0, tbl1, tbl2, tbl3;
idx0 = 2;
idx1 = 36;
idx2 = 111;
idx3 = 204;
tbl0 = lookup_tbl(idx0);
tbl1 = lookup_tbl(idx1);
tbl2 = lookup_tbl(idx2);
tbl3 = lookup_tbl(idx3);
my question is:
Could I use some neon intrinsic(maybe VTBL) to get the 4 values once? if I have set the index into a 32x4_t variable ?
thank you very much.
That operation is part of the gather scatter set of operations that the very latest high end intel processors implement. The best option for them lower down if it is important is to use the gpu if the rest of the job fits in one well. The basic problem is that memory access is the most time consuming part and a straightforward implementation of this in hardware wouldn't be any faster than four separate accesses as it requires access to four quite separate places in memory and is certainly not a RISC type operation. There are optimisations to be made in hardware especially if the indexes are closer or ordered or a bit mask is used, but I think we'll need someone talking about ARM getting into the high end compute market before they start implementing anything like that!