We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
hi, guys:
let me set an example:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
int table[256]={......};
int lookup_tbl(int index)
{
return table[index];
}
int main()
int idx0, idx1, idx2, idx3;
int tbl0, tbl1, tbl2, tbl3;
idx0 = 2;
idx1 = 36;
idx2 = 111;
idx3 = 204;
tbl0 = lookup_tbl(idx0);
tbl1 = lookup_tbl(idx1);
tbl2 = lookup_tbl(idx2);
tbl3 = lookup_tbl(idx3);
my question is:
Could I use some neon intrinsic(maybe VTBL) to get the 4 values once? if I have set the index into a 32x4_t variable ?
thank you very much.
Hi,
Yes, NEON instructions can be useful for loading multiple values at once but they can't be used to achieve what you are trying to do. From what I can see, you are trying to load four essentially arbitrary elements from an array (in your example, elements 2, 36, 111 and 204). NEON load instructions can load 4 32-bit values in a single operation but they have to be contiguous elements (there are some variations here which use the structured load instructions to de-interleave structured data but they still operate on contiguous elements in memory).
For instance:
int32x4_t temp;
temp = vld1q_s32(table);
compiles to:
LDR r0, =table
VLD1.32 {d0,d1}, [r0]
This loads 4 consecutive 32-bit words from the table. But I'm afraid they do have to be consecutive.
Chris
VTBL is byte wise look up instruction that we can't use for 32-bit values.
The other option is (you might already know this) rearrage the table if you know indexes ahead...In this case arrange the table elements with indexes 2, 36, 111 and 204 in consecutive locations so you can use VLD1.32 {d0,d1},[table]......this is not a good technique when you dont know indexes ahead...
That operation is part of the gather scatter set of operations that the very latest high end intel processors implement. The best option for them lower down if it is important is to use the gpu if the rest of the job fits in one well. The basic problem is that memory access is the most time consuming part and a straightforward implementation of this in hardware wouldn't be any faster than four separate accesses as it requires access to four quite separate places in memory and is certainly not a RISC type operation. There are optimisations to be made in hardware especially if the indexes are closer or ordered or a bit mask is used, but I think we'll need someone talking about ARM getting into the high end compute market before they start implementing anything like that!