We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I have a uint32x4_t ARM NEON vector register. I want to "shuffle" this 4 uint32_t with [vtbx2](infocenter.arm.com/.../index.jsp "Extended table look up intrinsics") and [vext](infocenter.arm.com/.../index.jsp "Vector reinterpret cast operations")
The interface for the table lookup intrinsics expecting `uint8x8_t`. This seems to be possible with a [cast](infocenter.arm.com/.../index.jsp "vector reinterpret cast") especially because the documentation states that the
> "[...] conversions do not change the bit pattern represented by the vector."
I tried it with the following code:
#include <iostream> #include <arm_neon.h> #include <bitset> int main() { uint32_t* data = new uint32_t[4]; uint32_t* result = new uint32_t[4]; //00 00 0A 0A data[0] = 2570; //00 0A 00 0A data[1] = 655370; //0A 0A 0A 0A data[2] = 168430090; //00 00 00 0A data[3] = 10; //load data uint32x4_t dataVec = vld1q_u32(data); //cast to uint8 uint8x16_t dataVecByteTmp = vreinterpretq_u8_u32(dataVec); uint32_t* tmpData = new uint32_t[4]; //store original data vst1q_u32(tmpData, dataVec); std::cout << "Orig Data:" << std::endl; for(int i = 0; i < 4; ++i) { std::bitset<32> f(tmpData[i]); std::cout << f << std::endl; } uint8_t* high = new uint8_t[16]; //store uint8 data vst1q_u8(high, dataVecByteTmp); std::cout << "unsigned output" << std::endl; for(int i = 0; i < 16; ++i) { std::cout << (unsigned)high[i] << std::endl; } std::cout << "bitwise output" << std::endl; for(int i = 0; i < 16; ++i) { std::bitset<8> b(high[i]); std::cout << b << std::endl; } delete[] tmpData; delete[] high; delete[] data; delete[] result; return 0; }
One can compile it with:
> g++ -march=native -mfpu=neon -std=c++14 main.cpp
The ouput looks like the following:
Orig Data: 00000000000000000000101000001010 00000000000010100000000000001010 00001010000010100000101000001010 00000000000000000000000000001010 unsigned output 10 10 0 0 10 0 10 0 10 10 10 10 10 0 0 0 bitwise output 00001010 00001010 00000000 00000000 00001010 00000000 00001010 00000000 00001010 00001010 00001010 00001010 00001010 00000000 00000000 00000000 For a better overview, I changed the formatting a bit: Orig (uint32_t): 00000000 00000000 00001010 00001010 00000000 00001010 00000000 00001010 00001010 00001010 00001010 00001010 00000000 00000000 00000000 00001010 New (uint8_t): 10 10 0 0 10 0 10 0 10 10 10 10 10 0 0 0 New (uint8_t bitwise): 00001010 00001010 00000000 00000000 00001010 00000000 00001010 00000000 00001010 00001010 00001010 00001010 00001010 00000000 00000000 00000000
As one can see, the result is not like expected. Does anyone knows whether I did something wrong or is this just a bug?
Sincerely
The result appears to be what would be expected for a little-endian machine, as you expect, the vreinterpret is not modifying the data.
Formatting your "Orig Data" output so that you can see the individual bytes in your 32-bit values gives:
+3 +2 +1 +0 -------- -------- -------- -------- 00000000 00000000 00001010 00001010 | +0 00000000 00001010 00000000 00001010 | +4 00001010 00001010 00001010 00001010 | +8 00000000 00000000 00000000 00001010 | +12
Reading this out in byte order goes from top-left to bottom-right and yields 0x0A, 0x0A, 0x00, 0x00, 0x0A, 0x00, 0x0A, 0x00, 0xAA, 0xAA, 0xAA... which matches the sequence you show in "bitwise output".
Simon.