I have a uint32x4_t ARM NEON vector register. I want to "shuffle" this 4 uint32_t with [vtbx2](infocenter.arm.com/.../index.jsp "Extended table look up intrinsics") and [vext](infocenter.arm.com/.../index.jsp "Vector reinterpret cast operations")
The interface for the table lookup intrinsics expecting `uint8x8_t`. This seems to be possible with a [cast](infocenter.arm.com/.../index.jsp "vector reinterpret cast") especially because the documentation states that the
> "[...] conversions do not change the bit pattern represented by the vector."
I tried it with the following code:
#include <iostream> #include <arm_neon.h> #include <bitset> int main() { uint32_t* data = new uint32_t[4]; uint32_t* result = new uint32_t[4]; //00 00 0A 0A data[0] = 2570; //00 0A 00 0A data[1] = 655370; //0A 0A 0A 0A data[2] = 168430090; //00 00 00 0A data[3] = 10; //load data uint32x4_t dataVec = vld1q_u32(data); //cast to uint8 uint8x16_t dataVecByteTmp = vreinterpretq_u8_u32(dataVec); uint32_t* tmpData = new uint32_t[4]; //store original data vst1q_u32(tmpData, dataVec); std::cout << "Orig Data:" << std::endl; for(int i = 0; i < 4; ++i) { std::bitset<32> f(tmpData[i]); std::cout << f << std::endl; } uint8_t* high = new uint8_t[16]; //store uint8 data vst1q_u8(high, dataVecByteTmp); std::cout << "unsigned output" << std::endl; for(int i = 0; i < 16; ++i) { std::cout << (unsigned)high[i] << std::endl; } std::cout << "bitwise output" << std::endl; for(int i = 0; i < 16; ++i) { std::bitset<8> b(high[i]); std::cout << b << std::endl; } delete[] tmpData; delete[] high; delete[] data; delete[] result; return 0; }
One can compile it with:
> g++ -march=native -mfpu=neon -std=c++14 main.cpp
The ouput looks like the following:
Orig Data: 00000000000000000000101000001010 00000000000010100000000000001010 00001010000010100000101000001010 00000000000000000000000000001010 unsigned output 10 10 0 0 10 0 10 0 10 10 10 10 10 0 0 0 bitwise output 00001010 00001010 00000000 00000000 00001010 00000000 00001010 00000000 00001010 00001010 00001010 00001010 00001010 00000000 00000000 00000000 For a better overview, I changed the formatting a bit: Orig (uint32_t): 00000000 00000000 00001010 00001010 00000000 00001010 00000000 00001010 00001010 00001010 00001010 00001010 00000000 00000000 00000000 00001010 New (uint8_t): 10 10 0 0 10 0 10 0 10 10 10 10 10 0 0 0 New (uint8_t bitwise): 00001010 00001010 00000000 00000000 00001010 00000000 00001010 00000000 00001010 00001010 00001010 00001010 00001010 00000000 00000000 00000000
As one can see, the result is not like expected. Does anyone knows whether I did something wrong or is this just a bug?
Sincerely