This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Vector reinterpret cast ARM Neon

DorJo over 8 years ago

I have a uint32x4_t ARM NEON vector register. I want to "shuffle" this 4 uint32_t with
[vtbx2](infocenter.arm.com/.../index.jsp "Extended table look up intrinsics") and [vext](infocenter.arm.com/.../index.jsp "Vector reinterpret cast operations")

The interface for the table lookup intrinsics expecting `uint8x8_t`. This seems to be possible with a [cast](infocenter.arm.com/.../index.jsp "vector reinterpret cast") especially because the documentation states that the

> "[...] conversions do not change the bit pattern represented by the vector."

I tried it with the following code:

#include <iostream>
#include <arm_neon.h>
#include <bitset>

int main() {
uint32_t* data = new uint32_t[4];
uint32_t* result = new uint32_t[4];
//00 00 0A 0A
data[0] = 2570;
//00 0A 00 0A
data[1] = 655370;
//0A 0A 0A 0A
data[2] = 168430090;
//00 00 00 0A
data[3] = 10;

//load data
uint32x4_t dataVec = vld1q_u32(data);
//cast to uint8
uint8x16_t dataVecByteTmp = vreinterpretq_u8_u32(dataVec);

uint32_t* tmpData = new uint32_t[4];
//store original data
vst1q_u32(tmpData, dataVec);

std::cout << "Orig Data:" << std::endl;
for(int i = 0; i < 4; ++i) {
std::bitset<32> f(tmpData[i]);
std::cout << f << std::endl;
}

uint8_t* high = new uint8_t[16];
//store uint8 data
vst1q_u8(high, dataVecByteTmp);

std::cout << "unsigned output" << std::endl;
for(int i = 0; i < 16; ++i) {
std::cout << (unsigned)high[i] << std::endl;
}
std::cout << "bitwise output" << std::endl;
for(int i = 0; i < 16; ++i) {
std::bitset<8> b(high[i]);
std::cout << b << std::endl;
}
delete[] tmpData;
delete[] high; 
delete[] data;
delete[] result;
return 0;
}

One can compile it with:

> g++ -march=native -mfpu=neon -std=c++14 main.cpp

The ouput looks like the following:

Orig Data:
00000000000000000000101000001010
00000000000010100000000000001010
00001010000010100000101000001010
00000000000000000000000000001010

unsigned output
10 
10 
0 
0 
10 
0 
10 
0 
10 
10 
10 
10 
10 
0 
0 
0 

bitwise output
00001010
00001010
00000000
00000000
00001010
00000000
00001010
00000000
00001010
00001010
00001010
00001010
00001010
00000000
00000000
00000000

For a better overview, I changed the formatting a bit:

Orig (uint32_t):
00000000 00000000 00001010 00001010
00000000 00001010 00000000 00001010
00001010 00001010 00001010 00001010
00000000 00000000 00000000 00001010

New (uint8_t):
10 10 0 0 
10 0 10 0 
10 10 10 10 
10 0 0 0

New (uint8_t bitwise):
00001010 00001010 00000000 00000000
00001010 00000000 00001010 00000000
00001010 00001010 00001010 00001010
00001010 00000000 00000000 00000000

As one can see, the result is not like expected. Does anyone knows whether I did something wrong or is this just a bug?

Sincerely

+1 Simon Craske over 8 years ago
The result appears to be what would be expected for a little-endian machine, as you expect, the vreinterpret is not modifying the data.

Formatting your "Orig Data" output so that you can see the individual bytes in your 32-bit values gives:

+3 +2 +1 +0 -------- -------- -------- -------- 00000000 00000000 00001010 00001010 | +0 00000000 00001010 00000000 00001010 | +4 00001010 00001010 00001010 00001010 | +8 00000000 00000000 00000000 00001010 | +12

Reading this out in byte order goes from top-left to bottom-right and yields 0x0A, 0x0A, 0x00, 0x00, 0x0A, 0x00, 0x0A, 0x00, 0xAA, 0xAA, 0xAA... which matches the sequence you show in "bitwise output".

Simon.
Cancel
Vote up 0 Vote down

Cancel