Vector reinterpret cast ARM Neon

I have a uint32x4_t ARM NEON vector register. I want to "shuffle" this 4 uint32_t with
[vtbx2](infocenter.arm.com/.../index.jsp "Extended table look up intrinsics") and [vext](infocenter.arm.com/.../index.jsp "Vector reinterpret cast operations")

The interface for the table lookup intrinsics expecting `uint8x8_t`. This seems to be possible with a [cast](infocenter.arm.com/.../index.jsp "vector reinterpret cast") especially because the documentation states that the

> "[...] conversions do not change the bit pattern represented by the vector."

I tried it with the following code:

#include <iostream>
#include <arm_neon.h>
#include <bitset>

int main() {
uint32_t* data = new uint32_t[4];
uint32_t* result = new uint32_t[4];
//00 00 0A 0A
data[0] = 2570;
//00 0A 00 0A
data[1] = 655370;
//0A 0A 0A 0A
data[2] = 168430090;
//00 00 00 0A
data[3] = 10;

//load data
uint32x4_t dataVec = vld1q_u32(data);
//cast to uint8
uint8x16_t dataVecByteTmp = vreinterpretq_u8_u32(dataVec);

uint32_t* tmpData = new uint32_t[4];
//store original data
vst1q_u32(tmpData, dataVec);

std::cout << "Orig Data:" << std::endl;
for(int i = 0; i < 4; ++i) {
std::bitset<32> f(tmpData[i]);
std::cout << f << std::endl;
}

uint8_t* high = new uint8_t[16];
//store uint8 data
vst1q_u8(high, dataVecByteTmp);

std::cout << "unsigned output" << std::endl;
for(int i = 0; i < 16; ++i) {
std::cout << (unsigned)high[i] << std::endl;
}
std::cout << "bitwise output" << std::endl;
for(int i = 0; i < 16; ++i) {
std::bitset<8> b(high[i]);
std::cout << b << std::endl;
}
delete[] tmpData;
delete[] high; 
delete[] data;
delete[] result;
return 0;
}

One can compile it with:

> g++ -march=native -mfpu=neon -std=c++14 main.cpp

The ouput looks like the following:

 

Orig Data:
00000000000000000000101000001010
00000000000010100000000000001010
00001010000010100000101000001010
00000000000000000000000000001010

unsigned output
10 
10 
0 
0 
10 
0 
10 
0 
10 
10 
10 
10 
10 
0 
0 
0 

bitwise output
00001010
00001010
00000000
00000000
00001010
00000000
00001010
00000000
00001010
00001010
00001010
00001010
00001010
00000000
00000000
00000000

For a better overview, I changed the formatting a bit:

Orig (uint32_t):
00000000 00000000 00001010 00001010
00000000 00001010 00000000 00001010
00001010 00001010 00001010 00001010
00000000 00000000 00000000 00001010

New (uint8_t):
10 10 0 0 
10 0 10 0 
10 10 10 10 
10 0 0 0

New (uint8_t bitwise):
00001010 00001010 00000000 00000000
00001010 00000000 00001010 00000000
00001010 00001010 00001010 00001010
00001010 00000000 00000000 00000000

As one can see, the result is not like expected. Does anyone knows whether I did something wrong or is this just a bug?

Sincerely

Parents
  • The result appears to be what would be expected for a little-endian machine, as you expect, the vreinterpret is not modifying the data.

    Formatting your "Orig Data" output so that you can see the individual bytes in your 32-bit values gives:

        +3       +2       +1       +0
     -------- -------- -------- --------
     00000000 00000000 00001010 00001010 | +0
     00000000 00001010 00000000 00001010 | +4
     00001010 00001010 00001010 00001010 | +8
     00000000 00000000 00000000 00001010 | +12

    Reading this out in byte order goes from top-left to bottom-right and yields 0x0A, 0x0A, 0x00, 0x00, 0x0A, 0x00, 0x0A, 0x00, 0xAA, 0xAA, 0xAA... which matches the sequence you show in "bitwise output".

    Simon.

     

     

     

     

Reply
  • The result appears to be what would be expected for a little-endian machine, as you expect, the vreinterpret is not modifying the data.

    Formatting your "Orig Data" output so that you can see the individual bytes in your 32-bit values gives:

        +3       +2       +1       +0
     -------- -------- -------- --------
     00000000 00000000 00001010 00001010 | +0
     00000000 00001010 00000000 00001010 | +4
     00001010 00001010 00001010 00001010 | +8
     00000000 00000000 00000000 00001010 | +12

    Reading this out in byte order goes from top-left to bottom-right and yields 0x0A, 0x0A, 0x00, 0x00, 0x0A, 0x00, 0x0A, 0x00, 0xAA, 0xAA, 0xAA... which matches the sequence you show in "bitwise output".

    Simon.

     

     

     

     

Children
No data
More questions in this forum