how to use NEON lame register and related functions.

hterrolle 2 months ago

hi,

As i anderstoud ARM64 got 2048 bits register divided in 16 lame of 128 bits (16 octets(bytes)).

int data1[4] = {300,400,600,400};
int* data1_ptr = (data1);

When i use "int32x4_t lame1 = vld1q_s32(data1_ptr )" i load 4 integer to "lame1" and i can check the value of lame1 using printf( " %d %d %d %d \n",lame1[0],lame1[1],lame1[2],lame1[3])

And i should be able to load 16 time 4 integer at max in the all register.

first question is how does NEON to know wich lame it is feed by vld1q if i laoad data1 to data16.

here is what i do not anderstand.

is i want to use "int32x2x2_t lame1and_2_low = vld2_s32(x_base);". i wiil load data1[0] and data1[1] to low register lame1 and data1[2] and data1[3] to low register lame2.

as i anderstoud int32x2x2_t is:

struct int32x2x2_t {
int32x2_t [2];
};

so i tried to use printf to see the data in register using lame1and_2_low [0][1] but it does not work. The only things i can do is lame1and_2_low [0] and lame1and_2_low .val[0]. but the data printed return 3 for lame1and_2_low [0] and -617811840 for lame1and_2_low [0]. so it is not the data value i was expected. :))

I post the question because i want to debug the exemple i found "RGB deinterleaving" and "matrix multiplication" where they use "vld3q_u8" and "vfmaq_laneq_f32" to anderstand how it is working. Both use lame and i do not anderstand how to print the data from register. It is easy with vld1q but not with vld2_s32.

So, the main question is how to printf data from register. They must be i syntaxe that i do not know.

In OpenCL we got S0 to S16. But in NEON i did not find any information on how printf data from register by name, for data type like 32x2x2_t and other lame data type.

========================================================================

By the way in "matrix multiplication" i have seen something strange.

uint32_t n = 2*BLOCK_SIZE;
uint32_t k = 2*BLOCK_SIZE;
float32_t A[n*k] => float32_t A[4]
matrix_init_rand(A, n*k);

then it use

float32x4_t A0;
float32x4_t A1;
float32x4_t A2;
float32x4_t A3;

and

A0 = vld1q_f32(A);
A1 = vld1q_f32(A+4);
A2 = vld1q_f32(A+8);
A3 = vld1q_f32(A+12);

how it is possible to load 4 time 4 integer with float32_t A[4] it should be float32_t A[16]

another strange thing in "matrix multiplication" is the use of "C0 = vfmaq_laneq_f32(C0, A0, B0, 0)"

C0 is suposed to be the output computation but in the documentation "">developer.arm.com/.../vfmaq_laneq_f32"

they said that "This instruction multiplies the vector elements in the first source SIMD&FP register by the specified value in the second source SIMD&FP register"

So it should be "C0 = vfmaq_laneq_f32(A0, B0,C0, 0)"

========================================================================

Something else that i do not anderstand in the documentation. there is intrinsics that is the function we use in C and AArch64 Instruction wich is supose is the assembler code ?

and the "Argument Preparation"

on https://arm-software.github.io we got these

a -> Vd.8H
b -> Vn.8B
c -> Vm.8B

and on the developer.arm.com/.../vfmaq_laneq_f32 we got

a register: Vd.4S
b register: Vn.4S
v register: Vm.4S
lane minimum: 0; maximum: 3

in fact in all the documentation i read, register name are different.On some document they call register by VH[0] extc..

It is very confusing naming.

========================================================================

It is quite long post and i am asking quite a lot of information. But i am sure that these will help me a lot but not me only.

PS: computing 4 data by instruction is really a very good advance. Let's wait for 256bit lame ;)) and computation inside the same lame ;))

PS: You are free to change the title of the question and split it in many part if you think so. Or let me know i will do it.

Have a good day. ;))

Regards and thanks in advance. And forgive my horrible english writting. ;))

Parents

0 Peter Harris 2 months ago

(Edited to add more to the answer - the forum bugged so I had to reply early and then reedit).

hterrolle said:
ARM64 got 2048 bits register divided in 16 lame of 128 bits (

When you use intrinsics you don't need to worry about registers - the compiler handles that for you, just like it would for a variable using a standard C data type.

Also note that AArch64 NEON increases to 32 x 128-bit registers, so you get twice as much capacity vs AArch32 which only has 16.

hterrolle said:
is i want to use "int32x2x2_t lame1and_2_low = vld2_s32(x_base);". i wiil load data1[0] and data1[1] to low register lame1 and data1[2] and data1[3] to low register lame2.

The vld2 instruction is a 64-bit de-interleaving load, so it will load indices 0,2 into the low 64-bits and indices 1,3 into the high 64-bits (note that this loads into a single 128 bit register).

If you want the 128-bit version which loads int32x4x2_t you need vld2q_s32, which will load 0,2,4,6 into the low 128 bits and 1,3,5,7 into the high 128 bits.

hterrolle said:
how it is possible to load 4 time 4 integer with float32_t A[4] it should be float32_t A[16]

To load 4 4xfp32 vectors without deinterleaving you want the vld1_f32_x4() intrinsic which takes a pointer to a float array as input.

hterrolle said:
PS: computing 4 data by instruction is really a very good advance. Let's wait for 256bit lame ;)) and computation inside the same lame ;))

I tend to use vst1 to store back to a temporary array on the stack and then printf from a standard C data type.

e.g. github.com/.../astcenc_vecmathlib_common_4.h

hterrolle said:
in fact in all the documentation i read, register name are different.On some document they call register by VH[0] extc..

The register names depend on the instruction and data type.

"Vd" is the destination "V" register, and "Vn" and "Vm" are the input registers containing parameter "n" and "m".

The register postfix is a data type notification telling you what the register contains. It's for the format ".<count><size>", where B = 8-bit, H = 16-bit, and S = 32-bit. For example, ".4S" indicates that the register contains 4x 32-bit values (of any 32-bit data type). The postfix will change depending on what the instruction is doing and how its going to interpret the content of the register.

hterrolle said:
Let's wait for 256bit lame ;))

It's already possible with the SVE instruction set, but it depends on the hardware implementation. Neoverse V1 is shipping with 256-bit SVE, but the Cortex mobile parts are all 128-bit at the moment.

hterrolle said:
And forgive my horrible english writting. ;))

It's a lot better than my French ;)

One correction for the future is that "lame" should be "lane".

Cheers,
Pete
Cancel
Vote up 0 Vote down

Reply

Accept answer

Reject answer

Cancel

Reply

0 Peter Harris 2 months ago

(Edited to add more to the answer - the forum bugged so I had to reply early and then reedit).

hterrolle said:
ARM64 got 2048 bits register divided in 16 lame of 128 bits (

When you use intrinsics you don't need to worry about registers - the compiler handles that for you, just like it would for a variable using a standard C data type.

Also note that AArch64 NEON increases to 32 x 128-bit registers, so you get twice as much capacity vs AArch32 which only has 16.

hterrolle said:
is i want to use "int32x2x2_t lame1and_2_low = vld2_s32(x_base);". i wiil load data1[0] and data1[1] to low register lame1 and data1[2] and data1[3] to low register lame2.

The vld2 instruction is a 64-bit de-interleaving load, so it will load indices 0,2 into the low 64-bits and indices 1,3 into the high 64-bits (note that this loads into a single 128 bit register).

If you want the 128-bit version which loads int32x4x2_t you need vld2q_s32, which will load 0,2,4,6 into the low 128 bits and 1,3,5,7 into the high 128 bits.

hterrolle said:
how it is possible to load 4 time 4 integer with float32_t A[4] it should be float32_t A[16]

To load 4 4xfp32 vectors without deinterleaving you want the vld1_f32_x4() intrinsic which takes a pointer to a float array as input.

hterrolle said:
PS: computing 4 data by instruction is really a very good advance. Let's wait for 256bit lame ;)) and computation inside the same lame ;))

I tend to use vst1 to store back to a temporary array on the stack and then printf from a standard C data type.

e.g. github.com/.../astcenc_vecmathlib_common_4.h

hterrolle said:
in fact in all the documentation i read, register name are different.On some document they call register by VH[0] extc..

The register names depend on the instruction and data type.

"Vd" is the destination "V" register, and "Vn" and "Vm" are the input registers containing parameter "n" and "m".

The register postfix is a data type notification telling you what the register contains. It's for the format ".<count><size>", where B = 8-bit, H = 16-bit, and S = 32-bit. For example, ".4S" indicates that the register contains 4x 32-bit values (of any 32-bit data type). The postfix will change depending on what the instruction is doing and how its going to interpret the content of the register.

hterrolle said:
Let's wait for 256bit lame ;))

It's already possible with the SVE instruction set, but it depends on the hardware implementation. Neoverse V1 is shipping with 256-bit SVE, but the Cortex mobile parts are all 128-bit at the moment.

hterrolle said:
And forgive my horrible english writting. ;))

It's a lot better than my French ;)

One correction for the future is that "lame" should be "lane".

Cheers,
Pete
Cancel
Vote up 0 Vote down

Reply

Accept answer

Reject answer

Cancel

Children

0 hterrolle 2 months ago in reply to Peter Harris

hi,

thanks for the answer but i tried to use vst1 function but i always got "no viable conversion" from the compiler.

I think it is normal because there are no int32x2x2_t on any vst1 function.

But using vst2_s32 i got a result but it is the same like before. here is the code:

    int A1[4] = {300,400,600,400};
    int* x_base = (A1);
    int32x2x2_t xvx2 = vld2_s32(x_base);
    //LOGE(" neon_multi xvx2 %3d %3d\n",xvx2.val[0],xvx2.val[1]);
    int32_t* xvx2_out;
    vst2_s32(xvx2_out,xvx2); // compile
    LOGE(" neon_multi xvx2_out %3d %3d\n",*(xvx2_out),*(xvx2_out+1));

output

neon_multi xvx2   3 -617811856
neon_multi xvx2_out   3 -617811840

And the problem is that if i print the first log then it crash printing the second log. So i had to compile twice once with the first log and once with the second. ;))

I do not anderstand why vld1q_s32 work well and can be printed and not vld2_s32.

it could be the use of the struct for vld2_s32 because i tried vst1_s32_x2(xvx2_out,xvx2) function and the same output.

in fact i have tested all possible svt function with int32x2x2_t and none of them return the correct value.

So i do not anderstand how it work, and this is very possible. Or there is something else like my <arm_neon.h>.

not easy neon, so many function. I though it would be easier ;))
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel
0 hterrolle 2 months ago in reply to hterrolle

Hi,

Ok, i found the solution to display the value. ;)) At woke up ! is the best time to solve problem ;))

    int A1[4] = {300,400,600,500};
    int32x2x2_t xvx2 = vld2_s32(A1);
    LOGE(" neon_multi xvx2 %d %d %d %d\n",xvx2.val[0][0],xvx2.val[0][1],xvx2.val[1][0],xvx2.val[1][1]);

    neon_multi xvx2 300 600 400 500

So you are rigth Peter. No need to know wich register name is used. We just need to know how to reach them in C.

I will try latter some other function. after cofee. ;))
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel
0 hterrolle 2 months ago in reply to hterrolle

i tried this time to use

    int32_t* xvx2_out;
    vst2_s32(xvx2_out,xvx2);

https://developer.arm.com/architectures/instruction-sets/intrinsics/vst2_s32 said:

Store multiple 2-element structures from two registers. This instruction stores multiple 2-element structures from two SIMD&FP registers to memory, with interleaving. Every element of each register is stored.

when i use the log the output

(*xvx2_out)      = xvx2.val[0] (int32x2x2_t.val[0]) = 3
(*xvx2_out+1) = xvx2.val[1] (int32x2x2_t.val[1]) = -617811856

so i got two adresse of a 64bit register. And i tried a lot of combinaison to extract the 32bit low and high from this adresse but no résult. always the same response.

((int32x2_t *)(xvx2_out))[0][0] return = 3 and ((int32x2_t *)(xvx2_out))[0][1] = -617811856
((int32x2_t *)(xvx2_out))[0] return = 3 and ((int32x2_t *)(xvx2_out))    = -617811856
((int32x2_t *)(xvx2_out)) return = 3 and ((int32x2_t *)(xvx2_out))    = -617811856

LOGE(" neon_multi xvx2_out %d %d \n",(xvx2_out),(xvx2_out)); retrun 3 and -617811856

LOGE(" neon_multi xvx2_out %d %d \n",(*xvx2_out) & 0xffffffff,(*xvx2_out >> 32) & 0xffffffff); retrun 3 and -617811856

Every attempt return 3 and -617811856

The output look like be the ardresse of 2 int32x2_t and not the first adresse of 4 int32_t.

I have no cloud how to retreive the value from the pionter. And i would be please to anderstand how to do it. I will learn something today. ;))
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel

0 Peter Harris

2 months ago in reply to hterrolle

For this ...

int32_t* xvx2_out;
vst2_s32(xvx2_out, xvx2);

What memory is xvx2_out pointing at? It's an uninitialized pointer.

This is what I would do:

#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>

int main(void)
{
    const int32_t input[8] = { 0, 1, 2, 3, 4, 5, 6, 7 };
    int32_t output[8] = {};

    // Linear load
    int32x4x2_t data = vld1q_s32_x2(input);

    // Linear store
    vst2q_s32(output, data);

    printf("Linear store:\n");
    printf("  r0[0] = %d\n", output[0]);
    printf("  r0[1] = %d\n", output[1]);
    printf("  r0[2] = %d\n", output[2]);
    printf("  r0[3] = %d\n", output[3]);
    printf("\n");
    printf("  r1[0] = %d\n", output[4]);
    printf("  r1[1] = %d\n", output[5]);
    printf("  r1[2] = %d\n", output[6]);
    printf("  r1[3] = %d\n", output[7]);
}

0 hterrolle 2 months ago in reply to Peter Harris

hi,

By the way! Thanks for the example. I think I understand interleaving.
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel