how to use NEON lame register and related functions.

hterrolle 24 days ago

hi,

As i anderstoud ARM64 got 2048 bits register divided in 16 lame of 128 bits (16 octets(bytes)).

int data1[4] = {300,400,600,400};
int* data1_ptr = (data1);

When i use "int32x4_t lame1 = vld1q_s32(data1_ptr )" i load 4 integer to "lame1" and i can check the value of lame1 using printf( " %d %d %d %d \n",lame1[0],lame1[1],lame1[2],lame1[3])

And i should be able to load 16 time 4 integer at max in the all register.

first question is how does NEON to know wich lame it is feed by vld1q if i laoad data1 to data16.

here is what i do not anderstand.

is i want to use "int32x2x2_t lame1and_2_low = vld2_s32(x_base);". i wiil load data1[0] and data1[1] to low register lame1 and data1[2] and data1[3] to low register lame2.

as i anderstoud int32x2x2_t is:

struct int32x2x2_t {
int32x2_t [2];
};

so i tried to use printf to see the data in register using lame1and_2_low [0][1] but it does not work. The only things i can do is lame1and_2_low [0] and lame1and_2_low .val[0]. but the data printed return 3 for lame1and_2_low [0] and -617811840 for lame1and_2_low [0]. so it is not the data value i was expected. :))

I post the question because i want to debug the exemple i found "RGB deinterleaving" and "matrix multiplication" where they use "vld3q_u8" and "vfmaq_laneq_f32" to anderstand how it is working. Both use lame and i do not anderstand how to print the data from register. It is easy with vld1q but not with vld2_s32.

So, the main question is how to printf data from register. They must be i syntaxe that i do not know.

In OpenCL we got S0 to S16. But in NEON i did not find any information on how printf data from register by name, for data type like 32x2x2_t and other lame data type.

========================================================================

By the way in "matrix multiplication" i have seen something strange.

uint32_t n = 2*BLOCK_SIZE;
uint32_t k = 2*BLOCK_SIZE;
float32_t A[n*k] => float32_t A[4]
matrix_init_rand(A, n*k);

then it use

float32x4_t A0;
float32x4_t A1;
float32x4_t A2;
float32x4_t A3;

and

A0 = vld1q_f32(A);
A1 = vld1q_f32(A+4);
A2 = vld1q_f32(A+8);
A3 = vld1q_f32(A+12);

how it is possible to load 4 time 4 integer with float32_t A[4] it should be float32_t A[16]

another strange thing in "matrix multiplication" is the use of "C0 = vfmaq_laneq_f32(C0, A0, B0, 0)"

C0 is suposed to be the output computation but in the documentation "">developer.arm.com/.../vfmaq_laneq_f32"

they said that "This instruction multiplies the vector elements in the first source SIMD&FP register by the specified value in the second source SIMD&FP register"

So it should be "C0 = vfmaq_laneq_f32(A0, B0,C0, 0)"

========================================================================

Something else that i do not anderstand in the documentation. there is intrinsics that is the function we use in C and AArch64 Instruction wich is supose is the assembler code ?

and the "Argument Preparation"

on https://arm-software.github.io we got these

a -> Vd.8H
b -> Vn.8B
c -> Vm.8B

and on the developer.arm.com/.../vfmaq_laneq_f32 we got

a register: Vd.4S
b register: Vn.4S
v register: Vm.4S
lane minimum: 0; maximum: 3

in fact in all the documentation i read, register name are different.On some document they call register by VH[0] extc..

It is very confusing naming.

========================================================================

It is quite long post and i am asking quite a lot of information. But i am sure that these will help me a lot but not me only.

PS: computing 4 data by instruction is really a very good advance. Let's wait for 256bit lame ;)) and computation inside the same lame ;))

PS: You are free to change the title of the question and split it in many part if you think so. Or let me know i will do it.

Have a good day. ;))

Regards and thanks in advance. And forgive my horrible english writting. ;))

Parents

0 hterrolle 23 days ago in reply to hterrolle

i tried this time to use

    int32_t* xvx2_out;
    vst2_s32(xvx2_out,xvx2);

https://developer.arm.com/architectures/instruction-sets/intrinsics/vst2_s32 said:

Store multiple 2-element structures from two registers. This instruction stores multiple 2-element structures from two SIMD&FP registers to memory, with interleaving. Every element of each register is stored.

when i use the log the output

(*xvx2_out)      = xvx2.val[0] (int32x2x2_t.val[0]) = 3
(*xvx2_out+1) = xvx2.val[1] (int32x2x2_t.val[1]) = -617811856

so i got two adresse of a 64bit register. And i tried a lot of combinaison to extract the 32bit low and high from this adresse but no résult. always the same response.

((int32x2_t *)(xvx2_out))[0][0] return = 3 and ((int32x2_t *)(xvx2_out))[0][1] = -617811856
((int32x2_t *)(xvx2_out))[0] return = 3 and ((int32x2_t *)(xvx2_out))    = -617811856
((int32x2_t *)(xvx2_out)) return = 3 and ((int32x2_t *)(xvx2_out))    = -617811856

LOGE(" neon_multi xvx2_out %d %d \n",(xvx2_out),(xvx2_out)); retrun 3 and -617811856

LOGE(" neon_multi xvx2_out %d %d \n",(*xvx2_out) & 0xffffffff,(*xvx2_out >> 32) & 0xffffffff); retrun 3 and -617811856

Every attempt return 3 and -617811856

The output look like be the ardresse of 2 int32x2_t and not the first adresse of 4 int32_t.

I have no cloud how to retreive the value from the pionter. And i would be please to anderstand how to do it. I will learn something today. ;))
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel

Reply

0 hterrolle 23 days ago in reply to hterrolle

i tried this time to use

    int32_t* xvx2_out;
    vst2_s32(xvx2_out,xvx2);

https://developer.arm.com/architectures/instruction-sets/intrinsics/vst2_s32 said:

Store multiple 2-element structures from two registers. This instruction stores multiple 2-element structures from two SIMD&FP registers to memory, with interleaving. Every element of each register is stored.

when i use the log the output

(*xvx2_out)      = xvx2.val[0] (int32x2x2_t.val[0]) = 3
(*xvx2_out+1) = xvx2.val[1] (int32x2x2_t.val[1]) = -617811856

so i got two adresse of a 64bit register. And i tried a lot of combinaison to extract the 32bit low and high from this adresse but no résult. always the same response.

((int32x2_t *)(xvx2_out))[0][0] return = 3 and ((int32x2_t *)(xvx2_out))[0][1] = -617811856
((int32x2_t *)(xvx2_out))[0] return = 3 and ((int32x2_t *)(xvx2_out))    = -617811856
((int32x2_t *)(xvx2_out)) return = 3 and ((int32x2_t *)(xvx2_out))    = -617811856

LOGE(" neon_multi xvx2_out %d %d \n",(xvx2_out),(xvx2_out)); retrun 3 and -617811856

LOGE(" neon_multi xvx2_out %d %d \n",(*xvx2_out) & 0xffffffff,(*xvx2_out >> 32) & 0xffffffff); retrun 3 and -617811856

Every attempt return 3 and -617811856

The output look like be the ardresse of 2 int32x2_t and not the first adresse of 4 int32_t.

I have no cloud how to retreive the value from the pionter. And i would be please to anderstand how to do it. I will learn something today. ;))
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel

Children

0 Peter Harris

23 days ago in reply to hterrolle

For this ...

int32_t* xvx2_out;
vst2_s32(xvx2_out, xvx2);

What memory is xvx2_out pointing at? It's an uninitialized pointer.

This is what I would do:

#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>

int main(void)
{
    const int32_t input[8] = { 0, 1, 2, 3, 4, 5, 6, 7 };
    int32_t output[8] = {};

    // Linear load
    int32x4x2_t data = vld1q_s32_x2(input);

    // Linear store
    vst2q_s32(output, data);

    printf("Linear store:\n");
    printf("  r0[0] = %d\n", output[0]);
    printf("  r0[1] = %d\n", output[1]);
    printf("  r0[2] = %d\n", output[2]);
    printf("  r0[3] = %d\n", output[3]);
    printf("\n");
    printf("  r1[0] = %d\n", output[4]);
    printf("  r1[1] = %d\n", output[5]);
    printf("  r1[2] = %d\n", output[6]);
    printf("  r1[3] = %d\n", output[7]);
}

0 hterrolle 17 days ago in reply to Peter Harris

hi,

By the way! Thanks for the example. I think I understand interleaving.
Cancel
Vote up 0 Vote down

Reply

Accept answer

Cancel