hi,
As i anderstoud ARM64 got 2048 bits register divided in 16 lame of 128 bits (16 octets(bytes)).
int data1[4] = {300,400,600,400}; int* data1_ptr = (data1);
When i use "int32x4_t lame1 = vld1q_s32(data1_ptr )" i load 4 integer to "lame1" and i can check the value of lame1 using printf( " %d %d %d %d \n",lame1[0],lame1[1],lame1[2],lame1[3])
And i should be able to load 16 time 4 integer at max in the all register.
first question is how does NEON to know wich lame it is feed by vld1q if i laoad data1 to data16.
here is what i do not anderstand.
is i want to use "int32x2x2_t lame1and_2_low = vld2_s32(x_base);". i wiil load data1[0] and data1[1] to low register lame1 and data1[2] and data1[3] to low register lame2.
as i anderstoud int32x2x2_t is:
struct int32x2x2_t { int32x2_t [2];};
so i tried to use printf to see the data in register using lame1and_2_low [0][1] but it does not work. The only things i can do is lame1and_2_low [0] and lame1and_2_low .val[0]. but the data printed return 3 for lame1and_2_low [0] and -617811840 for lame1and_2_low [0]. so it is not the data value i was expected. :))
I post the question because i want to debug the exemple i found "RGB deinterleaving" and "matrix multiplication" where they use "vld3q_u8" and "vfmaq_laneq_f32" to anderstand how it is working. Both use lame and i do not anderstand how to print the data from register. It is easy with vld1q but not with vld2_s32.
So, the main question is how to printf data from register. They must be i syntaxe that i do not know.
In OpenCL we got S0 to S16. But in NEON i did not find any information on how printf data from register by name, for data type like 32x2x2_t and other lame data type.
========================================================================
By the way in "matrix multiplication" i have seen something strange.
uint32_t n = 2*BLOCK_SIZE; uint32_t k = 2*BLOCK_SIZE; float32_t A[n*k] => float32_t A[4] matrix_init_rand(A, n*k);
then it use
float32x4_t A0;float32x4_t A1;float32x4_t A2;float32x4_t A3;
and
A0 = vld1q_f32(A);A1 = vld1q_f32(A+4);A2 = vld1q_f32(A+8);A3 = vld1q_f32(A+12);
how it is possible to load 4 time 4 integer with float32_t A[4] it should be float32_t A[16]
another strange thing in "matrix multiplication" is the use of "C0 = vfmaq_laneq_f32(C0, A0, B0, 0)"
C0 is suposed to be the output computation but in the documentation "">developer.arm.com/.../vfmaq_laneq_f32"
they said that "This instruction multiplies the vector elements in the first source SIMD&FP register by the specified value in the second source SIMD&FP register"
So it should be "C0 = vfmaq_laneq_f32(A0, B0,C0, 0)"
Something else that i do not anderstand in the documentation. there is intrinsics that is the function we use in C and AArch64 Instruction wich is supose is the assembler code ?
and the "Argument Preparation"
on https://arm-software.github.io we got these
a -> Vd.8Hb -> Vn.8Bc -> Vm.8B
and on the developer.arm.com/.../vfmaq_laneq_f32 we got
a register: Vd.4S b register: Vn.4S v register: Vm.4S lane minimum: 0; maximum: 3
in fact in all the documentation i read, register name are different.On some document they call register by VH[0] extc..
It is very confusing naming.
It is quite long post and i am asking quite a lot of information. But i am sure that these will help me a lot but not me only.
PS: computing 4 data by instruction is really a very good advance. Let's wait for 256bit lame ;)) and computation inside the same lame ;))
PS: You are free to change the title of the question and split it in many part if you think so. Or let me know i will do it.
Have a good day. ;))
Regards and thanks in advance. And forgive my horrible english writting. ;))
thanks for the answer but i tried to use vst1 function but i always got "no viable conversion" from the compiler.
I think it is normal because there are no int32x2x2_t on any vst1 function.
But using vst2_s32 i got a result but it is the same like before. here is the code:
int A1[4] = {300,400,600,400}; int* x_base = (A1); int32x2x2_t xvx2 = vld2_s32(x_base); //LOGE(" neon_multi xvx2 %3d %3d\n",xvx2.val[0],xvx2.val[1]); int32_t* xvx2_out; vst2_s32(xvx2_out,xvx2); // compile LOGE(" neon_multi xvx2_out %3d %3d\n",*(xvx2_out),*(xvx2_out+1));
output
neon_multi xvx2 3 -617811856 neon_multi xvx2_out 3 -617811840
And the problem is that if i print the first log then it crash printing the second log. So i had to compile twice once with the first log and once with the second. ;))
I do not anderstand why vld1q_s32 work well and can be printed and not vld2_s32.
it could be the use of the struct for vld2_s32 because i tried vst1_s32_x2(xvx2_out,xvx2) function and the same output.
in fact i have tested all possible svt function with int32x2x2_t and none of them return the correct value.
So i do not anderstand how it work, and this is very possible. Or there is something else like my <arm_neon.h>.
not easy neon, so many function. I though it would be easier ;))