hi,
As i anderstoud ARM64 got 2048 bits register divided in 16 lame of 128 bits (16 octets(bytes)).
int data1[4] = {300,400,600,400}; int* data1_ptr = (data1);
When i use "int32x4_t lame1 = vld1q_s32(data1_ptr )" i load 4 integer to "lame1" and i can check the value of lame1 using printf( " %d %d %d %d \n",lame1[0],lame1[1],lame1[2],lame1[3])
And i should be able to load 16 time 4 integer at max in the all register.
first question is how does NEON to know wich lame it is feed by vld1q if i laoad data1 to data16.
here is what i do not anderstand.
is i want to use "int32x2x2_t lame1and_2_low = vld2_s32(x_base);". i wiil load data1[0] and data1[1] to low register lame1 and data1[2] and data1[3] to low register lame2.
as i anderstoud int32x2x2_t is:
struct int32x2x2_t { int32x2_t [2];};
so i tried to use printf to see the data in register using lame1and_2_low [0][1] but it does not work. The only things i can do is lame1and_2_low [0] and lame1and_2_low .val[0]. but the data printed return 3 for lame1and_2_low [0] and -617811840 for lame1and_2_low [0]. so it is not the data value i was expected. :))
I post the question because i want to debug the exemple i found "RGB deinterleaving" and "matrix multiplication" where they use "vld3q_u8" and "vfmaq_laneq_f32" to anderstand how it is working. Both use lame and i do not anderstand how to print the data from register. It is easy with vld1q but not with vld2_s32.
So, the main question is how to printf data from register. They must be i syntaxe that i do not know.
In OpenCL we got S0 to S16. But in NEON i did not find any information on how printf data from register by name, for data type like 32x2x2_t and other lame data type.
========================================================================
By the way in "matrix multiplication" i have seen something strange.
uint32_t n = 2*BLOCK_SIZE; uint32_t k = 2*BLOCK_SIZE; float32_t A[n*k] => float32_t A[4] matrix_init_rand(A, n*k);
then it use
float32x4_t A0;float32x4_t A1;float32x4_t A2;float32x4_t A3;
and
A0 = vld1q_f32(A);A1 = vld1q_f32(A+4);A2 = vld1q_f32(A+8);A3 = vld1q_f32(A+12);
how it is possible to load 4 time 4 integer with float32_t A[4] it should be float32_t A[16]
another strange thing in "matrix multiplication" is the use of "C0 = vfmaq_laneq_f32(C0, A0, B0, 0)"
C0 is suposed to be the output computation but in the documentation "">developer.arm.com/.../vfmaq_laneq_f32"
they said that "This instruction multiplies the vector elements in the first source SIMD&FP register by the specified value in the second source SIMD&FP register"
So it should be "C0 = vfmaq_laneq_f32(A0, B0,C0, 0)"
Something else that i do not anderstand in the documentation. there is intrinsics that is the function we use in C and AArch64 Instruction wich is supose is the assembler code ?
and the "Argument Preparation"
on https://arm-software.github.io we got these
a -> Vd.8Hb -> Vn.8Bc -> Vm.8B
and on the developer.arm.com/.../vfmaq_laneq_f32 we got
a register: Vd.4S b register: Vn.4S v register: Vm.4S lane minimum: 0; maximum: 3
in fact in all the documentation i read, register name are different.On some document they call register by VH[0] extc..
It is very confusing naming.
It is quite long post and i am asking quite a lot of information. But i am sure that these will help me a lot but not me only.
PS: computing 4 data by instruction is really a very good advance. Let's wait for 256bit lame ;)) and computation inside the same lame ;))
PS: You are free to change the title of the question and split it in many part if you think so. Or let me know i will do it.
Have a good day. ;))
Regards and thanks in advance. And forgive my horrible english writting. ;))
(Edited to add more to the answer - the forum bugged so I had to reply early and then reedit).
hterrolle said: ARM64 got 2048 bits register divided in 16 lame of 128 bits (
When you use intrinsics you don't need to worry about registers - the compiler handles that for you, just like it would for a variable using a standard C data type.
Also note that AArch64 NEON increases to 32 x 128-bit registers, so you get twice as much capacity vs AArch32 which only has 16.
hterrolle said:is i want to use "int32x2x2_t lame1and_2_low = vld2_s32(x_base);". i wiil load data1[0] and data1[1] to low register lame1 and data1[2] and data1[3] to low register lame2.
The vld2 instruction is a 64-bit de-interleaving load, so it will load indices 0,2 into the low 64-bits and indices 1,3 into the high 64-bits (note that this loads into a single 128 bit register).
If you want the 128-bit version which loads int32x4x2_t you need vld2q_s32, which will load 0,2,4,6 into the low 128 bits and 1,3,5,7 into the high 128 bits.
hterrolle said:how it is possible to load 4 time 4 integer with float32_t A[4] it should be float32_t A[16]
To load 4 4xfp32 vectors without deinterleaving you want the vld1_f32_x4() intrinsic which takes a pointer to a float array as input.
hterrolle said:PS: computing 4 data by instruction is really a very good advance. Let's wait for 256bit lame ;)) and computation inside the same lame ;))
I tend to use vst1 to store back to a temporary array on the stack and then printf from a standard C data type.
e.g. github.com/.../astcenc_vecmathlib_common_4.h
hterrolle said:in fact in all the documentation i read, register name are different.On some document they call register by VH[0] extc..
The register names depend on the instruction and data type.
"Vd" is the destination "V" register, and "Vn" and "Vm" are the input registers containing parameter "n" and "m".
The register postfix is a data type notification telling you what the register contains. It's for the format ".<count><size>", where B = 8-bit, H = 16-bit, and S = 32-bit. For example, ".4S" indicates that the register contains 4x 32-bit values (of any 32-bit data type). The postfix will change depending on what the instruction is doing and how its going to interpret the content of the register.
hterrolle said:Let's wait for 256bit lame ;))
It's already possible with the SVE instruction set, but it depends on the hardware implementation. Neoverse V1 is shipping with 256-bit SVE, but the Cortex mobile parts are all 128-bit at the moment.
hterrolle said:And forgive my horrible english writting. ;))
It's a lot better than my French ;)
One correction for the future is that "lame" should be "lane".
Cheers, Pete