This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Help converting neon 32-bit syntax to 64-bit

Hi,

I am trying to translate a function from Neon 32-bit syntax to 64-bit. Unfortunately, I had trouble understanding the documentation

For instance, an example is this line:

vld1.32         {q0}, [%[src1]]!

translates to this:

ld1             {v0.4s}, [%[src1]], #16

what I don't understand is the "v0.4s" bit means. I suppose it somehow has to translate to the q0 register, but the logic eludes me. Anyone can help me out?

Parents
  • lefty.

    The naming difference stems from the fact that the register packing model is different between AArch32 and AArch64.

    In AArch32:

    The 128-bit register Q0 appears to be constructed from the concatenation of the two 64-bit registers D1 and D0, which in turn appear to be constructed from the concatenation of the four 32-bit registers S3, S2, S1 and S0.

    i.e. writing to S3 updates the top most 32-bits bits of Q0. Advanced-SIMD vector operations can operate on either 64 (Dn) or 128 (Qn) bits at a time.

    The instruction "VLD1.32 {Q0},[...]" indicates loading 32-bit quantities (from the ".32" part) into a 128-bit register (from the "Q0" part) (i.e. 4 x 32-bit values into a 128-bit register).

    In AArch64:

    The vector registers are refered to as Vn, but no longer have multiple directly named sub-registers packed into them, i.e. V0 only contains Q0, D0 and S0, with V1 containing Q1, D1 and S1 etc. with the smaller size registers residing at the least-significant bits of the V0 container.

    The Advanced-SIMD vector operations can again operate on either 64 or 128-bits at a time, however, they typically write an entire vector register (clearing the top-most bits to zero if a 64-bit vector is produce).

    The label appended to the Vn register indicates how it is being interpreted, with the number indicating how many lanes there are considered to be, and the B/H/S/D indicating the width of each element. Consistent with AArch32, the H, S and D indicate 16, 32 and 64-bits respectively, with B indicating 8-bit element size.

    The instruction "LD1 {V0.4S},[...]" indicates loading four (from the .4 part) lots of 32-bits (from the "S" part) into V0 (i.e. 4 x 32-bit values into a 128-bit register, as per the AArch32 example).

    Summary

    In general, the size and number of elements operated on can be mapped between AArch32 and AArch64 as follows:

    Shape (bits x lanes)8b x 88b x 1616b x 416b x 832b x 232b x 464b x 164b x 2
    AArch32.8 Dn.16 Qn.16 Dn.16 Qn.32 Dn.32 Qn.64 Dn.64 Qn
    AArch64Vn.8BVn.16BVn.4HVn.8HVn.2SVn.4SVn.1DVn.2D
    arm_neon.h (for type int)int8x8_tint8x16_tint16x4_tint16x8_tint32x2_tint32x4_tint64x1_tint64x2_t

    As shown in the "arm_neon.h" row, if you are using the standard 'C' intrinsics for Advanced-SIMD, then the difference in naming and mapping should be largely transparent, enabling code to be compiled for either AArch64 or AArch32.

    hth

    Simon.

Reply
  • lefty.

    The naming difference stems from the fact that the register packing model is different between AArch32 and AArch64.

    In AArch32:

    The 128-bit register Q0 appears to be constructed from the concatenation of the two 64-bit registers D1 and D0, which in turn appear to be constructed from the concatenation of the four 32-bit registers S3, S2, S1 and S0.

    i.e. writing to S3 updates the top most 32-bits bits of Q0. Advanced-SIMD vector operations can operate on either 64 (Dn) or 128 (Qn) bits at a time.

    The instruction "VLD1.32 {Q0},[...]" indicates loading 32-bit quantities (from the ".32" part) into a 128-bit register (from the "Q0" part) (i.e. 4 x 32-bit values into a 128-bit register).

    In AArch64:

    The vector registers are refered to as Vn, but no longer have multiple directly named sub-registers packed into them, i.e. V0 only contains Q0, D0 and S0, with V1 containing Q1, D1 and S1 etc. with the smaller size registers residing at the least-significant bits of the V0 container.

    The Advanced-SIMD vector operations can again operate on either 64 or 128-bits at a time, however, they typically write an entire vector register (clearing the top-most bits to zero if a 64-bit vector is produce).

    The label appended to the Vn register indicates how it is being interpreted, with the number indicating how many lanes there are considered to be, and the B/H/S/D indicating the width of each element. Consistent with AArch32, the H, S and D indicate 16, 32 and 64-bits respectively, with B indicating 8-bit element size.

    The instruction "LD1 {V0.4S},[...]" indicates loading four (from the .4 part) lots of 32-bits (from the "S" part) into V0 (i.e. 4 x 32-bit values into a 128-bit register, as per the AArch32 example).

    Summary

    In general, the size and number of elements operated on can be mapped between AArch32 and AArch64 as follows:

    Shape (bits x lanes)8b x 88b x 1616b x 416b x 832b x 232b x 464b x 164b x 2
    AArch32.8 Dn.16 Qn.16 Dn.16 Qn.32 Dn.32 Qn.64 Dn.64 Qn
    AArch64Vn.8BVn.16BVn.4HVn.8HVn.2SVn.4SVn.1DVn.2D
    arm_neon.h (for type int)int8x8_tint8x16_tint16x4_tint16x8_tint32x2_tint32x4_tint64x1_tint64x2_t

    As shown in the "arm_neon.h" row, if you are using the standard 'C' intrinsics for Advanced-SIMD, then the difference in naming and mapping should be largely transparent, enabling code to be compiled for either AArch64 or AArch32.

    hth

    Simon.

Children