Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Part 3: Matrix-matrix multiplication. Neon, SVE, and SME compared
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Architecture
  • SIMD and Vector Processing Instructions
  • Machine Learning (ML)
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Part 3: Matrix-matrix multiplication. Neon, SVE, and SME compared

Khalid Saadi
Khalid Saadi
August 6, 2024
17 minute read time.
This blog post is the third part of a three-part blog series. Part 1 of the blog series and part 2 of the blog series here.

This blog post describes how to implement the same matrix-matrix multiplication algorithm using three different Arm technologies: Neon, SVE, and SME.

Presenting these three examples together highlights some key differences between the technologies and is intended to help developers who want to port code from Neon or SVE/SVE2 to SME/SME2.

Architectural evolution: Neon, SVE/SVE2, and SME/SME2

The Neon, SVE/SV2, and SME/SME2 technologies were introduced into the Arm architecture as follows:

  • Armv7 introduced the Advanced SIMD extension, providing Single Instruction Multiple Data (SIMD) operations for a range of integer and floating-point types. Neon is an implementation of the Advanced SIMD instructions, provided as an extension for some Cortex-A Series processors. Neon was introduced in ARMv7-A in 2011. Neon provides fixed width 128-bit registers. This means that each Neon instruction operates on a fixed number of data values, for example, four 32-bit data values.
  • SVE, introduced in Armv8-A in 2016, and SVE2, introduced in Armv9-A in 2021, provide variable length registers. The size of the registers is implementation-defined from 128-bit to 2048-bit registers. This means that the programmer does not know the size of the registers that are available so code must be written to be vector-length agnostic. Therefore, the number of data values that each instruction processes is not fixed and is variable.
  • SME and SME2, both introduced in Armv9-A in 2021, also provide variable length registers. SME introduces two key new architectural features: streaming SVE mode and ZA storage. Streaming SVE mode is a high-throughput matrix data processing mode, and ZA storage is a dedicated 2D array that facilitates common matrix operations. These features allow SME and SME2 to efficiently process matrix and vector-based workloads.

These SIMD architecture extensions provide instructions that accelerate a wide range of applications including:

  • Media and signal processing applications
  • High Performance Computing (HPC) applications
  • Machine Learning (ML) applications

The examples in this blog use intrinsics, which are functions provided by the compiler that correspond to specific Arm instructions. This enables the programmer to write their entire program in C code rather than assembly language.

Matrix-matrix multiplication

All 3 of the examples in this blog implement matrix-matrix multiplication. Matrix multiplication takes two input matrices and produces one result matrix by multiplying each element of a row in the first matrix by the corresponding element of a column from the second matrix and summing these products. The result matrix’s dimensions are determined by the number of rows in the first matrix and the number of columns in the second matrix. For example, a 3 x 2 matrix multiplied by a 2 x 3 matrix results in a 3 x 3 matrix.

 Matrix-matrix multiplication grid

To multiply two matrices A and B, the number of columns in matrix A must be equal to the number of rows in matrix B. Multiplying matrices A and B results in matrix C.

Neon

This example uses Neon intrinsics to perform matrix-matrix multiplication.

The code does the following:

  1. Two input matrices contain 32-bit floating-point data stored in column major format.
  2. The code iterates over all the data in these matrices in blocks of 4x4.
  3. The vld intrinsics load four values from the rows and columns of the input matrices into Neon registers.
  4. Each fma Neon intrinsic performs four multiply and accumulate operations, calculating the result for the 4x4 block we are processing.
  5. The vst intrinsics store the result matrix to memory.

Here is the example code using Neon intrinsics:

void matrix_multiply_neon(float32_t  *A, float32_t  *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
        /*
         * Multiply matrices A and B, store the result in C.
         * It is the user's responsibility to make sure the matrices are compatible.
         */

        int A_idx;
        int B_idx;
        int C_idx;

        // these are the columns of a 4x4 sub matrix of A
        float32x4_t A0;
        float32x4_t A1;
        float32x4_t A2;
        float32x4_t A3;

        // these are the columns of a 4x4 sub matrix of B
        float32x4_t B0;
        float32x4_t B1;
        float32x4_t B2;
        float32x4_t B3;

        // these are the columns of a 4x4 sub matrix of C
        float32x4_t C0;
        float32x4_t C1;
        float32x4_t C2;
        float32x4_t C3;

        for (int i_idx=0; i_idx<n; i_idx+=4) {
                for (int j_idx=0; j_idx<m; j_idx+=4) {
                        // Zero accumulators before matrix op
                        C0 = vmovq_n_f32(0);
                        C1 = vmovq_n_f32(0);
                        C2 = vmovq_n_f32(0);
                        C3 = vmovq_n_f32(0);
                        for (int k_idx=0; k_idx<k; k_idx+=4) {
                                // Compute base index to 4x4 block
                                A_idx = i_idx + n*k_idx;
                                B_idx = k*j_idx + k_idx;

                                // Load most current A values in row
                                A0 = vld1q_f32(A+A_idx);
                                A1 = vld1q_f32(A+A_idx+n);
                                A2 = vld1q_f32(A+A_idx+2*n);
                                A3 = vld1q_f32(A+A_idx+3*n);

                                // Multiply accumulate in 4x1 blocks, i.e. each column in C
                                B0 = vld1q_f32(B+B_idx);
                                C0 = vfmaq_laneq_f32(C0, A0, B0, 0);
                                C0 = vfmaq_laneq_f32(C0, A1, B0, 1);
                                C0 = vfmaq_laneq_f32(C0, A2, B0, 2);
                                C0 = vfmaq_laneq_f32(C0, A3, B0, 3);

                                B1 = vld1q_f32(B+B_idx+k);
                                C1 = vfmaq_laneq_f32(C1, A0, B1, 0);
                                C1 = vfmaq_laneq_f32(C1, A1, B1, 1);
                                C1 = vfmaq_laneq_f32(C1, A2, B1, 2);
                                C1 = vfmaq_laneq_f32(C1, A3, B1, 3);

                                B2 = vld1q_f32(B+B_idx+2*k);
                                C2 = vfmaq_laneq_f32(C2, A0, B2, 0);
                                C2 = vfmaq_laneq_f32(C2, A1, B2, 1);
                                C2 = vfmaq_laneq_f32(C2, A2, B2, 2);
                                C2 = vfmaq_laneq_f32(C2, A3, B2, 3);

                                B3 = vld1q_f32(B+B_idx+3*k);
                                C3 = vfmaq_laneq_f32(C3, A0, B3, 0);
                                C3 = vfmaq_laneq_f32(C3, A1, B3, 1);
                                C3 = vfmaq_laneq_f32(C3, A2, B3, 2);
                                C3 = vfmaq_laneq_f32(C3, A3, B3, 3);
                        }
                        // Compute base index for stores
                        C_idx = n*j_idx + i_idx;
                        vst1q_f32(C+C_idx, C0);
                        vst1q_f32(C+C_idx+n, C1);
                        vst1q_f32(C+C_idx+2*n, C2);
                        vst1q_f32(C+C_idx+3*n, C3);
                }
        }
}

This example uses the following Neon code features:

Instruction Description
float32x4_t A vector data type containing four 32-bit floating point values.
vld1q_f32 A Neon intrinsic which loads four 32-bit floats from consecutive memory addresses into a float32x4_t.
vfmaq_lane_f32  A Neon intrinsic which performs a fused multiply accumulate operation. It multiplies a float32x4_t value by a single element of another float32x4_t then adds the result to a third float32x4_t before returning the result.
vst1q_f32 A Neon intrinsic which stores the four values in a float32x4_t to consecutive memory addresses.

This example uses a fixed block size of 4x4. This means that the input matrices must be multiples of four in both dimensions. You can deal with other sizes of matrices by padding the matrices with zeroes.

For a more detailed explanation of this example see here.

SVE/SVE2

This example uses SVE2 intrinsics to perform matrix-matrix multiplication.

The main difference between the Neon example and this SVE2 example is that SVE2 uses variable length vectors. While the Neon example uses a fixed block size of 4x4 to match the four 32-bit values that fit in a Neon register, we do not know the size of the SVE2 registers until runtime. This means that the code must be vector-length agnostic. The example uses predication to control the number of data values that are operated on by the SVE2 intrinsics. This means that that they fit perfectly in the SVE2 registers, no matter what size has been implemented. The Neon example uses a 32-bit float data type, float32x4_t, where the 4 shows that each Neon register can contain 4 32-bit values. The SVE2 example uses the svfloat32_t data type because the size of the SVE2 registers is not known until runtime.

The code does the following:

  1. Two input matrices contain 32-bit floating-point data stored in column major format.
  2. The code iterates over all the data in these matrices in groups of four rows. It uses the svcntw intrinsic, which returns the number of 32-bit elements in a vector, to match the number of columns loaded to the size of the SVE2 registers. This helps avoid hardcoding the number of elements in each iteration of the outer loop. The whileit intrinsic creates a predicate to ensure that the bounds of the matrices are not exceeded.
  3. The four svld intrinsics load matrix data into the SVE2 registers using the predicate created previously.
  4. The svlma intrinsics perform multiply and accumulate operations, calculating the result for the current iteration.
  5. The svst intrinsics store the result matrix to memory.

Here is the example code using SVE2 intrinsics:

void matrix_multiply_sve(const float32_t *A, const float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    /*
     * Multiply matrices A and B, store the result in C.
     * It is the users responsibility to make sure the matrices are compatible.
     */

    int a_idx;
    int b_idx;
    int c_idx;

    // these are the columns of a nx4 sub matrix of A
    svfloat32_t A0;
    svfloat32_t A1;
    svfloat32_t A2;
    svfloat32_t A3;

    // these are the columns of a 4x4 sub matrix of B
    svfloat32_t B0;
    svfloat32_t B1;
    svfloat32_t B2;
    svfloat32_t B3;

    // these are the columns of a nx4 sub matrix of C
    svfloat32_t C0;
    svfloat32_t C1;
    svfloat32_t C2;
    svfloat32_t C3;

    for (int i_idx=0; i_idx<n; i_idx+=svcntw()) {
        // calculate predicate for this i_idx
        svbool_t pred = svwhilelt_b32_u32(i_idx, n);

        for (int j_idx=0; j_idx<m; j_idx+=4) {
            // zero accumulators before matrix op
            C0 = svdup_n_f32(0);
            C1 = svdup_n_f32(0);
            C2 = svdup_n_f32(0);
            C3 = svdup_n_f32(0);
            for (int k_idx=0; k_idx<k; k_idx+=4){
                // compute base index to 4x4 block
                a_idx = i_idx + n*k_idx;
                b_idx = k*j_idx + k_idx;

                // load most current a values in row
                A0 = svld1_f32(pred, A+a_idx);
                A1 = svld1_f32(pred, A+a_idx+n);
                A2 = svld1_f32(pred, A+a_idx+2*n);
                A3 = svld1_f32(pred, A+a_idx+3*n);

                // multiply accumulate 4x1 blocks, that is each column C
                B0 = svld1rq_f32(svptrue_b32(), B+b_idx);
                C0 = svmla_lane_f32(C0,A0,B0,0);
                C0 = svmla_lane_f32(C0,A1,B0,1);
                C0 = svmla_lane_f32(C0,A2,B0,2);
                C0 = svmla_lane_f32(C0,A3,B0,3);

                B1 = svld1rq_f32(svptrue_b32(), B+b_idx+k);
                C1 = svmla_lane_f32(C1,A0,B1,0);
                C1 = svmla_lane_f32(C1,A1,B1,1);
                C1 = svmla_lane_f32(C1,A2,B1,2);
                C1 = svmla_lane_f32(C1,A3,B1,3);

                B2 = svld1rq_f32(svptrue_b32(), B+b_idx+2*k);
                C2 = svmla_lane_f32(C2,A0,B2,0);
                C2 = svmla_lane_f32(C2,A1,B2,1);
                C2 = svmla_lane_f32(C2,A2,B2,2);
                C2 = svmla_lane_f32(C2,A3,B2,3);

                B3 = svld1rq_f32(svptrue_b32(), B+b_idx+3*k);
                C3 = svmla_lane_f32(C3,A0,B3,0);
                C3 = svmla_lane_f32(C3,A1,B3,1);
                C3 = svmla_lane_f32(C3,A2,B3,2);
                C3 = svmla_lane_f32(C3,A3,B3,3);
            }
            // compute base index for stores
            c_idx = n*j_idx + i_idx;
            svst1_f32(pred, C+c_idx, C0);
            svst1_f32(pred, C+c_idx+n,C1);
            svst1_f32(pred, C+c_idx+2*n,C2);
            svst1_f32(pred, C+c_idx+3*n,C3);
        }
    }
}

This example uses the following SVE2 code features:

Instruction Description
svfloat32_t A vector data type containing 32-bit floating point values, where the exact number of values is defined at runtime based on the SVE vector length.
svwhilelt_b32_u32 An SVE2 intrinsic which computes a predicate from two uint32_t integers: a starting value and a maximum value.
svld1_f32 An SVE2 intrinsic which loads 32-bit floating point values into an SVE2 register.
svptrue_b32 An SVE2 intrinsic which sets a predicate for 32-bit values to all-true.
svld1rq_f32 An SVE2 intrinsic which loads an SVE2 register with copies of the same 128-bits (four 32-bit values)
svmla_lane_f32 An SVE2 intrinsic which performs a fused multiply accumulate instruction. The function multiplies each 128-bit segment of a svfloat32_t value by the corresponding single element of each 128-bit segment of another svfloat32_t. The svmla_lane_f32 intrinsic then adds the result to a third svfloat32_t before returning the result.
svst1_f32 An SVE2 intrinsic which stores the values in a svfloat32_t to consecutive memory addresses.

For a more detailed explanation of this example see here.

SME/SME2

This example uses SME2 assembly instructions to perform matrix-matrix multiplication.

The differences between this SME2 example and the other examples are as follows:

  • This SME2 example uses assembly code rather than the intrinsics used by the other examples.
  • SME2 provides ZA storage, a two-dimensional data array specifically designed for matrix operations. Sub-arrays within this ZA storage can be accessed as tiles, and elements within the tiles can be accessed either vertically or horizontally. This provides a very flexible mechanism for manipulating matrix data.
  • SME2 provides new instructions to perform matrix arithmetic. For example, the fmopa instruction calculates outer products.
  • SME2 provides a multi-vector 2D predication mechanism to ensure that matrix bounds are not exceeded.

Streaming SVE mode, entered using the smstart instruction, enables the SME2 instructions as well as the ZA storage.

This example takes input matrices matLeft and matRight. The example uses the fact that multiplying two matrices together is the same as summing the outer products for each column of matLeft and each row of matRight in turn.  

The initial input matrices are stored in memory as row-major arrays. Matrix multiplication is performed as the sum of the outer product of one column from matLeft and one row from matRight. Because the outer product requires column elements from matLeft, the code rearranges the matLeft data so that that the column elements are stored contiguously in memory. This data rearrangement is not shown here for the sake of brevity but can be seen in the SME programmers guide.

This example contains three nested loops:

  • The outermost loops iterate over the rows of the result matrix.
  • The middle loop iterates over the columns of the result matrix.
  • The innermost loop iterates over the K dimension, producing result matrix elements as a sum of products.

Matrix data is loaded to ZA storage from memory using ld1w instructions.

The outer product calculation uses the fmopa instruction. Each fmopa instruction reads two SVE Z input vectors and updates an entire SME ZA tile with the result.

2D predication ensures that the bounds of the matrices are not exceeded.

Finally, the st1w instructions write the result from ZA storage to memory.

This is only a very brief overview of the example. For more details about how the code operates, line-by-line, see the SME programmers guide.

   matmul_opt:
    // matmul_opt(M, K, N, matLeft, matRight, matResult_opt);
    // x0 : M
    // x1 : K, lda
    // x2 : N, ldc, ldb
    // x3 : &matLeft
    // x4 : &matRight
    // x5 : &matResult_opt
    // x6 : SVLs-2
    // x7 : a_ptr pointer
    // x8 : a_ptr end address
    // x9 : c_base pointer
    // x10: c_ptr0 pointer
    // x11: Exit condition for N loop
    // x12: M loop counter
    // x13: Store loop counter
    // x14: Predicate select index
    // x15: Exit condition for K loop
    // x16: b_base pointer
    // x17: b_ptr pointer
    // x18: (SVLs+1)*ldc
    // x19: ldb + SVLs
    // x20: SVLs*lda + SVLs
    // x21: c_ptr1 pointer
    // x22: SVLs*lda
    // x23: SVLs*ldc

// Assumptions:
// nbr in matLeft (M): any
// nbc in matLeft, nbr in matRight (K): any K > 2
// nbc in matRight (N): any
//
// Left matrix is pre-arranged.
//
// 32-bit accumulator mapping with 2x2 tiles processing

    stp     x19, x20, [sp, #-48]!
    stp     x21, x22, [sp, #16]
    stp     x23, x24, [sp, #32]

    smstart

// constants
    cntw    x6                      // SVLs
    mul     x22, x6, x1             // SVLs*lda
    mul     x23, x6, x2             // SVLs*ldc
    add     x18, x23, x2            // SVLs*ldc + ldc
    add     x11, x4, x2, lsl #2     // Exit condition for N loop
    mov     x12, #0
    cntb    x6                      // SVLb
    mov     x14, #0
    ptrue   pn10.b                  // Predicate as counter for SME2 VLx2 (a_ptr loads)
    whilelt pn8.s, x12, x0, vlx2    // tiles predicate (M dimension)
    sub     w6, w6, #8              // SVLb-8

.Loop_M:
    // Extracting tile 0/1 and tile 2/3 predicates (M dimension) from vlx2 predicate.
    pext    { p2.s, p3.s }, pn8[0]
    mov     x16, x4                 // b_base
    mov     x9, x5                  // c_base

    whilelt pn9.b, x16, x11, vlx2   // tiles predicate (N dimension)

.Loop_N:
    mov     x7, x3                  // a_ptr = a_base
    mov     x17, x16                // b_ptr = b_base
    mov     x10, x9                 // c_ptr0 = c_base

    // Extracting tile 0/2 and tile 1/3 predicates (N dimension) from vlx2 predicate.
    pext    { p0.b, p1.b }, pn9[0]
    add     x8, x3, x22, lsl #2     // a_base + SVLs*lda FP32 elms [Bytes]
    addvl   x15, x8, #-1            // Exit condition for K loop
    ld1w    {z1.s},  p2/z,   [x7]   // Load 1st vector from a_ptr

    zero    {za}
    ld1w    {z2.s-z3.s},  pn9/z,   [x17]  // Load 2 vectors from b_ptr
    fmopa   za0.s,  p2/m,   p0/m,   z1.s,   z2.s  // ZA0 += 1st a_ptr vector OP 1st b_ptr vector
    ld1w    {z5.s},  p3/z,   [x7, x22, lsl #2]    // Load 2nd vector from a_ptr
    addvl   x7, x7, #1                            // a_ptr += SVLb [Bytes]

.Loop_K:
    fmopa   za2.s,  p3/m,   p0/m,   z5.s,   z2.s       // ZA2 += 2nd a_ptr vector OP 1st b_ptr vector
    ld1w    {z3.s},  p1/z,   [x17, #1, MUL VL]         // Load 2nd vector from b_ptr
    fmopa   za1.s,  p2/m,   p1/m,   z1.s,   z3.s       // ZA1 += 1st a_ptr vector OP 2nd b_ptr vector
    ld1w    {z0.s-z1.s},  pn10/z,   [x7]               // Load next 2 vectors from a_ptr
    fmopa   za3.s,  p3/m,   p1/m,   z5.s,   z3.s       // ZA3 += 2nd a_ptr vector OP 2nd b_ptr vector
    ld1w    {z6.s-z7.s},  pn9/z,   [x17, x2, lsl #2]   // Load next 2 vectors from b_ptr
    fmopa   za0.s,  p2/m,   p0/m,   z0.s,   z6.s       // ZA0 += 1st a_ptr vector OP 1st b_ptr vector
    psel    pn11, pn10, p3.s[w14, 0]                   // Select predicate-as-counter
    ld1w    {z4.s-z5.s},  pn11/z,   [x7, x22, lsl #2]  // Load next 2 vectors from a_ptr
    fmopa   za2.s,  p3/m,   p0/m,   z4.s,   z6.s       // ZA2 += 2nd a_ptr vector OP 1st b_ptr vector
    add     x17, x17, x2, lsl #3                       // b_ptr += 2*ldb FP32 elms [Bytes]

    fmopa   za1.s,  p2/m,   p1/m,   z0.s,   z7.s       // ZA1 += 1st a_ptr vector OP 2nd b_ptr vector

    fmopa   za3.s,  p3/m,   p1/m,   z4.s,   z7.s       // ZA3 += 2nd a_ptr vector OP 2nd b_ptr vector
    ld1w    {z2.s-z3.s},  pn9/z,   [x17]               // Load next 2 vectors from b_ptr

    fmopa   za0.s,  p2/m,   p0/m,   z1.s,   z2.s       // ZA0 += 1st a_ptr vector OP 1st b_ptr vector
    addvl   x7, x7, #2                                 // a_ptr += 2*SVLb [Bytes]

    cmp     x7, x15
    b.mi    .Loop_K

    fmopa   za2.s,  p3/m,   p0/m,   z5.s,   z2.s       // ZA2 += 2nd a_ptr vector OP 1st b_ptr vector

    fmopa   za1.s,  p2/m,   p1/m,   z1.s,   z3.s       // ZA1 += 1st a_ptr vector OP 2nd b_ptr vector

    fmopa   za3.s,  p3/m,   p1/m,   z5.s,   z3.s       // ZA3 += 2nd a_ptr vector OP 2nd b_ptr vector
    add     x17, x17, x2, lsl #2                       // b_ptr += 2*ldb FP32 elms [Bytes]

    cmp     x7, x8
    b.pl    .Ktail_end

.Ktail_start:
    ld1w    {z1.s},  p2/z,   [x7]
    ld1w    {z2.s-z3.s},  pn9/z,   [x17]

    fmopa   za0.s,  p2/m,   p0/m,   z1.s,   z2.s
    ld1w    {z5.s},  p3/z,   [x7, x22, lsl #2]

    fmopa   za2.s,  p3/m,   p0/m,   z5.s,   z2.s

    fmopa   za1.s,  p2/m,   p1/m,   z1.s,   z3.s

    fmopa   za3.s,  p3/m,   p1/m,   z5.s,   z3.s

.Ktail_end:
    mov     w13, #0
   psel    pn11, pn9, p2.b[w13, 0]
   psel    pn12, pn9, p3.b[w13, 0]
   // Move from ZA tiles to vectors: z0 = za0h[1], z1 = za1h[1], z2 = za2h[1], z3 = za3h[1]
   mova    { z0.b-z3.b }, za0h.b[w13, 0:3]
   st1w    { z0.s-z1.s }, pn11, [x10]                  // Store to c_ptr0
   st1w    { z2.s-z3.s }, pn12, [x10, x23, lsl #2]     // Store to c_ptr0 + SVLs*ldc
.Loop_store_ZA:
    psel    pn11, pn9, p2.b[w13, 4]
    psel    pn12, pn9, p3.b[w13, 4]
    mova    { z0.b-z3.b }, za0h.b[w13, 4:7]
    st1w    { z0.s-z1.s }, pn11, [x10, x2,  lsl #2]      // Store to c_ptr0 + ldc
    st1w    { z2.s-z3.s }, pn12, [x10, x18,  lsl #2]     // Store to c_ptr0 + (SVLs+1)*ldc

    add     x10, x10, x2, lsl #3    // c_ptr0 += 2*ldc FP32 elms [Bytes]
    add     w13, w13, #8

    psel    pn11, pn9, p2.b[w13, 0]
    psel    pn12, pn9, p3.b[w13, 0]
    mova    { z0.b-z3.b }, za0h.b[w13, 0:3]
    st1w    { z0.s-z1.s }, pn11, [x10]                  // Store to c_ptr0
    st1w    { z2.s-z3.s }, pn12, [x10, x23, lsl #2]     // Store to c_ptr0 + SVLs*ldc
    cmp     w13, w6
    b.mi    .Loop_store_ZA

    psel    pn11, pn9, p2.b[w13, 4]
    psel    pn12, pn9, p3.b[w13, 4]
    mova    { z0.b-z3.b }, za0h.b[w13, 4:7]
    st1w    { z0.s-z1.s }, pn11, [x10, x2,  lsl #2]      // Store to c_ptr0 + ldc
    st1w    { z2.s-z3.s }, pn12, [x10, x18,  lsl #2]     // Store to c_ptr0 + (SVLs+1)*ldc

    addvl   x9, x9, #2
    addvl   x16, x16, #2            // b_base += 2*SVLb [Bytes]
    whilelt pn9.b, x16, x11, vlx2   // tile predicate (N dimension)
    b.first .Loop_N

    add     x3, x3, x22, lsl #3     // a_base += 2*SVLs*lda FP32 elms [Bytes]
    add     x5, x5, x23, lsl #3     // c_base += 2*SVLs*ldc FP32 elms [Bytes]
    incw    x12, all, mul #2        // M loop counter += 2* SVLs
    whilelt pn8.s, x12, x0, vlx2    // tiles predicate (M dimension)
    b.first    .Loop_M

    smstop

    ldp     x23, x24, [sp, #32]
    ldp     x21, x22, [sp, #16]
    ldp     x19, x20, [sp], #48

    ret

    .size   matmul_opt, .-matmul_opt

Useful resources

Neon

  • Neon product page
  • Learn the architecture - Neon programmers guide
  • Learn the architecture - Optimizing C code with Neon intrinsics

SVE/SVE2

  • SVE product page
  • Introduction to SVE
  • SVE optimization guide
  • Learn the architecture - Migrate Neon to SVE
  • Learn the architecture - Introducing SVE2 guide

SME/SME2

  • SME programmers guide
  • Arm A-profile A64 Instruction Set, SME instructions
  • Introducing the Scalable Matrix Extension for the Armv9-A Architecture
  • Part 1: Arm Scalable Matrix Extension (SME) Introduction
  • Part 2: Arm Scalable Matrix Extension (SME) Instructions
Anonymous
  • Eagle
    Eagle 9 months ago

    Is there any performance data for each method?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Architectures and Processors blog
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025
  • Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

    Samer El-Haj-Mahmoud
    Samer El-Haj-Mahmoud
    Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
    • January 28, 2025