Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Part 2: Arm Scalable Matrix Extension (SME) Instructions
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Architecture
  • SIMD and Vector Processing Instructions
  • Machine Learning (ML)
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Part 2: Arm Scalable Matrix Extension (SME) Instructions

Zenon Xiu (修志龙)
Zenon Xiu (修志龙)
June 24, 2024
10 minute read time.
This blog post is the second half of a two-part blog series. Read part 1 of the blog post here.


Part 2 of this two-part blog introduces some of the instructions that SME provides.

SME instructions that interact with the SME ZA storage include the following:

  • Instructions that accumulate or subtract the outer product of two vectors into a ZA tile
  • Load, store, and move instructions that transfer a vector to or from a ZA tile row or column
  • Instructions that add a vector horizontally or vertically to a ZA tile
  • instructions that add a multiple of the vector size in Streaming SVE mode to a scalar register

Outer product and accumulate or subtract instructions

To help understand the outer product and accumulate instructions, let us consider how a matrix multiplication can be performed using the outer product operation.

 

Calculating the outer product of two vectors a and b gives a result matrix C containing the outer product:

Now consider multiplying two matrices, a and b:

This multiplication can be calculated using two outer product operations and accumulating both result matrices: 

SME introduces efficient outer product and accumulate or subtract instructions for the following data types:

  • 8-bit and 16-bit integer
  • FP16, BF16, FP32, and FP64 floating point

These instructions calculate the outer product of two vectors in two Z vector registers (Zn and Zm), accumulate or subtract the result array with existing data in a ZA tile (ZAda), and save the result to the same ZA tile. Each source vector is independently predicated by a corresponding governing predicate register (Pn and Pm).

Output array Input vectors Description Example
INT32 INT8, INT8 Sum of four INT8 outer products into each INT32 element SMOPA or SMOPS or UMOPA or UMOPS: signed or unsigned integer sum of outer products and accumulate or subtract. For example:
UMOPS <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.B, <Zm>.B
INT32 INT16, INT16 Sum of two INT16 outer products into each INT32 element SMOPA or SMOPS or UMOPA or UMOPS: signed or unsigned integer sum of outer products and accumulate or subtract.  For example:
UMOPS <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.H, <Zm>.H 
INT64 INT16, INT16 Sum of four INT16 outer products into each INT64 element if FEAT_SME_I16I64 is implemented SMOPA or SMOPS or UMOPA or UMOPS: signed or unsigned integer sum of outer products and accumulate or subtract.  For example:
UMOPS <ZAda>.D, <Pn>/M, <Pm>/M, <Zn>.H, <Zm>.H 
FP32 BF16, BF16 Sum of two BF16 outer products into each FP32 element BFMOPA or BFMOPS: BFloat16 sum of outer products and accumulate or subtract.  For example:
BFMOPS <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.H, <Zm>.H
FP32 FP16, FP16 Sum of two FP16 outer products into each FP32 element FMOPA or FMOPS: Half-precision floating-point sum of outer products and accumulate or subtract.  For example:
FMOPS <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.H, <Zm>.H   
FP32 FP32, FP32 Simple FP32 outer product FMOPA or FMOPS: Floating-point outer product and accumulate or subtract.  For example:
FMOPS <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.S, <Zm>.S
FP64 FP64, FP64 Simple FP64 outer product if  FEAT_SME_F64F64 is implemented  FMOPA or FMOPS: Floating-point outer product and accumulate or subtract.  For example:
FMOPS <ZAda>.D, <Pn>/M, <Pm>/M, <Zn>.D, <Zm>.D

The instructions where the input vectors have the same data type as the output array (FP32 and FP64) are fairly straightforward.

The following example shows FP32 outer product and accumulate or subtract:

  FMOPA <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.S, <Zm>.S
  FMOPS <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.S, <Zm>.S

FP16, BF16, INT16, INT8, I16I64 outer product and accumulate or subtract instructions

Because these instructions widen the result data type, the operations are not as straightforward as the FP32 and FP64 variants.

  • BF16 variants calculate the sum of two BF16 outer products, widening results into FP32, then the result is destructively added to or subtracted from the destination tile.
  • INT8 variants calculate the sum of four INT8 outer products, widening results into INT32, then the result is destructively added to or subtracted from the destination tile.
  • INT16 variants calculate the sum of two INT16 outer products, widening results into INT32, then the result is destructively added to or subtracted from the destination tile.
  • FP16 variants calculate the sum of two FP16 outer products, widening results into FP32, then the result is destructively added to or subtracted from the destination tile.
  • If FEAT_SME_I16I64 is implemented, I16I64 variants calculate the sum of four INT16 outer products, widening results into INT64, then the result is destructively added to or subtracted from the destination tile.

The following example uses the INT8 UMOPA variant with SVL=128-bit:

UMOPA <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.B, <Zm>.B

Each input register (Zn.B, Zm.B) is treated as a matrix of 4x4 elements, as if each block of four contiguous elements (shown with red outline) were transposed.

In this example, because SVL is 128-bit:

  • The first source, Zn.B contains a 4x4 sub-matrix of unsigned 8-bit integer values.
  • The second source, Zm.B, contains a 4x4 sub-matrix of unsigned 8-bit integer values.
  • The UMOPA instruction calculates a 4 x 4 widened 32-bit integer sum of outer products, which is then destructively added to the 32-bit integer destination tile, ZAda.

More generally, we can say that the unsigned integer sum of outer products and accumulate instruction multiplies the sub-matrix in the first source vector by the sub-matrix in the second source vector. Each source vector contains a (SVL/32) x4 sub-matrix of unsigned 8-bit integer values. The resulting (SVL/32) x (SVL/32) widened 32-bit integer sum of outer product is then destructively added to the 32-bit integer destination tile.

The following example uses the BF16 FMOPA variant with SVL=128-bit:

BFMOPA <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.H, <Zm>.H

 

In this example, because SVL is 128-bit:

  • The first source, Zn.H, contains a 4 x 2 sub-matrix of half-precision floating-point values which are widened to single-precision floating-point values.
  • The second source, Zm.H contains a 2 x 4 sub-matrix of half-precision floating-point values which are widened to single-precision floating-point values.
  • The FMOPA instruction calculates a 4 x 4 single-precision floating-point sum of outer products, which is then destructively added to the single-precision floating-point destination tile, ZAda.

More generally, we can say that this instruction widens the (SVL/32) x 2 sub-matrix of half-precision floating-point values held in the first source vector to single-precision floating-point values and multiplies it by the widened 2 x (SVL/32) sub-matrix of half-precision floating-point values in the second source vector to single-precision floating-point values. The resulting (SVL/32) x (SVL/32) single-precision floating-point sum of outer products is then destructively added to the single-precision floating-point destination tile.

SME instructions with predication

Each source vector is independently predicated by a corresponding governing predicate:

  • Outer product and accumulate or subtract instructions use Pn/M and Pm/M: Inactive source elements are treated as having a value of zero.
  • Tile slice move instructions use Pg/M: Inactive elements in the destination slice remain unmodified.
  • Tile slice load instructions use Pg/Z: Inactive elements are set to zero in the destination vector.
  • Tile slice store instructions use Pg: Inactive elements are not written to memory.

Predication makes it easier to handle cases where the matrix dimension is not an exact multiple of SVL.

For example, consider the following instruction:

The vector input Z0 is predicated by P0, and Z1 is predicated by P1.

In this example:

  • SVL is 512-bit
  • The Z registers contain 16 x FP32 vectors
  • The last two elements are inactive in P0
  • The last element is inactive in P1.

This instruction updates (16-2) x (16 -1) FP32 elements in the ZA0.S tile, leaving the remaining elements of the ZA0.S tile unchanged because Pn/M is used.

The following figures show more examples of predicated outer product instructions. Strikethrough text shows the calculation components that are affected by the inactive predicate elements:

Addition of a vector to ZA rows and columns

SME includes instructions that add horizontal or vertical vector elements to a ZA tile with predication support.

Instruction Description
ADDHA Add the source vector to each horizontal slice of a ZA tile
ADDVA Add the source vector to each vertical slice of a ZA tile

 For example:

ADDHA ZA0.S, P0/M, P1/M, Z1.S 

This performs the following operation:

This ADDHA instruction adds each element of the source vector Z1 to the corresponding active element of each horizontal slice of the ZA0.S tile.

The tile elements are predicated by a pair of governing predicates. An element of a horizontal slice is considered active if its corresponding element in the second governing predicate is TRUE and the element corresponding to its horizontal slice number in the first governing predicate is TRUE. Inactive elements in the destination tile remain unmodified.

Tile load, store, move instructions

SME load, store, and move instructions do the following:

  • Load ZA rows and columns from memory
  • Store ZA rows and columns to memory
  • Move ZA rows and columns to SVE Z registers
  • Move SVE Z registers to ZA rows and columns

Tile slice load and store instructions

The LD1B, LD1H, LD1S, LD1D, and LD1Q instructions load contiguous memory values to a ZA tile slice with 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit elements respectively.

The ST1B, ST1H, ST1S, ST1D, and ST1Q instructions store a ZA tile slice with 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit elements respectively to contiguous memory.

These instructions also have predication support, for example:

LD1B ZA0H.B[W0, #imm], P0/Z, [X1, X2] 

This LD1B instruction performs a predicated contiguous load of bytes from memory address (X1+X2) to the ZA0 horizontal tile slice with slice index (W0+imm). Inactive elements are set to zero in the destination vector.

ST1H ZA1V.H[W0, #imm], P2, [X1, X2, LSL #1]

This ST1H instruction performs a predicated contiguous store of halfwords from the ZA1 vertical tile slice with slice index (W0+imm) to memory starting at (X1+X2*2). Inactive elements are not written to memory.

Tile slice move instructions

The MOV instruction (an alias of MOVA) moves a Z vector register to a ZA tile slice, or moves a ZA tile slice to a Z vector register. The instruction operates on individual horizontal or vertical slices within a named ZA tile of the specified element size. The slice number within the tile is selected by the sum of the slice index register and immediate offset. Inactive elements in the destination slice remain unmodified.

For example:

MOV     ZA0H.B[W0, #imm],  P0/M, Z0.B

or:

MOVA  ZA0H.B[W0, #imm],  P0/M, Z0.B

This instruction moves the vector register Z0.B to the horizontal ZA tile slice, ZA0H.B[W0, #imm], using P0 as the predication register. Inactive elements in the destination tile slice remain unmodified.

ZA array vector load/store instructions

The SME LDR instruction loads memory to a ZA array vector, and the SME STR instruction stores a ZA array vector to memory.

These instructions are unpredicated. For the purpose of context switching, they can be used in Non-streaming SVE mode when PSTATE.ZA is enabled.

For example, in the following STR instruction the ZA array vector is selected by the sum of the vector select register and an optional immediate value. The memory address is generated by scalar base, plus the same optional immediate offset multiplied by the current vector length in bytes:

STR ZA[<Wv>, <imm>], [<Xn|SP>{, #<imm>, MUL VL}]


Zero ZA tile instruction

The ZERO instruction zeroes a list of 64-bit element ZA tiles:

ZERO { <mask>}

The ZERO instruction zeroes up to eight tiles named ZA0.D to ZA7.D, as specified by mask, leaving the other tiles unmodified.

The instruction can be used in Non-streaming SVE mode when PSTATE.ZA is enabled.

To zero the entire ZA array, use the instruction alias ZERO {ZA}.

New SVE2 instructions

The SME architecture adds several new SVE2 instructions. These instructions are also usable when the PE is in Non-streaming SVE mode, if SVE2 is implemented. These instructions include:

  • Predicate select between predicate register or all-false
  • Reverse 64-bit double words in elements
  • Signed and Unsigned clamp to minimum/maximum vector

PSEL instruction

The PSEL instruction performs a predicate select between a predicate register or all-false, as follows:

PSEL <Pd>, <Pn>, <Pm>.<T>[<Wv>, <imm>]

If the indexed element of the second source predicate is true, the instruction places the contents of the first source predicate register into the destination predicate register, otherwise it sets the destination predicate to all-false.

For example, consider the following instruction assuming W12 is 0:

PSEL P0, P1, P2.B[W12, #0]

The element [W12+0] of the second source predicate P2.B is false. Therefore P0 is set to all zeros, as shown in the following figure:

Now consider the following instruction, still assuming W12 is 0 but this time the immediate offset is 1:


PSEL P0, P1, P2.B[W12, #1]

The element [W12+1] of the second source predicate P2.B is true. Therefore P0 is set to the contents of the first source predicate register P1, as shown in the following figure:

   

Further reading

For more detailed information about SME, see the Arm Architecture Reference Manual for A-profile architecture.

To discover how you can use SME in your applications to efficiently work with matrices and other forms of data, see the SME Programmer's Guide.

Anonymous
  • Aia H
    Aia H 7 months ago

    看了还是不理解PSEL如何使用 W12是什么呢

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Zenon Xiu (修志龙)
    Zenon Xiu (修志龙) 11 months ago in reply to vladimir.murzin@arm.com

    @Vladimir , well spotted! I should fix it in next revison. Thanks. 

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  •  vladimir.murzin@arm.com
    vladimir.murzin@arm.com 11 months ago

    Examples in Tile slice load and store instructions and Tile slice move instructions use W0 as the slice index register which is not permitted. Permitted values for the slice index register are W12-W15

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Architectures and Processors blog
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025
  • Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

    Samer El-Haj-Mahmoud
    Samer El-Haj-Mahmoud
    Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
    • January 28, 2025