Part 2: Arm Scalable Matrix Extension (SME) Instructions

Zenon Xiu (修志龙）

June 24, 2024

10 minute read time.

This blog post is the second half of a two-part blog series. Read part 1 of the blog post here.

Part 2 of this two-part blog introduces some of the instructions that SME provides.

SME instructions that interact with the SME ZA storage include the following:

Instructions that accumulate or subtract the outer product of two vectors into a ZA tile
Load, store, and move instructions that transfer a vector to or from a ZA tile row or column
Instructions that add a vector horizontally or vertically to a ZA tile
instructions that add a multiple of the vector size in Streaming SVE mode to a scalar register

Outer product and accumulate or subtract instructions

To help understand the outer product and accumulate instructions, let us consider how a matrix multiplication can be performed using the outer product operation.

Calculating the outer product of two vectors a and b gives a result matrix C containing the outer product:

Now consider multiplying two matrices, a and b:

This multiplication can be calculated using two outer product operations and accumulating both result matrices:

SME introduces efficient outer product and accumulate or subtract instructions for the following data types:

8-bit and 16-bit integer
FP16, BF16, FP32, and FP64 floating point

These instructions calculate the outer product of two vectors in two Z vector registers (Zn and Zm), accumulate or subtract the result array with existing data in a ZA tile (ZAda), and save the result to the same ZA tile. Each source vector is independently predicated by a corresponding governing predicate register (Pn and Pm).

Output array	Input vectors	Description	Example
INT32	INT8, INT8	Sum of four INT8 outer products into each INT32 element	SMOPA or SMOPS or UMOPA or UMOPS: signed or unsigned integer sum of outer products and accumulate or subtract. For example: UMOPS <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.B, <Zm>.B
INT32	INT16, INT16	Sum of two INT16 outer products into each INT32 element	SMOPA or SMOPS or UMOPA or UMOPS: signed or unsigned integer sum of outer products and accumulate or subtract. For example: UMOPS <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.H, <Zm>.H
INT64	INT16, INT16	Sum of four INT16 outer products into each INT64 element if FEAT_SME_I16I64 is implemented	SMOPA or SMOPS or UMOPA or UMOPS: signed or unsigned integer sum of outer products and accumulate or subtract. For example: UMOPS <ZAda>.D, <Pn>/M, <Pm>/M, <Zn>.H, <Zm>.H
FP32	BF16, BF16	Sum of two BF16 outer products into each FP32 element	BFMOPA or BFMOPS: BFloat16 sum of outer products and accumulate or subtract. For example: BFMOPS <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.H, <Zm>.H
FP32	FP16, FP16	Sum of two FP16 outer products into each FP32 element	FMOPA or FMOPS: Half-precision floating-point sum of outer products and accumulate or subtract. For example: FMOPS <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.H, <Zm>.H
FP32	FP32, FP32	Simple FP32 outer product	FMOPA or FMOPS: Floating-point outer product and accumulate or subtract. For example: FMOPS <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.S, <Zm>.S
FP64	FP64, FP64	Simple FP64 outer product if FEAT_SME_F64F64 is implemented	FMOPA or FMOPS: Floating-point outer product and accumulate or subtract. For example: FMOPS <ZAda>.D, <Pn>/M, <Pm>/M, <Zn>.D, <Zm>.D

The instructions where the input vectors have the same data type as the output array (FP32 and FP64) are fairly straightforward.

The following example shows FP32 outer product and accumulate or subtract:

  FMOPA <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.S, <Zm>.S
  FMOPS <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.S, <Zm>.S

FP16, BF16, INT16, INT8, I16I64 outer product and accumulate or subtract instructions

Because these instructions widen the result data type, the operations are not as straightforward as the FP32 and FP64 variants.

BF16 variants calculate the sum of two BF16 outer products, widening results into FP32, then the result is destructively added to or subtracted from the destination tile.
INT8 variants calculate the sum of four INT8 outer products, widening results into INT32, then the result is destructively added to or subtracted from the destination tile.
INT16 variants calculate the sum of two INT16 outer products, widening results into INT32, then the result is destructively added to or subtracted from the destination tile.
FP16 variants calculate the sum of two FP16 outer products, widening results into FP32, then the result is destructively added to or subtracted from the destination tile.
If FEAT_SME_I16I64 is implemented, I16I64 variants calculate the sum of four INT16 outer products, widening results into INT64, then the result is destructively added to or subtracted from the destination tile.

The following example uses the INT8 UMOPA variant with SVL=128-bit:

UMOPA <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.B, <Zm>.B

Each input register (Zn.B, Zm.B) is treated as a matrix of 4x4 elements, as if each block of four contiguous elements (shown with red outline) were transposed.

In this example, because SVL is 128-bit:

The first source, Zn.B contains a 4x4 sub-matrix of unsigned 8-bit integer values.
The second source, Zm.B, contains a 4x4 sub-matrix of unsigned 8-bit integer values.
The UMOPA instruction calculates a 4 x 4 widened 32-bit integer sum of outer products, which is then destructively added to the 32-bit integer destination tile, ZAda.

More generally, we can say that the unsigned integer sum of outer products and accumulate instruction multiplies the sub-matrix in the first source vector by the sub-matrix in the second source vector. Each source vector contains a (SVL/32) x4 sub-matrix of unsigned 8-bit integer values. The resulting (SVL/32) x (SVL/32) widened 32-bit integer sum of outer product is then destructively added to the 32-bit integer destination tile.

The following example uses the BF16 FMOPA variant with SVL=128-bit:

BFMOPA <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.H, <Zm>.H

In this example, because SVL is 128-bit:

The first source, Zn.H, contains a 4 x 2 sub-matrix of half-precision floating-point values which are widened to single-precision floating-point values.
The second source, Zm.H contains a 2 x 4 sub-matrix of half-precision floating-point values which are widened to single-precision floating-point values.
The FMOPA instruction calculates a 4 x 4 single-precision floating-point sum of outer products, which is then destructively added to the single-precision floating-point destination tile, ZAda.

More generally, we can say that this instruction widens the (SVL/32) x 2 sub-matrix of half-precision floating-point values held in the first source vector to single-precision floating-point values and multiplies it by the widened 2 x (SVL/32) sub-matrix of half-precision floating-point values in the second source vector to single-precision floating-point values. The resulting (SVL/32) x (SVL/32) single-precision floating-point sum of outer products is then destructively added to the single-precision floating-point destination tile.

SME instructions with predication

Each source vector is independently predicated by a corresponding governing predicate:

Outer product and accumulate or subtract instructions use Pn/M and Pm/M: Inactive source elements are treated as having a value of zero.
Tile slice move instructions use Pg/M: Inactive elements in the destination slice remain unmodified.
Tile slice load instructions use Pg/Z: Inactive elements are set to zero in the destination vector.
Tile slice store instructions use Pg: Inactive elements are not written to memory.

Predication makes it easier to handle cases where the matrix dimension is not an exact multiple of SVL.

For example, consider the following instruction:

The vector input Z0 is predicated by P0, and Z1 is predicated by P1.

In this example:

SVL is 512-bit
The Z registers contain 16 x FP32 vectors
The last two elements are inactive in P0
The last element is inactive in P1.

This instruction updates (16-2) x (16 -1) FP32 elements in the ZA0.S tile, leaving the remaining elements of the ZA0.S tile unchanged because Pn/M is used.

The following figures show more examples of predicated outer product instructions. Strikethrough text shows the calculation components that are affected by the inactive predicate elements:

Addition of a vector to ZA rows and columns

SME includes instructions that add horizontal or vertical vector elements to a ZA tile with predication support.

Instruction	Description
ADDHA	Add the source vector to each horizontal slice of a ZA tile
ADDVA	Add the source vector to each vertical slice of a ZA tile

For example:

ADDHA ZA0.S, P0/M, P1/M, Z1.S

This performs the following operation:

This ADDHA instruction adds each element of the source vector Z1 to the corresponding active element of each horizontal slice of the ZA0.S tile.

The tile elements are predicated by a pair of governing predicates. An element of a horizontal slice is considered active if its corresponding element in the second governing predicate is TRUE and the element corresponding to its horizontal slice number in the first governing predicate is TRUE. Inactive elements in the destination tile remain unmodified.

Tile load, store, move instructions

SME load, store, and move instructions do the following:

Load ZA rows and columns from memory
Store ZA rows and columns to memory
Move ZA rows and columns to SVE Z registers
Move SVE Z registers to ZA rows and columns

Tile slice load and store instructions

The LD1B, LD1H, LD1S, LD1D, and LD1Q instructions load contiguous memory values to a ZA tile slice with 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit elements respectively.

The ST1B, ST1H, ST1S, ST1D, and ST1Q instructions store a ZA tile slice with 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit elements respectively to contiguous memory.

These instructions also have predication support, for example:

LD1B ZA0H.B[W0, #imm], P0/Z, [X1, X2]

This LD1B instruction performs a predicated contiguous load of bytes from memory address (X1+X2) to the ZA0 horizontal tile slice with slice index (W0+imm). Inactive elements are set to zero in the destination vector.

ST1H ZA1V.H[W0, #imm], P2, [X1, X2, LSL #1]

This ST1H instruction performs a predicated contiguous store of halfwords from the ZA1 vertical tile slice with slice index (W0+imm) to memory starting at (X1+X2*2). Inactive elements are not written to memory.

Tile slice move instructions

The MOV instruction (an alias of MOVA) moves a Z vector register to a ZA tile slice, or moves a ZA tile slice to a Z vector register. The instruction operates on individual horizontal or vertical slices within a named ZA tile of the specified element size. The slice number within the tile is selected by the sum of the slice index register and immediate offset. Inactive elements in the destination slice remain unmodified.

For example:

MOV     ZA0H.B[W0, #imm],  P0/M, Z0.B

or:

MOVA  ZA0H.B[W0, #imm],  P0/M, Z0.B

This instruction moves the vector register Z0.B to the horizontal ZA tile slice, ZA0H.B[W0, #imm], using P0 as the predication register. Inactive elements in the destination tile slice remain unmodified.

ZA array vector load/store instructions

The SME LDR instruction loads memory to a ZA array vector, and the SME STR instruction stores a ZA array vector to memory.

These instructions are unpredicated. For the purpose of context switching, they can be used in Non-streaming SVE mode when PSTATE.ZA is enabled.

For example, in the following STR instruction the ZA array vector is selected by the sum of the vector select register and an optional immediate value. The memory address is generated by scalar base, plus the same optional immediate offset multiplied by the current vector length in bytes:

STR ZA[<Wv>, <imm>], [<Xn|SP>{, #<imm>, MUL VL}]

Zero ZA tile instruction

The ZERO instruction zeroes a list of 64-bit element ZA tiles:

ZERO { <mask>}

The ZERO instruction zeroes up to eight tiles named ZA0.D to ZA7.D, as specified by mask, leaving the other tiles unmodified.

The instruction can be used in Non-streaming SVE mode when PSTATE.ZA is enabled.

To zero the entire ZA array, use the instruction alias ZERO {ZA}.

New SVE2 instructions

The SME architecture adds several new SVE2 instructions. These instructions are also usable when the PE is in Non-streaming SVE mode, if SVE2 is implemented. These instructions include:

Predicate select between predicate register or all-false
Reverse 64-bit double words in elements
Signed and Unsigned clamp to minimum/maximum vector

PSEL instruction

The PSEL instruction performs a predicate select between a predicate register or all-false, as follows:

PSEL <Pd>, <Pn>, <Pm>.<T>[<Wv>, <imm>]

If the indexed element of the second source predicate is true, the instruction places the contents of the first source predicate register into the destination predicate register, otherwise it sets the destination predicate to all-false.

For example, consider the following instruction assuming W12 is 0:

PSEL P0, P1, P2.B[W12, #0]

The element [W12+0] of the second source predicate P2.B is false. Therefore P0 is set to all zeros, as shown in the following figure:

Now consider the following instruction, still assuming W12 is 0 but this time the immediate offset is 1:

PSEL P0, P1, P2.B[W12, #1]

The element [W12+1] of the second source predicate P2.B is true. Therefore P0 is set to the contents of the first source predicate register P1, as shown in the following figure:

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog