Part 2 of this two-part blog introduces some of the instructions that SME provides.
SME instructions that interact with the SME ZA storage include the following:
To help understand the outer product and accumulate instructions, let us consider how a matrix multiplication can be performed using the outer product operation.
Calculating the outer product of two vectors a and b gives a result matrix C containing the outer product:
Now consider multiplying two matrices, a and b:
This multiplication can be calculated using two outer product operations and accumulating both result matrices:
SME introduces efficient outer product and accumulate or subtract instructions for the following data types:
These instructions calculate the outer product of two vectors in two Z vector registers (Zn and Zm), accumulate or subtract the result array with existing data in a ZA tile (ZAda), and save the result to the same ZA tile. Each source vector is independently predicated by a corresponding governing predicate register (Pn and Pm).
The instructions where the input vectors have the same data type as the output array (FP32 and FP64) are fairly straightforward.
The following example shows FP32 outer product and accumulate or subtract:
FMOPA <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.S, <Zm>.S FMOPS <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.S, <Zm>.S
Because these instructions widen the result data type, the operations are not as straightforward as the FP32 and FP64 variants.
The following example uses the INT8 UMOPA variant with SVL=128-bit:
UMOPA <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.B, <Zm>.B
Each input register (Zn.B, Zm.B) is treated as a matrix of 4x4 elements, as if each block of four contiguous elements (shown with red outline) were transposed.
In this example, because SVL is 128-bit:
More generally, we can say that the unsigned integer sum of outer products and accumulate instruction multiplies the sub-matrix in the first source vector by the sub-matrix in the second source vector. Each source vector contains a (SVL/32) x4 sub-matrix of unsigned 8-bit integer values. The resulting (SVL/32) x (SVL/32) widened 32-bit integer sum of outer product is then destructively added to the 32-bit integer destination tile.
The following example uses the BF16 FMOPA variant with SVL=128-bit:
BFMOPA <ZAda>.S, <Pn>/M, <Pm>/M, <Zn>.H, <Zm>.H
More generally, we can say that this instruction widens the (SVL/32) x 2 sub-matrix of half-precision floating-point values held in the first source vector to single-precision floating-point values and multiplies it by the widened 2 x (SVL/32) sub-matrix of half-precision floating-point values in the second source vector to single-precision floating-point values. The resulting (SVL/32) x (SVL/32) single-precision floating-point sum of outer products is then destructively added to the single-precision floating-point destination tile.
Each source vector is independently predicated by a corresponding governing predicate:
Predication makes it easier to handle cases where the matrix dimension is not an exact multiple of SVL.
For example, consider the following instruction:
The vector input Z0 is predicated by P0, and Z1 is predicated by P1.
In this example:
This instruction updates (16-2) x (16 -1) FP32 elements in the ZA0.S tile, leaving the remaining elements of the ZA0.S tile unchanged because Pn/M is used.
The following figures show more examples of predicated outer product instructions. Strikethrough text shows the calculation components that are affected by the inactive predicate elements:
SME includes instructions that add horizontal or vertical vector elements to a ZA tile with predication support.
For example:
ADDHA ZA0.S, P0/M, P1/M, Z1.S
This performs the following operation:
This ADDHA instruction adds each element of the source vector Z1 to the corresponding active element of each horizontal slice of the ZA0.S tile.
The tile elements are predicated by a pair of governing predicates. An element of a horizontal slice is considered active if its corresponding element in the second governing predicate is TRUE and the element corresponding to its horizontal slice number in the first governing predicate is TRUE. Inactive elements in the destination tile remain unmodified.
SME load, store, and move instructions do the following:
The LD1B, LD1H, LD1S, LD1D, and LD1Q instructions load contiguous memory values to a ZA tile slice with 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit elements respectively.
The ST1B, ST1H, ST1S, ST1D, and ST1Q instructions store a ZA tile slice with 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit elements respectively to contiguous memory.
These instructions also have predication support, for example:
LD1B ZA0H.B[W0, #imm], P0/Z, [X1, X2]
This LD1B instruction performs a predicated contiguous load of bytes from memory address (X1+X2) to the ZA0 horizontal tile slice with slice index (W0+imm). Inactive elements are set to zero in the destination vector.
ST1H ZA1V.H[W0, #imm], P2, [X1, X2, LSL #1]
This ST1H instruction performs a predicated contiguous store of halfwords from the ZA1 vertical tile slice with slice index (W0+imm) to memory starting at (X1+X2*2). Inactive elements are not written to memory.
The MOV instruction (an alias of MOVA) moves a Z vector register to a ZA tile slice, or moves a ZA tile slice to a Z vector register. The instruction operates on individual horizontal or vertical slices within a named ZA tile of the specified element size. The slice number within the tile is selected by the sum of the slice index register and immediate offset. Inactive elements in the destination slice remain unmodified.
MOV ZA0H.B[W0, #imm], P0/M, Z0.B
or:
MOVA ZA0H.B[W0, #imm], P0/M, Z0.B
This instruction moves the vector register Z0.B to the horizontal ZA tile slice, ZA0H.B[W0, #imm], using P0 as the predication register. Inactive elements in the destination tile slice remain unmodified.
The SME LDR instruction loads memory to a ZA array vector, and the SME STR instruction stores a ZA array vector to memory.
These instructions are unpredicated. For the purpose of context switching, they can be used in Non-streaming SVE mode when PSTATE.ZA is enabled.
For example, in the following STR instruction the ZA array vector is selected by the sum of the vector select register and an optional immediate value. The memory address is generated by scalar base, plus the same optional immediate offset multiplied by the current vector length in bytes:
STR ZA[<Wv>, <imm>], [<Xn|SP>{, #<imm>, MUL VL}]
The ZERO instruction zeroes a list of 64-bit element ZA tiles:
ZERO { <mask>}
The ZERO instruction zeroes up to eight tiles named ZA0.D to ZA7.D, as specified by mask, leaving the other tiles unmodified.
The instruction can be used in Non-streaming SVE mode when PSTATE.ZA is enabled.
To zero the entire ZA array, use the instruction alias ZERO {ZA}.
The SME architecture adds several new SVE2 instructions. These instructions are also usable when the PE is in Non-streaming SVE mode, if SVE2 is implemented. These instructions include:
The PSEL instruction performs a predicate select between a predicate register or all-false, as follows:
PSEL <Pd>, <Pn>, <Pm>.<T>[<Wv>, <imm>]
If the indexed element of the second source predicate is true, the instruction places the contents of the first source predicate register into the destination predicate register, otherwise it sets the destination predicate to all-false.
For example, consider the following instruction assuming W12 is 0:
PSEL P0, P1, P2.B[W12, #0]
The element [W12+0] of the second source predicate P2.B is false. Therefore P0 is set to all zeros, as shown in the following figure:
Now consider the following instruction, still assuming W12 is 0 but this time the immediate offset is 1:
PSEL P0, P1, P2.B[W12, #1]
The element [W12+1] of the second source predicate P2.B is true. Therefore P0 is set to the contents of the first source predicate register P1, as shown in the following figure:
For more detailed information about SME, see the Arm Architecture Reference Manual for A-profile architecture.
To discover how you can use SME in your applications to efficiently work with matrices and other forms of data, see the SME Programmer's Guide.
@Vladimir , well spotted! I should fix it in next revison. Thanks.
Examples in Tile slice load and store instructions and Tile slice move instructions use W0 as the slice index register which is not permitted. Permitted values for the slice index register are W12-W15