Part 1: Arm Scalable Matrix Extension (SME) Introduction

May 23, 2024

8 minute read time.

Arm Scalable Matrix Extension (SME) is an architecture extension that provides enhanced support for matrix operations. SME builds on the Scalable Vector Extensions (SVE and SVE2), adding new capabilities to efficiently process matrices. Key features include:

Outer product between two vectors
Matrix tile storage
Load, store, insert, and extract of tile vectors, including on-the-fly transposition
Streaming SVE mode

The following table summarizes the main features of SME, SVE, and SVE2:

SME	SVE2	SVE
Streaming SVE mode	NEON DSP ++	Scalable vectors
On-the-fly matrix transposition	Multi-precision arithmetic	Per-lane predication
Matrix outer product of vectors	Match detect and histogram	Gather-load and Scatter-store
Load, store, insert, and extract of matrix vectors	Non-temporal scatter/gather	Speculative vectorization
	Bitwise permute	ML extensions (FP16 + DOT product)
	AEs, SHA3, SM4, Crypto	v8.6 BF16, FP, and Int8 matmul

SME defines the following new features:

A new architectural state capable of holding two-dimensional matrix tiles.
Streaming SVE mode which supports execution of SVE2 instructions with a vector length that matches the tile width.
New instructions that accumulate the outer product of two vectors into a tile.
New load, store, and move instructions that transfer a vector to or from a tile row or column.

Like SVE2, SME is a scalable vector length extension which enables Vector Length Agnostic (VLA), per-lane predication, predicate-driven loop control and management features.

Streaming SVE mode

SME introduces Streaming SVE mode, which implements a subset of the SVE2 instruction set and adds new SME-specific instructions.

Streaming SVE mode supports high-throughput “streaming” data processing of large datasets, where the data being streamed typically has simple loop control flow and limited conditionality.

Non-streaming SVE mode supports the full SVE2 instruction set, where general-purpose code typically handles complex data structures and complex predication.

Most SME instructions are only available in Streaming SVE mode. The streaming vector length (SVL) in Streaming SVE mode might be different from the Non-streaming vector length (NSVL).

It is anticipated that the Streaming Vector Length (SVL) is greater than or equal to the Non-streaming vector length (NSVL), that is SVL ≥ NSVL. For example, NSVL might be 128-bit in Non-streaming SVE mode, and SVL might be 512-bit in Streaming SVE mode.

The SME SVL can be 128-bit, 256-bit, 512-bit, 1024-bit or 2048-bit.

The effective streaming vector length (SVL) is controlled by SMCR_ELx.LEN at EL1, EL2, and EL3.

For more information about Streaming SVE mode, see section B1.4.6 in the Arm Architecture Reference Manual for A-profile architecture.

Switching between Non-streaming and Streaming SVE mode

If both SME and SVE2 are implemented, an application can switch between operating modes depending on what is needed.

Having a separate mode for SME operations allows an implementation to support different vector lengths for streaming and non-streaming processing within the same application. For example, an implementation might choose to support a larger vector length in Streaming SVE mode, with the hardware optimized for streaming, throughput-oriented operation.

Applications can easily switch dynamically between Streaming SVE mode and Non-streaming SVE mode. New PSTATE.{SM, ZA} bits enable and disable Streaming SVE mode and SME ZA storage:

SM: Enable and disable Streaming SVE mode
ZA: Enable and disable ZA storage access

Use the MSR instruction to modify the PSTATE.{SM, ZA} bits in the Streaming Vector Control Register (SVCR) as follows:

MSR SVCRSM, #<imm>
MSR SVCRZA, #<imm>
MSR SVCRSMZA, #<imm>

The SMSTART instruction is an alias of the MSR instruction that can set PSTATE.SM and PSTATE.ZA:

SMSTART: Enable both Streaming SVE mode and ZA storage access
SMSTRAT SM : Enable Streaming SVE mode
SMSTART ZA: Enable ZA storage access

The SMSTOP instruction is an alias of the MSR instruction that can clear PSTATE.SM and PSTATE.ZA:

SMSTOP: Disable both Streaming SVE mode and ZA storage access
SMSTOP SM : Disable Streaming SVE mode
SMSTOP ZA: Disable ZA storage access

The following figure shows how an application can switch between Streaming SVE mode and Non-streaming SVE mode.

For more information about using SMSTART and SMSTOP to switch between Streaming SVE mode and Non-streaming SVE mode, see sections C6.2.327 and C6.2.328 in the Arm Architecture Reference Manual for A-profile architecture.

SME architecture state

As with SVE2, Streaming SVE mode provides vector registers Z0-Z31 of size SVL and predicate registers P0-P15.

The lowest numbered bits of the SVE vector registers, Zn, also hold the fixed-length Vn, Qn, Dn, Sn, Hn, and Bn registers.

When entering Streaming SVE mode (PSTATE.SM is changed from 0 to 1) or exiting Streaming SVE mode (PSTATE.SM is changed from 1 to 0), all of these registers are set to zero.

Most Non-streaming SVE2 instructions can be used in Streaming SVE mode, but might use a different vector length. The current effective vector length, VL, can be read using the RDSVL instruction.

//Read multiple of Streaming SVE vector register size to Xd
RDSVL <Xd>, #<imm>

Note that software rarely needs to explicitly read SVL in Streaming SVE mode. The RDSVL instruction is usually used to determine the value of SVL when in Non-streaming mode.

ZA array

The new SME ZA (Z Array) storage is a 2D square byte array of dimension (SVL in bytes) x (SVL in bytes).

For example, if the vector length in Streaming SVE mode is 256-bit (32-byte), then the size of the ZA storage is 32x32=1024 bytes.

The ZA array can be accessed as:

ZA array vectors
ZA tiles
ZA tile slices

ZA array vector access

The ZA array can be accessed as SVL-bit vectors that contain 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit elements:

 ZA.B[N], ZA.H[N], ZA.S[N], ZA.D[N], ZA.Q[N]

The number of ZA array vectors is the number of bytes in SVL. For example, if SVL is 256-bit, then the number of ZA array vectors is 32.

For the purpose of context switching, new SME LDR and STR instructions were introduced to load and store ZA array vectors from and to memory:

LDR ZA[<Wv>, <imm>], [<Xn|SP>{, #<imm>, MUL VL}]
STR ZA[<Wv>, <imm>], [<Xn|SP>{, #<imm>, MUL VL}]

ZA tiles

A ZA tile is a square, 2D sub-array of elements within the ZA array. The width of a ZA tile is always SVL bits, the same width as the ZA array.

The number of tiles available is determined by the element data type size.

Element data type size	Number of tiles	Tile names
8-bit	1	ZA0.B
16-bit	2	ZA0.H-ZA1.H
32-bit	4	ZA0.S-ZA3.S
64-bit	8	ZA0.D-ZA7.D
128-bit	16	ZA0.Q-ZA15.Q

When the element data type is 8-bit, ZA can be accessed as only one tile, ZA0.B .
When the element data type is 16-bit, ZA can be accessed as two tiles, ZA0.H-ZA1.H .
When the element data type is 32-bit, ZA can be accessed as four tiles, ZA0.S-ZA3.S .
When the element data type is 64-bit, ZA can be accessed as eight tiles, ZA0.D- ZA7.D .
When the element data type is 128-bit, ZA can be accessed as sixteen tiles, ZA0.Q- ZA15.Q .

For example, if the SVL is 256-bit (32-byte) and the element data type size is 8-bit, ZA can be viewed as ZA0.B, and as 32 x (32 x 1-byte) vectors.

If the SVL is 256-bit (32-byte) and the element data type size is 16-bit, ZA can be viewed as two tiles, ZA0.H and ZA1.H, and each tile can be viewed as 16x (16x 2-byte) vectors.

ZA tile access

A ZA tile can be accessed either as a whole, or as tile slices.

When accessing a ZA tile as a whole, an instruction can use the tile name.

ZA0.B, ZA0.H-ZA1.H, ZA0.S-ZA3.S, ZA0.D-ZA7.D or ZA0.Q-ZA15.Q

A ZA tile slice is a 1D set of horizontally or vertically contiguous elements within a ZA tile.

A vector access to a tile reads or writes a ZA tile slice:

Horizontal or vertical tile slice is indicated by an H or V suffix on the ZA tile name
The tile slice is indicated by a slice index [N] to the ZA tile name

For example, the following diagram shows the ZA tile ZA0.B if the SVL is 128-bits. ZA0V.B[0] and ZA0V.B[13] access vertical tile slices, and ZA0H.B[0] and ZA0H.B[15] access horizontal tile slices.

The following diagram shows another example of tile slice access where the SVL is 128-bits and the element type size is 16-bit:

For efficiency of hardware access to ZA tiles and tile slices, the tile slices of ZA tiles are interleaved.

The following diagram shows an example of this interleaving. In this example, SVL is 256-bit and the element data type size is 16-bit. This means that ZA can be viewed as two ZA tiles, ZA0H and ZA1H, with interleaving horizontal tile slices:

The following diagram shows an example of the combined view of different element data type sizes, and horizontal and vertical tile slices:

The columns on the left show the different ways in which each row of the ZA storage can be addressed.

Let 'SIZE' be the size of vector element, where SIZE is 1,2,4,8,16 for data type B, H, S, D or Q respectively.

Let 'NUM_OF_ELEMENTS' be the number of elements in a vector, that is bytes_of(SVL)/SIZE.

ZAnH.<B|H|S|D|Q>[m] accesses a vector which contains the full row (m*SIZE+n ) in the ZA storage. This vector contains elements with data type B, H, S, D or Q.

ZAnV.<B|H|S|D|Q>[m] accesses a vector which contains elements from the column (m*SIZE) and rows (i*SIZE+n), where i is 0~(NUM_OF_ELEMENTS-1). This vector contains elements with data type B, H, S, D or Q.

Software that uses mixed element data type sizes and horizontal and vertical tile slices should be careful with the overlap.

For more information about the ZA array, array vectors, tiles, and tile slices, see sections B1.4.8 to B1.4.12 in the Arm Architecture Reference Manual for A-profile architecture.

Instructions supported in Streaming SVE mode

Some instructions have restrictions in Streaming SVE mode:

Some SVE2 instructions become illegal to execute in Streaming SVE mode
- Gather/scatter load/store SVE2 instructions
- SVE2 instructions that use First Fault Register
Most NEON instructions become UNDEFINED

For more information about instructions affected by Streaming SVE mode, see the document, Arm Architecture Reference Manual for A-profile architecture.

SME adds several new instructions, including the following:

Matrix outer product and accumulate or subtract instructions, including FMOPA, UMOPA, and BFMOPA.
- SVE2 vector registers (Z0-Z31) are the column and row inputs to the outer product instructions
- ZA storage holds the output of 2D matrix tiles
Addition of SVE2 Z vectors to ZA rows/columns
Zeroing ZA tiles
A few new instructions that can be used in both Streaming and Non-streaming SVE mode

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Part 1: Arm Scalable Matrix Extension (SME) Introduction

Streaming SVE mode

Switching between Non-streaming and Streaming SVE mode

SME architecture state

ZA array

ZA array vector access

ZA tiles

ZA tile access

Instructions supported in Streaming SVE mode

Further reading

Meeting the demand for software interoperability on Arm-based hardware

Arm A-Profile Architecture Developments 2024

Accelerate multi-token search in strings with SVE2 SVMATCH instruction