Arm Scalable Matrix Extension (SME) is an architecture extension that provides enhanced support for matrix operations. SME builds on the Scalable Vector Extensions (SVE and SVE2), adding new capabilities to efficiently process matrices. Key features include:
The following table summarizes the main features of SME, SVE, and SVE2:
SME defines the following new features:
Like SVE2, SME is a scalable vector length extension which enables Vector Length Agnostic (VLA), per-lane predication, predicate-driven loop control and management features.
SME introduces Streaming SVE mode, which implements a subset of the SVE2 instruction set and adds new SME-specific instructions.
Streaming SVE mode supports high-throughput “streaming” data processing of large datasets, where the data being streamed typically has simple loop control flow and limited conditionality.
Non-streaming SVE mode supports the full SVE2 instruction set, where general-purpose code typically handles complex data structures and complex predication.
Most SME instructions are only available in Streaming SVE mode. The streaming vector length (SVL) in Streaming SVE mode might be different from the Non-streaming vector length (NSVL).
It is anticipated that the Streaming Vector Length (SVL) is greater than or equal to the Non-streaming vector length (NSVL), that is SVL ≥ NSVL. For example, NSVL might be 128-bit in Non-streaming SVE mode, and SVL might be 512-bit in Streaming SVE mode.
The SME SVL can be 128-bit, 256-bit, 512-bit, 1024-bit or 2048-bit.
The effective streaming vector length (SVL) is controlled by SMCR_ELx.LEN at EL1, EL2, and EL3.
For more information about Streaming SVE mode, see section B1.4.6 in the Arm Architecture Reference Manual for A-profile architecture.
If both SME and SVE2 are implemented, an application can switch between operating modes depending on what is needed.
Having a separate mode for SME operations allows an implementation to support different vector lengths for streaming and non-streaming processing within the same application. For example, an implementation might choose to support a larger vector length in Streaming SVE mode, with the hardware optimized for streaming, throughput-oriented operation.
Applications can easily switch dynamically between Streaming SVE mode and Non-streaming SVE mode. New PSTATE.{SM, ZA} bits enable and disable Streaming SVE mode and SME ZA storage:
Use the MSR instruction to modify the PSTATE.{SM, ZA} bits in the Streaming Vector Control Register (SVCR) as follows:
The SMSTART instruction is an alias of the MSR instruction that can set PSTATE.SM and PSTATE.ZA:
The SMSTOP instruction is an alias of the MSR instruction that can clear PSTATE.SM and PSTATE.ZA:
The following figure shows how an application can switch between Streaming SVE mode and Non-streaming SVE mode.
For more information about using SMSTART and SMSTOP to switch between Streaming SVE mode and Non-streaming SVE mode, see sections C6.2.327 and C6.2.328 in the Arm Architecture Reference Manual for A-profile architecture.
As with SVE2, Streaming SVE mode provides vector registers Z0-Z31 of size SVL and predicate registers P0-P15.
The lowest numbered bits of the SVE vector registers, Zn, also hold the fixed-length Vn, Qn, Dn, Sn, Hn, and Bn registers.
When entering Streaming SVE mode (PSTATE.SM is changed from 0 to 1) or exiting Streaming SVE mode (PSTATE.SM is changed from 1 to 0), all of these registers are set to zero.
Most Non-streaming SVE2 instructions can be used in Streaming SVE mode, but might use a different vector length. The current effective vector length, VL, can be read using the RDSVL instruction.
//Read multiple of Streaming SVE vector register size to Xd RDSVL <Xd>, #<imm>
Note that software rarely needs to explicitly read SVL in Streaming SVE mode. The RDSVL instruction is usually used to determine the value of SVL when in Non-streaming mode.
The new SME ZA (Z Array) storage is a 2D square byte array of dimension (SVL in bytes) x (SVL in bytes).
For example, if the vector length in Streaming SVE mode is 256-bit (32-byte), then the size of the ZA storage is 32x32=1024 bytes.
The ZA array can be accessed as:
The ZA array can be accessed as SVL-bit vectors that contain 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit elements:
ZA.B[N], ZA.H[N], ZA.S[N], ZA.D[N], ZA.Q[N]
The number of ZA array vectors is the number of bytes in SVL. For example, if SVL is 256-bit, then the number of ZA array vectors is 32.
For the purpose of context switching, new SME LDR and STR instructions were introduced to load and store ZA array vectors from and to memory:
A ZA tile is a square, 2D sub-array of elements within the ZA array. The width of a ZA tile is always SVL bits, the same width as the ZA array.
The number of tiles available is determined by the element data type size.
For example, if the SVL is 256-bit (32-byte) and the element data type size is 8-bit, ZA can be viewed as ZA0.B, and as 32 x (32 x 1-byte) vectors.
If the SVL is 256-bit (32-byte) and the element data type size is 16-bit, ZA can be viewed as two tiles, ZA0.H and ZA1.H, and each tile can be viewed as 16x (16x 2-byte) vectors.
A ZA tile can be accessed either as a whole, or as tile slices.
When accessing a ZA tile as a whole, an instruction can use the tile name.
ZA0.B, ZA0.H-ZA1.H, ZA0.S-ZA3.S, ZA0.D-ZA7.D or ZA0.Q-ZA15.Q
A ZA tile slice is a 1D set of horizontally or vertically contiguous elements within a ZA tile.
A vector access to a tile reads or writes a ZA tile slice:
For example, the following diagram shows the ZA tile ZA0.B if the SVL is 128-bits. ZA0V.B[0] and ZA0V.B[13] access vertical tile slices, and ZA0H.B[0] and ZA0H.B[15] access horizontal tile slices.
The following diagram shows another example of tile slice access where the SVL is 128-bits and the element type size is 16-bit:
For efficiency of hardware access to ZA tiles and tile slices, the tile slices of ZA tiles are interleaved.
The following diagram shows an example of this interleaving. In this example, SVL is 256-bit and the element data type size is 16-bit. This means that ZA can be viewed as two ZA tiles, ZA0H and ZA1H, with interleaving horizontal tile slices:
The following diagram shows an example of the combined view of different element data type sizes, and horizontal and vertical tile slices:
The columns on the left show the different ways in which each row of the ZA storage can be addressed.
Let 'SIZE' be the size of vector element, where SIZE is 1,2,4,8,16 for data type B, H, S, D or Q respectively.
Let 'NUM_OF_ELEMENTS' be the number of elements in a vector, that is bytes_of(SVL)/SIZE.
ZAnH.<B|H|S|D|Q>[m] accesses a vector which contains the full row (m*SIZE+n ) in the ZA storage. This vector contains elements with data type B, H, S, D or Q.
ZAnV.<B|H|S|D|Q>[m] accesses a vector which contains elements from the column (m*SIZE) and rows (i*SIZE+n), where i is 0~(NUM_OF_ELEMENTS-1). This vector contains elements with data type B, H, S, D or Q.
Software that uses mixed element data type sizes and horizontal and vertical tile slices should be careful with the overlap.
For more information about the ZA array, array vectors, tiles, and tile slices, see sections B1.4.8 to B1.4.12 in the Arm Architecture Reference Manual for A-profile architecture.
Some instructions have restrictions in Streaming SVE mode:
For more information about instructions affected by Streaming SVE mode, see the document, Arm Architecture Reference Manual for A-profile architecture.
SME adds several new instructions, including the following:
For more detailed information about SME, see the Arm Architecture Reference Manual for A-profile architecture.
To discover how you can use SME in your applications to efficiently work with matrices and other forms of data, see the SME Programmer's Guide.
这个ZA寄存器是新加进去对的寄存器吗?还是从原来的32个Z0-Z31寄存器映射过去的呢?
ZA is newly introduced for SME, it is a decidated storage (you could think it as a 2D register), it is not a map of Z registerts.