Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Introducing the Scalable Matrix Extension for the Armv9-A Architecture
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Architecture
  • A-Profile CPU
  • A-profile
  • Processor Architecture
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Introducing the Scalable Matrix Extension for the Armv9-A Architecture

Martin Weidmann
Martin Weidmann
July 14, 2021
3 minute read time.

The Arm architecture brought scalable vector processing from the supercomputer to the widest range of devices, resulting in most of the worlds computational workloads running on Arm architecture. Following the Vision Day announcement of Armv9-A, Arm is making available early technical details of a new extension to the Armv9-A architecture, the Scalable Matrix Extension (SME). SME is the latest in a planned series of architecture improvements to provide increasing support for matrix operations.

The purpose of this early disclosure is to inform and enable the OS and tools developer ecosystems. SME introduces a new programmer’s model and register state to support matrix operations that future additions through 2022 will build upon.

SME builds on the Scalable Vector Extensions (SVE and SVE2), adding new capabilities to efficiently process matrices. Key features include:

  • Matrix tile storage
  • Load, store, insert, and extract tile vectors, including on-the-fly transposition
  • Outer product of SVE vectors
  • Streaming SVE mode

Matrix multiplication on Arm

Matrix multiplications are an important part of many key workloads, such as scientific simulations, computer vision, some aspects of Machine Learning (ML), and Augmented Reality (AR). The Arm architecture has evolved over time, gaining features to improve the performance and efficiency of these operations:
Graphic: An example of matrix multiplication on Arm

  • Armv8.4-A: Support for 8-bit integer DOT product instructions
  • Armv8.6-A: Support for in-vector integer & floating-point matrix-multiply instructions and the BFloat16 data type.
  • Armv9-A: Support for wider vectors in SVE2.

SME is the next step in this journey, enabling a significant increase in CPU matrix processing throughput and efficiency.

SME and matrix multiplication

To perform a matrix multiplication, a simple implementation is a triple nested loop algorithm as shown in the following graphic:
Graphic: SME and matrix multiplication example

for m in 0..M-1
     for n in 0..N-1
          C[m, n] = 0;
          for k in 0..K-1
                C[m, n] += A[m, k] * B[k, n]

This approach would give a multiply to load ratio of 1:2, that is 1 multiply per two element loads. To improve efficiency and throughput, we need to increase this ratio. One way to do this is by calculating more than one result at a time, for example:
Graphic: Calculating multiply ratios

In the previous example, by calculating a block of four results the multiply to load ratio improves to 1:1. That is, four loads are required to compute four multiplies.

SME is based on an outer-product engine, which takes the idea of generating multiple results per load further still:
Graphic: SME outer product engine

An outer product of vectors A[H] ⨷ B[W] is calculated, generating an HxW matrix which is accumulated into a matrix tile C[H×W]. A full matrix multiplication of A[HxK] and B[KxW] is calculated by iterating over the columns in A and the rows in B, accumulating into C.

SME is a scalable architecture, allowing implementation choice on the width of vectors supported. The multiply to load ratio depends on the implemented width. For example, a 512-bit vector implementation with 32-bit data would give a multiply to vector load ratio of 256:2. This increases to 256:1 when four output tiles can be computed from four input vectors.

Outer products can also be used to construct other high-level matrix operations such as matrix inversion, filters, and linear equation solvers.

SME and SVE2


Graphic: SVE2 on Armv9-A diagram

A new operating mode is added, Streaming SVE Mode. When in Streaming SVE Mode, the new SME storage and instructions are available, as well as significant subset of the existing SVE2 instructions. When not in Streaming SVE mode, behavior is unchanged from SVE2. Applications can switch between operating modes depending on what is needed.

Having a separate mode for SME operations allows an implementation to support different vector lengths for streaming and non-streaming processing within the same application. For example, an implementation might choose to support a larger vector length in Streaming SVE mode, with the hardware optimized for streaming, throughput-oriented operation.

Find out more

Full Instruction Set and System register information for SME is available with our technical webpages. A supplement to the Arm Architecture Reference Manual (ArmARM), documenting SME, is due for release at the end of this year. We also plan to release supporting materials and examples as part of the Learn the Architecture program of guides in 2022.

Anonymous
Architectures and Processors blog
  • Optimizing TIFF image processing using AARCH64 (64-bit) Neon

    Ramin Zaghi
    Ramin Zaghi
    This guest blog shows how 64-bit Neon technology can be used to improve performance in image processing applications.
    • October 13, 2022
  • Arm A-Profile Architecture Developments 2022

    Martin Weidmann
    Martin Weidmann
    2022 additions to Arm A-Profile architecture covering Virtual Memory System Architecture, SME2 and mitigating some ROP attacks with Guarded Control Stack.
    • September 29, 2022
  • A closer look at Arm A-profile support for non-maskable interrupts

    Christoffer Dall
    Christoffer Dall
    Arm is adding support in both the CPU and Generic Interrupt Controller (GIC) architecture for NMIs. But what is an NMI? how does operating systems software use these features?
    • May 23, 2022