Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Tools, Software and IDEs blog How to use the Arm Cortex-M55 Processor with the open-source CMSIS library
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Helium
  • Endpoint AI
  • Arm Compiler
  • Cycle Models
  • Fast Models
  • Cortex-M
  • CMSIS
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

How to use the Arm Cortex-M55 Processor with the open-source CMSIS library

Jason Andrews
Jason Andrews
March 11, 2020
15 minute read time.

Arm recently announced the Cortex-M55 processor, the first to feature Arm Helium technology, also known as the M-Profile Vector Extensions (MVE), introduced with the Armv8.1-M architecture. The vector extensions enable increased DSP and machine learning (ML) performance for endpoint devices.

The Cortex-M55 processor offers many new features, including: 

  • Helium vector processing 
  • Instruction set enhancements for loops and branches implementing low overhead loops.
  • Half precision floating-point support 
  • Enhancements for TrustZone management of the floating-point unit (FPU) 
  • Privileged execute-never (PXN) memory attribute in the Memory Protection Unit (MPU) 
  • Enhancements in debug including the Performance Monitoring Unit (PMU) and the Unprivileged Debug Extension (UDE)
  • Reliability, Availability, and Serviceability (RAS) extension from AXI5

This article helps to get started with IP selection and software development on the Cortex-M55 processor. It features an example of how to compare performance to previous designs such as the Cortex-M7 processor using the latest CMSIS-DSP library.

Start early software development with Arm tools

Arm tools can be used to update software libraries to realize increased performance for DSP and ML applications. Read more about Arm tools: Get Started with Early Development on the Arm Cortex-M55 processor.

Arm tools highlighted in this article are:

  • Arm Compiler for software compilation
  • Fast Models for code execution on a virtual platform
  • Cycle Models for performance analysis and comparison with previous Cortex-M designs

Arm Compiler version 6.14 with support for the Cortex-M55 processor is now available and is required for the following steps.

Simplify DSP development with the CMSIS library

CMSIS-DSP is a suite of common signal processing functions for Cortex-M. The source code and documentation are available on github.

The library is divided into several sections each covering a specific category:

  • Basic math functions
  • Fast math functions
  • Complex math functions
  • Filters
  • Matrix functions
  • Transform functions
  • Motor control functions
  • Statistical functions
  • Support functions
  • Interpolation functions

The library has separate functions for 8-bit integers, 16-bit integers, 32-bit integers and 32-bit floating-point data types and has been updated with vector instructions for the Cortex-M55 processor.

To get started with CMSIS on the Cortex-M55 processor let’s look at vector multiplication using arm_mult_q31() from the CMSIS-DSP software library.

For a CPU such as Cortex-M7, Arm Compiler 6 uses 32-bit multiply instructions (SMMUL). Compiling for the Cortex-M55 processor uses vector multiply instructions and the Q registers (VQDMULH).

Helium reuses the registers in the FPU as vector registers and each vector register is 128-bits wide. With Helium, the load, add, multiply, and store operations can be done using 128-bit values shown in the Q registers:

A block diagram to show FPU registers.

Figure 1: Cortex-M55 registers

The latest CMSIS-DSP library takes advantage of the vector instructions for increased performance.

For example, the arm_mult_q31() function now provides an implementation for Helium.

/* ----------------------------------------------------------------------
 * Project:      CMSIS DSP Library
 * Title:        arm_mult_q31.c
 * Description:  Q31 vector multiplication
 *
 * $Date:        18. March 2019
 * $Revision:    V1.6.0
 *
 * Target Processor: Cortex-M cores
 * -------------------------------------------------------------------- */
/*
 * Copyright (C) 2010-2019 ARM Limited or its affiliates. All rights reserved.
 *
 * SPDX-License-Identifier: Apache-2.0
 *
 * Licensed under the Apache License, Version 2.0 (the License); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an AS IS BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

#include "arm_math.h"

/**
  @ingroup groupMath
 */

/**
  @addtogroup BasicMult
  @{
 */

/**
  @brief         Q31 vector multiplication.
  @param[in]     pSrcA      points to the first input vector.
  @param[in]     pSrcB      points to the second input vector.
  @param[out]    pDst       points to the output vector.
  @param[in]     blockSize  number of samples in each vector.
  @return        none

  @par           Scaling and Overflow Behavior
                   The function uses saturating arithmetic.
                   Results outside of the allowable Q31 range[0x80000000 0x7FFFFFFF] are saturated.
 */
#if defined(ARM_MATH_MVEI)

#include "arm_helium_utils.h"

void arm_mult_q31(
    const q31_t * pSrcA,
    const q31_t * pSrcB,
    q31_t * pDst,
    uint32_t blockSize)
{
    uint32_t  blkCnt;           /* loop counters */
    q31x4_t vecA, vecB;

    /* Compute 4 outputs at a time */
    blkCnt = blockSize >> 2;
    while (blkCnt > 0U)
    {
        /*
         * C = A * B
         * Multiply the inputs and then store the results in the destination buffer.
         */
        vecA = vld1q(pSrcA);
        vecB = vld1q(pSrcB);
        vst1q(pDst, vqdmulhq(vecA, vecB));
        /*
         * Decrement the blockSize loop counter
         */
        blkCnt--;
        /*
         * advance vector source and destination pointers
         */
        pSrcA  += 4;
        pSrcB  += 4;
        pDst   += 4;
    }
    /*
     * tail
     */
    blkCnt = blockSize & 3;
    if (blkCnt > 0U)
    {
        mve_pred16_t p0 = vctp32q(blkCnt);
        vecA = vld1q(pSrcA);
        vecB = vld1q(pSrcB);
        vstrwq_p(pDst, vqdmulhq(vecA, vecB), p0);
    }
}

#else
void arm_mult_q31(
  const q31_t * pSrcA,
  const q31_t * pSrcB,
        q31_t * pDst,
        uint32_t blockSize)
{
        uint32_t blkCnt;                               /* Loop counter */
        q31_t out;                                     /* Temporary output variable */

#if defined (ARM_MATH_LOOPUNROLL)

  /* Loop unrolling: Compute 4 outputs at a time */
  blkCnt = blockSize >> 2U;

  while (blkCnt > 0U)
  {
    /* C = A * B */

    /* Multiply inputs and store result in destination buffer. */
    out = ((q63_t) *pSrcA++ * *pSrcB++) >> 32;
    out = __SSAT(out, 31);
    *pDst++ = out << 1U;

    out = ((q63_t) *pSrcA++ * *pSrcB++) >> 32;
    out = __SSAT(out, 31);
    *pDst++ = out << 1U;

    out = ((q63_t) *pSrcA++ * *pSrcB++) >> 32;
    out = __SSAT(out, 31);
    *pDst++ = out << 1U;

    out = ((q63_t) *pSrcA++ * *pSrcB++) >> 32;
    out = __SSAT(out, 31);
    *pDst++ = out << 1U;

    /* Decrement loop counter */
    blkCnt--;
  }

  /* Loop unrolling: Compute remaining outputs */
  blkCnt = blockSize % 0x4U;

#else

  /* Initialize blkCnt with number of samples */
  blkCnt = blockSize;

#endif /* #if defined (ARM_MATH_LOOPUNROLL) */

  while (blkCnt > 0U)
  {
    /* C = A * B */

    /* Multiply inputs and store result in destination buffer. */
    out = ((q63_t) *pSrcA++ * *pSrcB++) >> 32;
    out = __SSAT(out, 31);
    *pDst++ = out << 1U;

    /* Decrement loop counter */
    blkCnt--;
  }

}
#endif /* defined(ARM_MATH_MVEI) */

/**
  @} end of BasicMult group
 */

Let us look at how to compile and run the vector multiply and vector add functions with 32-bit integers targeting both Cortex-M7 and Cortex-M55 and compare the results.

We are going to use the CMSIS Test Framework to get going quickly and enable performance comparisons across the wide variety of functions provided by the CMSIS DSP library. This is a great way to get started doing performance analysis without the need to write any software.

The README file for the test framework provides many more details about how it works. Our goal is to use the test framework to run some tests and compare the results for the Cortex-M55 processor. Along the way, we learn how to add custom hardware systems to CMSIS to use for test execution.

Docker provides consistent results and is a great way to learn

To make the steps easy to follow and repeat, the process to set up and use the CMSIS test framework is captured in a Docker project on github. There are numerous tutorials on how to use Docker and many articles available on the benefits it provides.

Docker can be installed using Docker Desktop for Windows or Mac or on Linux using the instructions for the Docker Community Edition.

To get the project from github:

git clone https://github.com/ARM-software/Tool-Solutions.git
cd Tool-Solutions/docker/cmsis-models

Before building the Docker image download Arm Compiler 6.14 for Linux. This can be done using the get-ac6.sh script or using a wget command:

wget https://developer.arm.com/-/media/Files/downloads/compiler/DS500-BN-00026-r5p0-16rel0.tgz

The README file of the github project has more information about how to obtain the necessary license to run the Arm Compiler. 

The Docker build command is in build.sh. It can also be entered manually.

docker build -t cmsis-models  -f Dockerfile .

The following is a build log. In this log everything is cached by Docker, but it shows all the commands which are in the Dockerfile. The log is useful for comparison if anything goes wrong.

$ ./build.sh 
Sending build context to Docker daemon  291.1MB
Step 1/23 : FROM ubuntu:18.04
 ---> ccc6e87d482b
Step 2/23 : RUN echo "root:docker" | chpasswd
 ---> Using cache
 ---> 9246a019eef8
Step 3/23 : RUN apt-get update &&       apt-get -y install sudo git vim wget make python3 python3-pyparsing python3-numpy python3-colorama
 ---> Using cache
 ---> 86f8f4f90c9a
Step 4/23 : RUN useradd --create-home -s /bin/bash -m user1 && echo "user1:docker" | chpasswd && adduser user1 sudo
 ---> Using cache
 ---> 55950aabc125
Step 5/23 : RUN wget https://github.com/Kitware/CMake/releases/download/v3.14.3/cmake-3.14.3-Linux-x86_64.sh
 ---> Using cache
 ---> 1fdcd8e4536d
Step 6/23 : RUN bash ./cmake-3.14.3-Linux-x86_64.sh --skip-license --exclude-subdir --prefix=/usr/local
 ---> Using cache
 ---> 9aea8ff4c694
Step 7/23 : WORKDIR /home/user1
 ---> Using cache
 ---> 364617ba6e60
Step 8/23 : USER user1
 ---> Using cache
 ---> b5e7f33cc176
Step 9/23 : RUN mkdir /home/user1/tmp
 ---> Using cache
 ---> 77d6c2920f79
Step 10/23 : ADD --chown=user1:user1  DS500-BN-00026-r5p0-16rel0.tgz /home/user1/tmp
 ---> Using cache
 ---> de4f96f90128
Step 11/23 : RUN /home/user1/tmp/install_x86_64.sh --i-agree-to-the-contained-eula --no-interactive -d /home/user1/AC6
 ---> Using cache
 ---> e086489bb8e1
Step 12/23 : RUN rm -rf /home/user1/tmp
 ---> Using cache
 ---> 000539ab153e
Step 13/23 : RUN mkdir /home/user1/Platforms
 ---> Using cache
 ---> 6f2d6c7817d9
Step 14/23 : COPY --chown=user1:user1 setup-cmsis.sh /home/user1
 ---> Using cache
 ---> 5132eb713128
Step 15/23 : COPY --chown=user1:user1 configPlatform.cmake /home/user1
 ---> Using cache
 ---> dbb1857d69c9
Step 16/23 : COPY --chown=user1:user1 Platforms /home/user1/Platforms
 ---> Using cache
 ---> 345499509d17
Step 17/23 : COPY --chown=user1:user1  cmake_M7F.sh /home/user1
 ---> Using cache
 ---> 48e7baa40e97
Step 18/23 : COPY --chown=user1:user1  cmake_M55.sh /home/user1
 ---> Using cache
 ---> 6a9ed49d5d7c
Step 19/23 : RUN /home/user1/setup-cmsis.sh
 ---> Using cache
 ---> 03e05f172413
Step 20/23 : RUN echo "export ARMLMD_LICENSE_FILE=7010@localhost" >> /home/user1/.bashrc
 ---> Using cache
 ---> 6ee70bd89feb
Step 21/23 : RUN echo "export PATH=$PATH:/home/user1/AC6/bin" >> /home/user1/.bashrc
 ---> Using cache
 ---> dbf15c797265
Step 22/23 : RUN echo "export ARM_TOOL_VARIANT=ult" >> /home/user1/.bashrc
 ---> Using cache
 ---> f08afeaf00bf
Step 23/23 : COPY --chown=user1:user1 build_basic_math.sh /home/user1
 ---> Using cache
 ---> 596627eea0c1
Successfully built 596627eea0c1
Successfully tagged cmsis-models:latest

To run the Docker image use the docker run command (also in run.sh). It should start the container.

docker run --network host -it cmsis-models /bin/bash

To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

user1@therefuge:~$ ls
AC6      Platforms            cmake_M55.sh  configPlatform.cmake
CMSIS_5  build_basic_math.sh  cmake_M7F.sh  setup-cmsis.sh
user1@therefuge:~$ 

With the container running we are now ready to build the application using the CMSIS test framework.

If you used Docker and the container is running, jump down to the "Building test applications" section to build a test using the build_basic_math.sh script.

The setup can also be done without Docker. The next section gives more details about how to create the development environment for using the CMSIS test framework on a Linux machine.

CMSIS Test Framework prerequisites

The process described here is for Ubuntu 18.04. Similar commands are possible on other Linux distributions.

The CMSIS test framework is built on Python and cmake. Before getting started install cmake 3.14.3 and Python 3. The test framework is sensitive to the cmake version and 3.14.3 is recommended.

sudo apt-get install -y make wget python3 python3-pyparsing python3-numpy python3-colorama
wget https://github.com/Kitware/CMake/releases/download/v3.14.3/cmake-3.14.3-Linux-x86_64.sh
sudo bash ./cmake-3.14.3-Linux-x86_64.sh --skip-license --exclude-subdir --prefix=/usr/local

Install Arm Compiler 6

Arm Compiler version 6.14 and later includes support for Cortex-M55. Install it by downloading, extracting the file, and run the installer.

tar xvfz DS500-BN-00026-r5p0-16rel0.tgz
./install_x86_64.sh

Make sure to set the ARMLMD_LICENSE_FILE to point to a license server with a valid Arm Compiler 6 license. Next, add the bin/ directory to the PATH variable. For more installation details refer to the Installing Arm Compiler documentation. 

With the required tools installed, let us move on to get CMSIS and compile it.

Obtain the CMSIS library

To get the CMSIS library:

git clone https://github.com/ARM-software/CMSIS_5.git

The CMSIS library contains support for both Fixed Virtual Platforms and the MPS3 board. To run the tests on another platform requires some information to be supplied about the new system. In this case, we want to run on a custom Fast Model system and an equivalent Cycle Model system to compare the performance of different Arm processor types. The Platforms/ directory of the github project shows the required files to add a new platform. These files are copied into the CMSIS and the DSP directory to add the new platform support. The name of this new platform is abbreviated IPSS, for IP Selection System. There are files to add support for the Cortex-M7 and the Cortex-M55 for the IPSS hardware platform. The IPSS is a very simple system with CPU and memory. Here are the two commands to add the new platform support to CMSIS.

cp configPlatform.cmake CMSIS_5/CMSIS/DSP
cp -r Platforms/IPSS CMSIS_5/CMSIS/DSP/Platforms/

Once the platform support is added, there are customized cmake scripts for each processor. These files are copied to the CMSIS test framework directory:

cp cmake_M7F.sh CMSIS_5/CMSIS/DSP/Testing
cp cmake_M55.sh CMSIS_5/CMSIS/DSP/Testing

Build test applications

Use the build_basic_math.sh build script at the top to build the basic math functions in the DSP library.

The build_basic_math.sh script takes the CPU type, a data type, and a test number. The test numbers map to the functions using the table below:

ID

Function

1

vec_mult_q31

2

vec_add_q31

3

vec_sub_q31

4

vec_abs_q31

5

vec_negate_q31

6

vec_offset_q31

7

vec_scale_q31

8

vec_dot_q31

To build the vector multiply with 32-bit integer data types for each CPU use:

./build_basic_math.sh -t M55 -d Q31 -i 1
./build_basic_math.sh -t M7F -d Q31 -i 1

The resulting executables are CMSIS_5/CMSIS/DSP/Testing/build_M7F_Q31/Testing for the Cortex-M7 and CMSIS_5/CMSIS/DSP/Testing/build_M55_Q31/Testing for the Cortex-M55.

Run CMSIS tests using Fast Models

Arm Fast Models provide fast, flexible programmer's view models of Arm IP, allowing you to develop software prior to silicon availability. They allow full control over the simulation, including profiling, debug, and trace. Fast Models are also a great way to test software applications before trying them on Arm Cycle Models. Fast Models run quickly and have a broad set of debug and trace features to help eliminate functional problems.

Fast Models are available for the Cortex-M55 and the Cortex-M7. These models can be used to make sure the CMSIS applications behave as expected before moving to cycle accurate benchmarking.

A simple system which can run the Cortex-M7 software is shown in the following code. Refer to the Fast Models Quick Start for more details on how to get started.

/*
 * Copyright 2020 ARM Limited. All rights reserved.
 */

component m7
{
    composition
    {
        armcortexm7ct : ARMCortexM7CT();
        pvbus2ambapv : PVBus2AMBAPV();
        Memory : RAMDevice();
        Clock100MHz : ClockDivider(mul=100000000);
        Clock1Hz : MasterClock();
        BusDecoder : PVBusDecoder()
    }

    connection
    {
        Clock1Hz.clk_out => Clock100MHz.clk_in;
        BusDecoder.pvbus_m_range[0x0..0x9fffffff] => Memory.pvbus;
        pvbus2ambapv.amba_pv_m => self.amba_pv_m;
        armcortexm7ct.pvbus_m => BusDecoder.pvbus_s;
        Clock100MHz.clk_out => armcortexm7ct.clk_in;
        BusDecoder.pvbus_m_range[0xa8000000..0xa8001000] => pvbus2ambapv.pvbus_s;
    }

    properties
    {
        component_type = "System";
    }
    master port<AMBAPV> amba_pv_m;

}

A similar system can be used for the Cortex-M55 to run CMSIS executables on the Fast Model.

The output from the test is a set of letters and numbers. Here is the output from the vector multiply function:

Fast Models [11.9.41 (Nov 26 2019)]
Copyright 2000-2019 ARM Limited.
All Rights Reserved.

S: g 1
S: g 1
S: g 1
S: s 2
SS: 1 0 0 143 Y
E: 
b 16
S: 1 0 0 247 Y
E: 
b 32
S: 1 0 0 455 Y
E: 
b 64
S: 1 0 0 871 Y
E: 
b 128
S: 1 0 0 1703 Y
E: 
b 256
SSSS
_[TEST COMPLETE]_________________________________________________


simulation is complete

Info: /OSCI/SystemC: Simulation stopped by user.

The output includes information about the test suite, in this case DSP BasicMaths, and the function to be run. It also runs the selected function with multiple block sizes, for example multiply 16, 32, 64, 128, and 256 32-bit integer values. The cycle count is also embedded in the results. There are many more details about how to select tests and process results in the CMSIS github project.

Compare performance with Arm Cycle Models

Once the software is tested and confirmed to work as expected using Fast Models, Arm Cycle Models can be used to check the performance. The Cortex-M55 Cycle Model is now available on Arm IP Exchange.

Cycle Models will run cycle accurate simulation of each application.

The following table of results comparing the Cortex-M7 and the Cortex-M55 on the vector multiply and the vector add functions for a range of block sizes. The functions were selected to highlight the new vector instructions of the Cortex-M55 and do not represent a comprehensive performance analysis across a wide variety of software. Please refer to the Arm Developer website for Cortex-M comparison information.

Function (block size)

Cortex-M55 cycles

Cortex-M7 cycles

vec_mult_q31(16)

172

364

vec_mult_q31(32)

288

684

vec_mult_q31(64)

520

1308

vec_mult_q31(128)

984

2556

vec_mult_q31(256)

1912

5052

vec_add_q31(16)

168

266

vec_add_q31(32)

280

491

vec_add_q31(64)

504

929

vec_add_q31(128)

952

1805

vec_add_q31(256)

1848

3557

Further resources

Follow the below links for further tutorials on using CMSIS-DSP:

  • How to use new CMSIS-DSP library functions for classical machine learning 
  • How to use the CMSIS-DSP Python wrapper and how a CMSIS-DSP API is represented in Python 

Summary

The Cortex-M55 processor provides a significant uplift in ML and DSP performance for IoT devices. Arm offers several development tools and models to help partners along their path to bringing a Cortex-M55 based device to market. Arm tools and models are especially useful for understanding architecture differences and performance improvements compared to previous Cortex-M designs. This article explained how to use the CMSIS Test Framework to build software and run it on models to compare performance during IP selection. Following the methodology helps make sure performance is well understood right from the start.

More about CMSIS

Anonymous
Tools, Software and IDEs blog
  • Python on Arm: 2025 Update

    Diego Russo
    Diego Russo
    Python powers applications across Machine Learning (ML), automation, data science, DevOps, web development, and developer tooling.
    • August 21, 2025
  • Product update: Arm Development Studio 2025.0 now available

    Stephen Theobald
    Stephen Theobald
    Arm Development Studio 2025.0 now available with Arm Toolchain for Embedded Professional.
    • July 18, 2025
  • GCC 15: Continuously Improving

    Tamar Christina
    Tamar Christina
    GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
    • June 26, 2025