Arm recently announced the Cortex-M55 processor, the first to feature Arm Helium technology, also known as the M-Profile Vector Extensions (MVE), introduced with the Armv8.1-M architecture. The vector extensions enable increased DSP and machine learning (ML) performance for endpoint devices.
The Cortex-M55 processor offers many new features, including:
This article helps to get started with IP selection and software development on the Cortex-M55 processor. It features an example of how to compare performance to previous designs such as the Cortex-M7 processor using the latest CMSIS-DSP library.
Arm tools can be used to update software libraries to realize increased performance for DSP and ML applications. Read more about Arm tools: Get Started with Early Development on the Arm Cortex-M55 processor.
Arm tools highlighted in this article are:
Arm Compiler version 6.14 with support for the Cortex-M55 processor is now available and is required for the following steps.
CMSIS-DSP is a suite of common signal processing functions for Cortex-M. The source code and documentation are available on github.
The library is divided into several sections each covering a specific category:
The library has separate functions for 8-bit integers, 16-bit integers, 32-bit integers and 32-bit floating-point data types and has been updated with vector instructions for the Cortex-M55 processor.
To get started with CMSIS on the Cortex-M55 processor let’s look at vector multiplication using arm_mult_q31() from the CMSIS-DSP software library.
For a CPU such as Cortex-M7, Arm Compiler 6 uses 32-bit multiply instructions (SMMUL). Compiling for the Cortex-M55 processor uses vector multiply instructions and the Q registers (VQDMULH).
Helium reuses the registers in the FPU as vector registers and each vector register is 128-bits wide. With Helium, the load, add, multiply, and store operations can be done using 128-bit values shown in the Q registers:
Figure 1: Cortex-M55 registers
The latest CMSIS-DSP library takes advantage of the vector instructions for increased performance.
For example, the arm_mult_q31() function now provides an implementation for Helium.
/* ---------------------------------------------------------------------- * Project: CMSIS DSP Library * Title: arm_mult_q31.c * Description: Q31 vector multiplication * * $Date: 18. March 2019 * $Revision: V1.6.0 * * Target Processor: Cortex-M cores * -------------------------------------------------------------------- */ /* * Copyright (C) 2010-2019 ARM Limited or its affiliates. All rights reserved. * * SPDX-License-Identifier: Apache-2.0 * * Licensed under the Apache License, Version 2.0 (the License); you may * not use this file except in compliance with the License. * You may obtain a copy of the License at * * www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ #include "arm_math.h" /** @ingroup groupMath */ /** @addtogroup BasicMult @{ */ /** @brief Q31 vector multiplication. @param[in] pSrcA points to the first input vector. @param[in] pSrcB points to the second input vector. @param[out] pDst points to the output vector. @param[in] blockSize number of samples in each vector. @return none @par Scaling and Overflow Behavior The function uses saturating arithmetic. Results outside of the allowable Q31 range[0x80000000 0x7FFFFFFF] are saturated. */ #if defined(ARM_MATH_MVEI) #include "arm_helium_utils.h" void arm_mult_q31( const q31_t * pSrcA, const q31_t * pSrcB, q31_t * pDst, uint32_t blockSize) { uint32_t blkCnt; /* loop counters */ q31x4_t vecA, vecB; /* Compute 4 outputs at a time */ blkCnt = blockSize >> 2; while (blkCnt > 0U) { /* * C = A * B * Multiply the inputs and then store the results in the destination buffer. */ vecA = vld1q(pSrcA); vecB = vld1q(pSrcB); vst1q(pDst, vqdmulhq(vecA, vecB)); /* * Decrement the blockSize loop counter */ blkCnt--; /* * advance vector source and destination pointers */ pSrcA += 4; pSrcB += 4; pDst += 4; } /* * tail */ blkCnt = blockSize & 3; if (blkCnt > 0U) { mve_pred16_t p0 = vctp32q(blkCnt); vecA = vld1q(pSrcA); vecB = vld1q(pSrcB); vstrwq_p(pDst, vqdmulhq(vecA, vecB), p0); } } #else void arm_mult_q31( const q31_t * pSrcA, const q31_t * pSrcB, q31_t * pDst, uint32_t blockSize) { uint32_t blkCnt; /* Loop counter */ q31_t out; /* Temporary output variable */ #if defined (ARM_MATH_LOOPUNROLL) /* Loop unrolling: Compute 4 outputs at a time */ blkCnt = blockSize >> 2U; while (blkCnt > 0U) { /* C = A * B */ /* Multiply inputs and store result in destination buffer. */ out = ((q63_t) *pSrcA++ * *pSrcB++) >> 32; out = __SSAT(out, 31); *pDst++ = out << 1U; out = ((q63_t) *pSrcA++ * *pSrcB++) >> 32; out = __SSAT(out, 31); *pDst++ = out << 1U; out = ((q63_t) *pSrcA++ * *pSrcB++) >> 32; out = __SSAT(out, 31); *pDst++ = out << 1U; out = ((q63_t) *pSrcA++ * *pSrcB++) >> 32; out = __SSAT(out, 31); *pDst++ = out << 1U; /* Decrement loop counter */ blkCnt--; } /* Loop unrolling: Compute remaining outputs */ blkCnt = blockSize % 0x4U; #else /* Initialize blkCnt with number of samples */ blkCnt = blockSize; #endif /* #if defined (ARM_MATH_LOOPUNROLL) */ while (blkCnt > 0U) { /* C = A * B */ /* Multiply inputs and store result in destination buffer. */ out = ((q63_t) *pSrcA++ * *pSrcB++) >> 32; out = __SSAT(out, 31); *pDst++ = out << 1U; /* Decrement loop counter */ blkCnt--; } } #endif /* defined(ARM_MATH_MVEI) */ /** @} end of BasicMult group */
Let us look at how to compile and run the vector multiply and vector add functions with 32-bit integers targeting both Cortex-M7 and Cortex-M55 and compare the results.
We are going to use the CMSIS Test Framework to get going quickly and enable performance comparisons across the wide variety of functions provided by the CMSIS DSP library. This is a great way to get started doing performance analysis without the need to write any software.
The README file for the test framework provides many more details about how it works. Our goal is to use the test framework to run some tests and compare the results for the Cortex-M55 processor. Along the way, we learn how to add custom hardware systems to CMSIS to use for test execution.
To make the steps easy to follow and repeat, the process to set up and use the CMSIS test framework is captured in a Docker project on github. There are numerous tutorials on how to use Docker and many articles available on the benefits it provides.
Docker can be installed using Docker Desktop for Windows or Mac or on Linux using the instructions for the Docker Community Edition.
To get the project from github:
git clone https://github.com/ARM-software/Tool-Solutions.git cd Tool-Solutions/docker/cmsis-models
Before building the Docker image download Arm Compiler 6.14 for Linux. This can be done using the get-ac6.sh script or using a wget command:
wget https://developer.arm.com/-/media/Files/downloads/compiler/DS500-BN-00026-r5p0-16rel0.tgz
The README file of the github project has more information about how to obtain the necessary license to run the Arm Compiler.
The Docker build command is in build.sh. It can also be entered manually.
docker build -t cmsis-models -f Dockerfile .
The following is a build log. In this log everything is cached by Docker, but it shows all the commands which are in the Dockerfile. The log is useful for comparison if anything goes wrong.
$ ./build.sh Sending build context to Docker daemon 291.1MB Step 1/23 : FROM ubuntu:18.04 ---> ccc6e87d482b Step 2/23 : RUN echo "root:docker" | chpasswd ---> Using cache ---> 9246a019eef8 Step 3/23 : RUN apt-get update && apt-get -y install sudo git vim wget make python3 python3-pyparsing python3-numpy python3-colorama ---> Using cache ---> 86f8f4f90c9a Step 4/23 : RUN useradd --create-home -s /bin/bash -m user1 && echo "user1:docker" | chpasswd && adduser user1 sudo ---> Using cache ---> 55950aabc125 Step 5/23 : RUN wget https://github.com/Kitware/CMake/releases/download/v3.14.3/cmake-3.14.3-Linux-x86_64.sh ---> Using cache ---> 1fdcd8e4536d Step 6/23 : RUN bash ./cmake-3.14.3-Linux-x86_64.sh --skip-license --exclude-subdir --prefix=/usr/local ---> Using cache ---> 9aea8ff4c694 Step 7/23 : WORKDIR /home/user1 ---> Using cache ---> 364617ba6e60 Step 8/23 : USER user1 ---> Using cache ---> b5e7f33cc176 Step 9/23 : RUN mkdir /home/user1/tmp ---> Using cache ---> 77d6c2920f79 Step 10/23 : ADD --chown=user1:user1 DS500-BN-00026-r5p0-16rel0.tgz /home/user1/tmp ---> Using cache ---> de4f96f90128 Step 11/23 : RUN /home/user1/tmp/install_x86_64.sh --i-agree-to-the-contained-eula --no-interactive -d /home/user1/AC6 ---> Using cache ---> e086489bb8e1 Step 12/23 : RUN rm -rf /home/user1/tmp ---> Using cache ---> 000539ab153e Step 13/23 : RUN mkdir /home/user1/Platforms ---> Using cache ---> 6f2d6c7817d9 Step 14/23 : COPY --chown=user1:user1 setup-cmsis.sh /home/user1 ---> Using cache ---> 5132eb713128 Step 15/23 : COPY --chown=user1:user1 configPlatform.cmake /home/user1 ---> Using cache ---> dbb1857d69c9 Step 16/23 : COPY --chown=user1:user1 Platforms /home/user1/Platforms ---> Using cache ---> 345499509d17 Step 17/23 : COPY --chown=user1:user1 cmake_M7F.sh /home/user1 ---> Using cache ---> 48e7baa40e97 Step 18/23 : COPY --chown=user1:user1 cmake_M55.sh /home/user1 ---> Using cache ---> 6a9ed49d5d7c Step 19/23 : RUN /home/user1/setup-cmsis.sh ---> Using cache ---> 03e05f172413 Step 20/23 : RUN echo "export ARMLMD_LICENSE_FILE=7010@localhost" >> /home/user1/.bashrc ---> Using cache ---> 6ee70bd89feb Step 21/23 : RUN echo "export PATH=$PATH:/home/user1/AC6/bin" >> /home/user1/.bashrc ---> Using cache ---> dbf15c797265 Step 22/23 : RUN echo "export ARM_TOOL_VARIANT=ult" >> /home/user1/.bashrc ---> Using cache ---> f08afeaf00bf Step 23/23 : COPY --chown=user1:user1 build_basic_math.sh /home/user1 ---> Using cache ---> 596627eea0c1 Successfully built 596627eea0c1 Successfully tagged cmsis-models:latest
To run the Docker image use the docker run command (also in run.sh). It should start the container.
docker run --network host -it cmsis-models /bin/bash To run a command as administrator (user "root"), use "sudo <command>". See "man sudo_root" for details. user1@therefuge:~$ ls AC6 Platforms cmake_M55.sh configPlatform.cmake CMSIS_5 build_basic_math.sh cmake_M7F.sh setup-cmsis.sh user1@therefuge:~$
With the container running we are now ready to build the application using the CMSIS test framework.
If you used Docker and the container is running, jump down to the "Building test applications" section to build a test using the build_basic_math.sh script.
The setup can also be done without Docker. The next section gives more details about how to create the development environment for using the CMSIS test framework on a Linux machine.
The process described here is for Ubuntu 18.04. Similar commands are possible on other Linux distributions.
The CMSIS test framework is built on Python and cmake. Before getting started install cmake 3.14.3 and Python 3. The test framework is sensitive to the cmake version and 3.14.3 is recommended.
sudo apt-get install -y make wget python3 python3-pyparsing python3-numpy python3-colorama wget https://github.com/Kitware/CMake/releases/download/v3.14.3/cmake-3.14.3-Linux-x86_64.sh sudo bash ./cmake-3.14.3-Linux-x86_64.sh --skip-license --exclude-subdir --prefix=/usr/local
Arm Compiler version 6.14 and later includes support for Cortex-M55. Install it by downloading, extracting the file, and run the installer.
tar xvfz DS500-BN-00026-r5p0-16rel0.tgz ./install_x86_64.sh
Make sure to set the ARMLMD_LICENSE_FILE to point to a license server with a valid Arm Compiler 6 license. Next, add the bin/ directory to the PATH variable. For more installation details refer to the Installing Arm Compiler documentation.
With the required tools installed, let us move on to get CMSIS and compile it.
To get the CMSIS library:
git clone https://github.com/ARM-software/CMSIS_5.git
The CMSIS library contains support for both Fixed Virtual Platforms and the MPS3 board. To run the tests on another platform requires some information to be supplied about the new system. In this case, we want to run on a custom Fast Model system and an equivalent Cycle Model system to compare the performance of different Arm processor types. The Platforms/ directory of the github project shows the required files to add a new platform. These files are copied into the CMSIS and the DSP directory to add the new platform support. The name of this new platform is abbreviated IPSS, for IP Selection System. There are files to add support for the Cortex-M7 and the Cortex-M55 for the IPSS hardware platform. The IPSS is a very simple system with CPU and memory. Here are the two commands to add the new platform support to CMSIS.
cp configPlatform.cmake CMSIS_5/CMSIS/DSP cp -r Platforms/IPSS CMSIS_5/CMSIS/DSP/Platforms/
Once the platform support is added, there are customized cmake scripts for each processor. These files are copied to the CMSIS test framework directory:
cp cmake_M7F.sh CMSIS_5/CMSIS/DSP/Testing cp cmake_M55.sh CMSIS_5/CMSIS/DSP/Testing
Use the build_basic_math.sh build script at the top to build the basic math functions in the DSP library.
The build_basic_math.sh script takes the CPU type, a data type, and a test number. The test numbers map to the functions using the table below:
ID
Function
1
vec_mult_q31
2
vec_add_q31
3
vec_sub_q31
4
vec_abs_q31
5
vec_negate_q31
6
vec_offset_q31
7
vec_scale_q31
8
vec_dot_q31
To build the vector multiply with 32-bit integer data types for each CPU use:
./build_basic_math.sh -t M55 -d Q31 -i 1 ./build_basic_math.sh -t M7F -d Q31 -i 1
The resulting executables are CMSIS_5/CMSIS/DSP/Testing/build_M7F_Q31/Testing for the Cortex-M7 and CMSIS_5/CMSIS/DSP/Testing/build_M55_Q31/Testing for the Cortex-M55.
Arm Fast Models provide fast, flexible programmer's view models of Arm IP, allowing you to develop software prior to silicon availability. They allow full control over the simulation, including profiling, debug, and trace. Fast Models are also a great way to test software applications before trying them on Arm Cycle Models. Fast Models run quickly and have a broad set of debug and trace features to help eliminate functional problems.
Fast Models are available for the Cortex-M55 and the Cortex-M7. These models can be used to make sure the CMSIS applications behave as expected before moving to cycle accurate benchmarking.
A simple system which can run the Cortex-M7 software is shown in the following code. Refer to the Fast Models Quick Start for more details on how to get started.
/* * Copyright 2020 ARM Limited. All rights reserved. */ component m7 { composition { armcortexm7ct : ARMCortexM7CT(); pvbus2ambapv : PVBus2AMBAPV(); Memory : RAMDevice(); Clock100MHz : ClockDivider(mul=100000000); Clock1Hz : MasterClock(); BusDecoder : PVBusDecoder() } connection { Clock1Hz.clk_out => Clock100MHz.clk_in; BusDecoder.pvbus_m_range[0x0..0x9fffffff] => Memory.pvbus; pvbus2ambapv.amba_pv_m => self.amba_pv_m; armcortexm7ct.pvbus_m => BusDecoder.pvbus_s; Clock100MHz.clk_out => armcortexm7ct.clk_in; BusDecoder.pvbus_m_range[0xa8000000..0xa8001000] => pvbus2ambapv.pvbus_s; } properties { component_type = "System"; } master port<AMBAPV> amba_pv_m; }
A similar system can be used for the Cortex-M55 to run CMSIS executables on the Fast Model.
The output from the test is a set of letters and numbers. Here is the output from the vector multiply function:
Fast Models [11.9.41 (Nov 26 2019)] Copyright 2000-2019 ARM Limited. All Rights Reserved. S: g 1 S: g 1 S: g 1 S: s 2 SS: 1 0 0 143 Y E: b 16 S: 1 0 0 247 Y E: b 32 S: 1 0 0 455 Y E: b 64 S: 1 0 0 871 Y E: b 128 S: 1 0 0 1703 Y E: b 256 SSSS _[TEST COMPLETE]_________________________________________________ simulation is complete Info: /OSCI/SystemC: Simulation stopped by user.
The output includes information about the test suite, in this case DSP BasicMaths, and the function to be run. It also runs the selected function with multiple block sizes, for example multiply 16, 32, 64, 128, and 256 32-bit integer values. The cycle count is also embedded in the results. There are many more details about how to select tests and process results in the CMSIS github project.
Once the software is tested and confirmed to work as expected using Fast Models, Arm Cycle Models can be used to check the performance. The Cortex-M55 Cycle Model is now available on Arm IP Exchange.
Cycle Models will run cycle accurate simulation of each application.
The following table of results comparing the Cortex-M7 and the Cortex-M55 on the vector multiply and the vector add functions for a range of block sizes. The functions were selected to highlight the new vector instructions of the Cortex-M55 and do not represent a comprehensive performance analysis across a wide variety of software. Please refer to the Arm Developer website for Cortex-M comparison information.
Function (block size)
Cortex-M55 cycles
Cortex-M7 cycles
vec_mult_q31(16)
172
364
vec_mult_q31(32)
288
684
vec_mult_q31(64)
520
1308
vec_mult_q31(128)
984
2556
vec_mult_q31(256)
1912
5052
vec_add_q31(16)
168
266
vec_add_q31(32)
280
491
vec_add_q31(64)
504
929
vec_add_q31(128)
952
1805
vec_add_q31(256)
1848
3557
Follow the below links for further tutorials on using CMSIS-DSP:
The Cortex-M55 processor provides a significant uplift in ML and DSP performance for IoT devices. Arm offers several development tools and models to help partners along their path to bringing a Cortex-M55 based device to market. Arm tools and models are especially useful for understanding architecture differences and performance improvements compared to previous Cortex-M designs. This article explained how to use the CMSIS Test Framework to build software and run it on models to compare performance during IP selection. Following the methodology helps make sure performance is well understood right from the start.
More about CMSIS