Getting started with Armv8.1-M based processor: software development hints and tips

October 29, 2021

19 minute read time.

Silicon chips based on the Arm Cortex-M55 and Cortex-M85 processors are reaching the market. To help software developers to make the most out of the Cortex-M55/M85 based devices, we prepared this page to highlight key points and useful information related to Cortex-M55 and Cortex-M85 software development. Please note that this page is work in progress and will be updated when we have new materials and information.

Product document and resources

For an overview of the Armv8.1-M architecture and the Cortex-M55 processor, the following papers can be useful:

	Link to document
Introduction to the Armv8.1-M Architecture	https://www.arm.com/resources/white-paper/intro-armv8-1-m-architecture
Introduction to the Arm Cortex-M55 processor	https://www.arm.com/resources/white-paper/cortex-m55-introduction

The official product page and product document can be found here:

	Link to document
Cortex-M55 product page	https://developer.arm.com/Processors/Cortex-M55
Cortex-M55 Technical Reference Manual (TRM)	https://developer.arm.com/documentation/101051/0101/
Cortex-M55 Device Generic User Guide	https://developer.arm.com/documentation/101273/0101/
Cortex-M55 Software Developer Errata Notice (SDEN)	https://developer.arm.com/documentation/SDEN1679655/latest/
Cortex-M55 Software Optimization Guide	https://developer.arm.com/documentation/102692/latest/
Cortex-M85 product page	https://developer.arm.com/Processors/Cortex-M85
Cortex-M85 Technical Reference Manual (TRM)	https://developer.arm.com/documentation/101924/latest
Cortex-M85 Device Generic User Guide	https://developer.arm.com/documentation/101928/latest

Key resources for Helium programming:

	Link to document
Helium technology pages	https://www.arm.com/technologies/helium https://developer.arm.com/Architectures/Helium
Helium programmer’s guide: Introduction to Helium	https://developer.arm.com/documentation/102102/latest/
Helium programmer’s guide: Coding for Helium	https://developer.arm.com/documentation/102095/latest/
Helium programmer’s guide: Migration to Helium from Neon	https://developer.arm.com/documentation/102107a/latest
Arm Helium technology M-Profile Vector Extension (MVE) for Arm Cortex-M processor (reference book)	https://www.arm.com/resources/education/books/mve-reference-book
Intrinsic function lookup	https://developer.arm.com/architectures/instruction-sets/intrinsics/

Cortex-M85 announcement blog: Cortex-M85: Highest Performing Cortex-M Processor ever - Internet of Things (IoT) blog - Arm Community blogs - Arm Community

A presentation video of "Arm DevSummit 2022 - Harnessing the capabilities from the Arm Cortex-M85 processor" is available here (requires registration).

There are also many links to other Cortex-M-related resources listed in this page: https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/cortex-m-resources.

Differences between Cortex-M55 r1 (released in 2021) and r0 (released in 2020)

There has been a range of enhancements in the Cortex-M55 r1 release when comparing to r0:

Support for Arm Custom Instructions (also known as Custom Datapath Extension in the architecture)
Processor pipeline optimizations
Data prefetcher update, including new bit fields in the Prefetch Control Register (PFCR)
Limited static branch prediction in a few cases (disabled by default in r1p0 and enabled by default in r1p1).
Maximum number of hardware comparators in the Data Watchpoint and Trace (DWT) unit increased from 4 to 8.
Support for a new Trace Port Interface Unit (TPIU) with 16-bit trace data width. The details of the TPIU-M can be found in https://developer.arm.com/documentation/PJDOC-1779577084-533713/latest
Support for a Programmable MBIST Controller PMC-100 which supports online MBIST operations. Details of PMC-100 can be found in https://developer.arm.com/documentation/101528/latest/
A range of functional safety features.

Due to the pipeline optimizations, there can be instruction cycle differences between silicon chips based on r0 and r1 of the Cortex-M55.

Differences between Cortex-M85 r1 (released in 2022) and r0 (released in 2020)

There has been a range of enhancements in the Cortex-M85 r1 release when comparing to r0:

Support for Arm Custom Instructions (also known as Custom Datapath Extension in the architecture)
Support for a Programmable MBIST Controller PMC-100 which supports online MBIST operations. Details of PMC-100 can be found in https://developer.arm.com/documentation/101528/latest/
A range of functional safety features.

Using Arm Compiler 6 and Arm toolchains

Access to Arm Compiler 6

Arm Compiler 6 is available here:

Arm Development Studio: https://developer.arm.com/tools-and-software/embedded/arm-development-studio
Keil Microcontroller Development Kit (MDK): keil.com
Keil Studio (beta): https://www.keil.arm.com/
Standalone version: https://developer.arm.com/tools-and-software/embedded/arm-compiler/downloads/version-6

For best performance, please use Arm Compiler 6.15 or after. Arm Compiler 6.16 is now included in Keil MDK 5.34. If you are using an older version of the Keil MDK, please upgrade to the latest version. Note: Version 6.14 does have support for Armv8.1-M but is not as optimized as newer versions.

Command-line options

When using Cortex-M55 with Arm Compiler 6, the following command-line options can be used to select specific Cortex-M55 configuration:

Cortex-M55 processor configuration	armclang options for using “-mcpu”	armlink/fromelf options "--cpu"
No Helium, no FPU	-mcpu=cortex-m55+nomve+nofp	--cpu Cortex-M55.no_mve.no_fp
No Helium, with FPU	-mcpu=cortex-m55+nomve	--cpu Cortex-M55.no_mve
With Integer Helium, no FPU	-mcpu=cortex-m55+nofp	--cpu Cortex-M55.no_fp
With Integer Helium, scalar FPU	-mcpu=cortex-m55+nomve.fp	--cpu Cortex-M55.no_mvefp
With full feature	-mcpu=cortex-m55	--cpu Cortex-M55

When using Cortex-M85 with Arm Compiler 6, the following command-line options can be used to select specific Cortex-M85 configuration:

Cortex-M85 processor configuration	armclang options for using “-mcpu”	armlink/fromelf options "--cpu"
No Helium, no FPU	-mcpu=cortex-m85+nomve+nofp	--cpu Cortex-M85.no_mve.no_fp
No Helium, with FPU	-mcpu=cortex-m85+nomve	--cpu Cortex-M85.no_mve
With Integer Helium, no FPU	-mcpu=cortex-m85+nofp	--cpu Cortex-M85.no_fp
With full feature	-mcpu=cortex-m85	--cpu Cortex-M85

By default, when selecting Cortex-M55/M85 in Arm Compiler 6, the compiler assumed that the target supports Helium and FPU. To disable generation of Helium instructions, you need to add “+nomve”.

You can also specify architecture instead of specifying the processor. For example:

Cortex-M55/M85 processor configuration	armclang options for using “-march”	armlink/fromelf options "--cpu"
No Helium, no FPU	-march=armv8.1-m.main+dsp	--cpu 8.1-M.Main.no_mve.no_fp
No Helium, with FPU	-march=armv8.1-m.main+dsp+fp.dp	--cpu 8.1-M.Main.no_mve
With Integer Helium, no FPU	-march=armv8.1-m.main+mve	--cpu 8.1-M.Main.no_fp
With Integer Helium, scalar FPU	-march=armv8.1-m.main+mve+fp.dp	--cpu 8.1-M.Main.no_mvefp
With full feature	-march=armv8.1-m.main+mve.fp+fp.dp	--cpu 8.1-M.Main

Please note that:

Helium option (“+mve”) implies that legacy DSP feature (“+dsp”) is also enabled.
armclang "+nofp" option implies "+nomve.fp"
When using architecture option(s), it does not provide processor-specific optimizations.

Other information:

Combinations of architecture support options are documented in Arm Compiler 6 Reference Guide: https://developer.arm.com/documentation/101754/0620/armclang-Reference/Other-Compiler-specific-Features/Supported-architecture-feature-combinations-for-specific-processors
Running disassembly: To disassembly Helium instructions, the fromelf command-line option needs to specify Helium feature. For example:
- $> fromelf -c --cpu=8.1-M.Main.mve.fp test.elf --output list.txt

- $> fromelf -c --cpu=cortex-m55 test.elf --output list.txt
Running disassembly: To disassemble Arm Custom Instructions (Custom Datapath Extension), fromelf command-line option needs to specify CDE feature.
Note: CDE support is not available in Cortex-M55/M85 revision 0, for example:
- $> fromelf -c --cpu=8.1-M.Main.mve.fp test.elf -coprocN=cde --output list.txt

- $> fromelf -c --cpu=cortex-m55 test.elf -coprocN=cde --output list.txt

(where N is the coprocessor number)

Pre-processing macro __ARM_FEATURE_MVE: When using C/C++ compilers that are conformed to Arm C Language Extension (ACLE), a C macro called __ARM_FEATURE_MVE is set when the compilation has Helium (MVE) feature enabled. This C macro can be used in C/C++ codes to select Helium version of DSP/machine learning codes. For example, some CMSIS-DSP library functions are specific to MVE and program code can use this macro for conditional compilation of those library functions. Please note that the __ARM_FEATURE_MVE macro is a 2-bit value and the higher bit can be used to activate and inactivate MVE Float parts. For example:

#if (__ARM_FEATURE_MVE & 2)
  /* MVE Float */
  …
#endif

Auto-vectorization support

One of the key Arm Compiler 6 features that is useful when using Armv8.1-M processor is the auto-vectorization support. This enables a range of processing workloads to take advantage of the Helium and low-overhead-branch extension features without completely rewriting then for low level optimization.

In Arm Compiler 6, auto-vectorization is enabled for optimization level “-O2”. For best performance, please set compiler optimization level to “-Ofast” or above. (“-O2” and lower optimization does not give all the performance benefits).
The auto-vectorization optimization can be enabled and disabled using “-fvectorize” and “-fno-vectorize”. More details can be found in https://developer.arm.com/documentation/101754/0616/armclang-Reference/armclang-Command-line-Options/-fvectorize---fno-vectorize?lang=en.
Arm Compiler 6 (and LLVM) provides Vectorization diagnostic options:

-Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorize

Sometimes, software developers might need to change the source code slightly to help the compiler to vectorize certain loop operations. Examples of codes that cannot be vectorized, or difficult to vectorized includes:

Loops with interdependencies between different loop iterations
Loops with break clauses
Loops with complex conditions
Loops where the number of iterations is unknown at start
Loops with double-precision floating point processing (low-overhead branch might still be used).
Loops that involve operations that cannot be vectorized (for example, system operations like memory barriers)

Some existing program codes might contain manually unrolled loops because software developers unroll some of the loops to get better performance. When porting such codes to Cortex-M55 processor, it might end up making it more difficult for the Arm Compiler 6 to identify auto-vectorization opportunities. Therefore, it might be necessary to modify the code to remove the manual loop unrolling.

Software developers should also remove pointer aliasing in loops using the restrict directive when applicable.

Processor initialization

Enabling Armv8.1-M feature: Low-overhead branch

By default, the Loop and branch info cache is disabled after processor reset. To get the best out of the Low Overhead Branch (LOB) extension in Armv8.1-M, set the LOB bit in the Configuration and Control Register (CCR) to 1 to enable this cache. (Note: This hardware cache is not related to the I-cache and D-cache.)

If you are using CMSIS-CORE based startup code, this is normally handled in SystemInit().
This control bit is banked between the security state. Therefore, both Secure privileged software and Non-secure privileged software need to set this bit. Alternatively, Secure privileged software can set the Non-secure version of this control bit, but this arrangement is rare.
After setting this bit, an ISB instruction must be executed to ensure that the effect take place.

For example, if you are using CMSIS-CORE in your project:

  // Enable Loop and branch info cache
  SCB->CCR |= SCB_CCR_LOB_Msk;
  __DSB();
  __ISB();

If you are not using CMSIS-CORE in your project:

#define CCR_ADDR (0xE000ED14UL)
#define CCR  *(volatile unsigned int *) CCR_ADDR
#define __ISB()  __builtin_arm_isb(0xF)
#define __DSB()  __builtin_arm_dsb(0xF)
  CCR |= 0x00080000UL;
  __DSB();
  __ISB();

Enabling Armv8.1-M feature: Helium

Similar to the floating-point unit (FPU), the Helium hardware need to be enabled before it can be used. This operation is similar to enabling the FPU - To use Helium features, coprocessor 10 and 11 must be enabled. For example, if you are using CMSIS-CORE in your project:

  // Enable CP10 and CP11
  SCB->CPACR |= ((3U << 10U*2U) | /* CP10 Full Access */
                 (3U << 11U*2U) );/* CP11 Full Access */
  __DSB();
  __ISB();

If you are not using CMSIS-CORE in your project:

#define CPACR_ADDR (0xE000ED88UL)
#define CPACR *(volatile unsigned int *) CPACR_ADDR
#define __ISB()  __builtin_arm_isb(0xF)
#define __DSB()  __builtin_arm_dsb(0xF)
CPACR |=((3U << 10U*2U) |  (3U << 11U*2U) );
  __DSB();
  __ISB();

If TrustZone is used, Secure privileged software should also setup NSACR and CPPWR registers to define whether the Non-secure world is allowed to access Helium and FPU features.

Enabling the caches

For Cortex-M55/M85 based devices that have instruction and data caches implemented, you might need to enable these caches based on the application requirements. Generally, if running code or accessing data in memories connected in the main AXI bus, it is best to enable the caches. For example:

If the program is executing from flash memories (normally much slower then the processor), or other memories with high access latency, you should enable both instruction and data caches.
If the application utilizes memories (for example, SRAM) on the main bus for data storage, you should enable the data cache.

By default, the caches are disabled at startup. In CMSIS-CORE based software projects you can use:

SCB_EnableICache() – to enable the instruction cache
SCB_EnableDCache() – to enable the data cache

These functions include manual cache invalidation. In Armv8-M architecture, caches can also be invalidated automatically when being enabled. On the Cortex-M55/M85 processor, you can enable the caches using the following code:

(If you are using CMSIS-CORE in your project):

// Enable Instruction and Data caches
SCB->CCR |= (SCB_CCR_IC_Msk|SCB_CCR_DC_Msk);
__DSB();
__ISB();

(If you are not using CMSIS-CORE in your project):

#define CCR_ADDR (0xE000ED14UL)
#define CCR  *(volatile unsigned int *) CCR_ADDR
#define __ISB()  __builtin_arm_isb(0xF)
#define __DSB()  __builtin_arm_dsb(0xF)
  CCR |= 0x00030000UL;
____DSB();
____ISB();

Enable branch prediction in Cortex-M85

By default, branch prediction is disabled in Cortex-M85 and this feature is enabled usign BP (Branch Prediction) bit in the Configuration and Control Register (CCR).

(If you are using CMSIS-CORE in your project):

// Enable Branch Prediction
SCB->CCR |= SCB_CCR_BP_Msk;
__DSB();
__ISB();

(If you are not using CMSIS-CORE in your project):

#define CCR_ADDR (0xE000ED14UL)
#define CCR  *(volatile unsigned int *) CCR_ADDR
#define __ISB()  __builtin_arm_isb(0xF)
#define __DSB()  __builtin_arm_dsb(0xF)
  CCR |= 0x00040000UL;
____DSB();
____ISB();

Power control setup for best performance

For Cortex-M55 r0px and r1p0: Depending on the system design, the processor might attempt to put the Extension Processing Unit (EPU) into a retention state to save power if the EPU has been enabled but not being used. After the EPU entered retention state, if the software executes an FPU or Helium instruction, the processor will wake up the EPU automatically. While this is beneficial to energy efficiency, and is completely transparent to software, the automatic power switching sequences could cause delays to the program’s operation and could therefore reduce performance.

To avoid this performance penalty, change the ELPSTATE bits in the Core Power Domain Low Power State Register (CPDLPSTATE) to 0b00 (ON) or 0b01 (clock gated). Software should switch ELPSTATE bits back to 0b11 if the application does not require EPU, for example, when the device is going to enter a sleep mode. (After a reset the value of CPDLPSTATE is 0x00000333, meaning that the processor would attempt to switch the EPU into retention state because ELPSTATE is set to OFF (0b11)).

(Setting ELPSTATE to 0b01 when using CMSIS-CORE in your project):

/* Note: This code fragment is included in the example SystemInit code for the Cortex-M55 processor */
PWRMODCTL->CPDLPSTATE = (PWRMODCTL->CPDLPSTATE & 0xFFFFFFCFUL) |
                        (0x1 << PWRMODCTL_CPDLPSTATE_ELPSTATE_Pos);

(Setting ELPSTATE to 0b01 without using CMSIS-CORE in your project):

#define CPDLPSTATE_ADDR (0xE001E300UL)
#define CPDLPSTATE  *(volatile unsigned int *) CPDLPSTATE_ADDR
  CPDLPSTATE = (CPDLPSTATE & 0xFFFFFFCFUL) | (0x01UL << 4);

Note: The CMSIS-CORE v5.7 header file for Cortex-M55 is missing the register definition for the CPDLPSTATE register. This is added in v5.8.

Enabling limited static branch prediction

Cortex-M55 r1 supports limited static branch prediction by reusing the Low-Overhead Branch (LOB) hardware. In a few cases, this can help performance. In r1p0 release this feature is disabled by default, and can be enabled by clearing DISLOBR bit (bit 5) or the Auxiliary Control Register (ACTLR). In r1p1 release this bit is cleared by default.

Using the CMSIS-DSP library with the Cortex-M55/M85 processor

The CMSIS-DSP library has been optimized for the Cortex-M55/M85 processor.

	Link
Github (releases)	https://github.com/ARM-software/CMSIS_5/releases
Documentation	https://arm-software.github.io/CMSIS_5/DSP/html/index.html

In releases, the CMSIS-DSP codes are released as source code only. This is different from the past where binary builds (libraries) are also available. This change is because the Cortex-M processors are highly configurable and building the libraries for all possible configuration variants is becoming impractical.

To compile the CMSIS-DSP libraries with Arm Compiler 6, please select “-Ofast” optimization level for best performance.

Usually, application codes using the CMSIS-DSP can be directly reused on Cortex-M55 projects and able to take advantage of the Helium technology immediately. However, in a few cases code modifications are required:

Please note one of the CMSIS-DSP library biquad initialization function is different between Helium (MVE) and non-helium versions:

Non-Helium version	Helium version
`arm_biquad_cascade_df1_init_f32`	`arm_biquad_cascade_df1_mve_init_f32` Note: It takes a new argument: pCoeffsMod. Its size is 32*numStages float32_t elements.

For FIR filters, when Helium is enabled, padding might be needed to adjust the size of filter coefficient array. Please refer to the latest documentation for detail requirements.

For best performance, the buffers for filter processing should be at least 64-bit aligned (128-bit aligned is even better).

When migrating old projects, please review the DSP functions used in the project to see if any of those are in the deprecated function list (https://arm-software.github.io/CMSIS-DSP/main/deprecated.html). If yes, you should consider updating the codes.

Low-level code optimizations topics

Data placement

Performance for some of the CMSIS-DSP functions can be reduced if some of the data (for example, filter coefficients) is placed in the I-TCM. This is because I-TCM data bandwidth in Cortex-M55 is limited to 32-bit/cycle and is shared with instruction fetches. The I-TCM in Cortex-M85 does have higher bandwidth of 64-bit/cycle, but performance of reading constant data from I-TCM might still be affected by instruction fetches. To avoid this issue, if the function is to be executed from I-TCM, either
- Use linker script feature to place the coefficients in other memories, or
- Avoid declaring the coefficient table as “const” and declare it as “static”.

Program code running from AXI connected memories do not have the same issue.

The I-TCM bandwidth issue could also affect general program codes (for example, control codes) if the code contains a lot of literal data accesses. In such case the performance could be improve by either
- changing the literal data definition - so that they can be fetched from the D-TCM
  Or,
- By running the codes from main memory via AXI with caches enabled.
DSP codes that contain scatter and gather memory load and store could performance better when the data is in the D-TCM. This is because the processor allows two separate data accesses that are 32-bit or smaller to be carried out simultaneously providing that these accesses are not targeting the same D-TCM banks (Note: In Cortex-M55 and M85 there are 4 D-TCM banks interleaved by bit 2 and bit 3 of the addresses). With data access with AXI, the D-cache look up can handle one address per cycle.

Assembly level optimization

For best optimization, interleave Helium instructions of different types in the code sequence so that the processor can overlap the execution of these instructions.
For best performance, Helium instructions in a low overhead loop should be 32-bit aligned. If not, there can be 1 cycle penalty for each iteration.

Note: sometime C compilers insert a NOP in the being of the loop after DLS/WLS{TP}. This NOP is not a part of the loop, but a padding instruction to keep the instruction in the loop aligned. To determine the correct loop address, please using the negative offset in the LE instruction, and do not rely on the position of the WLS/DLS{TP} instruction.

When creating hand optimized code with inline assembly, you can use directive “.p2align 2” to align instructions to 32-bit.

Memory architecture considerations

Please note that the cache management in the Armv8-M architecture has some differences when compared to the Armv7-M architecture.

Enabling cache using automatic invalidation

The Cortex-M55/M85 processor provides a hardware mechanism to invalidate the cache at reset. This feature can simplify cache enabling sequence by allowing the caches to be enabled by just setting the DC (for D-cache) and IC (for I-cache) bits in the Configuration Control Register (CCR). The CCR register is in the System Control Block (SCB). This method also speeds up the cache initialization sequence when compared to using software invalidation routines because:
- It is hardware controlled.
- I-cache and D-cache can be invalidated in parallel.
This automatic invalidation mechanism can be disabled under hardware control (using an input signal on the processor) to maintain cache state across resets.
During the automatic cache invalidation, cache maintenance operation is handled as NOP, and a DSB instruction waits for all automatic cache invalidate sequences to complete.

Disabling caches

In Armv7-M, cache can be disabled by clearing CCR.IC and CCR.DC bits. In Armv8-M architecture, while clearing CCR.IC and CCR.DC bits disable cacheline allocation, it does not fully disable the cache if the cache contains valid data - cache lookup can still occur.
This simplified cache disabling sequence because cached data is still accessible (not lost) right after a cache is disabled. Software can then clean the cache to removed cached data, before fully disabling the caches.
To disable a cache completely, the following sequence could be used:
- Clear CCR.IC / CCR.DC bit(s).
- Trigger cache clean and clean & invalidation for D-cache, or invalidation for I-Cache.
- After the cache maintenance operation is completed, clear the ICACTIVE/DCACTIVE bit in the Memory System Control Register (MCSR) to disable the cache.

Code example to disable I and D caches:

  // Disable instruction and data cache
  // On Cortex-M55 this disables line-fills preventing any
  // New data from being written to the cache
  SCB->CCR &= ~SCB_CCR_IC_Msk;
  SCB->CCR &= ~SCB_CCR_DC_Msk;

  // Clean and invalidate caches to writeback any dirty 
  // data to memory
  SCB_CleanInvalidateDCache();
  SCB_InvalidateICache();
  __DSB();
  __ISB();

  // Clear MSCR.xACTIVE to disable cache lookups
  SCB->MSCR &= ~SCB_MSCR_ICACTIVE_Msk;
  SCB->MSCR &= ~SCB_MSCR_DCACTIVE_Msk;

  __DSB();
  __ISB();

  // Sequence complete – all instructions fetches and 
  // data read/write from main memory

Please note:

In some cases, program codes written for the Cortex-M7 processor would need to be modified when being used on the Cortex-M55 processor. This is because cached data could still be accessed after clearing CCR.IC and CCR.DC bits.
Because cache look up can still occur after clearing CCR.IC / CCR.DC, the Performance Monitor Unit (PMU) in Cortex-M55 r0 can still report cache lookup events. For example, after the previously mentioned code for disabling cache is executed, the Cortex-M55 processor r0 will still generate the data cache PMU events associated with the “accesses” to the data cache:

Event 0x0003 L1D_CACHE
Event 0x0036 LL_CACHE_RD
Event 0x0037 LL_CACHE_MISS_RD
Event 0x0039 L1D_CACHE_MISS_RD
Event 0x0040 L1D_CACHE_RD

These are all essentially the same event indicating a load store operation has accessed the cache. Technically this is correct as the cache logic is used as part of the access (instead of the TCM). However, it is confusing as the D-cache is disabled. Therefore, it is decided that in Cortex-M55 r1 these events are masked when the caches are deactivated in MSCR.

Using Half Precision Floating Point support

Armv8.1-M architecture introduced Half Precision Floating-Point arithmetic support. Half precision floating-point data are 16-bit, and its format is covered by IEEE 754-2008 standard. To support half precision arithmetic operations, the _Float16 data type is defined in C11 extension ISO/IEC TS 18661-3:2015.

In Armv8.1-M processors like the Arm Cortex-M55 and Cortex-M85 processors, the half precision float-point support is included when the FPU (Floating-Point Unit) is implemented. The Helium technology (M-Profile Vector Extension) introduced in Armv8.1-M also support half precision vector operations.

The format of half precision floating-point is shown in the diagram below:

Half precision floating point

Please note that the IEEE 754 half precision floating-point format is different from bfloat16 (__bf16), which is a different 16-bit floating-point format typically used for machine learning applications. For reference, the bfloat16 format is shown below. Bfloat16 is not supported by the Armv8.1-M architecture.

bfloat16

Although half-precision floating-point does not offer the same level of accuracy and data range as single precision floating-point, there are two main advantages for using this format in embedded applications:

Reduction of memory footprint - When an application has to store a significant amount of floating-point data (e.g. coefficients for DSP applications), using half-precision floating-point formation can reduce the memory size.
Increase performance when using with the Helium technology – because the Cortex-M55 and Cortex-M85 processor can handle twice the number of operations when using half-precision when comparing to single precision, the performance of using half precision can be much higher.

Half precision floating-point is supported by modern compilers including Arm Compiler 6, LLVM and GCC. Since _Float16 is a part of C11 extension, ideally C/C++ project should specify C11 standard when using _Float16. However, current versions of Arm Compiler 6, LLVM and GCC accept _Float16 regardless of C standard being used, so omitting the C standard option is not a major issue.

In Armclang (including Arm Compiler) and GCC, you can specify the C11 standard using the “-std=c11” option
In Keil Microcontroller Development Kit, you can specify the C11 standard inside the project option as below.

C11 option in Keil MDK

An example of using _Float16 in C code is shown below. In addition to the use of _Float16 data type, the C standard also allow you to define half precision constant using the f16 suffix (e.g. 3.14f16). You can also use _Float16 for type casting.

#include "stdio.h"
#include "ARMCM55.h"
static volatile _Float16 A1, B1;
int main(void)
{
  _Float16 C1;
  A1 = 0.5f16;
  B1 = (_Float16) 0.5;
  C1 = A1*B1;
  printf("%f\n", (double) C1);	
  while(1);
}

Please note that _Float16 data type is different from __fp16 data type, which is supported by Arm C Language Extension (ACLE).

_Float16 is an arithmetic data type.
__fp16 is for storage & conversion only.

For more information about the differences, please visit the following web pages:

Architectures and Processors blog

Future Architecture Technologies: POE2 and vMTE

Martin Weidmann

This blog post introduces two future technologies, Permission Overlay Extension version 2 (POE2) and Virtual Tagging Extension (vMTE).
- October 23, 2025
Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Chris Walsh

Arm is pleased to announce that the Arm Intrinsics Search Engine has been updated to include the Scalable Matrix Extension (SME) intrinsics, including both SME and SME2 intrinsics.
- October 3, 2025
Arm A-Profile Architecture developments 2025

Martin Weidmann

Each year, Arm publishes updates to the A-Profile architecture alongside full Instruction Set and System Register documentation. In 2025, the update is Armv9.7-A.
- October 2, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog