Skip navigation


1 2 3 Previous Next

Software Development Tools

172 posts

SOMNIUM are pleased to announce that we will be exhibiting at ARM TechCon again this year!


We love coming to TechCon! It's one of our busiest shows and one of the best opportunities for us to meet our US based customers.


Just like last year, we will be launching the newest release of SOMNIUM DRT on the first day of the expo. This time, we will be launching v3.5 which is a major update to many of DRT's features and also includes support for macOS making us the only professional tools available on all 3 major operating systems! You can find out more about our presence and book an appointment to chat to us on the SOMNIUM Website.


We'll post next week with full details of DRT 3.5 so keep an eye on here, our website and our twitter feed.

business card map.png

Visit us at stand 807 for a demo of DRT, to see DRT running on macOS and to speak with our software experts.


We hope to see you there!

ARM's latest embedded newsletter is now available. In this month’s edition we talk about the new ARM Cortex-R52, ARM's most advanced safety processor, in addition to efficient pre-silicon software development and the demise of the headphone jack. There are also a series of free training webinars and events. View the newsletter »


To receive this straight to your inbox, sign up for ARM’s developer newsletters »

The ARM® Cortex®-R52 processor is the most advanced processor for functional safety and the first implementation of the ARMv8-R architecture. Along with the announcement of the Cortex-R52, ARM offers a number of development tools to help partners speed up their path to market. This is especially helpful for a new architecture which highlights software separation for safety and security. This article summarizes the available tools and explains what’s new for the Cortex-R52.


About the Cortex-R52         


The Cortex-R52 is the first ARMv8-R processor. ARMv8-R brings real-time virtualization to Cortex-R in the form of a new privilege Level, EL-2, which provides exclusive access to a second stage Memory Protection Unit (MPU). This enables bare metal hypervisors to maintain software separation for multiple operating systems and tasks.


Many partners will be interested in the differences between the Cortex-R52 and previous designs, such as the Cortex-R5 and Cortex-R7. The Cortex-R52 evolves the Cortex-R5 microarchitecture by providing fast, deterministic interrupt response and low latency at a performance level which is better than Cortex-R7 on some real-time workloads. Cortex-R52 also offers a dedicated, read-only, low-latency flash interface which conforms to the AXI4 specification.


ARM Fast Models and Cycle Models enable virtual prototyping for software partners to develop solutions for the new Cortex-R52 before silicon is available.


DS-5 Development Studio


DS-5 Development Studio is the ARM tool suite for embedded C/C++ software development on any ARM-based SoC. DS-5 features the ARM Compiler, DS-5 Debugger, and Streamline system profiler. Also included is a comprehensive and intuitive IDE, based on the popular and widely-used Eclipse platform.


The DS-5 Debugger is developed in close co-operation with ARM processor and subsystem IP. DS-5 is used inside ARM as part of the development and verification cycle and is extensively tested against models, early FPGA implementations and (as soon as it is available) silicon. The DS-5 Debugger provides early-access debug and trace support to ARM lead partners working with leading-edge IP. This enables mature, stable, validated debug and trace support for Cortex-R52 to be included in the upcoming DS-5 release, version 5.26.


ARM Compiler 6


ARM Compiler 6 is the latest compilation toolchain for the ARM architecture, and is included in the DS-5 Development Studio. ARM Compiler 6 brings together the modern LLVM compiler infrastructure and the highly optimized ARM C libraries to produce performance and power optimized embedded software for the ARM architecture.


ARM Compiler 6 is developed closely with ARM IP and provides early-access support to lead partners. As with core support in all compilers code generation, performance, and code size improve over time, with improvements driven by experience and feedback from real-world use-cases. The upcoming release of ARM Compiler 6, version 6.6, will feature full support of link time optimization and enhanced instruction scheduling support, giving an improvement of nearly 10 percent for Cortex-R52 in key benchmark scores. Combined with significant improvements in code size, ARM Compiler 6 is a comprehensive choice for the Cortex-R52.


Cortex-R52 provides a compelling opportunity for users to migrate from ARM Compiler 5 to ARM Compiler 6. The ARM Compiler migration and compatibility guide aids the evaluation process by comparing the command line options, source code differences, assembly syntax, and other topics of interest.


If existing code needs to be updated from ARM Compiler 5 to ARM Compiler 6, the first step is to get the code to successfully compile. This generally takes a combination of Makefile changes to invoke the new compiler as well as source code adaptations.


First, compiler invocation needs to be switched from armcc to armclang. Other tools like armasm and armlink are included in ARM Compiler 6 and can continue to be used.

For example, when changing from Cortex-R7 to Cortex-R52 a few compiler command line option changes will be required:


ARM Compiler 5

ARM Compiler 6




--target= armv8r-arm-none-eabi –mcpu=cortex-r52




-Os / -Oz

-Onum (default is 2)

-Onum (default is 0)


The migration guide provides further details related to specific switches, but these are the basics to get going. Some compiler switches may need to be removed because they are specific to armcc; for example,  --apcs /interwork and --no_inline are not needed with armclang and can be removed.


Fast Models


Fast Models are an accurate, flexible programmer's view models of ARM IP, allowing you to develop software such as drivers, firmware, operating systems and applications prior to silicon availability. They allow full control over the simulation, including profiling, debug and trace. Fast Models can be exported to SystemC, allowing integration into the wider SoC design process.


Fast Models typical use cases:

  • Functional software debugging
  • Software profiling and optimization
  • Software validation and continuous integration


The Fast Model for the Cortex-R52 is being released in late September as part of Fast Models 10.1.


Cycle Models


Cycle models are compiled directly from ARM RTL and retain complete functional accuracy. This enables users to confidently make architecture decisions, optimize performance or develop bare metal software. Cycle Models run in SoC Designer or any SystemC simulator, including the Accellera reference simulator and simulators from EDA partners.


Cycle Models typical use cases:

  • IP selection and configuration
  • Analysis of HW/SW interaction
  • Benchmarking and system optimization


The Cortex-R52 SystemC Cycle Model supports a number of features which help with performance analysis:

  • SystemC signal interface
  • SystemC ARM TLM interface
  • Instruction trace (tarmac)
  • PMU event trace
  • Waveform generation


The Cycle Model for the Cortex-R52 is being released in late September and will be available on ARM IP Exchange.


Frequently asked questions about getting started with Cortex-R52


Does Cortex-R52 require DS-5 Ultimate Edition?

Yes, DS-5 Ultimate Edition is required for debugging with Cortex-R52.


What are the switches for ARM Compiler 6 to select Cortex-R52?

For ARM Compiler 6.6 use the armclang switches: --target=armv8r-arm-none-eabi -mcpu=cortex-r52

For ARM Compiler 6.5 use the armclang switches: --target=armv8r-arm-none-eabi -mcpu=kite


Can DS-5 be used for software debugging with a simulation model?

Yes, before silicon is available DS-5 can be used to develop and debug software using the Cortex-R52 Fast Model. The Fast Model can be used for functional software development, checking compliance with the ARMv8-R architecture, and software optimization. DS-5 with the Fast Model makes an ideal development environment for hypervisors, schedulers, real-time operating systems, and communication stacks.


Is there a model available for verification which works in EDA simulators?

Yes, the Cortex-R52 Cycle Model can be used in any EDA simulator. It has a SystemC wrapper which can be instantiated in a Verilog or VHDL design. It provides 100% cycle accuracy.


Is there a way to run benchmarks to compare Cortex-R52 to another core such as Cortex-R5 or Cortex-R7?

Yes, the Cortex-R52 Cycle Model instruments the Performance Monitor Unit (PMU) and provides a tarmac trace to run benchmarks and evaluate performance.


The Cortex-R52 CPU model doesn’t seem to start running after reset, is the model broken?

No, the most common cause is the CPUHALTx input is asserted and stopping the core from running.


Do Cortex-R52 models allow simulation of booting from TCM?

Yes, both the Fast Model and the Cycle Model can boot from TCM. The CFGTCMBOOTx enables the ATCM from reset on the Cycle Model and the Fast Model provides the tcm.a.enable parameter to do the same thing.




A full suite of development tools is available for the Cortex-R52, which enables developers to do more, earlier with the most advanced ARM processor for functional safety and learn about the ARMv8-R architecture. Please refer to for more information on ARM Development Tools.

DS-MDK, the software development solution for heterogeneous computing, now supports additional devices from NXP and a new development board.


NXP i.MX 6SoloX processors offer an ARM Cortex-A9 core together with an ARM Cortex-M4. The corresponding SABRE development board is now fully supported by DS-MDK, using the i.MX 6 software pack. Learn how to use the SABRE board together with DS-MDK on the reference page.


Furthermore, DS-MDK now supports PHYTEC phyBOARD-i.MX7-Zeta. This single-board computer (SBC) is a two-PCB counterpart to the phyCORE-i.MX7 SOM. The SOM itself serves as the CPU core of the SBC which interfaces to a carrier board via high density connectors. This carrier board breaks out major interface signals to plug-and-play or pin-level connectors and offers a JTAG connector for debugging purposes. Learn how to connect DS-MDK to the phyBOARD-i.MX7-Zeta on the reference page.

Public Webinar

If you want to learn more about DS-MDK, register for our public webinar on September 28th.



Check out our new YouTube playlist dedicated to our debugging features. Learn more about our Live Expression viewing, MTB Trace and Fault Diagnosis tools:

Additional core support

DS-5 v5.25 Professional and Ultimate Editions support cache visibility for Cortex-A5 and Cortex-A7 cores. Ultimate Edition also supports cache and MMU visibility for Cortex-A73, and debug support for ARMv8.1-A and ARMv8.2-A cores.


Additional Fixed Virtual Platforms

DS-5 v5.25 Professional Edition includes a license for single-core Cortex-M3 and Cortex-R4 Fixed Virtual Platforms (FVP). Ultimate Edition now includes a license for a wide range of single-core, multi-core, and big.LITTLE FVPs. The virtual platforms are delivered as part of the DS-5 installation package.


Core FamilyFixed Virtual PlatformCommunityProfessionalUltimate
Cortex-MFVP_MPS2_Cortex-M0, FVP_MPS2_Cortex-M0plus, FVP_MPS2_Cortex-M4, FVP_MPS2_Cortex-M7
FVP_VE_Cortex-R5x1, FVP_VE_Cortex-R7x1, FVP_VE_Cortex-R8x1
FVP_VE_Cortex-A5x1, FVP_VE_Cortex-A7x1, FVP_VE_Cortex-A15x1, FVP_VE_Cortex-A15x4-A7x4, FVP_VE_Cortex-A17x1
Cortex-v8AFVP_Base_Cortex-A53x1, FVP_Base_Cortex-A57x1, FVP_Base_Cortex-A72x1, FVP_Base_Cortex-A73x1, FVP_Base_Cortex-A32x1, FVP_Base_Cortex-A35x1, FVP_Base_Cortex-A57x2-A53x4, FVP_Base_Cortex-A72x2-A53x4, Cortex-A73x2-A53x4, FVP_Base_AEMv8A
Foundation Platform (v8) Not license managed



Revised Host Support

DS-5 v5.25 adds support for Windows 10 64-bit and Red Hat Enterprise Linux 7 Workstation 64-bit. Support for Linux 32-bit hosts has been dropped in this release.



DS-5 Professional and Ultimate

DS-5 Community

ARM Compiler 5.06u3

ARM Compiler 6.5

Fast Models 10.0

Windows 7 SP1 Professional Edition 32-bit*
Windows 7 SP1 Professional Edition 64-bit
Windows 7 SP1 Enterprise Edition 32-bit*
Windows 7 SP1 Enterprise Edition 64-bit
Windows 8.1 64-bit
Windows Server 2012 64-bit
Windows 10 64-bit
Red Hat Enterprise Linux 6 Workstation 32-bit
Red Hat Enterprise Linux 6 Workstation 64-bit**
Red Hat Enterprise Linux 7 Workstation 64-bit
Ubuntu Desktop Edition 12.04 LTS 32-bit
Ubuntu Desktop Edition 12.04 LTS 64-bit**
Ubuntu Desktop Edition 14.04 LTS 64-bit


* Not delivered in DS-5, but exists as a standalone product

** Requires additional GCC runtime libraries


Mali Graphics Debugger

DS-5 v5.25 includes the Mali Graphics Debugger. This enables DS-5 users to trace Vulkan, OpenGL ES, EGL, and OpenCL API calls


Enhanced debugger functionality

DS-5 debugger functionality has been enhanced in a number of areas, each of which is described in a separate blog:


The ARM Embedded Logic Analyzer (ELA) brings particular challenges to a debugger. The flexibility of the ELA and the broad range of implementation choices and potential uses, all place demands on a debugger. The debugger must present a high level of functionality with high potential for flexibility and customisation. However because most of the customisation must be carried out by the user, the debugger must also present a high level of usability.


A comprehensive scripting interface is the obvious way to address the challenges presented by the ARM ELA, and enables the debugger user to customise and extend the functionality of the debugger. However scripts bring their own challenges, which escalate rapidly as script library size and script complexity grow.


ARM DS-5 debugger now includes a comprehensive script management system aimed at helping users leverage the power of scripts and handle the challenges that scripts bring. Here we look at some of the challenges brought by the ARM ELA, and discuss some of the generic challenges brought by script complexity. We’ll then investigate how the DS-5 script management system enables users to address these challenges with an ease of use not seen in any other ARM debugger.


The ARM Embedded Logic Analyser

The ARM ELA enables developers to drive the highest levels of performance and efficiency from their ARM-based design. The key functionality of the ARM ELA is to monitor (and give the developer visibility of) signals deep within an ARM-based SoC. Signal information can be processed in one of two ways:

  1. Information about signals can be captured to an on-chip buffer for later analysis
  2. The ELA contains a comprehensive state machine. Transition between states is controlled by signal changes and comparisons, and the final state produces events that can be propagated over the CoreSight cross-trigger network

The ARM ELA is able to monitor, and provide visibility of, complex interactions and event chains taking place deep within the SoC. However SoC designers have a wide range of implementation options for the ARM ELA. The ELA could for example, monitor signals inside an ARM core. Or it could monitor signals in the bus interconnects: the ARM ELA is particularly useful for analysing throughput and identifying bottlenecks. Such is the flexibility of the ELA, the range of implementation options, and the range of challenges that it might be used to address, it’s impossible to hardcode ELA support into a debugger. The device is controlled by a large number of inter-dependent registers, which need to be used in harmony with each other. The only practical way for a debugger to provide support for the ARM ELA is through a comprehensive and highly functional scripting interface. This enables SoC designers and software developers to leverage the power of the ELA for their particular needs.

Scripting support in an ARM debugger

Good scripting support is a critical part of any modern ARM debugger, and scripting is sometimes the only way to reach the level of functionality and flexibility needed by a complex ARM-based design. A highly functional scripting API is the only practical way for a debugger to address a number of challenges in modern ARM-based designs:

  • There is a trend of growing complexity and individuality in ARM-based designs. In particular there’s growth in the size and complexity of the CoreSight cross-trigger and trace systems, with new devices and a variety of trace storage options distributed across the design
  • There may be a need for custom debugger functionality with is tailored to a specific debug target, or even a specific debugging challenge
  • As designers strive to keep power consumption to a minimum and gain maximum advantage from the flexibility of the ARM architecture, power management strategies are becoming more aggressive and can present significant challenges to a debugger
  • A key strength of ARM IP is its high suitability for a mixed implementation that also contains non-ARM IP. It can be very useful to get a measure of control and visibility of ARM and non-ARM IP in the same debugger


A comprehensive scripting API enables the user to handle both complexity and individuality in an ARM-based design. A scripting API enables the creation of custom debugger functionality to address the needs of an individual design, or the needs of a particular debug session. Because the needs of a debugger can be tightly bound to an individual SoC design or to the characteristics and causes of an individual software defect, enabling the user to create custom debugger functionality can be highly valuable.


However as script number and complexity rise, usability challenges start to appear. Particular problems might be found in script configuration, and with non-trivial scripts it’s common for their functionality to depend upon command-line arguments. This solution can suffer problems in scalability: the user needs to remember which command-line options are valid for which scripts. Each option has a range of valid values and options may be inter-related. With a significant library of complex and flexible scripts, the requirements on the user can quickly grow to a point where the value of the scripts starts to degrade. These problems can be compounded when scripts are shared between team members (and other teams), meaning users have to drive value from scripts with which they are unfamiliar.


DS-5 Use-Case Scripts

The ARM DS-5 debugger recently added a new script management system, aimed at addressing some of the problems found with large libraries of complex scripts. A key innovation is the ability to embed custom visual controls in the script itself: this is an extension of the existing functionality that has been successfully used by DS-5 DTSL (Debug and Trace Services Layer) scripts for a number of years.


Because controls can be represented graphically on custom control tabs, it’s easy to see at a glance which options are available for a particular script. Command line options which can take a range of values can be implemented as drop-down selection boxes, allowing value (and spelling) discovery at a glance. Options which take numerical or string values can be represented as text edit boxes, with bounds checking also embedded in the script. Controls can appear as hierarchies, with child controls becoming enabled only when parent controls are activated.
This screen capture shows a control tab from one of the use-case scripts shipped with DS-5 v5.25 as part of the support for the ARM ELA. Command line options for the script are represented as visual controls, removing the need for the user to carry deep familiarity with the script and to remember all details of all possible options. The controls are arranged on a number of control tabs, grouping areas of related functionality (in this case, giving a fine degree of control over movements between stages of the ELA internal state machine). Users can gain familiarity with the possibilities and functionality of the script very easily – the visualisation of command line arguments as custom controls significantly reduces the learning curve faced by script users.


On the left side of the careen capture can be seen a number of configuration “profiles”. Sets of control values can be saved as named configuration profiles to be used later. Directories of DS-5 use-case scripts, and sets of named configuration profiles containing pre-built collections of control values to address various needs, can be shared between DS-5 users.



Modern ARM-based designs can present a number of challenges to a debugger user, and devices such as the ARM ELA present particular challenges because of their high levels of flexibility, functionality, and implementation options. The only practical way to address these challenges is by using a comprehensive debugger scripting API, but users are likely to encounter scalability problems as the complexity and number of scripts rises.


The ARM DS-5 “Use-Case” script management system aims to resolve these problems and enable users to leverage the full power of their scripts. By visualising script command line options as custom controls complete with value, relationship, and bounds checking, DS-5 significantly reduces the learning curve and information required when using scripts. Named configuration profiles, and the ability to share script libraries and profiles between users and teams, increase this ease of use and flexibility.


For details of other changes in DS-5 v5.25, take a look at Key Changes in DS-5 Debugger v5.25

In order to keep costs, power consumption, and size to a minimum, many embedded products based on ARM Cortex-R cores have limited on-chip memory. In particular, the size of the Tightly Coupled Memory (TCM) can be restricted. Because TCM has very low latency, significant performance gains can be realised when running code in TCM. Therefore limiting TCM size can impose performance challenges: a trade-off to be considered by the SoC design team.


One way to reduce the impact of restricted TCM is to use an overlay manager. Code is organised into a number of overlays which share the same memory area. When executing code in the same overlay, no changes are necessary and the overlay stays resident in low-latency TCM. However when a call is made to a non-resident overlay, the overlay manager needs to load the correct overlay into the TCM. This load needs to be performed as efficiently as possible, and the debugger needs to be overlay-aware and present the correct information to the user. For example, the debugger needs to step over overlay veneers: effectively making the overlay manager invisible to the debugger user. In DS-5 v5.25 we’ve added overlay support to both the ARM Compiler and the DS-5 debugger. When overlays are enabled, the compiler leaves overlay information in the symbol file. The debugger reads this information when the symbol file is loaded, and can enable overlay support automatically. As well as handling the debug implications of overlays and automatically stepping over the overlay veneers, DS-5 debugger presents overlay information through additional debugger commands or through the new Overlays view, for example:



Here we can see the address and size of a number of overlays, and we can see instantly which overlays are currently loaded. Information has been expanded for one of the overlays, so that we can see at a glance the functions contained in that overlay. The matching overlay support in the ARM Compiler and DS-5 debugger makes it easy to manage overlays and drive significant performance enhancements from efficient use of fast on-chip memory.


For details of other changes in DS-5 v5.25, take a look at Key Changes in DS-5 Debugger v5.25

In DS-5 v5.24, we added the Stack View to the debugger. This view displays stack information that used to be displayed in the Debug Control View, giving two advantages. Firstly, the Debug Control View becomes less cluttered and more focussed: giving better clarity of information and an increase in debugger usability. Secondly, stack information can take a non-trivial amount of time to retrieve from the target: because the Stack View can be closed when not needed (and it’s possible to limit the stack depth displayed), stepping speed in the DS-5 debugger can be increased. This performance increase is particularly noticeable for debug targets with slow JTAG clocks or an extensive stack back-trace.


In DS-5 v5.25 we have enhanced the Stack View to display function parameters and local variables, for example:



Variables marked with a ‘P’ are function parameters, and can also be seen in the extended function prototype information. All the other variables shown are the function’s local variables. Arrays and structures can be expanded to display member variables, by right-clicking all variables can be displayed in the Variables or Memory views.


Retrieving variable information from the target can cause a degradation in debugger stepping performance, particularly for targets with very slow JTAG clocks or a large number of function parameters and local variables. To increase debugger stepping performance when parameter and variable information is not needed, display of parameters and variables can be disabled via the Stack View menu.


For details of other changes in DS-5 v5.25, take a look at Key Changes in DS-5 Debugger v5.25

The CoreSight cross-trigger network in a SoC is created from two components: Cross Trigger Matrix (CTM) devices form the backbone of the network and transport events around the SoC; and Cross Trigger Interface (CTI) devices which capture or deliver events to or from other components distributed around the SoC. Although the CoreSight cross trigger network has a variety of potential uses, by far the most common use encountered by DS-5 users is to synchronise cores.


This cross-trigger use-case enables related to cores to enter and exit debug state together. For example, when one core hits a breakpoint and enters debug state, this change in state is picked up by the CTI coupled to that core. The CTI passes the halt event into the cross-trigger matrix, where the CTMs route the event to the CTI coupled to other cores. These CTI issue halt requests to their cores, so that all related cores halt with minimal latency. The debugger doesn’t need to get involved with halting the cores: it just sets up the cross-trigger network so that events are routed correctly. This enables the very low latency which is critical on many SoCs to avoid undesirable effects such as kernel panics.


The ARM DS-5 debugger already supports cross-trigger network configuration and management, and the DS-5 Platform Configuration Editor (PCE) creates the necessary scripting when bringing up a new target platform in the debugger. However PCE currently only supports the most common core topologies: SMP and big.LITTLE. PCE cannot currently create the necessary DTSL scripts for other topologies: for example cross-triggering Cortex-A and Cortex-R/M cores in the same SoC, or halting cores when the on-chip trace buffer fills. These use-cases need custom scripting. DS-5 users can create custom DTSL scripts to drive the DS-5 debugger functionality they need, but there’s a learning curve to be considered. Complex cross-trigger requirements could mean more complexity in DTSL scripts than the average user might be prepared to take on. So in DS-5 v5.25 we’ve revised and enhanced the DTSL functionality around cross-triggering, and added a new DTSL class to make scripting easier and less complex.


These DTSL enhances make it easier and quicker to create custom cross-triggering support in DS-5, backed by groups of custom DSTL controls. This example shows a typical use-case, synchronising the watchdog timer with the Cortex-A57 cores in the ARM Juno reference platform:

In future DS-5 releases we’ll extend the DS-5 PCE to take advantage of this additional DTSL functionality. We’ll also review all the platform configurations that ship with DS-5, to see where DS-5 can leverage these changes to deliver additional value.

DS-MDK is out now! It combines the Eclipse-based DS-5 IDE and Debugger with CMSIS-Pack technology and uses Software Packs to extend device support for devices based on 32-bit ARM Cortex-A processors or heterogeneous systems based on 32-bit ARM Cortex-A and ARM Cortex-M processors. It is part of the MDK-Professional edition and initially provides support for the NXP i.MX6 and i.MX7 series devices.


Heterogeneous systems combine computing power with fast, efficient I/O performance

A heterogeneous computing system based on Cortex-A and Cortex-M processors combines best-in-class technology for application software and deterministic real-time I/O. Cortex-A application processors run a feature-rich operating system, such as Linux, and provide enough computing power for demanding applications. The energy-efficient Cortex-M processor typically executes a real-time operating system that is easy to use and tailored to meet real-time requirements for deterministic I/O operations. Such Cortex-M systems enable fast start-up times and

can be permanently 'on' in battery-powered systems. The two processor systems typically exchange information via a fast, interrupt driven inter-process communication and shared memory.




DS-MDK offers a complete software development solution for such systems:

  • It allows managing Cortex-A Linux and Cortex-M RTOS projects in the same development environment.
  • It fully supports the Cortex Microcontroller Software Interface Standard (CMSIS) development flow for efficient Cortex-M programming. Software Packs may be added any time to DS-MDK making new device support and middleware updates independent from the toolchain. They contain device support, CMSIS libraries, middleware, board support, code templates, and example projects. The IDE manages the provided software components that are available for the application as building blocks.
  • The DS-5 Debugger offers full visibility for multicore software development.


Learn more


Embedded ARM Processors



CMSIS++, or rather POSIX++, is a POSIX-like, portable, vendor-independent, hardware abstraction layer intended for C++/C embedded applications, designed with special consideration for the industry standard ARM Cortex-M processor series. Originally intended as a proposal for the next generation CMSIS,  CMSIS++ can probably be more accurately defined as "C++ CMSIS", and POSIX++ as "C++ POSIX".


CMSIS++ RTOS: APIs vs reference implementations


The CMSIS++ cornerstone is the RTOS, and in this respect CMSIS++ RTOS can be analysed from two perspectives: the CMSIS++ RTOS APIs, with a modern design and the CMSIS++ RTOS reference implementation with a clean and efficient code.

In the first phase of the project, the CMSIS++ RTOS APIs were designed, with POSIX threads in mind, but from a C++ point of view.

The native CMSIS++ RTOS interface is the C++ API, with a C API implemented as a wrapper, and an ISO C++ Threads API implemented also on top of the native C++ API.


The CMSIS++ RTOS C++ API as a wrapper on top of an existing RTOS


Initially, the C++ API was validated by implementing it as a wrapper on top of the popular open source project FreeRTOS. Full functionality was achieved, and the entire system passed the ARM CMSIS RTOS validation suite.


The CMSIS++ RTOS reference synchronisation objects (semaphores, queues, etc)


With the native C++ API validated, while still using the safety net provided by an existing scheduler, the next step toward a grand design was to implement, in a portable way, the synchronisation objects defined by the CMSIS++ RTOS.

The result was a highly portable implementation, that requires a very simple interaction with the scheduler, basically a thread suspend() and resume().

Using this model, all RTOS objects were implemented (semaphores, mutexes, condition variables, message queues, memory pools, event flags, clocks and timers); full functionality was achieved, and again the entire system passed the ARM CMSIS RTOS validation suite.

To be noted that in this configuration, when running on top of an existing RTOS, it is perfectly possible to select which implementation to use, at individual object level; in other words it is perfectly possible to run with some objects implemented by the host RTOS and some objects using the reference portable implementation. This is generally useful when some of the objects defined by CMSIS++ are not available in the host RTOS; for example in the current version of FreeRTOS there were no memory pools or condition variables, and these objects were supplied by the reference implementation.


The CMSIS++ RTOS reference scheduler


The last piece to complete the puzzle was the scheduler. The CMSIS++ RTOS specifications do not mandate for a specific scheduling policy, and, when running on top of an existing RTOS, any scheduling policy can be used.

However, the CMSIS++ RTOS reference scheduler takes the beaten path and implements a priority based, round robin, cooperative and optionally preemptive scheduler.

In other words, threads are assigned priorities, higher priority threads are scheduled first, equal priority threads are scheduled in a round robin way, and scheduling points are entered either explicitly at any wait() or yield(), or are optionally triggered by periodic interrupts, like the system clock ticks, or by user interrupts.


The scheduler portable code


The scheduler was designed to be as portable as possible, and to run on any reasonable architecture, with any word size.

As such, the scheduler's main responsibility is to manage the list of threads ready for execution and to switch their execution contexts in an orderly manner.

Although not mandatory for its functionality, the scheduler also keeps track of all registered threads, and provides iterators to walk these lists.

For a better modularity, the scheduler itself does not keep track of threads waiting for various events; this is delegated to the various synchronisation objects, that are expected to implement their own policy of suspending and resuming execution of threads waiting for common resources.

However, the reference synchronisation objects use similar lists to keep track of the waiting threads, and, to simplify the implementation, the scheduler provides base classes for these lists.


The scheduler port-specific code


Regardless how carefully a portable scheduler is designed and implemented, there will always be a last mile where the platform differences become important.

To accommodate for these differences, the scheduler needs to be ported on a specific platform. The port includes the specific definitions, mainly the way of creating and switching thread contexts, but also handling interrupts, accessing timers and clocks, etc.

There are currently two such CMSIS++ RTOS scheduler ports available and fully functional:

  • a 32-bits ARM thumb port, running on Cortex-M devices;
  • a 64-bits synthetic POSIX port, running as a user process on macOS and GNU/Linux.


These ports are actually not part of the CMSIS++ package itself, which is highly portable, but are part of separate µOS++ packages.


The Cortex-M scheduler port


This 32-bits ARM thumb port is specifically designed to run on Cortex-M devices. It currently supports ARMv6-M and ARMv7-M architectures, with or without FPU. Support for ARMv8-M will be added when needed.

The implementation uses the ARM specific features, like PendSV, which greatly simplify things.

For example, the context switching is performed by a rather simple function:


__attribute__ ((section(".after_vectors"), naked, used, optimize("s")))
PendSV_Handler (void)
  // The naked attribute and the push/pop are used to fully control
  // the function entry/exit code; be sure other registers are not
  // used in the assembly parts.
  asm volatile ("push {lr}");

  // The whole mystery of context switching, in one sentence. :-)
  port::scheduler::restore_from_stack (
      port::scheduler::switch_stacks (
          port::scheduler::save_on_stack ()));

  asm volatile ("pop {pc}");


Apart from saving/returning, this function does exactly what it is expected to do:

  • save_on_stack() - saves the context of the current thread on the thread stack and returns the stack address;
  • switch_stacks() - saves the above stack address in the current thread control block, selects the next thread waiting to run and returns the address of its stack context;
  • restore_from_stack() - restores the context of the new thread from the stack.


The two save/restore functions are among the very few in the Cortex-M port that require assembly code:


inline stack::element_t*
save_on_stack (void)
  register stack::element_t* sp_;

  asm volatile
      // Get the thread stack
      " mrs %[r], PSP                       \n"
      " isb                                 \n"

#if defined (__VFP_FP__) && !defined (__SOFTFP__)

      // Is the thread using the FPU context?
      " tst lr, #0x10                       \n"
      " it eq                               \n"
      // If so, push high vfp registers.
      " vstmdbeq %[r]!, {s16-s31}           \n"
      // Save the core registers r4-r11,r14.
      // Also save EXC_RETURN to be able to test
      // again this condition in the restore sequence.
      " stmdb %[r]!, {r4-r9,sl,fp,lr}       \n"


      // Save the core registers r4-r11.
      " stmdb %[r]!, {r4-r9,sl,fp}          \n"

      : [r] "=r" (sp_) /* out */
      : /* in */
      : /* clobber. DO NOT add anything here! */

  return sp_;

inline void
restore_from_stack (stack::element_t* sp)
  // Without enforcing optimisations, an intermediate variable
  // would be needed to avoid using R4, which collides with
  // the R4 in the list of ldmia.

  // register stack::element_t* sp_ asm ("r0") = sp;

  asm volatile

#if defined (__VFP_FP__) && !defined (__SOFTFP__)

      // Pop the core registers r4-r11,r14.
      // R14 contains the EXC_RETURN value
      // and is restored for the next test.
      " ldmia %[r]!, {r4-r9,sl,fp,lr}       \n"
      // Is the thread using the FPU context?
      " tst lr, #0x10                       \n"
      " it eq                               \n"
      // If so, pop the high vfp registers too.
      " vldmiaeq %[r]!, {s16-s31}           \n"


      // Pop the core registers r4-r11.
      " ldmia %[r]!, {r4-r9,sl,fp}          \n"


      // Restore the thread stack register.
      " msr PSP, %[r]                       \n"
      " isb                                 \n"

      : /* out */
      : [r] "r" (sp) /* in */
      : /* clobber. DO NOT add anything here! */


The generated code (for Cortex-M3) is remarkably neat and tidy:


08000198 <PendSV_Handler>:
8000198: b500       push {lr}
800019a: f3ef 8009 mrs r0, PSP
800019e: f3bf 8f6f isb sy
80001a2: e920 0ff0 stmdb r0!, {r4, r5, r6, r7, r8, r9, sl, fp}
80001a6: f000 fe07 bl 8000db8 <os::rtos::port::scheduler::switch_stacks(unsigned long*)>
80001aa: e8b0 0ff0 ldmia.w r0!, {r4, r5, r6, r7, r8, r9, sl, fp}
80001ae: f380 8809 msr PSP, r0
80001b2: f3bf 8f6f isb sy
80001b6: bd00       pop {pc}


Static vs dynamic memory allocation


One of the initial CMSIS++ RTOS design requirements was to give the user full control over the memory allocation.

The implementation fulfilled this requirement, allowing any possible memory allocation scheme, from the simplicity of using fully static allocation to the extreme of using separate custom allocators for each object requiring dynamic memory.


The objects requiring dynamic memory are:

  • threads, for the stacks
  • message queues, for the queues (arrays of messages)
  • memory pools, for the pools (arrays of blocks)


All these objects have a last allocator parameter in their constructors that defaults to the system allocator memory::allocator<T>.

For example one of the thread constructors is:


using Allocator = memory::allocator<stack::allocation_element_t>;

thread (const char* name, func_t function, func_args_t args,
        const attributes& attr = initializer, const Allocator& allocator =
              Allocator ());


By default the memory::allocator<T> is defined as:


template<typename T>
  using allocator = new_delete_allocator<T>;


but the user can define it as any standard C++ allocator, and so the behaviour of all objects requiring dynamic memory can be customised at once.

Even more, each such object has a separate template version, that takes a last allocator parameter, so at the limit each such object can be allocated using a separate allocator.

Given the magic of C++, using such allocators is straightforward:


template<typename T>
  class my_allocator;

thread_allocated<my_allocator> thread { "th", func, nullptr };

message_queue_allocated<my_allocator> queue1 { "q1", 7, sizeof(msg_t) };
message_queue_typed<msg_t, my_allocator> queue2 { "q2", 7 };

memory_pool_allocated<my_allocator> pool1 { "p1", 7, sizeof(blk_t) };
memory_pool_typed<blk_t, my_allocator> pool2 { "p2", 7 };


Static allocation is handled using exactly the same method, but different templates:


thread_static<2500> thread { "th", func, nullptr };

message_queue_static<7, msg_t> queue { "q" };

memory_pool_static<7, blk_t> pool { "p" };




Writing RTOS unit tests was always tricky and the results debatable. This does not mean it should not be attempted; actually, if done properly, these tests can be very useful.

To improve testability, the synthetic POSIX platform was implemented. It allows to run most RTOS tests within a very convenient environment like macOS or GNU/Linux.

Another greatly helpful tool used to run the RTOS tests is the GNU ARM Eclipse QEMU, which emulates the STM32F4DISCOVERY board well enough for most tests to be relevant.

Actually most of the times the tests were performed either on macOS or on QEMU, and only rarely, usually at the end, as a final validation, the tests were also performed on physical hardware.


The ARM CMSIS RTOS validation suite


The main test was the ARM CMSIS RTOS validation suite, that exercises quite thoroughly the interface published in the cmsis_os.h file.

This test is automatically performed by the test scripts on the STM32F4DISCOVERY board running under GNU ARM Eclipse QEMU and on the synthetic POSIX platform.

The result of a run is:


CMSIS-RTOS Test Suite   Jun 23 2016   16:03:42

TEST 01: TC_ThreadCreate                  PASSED
TEST 02: TC_ThreadMultiInstance           PASSED
TEST 03: TC_ThreadTerminate               PASSED
TEST 04: TC_ThreadRestart                 PASSED
TEST 05: TC_ThreadGetId                   PASSED
TEST 06: TC_ThreadPriority                PASSED
TEST 07: TC_ThreadPriorityExec            PASSED
TEST 08: TC_ThreadChainedCreate           PASSED
TEST 09: TC_ThreadYield                   PASSED
TEST 10: TC_ThreadParam                   PASSED
TEST 11: TC_ThreadInterrupts              PASSED
TEST 12: TC_GenWaitBasic                  PASSED
TEST 13: TC_GenWaitInterrupts             PASSED
TEST 14: TC_TimerOneShot                  PASSED
TEST 15: TC_TimerPeriodic                 PASSED
TEST 16: TC_TimerParam                    PASSED
TEST 17: TC_TimerInterrupts               PASSED
TEST 18: TC_SignalMainThread              PASSED
TEST 19: TC_SignalChildThread             PASSED
TEST 20: TC_SignalChildToParent           PASSED
TEST 21: TC_SignalChildToChild            PASSED
TEST 22: TC_SignalWaitTimeout             PASSED
TEST 23: TC_SignalParam                   PASSED
TEST 24: TC_SignalInterrupts              PASSED
TEST 25: TC_SemaphoreCreateAndDelete      PASSED
TEST 26: TC_SemaphoreObtainCounting       PASSED
TEST 27: TC_SemaphoreObtainBinary         PASSED
TEST 28: TC_SemaphoreWaitForBinary        PASSED
TEST 29: TC_SemaphoreWaitForCounting      PASSED
TEST 30: TC_SemaphoreZeroCount            PASSED
TEST 31: TC_SemaphoreWaitTimeout          PASSED
TEST 32: TC_SemParam                      PASSED
TEST 33: TC_SemInterrupts                 PASSED
TEST 34: TC_MutexBasic                    PASSED
TEST 35: TC_MutexTimeout                  PASSED
TEST 36: TC_MutexNestedAcquire            PASSED
TEST 37: TC_MutexPriorityInversion        PASSED
TEST 38: TC_MutexOwnership                PASSED
TEST 39: TC_MutexParam                    PASSED
TEST 40: TC_MutexInterrupts               PASSED
TEST 41: TC_MemPoolAllocAndFree           PASSED
TEST 42: TC_MemPoolAllocAndFreeComb       PASSED
TEST 43: TC_MemPoolZeroInit               PASSED
TEST 44: TC_MemPoolParam                  PASSED
TEST 45: TC_MemPoolInterrupts             PASSED
TEST 46: TC_MsgQBasic                     PASSED
TEST 47: TC_MsgQWait                      PASSED
TEST 48: TC_MsgQParam                     PASSED
TEST 49: TC_MsgQInterrupts                PASSED
TEST 50: TC_MsgFromThreadToISR            PASSED
TEST 51: TC_MsgFromISRToThread            PASSED
TEST 52: TC_MailAlloc                     PASSED
TEST 53: TC_MailCAlloc                    PASSED
TEST 54: TC_MailToThread                  PASSED
TEST 55: TC_MailFromThread                PASSED
TEST 56: TC_MailTimeout                   PASSED
TEST 57: TC_MailParam                     PASSED
TEST 58: TC_MailInterrupts                PASSED
TEST 59: TC_MailFromThreadToISR           PASSED
TEST 60: TC_MailFromISRToThread           PASSED

Test Summary: 60 Tests, 60 Executed, 60 Passed, 0 Failed, 0 Warnings.
Test Result: PASSED


The mutex stress test


This test exercises the scheduler and the thread synchronisation primitives. It creates 10 threads that compete for a mutex, simulate random activities and compute statistics on how many times each thread acquired the mutex, to validate the fairness of the scheduler.

The test is automatically performed by the scripts on the STM32F4DISCOVERY board running under GNU ARM Eclipse QEMU and on the synthetic POSIX platform.

A typical result of the run is:


Mutex stress & uniformity test.
Built with GCC 5.3.1 20160307 (release) [ARM/embedded-5-branch revision 234589].
Seed 3761791254
[  5s] t0:39   t1:42   t2:37   t3:41   t4:38   t5:37   t6:36   t7:41   t8:40   t9:34   sum=385, avg=39, delta in [-5,3] [-12%,8%]
[ 10s] t0:74   t1:82   t2:79   t3:84   t4:79   t5:84   t6:77   t7:76   t8:80   t9:75   sum=790, avg=79, delta in [-5,5] [-5%,6%]
[ 15s] t0:114  t1:120  t2:116  t3:128  t4:117  t5:122  t6:114  t7:116  t8:116  t9:115  sum=1178, avg=118, delta in [-4,10] [-2%,8%]
[ 20s] t0:155  t1:161  t2:152  t3:163  t4:153  t5:160  t6:154  t7:159  t8:154  t9:154  sum=1565, avg=157, delta in [-5,6] [-2%,4%]
[ 25s] t0:196  t1:199  t2:194  t3:206  t4:193  t5:198  t6:194  t7:200  t8:197  t9:194  sum=1971, avg=197, delta in [-4,9] [-1%,5%]
[ 30s] t0:233  t1:236  t2:241  t3:245  t4:231  t5:236  t6:233  t7:237  t8:234  t9:237  sum=2363, avg=236, delta in [-5,9] [-1%,4%]
[ 35s] t0:270  t1:281  t2:277  t3:284  t4:266  t5:273  t6:279  t7:278  t8:273  t9:277  sum=2758, avg=276, delta in [-10,8] [-3%,3%]


The semaphore stress test


This test exercises the synchronisation primitives available from interrupt service routines and the effectiveness of the critical sections. It creates a high frequency hardware timer which posts to a semaphore, and a thread counts if the posts arrived in time or were late, in other words if the scheduler was or not able to wakeup the thread fast enough.

The test runs on the physical STM32F4DISCOVERY board.

A typical result of the run shows that on this platform the scheduler can stand about 250.000 context switches per second:


Semaphore stress test.
Built with GCC 5.3.1 20160307 (release) [ARM/embedded-5-branch revision 234589].

Iteration 0
Seed 832262406
  42000 cy    1 kHz
  21000 cy    2 kHz
  10500 cy    4 kHz
   5250 cy    8 kHz
   2625 cy   16 kHz
   1312 cy   32 kHz
    656 cy   64 kHz
    328 cy  128 kHz
    164 cy  256 kHz    1 late
     82 cy  512 kHz  777 late
     41 cy 1024 kHz  998 late
     20 cy 2100 kHz  999 late
     10 cy 4200 kHz  999 late




CMSIS++ is still a young project, and many things need to be addressed, but the core component, the RTOS, is pretty well defined and functional.

For now it may not be perfect (as it tries to be), but it definitely provides a more standard set of primitives, closer to POSIX, and a wider set of APIs than many other existing RTOSes, covering both C++ and C applications; at the same time it does its best to preserve compatibility with the original ARM CMSIS APIs.

Any contributions to improve CMSIS++ will be highly appreciated.


More info


CMSIS++ is an open source project, maintained by Liviu Ionescu. The project is released under the terms of the MIT license.

The main source of information for CMSIS++ is the project web.

The Git repositories and all public releases are available from GitHub; specifically the stress tests are available from the tests folder.

The code for ARM CMSIS RTOS validator is available from GitHub.

The code for the Cortex-M scheduler port is available from GitHub.

The code for the synthetic POSIX scheduler port is available from GitHub.

For questions and discussions, please use the CMSIS++ section of the GNU ARM Eclipse forum.

For bugs and feature requests, please use the GitHub issues.

In previous blogs we covered an introduction to System Trace Macrocell (STM) concepts and terminology, and the STM Programmers' model with an example of how to generate efficient trace data. Once the STM is generating a trace stream, we may wish to view it within our Debugger.


DS-5 implements an "Events View" which serves this purpose.



Configuring Your Target


First, it is necessary to make sure that the platform configuration for your target is configured (via DTSL options) to collect trace from the STM, otherwise the view will not be configurable. From the Debug Configurations user interface, we can find the DTSL Options "Edit..." button underneath the target selection list.


Each platform may look slightly different. First, select a valid trace sink via the "Trace Buffer" tab - most platforms default to "None" and may have many options such as "DSTREAM" or "ETB."


There is usually a dialog tab marked "STM" or a checkbox which enables trace from a particular STM, per the following screenshot:




Configure the Events view


Once connected we can configure our Events view. By default, it looks fairly empty. This view must be configured for each Master and Channel combination we want to see in the view. We see an informational item on what the view will decode (which Masters and Channels) and the source (in this case, DSTREAM: STM).


The view is organized in pages, and the VCR-like controls will walk us back and forth within the decoded trace:


To configure the view, find the Settings menu (next to the view minimize/maximize buttons) and select the "Events Settings..." item.


We will then be presented with a dialog. First, select the trace source to be shown in the view. In the example we show collecting trace on the DSTREAM unit (via TPIU) and that we want to see the trace output from device "STM." This makes up the "DSTREAM: STM" trace configuration.


For each Master, a Channel can be defined, and the expected decode of that channel further changed from "Text" to "Binary." We see that we are enabling Master 64 and Channel 0 as Text and channel 1-65535 as Binary. The example code provided only uses Channel 0 and Channel 1, but here we see that we can have a different setting for each master and each channel.

The mapping of Master number to a source device is implementation-specific. For the Juno ARM Development Platform, it is listed in the SoC Technical Reference Manual (specifically for r0, r1, and r2).



Note the Import and Export buttons, which can be used to load in a pre-configured set of configurations, or save them out for later re-use, as different system environments and applications will have different settings.


Viewing Trace Output


Once we've collected trace, we will see the STM output in the Events view. Notice the Master and Channel are reported, the Timestamp increments.


We see, from our example code, our "Cambridge" string (the first character 'C' is Marked) and our Prime number and count following:



In this blog, the second in a series, we explore the programmers' model for the ARM System Trace Macrocell. A previous blog covered basic concepts of the STM architecture and implementation. Example code is provided, which is minimally targeted at the Juno ARM Development Platform.


STM Programmers’ Model


Memory Map


The STM Architecture defines a memory map that is split into two regions; a configuration interface (4KiB in size) which contains all the registers used to configure the behavior of the STM, as well as access Basic Stimulus Ports, if implemented.


A second region of memory contains the Extended Stimulus Ports and can be up to 16MiB in size. How this is represented in the system memory map is down to the design of the SoC -- all Masters (CPUs and devices) may access the same address, or all Masters may access a dedicated and independent address.


All registers in the STM Architecture are defined as being located at an offset relative to the base address of their constituent region. On the Juno SoC, the base address of the configuration (or "APB") interface is 0x2010_0000 and the based address of the Extended Stimulus (or "AXI") region is 0x2800_0000, with this address being common to all Masters.




There are two key steps to configuring the STM via the APB interface. The first is that the STM needs to be configured with a valid Trace ID, since it outputs the instrumentation data over the CoreSight trace subsystem.


This value is exported over the ATB bus interface and is required not only for the transactions to be valid, but to discern between STM trace data and, for example, trace data from another CoreSight component such as an Embedded Trace Macrocell (ETM).


When using an external debugger (such as ARM DS-5) to collect the trace, it is possible to have the debugger set up the Trace ID as part of the connection sequence. The responsibility for this truly depends on your use case; if an external debugger is involved then it may be configuring other CoreSight components and giving them Trace IDs. You do not want the STM Trace ID and the Trace ID for another component to be the same, but you also do not want the debugger to conflict with your application STM configuration.


If you have an external debugger connected you can modify your instrumentation software to compensate; there is no harm whatsoever in having the debugger set the same trace ID as your instrumentation software.


We show an example function stmTRACEID() which performs this operation:


 * stmTRACEID(stm, traceid)
 * Set STM's TRACEID (which goes out over ATB bus ATBID)
 * Note it is illegal per CoreSight to set the trace ID
 * to 0x00 or one of the reserved values (0x70 onwards)
 * (see IHI0029D D4.2.4 Special trace source IDs).
unsigned int stmTRACEID(struct STM *stm, unsigned int traceid)
  if ((traceid > 0x00) && (traceid < 0x70)) {
    unsigned int tcsr;

    traceid = traceid & TRACEID_MASK;

    tcsr = (stm->APB->STMTCSR & ~(TRACEID_MASK << TRACEID_SHIFT));
    stm->APB->STMTCSR = (tcsr | (traceid << TRACEID_SHIFT));

    return traceid;

  return 0;


The second requirement is to enable the stimulus ports in question. This is actually an optional part of STM Architecture that offers configuration registers to enable and disable the generation of trace packets when a particular stimulus port is accessed. It is possible to enable and disable stimulus ports with a certain granularity, but this will be completely dependent on the design of the instrumented code and the system it runs on. This example code enables all Extended stimulus ports such that any stimulus write to any stimulus port will generate a packet.


 * Set STMPSCR.PORTCTL to 0x0 to ensure port selection is not
 * used. STMPSCR.PORTSEL is ignored and STMSPER and STMSPTER
 * bits apply equally to all groups of ports.
 * Whether the STM has 32 or 65536 ports, they'll all be
 * enabled.
stm->APB->STMSPSCR = 0x00000000;
stm->APB->STMSPER = 0xffffffff;
stm->APB->STMSPTER = 0xffffffff;


Once configured, we can then enable the STM with appropriate register access:




This is the bare minimum setup for an STM. There are obviously other configuration options such as Compression, Timestamping, and Synchronization that may or may not be configured dependent on the application.


Which Stimulus Port?


Each of the 65536 possible Extended Stimulus Ports maps to an STPv2 Channel. A trace decoder can then look for trace belonging to this channel to retrieve the instrumentation and differentiate it from other instrumentation sources.


The layout in memory of the stimulus ports means that for each packet, a data item is written to a particular address and offset within the STM stimulus port address space. Recall that each Extended Stimulus Port is a 256-byte region of memory. The address of the start of the stimulus port, and therefore all the registers which will generate trace for that "channel" within the AXI interface, can be calculated.


channel_address  = STM_AXI_BASE + (0x100 * channel_number)


We present code which provides two examples of access methods, the first using logical operations to exploit defined address decode logic within the STM Architecture, and return a pointer which can be used to perform the memory write.

The finer points of the address decode used by the STM is documented in the STM Architecture, section 3.3. The code for stm.c:stmPortAddress() in the example code shows a method of calculating the address and offset using a flag-based API.

The second uses a C struct defining the layout of each stimulus port offset as an array. In this manner, assigning a value to a particular structure member would generate the appropriate store. Additionally, using C macros can simplify and increase readability of the actual stimulus port access.


struct stmPort {
  STM_NA G_reserved[16];


  STM_NA I_reserved[16];


 * STM AXI Stimulus Interface
 * The STM Architecture defines up to 65536 stimulus ports, all of which are
 * implemented on the STM and STM-500 from ARM, Ltd.
struct stmAXI {
     * access the port array based on the limit in
     * (stmAPB->STMDEVID & 0x1fff) so nothing we
     * can define at compile time..
    struct stmPort port[0];

 * STMn(port, class)
 * Write an n-byte value to a stimulus port of a particular type (e.g. G_DMTS)
#define STM8(a, p, type)  *((volatile unsigned char *) &((a)->port[p].type))
#define STM16(a, p, type) *((volatile unsigned short *) &((a)->port[p].type))
#define STM32(a, p, type) *((volatile unsigned int *) &((a)->port[p].type))
#define STM64(a, p, type) *((volatile unsigned long *) &((a)->port[p].type))


We can re-create "printf debug" functionality by passing formatted strings to a function which outputs them as data over the requested STM channel:

The example function stm.c:stmSendString() outputs a string as instrumentation using macros STMn() (where n is {8,16,32,64}) which resolve to a C struct access as defined above.

 * void stmSendString(stm, channel, string)
 * We specifically write a byte to ensure that we get a D8 packet,
 * although that limits the function to 8-bit encodings.
 * It doesn't matter what we use for the last write (if we see
 * a null character) -- G_FLAGTS has no data except the flag and
 * the timestamp, so a 32-bit access will be just fine..

void stmSendString(struct STM *stm, unsigned int channel, const char *string)
     * Send a string to the STM extended stimulus registers
     * The first character goes out as D8M (Marker) packet
     * The last character is followed by a Timestamp packet
     * This is the Annex C example from the STPv2 spec
    struct stmAXI *axi = stm->AXI;

    int first = 1;

    while(*string != '\0')
    {        /*
         * If the character is a linefeed, then don't output
         * it -- just reset our 'first' state to 1 so that
         * the next character (the start of the next line)
         * is marked
        if (*string == '\n') {
            STM32(axi, channel, G_FLAGTS) = *string++;
            first = 1;
        } else {
             * Continue to output characters -- if it's the
             * first character in a string, or just after a
             * linefeed (handled above), mark it.
            if (first) {
                STM8(axi, channel, G_DM) = (*string++);
                first = 0;
            } else {
                STM8(axi, channel, G_D) = (*string++);

     * Flag the end of the string
     * Access size doesn't matter as we have no data for flag
     * packets
    STM32(axi, channel, G_FLAGTS) = 0x0;


Effective use of the STM

Annex C of the STPv2 specification gives an example of encoding an ASCII string as a data item, and uses metadata functionality of the extended stimulus ports. Strings are delimited with a Marked packet at the start of the string, and the end each string is appended with a FLAG_TS packet, in place or in lieu of a linefeed or NUL character. For one type of Marked Data packet is 0x08 (G_DM). For a (plain) Data packet, 0x18 (G_D), and for a Flag packet with Timestamp, 0x60 (G_FLAGTS), so we can break down sending the string as individual writes to those addresses. When we look at the trace output for a NUL-terminated string “Cambridge”, we might expect to see the following in the trace stream following this example, as a result of those writes.



This allows a trace decoder to adequately identify individual lines within text output, and additionally gives the trace decoder a method of determining when the string was output in time by way of the Timestamp. For binary data, a similar construct may be used with Marked data or Flag metadata surrounding the elements of an instrumentation message.


It might become obvious that outputting ASCII strings over a trace bus with a single packet per character is possibly not the most efficient way to use the STM. Since each data item is encapsulated in the STPv2 protocol, there is some overhead. The example string "Cambridge" sent as D8 packets and surrounded by D8M and FLAG_TS could be, rather than 9 bytes long (1 byte per character), somewhat more than 20 bytes. Packet headers are easily accounted for, but a timestamp may be quite large (up to 7 bytes, not inclusive of the FLAG_TS packet header) and may vary in size. This also does not take into account reporting of Channel and Master information. There are many ways of encoding a string within larger packet types using marker and flag 'framing' to differentiate between strings, but in the end "printf", whether over a USART or an STM interface, is simply not an efficient method of instrumentation.


In fact, in industrial applications, instrumentation is usually binary data formatted to be compact and useful and not a console output. This is especially true of use cases such as the network packet processing instrumentation where the relevant data needn't be prefixed or human readable, and indeed may be far too vast for a human to spend time reading -- the point of said instrumentation would be statistical analysis.


The onus, therefore, is upon the trace decoder to make sense of that packetized binary data. With any instrumentation data, an appropriate format for that data can be designed – ASCII strings or binary structures – and this will very much inform how the Stimulus Ports are used. Simply, you will need to at least define the usage of channels and the metadata packets before you start writing instrumentation code. By modulating the access size and the use of the extended stimulus ports' abilities to add metadata, extremely efficient output of binary instrumentation data can be effected.


Annex C also gives an example of formatting binary data in such a manner that can be constructed using the stimulus port accessor methods (as previously described). Let us imagine an application which calculates prime numbers. When it finds a prime number, it outputs the prime number itself, and the position or index of the prime, as 32-bit stimulus accesses to the STM. For example, 41 is the 13th prime number, so it outputs "41" and "13."


Stimulus Port RegisterData


A trace decoder can then look for pairs of 32-bit data items, with the second followed by a Marker packet augmented with a Timestamp. From the difference in timestamps between packets, we could work out how long it took to generate that prime number.




This takes up six 32-bit words (24 data bytes) not including overhead for the 3 shown sets of data. Unless our first prime number very, very large, we would not need to encode the number or the count in a 32-bit data packet. Since each value is packetized independently (the STM will never merge two packets), the accessor could be conditional on the size (counting leading zeros) of the output data or could be automatically emitted as a smaller packet using optional STM Compression features.


The trace decoder would then be able to still look for pairs of data packets (with a Marker+Timestamp) but we would have more efficient usage of bits in the resultant trace. Below we show how an efficient trace output could be achieved counting primes, where increase the packet payload size as we reach the limit of the previous type (again, the first field is a "prime," the second marked field is a "count" of which prime). To collect the data below showing reporting of 5 sets of data, using 15 data bytes (again, not including overhead).




We can see that since the first prime can be encoded in 8 bits, we can use a D8 packet. Since it's position can be encoded as 8 bits, we can also use a D8 packet. The next prime is 257, which requires >8 bits to encode, but the position does not, so we see D16+D8MTS. And so on. Eventually we will see D32 and possibly D64 packets if we calculate enough primes, but only if we need that number of bits to encode the value.




We now know fundamentally how to program the STM and generate stimulus which implements out instrumentation. Next we'll discuss how to configure DS-5 to collect the instrumentation as Trace, in Configuring DS-5 for the System Trace Macrocell.

This article aims to introduce the ARM System Trace Macrocell (STM), outlining what it is, its basic operation, and why one might want to use it. Example code will be provided, minimally targeted at the Juno ARM Development Platform, in a later blog in the series.



Introduction to instrumentation

When writing code it is often useful to add informational statements that give an insight into control flow and data management as well as aiding in observation of the actual code at runtime. As such, instrumentation is an important component of code running on a live system. The proliferation of "printf" debug statements, whereby data is output to a console, is testament to this.


Sending text data to a USART or similar peripheral via printf is perhaps the most common method of instrumentation. It does have its drawbacks; the data rate of most USARTs are usually relatively low and at the same time the overhead of maintaining such communication is relatively high, involving the use of FIFOs and interrupt servicing. It is also sometimes complicated to access a serial port connection on a production system, which may be located remotely. With this in mind, the use of a USART for instrumentation can be considered non-ideal choice for use cases involving high-performance code or the collection of remote instrumentation data.


An alternative method may be to use network devices, such as Ethernet. These devices typically afford much higher bandwidth rates than USARTs, and are ideal for the collection of remote data. However, this does involve manually encapsulating the data in protocols such as TCP/IP, which can dramatically increase the overhead of servicing the peripheral. Therefore the overhead of instrumentation can be higher.


Using USARTs, Ethernet or other generic data peripherals can have detrimental effects on instrumented code. As an example, we can imagine a system which performs network packet data processing. If we consider using a USART then we may find that the data processing is limited because the overhead of sending instrumentation data is limited by the USART bandwidth. If we then consider that we then use Ethernet as a transport for instrumentation, we might find that the instrumentation on packet data processing contains data on the process of instrumentation itself.


It is considered desirable for instrumented code to run at close to the performance and run-time profile of non-instrumented code. That has the implication that instrumentation has as little management overhead as possible, and does not markedly interfere with operation of the non-instrumentation code. One way to solve these problems is with a device which is designed for the purpose of instrumentation.



The System Trace Macrocell

A System Trace Macrocell (STM) grants software developers the ability to instrument code utilizing the CoreSight Trace subsystem as a transport. CoreSight is a central part of most ARM SoCs, and is intended to operate at the similar clock rates as the rest of the components of the system. The STM itself operates in a non-invasive fashion requires very little overhead besides memory-mapped peripheral writes, and does not (directly) generate interrupts.


ARM defines a System Trace Macrocell Programmers' Model Architecture Specification (currently version 1.1, referred to here as "STM Architecture") and licenses the current CoreSight STM-500 product as implementation of that architecture.


Further information on CoreSight Trace can be found in Eoin McCann's 3-part blog on CoreSight.


The STM instruments using the MIPI System Trace Protocol version 2.0 (STPv2), which is available to MIPI Members. The protocol itself defines a method for both instrumentation data and metadata to be encapsulated in a trace stream, composed of varying sized data elements (from 4- to 64-bit). The instrumentation is otherwise free-form and neither the protocol nor the STM place any limitations on the data content of the stream. These aspects of the STM free the software developer from having to be concerned with instrumentation overheads and available bandwidth.






Instrumentation via STM can be identified as being output via a particular "Master," in order to differentiate the various sources within a system. A simple implementation might attribute all instrumentation with a single Master identifier. A more complex design might attribute each individual core with a unique Master identifier, making it clear which core was running the software was responsible for generating a particular datum of instrumentation.


Any device that can generate a memory system write can generate instrumentation, for example DMA peripherals and GPUs.


The number of masters within a system and their identifiers are part of the implementation of the system, and may or may not map directly to, for example, AXI IDs. Check the design documentation for your chosen SoC for details on which components are able to generate stimulus via memory writes, and what their STPv2 Master ID is.




Each STM implementation has access to up to 65536 instrumentation channels. Each of these channels is clearly defined in the trace stream, allowing for multiple types of instrumentation to be intermixed within a single system or single application. For instance, channel 0 could be used to encode ASCII text, while channel 10 could output packet headers in a binary format.  Alternatively, one channel could be allocated to each Process within a system.



Metadata: Marks, Flags, Timestamps and Triggers


STM metadata is highly flexible, allowing one to arbitrarily Mark any trace data packet. A marked datum is typically used to identify the start of data or something of interest in the trace stream. A Flag can be used in a similar way; however, no data is associated with a Flag.

Each packet can be supplemented with a Timestamp, which takes an external clock signal and converts it into an incrementing count in the trace stream. In this manner a trace stream can be synchronized with other trace in the system, such as Instruction Trace from an ETM, or simply allow timing information to the trace decoder.


STPv2 defines the format of the timestamp to be flexible. The STM-500 outputs timestamps in a natural binary format, with the ability to encode a delta to conserve bandwidth.

A Trigger is special as they are both output to the trace stream and can have an effect on the rest of the trace subsystem. The result of a Trigger can be routed to other components in the system. In this manner code can be instrumented and also generate additional trace from other Trace Macrocells within the system at pertinent points. This is particularly useful for post-mortem analysis use cases.




Stimulus Ports

Channels are formed on the STM by way of “stimulus ports.” These are groups of registers within the SoC memory map that, when accessed, generate the desired trace output. The STM Architecture defines both “Basic” and “Extended” Stimulus Ports. A Basic Stimulus Port is simple; data is written to the port, and that data is then output.


Extended Stimulus Ports allows for the augmentation of the data with useful metadata, along with the importance of that data (Guaranteed or Invariant, discussed later). The Extended Stimulus Ports consist of a grouping of 16 registers in a 256-byte  memory mapped region, separate from the STM configuration registers.


Depending on the address offset of the register within a group, a different STPv2 packet is output. The offsets are defined in the STM Architecture, Section 3.1 (Table 3-1), a summary of which is shown:



Address OffsetShort nameDescription
0x00G_DMTSData, marked with timestamp, guaranteed
0x08G_DMData, marked, guaranteed
0x10G_DTSData, with timestamp, guaranteed
0x18G_DData, guaranteed
0x60G_FLAGTSFlag with timestamp, guaranteed
0x68G_FLAGFlag, guaranteed
0x70G_TRIGTSTrigger with timestamp, guaranteed
0x78G_TRIGTrigger, guaranteed


The size of the data payload of each packet is determined by the size of the access made to the stimulus port offset. For example, an 8-bit store to offset 0x18 would nominally generate a 'D8' packet, while a 32-bit store to offset 0x18 would nominally generate a 'D32' packet, and so on.


To reiterate, we can "Mark" and "Timestamp" our data, and also output metadata only via "Flag" and "Trigger" mechanisms (these types of instrumentation have no data payload.)


Since ARM's STM and STM-500 IP do not implement the Basic Stimulus registers, we will not cover them here. ARM partners implementing an STM may choose to implement them per the STM Architecture. If, when designing an SoC, there is a requirement for more simple instrumentation, then it is possible that an Instrumentation Trace Macrocell (ITM) could be implemented which can provide similar functionality, although with a different programmers' model and trace output format. Please check your SoC documentation.

Fundamental Data Size


The STM implementation defines a “Fundamental Data Size.” This is essentially the maximum size of an access to the stimulus port registers, as determined by the implementation of the connection between the STM and the rest of the system.


For STM-500, as implemented in revision r1 of the Juno SoC, the fundamental data size is 64-bit, so a 64-bit stimulus should generate a D64 packet. Care should be taken to realize this value as it can change the way a trace decoder is written for application instrumentation that may run on multiple platforms.


Some SoCs implement an earlier version of the STM, the r0 revision of Juno being one example. The Fundamental Data Size is defined as 32-bit for that implementation.


An STM with a Fundamental Data Size of 64 bits may also be connected in such a way that it does not have a 64-bit wide data path, for example there may be a 'downsizer' between the instrumentation source and STM.


If a 64-bit memory system write is performed and either of the above are true, the actual trace output behavior is undefined by the STM architecture. Care should be taken to ensure these aspects are taken into account as it can change the way extracting instrumentation is performed within a trace decoder.

Guaranteed and Invariant Stimulus


The STM Architecture specifies two types of transaction, accessible through the stimulus port interface at separate offsets within the port – Guaranteed and Invariant. A write to the stimulus port "guaranteed" registers must be emitted by the STM as a trace packet; additionally, if a timestamp is requested (DnTS, FLAG_TS) and timestamping is enabled in the STM configuration registers (STMTCSR), then the timestamp will be generated.


Writes to the Invariant registers allow the STM to make a determination as to whether the full scope of instrumentation will be output. This is useful for instrumentation types that may be implemented as “lossy” – for instance, the output of the state of a loop counter where intermediate loop counts can be inferred, or where timestamping is not fundamental to the instrumentation. Invariant stimulus may, when emitted, "drop" timestamps for the sake of trace bandwidth. Important instrumentation – for instance, an error or other pertinent instrumentation, may still use Guaranteed stimulus.




Now that we have a good idea of what the STM is and how the architecture is defined, we can use the STM to generate instrumentation by Programming ARM's System Trace Macrocell, the second part of this blog series.

Filter Blog

By date:
By tag:

More Like This