# Blog

1 2 3 Previous Next

169 posts

# DS-MDK support extended to NXP i.MX 6SoloX processors and Phytec i.MX 7 board

Posted by Christopher Seidl Aug 25, 2016

DS-MDK, the software development solution for heterogeneous computing, now supports additional devices from NXP and a new development board.

NXP i.MX 6SoloX processors offer an ARM Cortex-A9 core together with an ARM Cortex-M4. The corresponding SABRE development board is now fully supported by DS-MDK, using the i.MX 6 software pack. Learn how to use the SABRE board together with DS-MDK on the reference page.

Furthermore, DS-MDK now supports PHYTEC phyBOARD-i.MX7-Zeta. This single-board computer (SBC) is a two-PCB counterpart to the phyCORE-i.MX7 SOM. The SOM itself serves as the CPU core of the SBC which interfaces to a carrier board via high density connectors. This carrier board breaks out major interface signals to plug-and-play or pin-level connectors and offers a JTAG connector for debugging purposes. Learn how to connect DS-MDK to the phyBOARD-i.MX7-Zeta on the reference page.

# New YouTube playlist dedicated to debugging in SOMNIUM DRT

Posted by danielohara Aug 16, 2016

Check out our new YouTube playlist dedicated to our debugging features. Learn more about our Live Expression viewing, MTB Trace and Fault Diagnosis tools:https://lnkd.in/dhX24is

# Key Changes in DS-5 Debugger v5.25

Posted by Paul Black Jul 28, 2016

DS-5 v5.25 Professional and Ultimate Editions support cache visibility for Cortex-A5 and Cortex-A7 cores. Ultimate Edition also supports cache and MMU visibility for Cortex-A73, and debug support for ARMv8.1-A and ARMv8.2-A cores.

DS-5 v5.25 Professional Edition includes a license for single-core Cortex-M3 and Cortex-R4 Fixed Virtual Platforms (FVP). Ultimate Edition now includes a license for a wide range of single-core, multi-core, and big.LITTLE FVPs. The virtual platforms are delivered as part of the DS-5 installation package.

Fixed Virtual Platform Community Core Family FVP_MPS2_Cortex-M0, FVP_MPS2_Cortex-M0plus, FVP_MPS2_Cortex-M4, FVP_MPS2_Cortex-M7 FVP_MPS2_Cortex-M3 Cortex-R FVP_VE_Cortex-R4 FVP_VE_Cortex-R5x1, FVP_VE_Cortex-R7x1, FVP_VE_Cortex-R8x1 FVP_VE_Cortex-A9x1 FVP_VE_Cortex-A9x4 FVP_VE_Cortex-A5x1, FVP_VE_Cortex-A7x1, FVP_VE_Cortex-A15x1, FVP_VE_Cortex-A15x4-A7x4, FVP_VE_Cortex-A17x1 FVP_Base_Cortex-A53x1, FVP_Base_Cortex-A57x1, FVP_Base_Cortex-A72x1, FVP_Base_Cortex-A73x1, FVP_Base_Cortex-A32x1, FVP_Base_Cortex-A35x1, FVP_Base_Cortex-A57x2-A53x4, FVP_Base_Cortex-A72x2-A53x4, Cortex-A73x2-A53x4, FVP_Base_AEMv8A Foundation Platform (v8) Not license managed

# Revised Host Support

DS-5 v5.25 adds support for Windows 10 64-bit and Red Hat Enterprise Linux 7 Workstation 64-bit. Support for Linux 32-bit hosts has been dropped in this release.

Platform

DS-5 Professional and Ultimate

DS-5 Community

ARM Compiler 5.06u3

ARM Compiler 6.5

Fast Models 10.0

Windows 7 SP1 Professional Edition 32-bit*
Windows 7 SP1 Professional Edition 64-bit
Windows 7 SP1 Enterprise Edition 32-bit*
Windows 7 SP1 Enterprise Edition 64-bit
Windows 8.1 64-bit
Windows Server 2012 64-bit
Windows 10 64-bit
Red Hat Enterprise Linux 6 Workstation 32-bit
Red Hat Enterprise Linux 6 Workstation 64-bit**
Red Hat Enterprise Linux 7 Workstation 64-bit
Ubuntu Desktop Edition 12.04 LTS 32-bit
Ubuntu Desktop Edition 12.04 LTS 64-bit**
Ubuntu Desktop Edition 14.04 LTS 64-bit

* Not delivered in DS-5, but exists as a standalone product

** Requires additional GCC runtime libraries

# Mali Graphics Debugger

DS-5 v5.25 includes the Mali Graphics Debugger. This enables DS-5 users to trace Vulkan, OpenGL ES, EGL, and OpenCL API calls

# Enhanced debugger functionality

DS-5 debugger functionality has been enhanced in a number of areas, each of which is described in a separate blog:

# DS-5 v5.25 “Use-Case Script” Support for the ARM Embedded Logic Analyzer

Posted by Paul Black Jul 28, 2016

# Introduction

The ARM Embedded Logic Analyzer (ELA) brings particular challenges to a debugger. The flexibility of the ELA and the broad range of implementation choices and potential uses, all place demands on a debugger. The debugger must present a high level of functionality with high potential for flexibility and customisation. However because most of the customisation must be carried out by the user, the debugger must also present a high level of usability.

A comprehensive scripting interface is the obvious way to address the challenges presented by the ARM ELA, and enables the debugger user to customise and extend the functionality of the debugger. However scripts bring their own challenges, which escalate rapidly as script library size and script complexity grow.

ARM DS-5 debugger now includes a comprehensive script management system aimed at helping users leverage the power of scripts and handle the challenges that scripts bring. Here we look at some of the challenges brought by the ARM ELA, and discuss some of the generic challenges brought by script complexity. We’ll then investigate how the DS-5 script management system enables users to address these challenges with an ease of use not seen in any other ARM debugger.

# The ARM Embedded Logic Analyser

The ARM ELA enables developers to drive the highest levels of performance and efficiency from their ARM-based design. The key functionality of the ARM ELA is to monitor (and give the developer visibility of) signals deep within an ARM-based SoC. Signal information can be processed in one of two ways:

1. Information about signals can be captured to an on-chip buffer for later analysis
2. The ELA contains a comprehensive state machine. Transition between states is controlled by signal changes and comparisons, and the final state produces events that can be propagated over the CoreSight cross-trigger network

The ARM ELA is able to monitor, and provide visibility of, complex interactions and event chains taking place deep within the SoC. However SoC designers have a wide range of implementation options for the ARM ELA. The ELA could for example, monitor signals inside an ARM core. Or it could monitor signals in the bus interconnects: the ARM ELA is particularly useful for analysing throughput and identifying bottlenecks. Such is the flexibility of the ELA, the range of implementation options, and the range of challenges that it might be used to address, it’s impossible to hardcode ELA support into a debugger. The device is controlled by a large number of inter-dependent registers, which need to be used in harmony with each other. The only practical way for a debugger to provide support for the ARM ELA is through a comprehensive and highly functional scripting interface. This enables SoC designers and software developers to leverage the power of the ELA for their particular needs.

# Scripting support in an ARM debugger

Good scripting support is a critical part of any modern ARM debugger, and scripting is sometimes the only way to reach the level of functionality and flexibility needed by a complex ARM-based design. A highly functional scripting API is the only practical way for a debugger to address a number of challenges in modern ARM-based designs:

• There is a trend of growing complexity and individuality in ARM-based designs. In particular there’s growth in the size and complexity of the CoreSight cross-trigger and trace systems, with new devices and a variety of trace storage options distributed across the design
• There may be a need for custom debugger functionality with is tailored to a specific debug target, or even a specific debugging challenge
• As designers strive to keep power consumption to a minimum and gain maximum advantage from the flexibility of the ARM architecture, power management strategies are becoming more aggressive and can present significant challenges to a debugger
• A key strength of ARM IP is its high suitability for a mixed implementation that also contains non-ARM IP. It can be very useful to get a measure of control and visibility of ARM and non-ARM IP in the same debugger

A comprehensive scripting API enables the user to handle both complexity and individuality in an ARM-based design. A scripting API enables the creation of custom debugger functionality to address the needs of an individual design, or the needs of a particular debug session. Because the needs of a debugger can be tightly bound to an individual SoC design or to the characteristics and causes of an individual software defect, enabling the user to create custom debugger functionality can be highly valuable.

However as script number and complexity rise, usability challenges start to appear. Particular problems might be found in script configuration, and with non-trivial scripts it’s common for their functionality to depend upon command-line arguments. This solution can suffer problems in scalability: the user needs to remember which command-line options are valid for which scripts. Each option has a range of valid values and options may be inter-related. With a significant library of complex and flexible scripts, the requirements on the user can quickly grow to a point where the value of the scripts starts to degrade. These problems can be compounded when scripts are shared between team members (and other teams), meaning users have to drive value from scripts with which they are unfamiliar.

# DS-5 Use-Case Scripts

The ARM DS-5 debugger recently added a new script management system, aimed at addressing some of the problems found with large libraries of complex scripts. A key innovation is the ability to embed custom visual controls in the script itself: this is an extension of the existing functionality that has been successfully used by DS-5 DTSL (Debug and Trace Services Layer) scripts for a number of years.

Because controls can be represented graphically on custom control tabs, it’s easy to see at a glance which options are available for a particular script. Command line options which can take a range of values can be implemented as drop-down selection boxes, allowing value (and spelling) discovery at a glance. Options which take numerical or string values can be represented as text edit boxes, with bounds checking also embedded in the script. Controls can appear as hierarchies, with child controls becoming enabled only when parent controls are activated.

This screen capture shows a control tab from one of the use-case scripts shipped with DS-5 v5.25 as part of the support for the ARM ELA. Command line options for the script are represented as visual controls, removing the need for the user to carry deep familiarity with the script and to remember all details of all possible options. The controls are arranged on a number of control tabs, grouping areas of related functionality (in this case, giving a fine degree of control over movements between stages of the ELA internal state machine). Users can gain familiarity with the possibilities and functionality of the script very easily – the visualisation of command line arguments as custom controls significantly reduces the learning curve faced by script users.

On the left side of the careen capture can be seen a number of configuration “profiles”. Sets of control values can be saved as named configuration profiles to be used later. Directories of DS-5 use-case scripts, and sets of named configuration profiles containing pre-built collections of control values to address various needs, can be shared between DS-5 users.

# Summary

Modern ARM-based designs can present a number of challenges to a debugger user, and devices such as the ARM ELA present particular challenges because of their high levels of flexibility, functionality, and implementation options. The only practical way to address these challenges is by using a comprehensive debugger scripting API, but users are likely to encounter scalability problems as the complexity and number of scripts rises.

The ARM DS-5 “Use-Case” script management system aims to resolve these problems and enable users to leverage the full power of their scripts. By visualising script command line options as custom controls complete with value, relationship, and bounds checking, DS-5 significantly reduces the learning curve and information required when using scripts. Named configuration profiles, and the ability to share script libraries and profiles between users and teams, increase this ease of use and flexibility.

For details of other changes in DS-5 v5.25, take a look at Key Changes in DS-5 Debugger v5.25

# DS-5 v5.25 Overlay Support for Cortex-R

Posted by Paul Black Jul 28, 2016

In order to keep costs, power consumption, and size to a minimum, many embedded products based on ARM Cortex-R cores have limited on-chip memory. In particular, the size of the Tightly Coupled Memory (TCM) can be restricted. Because TCM has very low latency, significant performance gains can be realised when running code in TCM. Therefore limiting TCM size can impose performance challenges: a trade-off to be considered by the SoC design team.

One way to reduce the impact of restricted TCM is to use an overlay manager. Code is organised into a number of overlays which share the same memory area. When executing code in the same overlay, no changes are necessary and the overlay stays resident in low-latency TCM. However when a call is made to a non-resident overlay, the overlay manager needs to load the correct overlay into the TCM. This load needs to be performed as efficiently as possible, and the debugger needs to be overlay-aware and present the correct information to the user. For example, the debugger needs to step over overlay veneers: effectively making the overlay manager invisible to the debugger user. In DS-5 v5.25 we’ve added overlay support to both the ARM Compiler and the DS-5 debugger. When overlays are enabled, the compiler leaves overlay information in the symbol file. The debugger reads this information when the symbol file is loaded, and can enable overlay support automatically. As well as handling the debug implications of overlays and automatically stepping over the overlay veneers, DS-5 debugger presents overlay information through additional debugger commands or through the new Overlays view, for example:

Here we can see the address and size of a number of overlays, and we can see instantly which overlays are currently loaded. Information has been expanded for one of the overlays, so that we can see at a glance the functions contained in that overlay. The matching overlay support in the ARM Compiler and DS-5 debugger makes it easy to manage overlays and drive significant performance enhancements from efficient use of fast on-chip memory.

For details of other changes in DS-5 v5.25, take a look at Key Changes in DS-5 Debugger v5.25

# DS-5 v5.25 Stack View Enhancements

Posted by Paul Black Jul 28, 2016

In DS-5 v5.24, we added the Stack View to the debugger. This view displays stack information that used to be displayed in the Debug Control View, giving two advantages. Firstly, the Debug Control View becomes less cluttered and more focussed: giving better clarity of information and an increase in debugger usability. Secondly, stack information can take a non-trivial amount of time to retrieve from the target: because the Stack View can be closed when not needed (and it’s possible to limit the stack depth displayed), stepping speed in the DS-5 debugger can be increased. This performance increase is particularly noticeable for debug targets with slow JTAG clocks or an extensive stack back-trace.

In DS-5 v5.25 we have enhanced the Stack View to display function parameters and local variables, for example:

Variables marked with a ‘P’ are function parameters, and can also be seen in the extended function prototype information. All the other variables shown are the function’s local variables. Arrays and structures can be expanded to display member variables, by right-clicking all variables can be displayed in the Variables or Memory views.

Retrieving variable information from the target can cause a degradation in debugger stepping performance, particularly for targets with very slow JTAG clocks or a large number of function parameters and local variables. To increase debugger stepping performance when parameter and variable information is not needed, display of parameters and variables can be disabled via the Stack View menu.

For details of other changes in DS-5 v5.25, take a look at Key Changes in DS-5 Debugger v5.25

# DS-5 v5.25 Extended Support for CoreSight Cross-Trigger Network

Posted by Paul Black Jul 28, 2016

The CoreSight cross-trigger network in a SoC is created from two components: Cross Trigger Matrix (CTM) devices form the backbone of the network and transport events around the SoC; and Cross Trigger Interface (CTI) devices which capture or deliver events to or from other components distributed around the SoC. Although the CoreSight cross trigger network has a variety of potential uses, by far the most common use encountered by DS-5 users is to synchronise cores.

This cross-trigger use-case enables related to cores to enter and exit debug state together. For example, when one core hits a breakpoint and enters debug state, this change in state is picked up by the CTI coupled to that core. The CTI passes the halt event into the cross-trigger matrix, where the CTMs route the event to the CTI coupled to other cores. These CTI issue halt requests to their cores, so that all related cores halt with minimal latency. The debugger doesn’t need to get involved with halting the cores: it just sets up the cross-trigger network so that events are routed correctly. This enables the very low latency which is critical on many SoCs to avoid undesirable effects such as kernel panics.

The ARM DS-5 debugger already supports cross-trigger network configuration and management, and the DS-5 Platform Configuration Editor (PCE) creates the necessary scripting when bringing up a new target platform in the debugger. However PCE currently only supports the most common core topologies: SMP and big.LITTLE. PCE cannot currently create the necessary DTSL scripts for other topologies: for example cross-triggering Cortex-A and Cortex-R/M cores in the same SoC, or halting cores when the on-chip trace buffer fills. These use-cases need custom scripting. DS-5 users can create custom DTSL scripts to drive the DS-5 debugger functionality they need, but there’s a learning curve to be considered. Complex cross-trigger requirements could mean more complexity in DTSL scripts than the average user might be prepared to take on. So in DS-5 v5.25 we’ve revised and enhanced the DTSL functionality around cross-triggering, and added a new DTSL class to make scripting easier and less complex.

These DTSL enhances make it easier and quicker to create custom cross-triggering support in DS-5, backed by groups of custom DSTL controls. This example shows a typical use-case, synchronising the watchdog timer with the Cortex-A57 cores in the ARM Juno reference platform:

In future DS-5 releases we’ll extend the DS-5 PCE to take advantage of this additional DTSL functionality. We’ll also review all the platform configurations that ship with DS-5, to see where DS-5 can leverage these changes to deliver additional value.

# DS-MDK Supports Heterogeneous Systems

Posted by Christopher Seidl Jul 28, 2016

DS-MDK is out now! It combines the Eclipse-based DS-5 IDE and Debugger with CMSIS-Pack technology and uses Software Packs to extend device support for devices based on 32-bit ARM Cortex-A processors or heterogeneous systems based on 32-bit ARM Cortex-A and ARM Cortex-M processors. It is part of the MDK-Professional edition and initially provides support for the NXP i.MX6 and i.MX7 series devices.

### Heterogeneous systems combine computing power with fast, efficient I/O performance

A heterogeneous computing system based on Cortex-A and Cortex-M processors combines best-in-class technology for application software and deterministic real-time I/O. Cortex-A application processors run a feature-rich operating system, such as Linux, and provide enough computing power for demanding applications. The energy-efficient Cortex-M processor typically executes a real-time operating system that is easy to use and tailored to meet real-time requirements for deterministic I/O operations. Such Cortex-M systems enable fast start-up times and

can be permanently 'on' in battery-powered systems. The two processor systems typically exchange information via a fast, interrupt driven inter-process communication and shared memory.

DS-MDK offers a complete software development solution for such systems:

• It allows managing Cortex-A Linux and Cortex-M RTOS projects in the same development environment.
• It fully supports the Cortex Microcontroller Software Interface Standard (CMSIS) development flow for efficient Cortex-M programming. Software Packs may be added any time to DS-MDK making new device support and middleware updates independent from the toolchain. They contain device support, CMSIS libraries, middleware, board support, code templates, and example projects. The IDE manages the provided software components that are available for the application as building blocks.
• The DS-5 Debugger offers full visibility for multicore software development.

# CMSIS++ RTOS: fully functional reference implementation

Posted by Liviu Ionescu Jun 24, 2016

## Overview

CMSIS++, or rather POSIX++, is a POSIX-like, portable, vendor-independent, hardware abstraction layer intended for C++/C embedded applications, designed with special consideration for the industry standard ARM Cortex-M processor series. Originally intended as a proposal for the next generation CMSIS,  CMSIS++ can probably be more accurately defined as "C++ CMSIS", and POSIX++ as "C++ POSIX".

## CMSIS++ RTOS: APIs vs reference implementations

The CMSIS++ cornerstone is the RTOS, and in this respect CMSIS++ RTOS can be analysed from two perspectives: the CMSIS++ RTOS APIs, with a modern design and the CMSIS++ RTOS reference implementation with a clean and efficient code.

In the first phase of the project, the CMSIS++ RTOS APIs were designed, with POSIX threads in mind, but from a C++ point of view.

The native CMSIS++ RTOS interface is the C++ API, with a C API implemented as a wrapper, and an ISO C++ Threads API implemented also on top of the native C++ API.

## The CMSIS++ RTOS C++ API as a wrapper on top of an existing RTOS

Initially, the C++ API was validated by implementing it as a wrapper on top of the popular open source project FreeRTOS. Full functionality was achieved, and the entire system passed the ARM CMSIS RTOS validation suite.

## The CMSIS++ RTOS reference synchronisation objects (semaphores, queues, etc)

With the native C++ API validated, while still using the safety net provided by an existing scheduler, the next step toward a grand design was to implement, in a portable way, the synchronisation objects defined by the CMSIS++ RTOS.

The result was a highly portable implementation, that requires a very simple interaction with the scheduler, basically a thread suspend() and resume().

Using this model, all RTOS objects were implemented (semaphores, mutexes, condition variables, message queues, memory pools, event flags, clocks and timers); full functionality was achieved, and again the entire system passed the ARM CMSIS RTOS validation suite.

To be noted that in this configuration, when running on top of an existing RTOS, it is perfectly possible to select which implementation to use, at individual object level; in other words it is perfectly possible to run with some objects implemented by the host RTOS and some objects using the reference portable implementation. This is generally useful when some of the objects defined by CMSIS++ are not available in the host RTOS; for example in the current version of FreeRTOS there were no memory pools or condition variables, and these objects were supplied by the reference implementation.

## The CMSIS++ RTOS reference scheduler

The last piece to complete the puzzle was the scheduler. The CMSIS++ RTOS specifications do not mandate for a specific scheduling policy, and, when running on top of an existing RTOS, any scheduling policy can be used.

However, the CMSIS++ RTOS reference scheduler takes the beaten path and implements a priority based, round robin, cooperative and optionally preemptive scheduler.

In other words, threads are assigned priorities, higher priority threads are scheduled first, equal priority threads are scheduled in a round robin way, and scheduling points are entered either explicitly at any wait() or yield(), or are optionally triggered by periodic interrupts, like the system clock ticks, or by user interrupts.

### The scheduler portable code

The scheduler was designed to be as portable as possible, and to run on any reasonable architecture, with any word size.

As such, the scheduler's main responsibility is to manage the list of threads ready for execution and to switch their execution contexts in an orderly manner.

Although not mandatory for its functionality, the scheduler also keeps track of all registered threads, and provides iterators to walk these lists.

For a better modularity, the scheduler itself does not keep track of threads waiting for various events; this is delegated to the various synchronisation objects, that are expected to implement their own policy of suspending and resuming execution of threads waiting for common resources.

However, the reference synchronisation objects use similar lists to keep track of the waiting threads, and, to simplify the implementation, the scheduler provides base classes for these lists.

### The scheduler port-specific code

Regardless how carefully a portable scheduler is designed and implemented, there will always be a last mile where the platform differences become important.

To accommodate for these differences, the scheduler needs to be ported on a specific platform. The port includes the specific definitions, mainly the way of creating and switching thread contexts, but also handling interrupts, accessing timers and clocks, etc.

There are currently two such CMSIS++ RTOS scheduler ports available and fully functional:

• a 32-bits ARM thumb port, running on Cortex-M devices;
• a 64-bits synthetic POSIX port, running as a user process on macOS and GNU/Linux.

These ports are actually not part of the CMSIS++ package itself, which is highly portable, but are part of separate µOS++ packages.

## The Cortex-M scheduler port

This 32-bits ARM thumb port is specifically designed to run on Cortex-M devices. It currently supports ARMv6-M and ARMv7-M architectures, with or without FPU. Support for ARMv8-M will be added when needed.

The implementation uses the ARM specific features, like PendSV, which greatly simplify things.

For example, the context switching is performed by a rather simple function:

void
__attribute__ ((section(".after_vectors"), naked, used, optimize("s")))
PendSV_Handler (void)
{
// The naked attribute and the push/pop are used to fully control
// the function entry/exit code; be sure other registers are not
// used in the assembly parts.
asm volatile ("push {lr}");

// The whole mystery of context switching, in one sentence. :-)
port::scheduler::restore_from_stack (
port::scheduler::switch_stacks (
port::scheduler::save_on_stack ()));

asm volatile ("pop {pc}");
}


Apart from saving/returning, this function does exactly what it is expected to do:

• save_on_stack() - saves the context of the current thread on the thread stack and returns the stack address;
• switch_stacks() - saves the above stack address in the current thread control block, selects the next thread waiting to run and returns the address of its stack context;
• restore_from_stack() - restores the context of the new thread from the stack.

The two save/restore functions are among the very few in the Cortex-M port that require assembly code:

inline stack::element_t*
__attribute__((always_inline))
save_on_stack (void)
{
register stack::element_t* sp_;

asm volatile
(
" mrs %[r], PSP                       \n"
" isb                                 \n"

#if defined (__VFP_FP__) && !defined (__SOFTFP__)

// Is the thread using the FPU context?
" tst lr, #0x10                       \n"
" it eq                               \n"
// If so, push high vfp registers.
" vstmdbeq %[r]!, {s16-s31}           \n"
// Save the core registers r4-r11,r14.
// again this condition in the restore sequence.
" stmdb %[r]!, {r4-r9,sl,fp,lr}       \n"

#else

// Save the core registers r4-r11.
" stmdb %[r]!, {r4-r9,sl,fp}          \n"

#endif
: [r] "=r" (sp_) /* out */
: /* in */
: /* clobber. DO NOT add anything here! */
);

return sp_;
}

inline void
__attribute__((always_inline))
restore_from_stack (stack::element_t* sp)
{
// Without enforcing optimisations, an intermediate variable
// would be needed to avoid using R4, which collides with
// the R4 in the list of ldmia.

// register stack::element_t* sp_ asm ("r0") = sp;

asm volatile
(

#if defined (__VFP_FP__) && !defined (__SOFTFP__)

// Pop the core registers r4-r11,r14.
// R14 contains the EXC_RETURN value
// and is restored for the next test.
" ldmia %[r]!, {r4-r9,sl,fp,lr}       \n"
// Is the thread using the FPU context?
" tst lr, #0x10                       \n"
" it eq                               \n"
// If so, pop the high vfp registers too.
" vldmiaeq %[r]!, {s16-s31}           \n"

#else

// Pop the core registers r4-r11.
" ldmia %[r]!, {r4-r9,sl,fp}          \n"

#endif

// Restore the thread stack register.
" msr PSP, %[r]                       \n"
" isb                                 \n"

: /* out */
: [r] "r" (sp) /* in */
: /* clobber. DO NOT add anything here! */
);
}


The generated code (for Cortex-M3) is remarkably neat and tidy:

08000198 <PendSV_Handler>:
8000198: b500       push {lr}
800019a: f3ef 8009 mrs r0, PSP
800019e: f3bf 8f6f isb sy
80001a2: e920 0ff0 stmdb r0!, {r4, r5, r6, r7, r8, r9, sl, fp}
80001a6: f000 fe07 bl 8000db8 <os::rtos::port::scheduler::switch_stacks(unsigned long*)>
80001aa: e8b0 0ff0 ldmia.w r0!, {r4, r5, r6, r7, r8, r9, sl, fp}
80001ae: f380 8809 msr PSP, r0
80001b2: f3bf 8f6f isb sy
80001b6: bd00       pop {pc}


## Static vs dynamic memory allocation

One of the initial CMSIS++ RTOS design requirements was to give the user full control over the memory allocation.

The implementation fulfilled this requirement, allowing any possible memory allocation scheme, from the simplicity of using fully static allocation to the extreme of using separate custom allocators for each object requiring dynamic memory.

The objects requiring dynamic memory are:

• message queues, for the queues (arrays of messages)
• memory pools, for the pools (arrays of blocks)

All these objects have a last allocator parameter in their constructors that defaults to the system allocator memory::allocator<T>.

For example one of the thread constructors is:

using Allocator = memory::allocator<stack::allocation_element_t>;

thread (const char* name, func_t function, func_args_t args,
const attributes& attr = initializer, const Allocator& allocator =
Allocator ());


By default the memory::allocator<T> is defined as:

template<typename T>
using allocator = new_delete_allocator<T>;


but the user can define it as any standard C++ allocator, and so the behaviour of all objects requiring dynamic memory can be customised at once.

Even more, each such object has a separate template version, that takes a last allocator parameter, so at the limit each such object can be allocated using a separate allocator.

Given the magic of C++, using such allocators is straightforward:

template<typename T>
class my_allocator;

message_queue_allocated<my_allocator> queue1 { "q1", 7, sizeof(msg_t) };
message_queue_typed<msg_t, my_allocator> queue2 { "q2", 7 };

memory_pool_allocated<my_allocator> pool1 { "p1", 7, sizeof(blk_t) };
memory_pool_typed<blk_t, my_allocator> pool2 { "p2", 7 };


Static allocation is handled using exactly the same method, but different templates:

thread_static<2500> thread { "th", func, nullptr };

message_queue_static<7, msg_t> queue { "q" };

memory_pool_static<7, blk_t> pool { "p" };


## Tests

Writing RTOS unit tests was always tricky and the results debatable. This does not mean it should not be attempted; actually, if done properly, these tests can be very useful.

To improve testability, the synthetic POSIX platform was implemented. It allows to run most RTOS tests within a very convenient environment like macOS or GNU/Linux.

Another greatly helpful tool used to run the RTOS tests is the GNU ARM Eclipse QEMU, which emulates the STM32F4DISCOVERY board well enough for most tests to be relevant.

Actually most of the times the tests were performed either on macOS or on QEMU, and only rarely, usually at the end, as a final validation, the tests were also performed on physical hardware.

### The ARM CMSIS RTOS validation suite

The main test was the ARM CMSIS RTOS validation suite, that exercises quite thoroughly the interface published in the cmsis_os.h file.

This test is automatically performed by the test scripts on the STM32F4DISCOVERY board running under GNU ARM Eclipse QEMU and on the synthetic POSIX platform.

The result of a run is:

CMSIS-RTOS Test Suite   Jun 23 2016   16:03:42

TEST 12: TC_GenWaitBasic                  PASSED
TEST 13: TC_GenWaitInterrupts             PASSED
TEST 14: TC_TimerOneShot                  PASSED
TEST 15: TC_TimerPeriodic                 PASSED
TEST 16: TC_TimerParam                    PASSED
TEST 17: TC_TimerInterrupts               PASSED
TEST 20: TC_SignalChildToParent           PASSED
TEST 21: TC_SignalChildToChild            PASSED
TEST 22: TC_SignalWaitTimeout             PASSED
TEST 23: TC_SignalParam                   PASSED
TEST 24: TC_SignalInterrupts              PASSED
TEST 25: TC_SemaphoreCreateAndDelete      PASSED
TEST 26: TC_SemaphoreObtainCounting       PASSED
TEST 27: TC_SemaphoreObtainBinary         PASSED
TEST 28: TC_SemaphoreWaitForBinary        PASSED
TEST 29: TC_SemaphoreWaitForCounting      PASSED
TEST 30: TC_SemaphoreZeroCount            PASSED
TEST 31: TC_SemaphoreWaitTimeout          PASSED
TEST 32: TC_SemParam                      PASSED
TEST 33: TC_SemInterrupts                 PASSED
TEST 34: TC_MutexBasic                    PASSED
TEST 35: TC_MutexTimeout                  PASSED
TEST 36: TC_MutexNestedAcquire            PASSED
TEST 37: TC_MutexPriorityInversion        PASSED
TEST 38: TC_MutexOwnership                PASSED
TEST 39: TC_MutexParam                    PASSED
TEST 40: TC_MutexInterrupts               PASSED
TEST 41: TC_MemPoolAllocAndFree           PASSED
TEST 42: TC_MemPoolAllocAndFreeComb       PASSED
TEST 43: TC_MemPoolZeroInit               PASSED
TEST 44: TC_MemPoolParam                  PASSED
TEST 45: TC_MemPoolInterrupts             PASSED
TEST 46: TC_MsgQBasic                     PASSED
TEST 47: TC_MsgQWait                      PASSED
TEST 48: TC_MsgQParam                     PASSED
TEST 49: TC_MsgQInterrupts                PASSED
TEST 52: TC_MailAlloc                     PASSED
TEST 53: TC_MailCAlloc                    PASSED
TEST 56: TC_MailTimeout                   PASSED
TEST 57: TC_MailParam                     PASSED
TEST 58: TC_MailInterrupts                PASSED

Test Summary: 60 Tests, 60 Executed, 60 Passed, 0 Failed, 0 Warnings.
Test Result: PASSED


### The mutex stress test

This test exercises the scheduler and the thread synchronisation primitives. It creates 10 threads that compete for a mutex, simulate random activities and compute statistics on how many times each thread acquired the mutex, to validate the fairness of the scheduler.

The test is automatically performed by the scripts on the STM32F4DISCOVERY board running under GNU ARM Eclipse QEMU and on the synthetic POSIX platform.

A typical result of the run is:

Mutex stress & uniformity test.
Built with GCC 5.3.1 20160307 (release) [ARM/embedded-5-branch revision 234589].
Seed 3761791254
[  5s] t0:39   t1:42   t2:37   t3:41   t4:38   t5:37   t6:36   t7:41   t8:40   t9:34   sum=385, avg=39, delta in [-5,3] [-12%,8%]
[ 10s] t0:74   t1:82   t2:79   t3:84   t4:79   t5:84   t6:77   t7:76   t8:80   t9:75   sum=790, avg=79, delta in [-5,5] [-5%,6%]
[ 15s] t0:114  t1:120  t2:116  t3:128  t4:117  t5:122  t6:114  t7:116  t8:116  t9:115  sum=1178, avg=118, delta in [-4,10] [-2%,8%]
[ 20s] t0:155  t1:161  t2:152  t3:163  t4:153  t5:160  t6:154  t7:159  t8:154  t9:154  sum=1565, avg=157, delta in [-5,6] [-2%,4%]
[ 25s] t0:196  t1:199  t2:194  t3:206  t4:193  t5:198  t6:194  t7:200  t8:197  t9:194  sum=1971, avg=197, delta in [-4,9] [-1%,5%]
[ 30s] t0:233  t1:236  t2:241  t3:245  t4:231  t5:236  t6:233  t7:237  t8:234  t9:237  sum=2363, avg=236, delta in [-5,9] [-1%,4%]
[ 35s] t0:270  t1:281  t2:277  t3:284  t4:266  t5:273  t6:279  t7:278  t8:273  t9:277  sum=2758, avg=276, delta in [-10,8] [-3%,3%]
Done.


### The semaphore stress test

This test exercises the synchronisation primitives available from interrupt service routines and the effectiveness of the critical sections. It creates a high frequency hardware timer which posts to a semaphore, and a thread counts if the posts arrived in time or were late, in other words if the scheduler was or not able to wakeup the thread fast enough.

The test runs on the physical STM32F4DISCOVERY board.

A typical result of the run shows that on this platform the scheduler can stand about 250.000 context switches per second:

Semaphore stress test.
Built with GCC 5.3.1 20160307 (release) [ARM/embedded-5-branch revision 234589].

Iteration 0
Seed 832262406
42000 cy    1 kHz
21000 cy    2 kHz
10500 cy    4 kHz
5250 cy    8 kHz
2625 cy   16 kHz
1312 cy   32 kHz
656 cy   64 kHz
328 cy  128 kHz
164 cy  256 kHz    1 late
82 cy  512 kHz  777 late
41 cy 1024 kHz  998 late
20 cy 2100 kHz  999 late
10 cy 4200 kHz  999 late


## Conclusions

CMSIS++ is still a young project, and many things need to be addressed, but the core component, the RTOS, is pretty well defined and functional.

For now it may not be perfect (as it tries to be), but it definitely provides a more standard set of primitives, closer to POSIX, and a wider set of APIs than many other existing RTOSes, covering both C++ and C applications; at the same time it does its best to preserve compatibility with the original ARM CMSIS APIs.

Any contributions to improve CMSIS++ will be highly appreciated.

CMSIS++ is an open source project, maintained by Liviu Ionescu. The project is released under the terms of the MIT license.

The main source of information for CMSIS++ is the project web.

The Git repositories and all public releases are available from GitHub; specifically the stress tests are available from the tests folder.

The code for ARM CMSIS RTOS validator is available from GitHub.

The code for the Cortex-M scheduler port is available from GitHub.

The code for the synthetic POSIX scheduler port is available from GitHub.

For questions and discussions, please use the CMSIS++ section of the GNU ARM Eclipse forum.

For bugs and feature requests, please use the GitHub issues.

# Configuring DS-5 for the System Trace Macrocell

Posted by mwsealey Jun 7, 2016

In previous blogs we covered an introduction to System Trace Macrocell (STM) concepts and terminology, and the STM Programmers' model with an example of how to generate efficient trace data. Once the STM is generating a trace stream, we may wish to view it within our Debugger.

DS-5 implements an "Events View" which serves this purpose.

First, it is necessary to make sure that the platform configuration for your target is configured (via DTSL options) to collect trace from the STM, otherwise the view will not be configurable. From the Debug Configurations user interface, we can find the DTSL Options "Edit..." button underneath the target selection list.

Each platform may look slightly different. First, select a valid trace sink via the "Trace Buffer" tab - most platforms default to "None" and may have many options such as "DSTREAM" or "ETB."

There is usually a dialog tab marked "STM" or a checkbox which enables trace from a particular STM, per the following screenshot:

## Configure the Events view

Once connected we can configure our Events view. By default, it looks fairly empty. This view must be configured for each Master and Channel combination we want to see in the view. We see an informational item on what the view will decode (which Masters and Channels) and the source (in this case, DSTREAM: STM).

The view is organized in pages, and the VCR-like controls will walk us back and forth within the decoded trace:

To configure the view, find the Settings menu (next to the view minimize/maximize buttons) and select the "Events Settings..." item.

We will then be presented with a dialog. First, select the trace source to be shown in the view. In the example we show collecting trace on the DSTREAM unit (via TPIU) and that we want to see the trace output from device "STM." This makes up the "DSTREAM: STM" trace configuration.

For each Master, a Channel can be defined, and the expected decode of that channel further changed from "Text" to "Binary." We see that we are enabling Master 64 and Channel 0 as Text and channel 1-65535 as Binary. The example code provided only uses Channel 0 and Channel 1, but here we see that we can have a different setting for each master and each channel.

The mapping of Master number to a source device is implementation-specific. For the Juno ARM Development Platform, it is listed in the SoC Technical Reference Manual (specifically for r0, r1, and r2).

Note the Import and Export buttons, which can be used to load in a pre-configured set of configurations, or save them out for later re-use, as different system environments and applications will have different settings.

## Viewing Trace Output

Once we've collected trace, we will see the STM output in the Events view. Notice the Master and Channel are reported, the Timestamp increments.

We see, from our example code, our "Cambridge" string (the first character 'C' is Marked) and our Prime number and count following:

# Programming ARM's System Trace Macrocell

Posted by mwsealey Jun 7, 2016

In this blog, the second in a series, we explore the programmers' model for the ARM System Trace Macrocell. A previous blog covered basic concepts of the STM architecture and implementation. Example code is provided, which is minimally targeted at the Juno ARM Development Platform.

## STM Programmers’ Model

### Memory Map

The STM Architecture defines a memory map that is split into two regions; a configuration interface (4KiB in size) which contains all the registers used to configure the behavior of the STM, as well as access Basic Stimulus Ports, if implemented.

A second region of memory contains the Extended Stimulus Ports and can be up to 16MiB in size. How this is represented in the system memory map is down to the design of the SoC -- all Masters (CPUs and devices) may access the same address, or all Masters may access a dedicated and independent address.

All registers in the STM Architecture are defined as being located at an offset relative to the base address of their constituent region. On the Juno SoC, the base address of the configuration (or "APB") interface is 0x2010_0000 and the based address of the Extended Stimulus (or "AXI") region is 0x2800_0000, with this address being common to all Masters.

### Configuration

There are two key steps to configuring the STM via the APB interface. The first is that the STM needs to be configured with a valid Trace ID, since it outputs the instrumentation data over the CoreSight trace subsystem.

This value is exported over the ATB bus interface and is required not only for the transactions to be valid, but to discern between STM trace data and, for example, trace data from another CoreSight component such as an Embedded Trace Macrocell (ETM).

When using an external debugger (such as ARM DS-5) to collect the trace, it is possible to have the debugger set up the Trace ID as part of the connection sequence. The responsibility for this truly depends on your use case; if an external debugger is involved then it may be configuring other CoreSight components and giving them Trace IDs. You do not want the STM Trace ID and the Trace ID for another component to be the same, but you also do not want the debugger to conflict with your application STM configuration.

If you have an external debugger connected you can modify your instrumentation software to compensate; there is no harm whatsoever in having the debugger set the same trace ID as your instrumentation software.

We show an example function stmTRACEID() which performs this operation:

/*
* stmTRACEID(stm, traceid)
*
* Set STM's TRACEID (which goes out over ATB bus ATBID)
*
* Note it is illegal per CoreSight to set the trace ID
* to 0x00 or one of the reserved values (0x70 onwards)
* (see IHI0029D D4.2.4 Special trace source IDs).
*
*/
unsigned int stmTRACEID(struct STM *stm, unsigned int traceid)
{
if ((traceid > 0x00) && (traceid < 0x70)) {
unsigned int tcsr;

tcsr = (stm->APB->STMTCSR & ~(TRACEID_MASK << TRACEID_SHIFT));
stm->APB->STMTCSR = (tcsr | (traceid << TRACEID_SHIFT));

return traceid;
}

return 0;
}


The second requirement is to enable the stimulus ports in question. This is actually an optional part of STM Architecture that offers configuration registers to enable and disable the generation of trace packets when a particular stimulus port is accessed. It is possible to enable and disable stimulus ports with a certain granularity, but this will be completely dependent on the design of the instrumented code and the system it runs on. This example code enables all Extended stimulus ports such that any stimulus write to any stimulus port will generate a packet.

/*
* Set STMPSCR.PORTCTL to 0x0 to ensure port selection is not
* used. STMPSCR.PORTSEL is ignored and STMSPER and STMSPTER
* bits apply equally to all groups of ports.
*
* Whether the STM has 32 or 65536 ports, they'll all be
* enabled.
*/
stm->APB->STMSPSCR = 0x00000000;
stm->APB->STMSPER = 0xffffffff;
stm->APB->STMSPTER = 0xffffffff;


Once configured, we can then enable the STM with appropriate register access:

stm->APB->STMTCSR = (stm->APB->STMTCSR | STMTCSR_EN);


This is the bare minimum setup for an STM. There are obviously other configuration options such as Compression, Timestamping, and Synchronization that may or may not be configured dependent on the application.

## Which Stimulus Port?

Each of the 65536 possible Extended Stimulus Ports maps to an STPv2 Channel. A trace decoder can then look for trace belonging to this channel to retrieve the instrumentation and differentiate it from other instrumentation sources.

The layout in memory of the stimulus ports means that for each packet, a data item is written to a particular address and offset within the STM stimulus port address space. Recall that each Extended Stimulus Port is a 256-byte region of memory. The address of the start of the stimulus port, and therefore all the registers which will generate trace for that "channel" within the AXI interface, can be calculated.

channel_address  = STM_AXI_BASE + (0x100 * channel_number)


We present code which provides two examples of access methods, the first using logical operations to exploit defined address decode logic within the STM Architecture, and return a pointer which can be used to perform the memory write.

The finer points of the address decode used by the STM is documented in the STM Architecture, section 3.3. The code for stm.c:stmPortAddress() in the example code shows a method of calculating the address and offset using a flag-based API.

The second uses a C struct defining the layout of each stimulus port offset as an array. In this manner, assigning a value to a particular structure member would generate the appropriate store. Additionally, using C macros can simplify and increase readability of the actual stimulus port access.

struct stmPort {
STM_STIM G_DMTS;
STM_STIM G_DM;
STM_STIM G_DTS;
STM_STIM G_D;
STM_NA G_reserved[16];

STM_STIM G_FLAGTS;
STM_STIM G_FLAG;
STM_STIM G_TRIGTS;
STM_STIM G_TRIG;

STM_STIM I_DMTS;
STM_STIM I_DM;
STM_STIM I_DTS;
STM_STIM I_D;
STM_NA I_reserved[16];

STM_STIM I_FLAGTS;
STM_STIM I_FLAG;
STM_STIM I_TRIGTS;
STM_STIM I_TRIG;
};

/*
* STM AXI Stimulus Interface
*
* The STM Architecture defines up to 65536 stimulus ports, all of which are
* implemented on the STM and STM-500 from ARM, Ltd.
*/
struct stmAXI {
/*
* access the port array based on the limit in
* (stmAPB->STMDEVID & 0x1fff) so nothing we
* can define at compile time..
*/
struct stmPort port[0];
};

/*
* STMn(port, class)
*
* Write an n-byte value to a stimulus port of a particular type (e.g. G_DMTS)
*/
#define STM8(a, p, type)  *((volatile unsigned char *) &((a)->port[p].type))
#define STM16(a, p, type) *((volatile unsigned short *) &((a)->port[p].type))
#define STM32(a, p, type) *((volatile unsigned int *) &((a)->port[p].type))
#define STM64(a, p, type) *((volatile unsigned long *) &((a)->port[p].type))


We can re-create "printf debug" functionality by passing formatted strings to a function which outputs them as data over the requested STM channel:

The example function stm.c:stmSendString() outputs a string as instrumentation using macros STMn() (where n is {8,16,32,64}) which resolve to a C struct access as defined above.

/*
* void stmSendString(stm, channel, string)
*
* We specifically write a byte to ensure that we get a D8 packet,
* although that limits the function to 8-bit encodings.
*
* It doesn't matter what we use for the last write (if we see
* a null character) -- G_FLAGTS has no data except the flag and
* the timestamp, so a 32-bit access will be just fine..
*/

void stmSendString(struct STM *stm, unsigned int channel, const char *string)
{
/*
* Send a string to the STM extended stimulus registers
* The first character goes out as D8M (Marker) packet
* The last character is followed by a Timestamp packet
*
* This is the Annex C example from the STPv2 spec
*/
struct stmAXI *axi = stm->AXI;

int first = 1;

while(*string != '\0')
{        /*
* If the character is a linefeed, then don't output
* it -- just reset our 'first' state to 1 so that
* the next character (the start of the next line)
* is marked
*/
if (*string == '\n') {
STM32(axi, channel, G_FLAGTS) = *string++;
first = 1;
} else {
/*
* Continue to output characters -- if it's the
* first character in a string, or just after a
* linefeed (handled above), mark it.
*/
if (first) {
STM8(axi, channel, G_DM) = (*string++);
first = 0;
} else {
STM8(axi, channel, G_D) = (*string++);
}
}
}

/*
* Flag the end of the string
*
* Access size doesn't matter as we have no data for flag
* packets
*/
STM32(axi, channel, G_FLAGTS) = 0x0;
}


## Effective use of the STM

Annex C of the STPv2 specification gives an example of encoding an ASCII string as a data item, and uses metadata functionality of the extended stimulus ports. Strings are delimited with a Marked packet at the start of the string, and the end each string is appended with a FLAG_TS packet, in place or in lieu of a linefeed or NUL character. For one type of Marked Data packet is 0x08 (G_DM). For a (plain) Data packet, 0x18 (G_D), and for a Flag packet with Timestamp, 0x60 (G_FLAGTS), so we can break down sending the string as individual writes to those addresses. When we look at the trace output for a NUL-terminated string “Cambridge”, we might expect to see the following in the trace stream following this example, as a result of those writes.

D8MD8D8D8D8D8D8D8D8FLAG_TS
Cambridge...

This allows a trace decoder to adequately identify individual lines within text output, and additionally gives the trace decoder a method of determining when the string was output in time by way of the Timestamp. For binary data, a similar construct may be used with Marked data or Flag metadata surrounding the elements of an instrumentation message.

It might become obvious that outputting ASCII strings over a trace bus with a single packet per character is possibly not the most efficient way to use the STM. Since each data item is encapsulated in the STPv2 protocol, there is some overhead. The example string "Cambridge" sent as D8 packets and surrounded by D8M and FLAG_TS could be, rather than 9 bytes long (1 byte per character), somewhat more than 20 bytes. Packet headers are easily accounted for, but a timestamp may be quite large (up to 7 bytes, not inclusive of the FLAG_TS packet header) and may vary in size. This also does not take into account reporting of Channel and Master information. There are many ways of encoding a string within larger packet types using marker and flag 'framing' to differentiate between strings, but in the end "printf", whether over a USART or an STM interface, is simply not an efficient method of instrumentation.

In fact, in industrial applications, instrumentation is usually binary data formatted to be compact and useful and not a console output. This is especially true of use cases such as the network packet processing instrumentation where the relevant data needn't be prefixed or human readable, and indeed may be far too vast for a human to spend time reading -- the point of said instrumentation would be statistical analysis.

The onus, therefore, is upon the trace decoder to make sense of that packetized binary data. With any instrumentation data, an appropriate format for that data can be designed – ASCII strings or binary structures – and this will very much inform how the Stimulus Ports are used. Simply, you will need to at least define the usage of channels and the metadata packets before you start writing instrumentation code. By modulating the access size and the use of the extended stimulus ports' abilities to add metadata, extremely efficient output of binary instrumentation data can be effected.

Annex C also gives an example of formatting binary data in such a manner that can be constructed using the stimulus port accessor methods (as previously described). Let us imagine an application which calculates prime numbers. When it finds a prime number, it outputs the prime number itself, and the position or index of the prime, as 32-bit stimulus accesses to the STM. For example, 41 is the 13th prime number, so it outputs "41" and "13."

Stimulus Port RegisterData
G_Dprime
G_DMTScount

A trace decoder can then look for pairs of 32-bit data items, with the second followed by a Marker packet augmented with a Timestamp. From the difference in timestamps between packets, we could work out how long it took to generate that prime number.

D32D32MTSD32D32MTSD32D32MTS...
411343144715...

This takes up six 32-bit words (24 data bytes) not including overhead for the 3 shown sets of data. Unless our first prime number very, very large, we would not need to encode the number or the count in a 32-bit data packet. Since each value is packetized independently (the STM will never merge two packets), the accessor could be conditional on the size (counting leading zeros) of the output data or could be automatically emitted as a smaller packet using optional STM Compression features.

The trace decoder would then be able to still look for pairs of data packets (with a Marker+Timestamp) but we would have more efficient usage of bits in the resultant trace. Below we show how an efficient trace output could be achieved counting primes, where increase the packet payload size as we reach the limit of the previous type (again, the first field is a "prime," the second marked field is a "count" of which prime). To collect the data below showing reporting of 5 sets of data, using 15 data bytes (again, not including overhead).

D8D8MTSD16D8MTSD16D8MTS...D16D8MTSD16D16MTS
251552575626357...16132551619256

We can see that since the first prime can be encoded in 8 bits, we can use a D8 packet. Since it's position can be encoded as 8 bits, we can also use a D8 packet. The next prime is 257, which requires >8 bits to encode, but the position does not, so we see D16+D8MTS. And so on. Eventually we will see D32 and possibly D64 packets if we calculate enough primes, but only if we need that number of bits to encode the value.

## Next

We now know fundamentally how to program the STM and generate stimulus which implements out instrumentation. Next we'll discuss how to configure DS-5 to collect the instrumentation as Trace, in Configuring DS-5 for the System Trace Macrocell.

# Introduction to ARM's System Trace Macrocell

Posted by mwsealey Jun 6, 2016

This article aims to introduce the ARM System Trace Macrocell (STM), outlining what it is, its basic operation, and why one might want to use it. Example code will be provided, minimally targeted at the Juno ARM Development Platform, in a later blog in the series.

## Introduction to instrumentation

When writing code it is often useful to add informational statements that give an insight into control flow and data management as well as aiding in observation of the actual code at runtime. As such, instrumentation is an important component of code running on a live system. The proliferation of "printf" debug statements, whereby data is output to a console, is testament to this.

Sending text data to a USART or similar peripheral via printf is perhaps the most common method of instrumentation. It does have its drawbacks; the data rate of most USARTs are usually relatively low and at the same time the overhead of maintaining such communication is relatively high, involving the use of FIFOs and interrupt servicing. It is also sometimes complicated to access a serial port connection on a production system, which may be located remotely. With this in mind, the use of a USART for instrumentation can be considered non-ideal choice for use cases involving high-performance code or the collection of remote instrumentation data.

An alternative method may be to use network devices, such as Ethernet. These devices typically afford much higher bandwidth rates than USARTs, and are ideal for the collection of remote data. However, this does involve manually encapsulating the data in protocols such as TCP/IP, which can dramatically increase the overhead of servicing the peripheral. Therefore the overhead of instrumentation can be higher.

Using USARTs, Ethernet or other generic data peripherals can have detrimental effects on instrumented code. As an example, we can imagine a system which performs network packet data processing. If we consider using a USART then we may find that the data processing is limited because the overhead of sending instrumentation data is limited by the USART bandwidth. If we then consider that we then use Ethernet as a transport for instrumentation, we might find that the instrumentation on packet data processing contains data on the process of instrumentation itself.

It is considered desirable for instrumented code to run at close to the performance and run-time profile of non-instrumented code. That has the implication that instrumentation has as little management overhead as possible, and does not markedly interfere with operation of the non-instrumentation code. One way to solve these problems is with a device which is designed for the purpose of instrumentation.

The System Trace Macrocell

A System Trace Macrocell (STM) grants software developers the ability to instrument code utilizing the CoreSight Trace subsystem as a transport. CoreSight is a central part of most ARM SoCs, and is intended to operate at the similar clock rates as the rest of the components of the system. The STM itself operates in a non-invasive fashion requires very little overhead besides memory-mapped peripheral writes, and does not (directly) generate interrupts.

ARM defines a System Trace Macrocell Programmers' Model Architecture Specification (currently version 1.1, referred to here as "STM Architecture") and licenses the current CoreSight STM-500 product as implementation of that architecture.

Further information on CoreSight Trace can be found in Eoin McCann's 3-part blog on CoreSight.

The STM instruments using the MIPI System Trace Protocol version 2.0 (STPv2), which is available to MIPI Members. The protocol itself defines a method for both instrumentation data and metadata to be encapsulated in a trace stream, composed of varying sized data elements (from 4- to 64-bit). The instrumentation is otherwise free-form and neither the protocol nor the STM place any limitations on the data content of the stream. These aspects of the STM free the software developer from having to be concerned with instrumentation overheads and available bandwidth.

## Terminology

### Masters

Instrumentation via STM can be identified as being output via a particular "Master," in order to differentiate the various sources within a system. A simple implementation might attribute all instrumentation with a single Master identifier. A more complex design might attribute each individual core with a unique Master identifier, making it clear which core was running the software was responsible for generating a particular datum of instrumentation.

Any device that can generate a memory system write can generate instrumentation, for example DMA peripherals and GPUs.

The number of masters within a system and their identifiers are part of the implementation of the system, and may or may not map directly to, for example, AXI IDs. Check the design documentation for your chosen SoC for details on which components are able to generate stimulus via memory writes, and what their STPv2 Master ID is.

### Channels

Each STM implementation has access to up to 65536 instrumentation channels. Each of these channels is clearly defined in the trace stream, allowing for multiple types of instrumentation to be intermixed within a single system or single application. For instance, channel 0 could be used to encode ASCII text, while channel 10 could output packet headers in a binary format.  Alternatively, one channel could be allocated to each Process within a system.

### Metadata: Marks, Flags, Timestamps and Triggers

STM metadata is highly flexible, allowing one to arbitrarily Mark any trace data packet. A marked datum is typically used to identify the start of data or something of interest in the trace stream. A Flag can be used in a similar way; however, no data is associated with a Flag.

Each packet can be supplemented with a Timestamp, which takes an external clock signal and converts it into an incrementing count in the trace stream. In this manner a trace stream can be synchronized with other trace in the system, such as Instruction Trace from an ETM, or simply allow timing information to the trace decoder.

STPv2 defines the format of the timestamp to be flexible. The STM-500 outputs timestamps in a natural binary format, with the ability to encode a delta to conserve bandwidth.

A Trigger is special as they are both output to the trace stream and can have an effect on the rest of the trace subsystem. The result of a Trigger can be routed to other components in the system. In this manner code can be instrumented and also generate additional trace from other Trace Macrocells within the system at pertinent points. This is particularly useful for post-mortem analysis use cases.

## Architecture

### Stimulus Ports

Channels are formed on the STM by way of “stimulus ports.” These are groups of registers within the SoC memory map that, when accessed, generate the desired trace output. The STM Architecture defines both “Basic” and “Extended” Stimulus Ports. A Basic Stimulus Port is simple; data is written to the port, and that data is then output.

Extended Stimulus Ports allows for the augmentation of the data with useful metadata, along with the importance of that data (Guaranteed or Invariant, discussed later). The Extended Stimulus Ports consist of a grouping of 16 registers in a 256-byte  memory mapped region, separate from the STM configuration registers.

Depending on the address offset of the register within a group, a different STPv2 packet is output. The offsets are defined in the STM Architecture, Section 3.1 (Table 3-1), a summary of which is shown:

0x00G_DMTSData, marked with timestamp, guaranteed
0x08G_DMData, marked, guaranteed
0x10G_DTSData, with timestamp, guaranteed
0x18G_DData, guaranteed
0x60G_FLAGTSFlag with timestamp, guaranteed
0x68G_FLAGFlag, guaranteed
0x70G_TRIGTSTrigger with timestamp, guaranteed
0x78G_TRIGTrigger, guaranteed

The size of the data payload of each packet is determined by the size of the access made to the stimulus port offset. For example, an 8-bit store to offset 0x18 would nominally generate a 'D8' packet, while a 32-bit store to offset 0x18 would nominally generate a 'D32' packet, and so on.

To reiterate, we can "Mark" and "Timestamp" our data, and also output metadata only via "Flag" and "Trigger" mechanisms (these types of instrumentation have no data payload.)

Since ARM's STM and STM-500 IP do not implement the Basic Stimulus registers, we will not cover them here. ARM partners implementing an STM may choose to implement them per the STM Architecture. If, when designing an SoC, there is a requirement for more simple instrumentation, then it is possible that an Instrumentation Trace Macrocell (ITM) could be implemented which can provide similar functionality, although with a different programmers' model and trace output format. Please check your SoC documentation.

### Fundamental Data Size

The STM implementation defines a “Fundamental Data Size.” This is essentially the maximum size of an access to the stimulus port registers, as determined by the implementation of the connection between the STM and the rest of the system.

For STM-500, as implemented in revision r1 of the Juno SoC, the fundamental data size is 64-bit, so a 64-bit stimulus should generate a D64 packet. Care should be taken to realize this value as it can change the way a trace decoder is written for application instrumentation that may run on multiple platforms.

Some SoCs implement an earlier version of the STM, the r0 revision of Juno being one example. The Fundamental Data Size is defined as 32-bit for that implementation.

An STM with a Fundamental Data Size of 64 bits may also be connected in such a way that it does not have a 64-bit wide data path, for example there may be a 'downsizer' between the instrumentation source and STM.

If a 64-bit memory system write is performed and either of the above are true, the actual trace output behavior is undefined by the STM architecture. Care should be taken to ensure these aspects are taken into account as it can change the way extracting instrumentation is performed within a trace decoder.

### Guaranteed and Invariant Stimulus

The STM Architecture specifies two types of transaction, accessible through the stimulus port interface at separate offsets within the port – Guaranteed and Invariant. A write to the stimulus port "guaranteed" registers must be emitted by the STM as a trace packet; additionally, if a timestamp is requested (DnTS, FLAG_TS) and timestamping is enabled in the STM configuration registers (STMTCSR), then the timestamp will be generated.

Writes to the Invariant registers allow the STM to make a determination as to whether the full scope of instrumentation will be output. This is useful for instrumentation types that may be implemented as “lossy” – for instance, the output of the state of a loop counter where intermediate loop counts can be inferred, or where timestamping is not fundamental to the instrumentation. Invariant stimulus may, when emitted, "drop" timestamps for the sake of trace bandwidth. Important instrumentation – for instance, an error or other pertinent instrumentation, may still use Guaranteed stimulus.

## Next

Now that we have a good idea of what the STM is and how the architecture is defined, we can use the STM to generate instrumentation by Programming ARM's System Trace Macrocell, the second part of this blog series.

# Sensors Expo:  June 21-23, 2016. McEnery Convention Center, San Jose, California

Posted by Bob Boys May 31, 2016

## Sensors Expo:  Sensors and ARM Cortex Processors: Working Together:

Sensors are present in many electronic devices.  Sensors capture a wide variety of signals and then this information needs to be collected by a microprocessor for processing and then further passed to the ultimate use or application.

Data acquired by a sensor is normally transferred as an analog or digital signal to a microprocessor.   The transfer speed can be critical to prevent overruns or lost data thus requiring fast processors capable of processing data quickly within tight time frames or windows.  Such transfers can be polled or interrupt driven as desired.

Digital signal transfers can include protocols such as UART, CAN, I2C, I2S, SPI and parallel.  Analog signals usually use an A/D convertor.  These peripherals usually reside inside the microprocessor or as external ICs.  The possibilities are nearly endless offering great flexibility in your design..

Once the microprocessor has the data it is often desirable to process the data in some way.  This can include filtering, scaling or for more sophisticated applications: Digital Signal Processing (DSP).

Processor Features:  ARM Cortex processors have become the de facto standard for sensor data acquisition and processing.  ARM free DSP libraries run on all Cortex-M processors from Cortex-M0 through Cortex-M7.  ARM Cortex-M4 and Cortex-M7 processors are especially useful with DSP extensions such as MAC (multiply-accumulate), SIMD (Single Instruction Multiple Data) and various other DSP instructions.  Third party suppliers offer DSP libraries for Cortex-A series.  Cortex-A series offer the NEON DSP extension.  A FPU (Floating Point Unit) is available on many Cortex processors.

Interrupt Controller:  The Cortex-M NVIC (Nested Vector Interrupt Controller) provides a flexible and versatile interrupt and exception handling mechanism.  Individual peripherals and GPIO pins can have their own interrupt vector which provides fast response times.  The NVIC is easy to configure using the CMSIS-Core standard APIs.

ARM Processor Types:  Different sensor applications can have very different processing requirements.  Ranging from the tiny 12,000 gate Cortex-M0, through the M3, M4 and M7 series through the real-time Cortex-R family to the powerful Cortex-A family, there is an ARM processor scalable for every sensor application.  Migrating up and down the ARM roadmap to choose the most applicable processor for your application is easy.  Each can run various operating systems or none at all (bare metal). Using a RTOS has definite advantages that make your project easier to design, understand and debug.

Various ARM licensees such as Atmel, NXP, STMicroelectronics, Cypress and many others offer Cortex processors with many peripherals to transfer data to and from sensors and some sensors have integrated an ARM processor on the same silicon as their sensors for efficiency and low cost solutions.

Debugging:  Cortex-M processors include many debug features to facilitate faster software development.  Serial Wire Viewer (SWV) is a component of ARM CoreSight debugging technology that can be used to display sensor data values graphically while the processor runs. SWV is non-intrusive and is easy to use. SWV also displays exceptions and interrupts in real-time and updates while the program runs.  Many quality debuggers feature SWV operation.  This makes software development easier and faster.

ETM trace (Embedded Trace Macrocell) instruction trace displays the instructions executed and also provides Performance Analysis and Code Coverage.  Many Cortex processors have ETM.

Low power: ARM processors are legendary for their low power consumption.  In situations where power is important/critical and data events are not occurring, you can utilize various sleep modes.  An interrupt generated by an external event or peripheral can be used to "wake-up" the processor, have it perform desired objectives and then put it back to sleep with the WFI(); "wait for interrupt" or the WFE() (wait for event) instructions.

Sensors Expo:  At the ARM/Keil booth at Sensors Expo we will display a wide variety of low cost evaluation boards from many manufacturers. Most of these contain interesting sensors.  We have demonstrations of working systems using the Keil MDK toolchain.  These are mostly turn-key "out-of-the-box" systems that you will get running in a short time.  You can use the free evaluation version (to 30 K) of Keil MDK with these boards.

We can explain the various ARM Cortex processors and their relation to each other.  Upcoming technologies such as the recently announced ARM v8-M architecture that provides needed security for IoT by using ARM TrustZone technology can be explained. We will have copies of the ARM Roadmap and a limited number of the famous Keil mouse pads to give away.

Sensors Expo:  June 21-23, 2016. McEnery Convention Center, San Jose, California   www.sensorsexpo.com

# Booting Linux on the ARMv8-A model provided with DS-5 Ultimate Edition 5.24 and later

Posted by Ronan Synnott May 10, 2016

This blog is an update of one I wrote a couple of years ago, referencing the latest FVP models provided with DS-5 (v5.24 at time of writing) and the latest pre-built Linaro distributions. It is intended for users new to DS-5 and/or users on Windows platforms, as the Linaro distributions assume a Linux host. Note that the pre-built images do not contain kernel debug information. If you wish to enable kernel awareness, you will need to rebuild appropriately. Application debug and other Linux aware features do not require this.

You should first download the appropriate pre-built software stack and file system to match your needs. For the below I downloaded fvp-latest-oe-uboot.zip and the OpenEmbedded LAMP filesystem. Unzip these files to your host machine.

Open the DS-5 Eclipse GUI, and select Run → Debug Configurations, to set up the debug session. Select DS-5 Debugger from the list on the left hand side, and click on New launch configuration. You can name this configuration to anything suitable. Locate the Base_AEMv8Ax1 (or Base_AEMv8x4) FVP (use Filter platforms text box to help), and drill down to the Debug ARMAEMv8-A level. If you have kernel debug symbols available I recommend selecting from the Linux Kernel and/or Device Driver Debug branch (more on this later).

We now need to use the Model parameters to instantiate the model appropriately for the Linaro images. Within the packages you downloaded above, you will find a run_model.sh script which is a Linux host script for launching this model stand alone with these files. We will use this as the basis for the parameters that DS-5 will pass. You can simply copy and paste the below to a text editor, fix the paths to the appropriate files to match their location on your host, to then paste to the Model parameters field. For more information on these options, see the FVP documentation.

--parameter bp.secure_memory=0

--parameter cluster0.NUM_CORES=1

--parameter cache_state_modelled=0

--parameter bp.pl011_uart0.untimed_fifos=1

--data cluster0.cpu0="\\path\to\\Image"@0x80080000

--data cluster0.cpu0="\\path\to\\fvp-base-gicv2-psci.dtb"@0x83000000

--data cluster0.cpu0="\\path\to\\ramdisk.img"@0x84000000

--parameter bp.ve_sysregs.mmbSiteDefault=0

--parameter bp.virtioblockdevice.image_path="\\path\to\\<filesystem>.img"

--parameter bp.smsc_91c111.enabled=true

--parameter bp.hostbridge.userNetworking=true

--parameter bp.hostbridge.userNetPorts="5555=5555,8080=8080,22=22"

Go to the Debugger tab, and select Connect Only. If you have debug symbols available for the image, I recommend loading the symbols via the Execute Debugger Commands panel. Note that the kernel runs at Exception Level EL1, and so the symbols need to be loaded to this level. To do this, use the command:

You should now be able to launch the model (by clicking on Debug). Click the run button, and the model should boot directly into Linux.

Features such as Kernel Awareness (if kernel symbols loaded), Remote System Explorer (RSE) View, and Application Debug will be available, just as per my previous blog. I would also highlight some general improvements we have made to the GUI since that blog was written. Note for using RSE, you need to first set a password for root ("passwd root" on the Linux command line), then create an RSE Linux connection to "localhost", configured for ssh files.

# CMSIS RTOS API: Criticism, comments and CMSIS++ suggestions

Posted by Liviu Ionescu Apr 17, 2016

## For the impatient

If you ever had to do with CMSIS RTOS API and did not enjoy it, or if you felt it like a straitjacket compared to your native RTOS, well, rest assured, your're not alone. The good news is that your experience matters and you can help improve CMSIS RTOS API. Go to GitHub Issues and comment on any of the existing issues, or open new ones.

## The story

### ARM, thumbs up for the CMSIS RTOS idea!

First of all I have to confess that I was a big supporter of the general idea of a common CMSIS RTOS API, from the moment I first read about it. However, as big as my expectations were, as big was my dissapointment when the specs went out.

### Some CMSIS RTOS API considerations

From my point of view, the main problems with the CMSIS RTOS API are:

• no POSIX compliance
• not C++ friendly

Please note that I did not ask for C++ APIs, the plain C APIs should be perfectly fine, I just prefered the APIs to be designed by someone who thinks in C++, not in C (and as such knows how to avoid the usual mess that unstructured C programs bring, especially in the embedded world); unfortunately ARM seems to have no C++ specialists in their design teams.

### The CMSIS++ proposal

Given this situation, and seeing that ARM had no plans for a C++ redesign, by the end of 2015 I started to think of CMSIS++, as a C++ POSIX compliant proposal for a future generation of CMSIS. In March 2016 the project was publicly announced in the ARM Connected site.

### Some CMSIS RTOS API issues

The initial CMSIS++ attempt was to simply rewrite the original CMSIS RTOS API in C++. However, while starting to walk on this path, I encountered many problems, and noticed many differences from the POSIX and ISO C/C++ specs. At a certain point I realised that the current design is broken beyound repair, and a reset is required, otherwise the approach will not work.

Restarting from scratch, the focus moved from CMSIS to POSIX and ISO.

During the design and development phases, I kept a log of issues that I identified and addressed in the CMSIS++ proposal.

Some are difficulties in understanding the CMSIS RTOS API, due to documentation issues, some are functional issues that make using the original API not very convenient, and some are suggestions for missing features.

The POSIX compliance issues are:

• Use POSIX error codes (#65)
• Use explicit separate calls for different waiting functions, like lock(), try_lock(), timed_lock() (#45)
• Add normal (non-recursive) mutex (#53)
• Add a mechanism to wait for a thread to terminate (#50)
• For message queues, make the message size user configurable (#70)
• For message queues, add message priorities (#72)
• Make osSemaphoreWait() return errors, not counts (#56)
• Deprecate or remove the unused thread_id parameter in osMessageQCreate()/osMailQCreate() prototypes (#61)

Other functional issues are:

• Avoid the heavy use of macros (to define objects and to refer to them) (#36)
• Do not mandate the use of a dynamic allocator (for stack, queues, etc) (#37)
• Add support for critical regions (interrupts & scheduler) (#38)
• Avoid mixing time durations (in milliseconds) with timer counts in ticks (#39)
• Add a separate RTC system clock (#40)
• Add os_main() to make the use of a main thread explicit (#41)
• Add support for a synchronised public memory allocator (#42)
• Avoid returning agregates (like osEvent) (#43)
• Extend the range for osKernelSysTick() (#44)
• Make the scheme to assign names to objects more consistent (#46)
• Add missing destructor functions to all objects (#47)
• Extend the range of priority levels (#49)
• Allow to explicitly define the semaphore max count (#55)
• Add a method to wait for a memory pool block to become available (#57)
• Fix non-portable message type in osMessagePut() (#60)
• Fix osMessagePut()/osMailPut() inconsistent error when called from ISR (#62)
• Mail queues, as separate objects, are redundant (#63)
• Add typedefs for all different types used in prototypes (#66)
• For all objects, add reset functions to return the object to initial status (#67)
• For mutex, add a method to get the owner thread (#68)
• For memory pools, add more accessors to get pool status (#69)
• For message queues, add more accessors to get queue status (#71)

The documentation issues are:

• Explain that thread functions can return (#48)
• Explain the mutex behaviour (recursive vs normal) (#52)
• Clarify the specs for binary vs counting semaphores (#54)
• Fix the data type used in osMessageQDef() example (#58)
• Fix misplaced thread id parameter for message queue (#59)

### CMSIS RTOS API v2

Somehow acknowledging the initial design problems, ARM announced working on CMSIS RTOS API v2. To my pleasant surprise, ARM seems to have deprecated the initial macro based object creation mechanism (probably one of the most annoying features of the RTOS API v1).

In the new proposal ARM also gave up returning aggregate objects, extended the priorities range, added explicit normal/recursive mutex objects, renamed some objects and generally kept very few features from the initial specification, so a design reset seems possible.

However, based on the CMSIS++ experience, there are still more design decisions required to bring the new RTOS v2 closer to POSIX and ISO, for example using the POSIX error codes, using the POSIX explicit separate calls for different waiting functions (like lock(), try_lock(), timed_lock()), etc.

### Feedback welcomed

So, if you would like to express your support for POSIX compatibility, or generally to have a better CMSIS RTOS API, please go to GitHub Issues and comment on any of the existing issues (especially those marked with Help Wanted), or open new tickets with your own suggestions.

CMSIS is an ARM technology, now also available as a GitHub project.

CMSIS 5 announcement.

CMSIS++ is an open source project, maintained by Liviu Ionescu.

The main source of information for CMSIS++ is the project web.

By date:
By tag: