CMSIS++ RTOS: fully functional reference implementation

June 23, 2016

12 minute read time.

Overview

CMSIS++, or rather POSIX++, is a POSIX-like, portable, vendor-independent, hardware abstraction layer intended for C++/C embedded applications, designed with special consideration for the industry standard ARM Cortex-M processor series. Originally intended as a proposal for the next generation CMSIS, CMSIS++ can probably be more accurately defined as "C++ CMSIS", and POSIX++ as "C++ POSIX".

CMSIS++ RTOS: APIs vs reference implementations

The CMSIS++ cornerstone is the RTOS, and in this respect CMSIS++ RTOS can be analysed from two perspectives: the CMSIS++ RTOS APIs, with a modern design and the CMSIS++ RTOS reference implementation with a clean and efficient code.

In the first phase of the project, the CMSIS++ RTOS APIs were designed, with POSIX threads in mind, but from a C++ point of view.

The native CMSIS++ RTOS interface is the C++ API, with a C API implemented as a wrapper, and an ISO C++ Threads API implemented also on top of the native C++ API.

The CMSIS++ RTOS C++ API as a wrapper on top of an existing RTOS

Initially, the C++ API was validated by implementing it as a wrapper on top of the popular open source project FreeRTOS. Full functionality was achieved, and the entire system passed the ARM CMSIS RTOS validation suite.

The CMSIS++ RTOS reference synchronisation objects (semaphores, queues, etc)

With the native C++ API validated, while still using the safety net provided by an existing scheduler, the next step toward a grand design was to implement, in a portable way, the synchronisation objects defined by the CMSIS++ RTOS.

The result was a highly portable implementation, that requires a very simple interaction with the scheduler, basically a thread suspend() and resume().

Using this model, all RTOS objects were implemented (semaphores, mutexes, condition variables, message queues, memory pools, event flags, clocks and timers); full functionality was achieved, and again the entire system passed the ARM CMSIS RTOS validation suite.

To be noted that in this configuration, when running on top of an existing RTOS, it is perfectly possible to select which implementation to use, at individual object level; in other words it is perfectly possible to run with some objects implemented by the host RTOS and some objects using the reference portable implementation. This is generally useful when some of the objects defined by CMSIS++ are not available in the host RTOS; for example in the current version of FreeRTOS there were no memory pools or condition variables, and these objects were supplied by the reference implementation.

The CMSIS++ RTOS reference scheduler

The last piece to complete the puzzle was the scheduler. The CMSIS++ RTOS specifications do not mandate for a specific scheduling policy, and, when running on top of an existing RTOS, any scheduling policy can be used.

However, the CMSIS++ RTOS reference scheduler takes the beaten path and implements a priority based, round robin, cooperative and optionally preemptive scheduler.

In other words, threads are assigned priorities, higher priority threads are scheduled first, equal priority threads are scheduled in a round robin way, and scheduling points are entered either explicitly at any wait() or yield(), or are optionally triggered by periodic interrupts, like the system clock ticks, or by user interrupts.

The scheduler portable code

The scheduler was designed to be as portable as possible, and to run on any reasonable architecture, with any word size.

As such, the scheduler's main responsibility is to manage the list of threads ready for execution and to switch their execution contexts in an orderly manner.

Although not mandatory for its functionality, the scheduler also keeps track of all registered threads, and provides iterators to walk these lists.

For a better modularity, the scheduler itself does not keep track of threads waiting for various events; this is delegated to the various synchronisation objects, that are expected to implement their own policy of suspending and resuming execution of threads waiting for common resources.

However, the reference synchronisation objects use similar lists to keep track of the waiting threads, and, to simplify the implementation, the scheduler provides base classes for these lists.

The scheduler port-specific code

Regardless how carefully a portable scheduler is designed and implemented, there will always be a last mile where the platform differences become important.

To accommodate for these differences, the scheduler needs to be ported on a specific platform. The port includes the specific definitions, mainly the way of creating and switching thread contexts, but also handling interrupts, accessing timers and clocks, etc.

There are currently two such CMSIS++ RTOS scheduler ports available and fully functional:

a 32-bits ARM thumb port, running on Cortex-M devices;
a 64-bits synthetic POSIX port, running as a user process on macOS and GNU/Linux.

These ports are actually not part of the CMSIS++ package itself, which is highly portable, but are part of separate µOS++ packages.

The Cortex-M scheduler port

This 32-bits ARM thumb port is specifically designed to run on Cortex-M devices. It currently supports ARMv6-M and ARMv7-M architectures, with or without FPU. Support for ARMv8-M will be added when needed.

The implementation uses the ARM specific features, like PendSV, which greatly simplify things.

For example, the context switching is performed by a rather simple function:

void
__attribute__ ((section(".after_vectors"), naked, used, optimize("s")))
PendSV_Handler (void)
{
  // The naked attribute and the push/pop are used to fully control
  // the function entry/exit code; be sure other registers are not
  // used in the assembly parts.
  asm volatile ("push {lr}");

  // The whole mystery of context switching, in one sentence. :-)
  port::scheduler::restore_from_stack (
      port::scheduler::switch_stacks (
          port::scheduler::save_on_stack ()));

  asm volatile ("pop {pc}");
}

Apart from saving/returning, this function does exactly what it is expected to do:

save_on_stack() - saves the context of the current thread on the thread stack and returns the stack address;
switch_stacks() - saves the above stack address in the current thread control block, selects the next thread waiting to run and returns the address of its stack context;
restore_from_stack() - restores the context of the new thread from the stack.

The two save/restore functions are among the very few in the Cortex-M port that require assembly code:

inline stack::element_t*
__attribute__((always_inline))
save_on_stack (void)
{
  register stack::element_t* sp_;

  asm volatile
  (
      // Get the thread stack
      " mrs %[r], PSP                       \n"
      " isb                                 \n"

#if defined (__VFP_FP__) && !defined (__SOFTFP__)

      // Is the thread using the FPU context?
      " tst lr, #0x10                       \n"
      " it eq                               \n"
      // If so, push high vfp registers.
      " vstmdbeq %[r]!, {s16-s31}           \n"
      // Save the core registers r4-r11,r14.
      // Also save EXC_RETURN to be able to test
      // again this condition in the restore sequence.
      " stmdb %[r]!, {r4-r9,sl,fp,lr}       \n"

#else

      // Save the core registers r4-r11.
      " stmdb %[r]!, {r4-r9,sl,fp}          \n"

#endif
      : [r] "=r" (sp_) /* out */
      : /* in */
      : /* clobber. DO NOT add anything here! */
  );

  return sp_;
}


inline void
__attribute__((always_inline))
restore_from_stack (stack::element_t* sp)
{
  // Without enforcing optimisations, an intermediate variable
  // would be needed to avoid using R4, which collides with
  // the R4 in the list of ldmia.

  // register stack::element_t* sp_ asm ("r0") = sp;

  asm volatile
  (

#if defined (__VFP_FP__) && !defined (__SOFTFP__)

      // Pop the core registers r4-r11,r14.
      // R14 contains the EXC_RETURN value
      // and is restored for the next test.
      " ldmia %[r]!, {r4-r9,sl,fp,lr}       \n"
      // Is the thread using the FPU context?
      " tst lr, #0x10                       \n"
      " it eq                               \n"
      // If so, pop the high vfp registers too.
      " vldmiaeq %[r]!, {s16-s31}           \n"

#else

      // Pop the core registers r4-r11.
      " ldmia %[r]!, {r4-r9,sl,fp}          \n"

#endif

      // Restore the thread stack register.
      " msr PSP, %[r]                       \n"
      " isb                                 \n"

      : /* out */
      : [r] "r" (sp) /* in */
      : /* clobber. DO NOT add anything here! */
  );
}

The generated code (for Cortex-M3) is remarkably neat and tidy:

08000198 <PendSV_Handler>:
8000198: b500       push {lr}
800019a: f3ef 8009 mrs r0, PSP
800019e: f3bf 8f6f isb sy
80001a2: e920 0ff0 stmdb r0!, {r4, r5, r6, r7, r8, r9, sl, fp}
80001a6: f000 fe07 bl 8000db8 <os::rtos::port::scheduler::switch_stacks(unsigned long*)>
80001aa: e8b0 0ff0 ldmia.w r0!, {r4, r5, r6, r7, r8, r9, sl, fp}
80001ae: f380 8809 msr PSP, r0
80001b2: f3bf 8f6f isb sy
80001b6: bd00       pop {pc}

Static vs dynamic memory allocation

One of the initial CMSIS++ RTOS design requirements was to give the user full control over the memory allocation.

The implementation fulfilled this requirement, allowing any possible memory allocation scheme, from the simplicity of using fully static allocation to the extreme of using separate custom allocators for each object requiring dynamic memory.

The objects requiring dynamic memory are:

threads, for the stacks
message queues, for the queues (arrays of messages)
memory pools, for the pools (arrays of blocks)

All these objects have a last allocator parameter in their constructors that defaults to the system allocator memory::allocator<T>.

For example one of the thread constructors is:

using Allocator = memory::allocator<stack::allocation_element_t>;

thread (const char* name, func_t function, func_args_t args,
        const attributes& attr = initializer, const Allocator& allocator =
              Allocator ());

By default the memory::allocator<T> is defined as:

template<typename T>
  using allocator = new_delete_allocator<T>;

but the user can define it as any standard C++ allocator, and so the behaviour of all objects requiring dynamic memory can be customised at once.

Even more, each such object has a separate template version, that takes a last allocator parameter, so at the limit each such object can be allocated using a separate allocator.

Given the magic of C++, using such allocators is straightforward:

template<typename T>
  class my_allocator;

thread_allocated<my_allocator> thread { "th", func, nullptr };

message_queue_allocated<my_allocator> queue1 { "q1", 7, sizeof(msg_t) };
message_queue_typed<msg_t, my_allocator> queue2 { "q2", 7 };

memory_pool_allocated<my_allocator> pool1 { "p1", 7, sizeof(blk_t) };
memory_pool_typed<blk_t, my_allocator> pool2 { "p2", 7 };

Static allocation is handled using exactly the same method, but different templates:

thread_static<2500> thread { "th", func, nullptr };

message_queue_static<7, msg_t> queue { "q" };

memory_pool_static<7, blk_t> pool { "p" };

Tests

Writing RTOS unit tests was always tricky and the results debatable. This does not mean it should not be attempted; actually, if done properly, these tests can be very useful.

To improve testability, the synthetic POSIX platform was implemented. It allows to run most RTOS tests within a very convenient environment like macOS or GNU/Linux.

Another greatly helpful tool used to run the RTOS tests is the GNU ARM Eclipse QEMU, which emulates the STM32F4DISCOVERY board well enough for most tests to be relevant.

Actually most of the times the tests were performed either on macOS or on QEMU, and only rarely, usually at the end, as a final validation, the tests were also performed on physical hardware.

The ARM CMSIS RTOS validation suite

The main test was the ARM CMSIS RTOS validation suite, that exercises quite thoroughly the interface published in the cmsis_os.h file.

This test is automatically performed by the test scripts on the STM32F4DISCOVERY board running under GNU ARM Eclipse QEMU and on the synthetic POSIX platform.

The result of a run is:

CMSIS-RTOS Test Suite   Jun 23 2016   16:03:42

TEST 01: TC_ThreadCreate                  PASSED
TEST 02: TC_ThreadMultiInstance           PASSED
TEST 03: TC_ThreadTerminate               PASSED
TEST 04: TC_ThreadRestart                 PASSED
TEST 05: TC_ThreadGetId                   PASSED
TEST 06: TC_ThreadPriority                PASSED
TEST 07: TC_ThreadPriorityExec            PASSED
TEST 08: TC_ThreadChainedCreate           PASSED
TEST 09: TC_ThreadYield                   PASSED
TEST 10: TC_ThreadParam                   PASSED
TEST 11: TC_ThreadInterrupts              PASSED
TEST 12: TC_GenWaitBasic                  PASSED
TEST 13: TC_GenWaitInterrupts             PASSED
TEST 14: TC_TimerOneShot                  PASSED
TEST 15: TC_TimerPeriodic                 PASSED
TEST 16: TC_TimerParam                    PASSED
TEST 17: TC_TimerInterrupts               PASSED
TEST 18: TC_SignalMainThread              PASSED
TEST 19: TC_SignalChildThread             PASSED
TEST 20: TC_SignalChildToParent           PASSED
TEST 21: TC_SignalChildToChild            PASSED
TEST 22: TC_SignalWaitTimeout             PASSED
TEST 23: TC_SignalParam                   PASSED
TEST 24: TC_SignalInterrupts              PASSED
TEST 25: TC_SemaphoreCreateAndDelete      PASSED
TEST 26: TC_SemaphoreObtainCounting       PASSED
TEST 27: TC_SemaphoreObtainBinary         PASSED
TEST 28: TC_SemaphoreWaitForBinary        PASSED
TEST 29: TC_SemaphoreWaitForCounting      PASSED
TEST 30: TC_SemaphoreZeroCount            PASSED
TEST 31: TC_SemaphoreWaitTimeout          PASSED
TEST 32: TC_SemParam                      PASSED
TEST 33: TC_SemInterrupts                 PASSED
TEST 34: TC_MutexBasic                    PASSED
TEST 35: TC_MutexTimeout                  PASSED
TEST 36: TC_MutexNestedAcquire            PASSED
TEST 37: TC_MutexPriorityInversion        PASSED
TEST 38: TC_MutexOwnership                PASSED
TEST 39: TC_MutexParam                    PASSED
TEST 40: TC_MutexInterrupts               PASSED
TEST 41: TC_MemPoolAllocAndFree           PASSED
TEST 42: TC_MemPoolAllocAndFreeComb       PASSED
TEST 43: TC_MemPoolZeroInit               PASSED
TEST 44: TC_MemPoolParam                  PASSED
TEST 45: TC_MemPoolInterrupts             PASSED
TEST 46: TC_MsgQBasic                     PASSED
TEST 47: TC_MsgQWait                      PASSED
TEST 48: TC_MsgQParam                     PASSED
TEST 49: TC_MsgQInterrupts                PASSED
TEST 50: TC_MsgFromThreadToISR            PASSED
TEST 51: TC_MsgFromISRToThread            PASSED
TEST 52: TC_MailAlloc                     PASSED
TEST 53: TC_MailCAlloc                    PASSED
TEST 54: TC_MailToThread                  PASSED
TEST 55: TC_MailFromThread                PASSED
TEST 56: TC_MailTimeout                   PASSED
TEST 57: TC_MailParam                     PASSED
TEST 58: TC_MailInterrupts                PASSED
TEST 59: TC_MailFromThreadToISR           PASSED
TEST 60: TC_MailFromISRToThread           PASSED

Test Summary: 60 Tests, 60 Executed, 60 Passed, 0 Failed, 0 Warnings.
Test Result: PASSED

The mutex stress test

This test exercises the scheduler and the thread synchronisation primitives. It creates 10 threads that compete for a mutex, simulate random activities and compute statistics on how many times each thread acquired the mutex, to validate the fairness of the scheduler.

The test is automatically performed by the scripts on the STM32F4DISCOVERY board running under GNU ARM Eclipse QEMU and on the synthetic POSIX platform.

A typical result of the run is:

Mutex stress & uniformity test.
Built with GCC 5.3.1 20160307 (release) [ARM/embedded-5-branch revision 234589].
Seed 3761791254
[  5s] t0:39   t1:42   t2:37   t3:41   t4:38   t5:37   t6:36   t7:41   t8:40   t9:34   sum=385, avg=39, delta in [-5,3] [-12%,8%]
[ 10s] t0:74   t1:82   t2:79   t3:84   t4:79   t5:84   t6:77   t7:76   t8:80   t9:75   sum=790, avg=79, delta in [-5,5] [-5%,6%]
[ 15s] t0:114  t1:120  t2:116  t3:128  t4:117  t5:122  t6:114  t7:116  t8:116  t9:115  sum=1178, avg=118, delta in [-4,10] [-2%,8%]
[ 20s] t0:155  t1:161  t2:152  t3:163  t4:153  t5:160  t6:154  t7:159  t8:154  t9:154  sum=1565, avg=157, delta in [-5,6] [-2%,4%]
[ 25s] t0:196  t1:199  t2:194  t3:206  t4:193  t5:198  t6:194  t7:200  t8:197  t9:194  sum=1971, avg=197, delta in [-4,9] [-1%,5%]
[ 30s] t0:233  t1:236  t2:241  t3:245  t4:231  t5:236  t6:233  t7:237  t8:234  t9:237  sum=2363, avg=236, delta in [-5,9] [-1%,4%]
[ 35s] t0:270  t1:281  t2:277  t3:284  t4:266  t5:273  t6:279  t7:278  t8:273  t9:277  sum=2758, avg=276, delta in [-10,8] [-3%,3%]
Done.

The semaphore stress test

This test exercises the synchronisation primitives available from interrupt service routines and the effectiveness of the critical sections. It creates a high frequency hardware timer which posts to a semaphore, and a thread counts if the posts arrived in time or were late, in other words if the scheduler was or not able to wakeup the thread fast enough.

The test runs on the physical STM32F4DISCOVERY board.

A typical result of the run shows that on this platform the scheduler can stand about 250.000 context switches per second:

Semaphore stress test.
Built with GCC 5.3.1 20160307 (release) [ARM/embedded-5-branch revision 234589].

Iteration 0
Seed 832262406
  42000 cy    1 kHz
  21000 cy    2 kHz
  10500 cy    4 kHz
   5250 cy    8 kHz
   2625 cy   16 kHz
   1312 cy   32 kHz
    656 cy   64 kHz
    328 cy  128 kHz
    164 cy  256 kHz    1 late
     82 cy  512 kHz  777 late
     41 cy 1024 kHz  998 late
     20 cy 2100 kHz  999 late
     10 cy 4200 kHz  999 late

Conclusions

CMSIS++ is still a young project, and many things need to be addressed, but the core component, the RTOS, is pretty well defined and functional.

For now it may not be perfect (as it tries to be), but it definitely provides a more standard set of primitives, closer to POSIX, and a wider set of APIs than many other existing RTOSes, covering both C++ and C applications; at the same time it does its best to preserve compatibility with the original ARM CMSIS APIs.

Any contributions to improve CMSIS++ will be highly appreciated.

More info

CMSIS++ is an open source project, maintained by Liviu Ionescu. The project is released under the terms of the MIT license.

The main source of information for CMSIS++ is the project web.

The Git repositories and all public releases are available from GitHub; specifically the stress tests are available from the tests folder.

The code for ARM CMSIS RTOS validator is available from GitHub.

The code for the Cortex-M scheduler port is available from GitHub.

The code for the synthetic POSIX scheduler port is available from GitHub.

For questions and discussions, please use the CMSIS++ section of the GNU ARM Eclipse forum.

For bugs and feature requests, please use the GitHub issues.

Tools, Software and IDEs blog

Python on Arm: 2025 Update

Diego Russo

Python powers applications across Machine Learning (ML), automation, data science, DevOps, web development, and developer tooling.
- August 21, 2025
Product update: Arm Development Studio 2025.0 now available

Stephen Theobald

Arm Development Studio 2025.0 now available with Arm Toolchain for Embedded Professional.
- July 18, 2025
GCC 15: Continuously Improving

Tamar Christina

GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
- June 26, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog