Analyzing the performance of RTOS-based systems using Streamline

Streamline is a Performance Analysis tool that comes with Arm's DS-5 Development Studio (DS-5). Streamline enables you to analyze the performance of bare-metal systems, systems with a Real Time Operating System (RTOS), as well as systems based on Linux, Android and Tizen.

Streamline makes it easy to optimize for Arm, by giving you better insight into how your software executes - whether it's to make your games more immersive and pushing the envelope of what Mali GPUs can deliver, or simply finding hot-spots in source code to make a SoC as efficient as possible.

This tutorial shows how to analyze the performance of an RTOS-based system using Streamline. Streamline can reveal which functions in your system are costing the most time, as well as showing you thread activity over time:

Steamline timeline tab

The example we'll use here is based on the Keil RTX version 5 RTOS. The target system used in this example is based on Arm Cortex-M33, but any Cortex-A-, Cortex-R- or Cortex-M-class device can be used. An example for Cortex-A9 is also provided in DS-5. There is a video that accompanies this tutorial.

There are four main steps to analyzing an RTOS-based application:

  1. Generate the data collection agent code, which we call "barman"
  2. Integrate the barman code into your RTOS-based application
  3. Run your application to collect profiling data
  4. Import the data into Streamline

To illustrate these steps, let's import a ready-made RTOS-based example that is provided in DS-5. This multi-threaded example is written in C, compiled with Arm Compiler 6, and linked with the Keil RTX5 RTOS libraries from the CMSIS Pack. To import the example, use File > Import > DS-5 > Examples & Programming Libraries. In the Import dialog, expand Examples > Streamline Bare-Metal Agent Examples, select the "RTX5_Cortex-M33_Blinky_Streamline" example, then Finish.  The referenced CMSIS Pack is imported automatically too.  The version of this example for Cortex-A9 can also be found here.

Streamline: importing DS-5 examples and programming libraries

The first two steps below have already been done for the supplied example, but are shown in detail so that you can follow the same steps for your own project.

Step 1 - Generate the data collection agent code

The first step is to use Streamline to generate the "barman" agent code for you, using its "Generate Barman sources" wizard. Launch Streamline, then select Streamline > Generate Barman sources:

Streamline: Generate Barman sources

The wizard's dialogs lead you through a number of implementation decisions, including... you want to store or transport the collected data (e.g. whether to store in a Linear RAM Buffer on the target, or for this example, to collect data over ITM); the number of cores you want to profile - just single core in this example, but multi-core is also possible; and the maximum number of tasks in the application:

Barman Generator Wizard

...the type of core(s) in your system - Cortex-M33 for this example:

Barman Generator Wizard: Select processors to target

...and the counters you want to collect. To collect all of them, press Ctrl-A in the left-hand panel, then use the mouse to drag and drop them from the left-hand panel to the right-hand-panel:

Barman Generator Wizard: Select events to trace

The next dialogs allow you to specify how often the Program Counter is sampled, to define custom counters, and finally, to save the generated "barman" source files and configuration file alongside your project. Two source files are generated - named 'barman.h' and 'barman.c' - which must be compiled and linked into your program. You will also find a file named 'barman.xml' in the same location. This file contains the configuration settings specified in this wizard, allowing you to reload, edit or recreate the source files at a later date, or with the command line tool:

Barman source files

Step 2 - Integrate the barman code into your RTOS-based application

The second step is to integrate the barman code into your RTOS-based application, so that barman can collect the data samples for you. Streamline has generated barman.c and .h for us, so we now need to add some "plumbing" code into the application to:

  • initialize and enable barman
  • allow barman to read the current thread id from the RTOS
  • allow the RTOS to inform barman when a task switch occurs. 

A complete list of barman's functions and parameters is given in the Streamline barman public API documentation.

Add into your main code a call to:
  enable_barman();                      // Enable barman

and barman support code like this:

#include "barman.h"
#include "rtx_lib.h" // provides osRtxThreadxxx functions and os_thread_t type

 * Perform the necessary initialization of the bare-metal agent
static void enable_barman(void)
   /* For M-class, the cycle counter provides the timestamps, so convert it to nS by multiplying by 10**9 and dividing by the clock frequency in Hz */
    const struct bm_protocol_clock_info clock_info = { .timestamp_base = 0,
                                                       .timestamp_multiplier = 1000000000,
                                                       .timestamp_divisor = 25000000, /* 25MHz system freq of AN521 FPGA */
                                                       .unix_base_ns = 0 };

    const struct bm_protocol_task_info task_entries[] =
        { (bm_task_id_t)tid_phaseA,   "phaseA" },
        { (bm_task_id_t)tid_phaseB,   "phaseB" },
        { (bm_task_id_t)tid_phaseC,   "phaseC" },
        { (bm_task_id_t)tid_phaseD,   "phaseD" },
        { (bm_task_id_t)tid_clock,    "clock" },
        { (bm_task_id_t)tid_app_main, "app_main" },
        { (bm_task_id_t)(osRtxConfig.idle_thread_attr->cb_mem), "osRtxIdleThread" },
        { (bm_task_id_t)(osRtxConfig.timer_thread_attr->cb_mem), "osRtxTimerThread" },

    /* Initialize barman but if there is a problem we will loop here */
    while (!barman_initialize_with_itm_interface("RTX5 Cortex-M33 Streamline bare-metal example", &clock_info,
                              /* All the tasks */
                              8, task_entries,
                              /* We only have one image for all tasks so we don't need to provide these */
                              0, BM_NULL,

    /* Now we are ready to enable sampling */

/* Allow barman to read the current thread id from RTX */
bm_task_id_t barman_ext_get_current_task_id(void)
  return (bm_task_id_t)osRtxThreadGetRunning();

/* Allow RTX to inform barman when a task switch occurs */
extern void $Super$$osRtxThreadSwitch(os_thread_t *thread);

void $Sub$$osRtxThreadSwitch(os_thread_t *thread)
  $Super$$osRtxThreadSwitch(thread);                               // Call the original osRtxThreadSwitch
  barman_record_task_switch(BM_TASK_SWITCH_REASON_PREEMPTED);      // Record the task switch

Note the use of the Arm Linker (armlink) $Sub/$Super mechanism here to allow us to patch a function within the RTOS itself without having to modify its own source code directly.  The GCC Linker (ld) has a similar feature - "--wrap".

Then rebuild the project with, for example, Arm Compiler 6. If you are using a makefile, you may need to update it to compile barman.c, then link barman.o into your executable. If you are using Eclipse's managed builder, it will do this for you automatically.

Step 3 - Run your application to collect profiling data

After you have rebuilt the application, you can then run it on a target to collect some profiling data. This example runs on an Arm MPS2+ board, with its FPGA programmed as a Cortex-M33. Here, DSTREAM debug hardware is used to connect to the board. DSTREAM is used for start/stop control, and also to collect profiling information over ITM, so before connecting to the target, we must configure DS-5 Debugger to collect trace information via DSTREAM and to collect ITM trace data too. After connecting, we can run the application, stop, then save the collected profiling data via ITM using a "trace dump" command. If your application stores the collected profiling data in a RAM buffer, use a "dump memory" command instead.

  1. Launch DS-5 Debugger
  2. Select Run > Debug Configurations....
  3. In the Debug Configurations dialog, create a debug connection to the target. For this example, expand the list of DS-5 Debugger configurations on the left-hand side, and select RTX5_Cortex-M33_Blinky_Streamline_MPS2.
  4. In the Connections panel, enter the USB: or TCP: IP address or name of your DSTREAM unit in the Debug Hardware Address field, or click on Browse to select one from a list
  5. To configure ITM, click on the DTSL Options Edit... button. In the Trace Capture tab select DSTREAM 4GB Trace Buffer, and in the ITM tab tick Enable CSITM Trace. Click on OK to save the DTSL options.
  6. Click on Debug to start debugging. The example executable will be downloaded to the target, and the program counter PC will be set to the entry point of the image.
  7. Debugging requires the DS-5 Debug perspective. If the Confirm Perspective Switch dialog box opens, click on Yes to switch perspective.
  8. Run the executable by clicking on the green Continue button in the Debug Control view, or by pressing F8 on the keyboard.
  9. As it runs, the DSTREAM is capturing ITM trace information. You should see the Buffer Used size increasing in the Trace view.
  10. After a few seconds, stop execution by clicking on the yellow Interrupt button in the Debug Control view, or by pressing F9.
  11. In the Commands view, enter: trace dump [path\to\a\local\folder] CSITM.

Step 4 - Import the data into Streamline

The final step is to import the profiling data into Streamline using its Import Capture wizard. The profiling data can be analyzed source-level symbolic information. Streamline can then generate a visualization of the captured data.

  1. Launch Streamline
  2. In its Streamline Data view, click on the Import Capture File(s)... button, and select CSITM_0.bin.
  3. In Select what to Import select Barman Agent Capture (via ITM), Next.
  4. In Provide Required Files enter the location of the executable image (RTX5_Cortex-M33_Blinky_Streamline.axf in this example) and barman.xml, then click Finish.
  5. In the Streamline Data view, right click and select Analyse.... Tick the executable image (RTX5_Cortex-M33_Blinky_Streamline.axf in this example), then click Analyze.
  6. After a few moments processing the data, Streamline presents a Timeline view containing charts of counters selected in barman.xml.

Let's take a look at the visualization in more detail.

The Timeline shows counter activity correlated with thread activity. In this example, we've added color-coded annotations into the application code to show the duration of each thread - see the red phaseA bars, green phaseB bars, blue phaseC bars, and yellow phaseD bars in the thread view below:

Streamline: Timeline showing counter activity

The Call Paths tab shows the proportion of time being spent in each thread, and the functions call from each thread:

Streamline: Call Paths tab

The Functions tab shows the time being spent in each function:

Streamline Functions tab

Double-click on a function to see the source code and the number of samples on each line of code:

Streamline Code tab

And finally, the Log tab gives a list of all the annotations that have been logged:

Streamline Log tab

If you wish to change the selection of counter charts displayed in the Timeline, you can do so via Streamline > Generate Barman sources, and follow the prompts. This will generate new barman.c/.h/.xml files. You will need to rebuild the image and run it again to collect the profiling data for your new counters selection.

To summarize, we've seen how Streamline can be used to analyze the performance of an RTOS-based system, to reveal hot-spots in the RTOS or application.  You can then focus your optimization efforts on those hot-spots, to improve overall performance of your system.