Streamline is a Performance Analysis tool that comes with Arm's DS-5 Development Studio (DS-5). Streamline enables you to analyze the performance of bare-metal systems, systems with a Real Time Operating System (RTOS), as well as systems based on Linux, Android and Tizen.
Streamline makes it easy to optimize for Arm, by giving you better insight into how your software executes - whether it's to make your games more immersive and pushing the envelope of what Mali GPUs can deliver, or simply finding hot-spots in source code to make a SoC as efficient as possible.
This tutorial shows how to analyze the performance of an RTOS-based system using Streamline. Streamline can reveal which functions in your system are costing the most time, as well as showing you thread activity over time:
The example we'll use here is based on the Keil RTX version 5 RTOS. The target system used in this example is based on Arm Cortex-M33, but any Cortex-A-, Cortex-R- or Cortex-M-class device can be used. An example for Cortex-A9 is also provided in DS-5. There is a video that accompanies this tutorial.
There are four main steps to analyzing an RTOS-based application:
To illustrate these steps, let's import a ready-made RTOS-based example that is provided in DS-5. This multi-threaded example is written in C, compiled with Arm Compiler 6, and linked with the Keil RTX5 RTOS libraries from the CMSIS Pack. To import the example, use File > Import > DS-5 > Examples & Programming Libraries. In the Import dialog, expand Examples > Streamline Bare-Metal Agent Examples, select the "RTX5_Cortex-M33_Blinky_Streamline" example, then Finish. The referenced CMSIS Pack is imported automatically too. The version of this example for Cortex-A9 can also be found here.
The first two steps below have already been done for the supplied example, but are shown in detail so that you can follow the same steps for your own project.
The first step is to use Streamline to generate the "barman" agent code for you, using its "Generate Barman sources" wizard. Launch Streamline, then select Streamline > Generate Barman sources:
The wizard's dialogs lead you through a number of implementation decisions, including...
...how you want to store or transport the collected data (e.g. whether to store in a Linear RAM Buffer on the target, or for this example, to collect data over ITM); the number of cores you want to profile - just single core in this example, but multi-core is also possible; and the maximum number of tasks in the application:
...the type of core(s) in your system - Cortex-M33 for this example:
...and the counters you want to collect. To collect all of them, press Ctrl-A in the left-hand panel, then use the mouse to drag and drop them from the left-hand panel to the right-hand-panel:
The next dialogs allow you to specify how often the Program Counter is sampled, to define custom counters, and finally, to save the generated "barman" source files and configuration file alongside your project. Two source files are generated - named 'barman.h' and 'barman.c' - which must be compiled and linked into your program. You will also find a file named 'barman.xml' in the same location. This file contains the configuration settings specified in this wizard, allowing you to reload, edit or recreate the source files at a later date, or with the command line tool:
The second step is to integrate the barman code into your RTOS-based application, so that barman can collect the data samples for you. Streamline has generated barman.c and .h for us, so we now need to add some "plumbing" code into the application to:
A complete list of barman's functions and parameters is given in the Streamline barman public API documentation.
Add into your main code a call to: enable_barman(); // Enable barmanand barman support code like this:
enable_barman(); // Enable barman
#include "barman.h" #include "rtx_lib.h" // provides osRtxThreadxxx functions and os_thread_t type /* * Perform the necessary initialization of the bare-metal agent */ static void enable_barman(void) { /* For M-class, the cycle counter provides the timestamps, so convert it to nS by multiplying by 10**9 and dividing by the clock frequency in Hz */ const struct bm_protocol_clock_info clock_info = { .timestamp_base = 0, .timestamp_multiplier = 1000000000, .timestamp_divisor = 25000000, /* 25MHz system freq of AN521 FPGA */ .unix_base_ns = 0 }; #if BM_CONFIG_MAX_TASK_INFOS > 0 const struct bm_protocol_task_info task_entries[] = { { (bm_task_id_t)tid_phaseA, "phaseA" }, { (bm_task_id_t)tid_phaseB, "phaseB" }, { (bm_task_id_t)tid_phaseC, "phaseC" }, { (bm_task_id_t)tid_phaseD, "phaseD" }, { (bm_task_id_t)tid_clock, "clock" }, { (bm_task_id_t)tid_app_main, "app_main" }, { (bm_task_id_t)(osRtxConfig.idle_thread_attr->cb_mem), "osRtxIdleThread" }, { (bm_task_id_t)(osRtxConfig.timer_thread_attr->cb_mem), "osRtxTimerThread" }, }; #endif /* Initialize barman but if there is a problem we will loop here */ while (!barman_initialize_with_itm_interface("RTX5 Cortex-M33 Streamline bare-metal example", &clock_info, #if BM_CONFIG_MAX_TASK_INFOS > 0 /* All the tasks */ 8, task_entries, #endif #if BM_CONFIG_MAX_MMAP_LAYOUTS > 0 /* We only have one image for all tasks so we don't need to provide these */ 0, BM_NULL, #endif 1)); /* Now we are ready to enable sampling */ barman_enable_sampling(); } /* Allow barman to read the current thread id from RTX */ bm_task_id_t barman_ext_get_current_task_id(void) { return (bm_task_id_t)osRtxThreadGetRunning(); } /* Allow RTX to inform barman when a task switch occurs */ extern void $Super$$osRtxThreadSwitch(os_thread_t *thread); void $Sub$$osRtxThreadSwitch(os_thread_t *thread) { $Super$$osRtxThreadSwitch(thread); // Call the original osRtxThreadSwitch barman_record_task_switch(BM_TASK_SWITCH_REASON_PREEMPTED); // Record the task switch }
Note the use of the Arm Linker (armlink) $Sub/$Super mechanism here to allow us to patch a function within the RTOS itself without having to modify its own source code directly. The GCC Linker (ld) has a similar feature - "--wrap".
Then rebuild the project with, for example, Arm Compiler 6. If you are using a makefile, you may need to update it to compile barman.c, then link barman.o into your executable. If you are using Eclipse's managed builder, it will do this for you automatically.
After you have rebuilt the application, you can then run it on a target to collect some profiling data. This example runs on an Arm MPS2+ board, with its FPGA programmed as a Cortex-M33. Here, DSTREAM debug hardware is used to connect to the board. DSTREAM is used for start/stop control, and also to collect profiling information over ITM, so before connecting to the target, we must configure DS-5 Debugger to collect trace information via DSTREAM and to collect ITM trace data too. After connecting, we can run the application, stop, then save the collected profiling data via ITM using a "trace dump" command. If your application stores the collected profiling data in a RAM buffer, use a "dump memory" command instead.
trace dump [path\to\a\local\folder] CSITM.
The final step is to import the profiling data into Streamline using its Import Capture wizard. The profiling data can be analyzed source-level symbolic information. Streamline can then generate a visualization of the captured data.
Let's take a look at the visualization in more detail.
The Timeline shows counter activity correlated with thread activity. In this example, we've added color-coded annotations into the application code to show the duration of each thread - see the red phaseA bars, green phaseB bars, blue phaseC bars, and yellow phaseD bars in the thread view below:
The Call Paths tab shows the proportion of time being spent in each thread, and the functions call from each thread:
The Functions tab shows the time being spent in each function:
Double-click on a function to see the source code and the number of samples on each line of code:
And finally, the Log tab gives a list of all the annotations that have been logged:
If you wish to change the selection of counter charts displayed in the Timeline, you can do so via Streamline > Generate Barman sources, and follow the prompts. This will generate new barman.c/.h/.xml files. You will need to rebuild the image and run it again to collect the profiling data for your new counters selection.
To summarize, we've seen how Streamline can be used to analyze the performance of an RTOS-based system, to reveal hot-spots in the RTOS or application. You can then focus your optimization efforts on those hot-spots, to improve overall performance of your system.
Hi Stephen, I translated your blog in Chinese, community.arm.com/.../streamline-rtos