Introduction to Statistical Profiling Support in Streamline

January 7, 2020

12 minute read time.

Starting with version 7.0, Arm Streamline Performance Analyzer (Streamline) supports profiling using the Statistical Profiling Extension (SPE). SPE is an optional extension to the Armv8.2-A architecture that allows low probe effect hardware sampling of the pipeline of the processor. Streamline is available as part of Arm Development Studio and Arm Mobile Studio.

Why SPE?

Previous versions of Streamline could only collect CPU performance information via hardware counters and could only sample the Program Counter via software interrupt. As hardware counters only provide aggregate counts, it is impossible to determine which specific instructions caused the event being counted. It is only possible to attribute counters to some region of the application, and that region is relatively large. The extent to which developers can isolate problem is therefore limited. Similarly, the rate with which the Program Counter or Call Stack can be sampled is limited. This is because sampling and unwinding are done is done in software using timer interrupt. The Statistical profiling Extension overcomes these issues by sampling the Program Counter periodically in hardware as part of the CPU’s pipeline. As a result, there is almost no overhead and so the sampling rate can be set much higher. As SPE is built into the processor’s pipeline, it can collect additional information directly about each sampled instruction. This information allows much more detailed analysis of executed code.

Tool Support

Streamline supports collecting SPE data alongside other performance counters in both system-wide and application tracing modes.

It supports visualizing the following SPE data:

Latency counter packets, providing issue and total instruction execution latency counts which can be used to identify execution stalls. The packets also provide load/store latency counts for memory accesses which can be used to identify high latency memory accesses and poor cache use.
Event packets, which provide important information about each sampled instruction.
This information includes:
- Whether the instruction accessed/hit/missed some level of cache
- Was a mis-predicted or not-taken branch
- Whether an exclusive load/store failed
Event packets can be used to identify issues such as branch prediction problems, poor cache use, and lock contention.
Data source packets, providing information about the level of memory hierarchy that a load or store accessed.

This data is visualized both in the timeline view, showing the trace for different events over time, and as an extension to the Call Paths, Functions and Code views. This allows the user to drill down to the per-thread, per-function, per-source-line and per-instruction level.

Prerequisites

Using SPE requires hardware with the appropriate extension, and a sufficiently recent Linux kernel with the arm_spe_pmu module enabled and support for SPE in device tree or UEFI. In addition, SPE currently requires KPTI to be disabled (boot with kpti=off kernel command-line argument). On future Arm CPU's this requirement is likely to be removed.

To check if SPE is supported by the kernel, check for the file /sys/bus/events/devices/arm_spe_0. To check if KPTI is enabled, check in the output of dmesg for the line Kernel/User page tables isolation: enabled or check /sys/devices/system/cpu/vulnerabilities/meltdown for which will contain Mitigation: PTI if enabled.

To use SPE, you will need access to a device that has the SPE feature, or to use an Arm model such as a Fixed Virtual Platform. For the purposes of this document, we are using the Arm Neoverse N1 SDP. If you wish to test on a model, then FVP_Base_RevC-2xAEMv8A Fixed Virtual Platform from Arm can be used.

Arm Mobile Studio fully supports SPE, but because there are currently no consumer Android devices that include the SPE capability, this document will focus on using Streamline with Arm Development Studio.

Neoverse N1 SDP Configuration

In this document, I am using the Arm Neoverse N1 Software Development Platform, running a basic Linux environment with Linux kernel 5.4.1. Beyond the normal kernel configuration options that are required to enable Streamline data collection, and in particular to enable SPE support as detailed previously, no special configuration is required.

The Neoverse N1 SDP is an unreleased infrastructure segment development platform currently only available to Early Access customers. The platform ships with a Neoverse N1 processor from Arm, the first to support the Statistical profiling Extension.

FVP Configuration

If you would like to try this on one of Arm's Fixed Virtual Platform models, it can be configured to enable SPE support by passing the following additional parameters:

-C cluster0.has_armv8-2=1 -C cluster0.has_statistical_profiling=1 -C cluster1.has_armv8-2=1 -C cluster1.has_statistical_profiling=1

Because the FVP does not model execution time, all instructions complete in a single cycle and so all latency counters report as zero on this target. Likewise, the FVP does not model the branch predictor and so all branches are reported as being predicted correctly.

Cache modelling can be enabled, and if it is then SPE cache related events will be generated that reflect the modelled cache behavior.

Getting Started

On the target, gatord is launched as usual as root user. This will perform a system wide capture. Arm Development Studio users can launch gatord with no additional command-line arguments:

/path/to/gatord

It is also possible to launch gatord in application tracing mode and collect SPE data:

/path/to/gatord --system-wide no --app <some-app-to-launch>

Streamline is launched and connected to the target, and from the ‘Counter Configuration’ window the Arm Statistical Profiling extension can be seen and configured.

Figure 1. Counter Configuration Dialog showing the SPE configuration settings.

The dialog allows the user to configure filtering of samples based on the operation type, filterable events, and minimum total latency. By default, all operations are sampled, but sampling can be limited to any combination of branch, load or store operation. Likewise, certain events can be selected to further reduce what operations are sampled.

The filter settings can be used to limit the data collected based on the kind of problem being addressed. For example, the Minimum Total Latency filter can be used to identify long running operations resulting from accessing memory instead of cache. The Mispredicted event can be used to find only branches that trigger branch mispredictions.

Although the hardware allows it, Streamline does not currently allow the user to configure the sample rate. The rate is hard-coded in gatord to be every 100,000 operations. Likewise, the optional random perturbation feature is always enabled. Support for configuring the sample rate will appear in the next version of Streamline (7.2).

Note: Bear in mind that SPE will count down operations for selection before filtering happens. In other words, the filter will sample every ‘n’th operation if the operation matches the filter, rather than filtering every operation then sampling every ‘n’th operation that matches.

When making a ‘local capture’ to a file on the device (rather than through the Streamline UI), the --spe argument can be given to control the SPE capture configuration.

Examining the Data

The following image shows some of the additional SPE counters added to the timeline view. The various properties that are found in the SPE 'events packet' are extracted into a set of stacked charts. These charts show the ratio of counted versus not counted samples for a particular property. The cumulative total for each chart gives the total number of samples for which the property was relevant. Some properties are only relevant to particular types of instruction, such as branches, loads, or stores.

The Architecturally retired chart, for example, shows the number of operations sampled that retired versus those that did not. The total gives the number of operations sampled that were speculatively executed. Likewise, the Level 1 Data Cache Access chart shows the number of load/store operations that were sampled and that accessed the L1 Data Cache, giving the ratio of those that hit the cache versus those that missed it.

Note: These charts are not available during live mode but will be added after the analysis completes.

The timeline view gives an overview of the behavior of the process(es) being sampled. It can be used to find regions of the capture that can then be investigated in more detail using the profiling views (the Call Stacks, Functions and Code tabs).

By default, the data shown in the profiling views shows the cumulative totals of each counter or event across the whole capture. When an interesting region is found, the calipers can be used to filter the data shown in the profiling views.

Figure 2. Some interesting range is selected using the cross-section marker.
This image shows SPE counters for a simple benchmark application that has poor L2 cache utilisation.
This capture is using an experimental N1 SPE chart template.

Drilling Down

Streamline has a number of views which can be used to gain a deeper understanding of the profiled applications behaviour. These include the Call Paths, Functions, and Code tabs.

Call Paths Tab

The Call Paths view shows call stack hierarchies organized by process and thread. Because SPE only samples the program counter and not call stacks, the functions that are shown form a flat list within each process. By default, the view will show the traditional periodic samples data. To display the SPE data instead, select SPE from the drop-down list at the top of the tab.

Figure 3. Select the data to show in the Call Paths tab using the drop-down list.

Event Packets are displayed as a ratio showing the number of counted-vs-not-counted samples collected for each function, and cumulative totals per thread and per process. By menu-clicking on the header of the table, it is possible to extract out individual parts of the ratio into separate columns to allow sorting. The individual columns show the count and percentage of the overall total. By sorting like this, it is possible, for example, to identify which functions contribute the most cache misses, or which have the worst individual branch misprediction rates.

Figure 4. Displayed columns can be configured from the header menu.

Figure 5. Call Path data sorted by functions having the most L1 Data Cache misses.

Latency counters are displayed as a histogram of log₂(latency). Data is binned for latency values of 0, of 1, of 2-3, of 4-7, of 8-15 and so on up to the maximum collected.

Figure 6. Example of what the Total latency column might look like.
Darker colours indicate higher values with respect to the process total, taller bars indicate higher values with respect to the row.

The bottom half of the Call Paths tab shows the children view. For periodic call stack sampling, it would show all children of a selected item in the hierarchy, giving percentages out of the total for the selected item. Given that the SPE data only shows a flat list of functions within the process-thread hierarchy, this section is only useful for showing the percentage totals relative to a parent thread or parent process.

Figure 7. The Call Path children view, showing all functions for a particular thread.

Functions Tab

The Functions tab shows a flat list of all sampled functions across all processes. Like the Call Paths tab, the Function tab has a drop-down box for selecting the data to show. Likewise, individual columns can be added or removed by menu-clicking the header.

Figure 8. The Functions tab, showing SPE data sorted by L1 Data Cache Misses.

Code Tab

The Code tab shows source code and disassembly. To view a function in this view, double click, or use the context menu on a function in the Call Paths or Functions tabs.

Unlike the previous two tabs, this view only shows one counter column at a time. To configure the item to view, select from the two drop-down boxes at the top of the view.

Figure 9. Selecting the data to display in the Code view using the pair of drop-downs.

The disassembly view shows counters per instruction, so for example it is possible to see per instruction latency counts.

Figure 10. Disassembly view, showing the Total Latency for the some artificial L1 data cache benchmark.
With this particular counter we can see clearly the data dependency between the CBNZ and LDR instructions.

If the analyzed program images have debug information available that provides line number information, and if the sources are available on the machine running Streamline, then the top half of the Code tab will show source code of the selected function. Again, individual counters are displayed, line by line. This makes it possible to identify individual blocks within the function that exhibit a particular trait.

Figure 11. Source Code view, showing the same Total Latency counter as given in Figure 10.

Next Steps

To get started with Statistical Profiling support in Streamline, upgrade to the latest version of Arm Development Studio (2019.1 / 2019.b) which is available here.

To learn about the Neoverse N1 and find information about reference platforms, see our announcement here.

Arm Statistical Profiling

0 comments
0 members are here

Tools, Software and IDEs blog

GCC 15: Continuously Improving

Tamar Christina

GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
- June 26, 2025
GitHub and Arm are transforming development on Windows for developers

Pareena Verma

Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
- May 20, 2025
What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog