Statistical Profiling Extension for ARMv8-A

January 26, 2017

10 minute read time.

The Statistical Profiling Extension is an optional feature in ARMv8.2. This article will provide an overview of the Extension, describe how it works, and the advantages it provides over other profiling mechanisms.

Recently, Will Deacon posted a request for comment (RFC) on support for the Statistical Profiling Extension to ARMv8-A, in the form of a Linux perf PMU driver. One of the first, and most obvious, questions posted in response was:

"Can you give a little high level overview of what exactly [the Statistical Profiling Extension] is?"

I am grateful to Will for writing a high-level overview of the Statistical Profiling Extension in response to this question; one which I shall now unashamedly plagiarize – and expand on – for this article.

What is profiling?

Before diving into the details of the Statistical Profiling Extension, let's first ask: what is profiling? Wikipedia provides a high-level definition of profiling in computer programming:

… a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization.

Software engineers optimize their software to reduce execution time and improve efficiency, allowing better utilization of the available resources, and extract the true performance of the system.

It’s relatively easy to measure the execution time of a program using a stopwatch, or power consumption using a meter. So why employ sampling? The answer is that to improve execution time the engineer has to understand why the program takes as long as it does.

Traditional techniques use performance counters to achieve this. These performance counters count particular events in the processor, such as cache misses or branch prediction failures and from the number of these events the engineer might be able to characterize their program.The problem is that programs are very large, and processors can execute billions of instructions every second. So that useful insights into program execution can be gained, and to direct optimization efforts to the right places, we must:

Reduce the data about the program to a manageable size.
Extract detailed information that attributes performance problems to code.

The Statistical Profiling Extension employs a method for profiling called sampling. Again, Wikipedia helps us out with a definition:

… sampling is concerned with the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population.

Sampling addresses both these issues: the dataset is reduced, and sampling enables more detailed measurement of each sample than would be possible when examining the whole population.

This is something that is possible with today’s hardware. By using a timer and sampling the program counter and hardware performance monitors (HPM) event counters at regular intervals, that is time-based sampling, a profiling tool can find hot-spots in programs and examine how events vary over time, during different phases of the program.

Time-based event sampling with ARM DS-5 Streamline Analyzer

Figure 1: Time-based event sampling with ARM DS-5 Streamline Analyzer

It is also possible to use event-based sampling:

Program the performance counter to count the interesting event.
Set the counter to a negative number.
Configure the counter to generate an interrupt when the counter wraps through zero.
Sample the program behavior when the interrupt is taken.

However, this suffers from a problems of skid and blind spots. It takes time for the counter to increment for the event, generate the interrupt and take the interrupt, during which the processor has advanced from the instruction that generated the event. For high-performance processors, this skid can be hundreds of processor instructions, meaning the data is not as useful as the engineer wanted. Furthermore, if interrupts are masked at this point, the interrupt cannot be taken until unmasked, creating blind spots. Event-based sampling can determine roughly where the event occurred, but not precisely where.

High-level overview of the Statistical Profiling Extension

So how does the Statistical Profiling Extension help with this? To explain, let’s start by explaining what Statistical Profiling does. (And here is where the plagiarism of Will’s article begins.)

The first problem for sampling is the definition of the population. For the Statistical Profiling Extension, the processor selects from the population of operations that are executed by the processor pipeline, typically after decoding instructions. Depending on the implementation, these are either:

Architectural instructions. That is, the instructions as they appear in the program and are defined by the Instruction Set Architecture in the Architecture Reference Manual.
Micro-operations (μops) which are the implementation-specific embodiment of the architectural instructions as produced by the instruction decoders.

This is key: if you don’t understand what the population is you are sampling from, then when you extrapolate from the sample you will get misleading results.

Sampling itself is controlled using a sampling interval, similar to a regular event counter, but also with an optional random perturbation to avoid falling into patterns where you continuously profile the same instruction in a hot loop.

When sampling is enabled:

After each operation is decoded, the interval counter is decremented.
When the counter hits zero, an operation is chosen for profiling and tracked within the processor pipeline.
Information about the operation is captured. When the operation completes, this information forms a sample record.
The sample record can then be filtered according to certain programmable criteria. If the sample satisfies the filter, it is written out to a memory buffer. Otherwise the record is discarded.

The processor raises an interrupt when the memory buffer fills up. (The interrupt will also be raised if there is an MMU or RAS-related fault when writing to the buffer.)

Systematic sampling

Because sampling is systematic (that is, operations are sampled at regular intervals) rather than purely random, the implementation can assume there is only a single sampled operation in flight at one time.

Therefore there only needs to be a single set of resources in the processor to collect the sample record.

The processor only has to keep track of which instruction is sampled. Each component part within the processor pipeline can autonomously record its sample data, and then, when the instruction is complete, the processor collates and writes out the completed instruction record.

Conversely, collecting this data for all instructions in flight would require massive amounts of storage with expensive indexing mechanisms to keep track of it.

Controlling sampling and filtering

The profiling tool has to allocate the memory buffer and set the profiling interval. The memory buffer is linear and virtually addressed. The interval must be large enough to satisfy the “one instruction in flight” rule. (The processor does check and flag to software if this rule is violated, but if there are too many violations this might lead to systematic sampling bias.)

Each time a record is generated, it is written to the buffer. When the buffer fills, the records in the buffer must be processed by software. Buffering means the processor does not generate an interrupt to software to process the records until many records have been collected. Not only does this reduce the overhead (fewer interrupts), it also means that samples are taken when the caches, TLBs, branch predictors, etc. are “warmed” for the process being profiled. Without this warming effect, the data is almost worthless.

The architecture does not fix the size of a record, but it might typically be around 64 bytes. If sampling is set to every 1,000 operations (say) then a 1MB buffer, which can contain ~16,000 sample records, would fill after 16 million operations. On a processor operating at 2GHz, executing one operation per cycle on average, a buffer fill event will be generated every 8 milliseconds.

Implementing a buffer does, though, mean the processor does not take an interrupt for every sample, and increasing the buffer size will further reduce the interrupt rate. However, writing the samples to memory will still have some “probe effect” on the program being profiled.

There are a number of ways a profiling tool might limit probe effects.

The first is to increase the sampling interval. Statistical analysis can be used to provide confidence in the values given in sets of records. That is, the probability that the estimated characteristic is indeed representative of the sampled population can be determined. This varies roughly in line with the inverse square-root of the population size, meaning it is possible to get quite high levels of confidence in results even with intervals of hundreds of thousands or millions of operations.

A second way is using the filtering capability of the Statistical Profiling Extension. By setting filters, the profiling tool can instruct the processor to discard samples that are not of interest. Note that this does not alter the sample population, only the sample records that are written to the memory buffer. This also reduces the data that software has to process later.

The filters that can be set are:

Based on the operation type (loads, stores, and/or branches).
Based on the events that are recorded by the operation (cache miss, branch mispredict, etc.).
Based on the total instruction latency from issue to completion.

These filters can be combined. For example, the profiling tool can request that only sampled load operations which miss in the Level 1 cache and take more than 100 cycles to be completed are written to the memory buffer.

Contents of a sample record

As a simple example, a profiling record generated for a branch instruction may consist of the following entries, called packets:

0.	(Address)	Virtual PC of the branch instruction.
1.	(Type)	Conditional direct branch.
2.	(Counter)	Number of cycles spent waiting for the instruction to be issued.
3.	(Address)	Virtual branch target + condition flags.
4.	(Counter)	Number of cycles spent executing the instruction.
5.	(Events)	E.g. Mispredicted and not-taken.
6.	(End)	End of record, optionally containing a timestamp for the instruction.

Figure 2: Example list of packets within a branch instruction record

The order of the packets might vary between implementations (apart from the End marker, which always comes last); the records are self-describing and the Statistical Profiling Extension describes the grammar used to encode the packets.

Figure 3: Structure of a record

This allows for future extensions to add more information. It also means that, For other types of instruction, other instruction data can be collected. For example, for a load instruction, the processor might collect:

0.	(Address)	Virtual PC of the load instruction.
1.	(Type)	Load instruction.
2.	(Counter)	Number of cycles spent waiting for the instruction to be issued.
3.	(Address)	Virtual address accessed by the instruction.
4.	(Counter)	Number of cycles spent executing the instruction.
5.	(Counter)	Number of cycles spent translating the address.
6.	(Address)	Physical address accessed by the instruction. (This can be disabled by the Operating System and/or Hypervisor.)
7.	(Source)	Where in the system the data was returned from. E.g. Level 1 cache, Last Level cache, or another socket in a multi-socket system. (This information is implementation-specific, so will require system-specific knowledge in the profiling tool.)
8.	(Events)	E.g. Missed in level 1 data cache and TLB.
9.	(End)	End of record, optionally containing a timestamp for the instruction

Figure 4: Example list of packets within a load instruction record

This record records which load instruction was executed, the data structure the instruction accessed, how many cycles it took to complete, and why. As this information is simultaneously and directly attributable to a single instruction it can provide deep insights into program behavior.

Supporting Statistical Profiling

At the time of writing (January 2017) ARM has published support for Statistical Profiling in the form of a perf PMU driver for the Linux kernel. As Will explains, there aren't any user-space patches for perf tool yet, but these are also being developed. The kernel side driver has been posted as an RFC to bring this exciting new technology to wider attention.

Over the coming months, additional patches for the Extension will be made available, including user-space support, and processor implementations will be announced.

Summary

When, in January 2016, David Brash announced the ARMv8.2 enhancements to the ARMv8 architecture, he summarized the Statistical Profiling Extension as follows:

The Statistical Profiling Extension is optional in ARMv8.2 and only supported in the AArch64 Execution state. A sample criterion is set on an instruction or micro-op basis, and then sampled at a regular interval. Each sample then gathers context associated with that sample into a profiling record, with only one record ever being compiled at any given time. Analyzing large working sets of samples can provide considerable insight into software execution and its associated performance when sampling continuously on systems running large workloads over extended periods of time.

Statistical Profiling joins an arsenal of hardware features available in the ARM architecture, including processor trace and CoreSight, which, through Open Source efforts such as OpenCSD and Linux perf provide powerful tools to software developers, enabling anyone to reliably extract the true performance of their ARM-based system.

Acknowledgements

My thanks to Will Deacon and David Gibbs for their help in preparing this article.

2 comments
0 members are here

Top Comments

Michael Williams over 7 years ago in reply to Michael Williams +1

This supplement has been integrated into the main Architecture Reference Manual, which you can find here: developer.arm.com/.../docs

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog