Using Perf to enable PMU functionality on Armv8-A CPUs: Stage 3 and Stage 4

August 22, 2023

6 minute read time.

Part 3 of 3 blog series

In the Part 2, after collecting the performance statics from PMU, the reason for relatively high L2 data cache miss rate deserves more analysis. This part describes how to do hot-spot analysis and an optimization method with validation using Perf.

Stage 3: Sampling PMU events and conducting hot-spot analysis

To do hot-spot analysis, you must collect more detailed information. You can use Perf to sample PMU events to conduct hot-spot analysis.

Perf commands to use

perf record: Use this command to sample detailed performance data of a specified application over time. This command can help you identify performance bottlenecks or hotspots. This command samples data and stores it in a binary file named perf.data in the same directory the command has been executed.
perf report: This command can parse the perf.data and generate profile.

How to use the `perf record` command

The basic usage of the perf record is as follows:

Fullscreen

1
2
3
4
5
6
# Specify PMU event and sampling period
$ perf record -e <event> -c <count> -- <command to run the application>
# Specify PMU event and average sampling rate
$ perf record -e <event> -F <freq> -- <command to run the application>
# By default, cycles is selected with average sampling rate set to be 4000 samples/sec
$ perf record -- <command to run the application>
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

# Specify PMU event and sampling period
$ perf record -e <event> -c <count> -- <command to run the application>
# Specify PMU event and average sampling rate
$ perf record -e <event> -F <freq> -- <command to run the application>
# By default, cycles is selected with average sampling rate set to be 4000 samples/sec
$ perf record -- <command to run the application>

When using the command perf record, you can select PMU events to sample by using the -e option. Also, you can specify the sampling period by using the -c option or average sampling rate by using the -F option.

By default, perf record uses the event cycles and the sampling rate is set to 4000. For Armv8-A CPUs, the event cycles is mapped to the cycle counter.

The sampling period refers to the number of occurrences of the PMU event. For example, if you set -c 1000, a sample is recorded every 1000 occurrences of the selected PMU event.

The sampling rate refers to the average rate of samples per second. For example, if you set -F 1000, Perf records around 1000 samples per second. It is achieved by dynamically adjusting the sampling period by the kernel.

When using perf record to sample PMU events for hot-spot analysis, consider the following factors:

What sampling rate and period to set

For Armv8-A CPUs, the sampling is triggered by PMU overflow interrupt. This means that setting -c to a small number or setting -F to a large number introduces more interrupts while the application runs. Frequent interrupts can cause significant overhead.

Also, this interrupt-based recording method used in Perf will introduces skids. This means that some information, for example, the instruction pointer sampled is not the one where the counter counts. So, you need to carefully set the sampling rate/period that is appropriate for your application.

What events to select for sampling

The performance characteristics given by perf stat show that not all the events must be selected in perf record. You can choose a few of them to record and do hot-spot analysis. In this example, you can sample only l2d_cache_refill.

Example

An example to sample PMU events and do hot-spot analysis on the Juno r2 platform is as follows:

Fullscreen

1
2
3
$ taskset -c 0-3 perf record -e armv8_pmuv3_0/l2d_cache_refill/ -F 999 -- 2D_array
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.316 MB perf.data (8265 samples) ] 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

$ taskset -c 0-3 perf record -e armv8_pmuv3_0/l2d_cache_refill/ -F 999 -- 2D_array
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.316 MB perf.data (8265 samples) ]

First, use the command perf record to sample PMU event l2d_cache_refill. After the file perf.data is generated, use the command perf report for further hot-spot analysis. This command provides an interactive interface. A basic procedure to do hot-spot analysis is as follows:

1. Locate the function with most overhead

Use the command perf report as follows. By default reads the file named perf.data in the current directory.

Fullscreen

1
$ perf report
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

$ perf report

In this example, we select only one PMU event when using the command perf record. This shows when using the perf report, it shows recording information directly as follows. If more PMU events are recorded by perf record, perf report shows a pop-up an menu for you to choose which PMU event to analyze.

Figure 1: Overhead of PMU event samples per function

The above figure shows that the functions compute_squares and array_assign respectively take 49.95% and 49.83% of the overall samples of PMU event l2d_cache_refill. The [.] means that the samples are taken in the user-space.

2. Source analysis with Perf annotate

We can take a step further to analyze from source code. The interrupt-based recording introduces the skids. Therefore, Perf cannot provide the instructions where the PMU events are counted.

Menu for choosing operations for PMU event samples

Figure 2: Menu for choosing operations for PMU event samples

We can locate the source code of function array_assign by selecting option Annotate array_assign, as shown in the following figure.

Hot-spot analysis through selected PMU event samples

Figure 3: Hot-spot analysis through selected PMU event samples

The following figure shows that the percentage of the PMU event being sampled when the CPU executes the instruction on the right side is displayed on the left side of the separator vertical line. Based on the percentage data and the understanding of cache, we can speculate that the high L2 data cache miss rate is mainly caused by the approach of traversing a 2D array by column.

Stage 4: Optimizing code and performing validation

Perf commands to use

perf stat: The usage information of this command is already described in the section Stage 2: Collecting statistics from PMU of Part 2.
perf diff: Command usage is the performance difference among two perf.data or more files captured through perf record. It is useful for comparing the effect before and after optimization.

How to use the `perf diff` command

The basic usage of the perf diff is as follows:

Fullscreen

1
2
3
4
# Specify perf.data path to compare
$ perf diff <path of old perf.data file> <path of new perf.data file> 
# By default, compare the `perf.data` and `perf.data.old` in the current directory
$ perf diff 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

# Specify perf.data path to compare
$ perf diff <path of old perf.data file> <path of new perf.data file> 
# By default, compare the `perf.data` and `perf.data.old` in the current directory
$ perf diff

Example

Based on collected information, one possible optimization direction for the example application is as follows:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#define COL_LINE 64
#define ROW_LINE 512000
 
long array[ROW_LINE][COL_LINE]; 
 
void compute_squares()
{    
    int i, j;
    
    for (i=0; i<COL_LINE; i++) {
        for (j=0; j<ROW_LINE; j++) {
            array[j][i] = array[j][i] * array[j][i];
        }
    }
}
 
void array_assign()
{
    int i, j;
    /* [M] Traverse by row */ 
    for (i=0; i<ROW_LINE; i++) { 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

#define COL_LINE 64
#define ROW_LINE 512000
 
long array[ROW_LINE][COL_LINE]; 
 
void compute_squares()
{    
    int i, j;
    
    for (i=0; i<COL_LINE; i++) {
        for (j=0; j<ROW_LINE; j++) {
            array[j][i] = array[j][i] * array[j][i];
        }
    }
}
 
void array_assign()
{
    int i, j;
    /* [M] Traverse by row */ 
    for (i=0; i<ROW_LINE; i++) { 
        for (j=0; j<COL_LINE; j++) {
            array[i][j] = i+j;
        }
    }
}
 
int main()
{
    array_assign();
    compute_squares();
    return 0;
}

We modify the approach of traversing the 2D array from column traversal to row traversal. Using the cache-friendly traversal method, it is expected to result in a reduction in L2 data cache misses which will improve the performance of the application.

Compile the modified example application and run perf stat again to get key performance metrics as follows:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ gcc -O0 -g 2D_array.c -o 2D_array
$ taskset -c 0-3 perf stat -e armv8_pmuv3_0/cpu_cycles/,armv8_pmuv3_0/inst_retired/,armv8_pmuv3_0/l1d_cache_refill/,armv8_pmuv3_0/l1d_cache/,armv8_pmuv3_0/l2d_cache_refill/,armv8_pmuv3_0/l2d_cache/,armv8_pmuv3_0/mem_access/ -- 2D_array
 
 Performance counter stats for '2D_array':
     5,387,230,517      armv8_pmuv3_0/cpu_cycles/
     1,599,412,481      armv8_pmuv3_0/inst_retired/
        34,576,836      armv8_pmuv3_0/l1d_cache_refill/
       697,922,349      armv8_pmuv3_0/l1d_cache/
        37,191,464      armv8_pmuv3_0/l2d_cache_refill/
        85,987,007      armv8_pmuv3_0/l2d_cache/
       665,282,669      armv8_pmuv3_0/mem_access/
 
       7.699647601 seconds time elapsed
       7.575061000 seconds user
       0.124050000 seconds sys
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

$ gcc -O0 -g 2D_array.c -o 2D_array
$ taskset -c 0-3 perf stat -e armv8_pmuv3_0/cpu_cycles/,armv8_pmuv3_0/inst_retired/,armv8_pmuv3_0/l1d_cache_refill/,armv8_pmuv3_0/l1d_cache/,armv8_pmuv3_0/l2d_cache_refill/,armv8_pmuv3_0/l2d_cache/,armv8_pmuv3_0/mem_access/ -- 2D_array
 
 Performance counter stats for '2D_array':

     5,387,230,517      armv8_pmuv3_0/cpu_cycles/
     1,599,412,481      armv8_pmuv3_0/inst_retired/
        34,576,836      armv8_pmuv3_0/l1d_cache_refill/
       697,922,349      armv8_pmuv3_0/l1d_cache/
        37,191,464      armv8_pmuv3_0/l2d_cache_refill/
        85,987,007      armv8_pmuv3_0/l2d_cache/
       665,282,669      armv8_pmuv3_0/mem_access/
 
       7.699647601 seconds time elapsed

       7.575061000 seconds user
       0.124050000 seconds sys

After the perf stat, you can compare the PMU event values and related metrics manually. The result is as follows:

Item	Before optimization	After optimization
l2d_cache_refill	65,775,517	37,191,464
l2d_cache	143,130,513	85,987,007
Attributable Level 2 unified cache refill rate	0.460	0.433
l1d_cache_refill	67,051,466	34,576,836
l1d_cache	695,606,934	697,922,349
Attributable Level 1 unified cache refill rate	0.095	0.050
Execution time	8.269886818	7.699647601

The results in the above table show that the optimization causes only a small decrease in L2 cache miss rate. However, you can observe a significant decrease in L1 data cache miss rate. This leads to reduced L2 cache accesses and faster execution time.

We may need to confirm if the modification of the code results in the optimization. Run the perf record again with the same parameters. After that, use perf diff to compare the effect of the optimization. A reduction in L2 data cache misses in the function array_assign is expected.

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ taskset -c 0-3 ./perf record -e armv8_pmuv3_0/l2d_cache_refill/ -F 999 -- 2D_array
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.301 MB perf.data (7690 samples) ]
$ perf diff
# Event 'armv8_pmuv3_0/l2d_cache_refill/'
#
# Baseline  Delta Abs  Shared Object      Symbol
# ........  .........  .................  ......................
#
    49.83%    -38.83%  2D_array           [.] array_assign
    49.95%    +38.29%  2D_array           [.] compute_squares
     0.22%     +0.54%  [unknown]          [k] 0xffff800008326210
     0.01%     +0.01%  ld-2.31.so         [.] 0x000000000000bab0
     0.00%     +0.00%  [kernel.kallsyms]  [k] 0x00008000083292e0
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

$ taskset -c 0-3 ./perf record -e armv8_pmuv3_0/l2d_cache_refill/ -F 999 -- 2D_array
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.301 MB perf.data (7690 samples) ]

$ perf diff
# Event 'armv8_pmuv3_0/l2d_cache_refill/'
#
# Baseline  Delta Abs  Shared Object      Symbol
# ........  .........  .................  ......................
#
    49.83%    -38.83%  2D_array           [.] array_assign
    49.95%    +38.29%  2D_array           [.] compute_squares
     0.22%     +0.54%  [unknown]          [k] 0xffff800008326210
     0.01%     +0.01%  ld-2.31.so         [.] 0x000000000000bab0
     0.00%     +0.00%  [kernel.kallsyms]  [k] 0x00008000083292e0

The perf record for the optimized application is run in the same directory as last time. In this case, Perf renames the perf.data to perf.data.old, and stores the latest recording samples into perf.data.

The event sampling results in the perf.data.old are treated as Baseline. We can see the changes because of the optimization from the column Delta Abs. Consider the PMU event l2d_cache_refill as an example. Before the optimization, 49.83% of the sampled L2 data cache misses events are from the function array_assign. After optimization, however, the value is reduced to 10.00%. Now, we can be sure that the optimization direction is effective.

0 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Using Perf to enable PMU functionality on Armv8-A CPUs: Stage 3 and Stage 4

Stage 3: Sampling PMU events and conducting hot-spot analysis

Perf commands to use

How to use the `perf record` command

What sampling rate and period to set

What events to select for sampling

Example

Stage 4: Optimizing code and performing validation

Perf commands to use

How to use the `perf diff` command

Example

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Using Perf to enable PMU functionality on Armv8-A CPUs: Stage 3 and Stage 4

Stage 3: Sampling PMU events and conducting hot-spot analysis

Perf commands to use

How to use the perf record command

What sampling rate and period to set

What events to select for sampling

Example

Stage 4: Optimizing code and performing validation

Perf commands to use

How to use the perf diff command

Example

How to use the `perf record` command

How to use the `perf diff` command