Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Using Perf to enable PMU functionality on Armv8-A CPUs: Stage 3 and Stage 4
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Armv8-A
  • Linux
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Using Perf to enable PMU functionality on Armv8-A CPUs: Stage 3 and Stage 4

Jiaming Guo
Jiaming Guo
August 22, 2023
6 minute read time.
Part 3 of 3 blog series

In the Part 2, after collecting the performance statics from PMU, the reason for relatively high L2 data cache miss rate deserves more analysis. This part describes how to do hot-spot analysis and an optimization method with validation using Perf.

Stage 3: Sampling PMU events and conducting hot-spot analysis

To do hot-spot analysis, you must collect more detailed information. You can use Perf to sample PMU events to conduct hot-spot analysis.

Perf commands to use

  • perf record: Use this command to sample detailed performance data of a specified application over time. This command can help you identify performance bottlenecks or hotspots. This command samples data and stores it in a binary file named perf.data in the same directory the command has been executed.
  • perf report: This command can parse the perf.data and generate profile.

How to use the perf record command

The basic usage of the perf record is as follows:

# Specify PMU event and sampling period
$ perf record -e <event> -c <count> -- <command to run the application>
# Specify PMU event and average sampling rate
$ perf record -e <event> -F <freq> -- <command to run the application>
# By default, cycles is selected with average sampling rate set to be 4000 samples/sec
$ perf record -- <command to run the application>

When using the command perf record, you can select PMU events to sample by using the -e option. Also, you can specify the sampling period by using the -c option or average sampling rate by using the -F option.

By default, perf record uses the event cycles and the sampling rate is set to 4000. For Armv8-A CPUs, the event cycles is mapped to the cycle counter.

The sampling period refers to the number of occurrences of the PMU event. For example, if you set -c 1000, a sample is recorded every 1000 occurrences of the selected PMU event.

The sampling rate refers to the average rate of samples per second. For example, if you set -F 1000, Perf records around 1000 samples per second. It is achieved by dynamically adjusting the sampling period by the kernel.

When using perf record to sample PMU events for hot-spot analysis, consider the following factors:

What sampling rate and period to set

For Armv8-A CPUs, the sampling is triggered by PMU overflow interrupt. This means that setting -c to a small number or setting -F to a large number introduces more interrupts while the application runs. Frequent interrupts can cause significant overhead.

Also, this interrupt-based recording method used in Perf will introduces skids. This means that some information, for example, the instruction pointer sampled is not the one where the counter counts. So, you need to carefully set the sampling rate/period that is appropriate for your application.

What events to select for sampling

The performance characteristics given by perf stat show that not all the events must be selected in perf record. You can choose a few of them to record and do hot-spot analysis. In this example, you can sample only l2d_cache_refill.

Example

An example to sample PMU events and do hot-spot analysis on the Juno r2 platform is as follows:

$ taskset -c 0-3 perf record -e armv8_pmuv3_0/l2d_cache_refill/ -F 999 -- 2D_array
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.316 MB perf.data (8265 samples) ] 

First, use the command perf record to sample PMU event l2d_cache_refill. After the file perf.data is generated, use the command perf report for further hot-spot analysis. This command provides an interactive interface. A basic procedure to do hot-spot analysis is as follows:

1. Locate the function with most overhead

Use the command perf report as follows. By default reads the file named perf.data in the current directory.

$ perf report

In this example, we select only one PMU event when using the command perf record. This shows when using the perf report, it shows recording information directly as follows. If more PMU events are recorded by perf record, perf report shows a pop-up an menu for you to choose which PMU event to analyze.

Overhead of PMU event samples per function

Figure 1: Overhead of PMU event samples per function

The above figure shows that the functions compute_squares and array_assign respectively take 49.95% and 49.83% of the overall samples of PMU event l2d_cache_refill. The [.] means that the samples are taken in the user-space.

2. Source analysis with Perf annotate

We can take a step further to analyze from source code. The interrupt-based recording introduces the skids. Therefore, Perf cannot provide the instructions where the PMU events are counted.

Menu for choosing operations for PMU event samples

Figure 2: Menu for choosing operations for PMU event samples

We can locate the source code of function array_assign by selecting option Annotate array_assign, as shown in the following figure.

Hot-spot analysis through selected PMU event samples

Figure 3: Hot-spot analysis through selected PMU event samples

The following figure shows that the percentage of the PMU event being sampled when the CPU executes the instruction on the right side is displayed on the left side of the separator vertical line. Based on the percentage data and the understanding of cache, we can speculate that the high L2 data cache miss rate is mainly caused by the approach of traversing a 2D array by column.

Stage 4: Optimizing code and performing validation

Perf commands to use

  • perf stat: The usage information of this command is already described in the section Stage 2: Collecting statistics from PMU of Part 2.
  • perf diff: Command usage is the performance difference among two perf.data or more files captured through perf record. It is useful for comparing the effect before and after optimization.

How to use the perf diff command

The basic usage of the perf diff is as follows:

# Specify perf.data path to compare
$ perf diff <path of old perf.data file> <path of new perf.data file> 
# By default, compare the `perf.data` and `perf.data.old` in the current directory
$ perf diff 

Example

Based on collected information, one possible optimization direction for the example application is as follows:

#define COL_LINE 64
#define ROW_LINE 512000
 
long array[ROW_LINE][COL_LINE]; 
 
void compute_squares()
{    
    int i, j;
    
    for (i=0; i<COL_LINE; i++) {
        for (j=0; j<ROW_LINE; j++) {
            array[j][i] = array[j][i] * array[j][i];
        }
    }
}
 
void array_assign()
{
    int i, j;
    /* [M] Traverse by row */ 
    for (i=0; i<ROW_LINE; i++) { 
        for (j=0; j<COL_LINE; j++) {
            array[i][j] = i+j;
        }
    }
}
 
int main()
{
    array_assign();
    compute_squares();
    return 0;
}

We modify the approach of traversing the 2D array from column traversal to row traversal. Using the cache-friendly traversal method, it is expected to result in a reduction in L2 data cache misses which will improve the performance of the application.

Compile the modified example application and run perf stat again to get key performance metrics as follows:

$ gcc -O0 -g 2D_array.c -o 2D_array
$ taskset -c 0-3 perf stat -e armv8_pmuv3_0/cpu_cycles/,armv8_pmuv3_0/inst_retired/,armv8_pmuv3_0/l1d_cache_refill/,armv8_pmuv3_0/l1d_cache/,armv8_pmuv3_0/l2d_cache_refill/,armv8_pmuv3_0/l2d_cache/,armv8_pmuv3_0/mem_access/ -- 2D_array
 
 Performance counter stats for '2D_array':

     5,387,230,517      armv8_pmuv3_0/cpu_cycles/
     1,599,412,481      armv8_pmuv3_0/inst_retired/
        34,576,836      armv8_pmuv3_0/l1d_cache_refill/
       697,922,349      armv8_pmuv3_0/l1d_cache/
        37,191,464      armv8_pmuv3_0/l2d_cache_refill/
        85,987,007      armv8_pmuv3_0/l2d_cache/
       665,282,669      armv8_pmuv3_0/mem_access/
 
       7.699647601 seconds time elapsed

       7.575061000 seconds user
       0.124050000 seconds sys

After the perf stat, you can compare the PMU event values and related metrics manually. The result is as follows:

Item Before optimization After optimization
l2d_cache_refill 65,775,517 37,191,464
l2d_cache 143,130,513 85,987,007
Attributable Level 2 unified cache refill rate 0.460 0.433
l1d_cache_refill 67,051,466 34,576,836
l1d_cache 695,606,934 697,922,349
Attributable Level 1 unified cache refill rate 0.095 0.050
Execution time 8.269886818 7.699647601

The results in the above table show that the optimization causes only a small decrease in L2 cache miss rate. However, you can observe a significant decrease in L1 data cache miss rate. This leads to reduced L2 cache accesses and faster execution time.

We may need to confirm if the modification of the code results in the optimization. Run the perf record again with the same parameters. After that, use perf diff to compare the effect of the optimization. A reduction in L2 data cache misses in the function array_assign is expected.

$ taskset -c 0-3 ./perf record -e armv8_pmuv3_0/l2d_cache_refill/ -F 999 -- 2D_array
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.301 MB perf.data (7690 samples) ]

$ perf diff
# Event 'armv8_pmuv3_0/l2d_cache_refill/'
#
# Baseline  Delta Abs  Shared Object      Symbol
# ........  .........  .................  ......................
#
    49.83%    -38.83%  2D_array           [.] array_assign
    49.95%    +38.29%  2D_array           [.] compute_squares
     0.22%     +0.54%  [unknown]          [k] 0xffff800008326210
     0.01%     +0.01%  ld-2.31.so         [.] 0x000000000000bab0
     0.00%     +0.00%  [kernel.kallsyms]  [k] 0x00008000083292e0

The perf record for the optimized application is run in the same directory as last time. In this case, Perf renames the perf.data to perf.data.old, and stores the latest recording samples into perf.data.

The event sampling results in the perf.data.old are treated as Baseline. We can see the changes because of the optimization from the column Delta Abs. Consider the PMU event l2d_cache_refill as an example. Before the optimization, 49.83% of the sampled L2 data cache misses events are from the function array_assign. After optimization, however, the value is reduced to 10.00%. Now, we can be sure that the optimization direction is effective.

Anonymous
Architectures and Processors blog
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025
  • Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

    Samer El-Haj-Mahmoud
    Samer El-Haj-Mahmoud
    Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
    • January 28, 2025