Using Perf to enable PMU functionality on Armv8-A CPUs: Stage 1 and Stage 2

August 15, 2023

8 minute read time.

Part 2 of 3 blog series

Part 1 sets the goal of the performance analysis, with four stages of the basic performance analysis workflow provided. This part describes stage 1 and stage 2 in details.

Stage 1: Selecting PMU events for analysis

Perf supports different types of events, including software events and hardware events. You can select events using various Perf commands with the option -e.

Use the command perf list to list symbolic events. An example on the Juno r2 platform is as follows:

$ perf list
List of pre-defined events (to be used in -e or -M):
 
branch-instructions OR branches                    [Hardware event]
...
 
alignment-faults                                   [Software event]
...
 
L1-dcache-load-misses                              [Hardware cache event]
L1-dcache-loads                                    [Hardware cache event]
...

br_immed_retired OR armv8_pmuv3_0/br_immed_retired/[Kernel PMU event]
br_mis_pred OR armv8_pmuv3_0/br_mis_pred/          [Kernel PMU event]
br_mis_pred OR armv8_pmuv3_1/br_mis_pred/          [Kernel PMU event]
...

As the example shows, the PMU of each CPU provides the following types of events defined by Perf:

Hardware event
Hardware cache event
Kernel PMU event

However, not all the PMU events of CPUs are listed in the symbolic format. Perf also supports using the raw hardware event format. For Armv8-A CPUs, it can be used by specifying the PMU driver name of the CPU with event numbers of the PMU events. You can select any PMU event described in the CPU Technical Reference Manual (TRM) by the following steps:

1. See the CPU TRM for the event number

Consider Cortex-A53 as an example. If you want to select Exception taken(IRQ) and Exception taken(FIQ), these two events are listed in the TRM, but not in the perf list. See the Cortex-A53 TRM to get their event numbers, as the following table shows:

Event Number	Event Mnemonic
0x86	`EXC_IRQ`
0x87	`EXC_FIQ`

2. Look up PMU driver name of corresponding CPU

The Juno r2 platform platform used in this blog is based on the big.LITTLE processor, which introduces two PMU drivers. They are named after armv8_pmuv3_0 and armv8_pmuv3_1. If you want to select events of Cortex-A53 PMU, you must look up the correct PMU driver name for Cortex-A53 PMU.

First, get the topology of the CPUs as follows:

$ cat /proc/cpuinfo 
processor : 0
BogoMIPS : 100.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 0
...
processor : 4
BogoMIPS : 100.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd07
CPU revision : 0 
...

In the log above, CPU0 is Cortex-A53 with part number 0xd03, and CPU4 is Cortex-A72 with part number 0xd07. The part number is defined in the MIDR_EL1. See the CPU TRM for more information.

Next, get the information about the CPU supported by the PMU driver as follows:

$ ls /sys/bus/event_source/devices
armv8_pmuv3_0  armv8_pmuv3_1  breakpoint  kprobe  software  tracepoint  uprobe
$ cat /sys/bus/event_source/devices/armv8_pmuv3_0/cpus
0-3
$ cat /sys/bus/event_source/devices/armv8_pmuv3_1/cpus
4-5

In the log above, the PMU driver named after armv8_pmuv3_0 is the driver of CPU0-3 PMUs and armv8_pmuv3_1 is the driver of CPU4-5 PMUs.

Finally, you can establish the relationship of the PMU driver and its corresponding CPU, that is:

armv8_pmuv3_0 for Cortex-A53
armv8_pmuv3_1 for Cortex-A72

3. Check the raw hardware event encoding format for the specific PMU

You now have the event numbers of the events to select and the PMU driver of the corresponding CPU. Next, you need to check the encoding format for the PMU events. Use the command as follows:

$ cat /sys/bus/event_source/devices/armv8_pmuv3_0/format/event
config:0-15

In the log above, config:0-15 means that bit field 0-15 (two Bytes) is used to pass the event number to Perf.

4. Select the raw hardware event in Perf

Select the PMU event in Perf in the raw hardware event format. The following commands list three possible usages.

# Single event
$ perf <command> -e armv8_pmuv3_0/event=0x86/ 
# Multiple events, selected in comma-separated with no space
$ perf <command> -e armv8_pmuv3_0/event=0x86/,armv8_pmuv3_0/event=0x87/ 
# Multiple events, both symbolic format and raw hardware events format
$ perf <command> -e armv8_pmuv3_0/event=0x86/,armv8_pmuv3_0/event=0x87/,armv8_pmuv3_0/br_immed_retired/

Stage 2: Collecting statistics from PMU

When you know the PMU events to select, you can use Perf to collect statistics of the selected PMU events.

Perf command to use

perf stat: Use this command to run the specified application and collect PMU counter statistics for the whole process. This command can help you understand the performance characteristics of the application and identify potential areas for further investigation.

How to use the `perf stat` command

The basic usage of the perf stat is as follows:

$ perf stat -e <event> -- <command to run the application>

When using perf stat to collect statistics from the PMU, consider the following factors:

Number of the PMU events to select each time
The PMU events to be selected each time
Specify user-space and kernel-space events to count

Number of the PMU events to select each time

In the initial performance analysis of the application, normally a wide range of PMU events are selected to get a comprehensive understanding of the application behavior and performance.

For Armv8-A CPUs, however, the available counters of PMU for each CPU are limited. If events selected in the Perf command exceed the available counters, the kernel uses time multiplexing to give each event a chance to count. At the end of the run, Perf scales the counting values, based on the total time enabled and time running. In other words, when multiplexing and scaling happens, the counting values for the selected PMU events are estimate values.

A typical example on the Juno r2 platform is as follows:

$ taskset -c 0-3 perf stat -e armv8_pmuv3_0/inst_retired/,armv8_pmuv3_0/l1d_cache_refill/,armv8_pmuv3_0/l1d_cache/,armv8_pmuv3_0/l2d_cache_refill/,armv8_pmuv3_0/l2d_cache/,armv8_pmuv3_0/br_immed_retired/,armv8_pmuv3_0/br_mis_pred/,armv8_pmuv3_0/br_pred/ -- ls

Performance counter stats for 'ls':

         1,690,222      armv8_pmuv3_0/inst_retired/     (45.02%)
            23,170      armv8_pmuv3_0/l1d_cache_refill/
           726,296      armv8_pmuv3_0/l1d_cache/
            13,997      armv8_pmuv3_0/l2d_cache_refill/
           125,787      armv8_pmuv3_0/l2d_cache/
           351,946      armv8_pmuv3_0/br_immed_retired/
            28,695      armv8_pmuv3_0/br_mis_pred/      (54.98%)
     <not counted>      armv8_pmuv3_0/br_pred/          (0.00%)

       0.007883225 seconds time elapsed

       0.004223000 seconds user
       0.004223000 seconds sys

In this example, eight PMU events of the Cortex-A53 PMU are selected. However, for Cortex-A53, only six event counters are available. This results in multiplexing among events inst_retired, br_mis_pred, and br_pred.

The percentage number listed behind these three events indicate the percentage of the enable time during the whole running time. The counting values listed in front of events inst_retired and br_mis_pred are estimate values, which are scaled based on the percentage. The command ls is chosen as the application. Because it runs for a short time. The event br_pred has no chance to be enabled to count values. Therefore, its counting value is marked as <not counted>. For the other events, the values are real counting values.

Thus, when determining the number of PMU events selected each time, you are recommended to take the characterization of the application into the consideration. Multiplexing and scaling introduce inaccuracy if the application execution time is too short or not uniform.

The PMU events to be selected each time

For better accuracy and meaningful metrics extracted through PMU event values, you are recommended to select related or comparable events each time.

When the number of PMU events selected each time exceeds the available counters, you can choose to place related PMU events into a group. In doing so, multiplexing does not occur within events in a group, but occurs only among groups. To achieve this, you can use the modifier \{ and \} between the events that you want to select as a group. The basic usage is as follows:

$ perf stat -e \{<event 1>,<event 2>,…,<event m>\},\{…\} -- <command to run the application>

Note: Each group must contain events no more than the maximum number supported by per CPU PMU. For Cortex-A53 and Cortex-A72, that is six event counters and one cycle counter. For other CPUs, you can consult the PMCFGR.N in the corresponding TRM.

An example on the Juno r2 platform platform is as follows:

$ taskset -c 0-3 perf stat -e \{armv8_pmuv3_0/inst_retired/,armv8_pmuv3_0/l1d_cache_refill/,armv8_pmuv3_0/l1d_cache/,armv8_pmuv3_0/l2d_cache_refill/,armv8_pmuv3_0/l2d_cache/\},\{armv8_pmuv3_0/br_immed_retired/,armv8_pmuv3_0/br_mis_pred/,armv8_pmuv3_0/br_pred/\} -- ls

Performance counter stats for 'ls':

         1,568,362      armv8_pmuv3_0/inst_retired/     (25.03%)
            18,195      armv8_pmuv3_0/l1d_cache_refill/ (25.03%)
           623,266      armv8_pmuv3_0/l1d_cache/        (25.03%)
            15,510      armv8_pmuv3_0/l2d_cache_refill/ (25.03%)
           146,055      armv8_pmuv3_0/l2d_cache/        (25.03%)
           372,014      armv8_pmuv3_0/br_immed_retired/ (74.97%)
            31,153      armv8_pmuv3_0/br_mis_pred/      (74.97%)
           424,800      armv8_pmuv3_0/br_pred/          (74.97%)

       0.007646814 seconds time elapsed
 
       0.000000000 seconds user
       0.008331000 seconds sys

The previous example places eight PMU events into two groups:

One group is related to the cache
The other group is related to the branch

The percentages listed behind the PMU events indicate that multiplexing happens only between the groups. This ensures that the counting values of PMU events within groups are homologous and comparable.

Specify user-space and kernel-space events to count

Perf supports user-space and kernel-space counting separately. You can achieve this by adding the modifier u for user-space counting and k for kernel-space counting as follows:

$ perf stat -e armv8_pmuv3_0/event=0x86/u,armv8_pmuv3_0/br_immed_retired/u -- <command to run the application> 
$ perf stat -e cpu-cycles:u -- <command to run the application>

Note: To count PMU events in kernel-space for a non-root user in Linux, you must also use the following setting:

$ su
$ echo -1 > /proc/sys/kernel/perf_event_paranoid

We can summarize the possible usages of perf stat as follows:

Divide all PMU events to select into groups.
Based on the application you want to profile, you can choose to:

Run multiple times, selecting one group at a time for measurement, and repeating the process until all events are measured. In this way, multiplexing and scaling does not happen.
Run one time with the group modifier added. In this way, multiplexing and scaling does not occur between groups.

Example

An example of collecting statistics from PMU on the Juno r2 platform is as follows:

$ gcc -O0 -g 2D_array.c -o 2D_array 
$ taskset -c 0-3 perf stat -e armv8_pmuv3_0/cpu_cycles/,armv8_pmuv3_0/inst_retired/,armv8_pmuv3_0/l1d_cache_refill/,armv8_pmuv3_0/l1d_cache/,armv8_pmuv3_0/l2d_cache_refill/,armv8_pmuv3_0/l2d_cache/,armv8_pmuv3_0/mem_access/ -- 2D_array

Performance counter stats for '2D_array':

     5,786,306,035      armv8_pmuv3_0/cpu_cycles/
     1,658,874,195      armv8_pmuv3_0/inst_retired/
        67,051,466      armv8_pmuv3_0/l1d_cache_refill/
       695,606,934      armv8_pmuv3_0/l1d_cache/
        65,775,517      armv8_pmuv3_0/l2d_cache_refill/
       143,130,513      armv8_pmuv3_0/l2d_cache/
       662,970,040      armv8_pmuv3_0/mem_access/

       8.269886818 seconds time elapsed

       8.129006000 seconds user
       0.140017000 seconds sys

The application on the Juno r2 platform platform is compiled by gcc with no optimization. Perf stat displays the statistics when the application runs to the end.

Note: The Juno r2 platform platform is a big.LITTLE based processor. Use taskset -c 0-3 to ensure that the application runs on the Cortex-A53 CPUs only.

This example selects cpu_cycles and the other six PMU events. For Cortex-A53, one cycle counter is for counting cpu_cycles, and six event counters are for counting inst_retired, mem_access and cache related PMU events.

The Arm Architecture Reference Manual for A-profile architecture (ARM-ARM) highlights meaningful combinations of common microarchitectural events. In this example, you can derive the following metrics from the PMU events:

Metric	Formula	Value
Attributable Level 1 data cache refill rate	L1D_CACHE_REFILL/L1D_CACHE	0.095
Attributable Level 2 unified cache refill rate	L2D_CACHE_REFILL/L2D_CACHE	0.460

Now, we have a high-level understanding of the application performance data. However, without comparison, we can only observe that the L2 data cache miss rate appears to be relatively high. To further analyze and improve the application performance, we need to do hot-spot analysis.

Part 3 describes:

Sampling PMU events and conducting hot-spot analysis
Optimizing code and performing validation

Parents

Thz89 9 months ago

Where is this Perf stuff documented, not on bottom-level, but reg. COTS Sw-API level?

(And: Is it possible to do PMU event reading (i.e. cache events) on user-level processes, e.g. QNX, even when running not as root, / where is this documented?)

Is it possible to setup an own Perf/PMU tracing mini-system for some applications, - already outlined by some application-note or blog?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Jiaming Guo 9 months ago in reply to Thz89

I think the following link can help you. :)

Use PAPI for counting | Arm Learning Paths
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

Jiaming Guo 9 months ago in reply to Thz89

I think the following link can help you. :)

Use PAPI for counting | Arm Learning Paths
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Architectures and Processors blog

Using SVE in C#

Alan Hayward

.NET 9 introduces SVE support on Arm, allowing users to write simplified vectorised code. This blog gives examples in C# and compares it to C++.
- November 20, 2024
Part 3: Enabling PAC and BTI on AArch64 for Linux

Bill Roberts

Supporting C++ style exceptions and DWARF for Pointer Authentication Codes (PAC) signed pointers.
- November 20, 2024
Part 2: Enabling PAC and BTI on AArch64 for Linux

Bill Roberts

Utilizing Pointer Authentication Codes (PAC) and Branch Target Instructions (BTI) together and optimizations in instruction counts.
- November 19, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Using Perf to enable PMU functionality on Armv8-A CPUs: Stage 1 and Stage 2

Stage 1: Selecting PMU events for analysis

Stage 2: Collecting statistics from PMU

Perf command to use

How to use the `perf stat` command

Number of the PMU events to select each time

The PMU events to be selected each time

Specify user-space and kernel-space events to count

Example

Using SVE in C#

Part 3: Enabling PAC and BTI on AArch64 for Linux

Part 2: Enabling PAC and BTI on AArch64 for Linux

Using Perf to enable PMU functionality on Armv8-A CPUs: Stage 1 and Stage 2

Stage 1: Selecting PMU events for analysis

Stage 2: Collecting statistics from PMU

Perf command to use

How to use the perf stat command

Number of the PMU events to select each time

The PMU events to be selected each time

Specify user-space and kernel-space events to count

Example

How to use the `perf stat` command