Part 1 sets the goal of the performance analysis, with four stages of the basic performance analysis workflow provided. This part describes stage 1 and stage 2 in details.
Perf supports different types of events, including software events and hardware events. You can select events using various Perf commands with the option -e.
-e.
Use the command perf list to list symbolic events. An example on the Juno r2 platform is as follows:
perf list
$ perf list List of pre-defined events (to be used in -e or -M): branch-instructions OR branches [Hardware event] ... alignment-faults [Software event] ... L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware cache event] ... br_immed_retired OR armv8_pmuv3_0/br_immed_retired/[Kernel PMU event] br_mis_pred OR armv8_pmuv3_0/br_mis_pred/ [Kernel PMU event] br_mis_pred OR armv8_pmuv3_1/br_mis_pred/ [Kernel PMU event] ...
As the example shows, the PMU of each CPU provides the following types of events defined by Perf:
However, not all the PMU events of CPUs are listed in the symbolic format. Perf also supports using the raw hardware event format. For Armv8-A CPUs, it can be used by specifying the PMU driver name of the CPU with event numbers of the PMU events. You can select any PMU event described in the CPU Technical Reference Manual (TRM) by the following steps:
1. See the CPU TRM for the event number
Consider Cortex-A53 as an example. If you want to select Exception taken(IRQ) and Exception taken(FIQ), these two events are listed in the TRM, but not in the perf list. See the Cortex-A53 TRM to get their event numbers, as the following table shows:
Exception taken(IRQ)
Exception
taken(FIQ)
EXC_IRQ
EXC_FIQ
2. Look up PMU driver name of corresponding CPU
The Juno r2 platform platform used in this blog is based on the big.LITTLE processor, which introduces two PMU drivers. They are named after armv8_pmuv3_0 and armv8_pmuv3_1. If you want to select events of Cortex-A53 PMU, you must look up the correct PMU driver name for Cortex-A53 PMU.
armv8_pmuv3_0
armv8_pmuv3_1
First, get the topology of the CPUs as follows:
$ cat /proc/cpuinfo processor : 0 BogoMIPS : 100.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 0 ... processor : 4 BogoMIPS : 100.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd07 CPU revision : 0 ...
In the log above, CPU0 is Cortex-A53 with part number 0xd03, and CPU4 is Cortex-A72 with part number 0xd07. The part number is defined in the MIDR_EL1. See the CPU TRM for more information.
0xd03
0xd07
MIDR_EL1
Next, get the information about the CPU supported by the PMU driver as follows:
$ ls /sys/bus/event_source/devices armv8_pmuv3_0 armv8_pmuv3_1 breakpoint kprobe software tracepoint uprobe $ cat /sys/bus/event_source/devices/armv8_pmuv3_0/cpus 0-3 $ cat /sys/bus/event_source/devices/armv8_pmuv3_1/cpus 4-5
In the log above, the PMU driver named after armv8_pmuv3_0 is the driver of CPU0-3 PMUs and armv8_pmuv3_1 is the driver of CPU4-5 PMUs.
Finally, you can establish the relationship of the PMU driver and its corresponding CPU, that is:
3. Check the raw hardware event encoding format for the specific PMU
You now have the event numbers of the events to select and the PMU driver of the corresponding CPU. Next, you need to check the encoding format for the PMU events. Use the command as follows:
$ cat /sys/bus/event_source/devices/armv8_pmuv3_0/format/event config:0-15
In the log above, config:0-15 means that bit field 0-15 (two Bytes) is used to pass the event number to Perf.
config:0-15
4. Select the raw hardware event in Perf
Select the PMU event in Perf in the raw hardware event format. The following commands list three possible usages.
# Single event $ perf <command> -e armv8_pmuv3_0/event=0x86/ # Multiple events, selected in comma-separated with no space $ perf <command> -e armv8_pmuv3_0/event=0x86/,armv8_pmuv3_0/event=0x87/ # Multiple events, both symbolic format and raw hardware events format $ perf <command> -e armv8_pmuv3_0/event=0x86/,armv8_pmuv3_0/event=0x87/,armv8_pmuv3_0/br_immed_retired/
When you know the PMU events to select, you can use Perf to collect statistics of the selected PMU events.
perf stat
The basic usage of the perf stat is as follows:
$ perf stat -e <event> -- <command to run the application>
When using perf stat to collect statistics from the PMU, consider the following factors:
In the initial performance analysis of the application, normally a wide range of PMU events are selected to get a comprehensive understanding of the application behavior and performance.
For Armv8-A CPUs, however, the available counters of PMU for each CPU are limited. If events selected in the Perf command exceed the available counters, the kernel uses time multiplexing to give each event a chance to count. At the end of the run, Perf scales the counting values, based on the total time enabled and time running. In other words, when multiplexing and scaling happens, the counting values for the selected PMU events are estimate values.
A typical example on the Juno r2 platform is as follows:
$ taskset -c 0-3 perf stat -e armv8_pmuv3_0/inst_retired/,armv8_pmuv3_0/l1d_cache_refill/,armv8_pmuv3_0/l1d_cache/,armv8_pmuv3_0/l2d_cache_refill/,armv8_pmuv3_0/l2d_cache/,armv8_pmuv3_0/br_immed_retired/,armv8_pmuv3_0/br_mis_pred/,armv8_pmuv3_0/br_pred/ -- ls
Performance counter stats for 'ls': 1,690,222 armv8_pmuv3_0/inst_retired/ (45.02%) 23,170 armv8_pmuv3_0/l1d_cache_refill/ 726,296 armv8_pmuv3_0/l1d_cache/ 13,997 armv8_pmuv3_0/l2d_cache_refill/ 125,787 armv8_pmuv3_0/l2d_cache/ 351,946 armv8_pmuv3_0/br_immed_retired/ 28,695 armv8_pmuv3_0/br_mis_pred/ (54.98%) <not counted> armv8_pmuv3_0/br_pred/ (0.00%) 0.007883225 seconds time elapsed 0.004223000 seconds user 0.004223000 seconds sys
In this example, eight PMU events of the Cortex-A53 PMU are selected. However, for Cortex-A53, only six event counters are available. This results in multiplexing among events inst_retired, br_mis_pred, and br_pred.
inst_retired
br_mis_pred
br_pred
The percentage number listed behind these three events indicate the percentage of the enable time during the whole running time. The counting values listed in front of events inst_retired and br_mis_pred are estimate values, which are scaled based on the percentage. The command ls is chosen as the application. Because it runs for a short time. The event br_pred has no chance to be enabled to count values. Therefore, its counting value is marked as <not counted>. For the other events, the values are real counting values.
ls
<not counted>
Thus, when determining the number of PMU events selected each time, you are recommended to take the characterization of the application into the consideration. Multiplexing and scaling introduce inaccuracy if the application execution time is too short or not uniform.
For better accuracy and meaningful metrics extracted through PMU event values, you are recommended to select related or comparable events each time.
When the number of PMU events selected each time exceeds the available counters, you can choose to place related PMU events into a group. In doing so, multiplexing does not occur within events in a group, but occurs only among groups. To achieve this, you can use the modifier \{ and \} between the events that you want to select as a group. The basic usage is as follows:
\{
\}
$ perf stat -e \{<event 1>,<event 2>,…,<event m>\},\{…\} -- <command to run the application>
Note: Each group must contain events no more than the maximum number supported by per CPU PMU. For Cortex-A53 and Cortex-A72, that is six event counters and one cycle counter. For other CPUs, you can consult the PMCFGR.N in the corresponding TRM.
PMCFGR.N
An example on the Juno r2 platform platform is as follows:
$ taskset -c 0-3 perf stat -e \{armv8_pmuv3_0/inst_retired/,armv8_pmuv3_0/l1d_cache_refill/,armv8_pmuv3_0/l1d_cache/,armv8_pmuv3_0/l2d_cache_refill/,armv8_pmuv3_0/l2d_cache/\},\{armv8_pmuv3_0/br_immed_retired/,armv8_pmuv3_0/br_mis_pred/,armv8_pmuv3_0/br_pred/\} -- ls
Performance counter stats for 'ls': 1,568,362 armv8_pmuv3_0/inst_retired/ (25.03%) 18,195 armv8_pmuv3_0/l1d_cache_refill/ (25.03%) 623,266 armv8_pmuv3_0/l1d_cache/ (25.03%) 15,510 armv8_pmuv3_0/l2d_cache_refill/ (25.03%) 146,055 armv8_pmuv3_0/l2d_cache/ (25.03%) 372,014 armv8_pmuv3_0/br_immed_retired/ (74.97%) 31,153 armv8_pmuv3_0/br_mis_pred/ (74.97%) 424,800 armv8_pmuv3_0/br_pred/ (74.97%) 0.007646814 seconds time elapsed 0.000000000 seconds user 0.008331000 seconds sys
The previous example places eight PMU events into two groups:
The percentages listed behind the PMU events indicate that multiplexing happens only between the groups. This ensures that the counting values of PMU events within groups are homologous and comparable.
Perf supports user-space and kernel-space counting separately. You can achieve this by adding the modifier u for user-space counting and k for kernel-space counting as follows:
u
k
$ perf stat -e armv8_pmuv3_0/event=0x86/u,armv8_pmuv3_0/br_immed_retired/u -- <command to run the application> $ perf stat -e cpu-cycles:u -- <command to run the application>
Note: To count PMU events in kernel-space for a non-root user in Linux, you must also use the following setting:
$ su $ echo -1 > /proc/sys/kernel/perf_event_paranoid
We can summarize the possible usages of perf stat as follows:
An example of collecting statistics from PMU on the Juno r2 platform is as follows:
$ gcc -O0 -g 2D_array.c -o 2D_array $ taskset -c 0-3 perf stat -e armv8_pmuv3_0/cpu_cycles/,armv8_pmuv3_0/inst_retired/,armv8_pmuv3_0/l1d_cache_refill/,armv8_pmuv3_0/l1d_cache/,armv8_pmuv3_0/l2d_cache_refill/,armv8_pmuv3_0/l2d_cache/,armv8_pmuv3_0/mem_access/ -- 2D_array
Performance counter stats for '2D_array': 5,786,306,035 armv8_pmuv3_0/cpu_cycles/ 1,658,874,195 armv8_pmuv3_0/inst_retired/ 67,051,466 armv8_pmuv3_0/l1d_cache_refill/ 695,606,934 armv8_pmuv3_0/l1d_cache/ 65,775,517 armv8_pmuv3_0/l2d_cache_refill/ 143,130,513 armv8_pmuv3_0/l2d_cache/ 662,970,040 armv8_pmuv3_0/mem_access/ 8.269886818 seconds time elapsed 8.129006000 seconds user 0.140017000 seconds sys
The application on the Juno r2 platform platform is compiled by gcc with no optimization. Perf stat displays the statistics when the application runs to the end.
Perf stat
Note: The Juno r2 platform platform is a big.LITTLE based processor. Use taskset -c 0-3 to ensure that the application runs on the Cortex-A53 CPUs only.
taskset -c 0-3
This example selects cpu_cycles and the other six PMU events. For Cortex-A53, one cycle counter is for counting cpu_cycles, and six event counters are for counting inst_retired, mem_access and cache related PMU events.
cpu_cycles
mem_access
The Arm Architecture Reference Manual for A-profile architecture (ARM-ARM) highlights meaningful combinations of common microarchitectural events. In this example, you can derive the following metrics from the PMU events:
Now, we have a high-level understanding of the application performance data. However, without comparison, we can only observe that the L2 data cache miss rate appears to be relatively high. To further analyze and improve the application performance, we need to do hot-spot analysis.Part 3 describes:
Where is this Perf stuff documented, not on bottom-level, but reg. COTS Sw-API level?
(And: Is it possible to do PMU event reading (i.e. cache events) on user-level processes, e.g. QNX, even when running not as root, / where is this documented?)
Is it possible to setup an own Perf/PMU tracing mini-system for some applications, - already outlined by some application-note or blog?
I think the following link can help you. :)
Use PAPI for counting | Arm Learning Paths