The Performance Monitor Unit (PMU) in Armv8-A CPU provides hardware-level performance monitoring and profiling capabilities. The PMU collects hardware event counts through counters. The counters include cycle counter and event counters. You can configure:
Tools like perf use PMU to profile Linux applications running on Armv8-A CPU. However, there is no common way to use PMU to profile the firmware, because it runs in the bare-metal environment.
To profile the firmware with PMU, a straightforward approach is to add PMU support in the form of a library. In this blog post, we provide a PMU library as a reference implementation. This library has been verified in the platform based on Armv8.0-A CPU. We also describe how to add PMU support in the firmware and profile the firmware with PMU. The Trusted Firmware-A (TF-A) and U-Boot are used as example firmware.
Download the PMU Library
The table below lists components of the PMU library.
arch.h
arch_helpers.h
armv8_pmuv3_events.h
armv8_pmuv3_fn.c
armv8_pmuv3_fn.h
jevents.py
armv8_pmuv3_events.c
Use the following command to unzip all the files of the PMU library to the path lib/pmu in the firmware.
lib/pmu
unzip armv8_pmuv3_library.zip -d ${FIRMWARE_PATH}/lib/pmu
Depending on the firmware you are working with, you need to make specific changes to the Makefile file as follows.
Makefile
Modify the existing Makefile file located in the root path of TF-A as follows.
BL_COMMON_SOURCES += lib/pmu/armv8_pmuv3_fn.c lib/pmu/armv8_pmuv3_events.c INCLUDES += -Ilib/pmu
Modify the existing Makefile file located in the root path of U-Boot as follows.
UBOOTINCLUDE += -Ilib/pmu
Create a Makefile file in the same directory with the PMU library. Fill the file as follows.
obj-y += armv8_pmuv3_fn.o armv8_pmuv3_events.o
Now, you can rebuild the firmware. If you build it successfully, go to stage 2.
First, get the supporting PMU events for the Armv8-A CPU on which the firmware runs. See the CPU Technical Reference Manual (TRM) for the PMU events.
In the PMU library, you can also use the Python script named jevents.py with the following steps.
The GitHub repository Machine-readable data maintains PMU events supported by each Arm CPU. It uses JSON format to store information such as event mnemonic and event number of the PMU events. This information is consistent with that in the CPU TRM.
The Python script named jevents.py in the PMU library clones the repository, converts the JSON format to C format, and generates the file named armv8_pmuv3_events.c. You can use it as follows.
First, use the following command to list all the names of the supporting CPUs for converting.
python3 jevents.py --list
Then, use the following command to specify the name of the CPU for converting. This command will generate the file named armv8_pmuv3_events.c. The file contains two types of arrays. One is for the supporting PMU events of the specified CPU. The other is for the selected PMU events for profiling.
python3 jevents.py --cpu <cpu name>
An example of the armv8_pmuv3_events.c file generated by jevents.py is as follows. We specify the cpu name as cortex-a53. The array named pmu_events_map contains all the supporting PMU events of Cortex-A53. It is the same with the Cortex-A53 TRM. You can select the PMU events from here and put the selected ones into the array named evt_select.
c
ortex-a53
pmu_events_map
evt_select
#include "armv8_pmuv3_events.h" /* Selected PMU Events Table */ struct pmu_event_selected evt_select[] = { { .event.name = "INST_RETIRED", .event.number = 0x08 }, { .event.name = "CYCLES", .event.number = CYCLE_COUNTER_EVENT }, /* Always place the following one at the end of the table */ { .event.name = "NULL", .event.number = PMU_SELECTED_END } }; /* PMU Events Mapping Table */ const struct pmu_event pmu_events_map[] = { /* Supporting PMU events for event counter */ { .name = "SW_INCR", .number = 0x0 }, { .name = "L1I_CACHE_REFILL", .number = 0x1 }, { .name = "L1I_TLB_REFILL", .number = 0x2 }, { .name = "L1D_CACHE_REFILL", .number = 0x3 }, { .name = "L1D_CACHE", .number = 0x4 }, { .name = "L1D_TLB_REFILL", .number = 0x5 }, { .name = "LD_RETIRED", .number = 0x6 }, { .name = "ST_RETIRED", .number = 0x7 }, { .name = "INST_RETIRED", .number = 0x8 }, { .name = "EXC_TAKEN", .number = 0x9 }, { .name = "EXC_RETURN", .number = 0xa }, { .name = "CID_WRITE_RETIRED", .number = 0xb }, { .name = "PC_WRITE_RETIRED", .number = 0xc }, { .name = "BR_IMMED_RETIRED", .number = 0xd }, { .name = "BR_RETURN_RETIRED", .number = 0xe }, { .name = "UNALIGNED_LDST_RETIRED", .number = 0xf }, { .name = "BR_MIS_PRED", .number = 0x10 }, { .name = "CPU_CYCLES", .number = 0x11 }, { .name = "BR_PRED", .number = 0x12 }, { .name = "MEM_ACCESS", .number = 0x13 }, { .name = "L1I_CACHE", .number = 0x14 }, { .name = "L1D_CACHE_WB", .number = 0x15 }, { .name = "L2D_CACHE", .number = 0x16 }, { .name = "L2D_CACHE_REFILL", .number = 0x17 }, { .name = "L2D_CACHE_WB", .number = 0x18 }, { .name = "BUS_ACCESS", .number = 0x19 }, { .name = "MEMORY_ERROR", .number = 0x1a }, { .name = "BUS_CYCLES", .number = 0x1d }, { .name = "CHAIN", .number = 0x1e }, { .name = "BUS_ACCESS_RD", .number = 0x60 }, { .name = "BUS_ACCESS_WR", .number = 0x61 }, { .name = "BR_INDIRECT_SPEC", .number = 0x7a }, { .name = "EXC_IRQ", .number = 0x86 }, { .name = "EXC_FIQ", .number = 0x87 }, { .name = "0xc0", .number = 0xc0 }, { .name = "0xc1", .number = 0xc1 }, { .name = "0xc2", .number = 0xc2 }, { .name = "0xc3", .number = 0xc3 }, { .name = "0xc4", .number = 0xc4 }, { .name = "0xc5", .number = 0xc5 }, { .name = "0xc6", .number = 0xc6 }, { .name = "0xc7", .number = 0xc7 }, { .name = "0xc8", .number = 0xc8 }, { .name = "0xc9", .number = 0xc9 }, { .name = "0xca", .number = 0xca }, { .name = "0xcb", .number = 0xcb }, { .name = "0xcc", .number = 0xcc }, { .name = "0xd0", .number = 0xd0 }, { .name = "0xd1", .number = 0xd1 }, { .name = "0xd2", .number = 0xd2 }, { .name = "0xe0", .number = 0xe0 }, { .name = "0xe1", .number = 0xe1 }, { .name = "0xe2", .number = 0xe2 }, { .name = "0xe3", .number = 0xe3 }, { .name = "0xe4", .number = 0xe4 }, { .name = "0xe5", .number = 0xe5 }, { .name = "0xe6", .number = 0xe6 }, { .name = "0xe7", .number = 0xe7 }, { .name = "0xe8", .number = 0xe8 }, /* Supporting event for cycle counter */ { .name = "CYCLES", .number = CYCLE_COUNTER_EVENT } };
In the initial profiling, normally you need to select a wide range of PMU events. This will get you a comprehensive understanding of the code to profile.
For Armv8-A CPUs, the available event counters per CPU are limited. You can refer to the CPU TRM for the number. If the required number of PMU events exceeds the available counters, you can use the following method.
pmu_event_selected
For example, we create two pmu_event_selected structure arrays named evt_select_cache and evt_select_wlc in the armv8_pmuv3_events.c file. For Cortex-A53, one cycle counter and six event counters are available. The number of PMU events in each array does not exceed this. Also, within each array, the PMU events are related to make sure the profiling data is comparable.
evt_select_cache
evt_select_wlc
armv8_pmuv3_events
We use these two arrays for the following profiling examples.
/* PMU events for data access */ struct pmu_event_selected evt_select_cache[] = { { .event.name = "L1D_TLB_REFILL", .event.number = 0x5 }, { .event.name = "L1D_CACHE_REFILL", .event.number = 0x3 }, { .event.name = "L1D_CACHE", .event.number = 0x4 }, { .event.name = "L2D_CACHE_REFILL", .event.number = 0x17 }, { .event.name = "L2D_CACHE", .event.number = 0x16 }, { .event.name = "CYCLES", .event.number = CYCLE_COUNTER_EVENT }, /* Always place the following one at the end of the table */ { .event.name = "NULL", .event.number = PMU_SELECTED_END } }; /* PMU events for workload characterization */ struct pmu_event_selected evt_select_wlc[] = { { .event.name = "LD_RETIRED", .event.number = 0x6 }, { .event.name = "ST_RETIRED", .event.number = 0x7 }, { .event.name = "MEM_ACCESS", .event.number = 0x4 }, { .event.name = "BR_IMMED_RETIRED", .event.number = 0xd }, { .event.name = "BR_RETURN_RETIRED", .event.number = 0xe }, { .event.name = "CYCLES", .event.number = CYCLE_COUNTER_EVENT }, /* Always place the following one at the end of table */ { .event.name = "NULL", .event.number = PMU_SELECTED_END } };
Now, you can profile the firmware with the PMU. Use the following procedure for the specific code in the firmware that you want to profile.
In the PMU library, armv8_pmuv3_fn.c provide a reference implementation.
You can simply add two functions named pmuv3_startProfiling and pmuv3_stopProfiling between the code that you want to profile. For each profiling, pass one pmu_event_selected structure array as the parameter.
pmuv3_startProfiling
pmuv3_stopProfiling
Depending on the number of PMU events you want to profile with, you need to profile once or multiple times. Here are two profiling examples.
To profile the function enable_mmu_el3 in TF-A with focusing on cache behavior, you can only select the array named evt_select_cache for a single profiling.
enable_mmu_el3
#include <armv8_pmuv3_fn.h> extern struct pmu_event_selected evt_select_cache[]; void __init arm_bl31_plat_arch_setup(void) { //... pmuv3_startProfiling(evt_select_cache); enable_mmu_el3(0); pmuv3_stopProfiling(evt_select_cache); //... }
To find out the crc32 workload characterization in U-Boot, you can select the arrays named evt_select_cache and evt_select_wlc. In this way, you need to profile the same code multiple times to collect all the profiling data.
crc32
As the do_mem_crc function is called by a command, you can program the firmware once and repeat the profiling by entering the same command multiple times.
do_mem_crc
#include <armv8_pmuv3_fn.h> extern struct pmu_event_selected evt_select_cache[]; extern struct pmu_event_selected evt_select_wlc[]; struct pmu_event_selected* pevt[] = {evt_select_cache,evt_select_wlc,NULL}; struct pmu_event_selected** p = pevt; static int do_mem_crc(struct cmd_tbl *cmdtp, int flag, int argc, char *const argv[]) { int ret = 0; // ... if(*p!=NULL) { pmuv3_startProfiling(*p); } else { p = pevt; } ret = hash_command("crc32", flags, cmdtp, flag, ac, av); if(*p!=NULL) { pmuv3_stopProfiling(*p); ++p; } return ret; }
The function startProfiling does the following things.
startProfiling
MDCR_EL2
MDCR_EL3
The function stopProfiling does the following things.
stopProfiling
Now you can rebuild the firmware and run it again to collect the statistical profiling result.
For the examples mentioned in the previous part, we perform the profiling on the Juno r2 platform. The output of the profiling is as follows.
************************************************************ [armv8_pmuv3] Profiling Result ************************************************************ PMU EVENT, PREVAL, POSTVAL, DELTA L1D_TLB_REFILL,0,2,2 L1D_CACHE_REFILL,0,8,8 L1D_CACHE,0,6,6 L2D_CACHE_REFILL,0,10,10 L2D_CACHE,0,10,10 CYCLES,1386,4293,2907 ***********************************************************
As you can see from the output, it records the pre-profiling and post-profiling value of each selected PMU events and cycles in this profiling. Also, it calculates the differences, and records them in the column named DELTA.
DELTA
For Cortex-A53 CPU, the PMU event L1D_TLB_REFILL is included in the count for L1D_CACHE_REFILL. Thus, you might observe that the counting value for L1D_CACHE_REFILL is greater than that of L1D_CACHE.
L1D_TLB_REFILL
L1D_CACHE_REFILL
L1D_CACHE
CRC32 for 80000100 ... 80001123 ==> c3ac0b65 ************************************************************ [armv8_pmuv3] Profiling Result ************************************************************ PMU EVENT, PREVAL, POSTVAL, DELTA L1D_TLB_REFILL,0,0,0 L1D_CACHE_REFILL,0,19,19 L1D_CACHE,11,80424,80413 L2D_CACHE_REFILL,0,80,80 L2D_CACHE,0,212,212 CYCLES,147,884986,884839 ************************************************************ CRC32 for 80000100 ... 80001123 ==> c3ac0b65 ************************************************************ [armv8_pmuv3] Profiling Result ************************************************************ PMU EVENT, PREVAL, POSTVAL, DELTA INST_RETIRED,13,208664,208651 LD_RETIRED,4,51346,51342 ST_RETIRED,5,23691,23686 MEM_ACCESS,15,80961,80946 BR_IMMED_RETIRED,10,43922,43912 BR_RETURN_RETIRED,0,0,0 CYCLES,170,890037,889867 ************************************************************
As you can see above, the results of cycles for the two profiling sessions are almost the same.
cycles
From the first profiling result, you can refer to the Arm Architecture Reference Manual to abstract some meaningful metrics as follows.
L1D_CACHE_REFILL / L1D_CACHE
L2D_CACHE_REFILL / L2D_CACHE
From the second profiling result, you can calculate the Instructions Per Cycle (IPC) as follows.
IPC = INST_RETIRED / CYCLES
The IPC of this workload is 0.23. This is caused by many MEM_ACCESS operations.
MEM_ACCESS