Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • SystemReady Certification
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Profile firmware with Performance Monitor Unit (PMU) in Armv8-A CPU
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • U-Boot
  • performance analysis
  • Trusted Firmware-A
  • Cortex-A
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Profile firmware with Performance Monitor Unit (PMU) in Armv8-A CPU

Jiaming Guo
Jiaming Guo
November 8, 2023
10 minute read time.

The Performance Monitor Unit (PMU) in Armv8-A CPU provides hardware-level performance monitoring and profiling capabilities. The PMU collects hardware event counts through counters. The counters include cycle counter and event counters. You can configure:

  • Each event counter to count specified hardware events.
  • Each counter to collect hardware events from workloads at various CPU Exception levels and states.

Tools like perf use PMU to profile Linux applications running on Armv8-A CPU. However, there is no common way to use PMU to profile the firmware, because it runs in the bare-metal environment.

To profile the firmware with PMU, a straightforward approach is to add PMU support in the form of a library. In this blog post, we provide a PMU library as a reference implementation. This library has been verified in the platform based on Armv8.0-A CPU. We also describe how to add PMU support in the firmware and profile the firmware with PMU. The Trusted Firmware-A (TF-A) and U-Boot are used as example firmware.

Download the PMU Library 

The table below lists components of the PMU library.

File name Description
arch.h System registers bit field definitions
arch_helpers.h Inline functions for system registers access (MRS/MSR)
armv8_pmuv3_events.h Structures for PMU event definitions
armv8_pmuv3_fn.c Armv8-A CPU PMU profiling functions implementation
armv8_pmuv3_fn.h Profiling interfaces definitions
jevents.py Python script to get supporting PMU events for a specified CPU and generate armv8_pmuv3_events.c

Stage 1: Add PMU support to the firmware

Use the following command to unzip all the files of the PMU library to the path lib/pmu in the firmware. 

unzip armv8_pmuv3_library.zip -d ${FIRMWARE_PATH}/lib/pmu

Depending on the firmware you are working with, you need to make specific changes to the Makefile file as follows.

TF-A

Modify the existing Makefile file located in the root path of TF-A as follows.

BL_COMMON_SOURCES += lib/pmu/armv8_pmuv3_fn.c lib/pmu/armv8_pmuv3_events.c
INCLUDES += -Ilib/pmu

U-Boot

Modify the existing Makefile file located in the root path of U-Boot as follows.

UBOOTINCLUDE += -Ilib/pmu

Create a Makefile file in the same directory with the PMU library. Fill the file as follows.

obj-y += armv8_pmuv3_fn.o armv8_pmuv3_events.o

Now, you can rebuild the firmware. If you build it successfully, go to stage 2.

Stage 2: Select PMU events for profiling

First, get the supporting PMU events for the Armv8-A CPU on which the firmware runs. See the CPU Technical Reference Manual (TRM) for the PMU events.

In the PMU library, you can also use the Python script named jevents.py with the following steps.

Get the supporting PMU events for the specified CPU

The GitHub repository Machine-readable data maintains PMU events supported by each Arm CPU. It uses JSON format to store information such as event mnemonic and event number of the PMU events. This information is consistent with that in the CPU TRM.

The Python script named jevents.py in the PMU library clones the repository, converts the JSON format to C format, and generates the file named armv8_pmuv3_events.c. You can use it as follows.

First, use the following command to list all the names of the supporting CPUs for converting.

python3 jevents.py --list

Then, use the following command to specify the name of the CPU for converting. This command will generate the file named armv8_pmuv3_events.c. The file contains two types of arrays. One is for the supporting PMU events of the specified CPU. The other is for the selected PMU events for profiling.

python3 jevents.py --cpu <cpu name>

An example of the armv8_pmuv3_events.c file generated by jevents.py is as follows. We specify the cpu name as cortex-a53. The array named pmu_events_map contains all the supporting PMU events of Cortex-A53. It is the same with the Cortex-A53 TRM. You can select the PMU events from here and put the selected ones into the array named evt_select.

#include "armv8_pmuv3_events.h"

/* Selected PMU Events Table */
struct pmu_event_selected evt_select[] = {
    { .event.name = "INST_RETIRED", .event.number = 0x08 },
    { .event.name = "CYCLES", .event.number = CYCLE_COUNTER_EVENT }, 
    /* Always place the following one at the end of the table */
    { .event.name = "NULL", .event.number = PMU_SELECTED_END } 
};

/* PMU Events Mapping Table */
const struct pmu_event pmu_events_map[] = {
    /* Supporting PMU events for event counter */ 
    { .name = "SW_INCR", .number = 0x0 },
    { .name = "L1I_CACHE_REFILL", .number = 0x1 },
    { .name = "L1I_TLB_REFILL", .number = 0x2 },
    { .name = "L1D_CACHE_REFILL", .number = 0x3 },
    { .name = "L1D_CACHE", .number = 0x4 },
    { .name = "L1D_TLB_REFILL", .number = 0x5 },
    { .name = "LD_RETIRED", .number = 0x6 },
    { .name = "ST_RETIRED", .number = 0x7 },
    { .name = "INST_RETIRED", .number = 0x8 },
    { .name = "EXC_TAKEN", .number = 0x9 },
    { .name = "EXC_RETURN", .number = 0xa },
    { .name = "CID_WRITE_RETIRED", .number = 0xb },
    { .name = "PC_WRITE_RETIRED", .number = 0xc },
    { .name = "BR_IMMED_RETIRED", .number = 0xd },
    { .name = "BR_RETURN_RETIRED", .number = 0xe },
    { .name = "UNALIGNED_LDST_RETIRED", .number = 0xf },
    { .name = "BR_MIS_PRED", .number = 0x10 },
    { .name = "CPU_CYCLES", .number = 0x11 },
    { .name = "BR_PRED", .number = 0x12 },
    { .name = "MEM_ACCESS", .number = 0x13 },
    { .name = "L1I_CACHE", .number = 0x14 },
    { .name = "L1D_CACHE_WB", .number = 0x15 },
    { .name = "L2D_CACHE", .number = 0x16 },
    { .name = "L2D_CACHE_REFILL", .number = 0x17 },
    { .name = "L2D_CACHE_WB", .number = 0x18 },
    { .name = "BUS_ACCESS", .number = 0x19 },
    { .name = "MEMORY_ERROR", .number = 0x1a },
    { .name = "BUS_CYCLES", .number = 0x1d },
    { .name = "CHAIN", .number = 0x1e },
    { .name = "BUS_ACCESS_RD", .number = 0x60 },
    { .name = "BUS_ACCESS_WR", .number = 0x61 },
    { .name = "BR_INDIRECT_SPEC", .number = 0x7a },
    { .name = "EXC_IRQ", .number = 0x86 },
    { .name = "EXC_FIQ", .number = 0x87 },
    { .name = "0xc0", .number = 0xc0 },
    { .name = "0xc1", .number = 0xc1 },
    { .name = "0xc2", .number = 0xc2 },
    { .name = "0xc3", .number = 0xc3 },
    { .name = "0xc4", .number = 0xc4 },
    { .name = "0xc5", .number = 0xc5 },
    { .name = "0xc6", .number = 0xc6 },
    { .name = "0xc7", .number = 0xc7 },
    { .name = "0xc8", .number = 0xc8 },
    { .name = "0xc9", .number = 0xc9 },
    { .name = "0xca", .number = 0xca },
    { .name = "0xcb", .number = 0xcb },
    { .name = "0xcc", .number = 0xcc },
    { .name = "0xd0", .number = 0xd0 },
    { .name = "0xd1", .number = 0xd1 },
    { .name = "0xd2", .number = 0xd2 },
    { .name = "0xe0", .number = 0xe0 },
    { .name = "0xe1", .number = 0xe1 },
    { .name = "0xe2", .number = 0xe2 },
    { .name = "0xe3", .number = 0xe3 },
    { .name = "0xe4", .number = 0xe4 },
    { .name = "0xe5", .number = 0xe5 },
    { .name = "0xe6", .number = 0xe6 },
    { .name = "0xe7", .number = 0xe7 },
    { .name = "0xe8", .number = 0xe8 },
    /* Supporting event for cycle counter */ 
    { .name = "CYCLES", .number = CYCLE_COUNTER_EVENT } 
};

Select PMU events for profiling

In the initial profiling, normally you need to select a wide range of PMU events. This will get you a comprehensive understanding of the code to profile.

For Armv8-A CPUs, the available event counters per CPU are limited. You can refer to the CPU TRM for the number. If the required number of PMU events exceeds the available counters, you can use the following method.

  1. Divide all the PMU events to select into groups. In each group, the number of the events must not exceed the available counters of the PMU.
  2. Create multiple pmu_event_selected structure arrays and put each group into one array.
  3. Profile the same code multiple times, select one array at a time, and repeat the process until all events are measured.

For example, we create two pmu_event_selected structure arrays named evt_select_cache and evt_select_wlc in the armv8_pmuv3_events.c file. For Cortex-A53, one cycle counter and six event counters are available. The number of PMU events in each array does not exceed this. Also, within each array, the PMU events are related to make sure the profiling data is comparable.

We use these two arrays for the following profiling examples.

/* PMU events for data access */
struct pmu_event_selected evt_select_cache[] = {
    { .event.name = "L1D_TLB_REFILL", .event.number = 0x5 },
    { .event.name = "L1D_CACHE_REFILL", .event.number = 0x3 },
    { .event.name = "L1D_CACHE", .event.number = 0x4 },
    { .event.name = "L2D_CACHE_REFILL", .event.number = 0x17 },
    { .event.name = "L2D_CACHE", .event.number = 0x16 },
    { .event.name = "CYCLES", .event.number = CYCLE_COUNTER_EVENT }, 
    /* Always place the following one at the end of the table */
    { .event.name = "NULL", .event.number = PMU_SELECTED_END } 
};

/* PMU events for workload characterization */
struct pmu_event_selected evt_select_wlc[] = {
    { .event.name = "LD_RETIRED", .event.number = 0x6 },
    { .event.name = "ST_RETIRED", .event.number = 0x7 },
    { .event.name = "MEM_ACCESS", .event.number = 0x4 },
    { .event.name = "BR_IMMED_RETIRED", .event.number = 0xd },
    { .event.name = "BR_RETURN_RETIRED", .event.number = 0xe },
    { .event.name = "CYCLES", .event.number = CYCLE_COUNTER_EVENT }, 
    /* Always place the following one at the end of table */
    { .event.name = "NULL", .event.number = PMU_SELECTED_END } 
};

Stage 3: Profile the firmware with the PMU

Now, you can profile the firmware with the PMU. Use the following procedure for the specific code in the firmware that you want to profile.

  1. Place start profiling point before the code. This configures, enables, and reads the current value of each PMU counter as pre-profiling value.
  2. Place stop profiling point after the code. This disables and reads each PMU counter value as post-profiling value.
  3. Rebuild the firmware and run the code again. Calculate the difference between the pre-profiling and post-profiling value of each PMU counter. This collects the statistical profiling result.

In the PMU library, armv8_pmuv3_fn.c provide a reference implementation.

Place profiling points in the firmware

You can simply add two functions named pmuv3_startProfiling and pmuv3_stopProfiling between the code that you want to profile. For each profiling, pass one pmu_event_selected structure array as the parameter.

Depending on the number of PMU events you want to profile with, you need to profile once or multiple times. Here are two profiling examples.

Example one

To profile the function enable_mmu_el3 in TF-A with focusing on cache behavior, you can only select the array named evt_select_cache for a single profiling.

#include <armv8_pmuv3_fn.h>

extern struct pmu_event_selected evt_select_cache[];

void __init arm_bl31_plat_arch_setup(void)
{
    //...

	pmuv3_startProfiling(evt_select_cache);
	
	enable_mmu_el3(0);
	
	pmuv3_stopProfiling(evt_select_cache);

    //...
}

Example two

To find out the crc32 workload characterization in U-Boot, you can select the arrays named evt_select_cache and evt_select_wlc. In this way, you need to profile the same code multiple times to collect all the profiling data.

As the do_mem_crc function is called by a command, you can program the firmware once and repeat the profiling by entering the same command multiple times.

#include <armv8_pmuv3_fn.h>

extern struct pmu_event_selected evt_select_cache[];
extern struct pmu_event_selected evt_select_wlc[];

struct pmu_event_selected* pevt[] = {evt_select_cache,evt_select_wlc,NULL};
struct pmu_event_selected** p = pevt;

static int do_mem_crc(struct cmd_tbl *cmdtp, int flag, int argc,
		      char *const argv[])
{
	int ret = 0;

    // ...

    if(*p!=NULL) {
        pmuv3_startProfiling(*p);
    } else {
        p = pevt;
    }

    ret = hash_command("crc32", flags, cmdtp, flag, ac, av);
    
    if(*p!=NULL) {
        pmuv3_stopProfiling(*p);
        ++p;
    }

	return ret;
}

Understand the implementation of profiling with the PMU

The function startProfiling does the following things.

  1. Performs necessary initialization for the PMU. It checks whether the number of event counters supported by the CPU's PMU is suitable for the selected events. Then, it sets the MDCR_EL2 or MDCR_EL3 to enable the PMU profiling at the current Exception level. This is because firmware typically runs at EL2/EL3 or in the secure state where profiling is prohibited to prevent information leakage.
  2. Configures each PMU counter to profile at the current Exception level and for the selected PMU events.
  3. Enables each PMU counter.
  4. Reads the current value of each PMU counter as the pre-profiling value.

The function stopProfiling does the following things.

  1. Disables each PMU counter.
  2. Reads the current value of each PMU counter as the post-profiling value.
  3. Dumps the result of this profiling.
  4. Performs necessary deinitialization for the PMU. This disables the PMU working at the EL2/EL3 or in the secure state.

Collect the statistical profiling result

Now you can rebuild the firmware and run it again to collect the statistical profiling result.

For the examples mentioned in the previous part, we perform the profiling on the Juno r2 platform. The output of the profiling is as follows.

Example one

************************************************************
               [armv8_pmuv3] Profiling Result
************************************************************
PMU EVENT, PREVAL, POSTVAL, DELTA
L1D_TLB_REFILL,0,2,2
L1D_CACHE_REFILL,0,8,8
L1D_CACHE,0,6,6
L2D_CACHE_REFILL,0,10,10
L2D_CACHE,0,10,10
CYCLES,1386,4293,2907
***********************************************************

As you can see from the output, it records the pre-profiling and post-profiling value of each selected PMU events and cycles in this profiling. Also, it calculates the differences, and records them in the column named DELTA.

For Cortex-A53 CPU, the PMU event L1D_TLB_REFILL is included in the count for L1D_CACHE_REFILL. Thus, you might observe that the counting value for L1D_CACHE_REFILL is greater than that of L1D_CACHE.

Example two

CRC32 for 80000100 ... 80001123 ==> c3ac0b65
************************************************************
               [armv8_pmuv3] Profiling Result
************************************************************
PMU EVENT, PREVAL, POSTVAL, DELTA
L1D_TLB_REFILL,0,0,0
L1D_CACHE_REFILL,0,19,19
L1D_CACHE,11,80424,80413
L2D_CACHE_REFILL,0,80,80
L2D_CACHE,0,212,212
CYCLES,147,884986,884839
************************************************************

CRC32 for 80000100 ... 80001123 ==> c3ac0b65
************************************************************
               [armv8_pmuv3] Profiling Result
************************************************************
PMU EVENT, PREVAL, POSTVAL, DELTA
INST_RETIRED,13,208664,208651
LD_RETIRED,4,51346,51342
ST_RETIRED,5,23691,23686
MEM_ACCESS,15,80961,80946
BR_IMMED_RETIRED,10,43922,43912
BR_RETURN_RETIRED,0,0,0
CYCLES,170,890037,889867
************************************************************

As you can see above, the results of cycles for the two profiling sessions are almost the same.

From the first profiling result, you can refer to the Arm Architecture Reference Manual to abstract some meaningful metrics as follows.

Metric Formula Value
Attributable Level 1 data cache refill rate L1D_CACHE_REFILL / L1D_CACHE <1%
Attributable Level 2 unified cache refill rate L2D_CACHE_REFILL / L2D_CACHE 37.7%

From the second profiling result, you can calculate the Instructions Per Cycle (IPC) as follows.

IPC = INST_RETIRED / CYCLES

The IPC of this workload is 0.23. This is caused by many MEM_ACCESS operations.

Anonymous
Architectures and Processors blog
  • Statistical Profile Extension: extracting value from SPE for SoC Telemetry

    Brian Jeff
    Brian Jeff
    We refer to the SPE performance methodology whitepaper published by Arm for details on the content of this blog.
    • December 8, 2023
  • Part 1: Arm Scalable Matrix Extension (SME) introduction

    Zenon Xiu (修志龙)
    Zenon Xiu (修志龙)
    This blog series provides an introduction to the Arm Scalable Matrix Extension (SME) including SVE and SVE2.
    • December 8, 2023
  • Implementing the WebAssembly bitmask operations on the 64-bit Arm architecture

    Anton Kirilov
    Anton Kirilov
    We discuss some of the challenges that we face when we are trying to implement the WebAssembly SIMD bitmask operations on the 64-bit Arm architecture.
    • December 6, 2023