Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
SoC Design and Simulation blog System Performance Analysis and the Arm Performance Monitor Unit (PMU)
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Cortex-A53
  • performance analysis
  • pmu
  • Linux
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

System Performance Analysis and the Arm Performance Monitor Unit (PMU)

Jason Andrews
Jason Andrews
February 19, 2015
5 minute read time.

Carbon cycle accurate models of Arm CPUs enable system performance analysis by providing access to the Performance Monitor Unit (PMU). Carbon models instrument the PMU registers and record PMU events into the The specified item was not found. System Analyzer database without any software programming. Contrast this non-intrusive PMU event collection with other common methods of software execution:

  • Arm Fast Models focus on speed and have limited ability to access PMU events
  • Simulating or emulating CPU RTL does not provide automatic instrumentation and event collection
  • Silicon requires software programming to enable and collect events from the PMU

The Arm Cortex-A53 is a good example to demonstrate the features of SoC Designer. The A53 PMU implements the PMUv3 architecture and gathers statistics on the processor and memory system. It provides six counters which can count any of the available events.

The Carbon A53 model instruments the PMU events to gather statistics without any software programming. This means all of the PMU events (not just six) can be captured from a single simulation.

The A53 PMU Events can be found in the Technical Reference Manual (TRM) in Chapter 12. Below is a partial list of PMU events just to provide some flavor of the types of events that are collected. The TRM details all of the events the PMU contains.

Arm Performance Monitor Unit

Profiling can be enabled by right-clicking on a CPU model and selecting the Profiling menu. Any or all of the PMU events can be enabled. Any simulation done with profiling enabled will write the selected PMU events into the Carbon System Analyzer database.

Arm Performance Monitor Unit

Bare Metal Software

The automatic instrumentation of PMU events is ideal for bare metal software since it requires no programming and will automatically cover the entire timeline of the software test or benchmark. Full control is available to enable the PMU events at any time by stopping the simulator and enabling or disabling profiling.

All of the profiling data from the PMU events, as well as the bus transactions, and the software profiling information end up in the Carbon Analyzer database. The picture below shows a section of the Carbon Analyzer GUI loaded with PMU events, bus activity, and software activity.

System Analyzer GUI loaded with PMU events

The Carbon Analyzer provides many out-of-the-box calculation of interesting metrics as well as a complete API which allows plugins to be written to compute additional system or application specific metrics.

Linux Performance Analysis

Things get more interesting in a Linux environment. A common use case is to run Linux benchmarks to profile how the software executes on a given hardware design. Linux can be booted quickly and then a benchmark can be run using a cycle accurate virtual prototype by making use of Swap & Play.

Profiling enables events to be collected in the analyzer database, but the user doesn’t have the ability to understand which events apply to each Linux process or to differentiate events from the Linux kernel vs. those from user space programs. It’s also more difficult to determine when to start and stop event collection for a Linux application. Control can be improved by using techniques from Three Tips for Using Linux Swap & Play with Arm Cortex-A Systems.

Using PMU Counters from User Space

Since the PMU can be used for Linux benchmarks, the first thing that comes to mind is to write some initialization code to setup the PMU, enable counters, run the test, and collect the PMU events at the end. This strategy works pretty well for those willing to get their hands dirty writing system control coprocessor instructions.

Enable User Space Access

The first step to being able to write a Linux application which accesses the PMU is to enable user mode access. This needs to be done from the Linux kernel. It's very easy to do, but requires a kernel module to be loaded or compiled into the kernel. All that is needed to set bit 0 in the PMUSERENR register to a 1. It takes only one instructions, but it must be executed from within the kernel. The main section of code is shown below.

Arm Performance Monitor Unit

Building a kernel module requires a source tree for the running kernel. If you are using a Carbon Performance Analysis Kit (CPAK), this source tree is available in the CPAK or can easily be downloaded by using the CPAK scripts.

A source code example as well as a Makefile to build it is attached to this blog.

The module can either be loaded dynamically into a running kernel or added to the static kernel build. When working with CPAKs it’s easier for me to just add it to the kernel. When I’m working with a board where I can natively compile it on the machine it’s easier to dynamically load it using:

$ sudo insmod enable_pmu.ko

Remember to use the lsmod command to see which modules are loaded and the rmmod command to unload it when finished.

The exit function of the module returns the user mode enable bit back to 0 to restore the original value.

PMU Application

Once user mode access to the PMU has been granted, benchmark programs can take advantage of the PMU to count events such as cycles and instructions. One possible flow from a user space program is:

  • Reset count values
  • Select which of the six PMU counter registers to use
  • Set the event to be counted, such as instructions executed
  • Enable the counters to start counting

Once this is done, the benchmark application can read the current values, run the code of interest, and then read the values again to determine how many events occurred during the code of interest.

Arm Performance Monitor Unit

The cycle counter is distinct from the other 6 event count registers. It is read from a separate CP15 system control register. For this example, event 0x8 is monitored, instruction architecturally executed, using event count register 0. Please take a look at the source code for the simple test application used to count cycles and instructions of a simple printf() call.

Summary

This article provided an introduction to using the Carbon Analyzer to automatically gather information on Arm PMU events for bare metal and Linux software workloads. Carbon models provide full access to all PMU events during a single simulation with no software changes and no limitations on the number of events captured.

It also explained how additional control can be achieved by writing software to access the PMU directly from a Linux test program or benchmark application. This can be done with no kernel changes, but does require the PMU to be enabled from user mode and is limited to the number of counters available in the PMU; six for CPUs such as the Cortex-A15 and A57.

Next time I will look at an alternative approach to use the Arm Linux PMU driver and a system call to collect PMU events. 

pmu.tar.gz
Anonymous
  • Jiss
    Jiss over 3 years ago

    Hi Jason,

    The source code link is broken now.  

    Could you please update the application source link? 

    Jiss

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • br-dev
    br-dev over 6 years ago

    Hi Jason,

    i tried on a RPI3 B model + (the last one), Raspbian 9, Linux Headers 4.14.79-v7+. The PMU access is set from user space.

    [ 1398.184005] enable_pmu: loading out-of-tree module taints kernel.
    [ 1398.184918] enable_pmu: Enabled PMU access from user space

    1) It ran fine except that i randomly (50% of the time) have a core dumped on the first asm volatile MCR p15 of init_pmu (the c9,c12,)x00007 or when reading the counter  only if i comment the call to init_pmu and run only read_ccles. When i retry it can display the result. it's random. I compiled with different compilation options including mcpu=cortex-a53 -march=native -mtune=cortex-a53. i used the gdb and it says signal sigill illegal instruction. if i retry i have chance it run as per 50% of success.

    Do you have any idea ?  I add a delay before calling the first asm cat  function. same issue

    The Raspbian is the last release and is up to date.

    Thanks

    gcc (Raspbian 6.3.0-18+rpi1+deb9u1) 6.3.0 20170516
    Copyright (C) 2016 Free Software Foundation, Inc.

    lscpu return :Architecture:          armv7l
    Byte Order:            Little Endian
    CPU(s):                4
    On-line CPU(s) list:   0-3
    Thread(s) per core:    1
    Core(s) per socket:    4
    Socket(s):             1
    Model:                 4
    Model name:            ARMv7 Processor rev 4 (v7l)
    CPU max MHz:           1400.0000
    CPU min MHz:           600.0000
    BogoMIPS:              38.40
    Flags:                 half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • hylz
    hylz over 6 years ago in reply to Georgia James

    Hi Jason and Georgia, thank you a lot for your notice and quick response, I am sure this will help me out quite a bit!

    Best regards,

    Jakob

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Georgia James
    Georgia James over 6 years ago in reply to hylz

    Hi hylz Jason Andrews has now updated the blog with the file for you.

    Many thanks,
    Georgia

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Jason Andrews
    Jason Andrews over 6 years ago in reply to hylz

    Hi Jakob,

    I have the file missing file. I'll find a way to get it to you.

    Thanks,

    Jason

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
>
SoC Design and Simulation blog
  • Performance verification with AMBA Viz

    Tony Nip
    Tony Nip
    Run consistent latency and bandwidth checks on CMN interconnects using AMBA Viz’s new performance script—no API expertise needed.
    • June 30, 2025
  • Understanding Scandump: A key silicon debugging technique

    Vincent Yang
    Vincent Yang
    Scandump is highly effective in silicon debugging as it can capture most internal states through scan chains, making it invaluable in diagnosing silicon issues.
    • June 5, 2024
  • Introduction to AMBA Viz

    Tony Nip
    Tony Nip
    AMBA Viz enables faster debug and performance analysis for cycle-accurate simulation and emulation, even for complex interconnects and AMBA bus protocols.
    • May 31, 2024