Carbon cycle accurate models of Arm CPUs enable system performance analysis by providing access to the Performance Monitor Unit (PMU). Carbon models instrument the PMU registers and record PMU events into the The specified item was not found. System Analyzer database without any software programming. Contrast this non-intrusive PMU event collection with other common methods of software execution:
The Arm Cortex-A53 is a good example to demonstrate the features of SoC Designer. The A53 PMU implements the PMUv3 architecture and gathers statistics on the processor and memory system. It provides six counters which can count any of the available events.
The Carbon A53 model instruments the PMU events to gather statistics without any software programming. This means all of the PMU events (not just six) can be captured from a single simulation.
The A53 PMU Events can be found in the Technical Reference Manual (TRM) in Chapter 12. Below is a partial list of PMU events just to provide some flavor of the types of events that are collected. The TRM details all of the events the PMU contains.
Profiling can be enabled by right-clicking on a CPU model and selecting the Profiling menu. Any or all of the PMU events can be enabled. Any simulation done with profiling enabled will write the selected PMU events into the Carbon System Analyzer database.
The automatic instrumentation of PMU events is ideal for bare metal software since it requires no programming and will automatically cover the entire timeline of the software test or benchmark. Full control is available to enable the PMU events at any time by stopping the simulator and enabling or disabling profiling.
All of the profiling data from the PMU events, as well as the bus transactions, and the software profiling information end up in the Carbon Analyzer database. The picture below shows a section of the Carbon Analyzer GUI loaded with PMU events, bus activity, and software activity.
The Carbon Analyzer provides many out-of-the-box calculation of interesting metrics as well as a complete API which allows plugins to be written to compute additional system or application specific metrics.
Things get more interesting in a Linux environment. A common use case is to run Linux benchmarks to profile how the software executes on a given hardware design. Linux can be booted quickly and then a benchmark can be run using a cycle accurate virtual prototype by making use of Swap & Play.
Profiling enables events to be collected in the analyzer database, but the user doesn’t have the ability to understand which events apply to each Linux process or to differentiate events from the Linux kernel vs. those from user space programs. It’s also more difficult to determine when to start and stop event collection for a Linux application. Control can be improved by using techniques from Three Tips for Using Linux Swap & Play with Arm Cortex-A Systems.
Since the PMU can be used for Linux benchmarks, the first thing that comes to mind is to write some initialization code to setup the PMU, enable counters, run the test, and collect the PMU events at the end. This strategy works pretty well for those willing to get their hands dirty writing system control coprocessor instructions.
The first step to being able to write a Linux application which accesses the PMU is to enable user mode access. This needs to be done from the Linux kernel. It's very easy to do, but requires a kernel module to be loaded or compiled into the kernel. All that is needed to set bit 0 in the PMUSERENR register to a 1. It takes only one instructions, but it must be executed from within the kernel. The main section of code is shown below.
Building a kernel module requires a source tree for the running kernel. If you are using a Carbon Performance Analysis Kit (CPAK), this source tree is available in the CPAK or can easily be downloaded by using the CPAK scripts.
A source code example as well as a Makefile to build it is attached to this blog.
The module can either be loaded dynamically into a running kernel or added to the static kernel build. When working with CPAKs it’s easier for me to just add it to the kernel. When I’m working with a board where I can natively compile it on the machine it’s easier to dynamically load it using:
$ sudo insmod enable_pmu.ko
Remember to use the lsmod command to see which modules are loaded and the rmmod command to unload it when finished.
The exit function of the module returns the user mode enable bit back to 0 to restore the original value.
Once user mode access to the PMU has been granted, benchmark programs can take advantage of the PMU to count events such as cycles and instructions. One possible flow from a user space program is:
Once this is done, the benchmark application can read the current values, run the code of interest, and then read the values again to determine how many events occurred during the code of interest.
The cycle counter is distinct from the other 6 event count registers. It is read from a separate CP15 system control register. For this example, event 0x8 is monitored, instruction architecturally executed, using event count register 0. Please take a look at the source code for the simple test application used to count cycles and instructions of a simple printf() call.
This article provided an introduction to using the Carbon Analyzer to automatically gather information on Arm PMU events for bare metal and Linux software workloads. Carbon models provide full access to all PMU events during a single simulation with no software changes and no limitations on the number of events captured.
It also explained how additional control can be achieved by writing software to access the PMU directly from a Linux test program or benchmark application. This can be done with no kernel changes, but does require the PMU to be enabled from user mode and is limited to the number of counters available in the PMU; six for CPUs such as the Cortex-A15 and A57.
Next time I will look at an alternative approach to use the Arm Linux PMU driver and a system call to collect PMU events.
Hi Jason,
The source code link is broken now.
Could you please update the application source link?
Jiss
i tried on a RPI3 B model + (the last one), Raspbian 9, Linux Headers 4.14.79-v7+. The PMU access is set from user space.
[ 1398.184005] enable_pmu: loading out-of-tree module taints kernel.[ 1398.184918] enable_pmu: Enabled PMU access from user space
1) It ran fine except that i randomly (50% of the time) have a core dumped on the first asm volatile MCR p15 of init_pmu (the c9,c12,)x00007 or when reading the counter only if i comment the call to init_pmu and run only read_ccles. When i retry it can display the result. it's random. I compiled with different compilation options including mcpu=cortex-a53 -march=native -mtune=cortex-a53. i used the gdb and it says signal sigill illegal instruction. if i retry i have chance it run as per 50% of success.
Do you have any idea ? I add a delay before calling the first asm cat function. same issue
The Raspbian is the last release and is up to date.
Thanks
gcc (Raspbian 6.3.0-18+rpi1+deb9u1) 6.3.0 20170516Copyright (C) 2016 Free Software Foundation, Inc.
lscpu return :Architecture: armv7lByte Order: Little EndianCPU(s): 4On-line CPU(s) list: 0-3Thread(s) per core: 1Core(s) per socket: 4Socket(s): 1Model: 4Model name: ARMv7 Processor rev 4 (v7l)CPU max MHz: 1400.0000CPU min MHz: 600.0000BogoMIPS: 38.40Flags: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
Hi Jason and Georgia, thank you a lot for your notice and quick response, I am sure this will help me out quite a bit!
Best regards,
Jakob
Hi hylz Jason Andrews has now updated the blog with the file for you.
Many thanks,Georgia
Hi Jakob,
I have the file missing file. I'll find a way to get it to you.
Thanks,
Jason