Software developers rely on performance profilers to collect detailed performance data during program execution. Performance profilers measure important parameters, like instructions, cycles, cache hits, and branch misses, so developers can characterize CPU workloads and analyze code execution. The results of profiling make it easier to fine-tune the efficiency of prototypes, production systems, and application code, and even help plan for the future, by identifying design requirements for next-generation systems.
The Arm Neoverse N1 CPU is truly revolutionary, delivering industry-leading socket performance at half the power, with server-class thread performance. Neoverse N1 powers leading cloud provider infrastructure such as the AWS Graviton2 processors and Oracle OCI Ampere A1. Cloud providers choose Neoverse N1 due to its clear benefit of delivering 40% better price performance over comparable current generation x86-based instances for a wide variety of workloads.
To deliver upon these performance gains and power savings, we need to understand the methodology behind the performance analysis. For software workload analysis, both raw hardware events as well as some useful data points derived from them can be used for correlating the events to derive actionable insights. To achieve that, we use something called a Performance Monitoring Unit (PMU). PMU is a hardware-based feature that gathers hardware execution data while the application is running. The PMU doesn’t increase overhead or impact performance, because profiling is done in hardware, outside the application’s process. There’s nothing inserted in the code, and the order of execution remains unchanged. The Neoverse N1 PMU is designed for use with the Linux perf tool, a performance tool API that helps to collect the metrics from the hardware counters. Linux perf also helps to annotate code with samples of events for easy correlation between micro-architectural behavior and software execution. A performance profiling setup of counting and sampling using Linux Perf during workload execution takes developers from high-level, big-picture analysis to detailed, event-specific examinations for identifying root causes of performance issues.
Choosing the right PMU events and following a methodology can make time spent on profiling and optimization more effective. It helps to have a high-level understanding of the Arm Neoverse N1 micro-architecture, since it includes complex pipelines and use a multi-level memory hierarchy. It also helps to know which events to focus on, since the Neoverse cores support more than 100 hardware counters.
To help save time and effort, so you can quickly refine your analysis and go deeper into the details of software optimization, we’ve put together two key documents that tells you what you need to know.
1) Arm Neoverse N1 PMU Guide: This document gives a better description of all the hardware PMU events, with micro-architecture and architecture details required for the usage of the events while conducting performance analysis.
2) Arm Neoverse N1 Performance Analysis Methodology White paper: This white paper presents the performance analysis methodology and demonstrates how to conduct workload characterization on Arm Neoverse N1.
Download the white paper
We introduce the three subsystems of the CPU, suggest the raw events and derived metrics to use with the initial workload characterization, identify the four key perf functions used for counting and event-based sampling, and include a case study for demonstrating the methodology.