Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Arm Neoverse V1 – Top-down Methodology for Performance Analysis & Telemetry Specification
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • performance
  • Debug and Analysis
  • Server and Infrastructure
  • Telemetry
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Arm Neoverse V1 – Top-down Methodology for Performance Analysis & Telemetry Specification

Jumana Mundichipparakkal
Jumana Mundichipparakkal
February 6, 2023
4 minute read time.

The Arm Neoverse V1 Performance Analysis Methodology whitepaper is now available to help you optimize your application code for V1-based production systems.

The whitepaper is an update to the previous Arm Neoverse N1: Performance Analysis Methodology, and it covers the new features and updates from N1 to V1 cores. This resource can be used to understand and optimize the performance of your application on the V1 platform.

To make the most of your time spent profiling and optimizing, it is important to select the right PMU events and follow a structured methodology with user-friendly SW metrics. In the whitepaper, we present the Arm top-down analysis methodology for Neoverse V1.

In this blog, we outline the updates from N1 to V1 cores and provide an overview of the contents of this whitepaper. We also include  references to other useful resources to take full advantage of the Neoverse V1 platform.

Arm Neoverse V1 supports top-down level 1 metrics

The Arm Neoverse V1 platform is the first Arm core to support the full set of events and metrics for top-down methodology level 1 metrics. These metrics are a great value add for performance analysis and optimization.

These metrics provide a detailed breakdown of processor pipeline utilization at the SLOT level, enabling the evaluation of processor efficiency and identification of bottlenecks. This feature is a major enhancement to the performance analysis capabilities of the Arm Neoverse V1 platform, in addition to other micro-architecture exploration metrics that can be used for further analysis. 

Arm Neoverse V1 telemetry specification
Events & metrics for performance analysis

The Arm Neoverse V1 telemetry specification, including software product-specific event descriptions and derived metrics for analysis, can be found in the Appendix B&C of the Arm Neoverse V1 Performance Analysis Methodology whitepaper.

Download the whitepaper

Arm Telemetry solution repository

The telemetry data, provided in machine readable JSON files, and stress workload suite referenced in the whitepaper are now available in the GitLab telemetry-solution repository.

Neoverse V1 PMU events and metrics cheat-sheets

Familiarizing yourself with the Arm Neoverse microarchitecture, including its complex pipelines and multi-level memory hierarchy, can be helpful in this process. As the Neoverse cores provide over 100 hardware counters to select from, it is important to prioritize which events to focus on. To assist with this task, we have created cheat sheets that list events and their corresponding derived metrics.

Events cheat sheet

Topdown Level 1 Cycle accounting Misses per kilo instructions
  • BR_MIS_PRED (r10)
  • CPU_CYCLES (r11)
  • OP_RETIRED (r3a)
  • OP_SPEC (r3b)
  • STALL_SLOT (r3f)
  • STALL_SLOT_BACKEND (r3d)
  • STALL_SLOT_FRONTEND (r3e)
  • CPU_CYCLES (r11)
  • STALL_BACKEND (r24)
  • STALL_FRONTEND (r23)
  • BR_MIS_PRED_RETIRED (r22)
  • DTLB_WALK (r34)
  • INST_RETIRED (r08)
  • ITLB_WALK (r35)
  • L1D_CACHE_REFILL (r03)
  • L1D_TLB_REFILL (r05)
  • L1I_CACHE_REFILL (r01)
  • L1I_TLB_REFILL (ro2)
  • L2D_CACHE_REFILL (r17)
  • L2D_TLB_REFILL (r2d)
  • LL_CACHE_MISS_RD (r37)
Branch effectiveness Instructions TLB effectiveness Miss ratio
  • BR_MIS_PRED_RETIRED (r22)
  • BR_RETIRED (r21)
  • INST_RETIRED (RO8)
  • INST_RETIRED (ro8)
  • ITLB_WALK (r35)
  • L1I_TLB (r26)
  • L1I_TLB_REFILL (r02)
  • L2D_TLB (r2f)
  • L2D_TLB_REFILL (r2d)
  • BR_MIS_PRED_RETIRED (r22)
  • BR_RETIRED (r21)
  • DTLB_WALK (r34)
  • ITLB_WALK (r35)
  • L1D_CACHE (ro4)
  • L1D_CACHE_REFILL (r03)
  • L1D_TLB (r25)
  • L1D_TLB_REFILL (r05)
  • L1I_CACHE (r14)
  • L1I_CACHE_REFILL (r01)
  • L1I_TLB (r26)
  • L1I_TLB_REFILL (ro2)
  • L2D_CACHE (r16)
  • L2D_CACHE_REFILL (r17)
  • L2D_TLB (r2f)
  • L2D_TLB_REFILL (r2d)
  • LL_CACHE_MISS_RD (r37)
  • LL_CACHE_RD (r36)
Data TLB effectiveness L1 Instruction cache effectiveness L1 data cache effectiveness
  • DTLB_WALK (r34)
  • INST_RETIRED (r08)
  • L1D_TLB (r25)
  • L1D_TLB_REFILL (r05)
  • L2D_TLB (r2f)
  • L2D_TLB_REFILL (r2d)
  • INST_RETIRED (r08)
  • L1I_CACHE (r14)
  • L1I_CACHE_REFILL (r01)
  • INST_RETIRED (ro8)
  • L1D_CACHE (r04)
  • L1D_CACHE_REFILL (r03)
L2 unified cache effectiveness Last level cache effectiveness Speculation operation mix
  • INST_RETIRED (r08)
  • L2D_CACHE (r16)
  • L2D_CACHE_REFILL (r17)
  • INST_RETIRED (r08)
  • LL_CACHE_MISS_RD (r37)
  • LL_CACHE_RD (r36)
  • ASE_SPEC (r74)
  • BR_IMMED_SPEC (r78)
  • BR_INDIRECT_SPEC (r7a)
  • CRYPTO_SPEC (r77)
  • SP_SPEC (r73)
  • INST_SPEC (r1b)
  • LD_SPEC (r70)
  • ST_SPEC (r71)
  • VFP_SPEC (r75)

Table 1. Neoverse V1 core events cheat sheet

Key references

The following two documents provide all the necessary information for conducting performance analysis on Neoverse V1 and are our recommended go-to reference:

1) Arm Neoverse V1 Performance Analysis Methodology White paper: This whitepaper presents a performance analysis methodology and shows how to conduct workload characterization on the Arm Neoverse V1 platform. It is an update to the previous whitepaper on Neoverse N1, which presents a performance analysis methodology and shows how to conduct workload characterization on the Arm Neoverse N1 platform. If you are new to Arm platforms and performance analysis tools such as Linux perf, we recommend reading this whitepaper first.

Download the whitepaper

2) Arm Neoverse V1 PMU Guide (direct download): This document provides a comprehensive overview of all hardware PMU events, including micro-architecture and architecture details that are necessary for using the events effectively in performance analysis.

Arm Neoverse V1 core

Arm Neoverse V1 is a core designed to deliver maximum single-threaded performance for demanding cloud, HPC, and AI/ML-assisted workloads. Neoverse V1 is the first Neoverse processor to include Scalable Vector Extensions (SVE) for maximum vector performance, HPC code reuse, and longevity. Neoverse V1 supports Bfloat16 and Int8 MatMul instructions. These instructions can offer up to 3x the performance for machine learning frameworks like TensorFlow, PyTorch, OneDNN, and others compared to Neoverse N1. The Neoverse V1 CPU is available today on AWS EC2 instances powered by the AWS Graviton3 and AWS Graviton3E processors.

Conclusion

Our top-down methodology analysis and telemetry specification is now available for the Neoverse V1 platform. We will begin upstreaming this information to the Linux perf tool soon. V-series cores, like V1, are designed to deliver maximum single-threaded performance within the Neoverse family of CPU IP. The Neoverse V1 Performance Analysis Methodology whitepaper paper and V1 PMU Guide can help developers extract maximum performance from the V1 architecture. We encourage all developers using V1-based platforms (including AWS Graviton3 and Graviton3E) to check it out.

Download the whitepaper

Anonymous
Servers and Cloud Computing blog
  • Migrating our GenAI pipeline to AWS Graviton powered by Arm Neoverse: A 40% cost reduction story

    Hrudu Shibu
    Hrudu Shibu
    This blog post explains how Esankethik.com, an IT and AI solutions company, successfully migrated its internal GenAI pipeline to AWS Graviton Arm64.
    • August 28, 2025
  • Using GitHub Arm-hosted runners to install Arm Performance Libraries

    Waheed Brown
    Waheed Brown
    In this blog post, learn how Windows developers can set up and use Arm-hosted Windows runners in GitHub Action.
    • August 21, 2025
  • Distributed Generative AI Inference on Arm

    Waheed Brown
    Waheed Brown
    As generative AI becomes more efficient, large language models (LLMs) are likewise shrinking in size. This creates new opportunities to run LLMs on more efficient hardware, on cloud machines doing AI inference…
    • August 18, 2025