Arm Community
Site
Search
User
Site
Search
User
Groups
Education Hub
Arm Ambassadors
Open Source Software and Platforms
Research Collaboration and Enablement
Forums
AI and ML forum
Architectures and Processors forum
Arm Development Platforms forum
Arm Development Studio forum
Arm Virtual Hardware forum
Automotive forum
Compilers and Libraries forum
Graphics, Gaming, and VR forum
High Performance Computing (HPC) forum
Infrastructure Solutions forum
Internet of Things (IoT) forum
Keil forum
Morello forum
Operating Systems forum
SoC Design and Simulation forum
SystemReady Forum
Blogs
AI and ML blog
Announcements
Architectures and Processors blog
Automotive blog
Graphics, Gaming, and VR blog
High Performance Computing (HPC) blog
Infrastructure Solutions blog
Internet of Things (IoT) blog
Operating Systems blog
SoC Design and Simulation blog
Tools, Software and IDEs blog
Support
Arm Support Services
Documentation
Downloads
Training
Arm Approved program
Arm Design Reviews
Community Help
More
Cancel
Arm Community blogs
Tools, Software and IDEs blog
Software Optimization: Four real-life Streamline use cases (Part 4)
Blogs
Mentions
Sub-Groups
Tags
Jump...
Cancel
More blogs in Arm Community blogs
AI and ML blog
Announcements
Architectures and Processors blog
Automotive blog
Embedded blog
Graphics, Gaming, and VR blog
High Performance Computing (HPC) blog
Infrastructure Solutions blog
Internet of Things (IoT) blog
Operating Systems blog
SoC Design and Simulation blog
Tools, Software and IDEs blog
Tags
Actions
RSS
More
Cancel
Related blog posts
Related forum threads
Software Optimization: Four real-life Streamline use cases (Part 4)
Guilherme Marshall
September 11, 2013
5 minute read time.
System-level Analysis with Performance Counters
- This is my last post of this series (see
Timeline analysis
,
Smart software profiling
and
Benchmarking
if you missed previous posts), and you would expect, I have saved the best for last. The Timeline View in
ARM DS-5™ Streamline
not only displays software counters generated by the OS or graphics drivers, but can also display performance counters from the processor's performance monitoring unit (PMU) or from any memory-mapped device (e.g.
Mali™ graphics processors
,
CoreLink™ interconnect
, L2 cache controller or memory controllers).
In addition, you can create custom charts with arithmetical functions of two or more performance counters, which is normally used for ratio-based analysis. For instance, it is easier to spot problems by identifying a sudden drop in the cache hit/miss ratio or an increase in the cycles per instruction (CPI) ratio than by measuring changes in the value of raw performance counters. Streamline includes pre-configured snippets for typically useful performance ratios.
Most people use the performance counter information in the Timeline view to spot problems that they didn't know about. They do this by looking for unexpectedly high or low values in counters or ratios, or by spotting spikes or drops in the value of counters. The fact that Streamline correlates performance counters with process information means that not only do you know there is a problem, but you can also identify what area of the software was likely to be responsible for it.
Here I will cover typical usage of the most widely used processor counters, although clearly there is lots more you can do, as ARM-based hardware products include many performance counters:
Cycles per instruction ratio: this tells you how efficiently the CPU is running code. An increase in this value means that something is wrong and prompts further investigation
Cache hit to miss ratio: this tells you whether the CPU is hitting the cache or not. Hitting the cache may make your software between 10 and 100 times faster, so it is an extremely important metric. This ratio may go down, for example, if you are using inappropriate data structures (e.g. lists or global variables) or if the critical loop in your algorithm is just too large to fit in the L1 cache. By spotting cache problems you may rewrite bits of the code to use different types of variables or perhaps rebuild that loop to optimize for code size instead of performance, and then get a massive performance improvement
Branch predictor success rate: this tells you if the CPU is predicting correctly whether branches in the code are taken or not taken. For complex
Cortex®-A
CPUs the penalty involved in branch mispredictions is pretty high, so it is worth keeping an eye on this ratio, and potentially rewriting your algorithms to ensure that branches in critical loops are correctly predicted
The Timeline View enables you to select a process or a thread and visualize the contribution it makes to CPU utilization, power consumption and performance counters, so if there is a spike in a counter you get immediate information on what thread caused it.
If you want greater detail on what is driving an increase in certain counters, Streamline includes a feature called event-based sampling, which enables you to take samples of the program counter on an interrupt from the performance monitoring unit. When this feature is activated, the profiling reports in the tool do not show information on processor time per thread or function, but instead show the percentage of times that an event is caused by each thread or function. For example, if you set event-based sampling on the branch misprediction counter, you can get a report on what code is incorrectly predicted by the CPU. This only requires you to select a counter in a graphical environment, so anyone can use it easily.
Graphics processors and system (or fabric) IP blocks such as ARM CoreLink interconnect, DMA controllers and memory controllers typically provide memory mapped counters containing information about their internal efficiency, level of utilization, bandwidth and latency. These counters are included in the hardware in order to enable developers to analyze and optimize the target system as a whole. This is important because the speed of software execution depends less and less on the performance of the CPU, and more and more on the impact of system-level components. Again, Streamline makes it very easy to spot and address these types of problems:
Simultaneous visualization of performance counters showing the utilization of CPUs, GPUs and other compute engines over time enables you to balance your code between them and achieve overall faster performance. Streamline shows clearly when compute engines saturate, which suggests that you could be better off trying a different compute engine (for example, the
NEON™
SIMD unit instead of the GPU)
System IP counters enable you to spot whether two or more masters are trying to access a single slave simultaneously, which results in slow performance and wasted energy. Again, rescheduling tasks, changing cache policies of CPUs and GPUs or using alternative memories (e.g. on-chip memory) can result in a massive impact in performance
Finally, relating system IP counters with power consumption increases your understanding of the relationship between software and power on your particular target. For example, it is well known that external memory accesses take more power than internal ones. By relating the number of L2 cache misses with the power consumption of the I/O power supply of the main SoC and external memory you can create a mathematical model of the relationship between code size and power consumption, which may prompt you to re-assess the amount of software you try to run in parallel.
Blogs in this series
Four real-life Streamline use cases (Part 1): Timeline analysis
Four real-life Streamline use cases (Part 2): Smart software profiling
Four real-life Streamline use cases (Part 3): Benchmarking
Four real-life Streamline use cases (Part 4): System-level analysis
Tools, Software and IDEs blog
Arm Toolchain for Embedded: next-generation Arm C/C++ embedded compiler
Paul Black
Arm is launching Arm Toolchain for Embedded (ATfE), an embedded C/C++ cross-compiler. The toolchain is expected to be launched in April 2025, but a beta version is available now.
January 9, 2025
Product update: Arm Development Studio 2024.1 now available
Ronan Synnott
Arm Development Studio 2024.1 is now available with support for Cortex-A725 and Cortex-X925.
January 2, 2025
Part 3: Leveraging Rust with Rich Operating Systems on Arm
Jonathan Pallant
Understand how Rust can take full advantage of running on a full-blown operating system such as Linux.
November 15, 2024