Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Tools, Software and IDEs blog Software Optimization: Four real-life Streamline use cases (Part 4)
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Software Optimization: Four real-life Streamline use cases (Part 4)

Guilherme Marshall
Guilherme Marshall
September 11, 2013
5 minute read time.
System-level Analysis with Performance Counters - This is my last post of this series (see Timeline analysis, Smart software profiling and Benchmarking if you missed previous posts), and you would expect, I have saved the best for last. The Timeline View in ARM DS-5™ Streamline not only displays software counters generated by the OS or graphics drivers, but can also display performance counters from the processor's performance monitoring unit (PMU) or from any memory-mapped device (e.g. Mali™ graphics processors, CoreLink™ interconnect, L2 cache controller or memory controllers).

  In addition, you can create custom charts with arithmetical functions of two or more performance counters, which is normally used for ratio-based analysis. For instance, it is easier to spot problems by identifying a sudden drop in the cache hit/miss ratio or an increase in the cycles per instruction (CPI) ratio than by measuring changes in the value of raw performance counters. Streamline includes pre-configured snippets for typically useful performance ratios.



  Most people use the performance counter information in the Timeline view to spot problems that they didn't know about. They do this by looking for unexpectedly high or low values in counters or ratios, or by spotting spikes or drops in the value of counters. The fact that Streamline correlates performance counters with process information means that not only do you know there is a problem, but you can also identify what area of the software was likely to be responsible for it.

  Here I will cover typical usage of the most widely used processor counters, although clearly there is lots more you can do, as ARM-based hardware products include many performance counters:
  • Cycles per instruction ratio: this tells you how efficiently the CPU is running code. An increase in this value means that something is wrong and prompts further investigation
  • Cache hit to miss ratio: this tells you whether the CPU is hitting the cache or not. Hitting the cache may make your software between 10 and 100 times faster, so it is an extremely important metric. This ratio may go down, for example, if you are using inappropriate data structures (e.g. lists or global variables) or if the critical loop in your algorithm is just too large to fit in the L1 cache. By spotting cache problems you may rewrite bits of the code to use different types of variables or perhaps rebuild that loop to optimize for code size instead of performance, and then get a massive performance improvement
  • Branch predictor success rate: this tells you if the CPU is predicting correctly whether branches in the code are taken or not taken. For complex Cortex®-A CPUs the penalty involved in branch mispredictions is pretty high, so it is worth keeping an eye on this ratio, and potentially rewriting your algorithms to ensure that branches in critical loops are correctly predicted
  The Timeline View enables you to select a process or a thread and visualize the contribution it makes to CPU utilization, power consumption and performance counters, so if there is a spike in a counter you get immediate information on what thread caused it.



  If you want greater detail on what is driving an increase in certain counters, Streamline includes a feature called event-based sampling, which enables you to take samples of the program counter on an interrupt from the performance monitoring unit. When this feature is activated, the profiling reports in the tool do not show information on processor time per thread or function, but instead show the percentage of times that an event is caused by each thread or function. For example, if you set event-based sampling on the branch misprediction counter, you can get a report on what code is incorrectly predicted by the CPU. This only requires you to select a counter in a graphical environment, so anyone can use it easily.





  Graphics processors and system (or fabric) IP blocks such as ARM CoreLink interconnect, DMA controllers and memory controllers typically provide memory mapped counters containing information about their internal efficiency, level of utilization, bandwidth and latency. These counters are included in the hardware in order to enable developers to analyze and optimize the target system as a whole. This is important because the speed of software execution depends less and less on the performance of the CPU, and more and more on the impact of system-level components. Again, Streamline makes it very easy to spot and address these types of problems:
  • Simultaneous visualization of performance counters showing the utilization of CPUs, GPUs and other compute engines over time enables you to balance your code between them and achieve overall faster performance. Streamline shows clearly when compute engines saturate, which suggests that you could be better off trying a different compute engine (for example, the NEON™ SIMD unit instead of the GPU)
  • System IP counters enable you to spot whether two or more masters are trying to access a single slave simultaneously, which results in slow performance and wasted energy. Again, rescheduling tasks, changing cache policies of CPUs and GPUs or using alternative memories (e.g. on-chip memory) can result in a massive impact in performance
  • Finally, relating system IP counters with power consumption increases your understanding of the relationship between software and power on your particular target. For example, it is well known that external memory accesses take more power than internal ones. By relating the number of L2 cache misses with the power consumption of the I/O power supply of the main SoC and external memory you can create a mathematical model of the relationship between code size and power consumption, which may prompt you to re-assess the amount of software you try to run in parallel.
Blogs in this series
  • Four real-life Streamline use cases (Part 1): Timeline analysis
  • Four real-life Streamline use cases (Part 2): Smart software profiling
  • Four real-life Streamline use cases (Part 3): Benchmarking
  • Four real-life Streamline use cases (Part 4): System-level analysis
Anonymous
Tools, Software and IDEs blog
  • Python on Arm: 2025 Update

    Diego Russo
    Diego Russo
    Python powers applications across Machine Learning (ML), automation, data science, DevOps, web development, and developer tooling.
    • August 21, 2025
  • Product update: Arm Development Studio 2025.0 now available

    Stephen Theobald
    Stephen Theobald
    Arm Development Studio 2025.0 now available with Arm Toolchain for Embedded Professional.
    • July 18, 2025
  • GCC 15: Continuously Improving

    Tamar Christina
    Tamar Christina
    GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
    • June 26, 2025