Improving data performance with Arm Streamline 7.1

November 25, 2019

7 minute read time.

When I first started writing games and optimizing them, I read an article explaining the importance of designing software with caches in mind, because main memory was so slow. At the time I was programming on a 486DX, with 8KB of unified L1 cache on-chip and 128KB of L2 cache plugged in to the motherboard. I don't remember the exact numbers for this machine. But, the main memory read latency the latency of fetching data if all levels of cache missed would likely have been under 15 CPU cycles.

On a modern processor the same 15 CPU cycles would be just about able fetch data from the L2 cache, but the main memory latency is likely to be well over 150 CPU cycles. For high performance applications the need for writing code that plays nicely with the CPU caches has never been greater.

Profiling with Streamline

The Arm Streamline profiler is a sample-based profiler, which takes data samples from a variety of data sources to build up its profile of how your application is running. These include the program counter (PC) of running threads, and the CPU performance monitoring unit (PMU) counters. These rich data sources can show where the hot spots are in the executing code, and how well the processors are handling executing it.

Time-based sampling

The default behavior of Streamline is to sample at a fixed period, typically one sample per processor every millisecond. This tells you where your application is spending time, but the only way to investigate program cache efficiency is by visually inspecting the PMU charts plotted on the timeline.

Streamline chart showing L1 and L2 cache accesses and refills.

This chart shows trends, and can be useful when monitoring improvements across multiple runs as patches are applied, but it is very difficult to tell which parts of your application are actually generating the memory traffic.

Event-based sampling

Streamline supports a different method of sampling, known as event-based sampling (EBS), which triggers samples for every N increments of a CPU PMU or Perf event counter. If we set this trigger to use the "L2 Data Cache (Refill)" counter, then we get a direct measure of the functions in our code which are getting L2 cache misses.

In Streamline 7.1, which is part of Arm Mobile Studio 2019.2 and Arm Development Studio 2019.1, EBS has been significantly improved to be more usable in modern multi-core chipsets. You can now set EBS triggers for all CPU clusters in a multi-cluster DynamIQ design, rather than just a single cluster. This allows any multi-threaded software to be profiled, no matter which physical processors it is running on.

A practical example

For this blog we'll use a test application, similar in structure to the popular LMBench memory latency test, designed to show the impact of the different levels of CPU cache. Each iteration of this test has two phases, setup and test.

Setup:

Allocate an N byte block of test memory.
Construct a singly linked list inside this block, where each list node points to the next node in the sequence.
One list node is created inside each cache line in the block.
The order of nodes in the list is randomized to minimize the impact of processor prefetching.

Test:

Iterate around this list repeatedly for 5 seconds.

We run three iterations of this test scenario with three block sizes: 16KB, 256KB, and 4MB. The 16KB test is designed to fit cleanly inside the L1 data cache, the 256KB test is designed to fit cleanly inside the L2 data cache, and the 4MB test is designed to exceed the capacity of the L2 and force loads from main memory.

Note: These block sizes are chosen for our test device; they may differ for other devices.

Timeline view

The initial data view in Streamline is the Timeline view, which gives an overview of the CPU behavior as a set of charts shown on global time series. In the figure below I have composited three highlighted regions, each of one second duration, into a single image for easy comparison.

A comparison of the performance of different memory block sizes.

The left-hand column, showing the 16KB block size, achieves a sustained throughput of 3.5 CPU cycles per instruction. We can see plenty of L1 cache accesses, but almost no L1 or refills L2 activity, so we can clearly see that the data for the code running here fits inside the L1 data cache.

The middle column, showing the 256KB block size, achieves a sustained throughput of 9.9 CPU cycles per instruction, almost 3 times slower than the case where data is inside the L1. We start to see an increase in the number of L2 cache accesses, to almost 1 per instruction executed, but almost L2 refills, so we can clearly see that the data for the code running is spilling out of the L1 cache but fits inside the L2 data cache.

The right-hand column, showing the 4MB block size, achieves a sustained throughput of just 72 cycles per instruction. This is a slowdown of ~20x compared to the 16KB test case, highlighting the importance of cache efficiency and data layout to data-plane processing algorithms.

The problem with this view is that, although we can see the performance has declined, we can’t see which functions in our application source code are to blame for the memory traffic.

Note: This image highlights a good use of custom user-defined charts. Data series plotted on the timeline can be created by defining mathematical expressions based on the raw hardware counter inputs. For example, we can define the "Cycles/Instruction" series for the Cortex-A73 CPUs in our test device as:

$CyclesCPUCycles.Cluster1 / $InstructionsExecutedAll.Cluster1

A custom data visualization can be saved as a Template and shared with other team members, which is a great way of sharing good practice and analysis methodology.

Time-based hot spots

If we look at the normal capture, using time-based samples, we can see that all it really tells us is that we spend a similar amount of time in each of the test functions.

The time breakdown for our test application, showing an equal division of time, and nothing interesting about cache behavior.

This is expected as we run each test function for 5 seconds, with some quantization error because we only test for completion at the end of each test cycle.

However, this does not indicate how efficiently the CPU cache is being used by each function. If we were looking to optimize an algorithm for cache suitability this isn't going to tell us where to start looking.

Event-based hot spots

Let’s repeat the same test scenario but switch to sampling every 15,000 L2 Cache Refills, instead of every millisecond.

The cost breakdown by cache miss, which shows us a direct view into the interesting cache behavior we care about.

Suddenly we get a much more informative view, which clearly points at the 4MB block size test as the cause of 80% of our L2 cache loads from main memory. For cache aware programming we now have a clear data hot spot to aim optimizations at.

Setting up event-based sampling

Setting up an EBS profile is easy. Connect to your target as normal and select the CPU counters you want to capture for all CPU clusters in your target device, including the one for use as an EBS trigger.

The Counter Configuration dialog, showing how to set up event-based sampling.

Once you have selected the counters you want to capture click on the target icon, found to the left of the counter name in the "Events to Collect" column, to enable that counter as an EBS trigger. The target icon will switch to red (see 1 in the figure above) to indicate that it is in use as an EBS trigger. If you have multiple CPU clusters in your device, the same EBS trigger will automatically be set up for the other clusters when you configure the first instance of it.

The second step is to configure the EBS threshold. Click on one of the trigger counters in the list to bring up the trigger panel (see 2 in the figure above) and enter the desired trigger threshold value in the dialog box. We recommend aiming to get approximately one EBS sample every millisecond, as this provides a good balance of measurement invasiveness and visibility. In our case, manual inspection in the Timeline view shows that we get approximately 15 million L2 accesses per second, so a trigger value of 15000 was used.

Summary

Optimizing data processing software for cache efficiency is more important today than it has ever been, but with traditional time-based profiling it can be very difficult to see which parts of your application are causing the cache misses to occur. Using Arm Streamline and event-based sampling to trigger profiling samples on L2 cache misses provides you with a direct measure of application cache efficiency, allowing targeted optimizations to improve data efficiency.

Download Arm Mobile Studio for Android profiling

Download Arm Development Studio for Linux profiling

Mobile, Graphics, and Gaming blog

Unlock the power of SVE and SME with SIMD Loops

Vidya Praveen

SIMD Loops is an open-source project designed to help developers learn SVE and SME through hands-on experimentation. It offers a clear, practical pathway to mastering Arm’s most advanced SIMD technologies…
- September 19, 2025
What is Arm Performance Studio?

Jai Schrem

Arm Performance Studio gives developers free tools to analyze performance, debug graphics, and optimize apps on Arm platforms.
- August 27, 2025
How Neural Super Sampling works: Architecture, training, and inference

Liam O'Neil

A deep dive into a practical, ML-powered approach to temporal super sampling.
- August 12, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog