Software Optimization: Four real-life Streamline use cases (Part 1)

Here at ARM's System Design Division almost every time we introduce someone to the ARM® DS-5 Streamline performance analyzer we end up being asked, so how much faster can my system run?' Normally we respond with a smile and explain that there is no one-size-fits-all answer for this question. However, after three very successful years on the market, customer interaction and ARM's internal usage (yes, we eat our own dog food!) have shown that significant performance uplift can rapidly be achieved by those users who incorporated the tool in their design flow. As a rule of thumb between twenty percent and thirty percent performance improvement seems to be the norm after the first analysis-optimization cycle, as this stage tends to unveil major performance bugs.

So to help you understand how our system analysis tool is being used in real-life, my colleague jorensan and I decided to compile this series of blogs discussing a few of its most common use cases. Hopefully one or more of them will help you create more responsive and robust ARM-based products in shorter timeframes. These are the cases to come:

A bit of history first: why has ARM developed Streamline in the first place?

ARM's business depends on getting its architecture deployed in successful products. To that end we do all we can to help engineers create robust, accurate software while keeping project costs down and time-to-market at its shortest. Streamline, as a system analysis tool, helps by identifying sub-optimal resource utilization and non-intended software behaviour (we call them performance bugs) that risk project schedule. To exemplify our point, the chart below shows three possible scenarios for the usage analysis tools and their impact in resource overshoot (e.g. additional CPU cycles or memory usage) and project schedule.

But why not stick to existing processor trace-based analysis tools?

Processor trace, as in ETM, PTM and ITM trace, was and still is very useful. In the MCU and real-time markets it is still the best source of data for analysis, thanks to its high level of detail and non-intrusive nature. But in other market applications things have moved on. Before the rise of smartphones, ARM processors were mainly found embedded in application specific devices. These processors had simple memory systems and ran relatively small software stacks based on real-time operating systems. Debugging was much simpler and software instrumentation and instruction trace were the standard methods of performance analysis. Rare were the SoCs streaming ETM traces at 100 Mbps or more.

The complex SoCs you find in smartphones, tablets, infotainment systems and some embedded systems today require different software analysis techniques though. Often their software stacks are as sophisticated as those of desktops, and their combined processor and system trace streams can take up over 10 Gbps bandwidth. So bringing together system trace and user instrumentation with performance counters and statistical profiling in a cost-efficient, probe-less manner was ARM's response with DS-5 Streamline.

Finally, first use case: Timeline analysis

While benchmarking and profiling are quite common analysis techniques, the most revolutionary benefit of Streamline is its ability to give a view of the whole system with its Timeline View. This brings a wealth of information to savvy software developers, who can use it in a variety of ways, from correcting silly mistakes in the code to gaining a deeper understanding of how the software relates to the hardware that it runs on.

Let's start with basic information you can extract from the Timeline View. By default, the process tab's heat map shows CPU load information, that is, which processes and threads are making use of processor time. This is interleaved with events from your application and power consumption information.

  This is incredibly powerful because of its simplicity. It enables you to see obvious problems in the code, which can have a massive effect on the overall performance of the system. Some real life examples include:

  • I thought I had closed that application, but it is still running
  • I thought I had turned off that peripheral, but it is still drawing power
  • I thought that these two applications could run in parallel on my multi-core target, but they don't
  • My small application is clogging the system because its library calls end up making massive amounts of system calls
  • My frame buffer is sometimes refreshing at 60fps and sometimes at 30fps

The process tab can be configured to show other information, such as how much GPU time each thread is taking or how long it is waiting for CPU time, by clicking on the rectangle next to the GPU Activity and CPU Wait Charts. For example, in the Xaos example below you can see that the Xorg process spends a lot of time waiting for Xaos to finish executing and freeing up the processor, and also that the Xaos process has some of its threads waiting for others to free up processor time.

CPU Wait information prompts the developer to adjust the relative priorities of threads so that they all run in an acceptable amount of time. Meanwhile, GPU Activity information can be used to understand why you are not getting the throughput you expected from the GPU, as multiple processes and threads may be conflicting for GPU resources if they have not been appropriately scheduled.

One of the power features of Streamline is its X-Ray mode, which enables you to see thread to CPU or cluster allocation over time, as well as performance counters per core and per cluster. This is extremely useful in order to understand how well your application is running on multi-core systems, and explore ideas for parallelizing your code.

This functionality has been used heavily inside ARM for big.LITTLE task switching development in order to validate the activity of the big and LITTLE clusters and whether tasks are allocated in the most efficient manner.

At application level, this functionality informs high level decisions such as whether to handle a piece of processing in the current thread or a new one, how to configure the dynamic voltage and frequency (DVFS) settings for maximum battery life, and so on. For example, depending on the number of available cores and their relative level of utilization it may be more effective to reduce the number of threads and power one core down, or to increase the number of threads, distribute them evenly across the cores and lower the voltage supply and clock frequency of all the cores. The visualization and power benchmarking capabilities of Streamline work together incredibly well in order to inform these types of decisions.

The Timeline in Streamline is unique in regards to the amount of information it integrates and the fact that it makes problems obvious to the developer. Its ease of use and the fact that "it just works" make it ideal for pre-empting bad code, because you can immediately measure and visualize the effect of changes in your code, instead of having to wait for performance analysis expert teams to run their tests and feed their reports back to you with a long feedback loop.

  That's it. Pick up a 30-day evaluation of DS-5 and start dieting your system...

  Next use case up: Smart software profiling. Stay tuned!