Want to get more out of your hardware? Think system.

This content was initially posted 3 June 2013 on blogs.arm.com                        

Chiense Version中文版

You may have seen the announcement of our new set of IP comprising the ARM® Cortex®-A12 processor, Mali™-T622 GPU and Mali-V500 video solution. Combined these IP blocks are the future of mainstream mobile platforms, and can deliver the performance and energy efficiency that most consumers will expect in the near future. But notice that I said can deliver. The dependency, as a good software engineer will have spotted, is on the software architecture and its optimized implementation. That’s where the ARM Development Studio 5 (DS-5) toolchain can help.

The ARM Streamline performance analyzer in DS-5 combines software profiling and kernel tracing with visibility into system-wide hardware and software events. In today’s multicore designs this enables you to quickly identify performance bottlenecks as they dynamically move across system components (e.g. from the CPU to the GPU through caches and interconnect) responding to the software load. This optimization approach has the potential to achieve greater performance and power efficiency than CPU centric analysis such as instruction trace-based profiling. Software can be optimized to maximize the utilization of other system components that could have otherwise capped overall performance.


GPUs, or graphics processing units, used to be something you would only come across in gaming or other specialist graphics platforms. Things have changed dramatically. GPUs can now also be found in industrial, medical and automotive applications, just to mention a few other than the obvious mobile and home entertainment devices. So even if you do not see an immediate need for them, be prepared to continue to see GPUs being designed into products closer to your heart over the next few years.

The Streamline screenshot below was captured on a SoC combining a dual-core Cortex-A9 processor with a quad-core Mali-400 GPU, as Cortex-A12 and Mali-T622 are not available yet on silicon. The three charts on top show the utilization levels of the CPU (aggregate), GPU vertex unit and GPU fragment processing units (aggregate), top-down, respectively. The nice-looking filmstrip between the charts and processes heatmap comprises sampled captures of the frame buffer to visually help the user correlate the screen content and the performance metrics. In this example capture, I was running a demo game on Android that allowed me to gradually increase the complexity of the game visuals. This happened from the start of the capture until I reached the most complex configuration (at around 23 seconds in, if you pay attention to the ruler on the top), at which point the game switched back to wireframes.

new 1.png

CPU and GPU combined analysis

A lot of useful information can be taken from this analysis, but I would like to concentrate on our moving target: the performance bottleneck. First, notice the small blue dots alongside the filmstrip. Those are a proxy of my game’s frames rate. So, as expected, we get fewer frames per second as we increase the complexity of textures, shadows and so on. But looking at the screenshot once more you can notice the exact points when the fragment processing units get saturated (at ~10s) and when vertex processing unit get maxed out (at ~28s). At no time is the CPU the performance bottleneck in the system. This type of information should help you optimize your system to achieve your target (e.g. maximum frame rate, best scene complexity-frame rate balance, etc). And even more when we talk about GPU compute in future blogs…

Multiples Cores + System Interconnect

The bandwidth of the interconnect linking system components is rarely something software developers care about. This was ok in the simple unicore systems you found in phones years ago, but with ARM processors now being used in configurations that contain 8, 16, and sometimes even more cores, the efficient use of cache coherent interconnect/network becomes critical. I borrowed the below diagram from arm.com to show you a bird’s eye view of where the CCI-400 IP lives in a system (for those of you that are not familiar with the subject).

new 2.jpg

System diagram showing the CoreLink CCI-400 interconnect

Conveniently, the ARM CoreLink™ CCI-400 cache coherent interconnect and CCN-504 cache coherent network have performance monitoring counters that can be used to expose detailed information on the traffic passing through them and help developers spot bottlenecks in the fabric (for example, happening during simultaneous access from the CPUs, GPUs and display controller to the same memory device), . Since DS-5 version 5.14 Streamline can take advantage of this data (initially for CCI-400 only) to add another level of visibility into the system performance. See below an example this, with data captured from an ARM Versatile Express development board running a dual-core Cortex-A15 (cluster 0, red) and tri-core Cortex-A7 (cluster 1, yellow) big.LITTLE processing system.

new 3.png

Streamline capture showing big.LITTLE and CCI-400

Next Step: Optimizing for Energy Efficiency

It’s time to wrap up, but in my next blog I intend to bring yet another dimension to your optimization workflow: energy efficiency. However, if you just can’t wait for that, start by having a look at Streamline’s webpages for the ARM Energy Probe and NI-DAQ interface. See you soon!

Related Blogs: