Performance and power optimization are critical considerations for new Linux and Android™ products. This blog explores the most widely used performance and power profiling methodologies, and their application to the different stages in the product design.
In the highly competitive market for smartphones, tablets and mobile Internet devices, the success of new products depends strongly on high performance, responsive software and long battery life.
In the PC era it was acceptable to achieve high performance by clocking the hardware at faster frequencies. However, this does not work in a world in which users expect to always stay connected. The only way to deliver high performance while keeping a long battery life is to make the product more efficient.
On the hardware side the need for efficiency has pushed the use of lower silicon geometries and SoC integration. On the software side performance analysis needs to become an integral part of the design flow.
Most Linux-capable ARM® processor-based chipsets include either a CoreSight™ Embedded Trace Macrocell (ETM) or a Program Trace Macrocell (PTM).
The ETM and PTM generate a compressed trace of every instruction executed by the processor, which is stored on an on-chip Embedded Trace Buffer (ETB) or an external trace port analyzer. Software debuggers can import this trace to reconstruct a list of instructions and create a profiling report. For example, DS-5 Development Studio Debugger can collect 4GB of instruction trace via the ARM DSTREAM target connection unit and display a time-based function heat map.
Figure 1: Instruction trace generation, collection and display
Instruction trace is potentially very useful for performance analysis, as it is 100% non-intrusive and provides information at the finest possible granularity. For instance, with instruction trace you can measure accurately the time lag between two instructions. Unfortunately, trace has some practical limitations.
The first limitation is commercial. The number of processors on a single SoC is growing and they are clocked at increasingly high frequencies, which results in higher bandwidth requirements on the CoreSight trace system and wider, more expensive, off-chip trace ports. The only sustainable solution for systems running at full speed is to trace to an internal buffer, which limits the capture to less than 1ms. This is not enough to generate profiling data for a full software task such as a phone call.
The second limitation is practical. Linux and Android are complex multi-layered systems, and it is difficult to find events of interest in an instruction trace stream. Trace search utilities help in this area, but navigating 4GB of compressed data is still very time-consuming.
The third limitation is technical. The debugger needs to know which application is running on the target and at which address it is loaded in order to decompress the trace stream. Today’s devices do not have the infrastructure to synchronize the trace stream with kernel context-switch information, which means that it is not possible to capture and decompress non-intrusively a full trace stream through context switches.
For performance analysis over long periods of time sample-based analysis offers a very good compromise of low intrusiveness, low price and accuracy. A popular Linux sample-based profiling tool is perf.
Sample-based tools make use of a timer interrupt to stop the processor at regular intervals and capture the current value of the program counter in order to generate profiling reports. For example, perf can use this information to display the processor time spent on each process, thread, function or line of source code. This enables developers to easily spot hot areas of code.
At a slightly higher level of intrusiveness, sample-based profilers can also unwind the call stack at every sample to generate a call-path report. This report shows how much time the processor has spent on each call path, enabling different optimizations such as manual function inlining.
Sample-based profilers do not require a JTAG debug probe or a trace port analyzer, and are therefore much lower cost than instruction trace-based profilers. On the downside they cause a target slow-down of between 5 and 10% depending on how much information is captured on every sample.
It is important to note that sample-based profilers do not deliver “perfect data” but “statistically relevant data”, as the profiler works on samples instead of on every single instruction. Because of this, profiling data for hot functions is very accurate, but profiling data for the rest of the code is not accurate. This is not normally an issue, as developers are mostly interested in the hot code.
A final limitation of sample-based profilers is related to the analysis of short, critical sequences of code. The profiler will tell you how much processor time is spent on that code. However, only instruction trace can provide the detail on the sequence in which instructions are executed and how much time each instruction requires.
Logging or annotation is a traditional way to analyze the performance of a system. In its simplest form, logging relies on the developer adding print statements in different places in the code, each with a timestamp. The resulting log file shows how long each piece of code took to execute.
This methodology is simple and cheap. Its major drawback is that in order to measure a different part of the code you need to instrument it and rebuild it. Depending on the size of the application this can be very time consuming. For example, many companies only rebuild their software stacks overnight.
The Linux kernel provides the infrastructure for a more advanced form of logging called “tracing”. Tracing is used to automatically record a high number of system-level events such as IRQs, system calls, scheduling and event application-specific events. Lately, the kernel has been extended to also provide access to the processor’s performance counters, which contain hardware-related information such as cache usage or number of instructions executed by the processor.
Kernel trace enables you to analyze performance in two ways. First, you can use it to check whether some events are happening more often than expected. For example, it can be used to detect that an application is making the same system call several times when only one is required. Secondly, it can be used to measure the latency between two events and compare it with your expectations or previous runs.
Since kernel trace is implemented in a fairly non-intrusive way, it is very widely used by the Linux community, using tools such as perf, ftrace or LTTng. A new Linux development will enable events to be “printed” to a CoreSight Instrumentation Trace Macrocell (ITM) or System Trace Macrocell (STM) in order to reduce intrusiveness further and provide a better synchronization of events with instruction trace.
Open source tools such as perf and commercial tools such as the ARM DS-5 Streamline performance analyzer combine the functionality of a sample-based profiler with kernel trace data and processor performance counters, providing high-level visibility of how applications make use of the kernel and system-level resources.
For example, Streamline can display processor and kernel counters over time, synchronized to threads, processes and the samples collected, all in a single timeline view. For example, this information can be used to quickly spot which application is thrashing the cache memories or creating a burst in network usage.
Figure 2: Streamline Timeline View
Instrumentation completes the pictures of performance analysis methodologies. Instrumented software can log every function – or potentially every instruction - entry and exit to generate profiling or code coverage reports. This is achieved by instrumenting, or automatically modifying, the software itself.
The advantage of instrumentation over sample-based profiling is that it gives information about every function call instead of only a sample of them. Its disadvantage is that it is very intrusive, and may cause substantial slow-down.
All of the techniques described so far may apply to all stages of a typical software design cycle. However, some are more appropriate than others at each stage.
Low Cost
Low Intrusiveness
Accuracy
Granularity
System Visibility
Logging
•••
•••••
•
••
Kernel trace
••••
Instruction trace
Sample-based
Instrumentation
Table 1: Comparison of methodologies
Instruction trace is mostly useful for kernel and driver development, but has limited use for Linux application and Android native development, and virtually no use for Android Java application development.
Performance improvements in kernel space are often in time-critical code handling the interaction between kernel, threads and peripherals. Improving this code requires the high accuracy and granularity, and low intrusiveness of instruction trace.
Secondly, kernel developers have enough control of the whole system to do something about it. For example, they can slow down the processors to transmit trace over a narrow trace port, or they can hand-craft the complete software stack for a fast peripheral. However, as you move into application space, developers do not need the accuracy and granularity of instruction trace, as the performance increase achieved by software tweaks can easily be lost by random kernel and driver behaviour totally outside of his control.
In the application space, engineering efficiency and system visibility are much more useful than perfect profiling information. The developer needs to find quickly which bits of code to optimize, and measure accurately the time between events, but can accept a 5% slow-down in the code.
System visibility is extremely important in both kernel and application space, as it enables developers to quickly find and kill the elephant in the room. Example system-related performance issues include misuse of cache memories, processors and peripherals not being turned off, inefficient access to the file system or deadlocks between threads or applications. Solving a system-related issue has the potential to increase the total performance of the system ten times more than spending days or weeks writing optimal code for an application in isolation. Because of this, analysis tools combining sample-based profiling and kernel trace will continue to dominate Linux performance analysis, especially at application level.
Instrumentation-based profiling is the weakest performance analysis technique because of its high level of intrusiveness. Optimizing Android Java applications has better chances of success by using manual logging than open-source tools.
Most Android applications are developed at Java level in order to achieve platform portability. Unfortunately, the performance of the Java code has a random component, as it is affected by the JIT compiler. This makes both performance analysis and optimization difficult.
In any case, the only way to guarantee that an Android application will be fast and power-efficient is to write it - or at least parts of it - in native C/C++ code. Research shows that native applications run between 5 and 20 times faster than equivalent Java applications. In fact, most popular Android apps for gaming, video or audio are written in C/C++.
For Android native development on ARM processor-based systems Android provides the Native Development Kit (NDK). ARM offers DS-5 as its professional software tool-chain for both Linux and Android native development.
By jorensan, Director of Product Management - Tools at ARM