Streamline and JITed code

January 5, 2017

5 minute read time.

The Streamline Performance Analyser can already make use of the Linux perf sub-system but so far this has only been under the hood via the gator daemon (which can collect other data too).

As of DS-5.26 it is now possible to load a perf.data capture directly and the DSG APD team have been looking at one of the immediate benefits; it allows Streamline's whole system view to profile generated code from a VM like Java or ECMAScript (JavaScript).

A more comprehensive view of your system is always a good thing and can even lead to an improved understanding of how things actually work, but hang on, what is different about these languages and why do they need additional support?

Overview of statistical profiling tools

Statistical profiling tools like Streamline (or perf) work by requesting to be called periodically from a timer interrupt (in our case typically at 1kHz) and then inspecting, or 'sampling', the current state of the machine to determine things from the fairly basic, such as 'Which CPUs are active?', 'What application or service are they running at that moment?', 'How much memory is free?' to the more exotic 'How many CPU branch miss-predicts have occurred since the last sample?' or 'How much energy has the GPU consumed since the last sample?', but the most fundamental of all is 'What is the current value of the Program Counter (PC)?'.

PC is the CPU register that determines what instruction is currently being executed (more-or-less anyway) so if the CPU spends a large proportion of time executing a particular piece of code such as a key loop or function then statistically the profiling tool will find itself collecting lots of samples with a value of PC in that loop or function. Streamline presents this information alongside your code to understand which part is consuming the most time.

This is straightforward for languages like C or C++ which are almost always compiled directly to the machine code of the target device, beforehand, on the developer's computer. In this case we know the name of the program executing and so the set of PC samples can be cross-referenced with the original binary and debug information to determine what source code is 'hot'.

Languages like Java and JavaScript are not typically converted directly to machine code; instead a Just-In-Time compiler (JIT) is often used. The JIT compiler generates machine code on demand, on the target device, as the program runs. Streamline can still capture PC samples but they will be within some anonymous buffer for generated code and do not correspond to anything useful that we can report.

To make sense of these samples we must ask the JIT compiler to produce some meta-data to fill in the gap, something like "At time t=1287.5 I generated the following size=972 bytes of code <....> at address 0x12345678 for method MyExample::get_example(int i)". With this information a profiler can discover if a PC sample is actually part of some generated code.

With the new perf import feature in Streamline and some work in the perf tool itself this is now feasible; this LKML posting is particularly relevant

What does this look like with perf and Streamline?

To try this out you will need a sufficiently recent Linux kernel and a VM or JIT that can generate suitable meta-data in the format mentioned in the LKML posting.

We are using Linux 4.7 on a Juno development board and the v8 JavaScript engine as part of node.js.

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# First take a performance profile, 'perf record',
# using a clock common to kernel and user space, '-k mono',
# of the node binary
# which has a '--perf-prof' switch that enables the generation of suitable meta-data
perf record -k mono ./node --perf-prof ./benchmark/dgram/array-vs-concat.js
# perf record writes 'perf.data' containing the profiling data
# node writes jit-<pid>.dump
# Then merge, 'perf inject', the meta-data into the original profiling data '-i' and
# write out a new version '-o' with support for jitted code enabled '--jit'
perf inject --jit -i perf.data -o perf.data.withjit
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

# First take a performance profile, 'perf record',

# using a clock common to kernel and user space, '-k mono',

# of the node binary

# which has a '--perf-prof' switch that enables the generation of suitable meta-data

perf record -k mono ./node --perf-prof ./benchmark/dgram/array-vs-concat.js


# perf record writes 'perf.data' containing the profiling data

# node writes jit-<pid>.dump


# Then merge, 'perf inject', the meta-data into the original profiling data '-i' and

# write out a new version '-o' with support for jitted code enabled '--jit'

perf inject --jit -i perf.data -o perf.data.withjit


# perf merges the data and writes perf.data.withjit

# perf inject also generates an ELF files of the form jitted-<pid>-<n>.so to

# contain the code for each method compiled by the JIT


# Gather up all the required files...

tar czf streamline-data.tgz \

./perf.data.withjit \

./jitted-<pid>-*.so \

./node


# ...and copy to your host PC ready to import into Streamline e.g.

scp streamline-data.tgz user@host:~/

In Streamline click on 'Import capture files...' and import 'perf.data.withjit'. See section 2.1 and 2.2 of the User Guide

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0482w/ric1359497467549.html

Wait for the import process to finish and then use the Analyse button to run the analysis phase. The generated .so files can be added later after inspecting the Functions tab to determine which ones are contributing most. See section 2.8 of the User Guide.

Before and after...

Without this extra support you get a view something like this...

...in which a large proportion of the samples that Streamline has captured cannot be assigned to anything more specific than the 'node' process (as highlighted in the screen capture above).

After running perf inject and adding some of the jitted-<pid>-<n>.so files to the analysis we can get a more detailed breakdown...

...and even code corresponding to the JIT'ed methods...

This involves a few more steps than a regular Streamline capture but the additional information is well worth it.

Required tools

DS-5 >= v5.26

Linux kernel (+ perf user space tools) >= v4.6

(alternatively Linux kernel >= v4.1 with perf user space tools >= v4.6 might work too)

Tested VMs / JITs

V8 JavaScript interpreter built from commit 8a086d8cc3ae15335e49c68f4f9d2ca2ac365b92 (anything from August 2016 onwards is probably okay)

node.js v7.4.0

OpenJDK v1.8.0_102 (this requires an additional plug-in to generate the meta-data and will be the subject of our next blog)

Next steps

The Streamline team would love to hear about how this support is used, did you manage to find a new and interesting performance issue, did you discover something unexpected about your system, are there other languages/VMs/JITs that provide the required meta-data? Are you using a VM or JIT that emits similar meta-data but in a different format, can it be converted? Please drop us a note below or connect directly.

DSG APD are also looking at Python which is typically in interpreted language. This means it never generates any machine code; the interpreter just looks at each line of Python code in turn and decides what needs doing. When the profiling timer interrupt goes off we will generally find ourselves somewhere within the same Python interpreter loop. This means the value of PC does not give us any useful information about what we are executing. So instead we are investigating if we can either trawl around inside the Python interpreter and find out what line of code it is currently working on, or have it produce some meta-data like a JIT would. Watch this space...

0 comments
0 members are here

Tools, Software and IDEs blog

GitHub and Arm are transforming development on Windows for developers

Pareena Verma

Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
- May 20, 2025
What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025
Running KleidiAI MatMul kernels in a bare-metal Arm environment

Paul Black

Benchmarking Arm®︎ KleidiAI MatMul kernels on bare-metal with AC6, GCC, and ATfE compilers.
- April 17, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog