This content was initially posted 5 August 2013 on blogs.arm.com
In this blog I would like to share my experience about how the ARM® DS-5™ Streamline performance analyzer helped me to analyze where CPU cycles were being spent in some parallel benchmark code.
I have been working on a research project called MontBlanc that aims to build an HPC (High Performance Computing) prototype system using on the shelf ARM (mobile) technology. I'm in charge of performance analysis and optimization of the MPI communication library. MPI is the de-facto standard for distributed memory message passing applications, i.e. for large parallel applications that run on a cluster or on a supercomputer and communicate via external network interconnect.
A short while ago I encountered the problem of figuring out where the CPU cycles were being spent in a micro benchmark that was measuring the end-to-end latency of the MPI communication library.
The particular benchmark was a simple ping-pong test. It consists of two processes - we call them ping and pong. The ping process sends a message to the pong process by calling MPI_Send and then waits for a reply by calling MPI_Recv. The pong process calls the same MPI functions but in reverse order, i.e. it waits for a message from ping and when it receives one it immediately sends back an acknowledgement. The ping process measures the time it takes to get the acknowledgment from pong. The steps are repeated in a tight loop, typically for a few hundred times to compute the average round trip latency.
I had two SECO boards (see Figure 1) to run the benchmark. Each SECO board sported a NVIDIA Tegra2 chip with dual ARM Cortex™-A9 MP cores. They had integrated Gigabit Ethernet ports which I connected directly by a single network cable. The boards also had a second 100Mb Ethernet port which were connected to the local network.
Figure 1 The two SECO boards in the test rack
Although the benchmark itself is simple, a lot is going on under the hoods within the MPI library and in the Linux kernel at runtime to send and receive the messages across the Ethernet connection. MPI relies on the common TCP/IP protocol on these boards. Each MPI send/receive call lands actually in the Linux kernel through system calls and then the kernel networking subsystem does all the necessary pre- and post-processing to convert the application data into Ethernet frames and vice versa.
The latency numbers I saw when I ran the test on the SECO boards were rather high compared to a reference Gigabit Ethernet connected system. My task was to devise optimizations in the messaging software stack to reduce the high latency. To accomplish the task, I had to figure out where most of the cycles were spent, i.e. which component of the stack was responsible for the majority of the overhead.
I suspected the culprit was the TCP/IP processing within the kernel but I needed evidence to prove that. I had to measure the kernel imposed CPU utilization during the tight send/receive loop in the ping process. DS-5 Streamline made that measurement fairly easy and straightforward.
I launched the gator daemon on the first SECO board and the DS-5 GUI on my desktop. I hit the 'capture options' button in the Streamline data window to open the 'Capture & Analysis Options' dialog (see Figure 2). In that window, I typed the connection address (i.e. seco001:2001 where 2001 is port number my gator daemon was listening on). I also selected the 'High Resolution Timeline' option to get detailed results for my short benchmark run.
Figure 2 Capture dialog (click to enlarge)
Next, I annotated the inner loop in the benchmark source code to mark it as a special region of interest in the Streamline views. Streamline provides an instrumentation API to mark specific regions in the code. Annotations manifest as visible markers in the timeline view to help locate and analyse various runtime phases of the application.
That was all the necessary setup for my experiment. I hit the 'start capture' button and kicked off my benchmark on the SECO boards (using a separate ssh session). After the benchmark finished, I hit ‘stop capturing’. DS-5 transferred and processed the trace file in a few seconds. The default timeline view appeared on my screen as it is shown in Figure 3.
Figure 3 Streamline timeline view (click to enlarge)
Next, I restricted the timeline view to the inner loop of my benchmark as follows. I selected 'pingpong' in the list of processes at the bottom of the window to filter out the results of any other process activities. Then I changed the timeline resolution to 5ms to get a more detailed view of the execution. I positioned the calipers at the yellow markers on the timeline to filter out any activities that happened outside of my region of interest. Note that filtering by calipers affects all streamline views. The markers appeared there as a result of my annotation in the source code, i.e. they showed when my inner loop in the benchmark started and completed. I also 'opened up' the CPU Activity bar to show separate views of each CPU cores. As Figure 4 depicts, it was obvious that pingpong used 100% of one of the cores while the inner loop was running – whilst the other core was practically idle.
Figure 4 Timeline view focused on our region of interest (click to enlarge)
To figure out what was going on within the pingpong process, I switched to the 'Call Paths' view (see Figure 5). Here I could see immediately that more than 65% of the time pingpong was executing in the kernel – but where in the kernel, exactly?
Figure 5 Call Paths view (click to enlarge)
The next step was to load the kernel symbols from the vmlinux file into Streamline. I rebuilt the kernel for debug (CONFIG_DEBUG_INFO=y) to add debug symbols into the vmlinux file. Then, using the "Capture & Analysis Options" dialog, I clicked "Add ELF image" and located the vmlinux file for the kernel, then re-analyzed the data (see Figure 6).
This shows many kernel functions are involved, but the most heavily used one is 'getnstimeofday' (~21% of the samples), as called by the TCP/IP framework (sliding window flow control). This confirmed my original hypothesis that time was being spent mainly by processing the TCP/IP network stack.
Based on these findings, I replaced the “heavyweight” default TCP/IP based transport layer implementation in the OpenMPI library with a “lightweight” direct Ethernet transport layer implementation (called OpenMX). This resulted in nearly 40% decrease in the end-to-end latency!
To summarize, DS-5 Streamline makes it easy to inspect the performance of your application even if you are interested only in specific phases at runtime - you can annotate your source code to mark various regions - and even if you are interested in analysing both user space and OS activities triggered by your code. In this particular project, I was able to locate a bottleneck in a benchmark, and then significantly improve its performance by simply replacing one block of library code – a very pleasing result!
Gabor Dozsa, Staff Engineer, ARM. He is a member of ARM’s R&D Division that is looking at future ARM technologies and applications. He’s recently been working on performance analysis and optimization of library code in High Performance Computing systems.