If you like movies, you may spend a lot of time browsing which one you are going to watch next… What if we ask the world for advice?
Let us use one of the many movie review databases available online. In this article, we are going to write a simple C application that will parse the data and sort out the best movies of all times.
Instead of developing this application locally, we are using the Arm AWS instances. We will show what tools are available to develop applications for Arm-based servers and we demonstrate how easily they can be used in the cloud.
Amazon AWS has recently launched Graviton2 instances: a new generation of custom AWS silicon with 64-bit Arm Neoverse cores designed and optimized for cloud-native workloads. They can be launched from the AWS console:
In comparison to the first generation of Arm AWS Graviton processors, the Graviton2 offer:
In addition, it can achieve 40% price and performance advantage over x86 generation 5 instances.
Let us select a m6g.medium instance with Ubuntu 18.04 LTS. M6g instances are designed for general-purpose workloads such as application servers, mid-size data stores, and micro-services. When configuring the instance, make sure to open the SSH port 22 in the security settings. When ready, we can connect from our local machine using SSH, our private *.pem key and the public IP of the instance launched. These settings can be saved into the .ssh/config file under the session name "aws" that we reuse in the rest of the article, for example:
Host aws HostName 52.214.109.204 User ubuntu IdentityFile ~/.ssh/mypkey_aws.pem
To develop our code, we are going to use Arm Allinea Studio - a complete suite of high-performance tools for developing Arm-based server and HPC applications. It includes the Arm Compiler for Linux and the Arm Forge development toolkit. They are designed to get your application running at optimal performance on Armv8-A.
The tools can be downloaded on our Ubuntu 18.04 AWS instance from Arm Developer. The Arm Compiler for Linux and the Arm Forge installation packages can be downloaded locally and copied to the instance with:
user@local $ scp Arm-Compiler-for-Linux_20.1_Ubuntu_16.04_aarch64.tar aws:. user@local $ scp arm-forge-20.0.3-Ubuntu-18.04-aarch64.tar aws:.
Builds are also available for most Linux server distributions.
Now, let us connect to the remote instance:
user@local $ ssh aws
A few packages are needed before installing the tools:
sudo apt install build-essential libxrandr2 libsm6 libfontconfig1 python
Installation instructions can be found in this guide on Arm Developer.
The source code of the example application can be downloaded with:
git clone https://github.com/ARM-software/Tool-Solutions
The source files are in the following directory:
cd Tool-Solutions/allinea-studio-examples/sortmovies
Our application needs two input files:
Instructions to download the files can be found in README.md. These two files (database.tsv and ratings.tsv) are in TSV format. Each entry is identified with a unique key that our algorithm needs to match between the two files.
Time to open our favorite code editor to write some code. The structure of the algorithm is as follows:
The algorithm consists in a single C file that can be compiled with the Arm Compiler for Linux:
armclang sortmovies.c -o sortmovies.exe
The Arm Compiler for Linux is LLVM-based and includes the Arm C/C++ Compiler, the Arm Fortran Compiler and Arm Performance Libraries. It includes support for the latest Fortran and C++ 14 standard to improve the speed of server and HPC workloads on a wide range of Arm-based platforms.
Arm Allinea Studio also comes with the GCC compiler. As an alternative, the code can be compiled with:
gcc sortmovies.c -o sortmovies.exe
We can now run the application:
./sortmovies.exe
The initial version of the application runs on small test cases but crashes with a segmentation fault (segfault) on a larger dataset. Let us investigate the issue with Arm Forge: the integrated tool suite for debugging and profiling server applications.
Arm Forge’s debugger Arm DDT is the debugger of choice for developing of C++, C or Fortran parallel, and threaded applications. Its powerful intuitive graphical interface helps you easily detect memory bugs and divergent behavior at all scales, making Arm DDT the number one debugger in research, industry, and academia.
Arm DDT is GUI-based but also comes with a remote client designed to debug remote applications. The remote client for our local machine under Linux, Windows, or MacOS can be downloaded from the Arm Developer downloads area.
After the client is installed on our local machine, we launch it with the following command:
user@local $ ddt &
Then, we need to configure the connection to our AWS instance in the “Remote launch” drop-down menu.
We can reuse the SSH session "aws" we have created in the “Host Name” field. Arm Forge reuses settings (login name, private key) automatically. We also need to specify where Forge is installed on the instance in the “Remote Installation Directory” field: by default the tools are installed in /opt/arm/forge.
When the settings are saved, the connection is available from the drop-down window in the remote client’s main menu.
When connected, let us leave the remote client aside, waiting for a debugging job to connect.
Back to the terminal running on our cloud instance, we need to recompile the application for debugging with the Arm Compiler for Linux:
armclang -O0 -g -fsanitize=address sortmovies.c -o sortmovies.exe
or with GCC:
gcc -O0 -g -fsanitize=address sortmovies.c -o sortmovies.exe
We have added a few options for debugging:
Let us restart the application in the debugger:
ddt --connect ./sortmovies.exe
The “--connect” option is important here, as it allows to connect the debugging session to the remote client we launched locally. In the remote client, a window notifies us of the incoming connection.
When accepted, the “Run” window appears. Arm DDT integrates features of the Arm Compiler for Linux: this window allows to enable the ASAN plug-in. This helps us get to the root of the segfault:
Clicking on “Run” starts the debugging session. Arm DDT’s debugger pauses the application at the beginning of the program and displays the main window:
The GUI allows inspection of source files, variables and stack. Let’s play the application thanks to the control bar on top of the window until an error is triggered. The following message appears:
The stack viewer allows to get back to the line of code where the problem occurs.
Line 60, the “type” variable is set to NULL as shown, in the variable view:
We see that the “t_buffer” variable has not been initialized as expected on line 56. Each line of the dataset stored in “t_buffer” should start with a movie ID and this line looks truncated. To compare, we can check the previous ID variable stored in our records thanks to the “Evaluate” window:
The information enables us to find the problematic input from the dataset. A very long line causes our buffer of 256 characters to overflow. To fix this, let us retrieve the size of the line using getline() instead of using a fixed-size buffer. The code can be edited and recompiled directly from the debugger when the build command is specified.
Now that our code is working, it is time to understand its efficiency. Arm Forge’s profiler – Arm MAP profiles our code without distorting application behavior. Arm MAP is Arm Forge's scalable low-overhead profiler of C++, C, Fortran and Python with no instrumentation or code changes required. It helps developers accelerate their code by revealing the causes of slow performance. From multiprocessor Linux workstations to the largest supercomputers, you can profile realistic test cases with typically less than 5% runtime overhead.
To profile, let us compile the code with optimizations (-O3) and debugging information for the tool to display source code information:
armclang -O3 -g sortmovies.c -o sortmovies.exe
or:
gcc -O3 -g sortmovies.c -o sortmovies.exe
Profiling a remote application with Arm MAP is straightforward:
map --profile ./sortmovies.exe
The “--profile” option requests the profiler to sample the application in the background and output the results when the application terminates. The result (a *.map file) can be open afterward from the remote client by selecting “Load Profile Data File” from Arm Forge’s main menu. Here are the profiling results for a small test case.
The application runs for 32 seconds, as shown in the summary on top of the GUI. Arm MAP describes the application behavior with a few graphs. At the top, the application activity is a timeline that reports when the application is computing in green and when it is performing I/O system calls in orange. Users can zoom in on a specific time frame to inspect performance aspects. Different metric graphs can be displayed underneath the activity. Here, we have selected POSIX I/O read and write rates that illustrate when the application is reading input data, sorting and writing results.
In the center of the GUI, the source code viewer is displayed with time and activity annotations. The stack view in the bottom of the GUI categorizes functions and lines of code depending on how they are executed.
The profiling results show that the loop reading ratings.tsv and matching the movie IDs in our records (“table”) is costly. The flowchart is as follows:
The movie IDs are stored in the input files in descending order. This loop can be optimized by iterating from the last matching position in table (i=last_found) instead of iterating from the beginning (i=0). As a result, we won't check movies that have already been matched with ratings data.
We can profile the optimized code on a larger test case: this brings a 5.5 speedup. However, MAP shows the same loop is still costly:
Additional optimization is possible, since our ID strings are always the same size. Let’s replace:
strcmp(table[i].id,r_id)
by:
memcmp(table[i].id, r_id, 9*sizeof(char));
The new version gives an additional 1.3 speedup.
In two iterations, we managed to reduce the execution time from 215 seconds to 30 seconds on a large dataset. And the best rated movie of all times is… The Shawshank Redemption. Now you know what to watch next.
We have seen how easy software development on Arm-based instances in the cloud can be with Allinea Studio. If you would like to try on your own applications on AWS Graviton2 instances, you can request a free trial license of the tool.
[CTAToken URL = "https://pages.arm.com/Hpc-trial-request.html" target="_blank" text="Get your free trial" class ="green"]