Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Tools, Software and IDEs blog Debugging and Optimizing Performance of Applications on AWS Graviton2
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Compilers
  • Software Development Tools
  • Debugging
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Debugging and Optimizing Performance of Applications on AWS Graviton2

Florent Lebeau
Florent Lebeau
May 27, 2020
9 minute read time.

If you like movies, you may spend a lot of time browsing which one you are going to watch next… What if we ask the world for advice?

Let us use one of the many movie review databases available online. In this article, we are going to write a simple C application that will parse the data and sort out the best movies of all times.

Instead of developing this application locally, we are using the Arm AWS instances. We will show what tools are available to develop applications for Arm-based servers and we demonstrate how easily they can be used in the cloud.

Arm AWS instances

Amazon AWS has recently launched Graviton2 instances: a new generation of custom AWS silicon with 64-bit Arm Neoverse cores designed and optimized for cloud-native workloads. They can be launched from the AWS console:

AWS Arm instances types

In comparison to the first generation of Arm AWS Graviton processors, the Graviton2 offer:

  • 4X more compute cores (up to 64 vCPUs),
  • 5X faster memory,
  • 7X performance,

In addition, it can achieve 40% price and performance advantage over x86 generation 5 instances. 

Let us select a m6g.medium instance with Ubuntu 18.04 LTS. M6g instances are designed for general-purpose workloads such as application servers, mid-size data stores, and micro-services. When configuring the instance, make sure to open the SSH port 22 in the security settings. When ready, we can connect from our local machine using SSH, our private *.pem key and the public IP of the instance launched. These settings can be saved into the .ssh/config file under the session name "aws" that we reuse in the rest of the article, for example:

Host aws
  HostName 52.214.109.204
  User ubuntu
  IdentityFile ~/.ssh/mypkey_aws.pem

Arm Allinea Studio development toolkit for Arm servers

To develop our code, we are going to use Arm Allinea Studio - a complete suite of high-performance tools for developing Arm-based server and HPC applications. It includes the Arm Compiler for Linux and the Arm Forge development toolkit. They are designed to get your application running at optimal performance on Armv8-A.

The tools can be downloaded on our Ubuntu 18.04 AWS instance from Arm Developer. The Arm Compiler for Linux and the Arm Forge installation packages can be downloaded locally and copied to the instance with:

user@local $ scp Arm-Compiler-for-Linux_20.1_Ubuntu_16.04_aarch64.tar aws:.
user@local $ scp arm-forge-20.0.3-Ubuntu-18.04-aarch64.tar aws:.
 

Builds are also available for most Linux server distributions. 

Now, let us connect to the remote instance:

user@local $ ssh aws

A few packages are needed before installing the tools:

sudo apt install build-essential libxrandr2 libsm6 libfontconfig1 python

Installation instructions can be found in this guide on Arm Developer.

The source code of the example application can be downloaded with:

git clone https://github.com/ARM-software/Tool-Solutions

The source files are in the following directory:

cd Tool-Solutions/allinea-studio-examples/sortmovies

Our application needs two input files:

  • One corresponds to the movie and TV show database,
  • The other contains vote information with average ratings and total number of votes.

Instructions to download the files can be found in README.md. These two files (database.tsv and ratings.tsv) are in TSV format. Each entry is identified with a unique key that our algorithm needs to match between the two files.

Time to open our favorite code editor to write some code. The structure of the algorithm is as follows:

  • Parse database.tsv and only record movies titles.
  • Parse ratings.tsv and for each movie, compute a score (average rating x number of votes) and record it.
  • Sort the records by score.
  • Output the sorted records in a file.

The algorithm consists in a single C file that can be compiled with the Arm Compiler for Linux:

armclang sortmovies.c -o sortmovies.exe

The Arm Compiler for Linux is LLVM-based and includes the Arm C/C++ Compiler, the Arm Fortran Compiler and Arm Performance Libraries. It includes support for the latest Fortran and C++ 14 standard to improve the speed of server and HPC workloads on a wide range of Arm-based platforms.

Arm Allinea Studio also comes with the GCC compiler. As an alternative, the code can be compiled with:

gcc sortmovies.c -o sortmovies.exe

We can now run the application:

./sortmovies.exe

Debugging in the cloud

The initial version of the application runs on small test cases but crashes with a segmentation fault (segfault) on a larger dataset. Let us investigate the issue with Arm Forge: the integrated tool suite for debugging and profiling server applications.

Arm Forge’s debugger Arm DDT is the debugger of choice for developing of C++, C or Fortran parallel, and threaded applications. Its powerful intuitive graphical interface helps you easily detect memory bugs and divergent behavior at all scales, making Arm DDT the number one debugger in research, industry, and academia.

Arm DDT is GUI-based but also comes with a remote client designed to debug remote applications. The remote client for our local machine under Linux, Windows, or MacOS can be downloaded from the Arm Developer downloads area.

After the client is installed on our local machine, we launch it with the following command:

user@local $ ddt &

Then, we need to configure the connection to our AWS instance in the “Remote launch” drop-down menu.

Arm Forge remote connection settings

We can reuse the SSH session "aws" we have created in the “Host Name” field. Arm Forge reuses settings (login name, private key) automatically. We also need to specify where Forge is installed on the instance in the “Remote Installation Directory” field: by default the tools are installed in /opt/arm/forge.

When the settings are saved, the connection is available from the drop-down window in the remote client’s main menu.

Arm Forge’s main menu

When connected, let us leave the remote client aside, waiting for a debugging job to connect.

Back to the terminal running on our cloud instance, we need to recompile the application for debugging with the Arm Compiler for Linux:

armclang -O0 -g -fsanitize=address sortmovies.c -o sortmovies.exe

or with GCC:

gcc -O0 -g -fsanitize=address sortmovies.c -o sortmovies.exe

We have added a few options for debugging:

  • “-g” for debugging information – this is required to view the source code.
  • “-O0” to disable compiler optimizations. This is recommended as this will allow us to inspect all variables of the program.
  • “-fsanitize=address” to enable clang’s address sanitizer (ASAN). This is optional but it helps us detect memory errors.

Let us restart the application in the debugger:

ddt --connect ./sortmovies.exe

The “--connect” option is important here, as it allows to connect the debugging session to the remote client we launched locally. In the remote client, a window notifies us of the incoming connection.

When accepted, the “Run” window appears. Arm DDT integrates features of the Arm Compiler for Linux: this window allows to enable the ASAN plug-in. This helps us get to the root of the segfault:

Arm DDT’s run window

Clicking on “Run” starts the debugging session. Arm DDT’s debugger pauses the application at the beginning of the program and displays the main window:

Arm DDT’s main window

The GUI allows inspection of source files, variables and stack. Let’s play the application thanks to the control bar on top of the window until an error is triggered. The following message appears:

ASAN error message

The stack viewer allows to get back to the line of code where the problem occurs.

Arm DDT’s stack view

Line 60, the “type” variable is set to NULL as shown, in the variable view:

Arm DDT's variable view

We see that the “t_buffer” variable has not been initialized as expected on line 56. Each line of the dataset stored in “t_buffer” should start with a movie ID and this line looks truncated. To compare, we can check the previous ID variable stored in our records thanks to the “Evaluate” window:

Arm DDT’s evaluate window

The information enables us to find the problematic input from the dataset. A very long line causes our buffer of 256 characters to overflow. To fix this, let us retrieve the size of the line using getline() instead of using a fixed-size buffer. The code can be edited and recompiled directly from the debugger when the build command is specified.

Save source file and rebuild application

Analyzing performance

Now that our code is working, it is time to understand its efficiency. Arm Forge’s profiler – Arm MAP profiles our code without distorting application behavior. Arm MAP is Arm Forge's scalable low-overhead profiler of C++, C, Fortran and Python with no instrumentation or code changes required. It helps developers accelerate their code by revealing the causes of slow performance. From multiprocessor Linux workstations to the largest supercomputers, you can profile realistic test cases with typically less than 5% runtime overhead.

To profile, let us compile the code with optimizations (-O3) and debugging information for the tool to display source code information:

armclang -O3 -g sortmovies.c -o sortmovies.exe

or:

gcc -O3 -g sortmovies.c -o sortmovies.exe

Profiling a remote application with Arm MAP is straightforward:

map --profile ./sortmovies.exe

The “--profile” option requests the profiler to sample the application in the background and output the results when the application terminates. The result (a *.map file) can be open afterward from the remote client by selecting “Load Profile Data File” from Arm Forge’s main menu. Here are the profiling results for a small test case.

Arm MAP main window

The application runs for 32 seconds, as shown in the summary on top of the GUI. Arm MAP describes the application behavior with a few graphs. At the top, the application activity is a timeline that reports when the application is computing in green and when it is performing I/O system calls in orange. Users can zoom in on a specific time frame to inspect performance aspects. Different metric graphs can be displayed underneath the activity. Here, we have selected POSIX I/O read and write rates that illustrate when the application is reading input data, sorting and writing results.

In the center of the GUI, the source code viewer is displayed with time and activity annotations. The stack view in the bottom of the GUI categorizes functions and lines of code depending on how they are executed.

The profiling results show that the loop reading ratings.tsv and matching the movie IDs in our records (“table”) is costly. The flowchart is as follows:

ID search flowchart

The movie IDs are stored in the input files in descending order. This loop can be optimized by iterating from the last matching position in table (i=last_found) instead of iterating from the beginning (i=0). As a result, we won't check movies that have already been matched with ratings data.

We can profile the optimized code on a larger test case: this brings a 5.5 speedup. However, MAP shows the same loop is still costly:

Source code profiling information

Additional optimization is possible, since our ID strings are always the same size. Let’s replace:

strcmp(table[i].id,r_id)

by:

memcmp(table[i].id, r_id, 9*sizeof(char));

The new version gives an additional 1.3 speedup.

In two iterations, we managed to reduce the execution time from 215 seconds to 30 seconds on a large dataset. And the best rated movie of all times is… The Shawshank Redemption. Now you know what to watch next.

Develop your own code in the cloud

We have seen how easy software development on Arm-based instances in the cloud can be with Allinea Studio. If you would like to try on your own applications on AWS Graviton2 instances, you can request a free trial license of the tool.

Get your free trial

Anonymous
Tools, Software and IDEs blog
  • GCC 15: Continuously Improving

    Tamar Christina
    Tamar Christina
    GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
    • June 26, 2025
  • GitHub and Arm are transforming development on Windows for developers

    Pareena Verma
    Pareena Verma
    Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
    • May 20, 2025
  • What is new in LLVM 20?

    Volodymyr Turanskyy
    Volodymyr Turanskyy
    Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
    • April 29, 2025