Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
High Performance Computing (HPC) blog Genomics: Optimizing the BWA aligner for Arm Servers
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Arm HPC products
  • High Performance Computing (HPC)
  • Neoverse
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Genomics: Optimizing the BWA aligner for Arm Servers

David Lecomber
David Lecomber
April 30, 2021
7 minute read time.

Genetic sequencing and the field of genomics is phenomenally important today. From cancer research and patient diagnostics through to agricultural research and more recently the analysis and detection of COVID-19 variants, it truly is felt everywhere.

In this blog, we talk about alignment - one of the largest and most prevalent part of genomics workloads - and how we accelerate its performance on Arm servers in the Neoverse N1 series, such as the Ampere Altra and the AWS Graviton 2 instances.

Background

An individual’s genome is a 3 billion length sequence, made entirely of the characters A, T, C, and G. To sequence a human clinical sample a sequencing machine is used. This creates a large file, containing randomly located subsequences of contiguous characters from that genome. Those sequences – each around 150 characters long – are parts of the genome and known as short reads.  

Short reads can be from any part of my genome - and so the output of the sequencing machine is usually larger than the genome itself to give more (but not complete) coverage – a typical sample in the NIST data sets is 1.8GB compressed.

The difficult bit is putting these short reads together – in the right order.

Applications such as BWA or Bowtie2 take the sample reads and align them against a reference genome – rather than starting from a blank canvas – which simplifies the task. Where differences from the genes in the reference genome to those in mine may be variants that cause disease or susceptibility to a disease.

Alignment is a frequent and compute intensive task. It is natural that if we can save time and money then we should.

The team at AWS have written a comparative blog on BWA performance across their different instance types. In this, it is shown that versus the available x86_64 fleet, an Arm architecture Graviton 2 instance was 8% slower but saved 50% of the cost.

My question – can we improve the performance so that the Arm servers win on performance as well as cost?

BWA on Arm

For this work, we build on the original AWS example. We will use AWS Graviton 2 – although the 80 and 128 core Ampere Altra systems are an option for those using a local server.

To reduce the search space and factors under analysis – and concentrate on understanding the performance, we made two changes to AWS’s own example:

  • Using instance storage for the data rather than Lustre FSX,
  • Using C6g metal aarch64 AWS Graviton 2 (64 cores) and C5 metal x86_64 instances (72 cores),
  • Using the same instances regardless of BWA thread count.

We use the same sample data - a reference from the 1000 Genomes Project and a reference sample from NIST:

wget ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
gunzip human_g1k_v37.fasta.gz

The install and build script is:

#!/bin/sh
sudo yum groupinstall "Development Tools" -y
git clone https://github.com/lh3/bwa.git
cd bwa/
ARCH=`uname -m`
if [ "$ARCH" = "aarch64" ] ; then
  wget https://raw.githubusercontent.com/DLTcollab/sse2neon/master/sse2neon.h
  sed -i -e 's/<emmintrin.h>/"sse2neon.h"/' ksw.c
  make clean
  make all CFLAGS="-g -Wall -Wno-unused-function -O3 -mtune=native"
else
  make clean
  make all CFLAGS="-g -Wall -Wno-unused-function -O3 -mtune=native"
fi

First the reference genome must be indexed - an exercise we do once but reuse for every future sample alignment.

./bwa/bwa index human_g1k_v37.fasta

Throughout this blog we use the BWA "mem" option to align our sample to the reference - and the "THREADS" is replaced by the number of threads we wish to use.

 ./bwa/bwa mem -t THREADS -o sample.sam human_g1k_v37.fasta NIST7035_TAAGGCGA_L001_R1_001.fastq.gz 

First impressions 

Our experiment starts from a different configuration to the AWS example, so we initially benchmark BWA with 8, 16, 32 and 64 thread options.  

Let us compare C5 (x86_64) and C6g (aarch64 - Graviton 2) runtimes with the default compiler, GCC 7.2:

Initial Execution Time (lower is better) 

Our example has a faster run time on C6g at 8 threads vs C5 (318 vs 337 seconds) – a 6% time saving, and at the higher end – 106 vs 112 seconds – a 5% saving.  We have used a stronger compiler optimization flags than the original blog for both platforms: -O3 and also -mcpu or -mtune to tune to the "max" for the native CPU.

We could stop here – but let us not, let us find out what is happening.

Change the compiler 

How much can we improve performance by something as easy as changing the compiler?  There are more recent GCC versions – as well as the commercially supported compiler Arm Compiler for Linux (ACfL), part of the Allinea Studio tool suite.

Plotting these three compiler options – on C6g – we see:

 Execution times on aarch64 with different compilers

A significant improvement comes from using ACfL – we have reduced the run time at 8 threads to 295 seconds from 318, and at 64 threads to 93 from 106. This is now 12-17% faster than x86 C5. 

Are we done? Of course not. 

Looking inside

Arm Allinea Studio provides a tool, Performance Reports, that gives a high-level characterization of software performance. 

We take a look at this overview now for the two platforms:

x86 Performance Report aarch64 Performance Report

On the left, the C5 x86_64 instance, and on the right the C6g aarch64 instance. 

There is a broadly similar I/O and compute balance.  

I/O is a really important part of this workload – those data files are huge.

In the CPU section – the tool cannot directly compare x86_64 and aarch64 due to differences in the available data measurements. However the memory intensive nature of the workloads are seen by the time spent in memory accesses on x86, and the observed L2 misses and stalled backend cycles on aarch64.  The Graviton 2 has a larger L2 cache than the x86_64 instances, and very high memory bandwidth - both of which are an advantage here.

In the Threads section, physical core utilization is lower for Arm. We need to go deeper still.

Performance profile

Arm Forge MAP, also part of Arm Allinea Studio, includes a source level profiler for developers. We can use this to find out exactly what is happening.

Here is the timeline view of activity - with 64 worker thread sessions from x86_64 (top) and aarch64 (bottom) shown.

Screenshot of Arm MAP performance timeline

In the timeline - we can see similar patterns of activity – such as:

  • Substantial I/O (orange) representing about 20% of the execution time at the tail end of the run, and
  • The main thread sits in synchronization (blue) whilst other parts of the workload execute

We also see the change in memory usage across the server (node) over time.

The height of the “Application Activity” corresponds to active threads: there is poor utilization in both cases. From the point in time when the main thread becomes blue after having read data in, and when it is waiting on the termination of the other threads, there are areas where only one other thread is active (visible as the light grey gaps).

If we zoom in we would see that one thread doing the decompression of the input data is running before sending the work to the worker threads.  

That gap re-occurs later thread: it appears that the aligner threads are collectively too fast for the decompression thread: this says we have too many aligner threads for the pace that they can be fed.

This starvation of work, and the highly sequential I/O regions at the start and end of the run are why an 8x increase in threads does not yield an 8x performance increase. 

There is another interesting area – in the aarch64 timeline, the first large green segment has a substantial grey (for “waiting”) above it – which is not present for x86_64.

Zooming in to suspicious idle area

Digging a bit deeper – we see that this is a result of mprotect, which is being incurred during memory allocations – the heavy thread usage is resulting in this waiting at the kernel level.

There is a quick trick to use: jemalloc (https://github.com/jemalloc/jemalloc) is an alternative memory allocator – and found to produce better multithreaded performance in some cases.  It's worth trying here as it's easy to use: relinking with jemalloc (or just set the environment variable LD_PRELOAD=/path/to/libjemalloc.so) is all we need to do.

Let us try:

 Final performance with Jemalloc 

We now see improvement – and GCC 7.2 has caught up with ACfL, and the binaries created by both compilers are faster with Jemalloc. We also tried this for x86_64 but this did not change the run time for that platform. 

Subsequent to trying Jemalloc we also experimented with the standard system glibc - and by setting: 

export GLIBC_TUNABLES=glibc.malloc.top_pad=183500800

We were able to match the Jemalloc performance.  

The final runtimes for Graviton 2 were: 291 seconds for 8 threads, and 82 for 64 threads – a time saving of 14%-27% over the fastest x86_64 instance timings.

In one simple session of profiling – we’ve increased our advantage from 5% over x86 to 14-27%.

We have also learned more about performance because of deploying performance tools to analyze deeper:

  • Highly sequential phases for I/O impact efficiency of larger thread counts
  • The need to balance the decompression thread with the number of worker-aligners

We also can see peak memory use is under 14GB - meaning the c6g or c5 were (at 128 and 192GB respectively) is over-provisioning too.

There are more tricks we can use to optimize cost further – such as trying different file system types in AWS – but this, as they say, is an exercise for the reader.

Anonymous
High Performance Computing (HPC) blog
  • AWS Graviton3 improves Cadence EDA tools performance for Arm

    Tim Thornton
    Tim Thornton
    In this blog we provide an update to our use of Cadence EDA tools in the AWS cloud, with a focus on Graviton3 performance improvements.
    • November 16, 2022
  • A case study in vectorizing HACCmk using SVE

    Brian Waldecker
    Brian Waldecker
    This blog uses the HACCmk benchmark to demonstrate the vectorization capabilities and benefits of SVE over NEON (ASIMD)
    • November 3, 2022
  • Bringing WRF up to speed with Arm Neoverse

    Phil Ridley
    Phil Ridley
    In this blog we examine the WRF weather model and examine the performance improvement available using AWS Graviton3 (Neoverse V1 core) compared to AWS Graviton2 (Neoverse N1 core).
    • October 19, 2022