Genetic sequencing and the field of genomics is phenomenally important today. From cancer research and patient diagnostics through to agricultural research and more recently the analysis and detection of COVID-19 variants, it truly is felt everywhere.
In this blog, we talk about alignment - one of the largest and most prevalent part of genomics workloads - and how we accelerate its performance on Arm servers in the Neoverse N1 series, such as the Ampere Altra and the AWS Graviton 2 instances.
An individual’s genome is a 3 billion length sequence, made entirely of the characters A, T, C, and G. To sequence a human clinical sample a sequencing machine is used. This creates a large file, containing randomly located subsequences of contiguous characters from that genome. Those sequences – each around 150 characters long – are parts of the genome and known as short reads.
Short reads can be from any part of my genome - and so the output of the sequencing machine is usually larger than the genome itself to give more (but not complete) coverage – a typical sample in the NIST data sets is 1.8GB compressed.
The difficult bit is putting these short reads together – in the right order.
Applications such as BWA or Bowtie2 take the sample reads and align them against a reference genome – rather than starting from a blank canvas – which simplifies the task. Where differences from the genes in the reference genome to those in mine may be variants that cause disease or susceptibility to a disease.
Alignment is a frequent and compute intensive task. It is natural that if we can save time and money then we should.
The team at AWS have written a comparative blog on BWA performance across their different instance types. In this, it is shown that versus the available x86_64 fleet, an Arm architecture Graviton 2 instance was 8% slower but saved 50% of the cost.
My question – can we improve the performance so that the Arm servers win on performance as well as cost?
For this work, we build on the original AWS example. We will use AWS Graviton 2 – although the 80 and 128 core Ampere Altra systems are an option for those using a local server.
To reduce the search space and factors under analysis – and concentrate on understanding the performance, we made two changes to AWS’s own example:
We use the same sample data - a reference from the 1000 Genomes Project and a reference sample from NIST:
wget ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz gunzip human_g1k_v37.fasta.gz
The install and build script is:
#!/bin/sh sudo yum groupinstall "Development Tools" -y git clone https://github.com/lh3/bwa.git cd bwa/ ARCH=`uname -m` if [ "$ARCH" = "aarch64" ] ; then wget https://raw.githubusercontent.com/DLTcollab/sse2neon/master/sse2neon.h sed -i -e 's/<emmintrin.h>/"sse2neon.h"/' ksw.c make clean make all CFLAGS="-g -Wall -Wno-unused-function -O3 -mtune=native" else make clean make all CFLAGS="-g -Wall -Wno-unused-function -O3 -mtune=native" fi
First the reference genome must be indexed - an exercise we do once but reuse for every future sample alignment.
./bwa/bwa index human_g1k_v37.fasta
Throughout this blog we use the BWA "mem" option to align our sample to the reference - and the "THREADS" is replaced by the number of threads we wish to use.
./bwa/bwa mem -t THREADS -o sample.sam human_g1k_v37.fasta NIST7035_TAAGGCGA_L001_R1_001.fastq.gz
Our experiment starts from a different configuration to the AWS example, so we initially benchmark BWA with 8, 16, 32 and 64 thread options.
Let us compare C5 (x86_64) and C6g (aarch64 - Graviton 2) runtimes with the default compiler, GCC 7.2:
Our example has a faster run time on C6g at 8 threads vs C5 (318 vs 337 seconds) – a 6% time saving, and at the higher end – 106 vs 112 seconds – a 5% saving. We have used a stronger compiler optimization flags than the original blog for both platforms: -O3 and also -mcpu or -mtune to tune to the "max" for the native CPU.
We could stop here – but let us not, let us find out what is happening.
How much can we improve performance by something as easy as changing the compiler? There are more recent GCC versions – as well as the commercially supported compiler Arm Compiler for Linux (ACfL), part of the Allinea Studio tool suite.
Plotting these three compiler options – on C6g – we see:
A significant improvement comes from using ACfL – we have reduced the run time at 8 threads to 295 seconds from 318, and at 64 threads to 93 from 106. This is now 12-17% faster than x86 C5.
Are we done? Of course not.
Arm Allinea Studio provides a tool, Performance Reports, that gives a high-level characterization of software performance.
We take a look at this overview now for the two platforms:
On the left, the C5 x86_64 instance, and on the right the C6g aarch64 instance.
There is a broadly similar I/O and compute balance.
I/O is a really important part of this workload – those data files are huge.
In the CPU section – the tool cannot directly compare x86_64 and aarch64 due to differences in the available data measurements. However the memory intensive nature of the workloads are seen by the time spent in memory accesses on x86, and the observed L2 misses and stalled backend cycles on aarch64. The Graviton 2 has a larger L2 cache than the x86_64 instances, and very high memory bandwidth - both of which are an advantage here.
In the Threads section, physical core utilization is lower for Arm. We need to go deeper still.
Arm Forge MAP, also part of Arm Allinea Studio, includes a source level profiler for developers. We can use this to find out exactly what is happening.
Here is the timeline view of activity - with 64 worker thread sessions from x86_64 (top) and aarch64 (bottom) shown.
In the timeline - we can see similar patterns of activity – such as:
We also see the change in memory usage across the server (node) over time.
The height of the “Application Activity” corresponds to active threads: there is poor utilization in both cases. From the point in time when the main thread becomes blue after having read data in, and when it is waiting on the termination of the other threads, there are areas where only one other thread is active (visible as the light grey gaps).
If we zoom in we would see that one thread doing the decompression of the input data is running before sending the work to the worker threads.
That gap re-occurs later thread: it appears that the aligner threads are collectively too fast for the decompression thread: this says we have too many aligner threads for the pace that they can be fed.
This starvation of work, and the highly sequential I/O regions at the start and end of the run are why an 8x increase in threads does not yield an 8x performance increase.
There is another interesting area – in the aarch64 timeline, the first large green segment has a substantial grey (for “waiting”) above it – which is not present for x86_64.
Digging a bit deeper – we see that this is a result of mprotect, which is being incurred during memory allocations – the heavy thread usage is resulting in this waiting at the kernel level.
There is a quick trick to use: jemalloc (https://github.com/jemalloc/jemalloc) is an alternative memory allocator – and found to produce better multithreaded performance in some cases. It's worth trying here as it's easy to use: relinking with jemalloc (or just set the environment variable LD_PRELOAD=/path/to/libjemalloc.so) is all we need to do.
Let us try:
We now see improvement – and GCC 7.2 has caught up with ACfL, and the binaries created by both compilers are faster with Jemalloc. We also tried this for x86_64 but this did not change the run time for that platform.
Subsequent to trying Jemalloc we also experimented with the standard system glibc - and by setting:
export GLIBC_TUNABLES=glibc.malloc.top_pad=183500800
We were able to match the Jemalloc performance.
The final runtimes for Graviton 2 were: 291 seconds for 8 threads, and 82 for 64 threads – a time saving of 14%-27% over the fastest x86_64 instance timings.
In one simple session of profiling – we’ve increased our advantage from 5% over x86 to 14-27%.
We have also learned more about performance because of deploying performance tools to analyze deeper:
We also can see peak memory use is under 14GB - meaning the c6g or c5 were (at 128 and 192GB respectively) is over-provisioning too.
There are more tricks we can use to optimize cost further – such as trying different file system types in AWS – but this, as they say, is an exercise for the reader.