Genomics has been absolutely transformational to public health and continues to deliver benefits for us all. To achieve its many results, involves a significant and growing amount of computing in Cloud and on-prem data centers, at research centers, hospitals, and the wider life sciences industry.
Reference-guided assembly is an essential stage in many workflows in this field. For a typical patient, a swab leads to a sample being sequenced in a sequencing machine. And the output of this machine is gigabytes of fragments (substrings of the A, C, G, and T DNA bases). These reads are “aligned” against a complete human genome from a (standard) reference individual to establish where those reads “fit” and assemble large sections of the genome of the patient.
The three most well-known applications that accomplish reference-guided assembly are BWA, bwa-mem2, and minimap2. With such widespread use, the price and performance of these applications is critical to the industry.
In a previous blog (Optimizing the BWA aligner for Arm servers) we showed how to run BWA and its performance on AWS Graviton2 against the prevailing x86_64 servers of early 2021.
In this blog, we can now show the performance of the three major aligners on the Arm architecture AWS Graviton3. AWS Graviton3 is the most recent Arm-based server in the AWS fleet, and the successor to AWS Graviton2.
We demonstrate that AWS Graviton3 increases performance by between 12% and 31% over the AWS Graviton2. And Graviton3 increases performance by 10% and 23% over the best available x86_64 systems today. This result delivers a cost saving of 20-30% over the comparable x86_64 systems.
We use the human_g1k_v37 reference from the 1000 Genomes project, and NA12878 from NIST archive. These test cases are both mirrored on AWS S3 and fetched using:
aws s3 cp –no-sign-request s3://1000genomes/technical/reference/human_g1k_v37.fasta.gz . aws s3 cp –no-sign-request s3://giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz .
aws s3 cp –no-sign-request s3://1000genomes/technical/reference/human_g1k_v37.fasta.gz .
aws s3 cp –no-sign-request s3://giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz .
In each case, we have used the gcc-10 compiler for the platform comparison. Each of these test cases builds easily on Arm through ordinary build scripts, either in the main repository, or in public forks awaiting merge.
https://github.com/lh3/bwa https://github.com/dslarm/bwa-mem2 https://github.com/dslarm/minimap2
https://github.com/lh3/bwa
https://github.com/dslarm/bwa-mem2
https://github.com/dslarm/minimap2
The applications are all multithreaded with a configurable number of worker threads. We illustrate the benchmark on 8xlarge instances - which have 32 vCPUs - using 32 worker threads.
At run time, we use the Cloudflare zlib package to replace the system zlib, which helps the aligners to decompress the input data files faster. In a further optimization, for bwa we preload the jemalloc library - which can be more efficient for standard memory allocation functions in multithreaded codes.
The build scripts for each application and to fetch the data sets are available on our Github at https://github.com/arm-hpc/genomics-blog.
AWS Graviton3 uses the Arm Neoverse V1 core, in contrast AWS Graviton2 uses the Arm Neoverse N1 core.
The Neoverse V1 brings significant changes, in particular it is a significantly wider core – able to execute more instructions per cycle, extracting more instruction level parallelism than its predecessor.
Using perf stat we can extract the achieved instructions per cycle (IPC) rate for both platforms.
As can be seen – the improvement in IPC varies across each application – with minimap2 seeing the most benefit at +26% more instructions per cycle. This result is for a full workload, and also includes time spent in I/O.
AWS Graviton3 is also the first DDR5 system in the AWS fleet, with 50% more DDR bandwidth than its predecessor.
Also, the AWS Graviton3 executes at 2.6GHz, compared to 2.5GHz of the Graviton2.
The combined impact of higher IPC and frequency translates directly into runtime, which we look at next.
Compared to the previous generation AWS Graviton2 (c6g.8xlarge) the performance is between 12% and 31% higher.
However, AWS Graviton3 also demonstrates between 10% and 23% more performance compared to Intel Icelake (c6i.8xlarge) and between 11% and 21% more performance than AMD Milan (c6a.8xlarge).
If we turn to cost per alignment:
AWS Graviton3 offers the most price performance of any platform for all three genomics applications. Running the same alignment on AMD Milan costs up to 27% more per sample set or up to 45% more on Intel Ice Lake. This result means that AWS Graviton3 saves up to 20% over AMD Milan and up to 30% compared Intel Ice Lake.
[CTAToken URL = "https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog" target="_blank" text="More HPC Blogs" class ="green"]