Using Arm servers to reduce the time and cost of Genomics

October 5, 2022

3 minute read time.

Genomics has been absolutely transformational to public health and continues to deliver benefits for us all. To achieve its many results, involves a significant and growing amount of computing in Cloud and on-prem data centers, at research centers, hospitals, and the wider life sciences industry.

Reference-guided assembly is an essential stage in many workflows in this field. For a typical patient, a swab leads to a sample being sequenced in a sequencing machine. And the output of this machine is gigabytes of fragments (substrings of the A, C, G, and T DNA bases). These reads are “aligned” against a complete human genome from a (standard) reference individual to establish where those reads “fit” and assemble large sections of the genome of the patient.

The three most well-known applications that accomplish reference-guided assembly are BWA, bwa-mem2, and minimap2. With such widespread use, the price and performance of these applications is critical to the industry.

In a previous blog (Optimizing the BWA aligner for Arm servers) we showed how to run BWA and its performance on AWS Graviton2 against the prevailing x86_64 servers of early 2021.

In this blog, we can now show the performance of the three major aligners on the Arm architecture AWS Graviton3. AWS Graviton3 is the most recent Arm-based server in the AWS fleet, and the successor to AWS Graviton2.

We demonstrate that AWS Graviton3 increases performance by between 12% and 31% over the AWS Graviton2. And Graviton3 increases performance by 10% and 23% over the best available x86_64 systems today. This result delivers a cost saving of 20-30% over the comparable x86_64 systems.

Applications and test case

We use the human_g1k_v37 reference from the 1000 Genomes project, and NA12878 from NIST archive. These test cases are both mirrored on AWS S3 and fetched using:

aws s3 cp –no-sign-request s3://1000genomes/technical/reference/human_g1k_v37.fasta.gz .

aws s3 cp –no-sign-request s3://giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz .

In each case, we have used the gcc-10 compiler for the platform comparison. Each of these test cases builds easily on Arm through ordinary build scripts, either in the main repository, or in public forks awaiting merge.

https://github.com/lh3/bwa

https://github.com/dslarm/bwa-mem2

https://github.com/dslarm/minimap2

The applications are all multithreaded with a configurable number of worker threads. We illustrate the benchmark on 8xlarge instances - which have 32 vCPUs - using 32 worker threads.

At run time, we use the Cloudflare zlib package to replace the system zlib, which helps the aligners to decompress the input data files faster. In a further optimization, for bwa we preload the jemalloc library - which can be more efficient for standard memory allocation functions in multithreaded codes.

The build scripts for each application and to fetch the data sets are available on our Github at https://github.com/arm-hpc/genomics-blog.

AWS Graviton2 to AWS Graviton3 – A leap in capability

AWS Graviton3 uses the Arm Neoverse V1 core, in contrast AWS Graviton2 uses the Arm Neoverse N1 core.

Neoverse N1 Pipeline

Neoverse V1 Pipeline

The Neoverse V1 brings significant changes, in particular it is a significantly wider core – able to execute more instructions per cycle, extracting more instruction level parallelism than its predecessor.

Using perf stat we can extract the achieved instructions per cycle (IPC) rate for both platforms.

Comparative Instructions per Cycle

As can be seen – the improvement in IPC varies across each application – with minimap2 seeing the most benefit at +26% more instructions per cycle. This result is for a full workload, and also includes time spent in I/O.

AWS Graviton3 is also the first DDR5 system in the AWS fleet, with 50% more DDR bandwidth than its predecessor.

Also, the AWS Graviton3 executes at 2.6GHz, compared to 2.5GHz of the Graviton2.

The combined impact of higher IPC and frequency translates directly into runtime, which we look at next.

Which architecture provides the most performance and the least cost?

Runtime per alignment with 32-worker threads

Compared to the previous generation AWS Graviton2 (c6g.8xlarge) the performance is between 12% and 31% higher.

However, AWS Graviton3 also demonstrates between 10% and 23% more performance compared to Intel Icelake (c6i.8xlarge) and between 11% and 21% more performance than AMD Milan (c6a.8xlarge).

If we turn to cost per alignment:

Relative cost of alignment

Summary

AWS Graviton3 offers the most price performance of any platform for all three genomics applications. Running the same alignment on AMD Milan costs up to 27% more per sample set or up to 45% more on Intel Ice Lake. This result means that AWS Graviton3 saves up to 20% over AMD Milan and up to 30% compared Intel Ice Lake.

More HPC Blogs

Servers and Cloud Computing blog

Refining MurmurHash64A for greater efficiency in Libstdc++

Zongyao Zhang

Discover how tuning MurmurHash64A’s memory access pattern yields up to 9% faster hashing performance.
- October 16, 2025
How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Marc Meunier

Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
- October 13, 2025
Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

odinlmshen

In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
- October 13, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Using Arm servers to reduce the time and cost of Genomics

Applications and test case

AWS Graviton2 to AWS Graviton3 – A leap in capability

Which architecture provides the most performance and the least cost?

Summary

Refining MurmurHash64A for greater efficiency in Libstdc++

How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3