Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
High Performance Computing (HPC) blog Using Arm servers to reduce the time and cost of Genomics
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • High Performance Compute
  • Neoverse
  • Server and HPC
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Using Arm servers to reduce the time and cost of Genomics

David Lecomber
David Lecomber
October 5, 2022
3 minute read time.

Genomics has been absolutely transformational to public health and continues to deliver benefits for us all. To achieve its many results, involves a significant and growing amount of computing in Cloud and on-prem data centers, at research centers, hospitals, and the wider life sciences industry. 

Reference-guided assembly is an essential stage in many workflows in this field. For a typical patient, a swab leads to a sample being sequenced in a sequencing machine. And the output of this machine is gigabytes of fragments (substrings of the A, C, G, and T DNA bases).  These reads are “aligned” against a complete human genome from a (standard) reference individual to establish where those reads “fit” and assemble large sections of the genome of the patient.

The three most well-known applications that accomplish reference-guided assembly are BWA, bwa-mem2, and minimap2. With such widespread use, the price and performance of these applications is critical to the industry. 

In a previous blog (Optimizing the BWA aligner for Arm servers) we showed how to run BWA and its performance on AWS Graviton2 against the prevailing x86_64 servers of early 2021. 

In this blog, we can now show the performance of the three major aligners on the Arm architecture AWS Graviton3. AWS Graviton3 is the most recent Arm-based server in the AWS fleet, and the successor to AWS Graviton2. 

We demonstrate that AWS Graviton3 increases performance by between 12% and 31% over the AWS Graviton2. And Graviton3 increases performance by 10% and 23% over the best available x86_64 systems today. This result delivers a cost saving of 20-30% over the comparable x86_64 systems.

Applications and test case

We use the human_g1k_v37 reference from the 1000 Genomes project, and NA12878 from NIST archive. These test cases are both mirrored on AWS S3 and fetched using:

aws s3 cp –no-sign-request s3://1000genomes/technical/reference/human_g1k_v37.fasta.gz .

aws s3 cp –no-sign-request s3://giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz .

In each case, we have used the gcc-10 compiler for the platform comparison. Each of these test cases builds easily on Arm through ordinary build scripts, either in the main repository, or in public forks awaiting merge.

https://github.com/lh3/bwa

https://github.com/dslarm/bwa-mem2

https://github.com/dslarm/minimap2 

The applications are all multithreaded with a configurable number of worker threads. We illustrate the benchmark on 8xlarge instances - which have 32 vCPUs - using 32 worker threads.

At run time, we use the Cloudflare zlib package to replace the system zlib, which helps the aligners to decompress the input data files faster.  In a further optimization, for bwa we preload the jemalloc library - which can be more efficient for standard memory allocation functions in multithreaded codes.

The build scripts for each application and to fetch the data sets are available on our Github at https://github.com/arm-hpc/genomics-blog.

AWS Graviton2 to AWS Graviton3 – A leap in capability 

AWS Graviton3 uses the Arm Neoverse V1 core, in contrast AWS Graviton2 uses the Arm Neoverse N1 core.

Neoverse N1 Pipeline

Neoverse V1 Pipeline

The Neoverse V1 brings significant changes, in particular it is a significantly wider core – able to execute more instructions per cycle, extracting more instruction level parallelism than its predecessor.

Using perf stat we can extract the achieved instructions per cycle (IPC) rate for both platforms.

Comparative Instructions per Cycle

As can be seen – the improvement in IPC varies across each application – with minimap2 seeing the most benefit at +26% more instructions per cycle. This result is for a full workload, and also includes time spent in I/O.

AWS Graviton3 is also the first DDR5 system in the AWS fleet, with 50% more DDR bandwidth than its predecessor.

Also, the AWS Graviton3 executes at 2.6GHz, compared to 2.5GHz of the Graviton2. 

The combined impact of higher IPC and frequency translates directly into runtime, which we look at next.

Which architecture provides the most performance and the least cost?

Runtime per alignment with 32-worker threads

Compared to the previous generation AWS Graviton2 (c6g.8xlarge) the performance is between 12% and 31% higher.

However, AWS Graviton3 also demonstrates between 10% and 23% more performance compared to Intel Icelake (c6i.8xlarge) and between 11% and 21% more performance than AMD Milan (c6a.8xlarge).  

If we turn to cost per alignment:

Relative cost of alignment

Summary

AWS Graviton3 offers the most price performance of any platform for all three genomics applications. Running the same alignment on AMD Milan costs up to 27% more per sample set or up to 45% more on Intel Ice Lake. This result means that AWS Graviton3 saves up to 20% over AMD Milan and up to 30% compared Intel Ice Lake.

More HPC Blogs

Anonymous
High Performance Computing (HPC) blog
  • AWS Graviton3 improves Cadence EDA tools performance for Arm

    Tim Thornton
    Tim Thornton
    In this blog we provide an update to our use of Cadence EDA tools in the AWS cloud, with a focus on Graviton3 performance improvements.
    • November 16, 2022
  • A case study in vectorizing HACCmk using SVE

    Brian Waldecker
    Brian Waldecker
    This blog uses the HACCmk benchmark to demonstrate the vectorization capabilities and benefits of SVE over NEON (ASIMD)
    • November 3, 2022
  • Bringing WRF up to speed with Arm Neoverse

    Phil Ridley
    Phil Ridley
    In this blog we examine the WRF weather model and examine the performance improvement available using AWS Graviton3 (Neoverse V1 core) compared to AWS Graviton2 (Neoverse N1 core).
    • October 19, 2022