Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
High Performance Computing (HPC) blog Tuning bowtie2 for better performance
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • HPC Tools
  • Development Tools
  • Arm Performance Reports
  • infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Tuning bowtie2 for better performance

Mark O'Connor
Mark O'Connor
January 10, 2015
7 minute read time.

Faster sequence alignment with Arm Performance Reports

Recently we've been running bowtie2 on a 16 CPU server with 32 GB RAM. I've tried using the “-p” flag to use more cores but it doesn't seem to make a lot of difference after 8 or so.

Today we're going to do a short performance investigation using the zebrafish genome and example pairs from HPC Lab Benchmarking Short Sequence Mapping Tools to check the health of our setup. Should bowtie2 be faster on this relatively powerful server or is this just the way things are?

$ time ../bowtie2 -p16 -x zebrafish -1 ./zebrafish.1M.1.fq -2 ./zebrafish.1M.2.fq -S ./z1.sam
999999 reads; of these:
  999999 (100.00%) were paired; of these:
    464803 (46.48%) aligned concordantly 0 times
    389786 (38.98%) aligned concordantly exactly 1 time
    145410 (14.54%) aligned concordantly >1 times
    ----
    464803 pairs aligned concordantly 0 times; of these:
      200113 (43.05%) aligned discordantly 1 time
    ----
    264690 pairs aligned 0 times concordantly or discordantly; of these:
      529380 mates make up the pairs; of these:
        44199 (8.35%) aligned 0 times
        149239 (28.19%) aligned exactly 1 time
        335942 (63.46%) aligned >1 times
97.79% overall alignment rate

real    2m56.503s
user    16m18.817s
sys     0m7.910s

Is 3 minutes good or bad for the zebrafish example on such hardware? If we run with -p32 instead then it finishes in 2m55 instead of 2m56. Is this the limit?

We don't know. We need a performance report.

Step 1: Generating a bowtie2 performance report

Arm Performance Reports helps you tune software and systems to run well together. It can be downloaded from this web page and shows up bottlenecks or misconfigurations and gives advice about how to investigate further.

Normally you run Arm Performance Reports simply by putting “perf-report” in front of the command you wish to measure, but as “bowtie2” is actually a perl script that calls several different programs before running the alignment we just edit the “bowtie2” script and add “perf-report” to the command that it runs:

    my $cmd = "$align_prog$debug_str --wrapper basic-0 ".join(" ", @bt2_args);

like this:

  my $cmd = "perf-report $align_prog$debug_str …

Now bowtie2 runs just as before but also generates a performance report in the working directory. For the 16-core zebrafish example that takes 2m56 on our system we get this one:

Tuning bowtie2

Immediately we can see that our powerful 16-core system is mostly going to waste – an I/O bottleneck in transferring the data is limiting the speed at which we can run sequence alignment. In fact, 65.1% of the time is spent just reading and writing files!

The report advises us to read the I/O section, which further breaks this time down into the time spent reading and writing to files. Here it's clear where the problem lies – 98.9% of the I/O time is spent writing the output files at just 5.09 MB/s.

This is very slow; we expected our network filesystem to achieve much higher speeds. The network admin should help us troubleshoot this but in the meantime we can write the files to the local disk to see what sort of speedup we could achieve by improving this.

Step 2: Avoiding the network filesystem for writing output files

Here we decide to write to a dedicated internal disk that we can then copy files from either in the background while other jobs run or via a USB disk. This shows us the sort of speedup we could achieve when the network filesystem issues are resolved:

$ time ../bowtie2 -p16 -x zebrafish -1 ./zebrafish.1M.1.fq -2 ./zebrafish.1M.2.fq -S /scratch/mark/z1.sam
999999 reads; of these:
  999999 (100.00%) were paired; of these:
    464803 (46.48%) aligned concordantly 0 times
    389786 (38.98%) aligned concordantly exactly 1 time
    145410 (14.54%) aligned concordantly >1 times
    ----
    464803 pairs aligned concordantly 0 times; of these:
      200113 (43.05%) aligned discordantly 1 time
    ----
    264690 pairs aligned 0 times concordantly or discordantly; of these:
      529380 mates make up the pairs; of these:
        44199 (8.35%) aligned 0 times
        149239 (28.19%) aligned exactly 1 time
        335942 (63.46%) aligned >1 times
97.79% overall alignment rate

real    1m0.758s
user    15m29.914s
sys     0m4.338s

Wow, instead of almost 3 minutes it now takes just one, a 2.9x speedup! It's definitely worth getting the networking issues resolved as quickly as possible.

So is bowtie2 now well-tuned or are there other limits? We left the indexes and sequence files being read from the network, for example. Shouldn't we change those too? Let's look at the performance report for the new run:

Tuning bowtie2 2

It's definitely looking healthier. In fact Now 96.4% of the time is spent computing – just what we wanted. The multiple threads created with -p16 are shown in the threads breakdown. We can see that bowtie2 is scaling wonderfully - 91.1% of the time is still spent efficiently computing our alignment in parallel.

This means that we should try running with -p32 again – it only made a 1 second difference previously but we now know that was because the CPU was starved of work due to the slow filesystem.

There are a number of forum posts around stating that bowtie2 doesn't scale past 4 or 8 threads. The measurements seen here cast serious doubt on such claims! It seems likely that the authors of such posts were also running into filesystem bottlenecks and didn't even realize it.

Step 3: Enabling hyperthreading and running with 32 cores

Telling bowtie2 to use all 32 hyperthreaded cores only made a 1 second difference before. How much benefit can we get now that we've resolved the I/O issue?

$ time ../bowtie2 -p32 -x zebrafish -1 ./zebrafish.1M.1.fq -2 ./zebrafish.1M.2.fq -S /scratch/mark/z1.sam
999999 reads; of these:
  999999 (100.00%) were paired; of these:
    464803 (46.48%) aligned concordantly 0 times
    389786 (38.98%) aligned concordantly exactly 1 time
    145410 (14.54%) aligned concordantly >1 times
    ----
    464803 pairs aligned concordantly 0 times; of these:
      200113 (43.05%) aligned discordantly 1 time
    ----
    264690 pairs aligned 0 times concordantly or discordantly; of these:
      529380 mates make up the pairs; of these:
        44199 (8.35%) aligned 0 times
        149239 (28.19%) aligned exactly 1 time
        335942 (63.46%) aligned >1 times
97.79% overall alignment rate

real    0m47.261s
user    23m10.694s
sys     0m16.169s

Down from 61s to just 47s, a further speedup of almost 30% bringing us to a total 3.7x speedup with just two simple configuration changes. Another reason to always measure performance yourself and not rely on Internet hearsay! So what does the evidence in the new performance report tell us – can we get any more performance out of our bowtie2 runs?

Tuning bowtie2 3

The report advises looking at the CPU breakdown, which shows us that just 9.2% of the time is spent in vectorized instructions and 68.5% of the time is spent on memory accesses. The advice to profile the code is going to deep for us – we aren't going to rewrite bowtie2.

However, the vectorization advice is interesting. Vector instructions allow the CPU to crunch more data at once. Our i7 supports Intel's AVX2 instruction set, but perhaps bowtie2 doesn't make use of it as it isn't available on older CPUs.

Step 4: Making a build of bowtie2 optimized for our hardware

It's easy refreshingly simple to recompile bowtie2 from the source code with settings designed to make better use of our hardware – just replace “-march=sse2” with “-march=native” in the Makefile. Of course, if you really want to get the most out of an Intel CPU it's hard to beat Intel's own compilers, so we also swapped gcc for Intel's icc. How much difference did these changes make?

$ time ../bowtie2 -p32 -x zebrafish -1 ./zebrafish.1M.1.fq -2 ./zebrafish.1M.2.fq -S /scratch/mark/z1.sam
999999 reads; of these:
  999999 (100.00%) were paired; of these:
    464870 (46.49%) aligned concordantly 0 times
    389788 (38.98%) aligned concordantly exactly 1 time
    145341 (14.53%) aligned concordantly >1 times
    ----
    464870 pairs aligned concordantly 0 times; of these:
      200208 (43.07%) aligned discordantly 1 time
    ----
    264662 pairs aligned 0 times concordantly or discordantly; of these:
      529324 mates make up the pairs; of these:
        44452 (8.40%) aligned 0 times
        149397 (28.22%) aligned exactly 1 time
        335475 (63.38%) aligned >1 times
97.78% overall alignment rate

real    0m41.825s
user    20m29.531s
sys     0m14.335s

Not bad - spending five minutes recompiling the package to make use of the current CPU features has reduced the running time from 47s to just 41.8s, a further 12% speedup that brings us to a total 4.2x speedup!

Conclusion: 4.2x speed improvement in one afternoon's work

Following the guidance in Arm Performance Reports we have:

  • Identified an I/O bottleneck writing output files – 2.9x speedup
  • Switched to “-p32” to make full use of hyperthreading – 3.7x speedup
  • Built bowtie2 with compiler flags optimized for our system – 4.2x speedup
  • Sequence alignment is only one part of most bioinformatics workflows. Which tools are you using that could be improved? Is an unidentified I/O bottleneck slowing you down?

The quickest way to find out? Run Arm Performance Reports regularly on each step of your workflow to keep your research running as efficiently and smoothly as possible.

How much more could you get done with a 4x faster system?

Anonymous
High Performance Computing (HPC) blog
  • Arm Compiler for Linux and Arm PL now available in Spack

    Annop Wongwathanarat
    Annop Wongwathanarat
    We are happy to announce that Arm Compiler for Linux (ACfL) and Arm Performance Libraries (Arm PL) are now available as installable packages in Spack, a widely used package manager in the HPC community…
    • May 22, 2023
  • Using vector math functions on Arm

    Chris Goodyer
    Chris Goodyer
    In this post, we highlight the scale of performance increases possible, detail the accuracy requirements, and explain in detail how to use the libamath library that ships with ACfL.
    • May 16, 2023
  • Arm Compiler for Linux and Arm Performance Libraries 23.04

    Chris Goodyer
    Chris Goodyer
    Arm Compiler for Linux 23.04 is now available with improved compilers and libraries. In this blog, we explore what is new in this first major release of 2023.
    • May 9, 2023