Recently we've been running bowtie2 on a 16 CPU server with 32 GB RAM. I've tried using the “-p” flag to use more cores but it doesn't seem to make a lot of difference after 8 or so.
Today we're going to do a short performance investigation using the zebrafish genome and example pairs from HPC Lab Benchmarking Short Sequence Mapping Tools to check the health of our setup. Should bowtie2 be faster on this relatively powerful server or is this just the way things are?
$ time ../bowtie2 -p16 -x zebrafish -1 ./zebrafish.1M.1.fq -2 ./zebrafish.1M.2.fq -S ./z1.sam 999999 reads; of these: 999999 (100.00%) were paired; of these: 464803 (46.48%) aligned concordantly 0 times 389786 (38.98%) aligned concordantly exactly 1 time 145410 (14.54%) aligned concordantly >1 times ---- 464803 pairs aligned concordantly 0 times; of these: 200113 (43.05%) aligned discordantly 1 time ---- 264690 pairs aligned 0 times concordantly or discordantly; of these: 529380 mates make up the pairs; of these: 44199 (8.35%) aligned 0 times 149239 (28.19%) aligned exactly 1 time 335942 (63.46%) aligned >1 times 97.79% overall alignment rate real 2m56.503s user 16m18.817s sys 0m7.910s
Is 3 minutes good or bad for the zebrafish example on such hardware? If we run with -p32 instead then it finishes in 2m55 instead of 2m56. Is this the limit?
We don't know. We need a performance report.
Arm Performance Reports helps you tune software and systems to run well together. It can be downloaded from this web page and shows up bottlenecks or misconfigurations and gives advice about how to investigate further.
Normally you run Arm Performance Reports simply by putting “perf-report” in front of the command you wish to measure, but as “bowtie2” is actually a perl script that calls several different programs before running the alignment we just edit the “bowtie2” script and add “perf-report” to the command that it runs:
my $cmd = "$align_prog$debug_str --wrapper basic-0 ".join(" ", @bt2_args);
like this:
my $cmd = "perf-report $align_prog$debug_str …
Now bowtie2 runs just as before but also generates a performance report in the working directory. For the 16-core zebrafish example that takes 2m56 on our system we get this one:
Immediately we can see that our powerful 16-core system is mostly going to waste – an I/O bottleneck in transferring the data is limiting the speed at which we can run sequence alignment. In fact, 65.1% of the time is spent just reading and writing files!
The report advises us to read the I/O section, which further breaks this time down into the time spent reading and writing to files. Here it's clear where the problem lies – 98.9% of the I/O time is spent writing the output files at just 5.09 MB/s.
This is very slow; we expected our network filesystem to achieve much higher speeds. The network admin should help us troubleshoot this but in the meantime we can write the files to the local disk to see what sort of speedup we could achieve by improving this.
Here we decide to write to a dedicated internal disk that we can then copy files from either in the background while other jobs run or via a USB disk. This shows us the sort of speedup we could achieve when the network filesystem issues are resolved:
$ time ../bowtie2 -p16 -x zebrafish -1 ./zebrafish.1M.1.fq -2 ./zebrafish.1M.2.fq -S /scratch/mark/z1.sam 999999 reads; of these: 999999 (100.00%) were paired; of these: 464803 (46.48%) aligned concordantly 0 times 389786 (38.98%) aligned concordantly exactly 1 time 145410 (14.54%) aligned concordantly >1 times ---- 464803 pairs aligned concordantly 0 times; of these: 200113 (43.05%) aligned discordantly 1 time ---- 264690 pairs aligned 0 times concordantly or discordantly; of these: 529380 mates make up the pairs; of these: 44199 (8.35%) aligned 0 times 149239 (28.19%) aligned exactly 1 time 335942 (63.46%) aligned >1 times 97.79% overall alignment rate real 1m0.758s user 15m29.914s sys 0m4.338s
Wow, instead of almost 3 minutes it now takes just one, a 2.9x speedup! It's definitely worth getting the networking issues resolved as quickly as possible.
So is bowtie2 now well-tuned or are there other limits? We left the indexes and sequence files being read from the network, for example. Shouldn't we change those too? Let's look at the performance report for the new run:
It's definitely looking healthier. In fact Now 96.4% of the time is spent computing – just what we wanted. The multiple threads created with -p16 are shown in the threads breakdown. We can see that bowtie2 is scaling wonderfully - 91.1% of the time is still spent efficiently computing our alignment in parallel.
This means that we should try running with -p32 again – it only made a 1 second difference previously but we now know that was because the CPU was starved of work due to the slow filesystem.
There are a number of forum posts around stating that bowtie2 doesn't scale past 4 or 8 threads. The measurements seen here cast serious doubt on such claims! It seems likely that the authors of such posts were also running into filesystem bottlenecks and didn't even realize it.
Telling bowtie2 to use all 32 hyperthreaded cores only made a 1 second difference before. How much benefit can we get now that we've resolved the I/O issue?
$ time ../bowtie2 -p32 -x zebrafish -1 ./zebrafish.1M.1.fq -2 ./zebrafish.1M.2.fq -S /scratch/mark/z1.sam 999999 reads; of these: 999999 (100.00%) were paired; of these: 464803 (46.48%) aligned concordantly 0 times 389786 (38.98%) aligned concordantly exactly 1 time 145410 (14.54%) aligned concordantly >1 times ---- 464803 pairs aligned concordantly 0 times; of these: 200113 (43.05%) aligned discordantly 1 time ---- 264690 pairs aligned 0 times concordantly or discordantly; of these: 529380 mates make up the pairs; of these: 44199 (8.35%) aligned 0 times 149239 (28.19%) aligned exactly 1 time 335942 (63.46%) aligned >1 times 97.79% overall alignment rate real 0m47.261s user 23m10.694s sys 0m16.169s
Down from 61s to just 47s, a further speedup of almost 30% bringing us to a total 3.7x speedup with just two simple configuration changes. Another reason to always measure performance yourself and not rely on Internet hearsay! So what does the evidence in the new performance report tell us – can we get any more performance out of our bowtie2 runs?
The report advises looking at the CPU breakdown, which shows us that just 9.2% of the time is spent in vectorized instructions and 68.5% of the time is spent on memory accesses. The advice to profile the code is going to deep for us – we aren't going to rewrite bowtie2.
However, the vectorization advice is interesting. Vector instructions allow the CPU to crunch more data at once. Our i7 supports Intel's AVX2 instruction set, but perhaps bowtie2 doesn't make use of it as it isn't available on older CPUs.
It's easy refreshingly simple to recompile bowtie2 from the source code with settings designed to make better use of our hardware – just replace “-march=sse2” with “-march=native” in the Makefile. Of course, if you really want to get the most out of an Intel CPU it's hard to beat Intel's own compilers, so we also swapped gcc for Intel's icc. How much difference did these changes make?
$ time ../bowtie2 -p32 -x zebrafish -1 ./zebrafish.1M.1.fq -2 ./zebrafish.1M.2.fq -S /scratch/mark/z1.sam 999999 reads; of these: 999999 (100.00%) were paired; of these: 464870 (46.49%) aligned concordantly 0 times 389788 (38.98%) aligned concordantly exactly 1 time 145341 (14.53%) aligned concordantly >1 times ---- 464870 pairs aligned concordantly 0 times; of these: 200208 (43.07%) aligned discordantly 1 time ---- 264662 pairs aligned 0 times concordantly or discordantly; of these: 529324 mates make up the pairs; of these: 44452 (8.40%) aligned 0 times 149397 (28.22%) aligned exactly 1 time 335475 (63.38%) aligned >1 times 97.78% overall alignment rate real 0m41.825s user 20m29.531s sys 0m14.335s
Not bad - spending five minutes recompiling the package to make use of the current CPU features has reduced the running time from 47s to just 41.8s, a further 12% speedup that brings us to a total 4.2x speedup!
Following the guidance in Arm Performance Reports we have:
The quickest way to find out? Run Arm Performance Reports regularly on each step of your workflow to keep your research running as efficiently and smoothly as possible.
How much more could you get done with a 4x faster system?