Titanic

Published on July 2016 | Categories: Types, School Work | Downloads: 91 | Comments: 0 | Views: 713
of 18
Download PDF   Embed   Report

This paper describes the investigation that has been carriedout over a period of few weeks, on the first phase of the sequencingthat takes around 5 to 7 days with the current software tool chainand hardware utilizing a set of smaller but still representativedata sets and it will conclude with some recommendations andlessons learned. Additionally, we were also able to build asurprisingly accurate model which predicts the behavior of thetool chain.

Comments

Content


Titanic
Mohit Kothari
Computer Science and Engineering
University of California, San Diego
Roger Tanuatmadja
Computer Science and Engineering
University of California, San Diego
Gautam Akiwate
Computer Science and Engineering
University of California, San Diego
Abstract—Next-generation sequencing (NGS), also known as
high throughput sequencing is currently being utilized in the
Bioinformatics Department of the University of California at San
Diego as a research tool to examine and detect variants (mu-
tations) in genomes of individuals and related family members.
Unfortunately, given the current tools and compute environment,
a typical NGS work-flow presently takes between 12 to 14 days
to complete. There is therefore a desire within the department
to investigate the possibility of reducing the turnaround time to
accommodate current/future research needs that would tread a
similar path.
This paper describes the investigation that has been carried
out over a period of few weeks, on the first phase of the sequencing
that takes around 5 to 7 days with the current software tool chain
and hardware utilizing a set of smaller but still representative
data sets and it will conclude with some recommendations and
lessons learned. Additionally, we were also able to build a
surprisingly accurate model which predicts the behavior of the
tool chain.
Keywords—NGS, Variant Calling, BWA, Picard, GATK, BAM,
SAM
I. INTRODUCTION
A full end to end NGS work-flow involves a number
of smaller work-flows/phases including utilizing hardware se-
quencers such as ones made by Illumina, pre-processing the
output BAM files by running them through a set of software
tools to perform tasks such as sequence mapping (hereby
called the pre-processing phase) before concluding with variant
calling, annotation and filtering. A BAM file is an industry
standard binary version of a SAM file with the later being
a tab-delimited text file containing sequence alignment data.
For the purposes of this paper, we will only focus on the pre-
processing phase since this is the work-flow that is currently
taking the longest to complete
1
.
An overview of the current hardware and software tool
chain (including how much time each tool is currently taking
to complete in a typical execution) has been provided in Table.
II
The current work-flow is currently executed via a Perl
script that executes each stage in Table. II in sequential order.
When possible, the communication between the stages is
done through Unix pipes to reduce I/O with temporary files
being used when piping is not feasible. Additionally, although
each stage proceeds sequentially, wherever applicable, the tool
utilized in each stage is currently executed with parameters that
would take advantage of extra processors/cores.
1
http://www.slideshare.net/AustralianBioinformatics/introduction-to-
nextgeneration
TABLE I: Machine Description
System Description
Processor Model Intel Xeon
Clock Speed 1.8GHz
No. of Processors 1
No. of Cores per Processor 4
RAM Size 100 GB
Disk 228 GB
II. OVERVIEW
We start by talking about the different tools that make
up the pipeline in Section III. We then move to describing
the workings of the pipeline itself in Section IV. Section
V describes our framework to measure different system re-
source usage while running the pipeline and also tries explain
challenges faced while replicating the environment on a new
system. Section VI describes initial analysis and includes
some insights into CPU, I/O and Memory usage along with
Java GC and hardware performance counters. Section VII
attempts to model the entire pipeline in mathematical terms
and section VIII evaluates the model and lists some suprising
results. Section IX uses the model to analyse reoccuring issues
in the pipeline and makes some future predictions. Finally,
section X concludes with lessons learnt and some of the
recommendations for the pipeline.
III. TOOLS AND CONFIGURATIONS
To understand the pipeline it is important that we under-
stand the tools that make up the pipeline. Hence, we begin
by describing the tools and the roles they play. The pipeline
consists of the following tool chains for processing
2
1) BWA
2) SAMtools
3) HTSlib
4) Picard
5) Genome Analysis Toolkit (GATK)
A. BWA
BWA
3
is a software package for mapping low-divergent
sequences against a large reference genome, such as the
human genome. It consists of three variations of the Burton-
Wheeler Aligner algorithm: BWA-backtrack, BWA-SW and
BWA-MEM. The pipeline uses BWA-MEM as it is capable of
for processing up to 1 million base pairs (bp)[1].BWA-MEM
2
http://gatkforums.broadinstitute.org/discussion/2899/howto-install-all-
software-packages-required-to-follow-the-gatk-best-practices
3
http://bio-bwa.sourceforge.net/
1
TABLE II: Software Tool Chain
Stage Name Tool Family Software Tool Current Processing Time
Shuffling and Aligning Input File N/A htscmd, bwa, samtools (C language) 33 hours
SAM File Sorting Picard SAMSort (Java) 8 hours
Mark Duplicates Picard MarkDuplicates (Java) 8 hours
BAM File Index construction Picard BuildBamIndex (Java) 1 hour
Building Insert Delete (Indel) Realignment Targets GATK RealignerTargetCreator (Java) 8 hours
Realignment around Indel GATK IndelRealigner (Java) 8 hours
Base Q Covariance 1st Stage GATK BaseRecalibrator (Java) 30 hours
Base Q Covariance 2nd Stage GATK BaseRecalibrator (Java) 80 hours
Plot Base Q Results GATK Analyze Covariates (Java) 0 hours
Base Q Recalibration GATK PrintRead (Java) 33 hours
is highly parallel as it works on independent chunks of base
pair reads.
B. SAMtools
The SAM (Sequence Alignment/Map) file format is a
generic format for storing large nucleotide sequence align-
ments.SAM Tools provides various utilities for manipulating
alignments in the SAM format, including sorting, merging,
indexing and generating alignments in a per-position format.
It is inherently a single threaded application[2].
C. HTSlib
HTSlib
4
is an implementation of a unified C library for
accessing common file formats, such as SAM, CRAM and
VCF, used for high-throughput sequencing data, and is the
core library used by samtools and bcftools. The binary is called
htscmd and it is used to shuffle the input data and convert the
later into a single fastq file.This file is then provided as an
input to the BWA tool. This tool is also single threaded and
doesn’t have any parallelism in it.
D. Picard tools
Picard
5
is comprised of Java-based command-line utilities
that manipulate SAM files, and a Java API (HTSJDK) for
creating new programs that read and write SAM files. Both
SAM text format and SAM binary (BAM) format are supported
by the tool. The pipeline uses 3 utilities from the Picard
tool set for three purposes, namely, sorting the SAM file
(SortSam), marking duplicates (MarkDuplicates) and building
the BAM index (BuildBamIndex). All these utilities are again
unfortunately single threaded and the algorithms don’t employ
any parallelism.
E. Genome Analysis Toolkit (gatk)
Similar to Picard, GATK
6
too is a set of tools and it serves
as the core of the pipeline performing the major analysis tasks
on the genome. The GATK is the industry standard for analyses
such as identifying rare mutations among exomes as well as
specific mutations within a group of patients. The specific tools
that are being used are RealignerTargetCreator, IndelRealigner,
BaseRecalibrator, PrintReads. The GATK was built from the
ground up with performance in mind. It employs the concept
of Map Reduce, which is basically a strategy to speed up
4
https://github.com/samtools/htslib
5
http://picard.sourceforge.net/
6
http://www.broadinstitute.org/gatk/about/
Tool Full Name Supported Parallelism
RTC Realign Target Creator NT
IR Indel Realigner SG
BR Base Recalibrator NCT, SG
PR Print Reads NCT
TABLE III: Parallelism in GATK
performance by breaking down large iterative tasks into shorter
segments whose output will then be merged into some overall
result. Additionally, it also employs multi-threading heavily.
Multi-threading is enabled simply by using the nt and nct
command line arguments[3].
Here nt represents the number of data threads sent to
the processor and nct represents the number of CPU threads
allocated to each data thread. Apart from multi-threading, they
also have a notion of “Scatter-Gather” which can be applied to
a cluster of machines. Scatter-Gather (SG) is a very different
process from multi-threading because the parallelization hap-
pens outside of the program itself. It basically creates separate
GATK commands for a portion of input data and send this
command to different nodes of the cluster.
Not all the tools support all different kinds of parallelism,
below is the table which lists down these support.
As seen in Table III, SG is not supported by all the tools.
Ones that support are not the bottle neck of the pipeline as
explained in section VII
Next section describes our framework for measuring the
system performance.
IV. THE PIPELINE
In this section, we attempt to describe our understanding of
each step in the workflow in computer science layman terms
as opposed to the point of view of a trained Bioinformatics
scientist. For the purposes on analysis, we also divide the
pipeline into logical phases.
The input to the workflow is a set of BAM files which
is essentially a really large set of files (totalling 1 Billion
short reads per genome per person) containing a collection of
short reads. Short reads are fragments of a much longer DNA
sequence and these are produced by hardware sequencers such
as ones produced by Illumina. There are technologies in the
marketplace that can produce long reads as well, but we will
not discuss them here given our limited understanding of the
topic.
2
Once this is done, these short reads are fed into the pipeline
that is made up of 14 stages.
A. Phase 1: Shuffle and Align
In this stage, the short reads in the BAM files will first be
shuffled to minimize bias during the alignment process
7
. Once
shuffling is completed, each of the short reads will then need to
be aligned (mapped to a specific position) to a large reference
genome (in our case this is a 3 billion long sequence). From
the outskirt, read alignment seems to be a simple problem
i.e. find a particular substring within a bigger string, however
read errors can and do happen, hence turning this into an
approximate hashing problem which is then solved utilizing
the Burrows Wheeler Aligner (BWA) transform algorithm.
It is our understanding that each of the short reads can be
sequenced in this stage without knowledge of any other short
read data as long as the input BAM file is valid. While running
intial experiments we found an interesting problem related to
missing base-pairs which is further discussed in section V.
B. Phase 2: SAM Sorting
Many of the downstream analysis programs which utilizes
BAM files actually require the files to be sorted since this
allows reading from these files to be done more efficiently.
C. Phase 3: Remove Files
Phase 3 removes the temporary files created in the earlier
phases.
D. Phase 4: Mark Duplicates
In this phase, duplicates of any particular unique short read
will be marked to prevent a skew during the variant calling
process. Duplicates are usually produced due to a particular
DNA preparation process and may be unavoidable. Marking
duplicates sounds like something that can be built on top of
the canonical example of using Map reduce i.e. counting the
number of words in a given document.
E. Phase 5: Remove Files
This phase too we remove some files
F. Phase 6: Index Dedup (Build BAM Index)
In this phase, the output BAM files from preceding stages
are indexed for fast access. This essentially allows a particular
short read to be accessed by jumping immediately to a specific
offset within a particular BAM file thus negating the need to
read preceding data into memory. The output of this process
is a set of accompanying index files to the original BAM files.
7
http://www.broadinstitute.org/gatk/guide/tagged?tag=bam
G. Phase 7 and 8: InDel Targets and Realign InDels
The next two phases pertains to Insertion Deletion (InDel)
and thus would benefit from a short overview. The term
InDel refers to a class of variations that is present in a
human genome. The need to align short reads around InDels
arises due to 2 major reasons. The first reason is that InDel
can cause mappers (such as the BWA algorithm employed
in Phase 1) to misalign short reads. The second reason is
that those misalignments then would harm the accuracy of
downstream processes such as base quality recalibration and
variant detection.
8
Here, the regions in the BAM files that will need to be
realigned are identified. In general there are three types of
realignment targets: known sites such as ones coming from
the 1000 Genome project, InDels that are seen in the original
alignments (as part of the application of the BWA algorithm),
and finally sites where evidence suggests a hidden InDel.
Once the InDel regions have been identified, this stage
would then perform the actual realignment.
H. Phase 9: Remove Files
Phase 9 removes the temporary files created by the earlier
phase.
I. Phase 10: Baseq (Base Quality) Covariance Stage 1 (Base
Recalibration)
Hardware sequencers would associate a quality score with
their reads. There is however a tendency for sequencers to be
overly optimistic in terms of their confidence scores. In this
stage, a recalibration table will be built utilizing some machine
learning algorithm based on covariation among several features
of base such as read group, the original quality score from the
sequencer, 1st/2nd read in a pair, etc.
9
J. Phase 11: Baseq (Base Quality) Covariance Stage 2
In this phase, the recalibration table built on the previous
stage would be utilized to recompute the base quality score.
K. Phase 12: Plot Base Quality Covariance
In this stage, the plot of the recalibration tables are gen-
erated so that an evaluation can be made on whether the
recalibration has worked properly.
L. Phase 13: Base Quality Recalibration
In this phase, the recalibrated data is subjected to some
final processing before written out to disk.
M. Phase 14: Remove Files
Phase 14 simply removes the Index realignment files.
8
http://hmg.oxfordjournals.org/content/19/R2/R131.full
9
http://weallseqtoseq.blogspot.com/2013/10/gatk-best-practices-workshop-
data-pre.html
3
V. FRAMEWORK
In this section, we describe the framework that was setup
to run the pipeline and the changes and modifications done to
the framework for the purposes of analysis. In addition, we
also discuss the challenges faced, while trying to get working
the pipeline with a subset of the reads to ease analysis. Further,
we also talk about the challenges we faced while duplicating
the environment on another machine.
A. Framework Changes
The framework - essentially a script provided to us needed
an overhaul and further additons so that we could start mea-
suring system performance for each phase of the pipeline.
To measure basic resources like, CPU utilization, Memory
utilization and Disk I/O we decided to use dstat tool for each
command. For collecting hardware performance counters, e.g.
L1 cache misses, last level cache misses etc. we used perf-stat
utility. Also, since most of the tools are Java based, we thought
it would beneficial to collect JVM Garbage Collector logs as
well.
B. Data Sources
Our initial experiments were run on the master node of
the Bioinformatics lab during a short period of time when it
was not being utilized by the the department. This gave us an
advantage since we did not have to simultaneously understand
both tools and environment at the outset. However, once the
normal usage of the node was resumed, we had to utilize a
different machine. The critical issue was that we had been
using private patient data up to that point and due to privacy
concerns it was impossible for us to move the data set into
the new machine. We were thus faced with the challenge of
finding an equivalent data set. We tried a number of sources
and finally found the 1000 Genome project to be fruitful.
10
C. Pair issue and ’R’
Our woes did not stop after being able to find a data
set from 1000 Genome project, We expected the subset of
the reads to successfully run in the pipeline. However, we
faced cryptic errors in the first phase of the pipeline itself.
After multiple long debugging sessions and help from the
bioinformatics people we found out that the base pair reads
come in pairs and the bwa tool requires that every base pair
has its pair in the data set.
Since we were running our experiments for a subset of the
reads there was a probability that certain base-pairs didn’t have
their pair and because of this pipeline was failing at the first
phase itself. So to remove this error, we had to write a wrapper
script around the given pipeline to pre-process the input bam
file and remove all the reads whose pairs didn’t exist.
Challenges didn’t end here, there is a phase in the pipeline
which uses ’R’ to plot graphs and it turns out that tools are
using deprecated libraries of ’R’ for plotting which we couldn’t
install while replicating the environment, so we had to skip
that phase during our experiments. Our initial analyses on
bioinformatics machine showed that this phase did not take
10
http://www.1000genomes.org/
significant time estate of the pipeline and hence it shouldn’t
impact our future experiments.
Finally, after all these changes, we were able to replicate
the complete environment on a new server where we could
experiment with different data-set sizes and analyse results.
VI. BASELINE RUN
In order to gain a better understanding of the pipeline,
we started our investigation by running the pipeline using the
simplest possible configuration and yet still have it perform
meaningful work. This is accomplished by setting the number
of threads to 1 in each phase of the pipeline as well as
utilizing only 1% of the full data set used in a typical run
in the Bioinformatics department, which comes to be around
10 million reads. The choice of 1% of the full set was informed
through conversations with a student from the department.
The next level of experiments consisted of runnning the
same pipeline and dataset but with higher number of threads
and since we were working on a quad-core machine, we chose
4 threads for immediate comparison. Later in the paper, we
look at how SMT performs.
Table IV shows that there are mainly 5 phases which
contribute to approx. 84% of the total time. Looking at the
graphs for those phases namely, figures 1, 2, 3, 4, 5, 6, 7, 8, 9
and 10, it is evident that neither I/O or Memory is a bottleneck.
Remaining graphs of the phases can be found in Appendix A
This is further evident when we captured the resource
usage pattern for an actual run that lasted for approx. 7 days.
Figures 11 and 12 show that although I/O was increased, its
not the bottle neck of the system and the tools never ran out
of memory.
Although we believe that memory bandwidth contention
might be one of the possible reasons why these phases are
slow but due to timing constraints we were not able to explore
that avenue.
Phase Name Single Thread Time (s) 4 Thread Time (s)
Shuff Algn 4932 1505
Sort Sam 244 244
Remove BAM 1 1
DeDup Srtd 299 299
Remove Srtd 1 1
Index DeDup 40 40
Indel Target 2961 830
Realn Target 393 394
Remove Dedup 1 1
Base Covar 1276 719
Base Covar 2 1907 1330
Plot (Didn’t Measure) 0 0
Baseq Realn 1085 721
Remove Realn 1 1
TABLE IV: Time taken by different phases of the pipeline for
different number of threads for dataset of size 10 Million reads
A. Investigating Java Runtime Environment
With the exception of the first phase of the pipeline, all
the utilized tools are programs written in Java. This provided
us with another avenue of investigation to pursue through the
enabling of Java Garbage Collection (GC) logging.
4
0
20
40
60
80
100
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0
20
40
60
80
100
120
140
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
3
4
5
6
7
8
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 1: Single thread resource usage for Phase 1: Shuffle and Align
0
20
40
60
80
100
0 200 400 600 800 1000 1200 1400 1600
0
20
40
60
80
100
120
140
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
3
4
5
6
7
8
0 200 400 600 800 1000 1200 1400 1600
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 2: 4 threaded resource usage for Phase 1: Shuffle and Align
Given the batch nature of each phase in the pipeline, we
are primarily interested in knowing the phase throughput i.e.
the percentage of time spent by each pipeline phase doing
useful work instead of GC. In general, any throughput number
at 95% and above are considered good
11
. Additionally, should
the throughput number falls below 95%, we are also interested
in seeing any instances of a GC taking an excessively long
amount of time.
We modified the script used to run the tool chain to
augment every ”java” command with the following flags
-Xloggc:logs
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
11
http://www.slideshare.net/jclarity/hotspot-garbage-collection-tuning-guide
.
The set of flags serve to output GC data in sufficient details
including the amount of memory freed in both Young and Old
Generations in each iteration as well as the times during which
GC happens into the specified log file. We then used a tool
called JClarify Censum
12
to visualize the data and collect the
throughput metric.
1) Throughput Results: When it comes to the pipeline, the
phases utilizing Java that contribute the most to the running
time are Indel Targets, Base Covar 1, Base Covar 2, and
Baseq Recal so we are going to constrain our discussion
to those 4 phases. For any combination of input size and
number of threads, we found that the throughput number never
dropped below 95% with the exception of the Base Covariance
phase where the number dropped steadily when the number of
12
http://www.jclarity.com/censum/
5
0
20
40
60
80
100
0 500 1000 1500 2000 2500 3000
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
0 500 1000 1500 2000 2500 3000
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 3: Single thread resource usage for Phase 7: Indel Target Index
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800 900
0
5
10
15
20
25
30
35
40
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
3
4
5
6
7
8
9
10
0 100 200 300 400 500 600 700 800 900
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 4: 4 threaded resource usage for Phase 7: Indel Target Index
10M 20M 40M 100M
Indel Targets 98.6% 98.9% 98.7% 98.9%
Base Covar 1 95.8% 94.7% 93.6% 92%
Base Covar 2 97.1% 96.8% 96.6% 96.6%
Baseq Recal 97.4% 97.5% 97.5% 97.6%
TABLE V: GC Throughput for number of threads = 8
threads specified is 8 (at 4 threads and below, the throughput
remained at above 95
We then made some modifications to the ”java” command
to see if we can improve the throughput number for that
phase. In particular we found that for a data size of 40
million, we were able to increase the throughput from 93.6%
to 96.1% by specifying a number of additional flags
-Xms15000m
-Xmx15000m
-Xmn10000m
The choice of those numbers were informed based on the raw
GC data collected in particular the sizes of the Young and Old
Generation at the end of the GC log for that phase with some
buffer built in to tolerate possible memory spikes.
Improvement aside, we don’t think that it is particularly
meaningful, for example, if we refer to table V, improving
Base Covar 1’s throughput from 92% to 95% would only result
in an improvement of around 1 minute (3% of 35 minutes) in
running time. It is true that given that the particular phase is
currently taking around 30 hours (using the full data set) that
we may be seeing some real savings in time, however we also
know that the phase is currently being run with the number of
threads set to 5 in the Bioinformatics head node (and earlier
6
0
20
40
60
80
100
0 200 400 600 800 1000 1200 1400
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
4
6
8
10
12
14
0 200 400 600 800 1000 1200 1400
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 5: Single thread resource usage for Phase 10: Base Covariance 1
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
3
4
5
6
7
8
9
10
11
12
0 100 200 300 400 500 600 700 800
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 6: 4 threaded resource usage for Phase 10: Base Covariance 1
we have mentioned that throughput is not an issue when the
number of threads is 4). In other words, without further work,
we are not able to ascertain how much of an improvement
would tuning this particular phase result in.
Another thing that is worth mentioning is that early in
our experiments, we were seeing Java programs ran in single
threaded configuration through the use of nct (see some
other table for information on tools that accept number of
data/compute threads) consuming more than a thread’s worth
of CPU utilization (”top” would occasionally show utilization
above 30%) . We utilized an application called JConsole which
is shipped with any JDK since version 5.0 to investigate the
issue since the later has the ability to show all running threads
in a particular Java Virtual Machine (JVM).
What we found was that although the tool did start up
a number of data/compute threads as specified through a
configuration option, there were also a number of utility
threads started by the tool and the JVM including a Progress
Tracking thread (started by GATK), a number of GC threads
and a number of TCP/IP threads. We were not able to ascertain
why the JVM would start a number of TCP/IP threads, and
there does not seem to be any flags specific turning those
threads off, however the presence of these extra threads do
explain the CPU utilization phenomenon that we were seeing.
B. Performance counter measurements
Just to be sure that the delay in the phases are not caused
by lot of L1 cache misses, branch-predictor misses or Off-
Chip accesses, we measured hardware performance counters
using perf-stat tool and found it to be consistent across
multiple runs for different input sizes and thread numbers.
There was about 3.5% of L1 Data cache miss rate, 1.2% of
Branch predictor misses. But an interesting result was 31.3%
7
0
20
40
60
80
100
0 200 400 600 800 1000 1200 1400 1600 1800 2000
0
0.2
0.4
0.6
0.8
1
1.2
1.4
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
4
6
8
10
12
14
0 200 400 600 800 1000 1200 1400 1600 1800 2000
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 7: Single thread resource usage for Phase 11: Base Covariance 2
0
20
40
60
80
100
0 200 400 600 800 1000 1200 1400
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
4
6
8
10
12
14
0 200 400 600 800 1000 1200 1400
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 8: 4 threaded resource usage for Phase 11: Base Covariance 2
of LLC misses which was an interesting result but because
of timing constraints we couldn’t explore this avenue as well.
However, since we saw that these measurements seemed pretty
much constant across runs with different sizes and threads we
concluded that while the cache misses could be potentially
interesting it wouldn’t affect our ability to model the pipeline.
VII. THE PIPELINE MODEL
One of the ideas behind profiling the pipeline was to build
a model based of which we could make predictions. The
idea being that if we could accurately predict and model the
behavior of the software pipeline we had, in a manner of
speaking, truly understood the workings of the system as a
black box.
As mentioned previously, to better understand the pipeline
we logically split the pipeline into phases. In retrospect, this
turned out to be a crucial step in building the model as this
enabled us to better predict the behavior of the entire pipeline
as a combination of phases rather than as a single entity.
Essentially, we have built a model for each of the phases
which is then used to make a prediction for the entire pipeline.
More specifically, the model will utilize the size of input data
and number of threads to make a prediction for the time the
pipeline will take to complete.
A. Building the Model
In building the model, we realized that there are four
important factors that affect the running time. Further more,
each of the phases had a different behavior which seemed to
stem from change in the four factors.
The factors that affect running time and form an integral
part of the model are:
8
0
20
40
60
80
100
0 200 400 600 800 1000 1200
0
5
10
15
20
25
30
35
40
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
3
4
5
6
7
8
9
10
11
12
13
0 200 400 600 800 1000 1200
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 9: Single thread resource usage for Phase 13: Base Recalibration
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800
0
10
20
30
40
50
60
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
3
4
5
6
7
8
9
10
11
12
0 100 200 300 400 500 600 700 800
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 10: 4 threaded resource usage for Phase 13: Base Recalibration
Fig. 11: CPU and I/O utilization for full run on 1 billion reads.
9
Fig. 12: Memory utilization for full run on 1 billion reads.
1) f : The fraction of the phase that is parallelizable
2) p : The number of threads used
3) m : The fraction of the phase that actually depends
on the size of the input
4) s : The size of the input
Every phase, has it’s own set of f, m which causes the
phases to show different behavior. One assumption we make
here is that for fraction f of the pipeline the program is
perfectly scalable. However, in reality it is possible that the
performance of the phase will stop improving as the number
of threads is increased, or in fact degenerates after a certain
threshold has been reached. In fact, we see this exact behavior
for one of the phases. Considering that we are treating all the
programs as black-boxes it would be hard to pinpoint the exact
reasons for the degeneration. However, we briefly attempt to
address the issue later in this section.
1) Model Terminology: In this section we briefly elaborate
on the terminology we intend to use for building the mode.
Each phase, in addition to the two internal parameters f and
m has two external parameters that it depends on. One is the
number of threads p, and the second is the size of the input
ie. the number of reads s. These parameters are external and
could possibly change.
Every phase is denoted by P
i
p,s
where p is the number of
threads and s is the input size and i is the phase number in
the pipeline.
Thus, the total time for the entire pipeline E can be
represented as
n

i=1
t(P
i
p,s
) = t(E)
For the purposes of the analysis we consider 10M as the
base size. As mentioned in Section. VI, the model uses the data
for 10M as ground truth because using an input size smaller
than that does not produce meaningful results, not only from
the correctness point of you but also from the behavior point of
view. Also as discussed before, the time taken by every phase
depends on f, p, s, m. However, there are two factors that are
inherent to the phases and are not stated explicitly. Hence, the
success of the entire model revolves around two factors f, m
and how well we estimate and compute these factors.
For this, we decided to stay simple and come up with a
simple way to compute f and m. For f, Amdahl’s seemed
like a good approximation. In fact, we see the Amdahl’s Law
approximation works really well and seems to provide a good
starting point. Considering our sucess for f with Amdahl’s law
we decided to use a slightly modified version of the Amdahl’s
law to get a value of m for each phase.
To get f, we basically, for each phase P
i
p,10M
and p =
2, 4, 6, get a corresponding f
i
p,10M
and compute f as f
i
=
min(f
i
2,10M
, f
i
4,10M
, f
i
6,10M
)
Note, f
i
p,s
denotes the fraction of the phase i that is
parallelizable for size s. However, since the size of the input
s should not affect the parallelizability of the phase we can
estimate f with s = 10M.
On the other hand to get m, we basically, for each phase
P
i
p,20M
and p = 2, 4, 6, get a corresponding m
p,20M
and
compute m as m
i
= max(m
i
2,20M
, f
i
4,20M
, f
i
6,20M
)
Note, m
i
p,s
denotes the fraction of the phase i that depends
on the size of input when compared to s = 10M as the base.
However, since the size of input s will not change the fraction
of the phase that depends on the size we can estimate m with
s = 20M
We choose the min for f and the max for m as we would
like to make conservative estimates.
B. Enumerating the Model: Phase 1
In this section, we try to make concrete the above calcula-
tions by walking through enumerating the model for Phase 1.
Essentially, we try to solidify the model by showing how we
build the model and then make predictions for Phase 1.
1) Computing f: To compute f
1
we use the data for
s = 10M and p = 2, 4, 6
Using Amdahl’s Law we get
10
TABLE VI: Phase 1, s = 10M
Threads Time (seconds)
p = 1 4932
p = 2 2663
p = 4 1529
p = 6 1151
f =
p
p−1
(1 −
t(P
i
p,10M
)
t(P
i
1,10M
)
)
f
1
2,10M
= 0.9201
f
1
4,10M
= 0.9200
f
1
6,10M
= 0.9200
Thus we get f
1
= 0.9200
TABLE VII: Validate f, Phase 1, p = 8 threads
Predicted (seconds) Actual (seconds)
P
1
8,10M
962 963
As a quick validation of our estimation we use this to
predict the value P
1
8,10M
. As we can see in Table. VII that
our estimation of f
1
is pretty spot on.
TABLE VIII: Phase 1, s = 20M
Threads Time (seconds)
p = 2 4974
p = 4 2909
p = 8 1867
2) Computing m: To compute m
1
we use the data for
s = 20M and p = 2, 4, 8
To evaluate m we define a new term r which is a ratio
of the size s to 10M. Further, we modify the Amdahl’s Law
equation slightly to make it compute m.
m =
1
r−1
(1 −
t(P
i
p,20M
)
t(P
i
p,10M
)
)
m
1
2,20M
= 0.9145
m
1
4,20M
= 0.9329
m
1
8,20M
= 0.9387
Thus we get m
1
= 0.93870
C. Evaluating the Pipeline
Finally, we have all the data we need to evaluate the
pipeline and make predictions. It is important to understand
that we evaluate the pipeline in phases rather than as a whole.
Based on our measures for f and m we can now rewrite the
time taken by a phase as
t(P
i
p,s
) = t(P
i
1,10M
).(
f
p
+ (1 −f)).(m.
s
10M
+ (1 −m))
TABLE IX: Phases, f, m
Phase f m
P
1
0.9200 0.9387
P
2
0.0000 0.9877
P
3
0.0000 0.0000
P
4
0.0000 0.9369
P
5
0.0000 0.0000
P
6
0.0000 0.9230
P
7
0.9595 0.0733
P
8
0.0000 0.8560
P
9
0.0000 0.0000
P
10
0.5982 0.3533
P
11
0.4096 0.6789
P
12 M
− −
P
13 N
0.4473, 0.1327 0.9996
P
14
0.0000 0.0000
Thus we can rewrite our previous equation as follows
n

i=1
t(P
i
1,10M
).(
f
p
+ (1 −f)).(m.
s
10M
+ (1 −m)) = t(E)
It is interesting to note that most of the phases do not exhibit a
high deal of parallelism. This indicates that we should not get
improvements if we keep increasing the number of threads. In
Section. VIII we evaluate our model for accuracy and discuss
possible issues and shortcomings.
D. Porting the Model
All the above runs were done on the Machine. I as
described. One obvious worry is that this model will not be
portable and needs to be redone for every machine. While this
is a valid concern, we believe the model is structured the only
parameter that the model depends on - and is variable is on
P
i
1,10M
. Also, it is evident that P
i
1,10M
will change based on
the machine it runs on - primarily because the new machine
might run at a different frequencies. We hypothesize that we
can essentially port the entire model by adding a multiplier k
which could be simply computed for each phase by computing
k
i
=
Mnew(P
i
1,10M
)
M
ref
(P
i
1,10M
)
Thus, we can rewrite our generic equation to be
n

i=1
t(P
i
1,10M
).(
f
p
+ (1 −f)).(m.
s
10M
+ (1 −m)).k
i
= t(E)
VIII. EVALUATION
In this section we evaluate our model by comparing our
predictions to actual runtime for s = 10, 20, 40, 100 and for
p = 4, 8. Essentially, we’d like to conclusively show that our
simple model works surprisingly well. Table. X and Table. XI
show the comparisons between our predictions and the actual
runtime.
To reiterate, we use the model and the equation as devel-
oped in Section. VII to make the predictions.
11
10 Million 20 Million 40 Million 100 Million
Predicted
Time (s)
Actual
Time (s)
Predicted
Time (s)
Actual
Time (s)
Predicted
Time (s)
Actual
Time (s)
Predicted
Time (s)
Actual
Time (s)
Shuff Algn 1528 1505 2964 2909 5834 5779 14445 14524
Sort Sam 244 244 484 484 966 965 2412 2433
Remove bam 1 1 1 1 1 1 1 4
Dedup Sorted 299 299 579 579 1139 1148 2820 2960
Remove Sorted 1 1 1 1 1 1 1 3
Index DeDup 40 40 76 79 150 152 372 375
Indel Targets 830 830 891 865 1012 961 1377 1245
Realn Targets 393 394 729 730 1402 1364 3420 3389
Remove DeDup 1 1 1 1 1 1 1 2
Base Covar 1 703 719 952 973 1449 1495 2940 3023
Base Covar 2 1321 1330 2218 2233 4011 3932 9393 9402
Plot - - - - - - -
Baseq Recal 721 721 1441 1379 2883 2795 7207 6837
Remove Realn 1 1 1 1 1 1 1 1
Total 6085 6086 10342 10235 18855 18595 44396 44198
TABLE X: Predicted vs Measured values for each phase of the pipeline and different input sizes, running with 4 threads wherever
possible.
10 Million 20 Million 40 Million 100 Million
Predicted
Time (s)
Actual
Time (s)
Predicted
Time (s)
Actual
Time (s)
Predicted
Time (s)
Actual
Time (s)
Predicted
Time (s)
Actual
Time (s)
Shuff Algn 961 963 1864 1867 3670 3728 9086 9384
Sort Sam 244 244 484 485 966 964 2412 2475
Remove bam 1 1 1 1 1 2 1 4
Dedup Sorted 299 301 579 593 1139 1139 2820 2935
Remove Sorted 1 1 1 1 1 1 1 2
Index DeDup 40 39 76 78 150 155 372 376
Indel Targets 475 454 509 477 579 526 788 685
Realn Targets 393 395 729 721 1402 1370 3420 3402
Remove DeDup 1 1 1 1 1 1 1 2
Base Covar 1 608 613 822 798 1252 1134 2541 2142
Base Covar 2 1223 1223 2054 2032 3715 3594 8699 8410
Plot - - - - - - - -
Baseq Recal
O
959 959 1918 1906 3835 3744 9587 9178
Remove Realn 1 1 1 1 1 1 1 1
Total 5207 5195 9040 8961 16713 16359 39730 38996
TABLE XI: Predicted vs Measured values for each phase of the pipeline and different input sizes, running with 8 threads wherever
possible.
A. Observations
TABLE XII: Prediction Error
Size Error (p = 4) Error (p = 8)
20M 107s 79s
40M 260s 354s
100M 198s 734s
1) To be honest, we are pleasantly surprised at how well
our model performs. As we can see in Table. XII the
error for p = 4 for s = 100M which is 10% of the
actual data size is less than 4 minutes.
2) One of the major sources of error seems to come
from P
13
for p = 8. As we noticed while building
the model P
13
showed erratic behavior for 8 threads.
Infact, when we increased the number of threads to
8, the time actually increased rather than decrease.
3) If one were to remove the approx. 400 second error
from P
13
for p = 8 we get much better results.
4) However, it is important to note that even the 734
seconds error for a runtime of 38996 seconds means
an accuracy of 99.9% which we believe is extremely
good.
5) More importantly, it is evident that increasing number
of threads does not significantly improve the results.
This is supported by the f values that we computed
in Section. VII
B. The kink in the model
Clearly, there seems to be an issue when we run with 8
threads for P
13
. However, the important question is why does
P
13
show the odd behavior with p = 8. Was the p = 8
a machine specific anomaly or is it a tool specific issue.
We suspect that this particular issue could be because that
P
13
cannot fully utilize the SMT threads and infact causes
contention leading to worsening behavior. However, due to
12
constraints on timings we have been unable to verify this and
is part of our future work.
IX. ANALYSIS AND PREDICTIONS
Given that we have now successfully created a model - a
model that seems to be able to accurately mirror the behavior
of the pipeline we’d now like to use it to answer questions -
questions about runtimes on other machines, questions about
trade-offs. Another question we would like to be able to answer
is, how much time would it take to run the full pipeline on
the current machine - Machine. I as well as the newly bought
Machine. XIII that are intended to be part of a cluster.
TABLE XIII: Cluster Node: Machine Description
System Description
Processor Model Intel(R) Xeon(R) CPU E5620
Clock Speed 3.50GHz
No. of Processors 1
No. of Cores per Processor 4
RAM Size 64 GB
Disk 18 TB
A. Trade-Off: More Threads or More Instances
One secondary goals while profiling was to be able to
determine what is better in terms of time. Running a single
instance with more threads or running more instances with
fewer threads. Concretely, speaking is it better to spawn
a single instance of the pipeline with say 4 threads twice
as opposed to running two instances with 2 threads each
simultaneously.
1) Building an Intuition: Based on our observations and
the model, it is apparent that not a lot of the phases in the
pipeline have a high degree of parallelizability. Furthermore,
for some of the phases it seems that things infact get worse
if we go beyond a particular number of threads.
Hence, considering that there is not much parallelizability
to be had beyond a particular point there is a strong indication
that spawning more instances with fewer threads might be
better.
2) Potential Issues: One potential issue with spawning
multiple issues of the pipeline is resource contention or sched-
uler conflicts. However, we did a quick run and saw that there
were no significant changes in runtime as compared to a single
instance with the same number of threads. Concretely, we
saw that two instances of the pipeline running with 2 threads
each took the same time as a single instance of the pipeline
running with 2 threads. This indicates that resource contention
or scheduler conflicts, if present, do not significantly affect the
run.
3) Actual Analysis: As can be seen in Table. XIV it is
definitely more beneficial to run multiple instances of the
pipeline as opposed to a single instance with more threads
multiple times. Clearly, there seems to be much advantage in
running multiple instances - infact it seems that the more the
number of instances you can support and run the better it would
be.
TABLE XIV: Analysis 10M: More Threads vs More Instances
Threads Instances Total Time
p = 2 2 2663
p = 4 1 3058 (1529x2)
Threads Instances Total Time
p = 2 3 2663
p = 6 1 3453 (1151x3)
B. Prediction for Complete Run
One of the biggest issues for doing a complete run, is that
a complete run takes approximately 4-5 days. Hence, it would
have been impractical to run analysis on full runs given the
high turnaround time. Also, given the limited time at hand, we
could not do a complete run. However, given that we have a
fancy model and all it only makes sense to do a prediction for
the complete run on the Machine I. In Table. XV we make
our predictions based on the model developed in Section. VII.
We predict that to finish a complete run on we will take 119
hours ie. approx. 5 days.
TABLE XV: Prediction Full Run - 4 Threads
Phase Time
P
1
143613
P
2
24102
P
3
1
P
4
28032
P
5
1
P
6
3695
P
7
6854
P
8
33697
P
9
1
P
10
25310
P
11
90118
P
12
1
P
13
72072
P
14
1
Total 427501
C. Prediction on Cluster Node
Another interesting problem as mentioned previously is
porting the model to other machines. For the same, we
proposed the idea of a multiplier k. However, while proposing
the multiplier k we suggested an empirical way of getting k
i
for each of the phases. In this case, however, considering we
do not have any relevant data to make an empirical estimate.
1) Estimating k: Given that we do not have data to estimate
k for each phase we would like to find a way to estimate k.
Considering, the similar nature of the machines at hand, if we
ask the question what could be the potential reasons for the
running time to change.
1) Higher Clock Speed
2) Different Memory System
3) IO
However, as we have seen in Section. VI that IO is
definitely not a bottleneck and hence not a consideration.
Further, in VI-B we also comment on the fact that the cache
behavior remains same. Hence, we are potentially left with
only the clock speed as a possible source of change in time.
13
Note, while this might not be entirely accurate, we believe this
is a good approximation of what we should expect.
Thus we can compute k as a single factor that can be
applied to all phases as
k =
f
Mref
f
M
new
k =
1.8
3.4
= 0.5294
Based on this we can estimate that approx. 3 days.
X. LESSONS
The overarching goal of the project was to try and suggest
spheres for improvement and make suggestions on potential
trade-offs and hardware suitability. As explained in the previ-
ous sections we think that the tools have been optimally set up
- that is the tools are being in used in a way that extract the
most possible out of them and have been able to rule out most
of the obvious bottlenecks. Further, we believe our ability to
model the pipeline fairly accurately reflects that we understand
the pipeline - atleast as a black box. Further, it also reflects
that whatever we find true for smaller artificial data sets will
also be applicable for actual data sets.
This said, in this section we would like to brielfy talk about
the lessons we learnt and some things to keep in mind for
future work with the pipeline.
A. Lessons on Pipeline
1) Run More Instances as Opposed to More Threads: We
conclusively show that the pipeline as a whole isn’t exactly
parallelizable. Hence, it makes more sense to run multiple
instances of the pipeline with lesser threads simultaneously
as opposed to a single instance with more threads multiple
times.
2) Tools Need Change: One thing that was evident from the
entire exercise was that there needs to definitely be an impetus
to incorporate more parallelism in the tools. However, there
seem to be major issues regarding this. First, the algorithms
themselves don’t lend themselves well to parallelism. Sec-
ondly, the bioinformatics community as a whole is incredibly
cautious of moving to new tools. The latter is primarily
because some of the problems that these tools try to solve are
essentially open problems which means different tools solve
the same problems differently leading to possibly different
results. Thus, unless the tools are proven the community seems
reluctant to adopt them.
3) Hardware Investments: The pipeline as a whole doesn’t
afford a lot of parallelizability. Hence, there could be two
ways that the system could be built up to accomodate the
pipelining. The first approach, the one that is currently being
followed now, is to buy huge machines with lots of memory
and run multiple instances of the pipeline simultaneously. The
other approach is to have multiple lesser powerful machines.
Hence, instead of having a 100GB quad-core machine one
could possibly do away with less powerful dual cores and
lesser memory. It would be worthwhile to investigate the latter
approach if it could potentially lead to any savings.
B. Lessons in general
1) (More Cores or More Threads) != Better Performance:
One of the biggest takeaways was that more cores or more
threads needn’t neccessarily mean better performance and
hence care should be taken while deciding your configuartion
parameters.
2) Simple Models Work: Probably, the biggest and most
important part of the project was the model. The simplicity of
the model has possibly been the key to it’s performance.
ACKNOWLEDGMENT
We would like to thank Prof. Voelker - the guy who knew
the guy who had a project for us, without whose guidance
and support much of this wouldn’t have been possible. We
would also like to thank Roy Ronnen and also Viraj Deshpande
for patiently guiding us through the bioinformatics aspects of
the project. Finally, we would like to thank Prof. Snoeren for
helping us get key insights into the project and furthermore
for the course itself.
REFERENCES
[1] Heng Li and Richard Durbin. Fast and accurate short read alignment with
burrows–wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.
[2] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan,
Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, et al.
The sequence alignment/map format and samtools. Bioinformatics,
25(16):2078–2079, 2009.
[3] Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko,
Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler,
Stacey Gabriel, Mark Daly, et al. The genome analysis toolkit: a
mapreduce framework for analyzing next-generation dna sequencing
data. Genome research, 20(9):1297–1303, 2010.
APPENDIX
Graphs of resource usage for the remaining phases of the
pipeline for single thread and 4 thread runs on 10M data-set
14
0
20
40
60
80
100
0 50 100 150 200 250
0
20
40
60
80
100
120
140
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
3
4
5
6
7
8
9
10
0 50 100 150 200 250
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 13: Single threaded resource usage for Phase 2: Sort SAM
0
20
40
60
80
100
0 50 100 150 200 250
0
20
40
60
80
100
120
140
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
3
4
5
6
7
8
9
0 50 100 150 200 250
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 14: 4 threaded resource usage for Phase 2: Sort SAM
15
0
20
40
60
80
100
0 50 100 150 200 250 300
0
20
40
60
80
100
120
140
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
4
6
8
10
12
14
16
0 50 100 150 200 250 300
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 15: Single threaded resource usage for Phase 4: DeDup Sorting
0
20
40
60
80
100
0 50 100 150 200 250 300
0
20
40
60
80
100
120
140
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
4
6
8
10
12
14
0 50 100 150 200 250 300
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 16: 4 threaded resource usage for Phase 4: DeDup Sorting
16
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40
0
0.2
0.4
0.6
0.8
1
1.2
1.4
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2.6
2.65
2.7
2.75
2.8
2.85
2.9
2.95
3
3.05
0 5 10 15 20 25 30 35 40
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 17: Single threaded resource usage for Phase 6: Index DeDup
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40
0
0.2
0.4
0.6
0.8
1
1.2
1.4
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2.05
2.1
2.15
2.2
2.25
2.3
2.35
2.4
2.45
2.5
0 5 10 15 20 25 30 35 40
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 18: 4 threaded resource usage for Phase 6: Index DeDup
17
0
20
40
60
80
100
0 50 100 150 200 250 300 350 400 450
0
10
20
30
40
50
60
70
80
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
3
4
5
6
7
8
9
10
11
12
0 50 100 150 200 250 300 350 400 450
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 19: Single threaded resource usage for Phase 8: Realign InDels
0
20
40
60
80
100
0 50 100 150 200 250 300 350 400
0
10
20
30
40
50
60
70
80
90
100
C
P
U

U
t
i
l
i
z
a
t
i
o
n

(
%
)
R
e
a
d
/
W
r
i
t
e
(
M
B
)
Time (s)
CPU Utilization
Read(MB)
Write(MB)
(a) CPU and I/O Usage
2
3
4
5
6
7
8
9
10
11
0 50 100 150 200 250 300 350 400
M
e
m
o
r
y

(
G
B
)
Time (s)
Memory Utilization
(b) Memory consumption
Fig. 20: 4 threaded resource usage for Phase 8: Realign InDels
18

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close