|
|
|
Published online before print
December 30, 2002, 10.1101/gr.731003
METHODS The Phusion AssemblerInformatics Department, The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
The Phusion assembler has assembled the mouse genome from the whole-genome shotgun (WGS) dataset collected by the Mouse Genome Sequencing Consortium, at 7.5x sequence coverage, producing a
high-quality draft assembly 2.6 gigabases in size, of which 90% of
these bases are in 479 scaffolds. For the mouse genome, which is a
large and repeat-rich genome, the input dataset was designed to include
a high proportion of paired end sequences of various size selected
inserts, from 2200 kbp lengths, into various host vector templates.
Phusion uses sequence data, called reads, and information about reads
that share common templates, called read pairs, to drive the assembly
of this large genome to highly accurate results. The preassembly stage,
which clusters the reads into sensible groups, is a key element of the
entire assembler, because it permits a simple approach to
parallelization of the assembly stage, as each cluster can be treated
independent of the others. In addition to the application of Phusion to
the mouse genome, we will also present results from the WGS assembly of
Caenorhabditis briggsae sequenced to about 11x coverage. The
C. briggsae assembly was accessioned through EMBL,
http://www.ebi.ac.uk/services/index.html, using the series
CAAC01000001CAAC01000578, however, the Phusion mouse assembly
described here was not accessioned. The mouse data was generated by the
Mouse Genome Sequencing Consortium. The C. briggsae sequence
was generated at The Wellcome Trust Sanger Institute and the Genome
Sequencing Center, Washington University School of Medicine.
Whole-genome shotgun (WGS) sequencing is an approach used since the early 1980s (Sanger et al. 1982 at 49 kb (Sanger et al. 1982 5
orders of magnitude, so too have the size of genomes considered
tractable using a WGS approach.
Over the last few years, many groups have become involved in developing
WGS assemblers specifically for genome, or selected portions of
genomes, for example, single chromosomes or groups of chromosomes, from
larger than a few megabases up to multiple gigabases. All of these
assemblers use paired-end sequencing of various sized insert templates
to detect and avoid misassemblies, join contigs together, and guide the
scaffolding of contigs. For a hybrid theory/simulation analysis of the
power of paired end sequencing, see Siegel et al. (2000)
The Phusion assembler was used to assemble the mouse genome at
Mouse Assembly For the mouse assembly, the initial set of reads from the Mouse Genome Sequencing Consortium consisted of those listed in the file ftp://ftp.ncbi.nih.gov/pub/TraceDB/mus_musculus/Feb_1_Freeze_Ti_List.gz and BACend reads from TIGR ftp://ftp.tigr.org/pub/data/m_musculus/bac_end_sequences/mbends. The Feb_1_Freeze_Ti_List file lists the trace identifiers (Ti's) as known to the NCBI Trace Archive Data Base (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?). Of the 40,793,320 reads from the freeze list, 32,042,831 (78.6%) passed the screens for contamination, for example, sequencing vectors, Escherichia coli, phage, etc., and minimum acceptable quality, that is, >99 clipped bases with <5% base call errors within the clipped region. The BACend reads from TIGR were used without quality values and clipping and contamination screening was not applied. Bad plate pairing detection (see Methods) was applied, which decoupled pairing information for 500 k templates, or about 3% of the total passed
reads. Table 2 shows the distribution of
reads over the different insert size ranges. Unpaired reads from all
libraries are grouped together, as a read without a mate only
contributes its sequence to the assembly.
The 32,496,031 clipped reads comprise 19.3 Gbp, giving a 595-bp average read length, and cover the mouse genome to an average depth of approximately seven. The Phusion read grouping algorithm used a k-mer of 17 bases, ignored words that occurred more than D = 13 times, and the minimum number of matching k-mers to group reads together was set to M = 11. The Phusion algorithm automatically increased M to 20 to satisfy the maximum cluster size of 120,000 reads. Phusion clustered 28.7 M reads into 424 k groups, with 50% of the reads in groups of 287 or more reads. The largest cluster contained 70,059 reads. This grouping stage took 36 h of CPU time, running on one processor and used 97 Gbytes of memory of a Compaq Alpha GS320 equipped with 128 Gbytes of memory. This quick turnaround from input reads to clusters allowed tuning of the grouping parameters to find optimal settings. Less optimal settings were D=12 and D=14, both producing more clusters and incorporating fewer reads. A compute farm of 400 CPUs, Compaq Alpha DS10s with 1 Gbyte of memory each, assembled the clusters using RPphrap in 9 h elapsed time, using a total of 132 CPU
days. RPjoin took 24 h to complete and used 70 Gbytes of memory, RPono
took 28 h and 60 Gbytes of memory. The released assembly, which can be
found at
ftp://ftp.sanger.ac.uk/pub/image/tmp/ssahaAssemble/mouse/2002.02.01/,
consists of 2.51 Gbps in 311,577 contigs with an N50 size of 20,121 bps
and a total scaffold size of 2.62 Gbps in 70,427 scaffolds with an N50
size of 6.5 Mbps. Of the starting set of 32.5M reads, 29.3M (90.1%)
are located in this assembly (N50 is a measure of the contig size at
which 50% of the assembled bases are in contigs of this size or
larger). Of the reads not in the assembly, 2.5 M were not clustered by
the Phusion read grouping stage, 87 k reads were excluded by RPphrap
stage, and 569 k were removed because they formed scaffolds that were
smaller than 1 kb or contained fewer than three reads. There are
241,150 captured gaps in the scaffolds, totaling 117 Mbp, with an
average size of 486 bps (rounding of contig and scaffold sizes lead to
the apparent mismatch with the total gap size).
Comparing the assembly to 40 Mbps of finished clones from the same
mouse strain C57BL/6J shows the assembly covers 94% of the bases,
whereas the scaffolds cover 99.7%. There are three global scaffolding
errors indicated from these 40 Mbps. These 40 Mbps also illustrate the
sequence accuracy of the assembly. Because PHRAP is at the heart of the
assembly process, quality values are assigned via this commonly used
assembler, and are expected to be accurately determined (Ewing and
Green 1998 Because a fingerprint map was built from the same BAC clones, see http://genome.wustl.edu/projects/mouse/index.php?fpc=1, as the BAC end sequences from TIGR, integration of this map and the assembly allows placement of the scaffolds onto chromosomal positions. There are 393,470 BACend reads contained in the assembly, and 9775 scaffolds contain one or more BACends. This allowed 95.7% of the bases in the scaffolds to be linked to the FPC map. In some cases, the map and the BACend reads within the scaffolds show conflicting information. This would be expected at locations of global scaffolding error in the assembly. Therefore, when positioning scaffolds onto map coordinates, conflicts were resolved by breaking the scaffolds at the nearest contig boundary.
Caenorhabditis briggsae Assembly
Overall sequence coverage is estimated at 11-fold, thus presenting a different challenge to the Phusion assembler compared with the mouse assembly. The main problem is that at this depth and with various repeat structures of the genome, the clustering algorithm tends to group all of the reads together. This is not desirable, as the assembly problem is not reduced to manageable sized groups. PHRAP can assemble groups of reads as large as a few hundred thousand at a time, but not two million reads within a sensible amount of time and memory. As shown earlier, M was increased to 20 for the 7.5x assembly of mouse to achieve cluster sizes below 120,000 reads. For C. briggsae, the Phusion parameters were set to use a k-mer of 16 bases, ignore words that occur more than D = 15 times, and the minimum number of matching k-mers to group reads together was set to M = 8. The cluster size dropped below 120,000 reads once M reached 34, and continued up to 44 to satisfy the maximum cluster size of 20,000 reads. Therefore, for C. briggsae, D was set lower relative to the depth of sequencing and M incremented to a higher level than for the mouse assembly. Results of the assembly are as follows. Of the 2,085,214 decontaminated but unclipped reads, Phusion clustered 1,932,906 reads into 16,206 groups, with 50% of the reads in groups of 394 or more reads. The largest cluster contained 19,292 reads. This grouping stage took 75 min of CPU time and 10 Gbytes of memory running on a single processor of a Compaq Alpha ES40 equipped with 32 Gbytes of memory. These groups were assembled in 2 h elapsed time with RPphrap using 145 nodes of the 400-node CPU farm. Total CPU time used was about 9 d. The RPjoin and RPono stages also proceeded very quickly, taking about 2 h as well. Like the mouse assembly, an FPC fingerprint map of C. briggsae was generated, see http://genome.wustl.edu/projects/cbriggsae/index.php, and the BACends added to the assembly allowed integration of the two data sets. The assembly contains 1,945,314 reads (93.3% of the starting set of reads), in the cb25.agp8 assembly, see ftp://ftp.sanger.ac.uk/pub/wormbase/cbriggsae/cb25.agp8/. The N50 contig size is 41 kb and the N50 scaffold size is 1450 kb. On the basis of comparison to the 12 Mb of previously finished sequence, we estimate that the whole-genome shotgun assembly achieved 98% coverage of the C. briggsae genome at the contig level, and no global scaffolding errors were found. The size of the assembled and FPC mapped genome is 102 Mbp in 142 ultracontig pieces, with an additional 6 Mbp not placed on the FPC map, which is in 436 pieces (many highly repetitive). In the final sequence, 270 kb finished fosmid data from 155 accessions were incorporated to bridge scaffold gaps. Because of the absence of dense chromosomal maps for C. briggsae, we cannot assign the ultracontigs to chromosomal locations, and, therefore, cannot give draft chromosome sequences. This assembly was accessioned through EMBL, http://www.ebi.ac.uk/services/index.html, using the series CAAC01000001CAAC01000578. This assembly used unclipped reads, whereas the mouse assembly used clipped reads. An earlier assembly of C. briggsae starting with the same set of reads with quality trimming applied, resulted in contigs that had a 31-kb N50 measure. Thus, working with unclipped reads improved the assembly in terms of contig length by 32%. This was also tried with the mouse data, and using untrimmed reads increased the contig N50 size from 20 to 25kb.
The development of the Phusion assembler utilized the sequence from the mouse WGS data set and the C. briggsae data set to test and challenge all parts of the code from its inception in August ,2001. At that time, the C. briggsae data set was at 4.5x
coverage and provided a good test set for developing Phusion because it
was very quick to run; about a 2-h turnaround time. By October, 2001,
the mouse data set had reached about that level of coverage 4x, but
the mouse data set was a large amount of data, thus posing new
challenges for memory use. The arrival of the GS320, with 128 Gbytes of
memory, let those concerns fade away for a while, but the turnaround
time was much longer, typically a few days. The Whitehead Institute
Center for Genomic Research (WICGR) was also actively applying their
ARACHNE assembler to the mouse data, and early on we agreed to use
common assembly output formats to make comparison of the results
easier. We also agreed to use common starting sets so that the assembly
results would not be influenced by different numbers of input reads.
All assemblies can be found at
ftp://ftp.sanger.ac.uk/pub/image/tmp/ssahaAssemble/mouse for the Sanger
Institute, and at ftp://wolfram.wi.mit.edu/pub/mouse_contigs/ for
WICGR. The friendly competition that this dual effort instilled drove
both assemblers to achieve the best possible results. Along the way,
assemblies were selected for further annotation work. The November,
2001 Phusion assembly was selected, and can be seen at
http://genome.cse.ucsc.edu/cgi-bin/hgGateway?db=mm1, whereas for the
February 2002 data set assembly, the ARACHNE assembly was selected. The
differences between the Phusion and ARACHNE assemblies based on the
February, 2002 data set were small when looking at coverage, but at
that time, the ARACHNE assembly had a longer N50 contig size and
scaffold size, and no detectable global scaffold errors. Thus, the
ARACHNE mouse assembly was selected as the basis of the MGSC version 3
assembly for analysis and comparison to the human genome in the main
mouse paper (Mouse Genome Sequencing Consortium 2002All of the assemblers that have been applied to large genomes, >100 Mbps, and use read-pair information, that is, Celera, JAZZ, ARACHNE, RePS, and Phusion, use similar methods for the scaffolding stages. More differences arise in how each assembler initially clusters reads, forms alignments, and detects repeat-induced misassemblies. However, these assemblers can be grouped into two classes for this stage. Celera, JAZZ, and ARACHNE assemblers all compute local alignments between selected reads, whereas RePS and Phusion prepare and select reads such that the alignment problem can be solved by PHRAP. As mentioned in the introduction section, RePS and Phusion are quite similar in approach, although there are differences. One of the defining differences between RePS and Phusion is the way MDRs are applied. In RePS, the MDRs are hard-masked, such that PHRAP initially does not see the sequence in these regions. For Phusion, these regions are not masked. Another difference is that Phusion directly clusters the reads as an integral part of the histogram word analysis, whereas RePS uses a subsequent BLAST stage to cluster reads. As mentioned in the Results section, one of the ways we improved on the contig N50 size was to use unclipped reads. This works because the unclipped, low-quality ends of the reads do contain many valid k-mer words, and leaving these in allows associations to be made between reads that would not have occurred if these ends were removed. The trade off is an increase in erroneous k-mer words, which adds predominantly to the number of words seen once, see Figure 2, below. Because the Phusion clustering stage requires a number of shared words between reads to make an association, these erroneous words would need to occur in a quite improbable way for the untrimmed ends to make new readread associations. Even if that were to occur, PHRAP would not assemble these incorrectly associated reads, because the bulk of the good portion of the read would have come from different portions of the genome. Thus, not trimming the low-quality ends of the reads has an overall desirable effect.
The computer requirements for Phusion are substantial. As presented in the Results section, using a k-mer, which represents >10 times the number of words than bases in the genome is desirable, and storage is needed for all bases, quality values, the sort arrays, and the read relationship matrix. For the mouse genome, the peak memory use was 97 Gbytes, which, for today, is quite a large amount of memory and places the use of this algorithm out of reach for most labs. Then, the next stage, RPphrap, used 132 CPU days of compute time, which again is quite extreme. However, one should keep in mind that computer specifications keep improving, and at some point, what may seem extreme today will be within reach by more labs in the near future. Also, the cost of the computers used for this assembly effort is still a small fraction of the cost to produce the sequence for the mouse genome. The benefit of this approach was a modular system with fast turnaround time for any of the stages, thus, improvements to the algorithms and different initial settings could be tested in a reasonable amount of time. So far, only the ARACHNE and Phusion assemblers can be compared in a direct way, having used the same initial dataset. As the Trace Archives collect more complete datasets for additional organisms, for example, C briggsae, Ciona savignyi, Anopheles gambiae, etc., this will allow more comparisons to be made between assemblers. However, it is a major undertaking to commit to assembling large genomes, and often a lot of knowledge about the processes used in collecting the data needs to be known. Fortunately, there are numerous auxiliary information fields to describe this information, thus, much of this knowledge can be stored with the data. The number of genomes that will be sequenced using the WGS approach will surely increase, possibly quite dramatically if sequencing costs continue to drop, thus, the continued improvements in assembly algorithms and comparisons among them will remain an active field of research for quite some time.
Data Preparation: Clip and Screen Reads, Remove Contaminants The reads from the sequencing process are first screened to remove bad data prior to the start of the assembly process. Reads that are primarily of poor quality are removed completely from the data set. Also, the end portions of reads that are of poor quality are removed. This removal of reads and portions of reads is typically referred to as clipping. Poor quality is determined directly from the sequence quality information as generated by the PHRED (Ewing and Green 1998
Phusion Read Grouping Stage
To generate the histogram, the complete data set is scanned for all k-mer words at all base locations in every read. The reverse complement of each k-mer is also computed at each location. Only k-mers that contain exclusively A, C, G, or T's are considered. For convenience during processing, these bases are converted to the binary values 00, 01, 10, and 11 for A, C, G, and T, respectively. Thus, a k-mer once converted to this binary representation can take on any value from 0 to 4 k-1. For
each k-mer and its reverse complement, only the minimum value
of these two words is used. Otherwise, a 2.7 Gbp genome would appear to
be 5.4 Gbp if both strands are taken into consideration.
The k-mer histogram is stored as an array that is
4 After compiling the k-mer histogram to record the number of times each k-mer occurs in the data set, the results are condensed into a second histogram showing the k-mer word-use distribution, indicating the statistical distribution of the number of times k-mer words occur in the data set.
Figure 2 shows in its upper curve an
example of such a word-use distribution histogram for a data set
obtained from
For comparison, the lower curve in the figure is a Poisson distribution
with a mean value of 7. This is the form that the word-use distribution
would have if the genome did not contain repeats and if the sequencing
were error free. At 7x coverage with no sequencing errors, words that
occur once should be 1/23 the number of words that occur 7 times. The
measured value of words seen once is 0.5 G words, which is off scale in
Figure 2, and indicates that the number of erroneous words is The compilation of the word-use distribution described above is highly valuable, as it identifies all of those reads that relate to repetitive sequence portions. The highly repetitive sequence portions should not be used for determining which reads overlap, as they are not specific to a unique portion of the genome and will thus be a source of spurious alignment attempts in the assembler. The next stage of the process thus excludes from consideration all of the words in the data set that occur more than a certain number of times, D, based on the word-use distribution analysis, in which D will typically be higher than n, the sequencing redundancy factor. The value of D is set so as to capture most of the underlying Poisson distribution of the unique regions of the genome, thereby excluding the repetitive sequence portions that do not uniquely identify any particular portion of the genome or genome section. The value of D may be set automatically or manually from the word-use distribution. Automatic setting may be performed on the basis of exclusion of all words with an occurrence more than a given factor of the distribution peak, or on the basis of a certain fraction of the distribution that captures a given proportion of the words in the Poisson distribution. With the word-use distribution shown in Figure 2, a value for D of 13 captures 97% of the unique and error-free words in the input set.
Second Pass: Create Sorted List of Read Associations
Third Pass: Read Clustering by Creating Read-Relation Matrix The sorted list is now processed to fill a read-relation matrix. In this step, it is determined for each read all of the other reads that share any of the same k-mers. Moreover, for each pair of reads associated by at least one common k-mer word, it is determined how many times common k-mers occur. The read-relation data is created by filling a matrix that contains one row for each read. The row location is the index value of the read. The columns are filled with all other reads that share common k-mer words with the row's read together with the number of times, m, the association was made. Table 5 below shows some example rows from this matrix. The first row in this example relates to the read 25a08.p1c that shares 232 k-mers with the read 25c12.q1c, 163 with 19c12.q1c, and 135 with 1c05.q1c. It is noted that the 232 common k-mers between 25a08.p1c and 25c12.q1c may relate to <232 different words. This will be the case if words occur more than once in the two reads. It is also noted that sequencing errors will cause a background level of random associations between reads. Fortunately, these random associations will be limited to a relatively small number of words and can, therefore, be filtered out by ignoring associations between reads that do not occur above a certain number of times. This filtering is achievable by setting a threshold value of M. The association is thus cancelled from the matrix if the number of shared k-mers m between the row's read and the other read is less than M. A value of M = 11 is used in this example. As the value of M is set higher, it becomes increasingly likely that some true associations are removed. As explained further below, removal of weak true associations are in some cases beneficial to the assembly overall assembly process.
The example in Table 5 also shows all other rows that are listed on the first row and an additional row, 16b09.q1c, which is linked from 1c05.q1c. One read, 15d02.p1c, is not followed because it only has five shared k-mers with 25a12.q1c. Table 5 is a very simple example of traversing all links branching from the read 25a08.p1c. The contents of Table 5 are thus a closed set of five reads that are deduced to collectively define a potentially contiguous section of the genome being sequenced. An important point to note here is that multiple sequence alignment is not performed at this stage to create this closed set of reads. The k-mer association approach described above has allowed determination that these five reads may concatenate somehow to form a contiguous section, without having to go through the computational complexity of an alignment process. Therefore, any cluster of reads is defined as the simply connected components of the undirected graph of reads that have an edge between them, in which the edges are defined by pairs of reads that share M or more selected k-mers. A key advantage of grouping the reads into contiguous groups in this way by use of read associations is that it makes subsequent alignment computationally easy. This is because the reads of each cluster can be aligned independently of the reads of any other cluster. The isolated groups of reads can thus be passed onto any assembly algorithm as independent sets. This allows multiple sequence alignment processing of the different clusters to proceed in parallel. Moreover, it means that the assembler is given an alignment problem that is known to be easily soluble because the reads are associated with each other through mostly unique words. Reads that are primarily made up of repetitive words, which cause assembly algorithms the most difficulty, will not become associated with any cluster. The cluster size is also controllable, as described below, which allows cluster size to be optimized to the cluster size most efficiently processed by the assembler. The clustering of data prior to alignment means that the process of alignment is confined to groups of reads that are already known to fit together, that is, contiguous read groups. Taking a jigsaw analogy, the clustering may be considered to be the step of creating piles of jigsaw pieces with common patterns and/or colors before trying to fit any individual pieces together. The step of fitting the pieces together may be considered to be analogous to alignment and is only attempted within each pile.
Deliver Clusters to Assembler
Iterative Recomputation to Adjust Cluster Size At the other extreme, it may also occur that the initial analysis results in a large number of very small clusters. In this case, the initial analysis is not optimized for alignment and recomputation is desirable. This can be countered using a lower value of M and/or a higher value for D. Considering the case of there being only one large cluster, or an undesirably low number of large clusters, increasing the value for M can be implemented within the algorithm whenever a given maximum cluster size, C, is exceeded. For these clusters, M is incremented iteratively, until the cluster sizes drop below C. In a typical example, M starts initially at a value of 11 and is then incremented to 50 in steps of 2, until a desired maximum cluster size C is no longer exceeded. Using M as the adjustable parameter for varying cluster size keeps recomputation to a minimum, as recompilation of the read indices of each common k-mer is not necessary, and as the recomputation can be confined to breaking down only the large clusters. All clusters smaller than C reads do not need to be recomputed. If D is used as an adjustable parameter for varying cluster size, this will require more recomputation than adjusting M, as it will necessitate returning back to the second pass stage of creating the lists of all read indices of each k-mer that occur less than the new value for D, and then recomputing the read relation matrix.
RPphrap
Once the read clusters are formed, each cluster, along with read-pair
information, read-sequence data, both ends if available, and quality
values, are assembled independently using PHRAP at the heart of a
master program that applies read-pair information in an iterative way.
The version of PHRAP that we use, version 0.990319, is not capable of
using read-pair information. This master program, RPphrap, uses
read-pair information to split PHRAP-generated contigs at locations
that show read-pair insert size consistency violations. For
PHRAP-generated contigs that contain one but not both reads of a read
pair, the missing read is projected out from its mate to its expected
position, and if that position is within 1 standard deviation of
overlapping the contig, then that read is added to the set of reads in
the contig. This process of splitting and extending is shown in Figure
3. This process is applied to all reads in
each contig and all new groups of reads are reassembled at the next
RPphrap iteration. We add mates into contigs in this controlled way, so
that the contig extension is controlled. If all mates were added
without regard to estimated placement, then disconnected groups of
reads may form new contigs that are isolated from the current contig.
For example, if a contig is 2-kb long to start with, and let's say all
reads were from read pairs that span 10 kb, then an uncontrolled
inclusion of these mates could form two new contigs
Bad Plate-Pairing Detection As RPphrap is running, if a read is assembled into a contig and its mate should also be within the limits of the contig, but is not there, a warning message is generated. After all clusters have passed through RPphrap, statistics are measured on pairing failure rates, and when these strongly correlate with potential laboratory tracking errors, then these sets of read pairs are decoupled, and the entire RPphrap assembly process is rerun. For example, some laboratories produce separate forward and reverse direction sequencing reaction plates for a given set of templates. If these two plates get incorrectly labeled so that they are no longer tracked as originating from the same template, then RPphrap will generate many warning messages for these reads, as the read pairs will most likely not be grouped into common contigs.
RPjoin Therefore, this stage, RPjoin, first looks for shared reads among all contigs. For all pairs of overlapping contigs, a merging process intermeshes the reads and splices the sequence together. The locations of all of the reads are readjusted to their new contig location. Because the assembly process is not perfect, some reads are assembled into the wrong locations. When RPjoin finds two contigs that share a common read, but the sequence data do not agree over the extent of the contigs if placed according to the contig locations of this read, the read is removed from the smaller of the two contigs. Once the shared reads are completely depleted, that is, no read appears more than once in the assembly, a second type of contig merging is applied, which looks for inferred overlap based on read pairs that span contigs. For a given contig, all of the reads in that contig that have mates in another contig are measured for inferred placement of the other contigs. Because the location and orientation of each read is known, a contigcontig gap size is computed. As multiple read pairs may indicate an association between two contigs, and average gap size is computed. If this gap size is negative, this indicates a potential overlap and triggers RPjoin to look for sequence similarity at the overlapping ends. If this is found and consistent, then these contigs are merged in the same way as in the shared read stage above.
RPono The iterative approach using increasing allowed gap sizes eliminates the need to fill in large gaps had this process attempted this in a purely greedy fashion. For example, two large contigs that span one small 2-kb contig may have many more links joining them together than the smaller contig, and a greedy method without a maximum gap size limit would join the large contigs together first. Using the maximum gap size, one of the two larger contigs will link in the smaller contig first, as its spacing would not exceed the maximum gap size. This method is applied with a typical maximum gap size increasing through 1, 2, 4, 10, 20, and 40 kb. The end point depends on the initial N50 contig size and the insert sizes used. For example, if only 2-kb inserts were used, advancing beyond a 2-kb maximum gap size would not change the outcome of this process.
Contamination Detection
For example, if three centers produce equal amounts of WGS data, and
one center improperly pooled DNA from a human clone together with
whole-genomic mouse DNA, then the resulting sequence reads from that
mixture would produce assemblies of the human clone that would lack
sequence reads from the other centers. Thus, to detect suspected
contamination-derived contigs, one only needs to count the origin of
the reads in each contig, and if that is purely from one center, then
the probability that the contig is contaminant is
10.3333
Availability
ftp://ftp.ncbi.nih.gov/pub/TraceDB/mus_musculus/Feb_1_Freeze_Ti_List.gz; Initial dataset for mouse. ftp://ftp.sanger.ac.uk/pub/image/tmp/ssahaAssemble/mouse/; Phusion mouse assembly. ftp://ftp.sanger.ac.uk/pub/wormbase/cbriggsae/cb25.agp8/; Phusion C.briggsae assembly. ftp://ftp.tigr.org/pub/data/m_musculus/bac_end_sequences/mbends; Mouse BACend sequence. http://genome.cse.ucsc.edu/cgi-bin/hgGateway?db=mm1; Phusion assembly. http://genome.wustl.edu/projects/cbriggsae/index.php; Caenorhabditis Briggsae. http://genome.wustl.edu/projects/mouse/index.php?fpc=1; Mouse Genome. http://trace.ensembl.org/; Ensembl Genome Server. ftp://wolfram.wi.mit.edu/pub/mouse_contigs/; ARACHNE mouse assembly. http://www.ebi.ac.uk/services/index.html; EBI Services. http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?; Trace DB. http://www.phrap.org/; Genome Software Development Page.
We thank TIGR for providing the mouse BACend sequences, MGSC for funding, the participating sequencing centers for generating the WGS data, and the groups involved in producing the physical map of the mouse genome (Gregory et al. 2002 The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
1 Corresponding author. E-MAIL jcm{at}sanger.ac.uk; FAX 44-1223-494-919 Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.731003. Article published online before print in December 2002.
DNA. J. Mol. Biol. 162: 729-773.[CrossRef][Medline]
Received August 28, 2002; accepted in revised format November 5, 2002. 13:81-90 © by 2003 Cold Spring Harbor Laboratory Press ISSN 1088-9051/03 $5.00 This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||