|
|
|
|
Genome Res. 14:721-732, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Resources The Atlas Genome Assembly SystemHuman Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
Atlas is a suite of programs developed for assembly of genomes by a "combined approach" that uses DNA sequence reads from both BACs and whole-genome shotgun (WGS) libraries. The BAC clones afford advantages of localized assembly with reduced computational load, and provide a robust method for dealing with repeated sequences. Inclusion of WGS sequences facilitates use of different clone insert sizes and reduces data production costs. A core function of Atlas software is recruitment of WGS sequences into appropriate BACs based on sequence overlaps. Because construction of consensus sequences is from local assembly of these reads, only small (<0.1%) units of the genome are assembled at a time. Once assembled, each BAC is used to derive a genomic layout. This "sequence-based" growth of the genome map has greater precision than with non-sequence-based methods. Use of BACs allows correction of artifacts due to repeats at each stage of the process. This is aided by ancillary data such as BAC fingerprint, other genomic maps, and syntenic relations with other genomes. Atlas was used to assemble a draft DNA sequence of the rat genome; its major components including overlapper and split-scaffold are also being used in pure WGS projects.
The most effective and economical way to provide comprehensive information about an organism is through a whole-genome sequencing project. This perspective and the Human Genome Project (HGP) have advanced DNA sequencing technology so that the raw data can be generated for virtually any genome in a timely and affordable manner. Several different genome assembly strategies have also been developed, to efficiently merge the primary data to construct an ordered and accurate final sequence. Specialized assembly software underlies each of these strategies and defines the precise qualities of the data to be generated.
The current focus of large genome projects is on producing draft sequences. Initial definitions of a "draft" required only relatively low genome sequence coverage (Bouck et al. 1998
The whole-genome shotgun (WGS) method, involving sequencing of clones randomly selected from libraries of genomic DNA with different insert sizes, has found widespread acceptance. WGS "reads" are assembled using information from sequence overlaps, pairing of reads from each end of subclone inserts, and distances between paired reads from libraries of different insert sizes. This method produced the first complete sequence of a free-living organism, the bacterium Haemophilus influenzae (Fleischmann et al. 1995
An alternative approach for genome assembly, illustrated by the "map first, sequence later" philosophy of the HGP, is the "clone-by-clone" (CBC) method. A clone map of the genome is first constructed using restriction enzyme digestion fingerprinting and specialized software (Soderlund et al. 1997
To exploit the merits of both the WGS and CBC methods, a "combined approach" for genome assembly was developed for the Rat Genome Sequencing Project (RGSP) (Rat Genome Sequencing Project Consortium 2004
In the current communication, we present the design, implementation, and operation of the Atlas assembly system that enabled the combined approach to genome assembly. We include selected results obtained in the RGSP demonstrating the effectiveness of our approach (Rat Genome Sequencing Project Consortium 2004
The overall scheme of Atlas, summarized in Figure 1 and Table 1, incorporates an upstream phase of data preparation and localized assembly and a downstream phase of large-scale assembly and mapping. The upstream phase produces combined WGS and BAC read assemblies in four steps: (1) data preparation, including quality checks and read trimming; (2) tabulation of k-mers to identify repetitive sequences; (3) computing overlaps between reads; and (4) assembling WGS and BAC reads into "enriched BACs" (eBACs) and scaffolds, usually one per BAC. The down-stream phase rebuilds the enriched BACs into a genome assembly in four additional steps: (5) identifying bactigs as groups of overlapping BACs; (6) reassembling eBACs into new contigs and scaffolds for each bactig; (7) building superbactigs by linking bactigs; and (8) building ultrabactigs and chromosomes.
Step 1: Trimming Reads Reads are trimmed to a standard determined by the RGSP. The standard requires identifying windows of 50 bases, scanning in from each end of the read, with no ambiguous or contaminant (e.g., vector) bases and <1.25 expected errors. The passed region is from the beginning of the first such window to the end of the last such window, and bases not located within or between the windows are removed (Fig. 2). After trimming, WGS reads with <100 bases and BAC reads with <50 bases are omitted from the overlap analysis. WGS reads require longer passing sequences so that their assignment to BACs in later stages can be confirmed with a higher confidence. Trimming ensures that only high-quality bases are used in the next phases of the process. Ultimately, when local assemblies of groups of reads are performed using Phrap, the entire sequence called by Phred is used (Ewing and Green 1998
Step 2: Counting k-mers Analysis of genome-wide oligonucleotide frequencies provides estimates of genome size and the true depth of sequence coverage, and guides repeat suppression (Kim and Segre 1999
The table of k-mers and their frequencies is next used to guide overlapper in its selection of pairs of reads to align. To suppress overlaps based on repeated sequence and to avoid testing all O(n2) pairs from n reads, overlapper will only compare reads sharing a "rare k-mer," defined as having a WGS frequency less than some fixed R. The memory required to store a frequency table, roughly 10 bytes per distinct k-mer (plus 20% hash-table overhead), mandates careful planning. By recording only those k-mers with at least R copies in the WGS, we keep the table small. For the RGSP, using R = 10 gives a table with 59 million distinct k-mers, or 2.1% of the total, and allows us to track an estimated 92% of repetitive copies in the genome. (Setting R = 9 would almost double the size of the table.) When sufficient memory is available for overlapper jobs, we can set R low enough to include borderline k-mers as well as those that are clearly repetitive.
Step 3: Finding Read Overlaps: The overlapper
Key issues for overlapper are run times in limited memory, detecting true genomic overlaps (sensitivity), rejecting alignments because of repeated sequences (specificity), and recording details that will help later resolution of borderline cases. For the rat genome, we aimed for computation on 40 million reads in The rarity heuristic based on the saved k-mer table is key to overlapper success. The WGS-based k-mer counts provide a genome-wide view of repetitive sequences that can be loaded into each overlapper job. overlapper parameters for the rarity heuristic include R', the minimum count kept in its internal k-mer frequency table (no smaller than the R used when creating the saved frequency table) and Y, the maximum count for a k-mer that will seed an overlap. Each pair of reads sharing a k-mer with frequency less than or equal to Y will be aligned, using the rarest k-mer as a seed (k-mers with frequency <R ' are treated as having the same frequency, and ties are broken arbitrarily). Instead of comparing each of n reads with all others, an O(n2) task overall, or with hundreds of reads that share any (possibly repetitive) k-mer, we compare each read with an average of <Y other reads. For the RGSP, overlapper was able to meet our runtime efficiency goals even with Y set to 100. Although very efficient, the rarity heuristic cannot by itself be made sufficiently sensitive and specific for following stages. In the RGSP, imposing a seed frequency cutoff Y of 8 would exclude almost 5% of the genomically unique k-mers, limiting sensitivity. But that is still too high for specificity; setting Y even at 6 would let through almost 2% of repeated k-mer occurrences, causing an even larger percentage of false overlaps. The additional banded alignment process allows us to use a larger value for Y for identifying candidate overlaps, yet still reject false overlaps based on the quality of alignment.
overlapper's quality heuristic computes an end-to-end banded alignment of the reads in each candidate overlap (Chao et al. 1992
Step 4: Production of eBACs: The binner and split-scaffold
A quantitative analysis of yield of WGS reads at each of these steps is presented in Table 2. This summarizes the construction of eBACs for the RGSP. The expected WGS "catch," considering the number of usable WGS reads, and the average BAC size of 208.8 kb (from FPC data) as a fraction of the total genome size, is 1638 reads. Additional WGS reads, from WGS-WGS overlaps, extending as much as 10 kb off the ends of each BAC, are also recruited. This provides a mechanism to fill gaps in the BAC tiling path with the corresponding genomic sequence. The average catch of 1675 reads is thus very near the expected value for a BAC plus WGS margin, with a standard deviation of 20%.
The reads in each BAC bin are next assembled using Phrap, generating an initial set of contigs. A comparison with finished BACs showed that the default options for Phrap gave acceptable results (see Tuning under Methods). However, contigs generated by Phrap are improved by using read-pair information to detect misjoins. Phrap does not use read-pair information, thus we developed the split-scaffold tool to use read pairs to identify misjoins within contigs and split them. After splitting, reads are also removed from contig ends where they conflict with scaffolding and lack a read pair mate inside the contig. Finally, split-scaffold creates scaffolds: contigs that are linked because they each contain one read from a read pair. Two such links are required to consider contigs as reliably joined in a scaffold.
During scaffolding, the contigs/scaffolds at the ends of the BAC insert are identified using several methods, such as searching for BAC end sequence reads, a read pair with one genomic and one vector read, or a chimeric read that is part vector and part genomic (Chen et al. 2004
The BAC-Fisher
Step 5: Generation of Bactigs Bactigging and its associated quality checks are designed to reduce misassemblies, such as those caused by genome duplications, which cannot be solved based on reads alone. Because most genome duplications are smaller than a BAC clone, BAC clone pairs with copies of the duplicated region are distinguished owing to divergence in the remainder of their eBAC assemblies. Moreover, unlike the reads alone, each BAC clone has additional information, such as FPC fingerprint pattern and STS marker content. The accuracy of bactigging is assessed by consistency with the FPC and radiation hybrid STS marker maps. Potential errors can be identified and corrected before the bactig assembly. Such comprehensive quality control ensures assembly quality but is more difficult with other WGS assembly methods. The process of constructing bactigs begins with identifying candidate pairs of overlapping BACs. Rather than conduct all-against-all sequence comparison among enriched BACs, we first identify candidate BAC overlaps based on WGS and FPC information. The eBAC assemblies of overlapping clones will generally share WGS reads; this criterion contributed 33,146 potential BAC overlaps for the RGSP. However, shared WGS reads will be largely suppressed where the shared region is highly repetitive. To avoid missed overlaps, an additional 724 candidate pairs of BACs were identified in the RGSP by linking with two or more read pairs (from WGS, including BAC-end sequences) or by having similar FPC patterns.
Candidate overlapping pairs are evaluated by alignment using BLASTZ (Schwartz et al. 2003
Two further steps are implemented to eliminate false overlaps caused by largescale, low-copy genome duplications, one of the most difficult types of errors for assembly programs to deal with. First, overlapping BAC pairs are checked for their consistency with other BAC overlaps. Conflicting sets of overlaps are further analyzed for FPC patterns and links between BAC-end reads in the assembly, because these two types of data could potentially extend beyond a duplicated region. An overlap is accepted if it is supported by FPC or BAC end pairs uniquely among potential overlaps. To estimate the specificity of this method, a computer simulation was conducted using 10,000 pairs of BACs that do not overlap to calculate the false-positive rate not excluded by FPC pattern or BAC end read links. Simulations showed that the false-positive error rate is reduced to <0.1% using either FPC or BAC end sequences. Conflicting sets of overlaps that cannot be resolved by these approaches are excluded from subsequent analysis, resulting in gaps in the tiling path. This creates artificial duplications when the overlap is real; these are dealt with later. A bactig is constructed for each remaining set of overlapping BACs. A bactig represents a single contiguous genome region covered by BAC clones, and can often be assembled into a single scaffold. One remaining problem for bactig construction is the artificial duplications caused by unresolved clone overlaps. Most of these errors are fixed by implementing a feedback loop, to identify true overlapping clones at a later assembly stage, with the aid of long-range information such as RH markers. Previously excluded BAC clone overlaps are considered real if the two clones are located adjacent to each other in a higher order structure, the superbactig that are formed by linking bactigs together through read pairs and confirmed by RH markers (see Step 7 for detail). With this downstream evidence, the bactig is recomputed and subsequent steps are repeated.
A final comprehensive quality control check is implemented to catch rare exceptions that the above filtering misses. The order of BACs in each bactig is compared with independently generated maps. In the case of the RGSP, this included the FPC assembly, the rat radiation hybrid STS map, and the human and mouse genomes, to uncover potential errors such as misjoins. About 60% of the RGSP BACs could be mapped with one or more STS markers, and In the current rat assembly, 21,689 BACs are grouped into 1607 bactigs, whereas 224 BACs remain as singletons. Among the singletons, 40 are synthetic projects (see Data Quality Assessment in Methods). A quality assessment of RGSP bactigs was made on the bactigs covering two QTL regions, Rf1 and MCS1, for which physical maps of the clone layout had been generated independently. We found only 32 Atlas bactigs in disagreement with the physical map; in consultation with the RGSP Consortium, we determined that the bactigs were more likely to be correct. In sum, very-high-quality bactigs were constructed that were then used as units for the subsequent assembly.
Step 6: Bactig Assembly: Wave Grouping and rolling-phrap
The same splitting and scaffolding process is applied as in the eBAC assemblies, with split-scaffold tuned to take better advantage of long insert pairs (50-kb inserts, BAC ends), which are more useful at the megabase scale of bactigs (see Methods). For the RGSP, a total of 6247 scaffolds are obtained for all the bactigs including 2745 single contig scaffolds. On average, 1.9 multicontig scaffolds were obtained for each bactig. The improvement in overall scaffold size by split-scaffold is significant. When 39 BACs that overlapped finished regions in the genome were analyzed by comparison to the finished sequences, the N50 for scaffolds increased from 974 kb to 1666 kb with the splitting and trimming procedure.
Step 7: Generation of Superbactigs
Step 8: Ultrabactigs and Placement on Chromosomes The rat RH map (v3.4) was used to place the ultrabactigs obtained above onto individual chromosomes. To anchor sequence to a chromosome, e-PCR and NCBI-BLASTN are used to map marker primers or sequences onto individual ultrabactigs (see Methods). To place the ultrabactigs onto chromosomes, the marker locations on the ultrabactigs were converted from the superbactig locations based on their order, orientation, and gap sizes. Dominant windows and slopes were calculated to determine the chromosome location and orientation of each ultrabactig. For those without solid marker information, rat/mouse and rat/human synteny information was used to insert them between or at the ends of marker-ordered ultrabactigs. The final order and orientation were manually adjusted after considering marker quality of the RH map, quality of marker mapping, FPC assembly, and mouse/human comparative information to maximize the synteny. Adjacent superbactigs on each chromosome were further linked if there were any appropriate read pairs or FPC suggested links. This reduced the RGSP ultrabactigs to 419 pieces with 71 singletons; 291 pieces were placed on chromosomes. Most of the 128 unplaced pieces are either singletons or short superbactigs consisting of only a few clones.
Availability
The strategy used to sequence the rat genome was a combined approach using elements of CBC BAC sequencing with WGS sequencing (Rat Genome Sequencing Project Consortium 2004 1 x-2x) skims from a 1.6 x clone coverage set of BACs were produced in addition to a 4 x-5 x coverage of WGS reads. The goal of the BAC skims was to provide localization that could be used to boost the confidence of the assembly, especially in repeat regions, as compared with a WGS-only assembly. Despite the mixed nature of the source data, the assembly of the rat data has more in common with a WGS-only assembly than it does with the CBC approach. The rat assembly required the same all-against-all read comparison that is needed for WGS-only assemblies. This all-against-all read comparison is both a major part of the computational load of the rat assembly as well as a key software component determining its success. The success of this approach emphasizes the utility of BACs in large-scale sequencing projects. The Atlas assembly system is designed to take both BAC and WGS data as inputs for the genome assembly. This is an important distinction between Atlas and the other popular genome assemblers in use, which are not designed to exploit the unique information associated with BAC clone sequences. The critical issue in assembling genomes is the method of dealing with repeated sequences. Atlas initially identifies repeats through oligonucleotide frequency analysis and excludes them from the overlap analysis of sequences. Repeated sequences appear in the assembly only as read pairs after the main layout has been derived. However, because of low-frequency repeats, which are not readily distinguished from the redundant sequences from high-coverage sequencing, assemblies can be erroneous. To address this, extensive and repeated checking of assemblies occurs throughout the assembly process, using a variety of methods to detect repeat-induced errors. These include checks of intrinsic properties of assemblies, such as template and read-pair distributions, as well as comparison to external data sets such as FPC and STS maps and syntenic relations with other genomes. These checks are performed at virtually every stage of the process, which is highly iterative to allow feedback from errors that only appear in downstream analysis. All of this produces a robust draft consensus sequence with a high degree of consistency with existing information. The methods described here clearly highlight the complexity and pitfalls in assembly of large genomes. Current large-scale projects (save the mouse) plan to produce draft (unfinished) genome sequences, further emphasizing the need for high-accuracy assembly procedures. This is not only a software challenge: ancillary data such as FPC maps, STS maps, EST and cDNA, use of BAC based assembly, and sufficient WGS coverage are all essential to get an accurate product. The uses of many of these types of data are clearly illustrated in this report.
Overlap Detection The overlapper performs banded, end-to-end alignments on pairs of reads (Chao et al. 1992
Although all WGS-WGS and BAC-WGS overlaps were computed, the Atlas assembly relied primarily on BAC-WGS overlaps. Each overlap is saved as a directed edge that specifies the span, score, left extension, right extension, strand, and global frequency of the seeding k-mer. The span is simply the number of bases the two reads overlap; the score is the banded alignment score in the span region. The left and right extensions are the magnitudes of the nonoverlap regions between the two reads, where a positive extension indicates that the origin read of the edge is longer in that direction, and a negative extension indicates that the sink read of the edge is longer. The strand is indicated by the letter "f" for same-strand and "r" for opposite-strand overlaps.
Sorting WGS Reads Into BAC Bins
Tuning Enriched BAC Assemblies Variations that were tested included:
Assembled scaffolds were scored both on total finished sequence covered with chains of nonoverlapping exact matches in the correct order and orientation. The TIGR tool Mummer (v2.12; Delcher et al. 2002 The percentage coverage ranged from 80%-96% of the actual overlap of the sequences. The best options based on these sequences had:
However, manual examination of unfinished assemblies for regions of likely genomic duplication led us to believe that the "limit on total overlaps" heuristic would not do as well as the "best N overlaps" heuristic for those regions. The best settings with that heuristic were used for the enriched BAC preliminary assemblies for the Rnor3.1 release. These were the same as above, using 32-mers, but with the "best N overlaps" heuristic for N = 6. The results were almost identical except that one of the six BACs ended up in two scaffolds instead of one.
rolling-phrap
Atlas-scaffold and split-scaffold Misjoins were corrected in a multistep process. The essence of this process is to use the same read-pairing engine used in scaffolding to discover regions of contigs with inconsistent read pairs. Some inconsistencies were due to small numbers of WGS reads that did not belong in the contig. In these cases, the contig as a whole does not need to be fixed, only the extraneous reads removed. Thus, we could not assume that simply splitting contigs was the proper response to an inconsistency. We used the inconsistent read pairs to identify suspicious regions, which were examined for misjoins. Suspicious read pairs were identified as being one end interior to a contig with the paired end in another contig. Each interior contig point was identified as a checkpoint if the left, the right, or both sides had read-pair evidence linking it to another contig. Based on empirical testing, we identified three rules that determined when to split contigs:
After splitting problem contigs, split-scaffold checks the ends of contigs, solely for template crashes, and where these are found, and the contig presents a scaffolding conflict, the contig ends are trimmed by removing reads. After this round of refining the assembly, scaffolds are generated that order and orient as many contigs as possible based on at least two read-pair links. Two scaffold versions are produced for each bactig: one from the contigs with splitting, and another from the contigs with splitting and trimming. The version with the longer scaffold is retained as the final version. The scaffolds >15 kb contain from these bactig assemblies (and eBAC assemblies for single-BAC bactigs) the full set of contigs used in generating the final sequence.
Mapping Ultrabactigs to Chromosomes
Data Quality Assessment Contigs and scaffolds from each BAC are analyzed for uniform participation of microtiter plates (trays) containing BAC skim sequencing reactions. Reads from trays with unbalanced participation are removed as potentially mixtures of BACs or other contamination, and the eBAC is reassembled. Similarly, if an eBAC assembly size is significantly larger than the FPC size of the BAC or if there are too many scaffolds, the BAC is rejected as a potential mixture. BACs failing any test are excluded from the genome assembly and flagged for purification. In most cases of mixtures of reads, it is possible to select one read set (reads in a set of contigs) to remain with a BAC and remove all others. For example, in cases in which a clear majority is contradicted by a small minority, we assume the majority to be the correct sequence for the BAC. In other situations, it is possible to assign one read set to a BAC based on BAC end sequences or clones that overlap based on FPC data. In a minority of BACs, neither of these techniques is informative. We resort to splitting such projects in two and noting that the true match to the clone is unknown. We also use the overlapper to detect relatedness between read sets based on k-mer content. This technique has proven particularly useful in the detection of problems with whole-genome shotgun trays, which are among the most difficult to detect with the contig-based purifier.
Reads selected for removal are not discarded. Instead, when a read set is composed of at least four 96-well trays or comprises at least half the reads in a BAC project, we create a "synthetic" project to hold it. Although a synthetic project is not associated with any known clone, it is otherwise assembled and treated exactly as an ordinary BAC project and can contribute to the genome assembly. The final rat genome assembly included 869 synthetic projects containing 359,967 reads that would otherwise have been wasted. In the final analysis, this quality control approach passed 98.7% of eBAC assemblies, and 99.3% of these were included in the final assembly. This included 97.7% of passed BACs that had been purified after previously failing and 93.8% of passed synthetic projects, a slightly lower rate because more synthetic projects created by splitting projects in half were too small to be useful. This high success rate validates the power and correctness of this purification system. Overall, 8.4% of our BAC sequencing projects were subject to purification at one time or another. But purification reduced the number of failed BAC projects from In addition to this scrutiny of sequence reads, bactigging produces a list of eBACs that cannot be included in a consistent layout, owing to excessively large numbers of overlaps. These are then re-examined for cross-contamination problems. Similarly, where superbactigging reveals problems with the underlying bactigs, the bactig layouts are re-examined. These and other quality assessment mechanisms are outlined in Table 3.
This project was supported by grant U54 HG02345 from the NHGRI and NHLBI to R.A.G. We thank members of the Rat Genome Sequencing Project Consortium who provided the data for this project and provided feedback as to the quality of the assembly. At the BCM-HGSC we particularly thank Bingshan Li, Yue Liu, Qin Xiang, and Erica Sodergren, who provided invaluable assistance and input to this project; David Wheeler and Zhengdong Zhang also contributed to the quality assessment. We are grateful to John Bouck and Harley Gorrell for making early contributions to the work. We thank Gerard Bouffard and Eric Green (NHGRI) for access to shotgun reads and finished sequences on rat BACs that were used for quality checks of this assembly and Ann Kwitek and Howard Jacob (Medical College of Wisconsin) for assistance with the rat radiation hybrid map. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2264004.
1 These authors contributed equally to this work.
2 Corresponding author. [The following individuals kindly provided reagents, samples, or unpublished information as indicated in the paper: the Rat Genome Sequencing Project Consortium, Gerard Bouffard, and Eric Green.]
Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., et al. 2000. The genome sequence of Drosophila melanogaster. Science 287: 2185-2195.
Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.M., Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A., et al. 2002. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297: 1301-1310. Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796-815.[CrossRef][Medline]
Bailey, J.A., Gu, Z., Clark, R.A., Reinert, K., Samonte, R.V., Schwartz, S., Adams, M.D., Myers, E.W., Li, P.W., and Eichler, E.E. 2002. Recent segmental duplications in the human genome. Science 297: 1003-1007.
Batzoglou, S., Jaffe, D.B., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Berger, B., Mesirov, J.P., and Lander, E.S. 2002. ARACHNE: A whole-genome shotgun assembler. Genome Res. 12: 177-189.
Blattner, F.R., Plunkett III, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode, C.K., Mayhew, G.F., et al. 1997. The complete genome sequence of Escherichia coli K-12. Science 277: 1453-1474.
Bouck, J., Miller, W., Gorrell, J.H., Muzny, D., and Gibbs, R.A. 1998. Analysis of the quality and utility of random shotgun sequencing at low redundancies. Genome Res. 8: 1074-1084.
The C. elegans Sequencing Consortium. 1998. Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282: 2012-2018. Celniker, S.E., Wheeler, D.A., Kronmiller, B., Carlson, J.W., Halpern, A., Patel, S., Adams, M., Champe, M., Dugan, S.P., Frise, E., et al. 2002. Finishing a whole-genome shotgun: Release 3 of the Drosophila melanogaster euchromatic genome sequence. Genome Biol. 3: RESEARCH0079, 1-14.
Chao, K.M., Pearson, W.R., and Miller, W. 1992. Aligning two sequences within a specified diagonal band. Comput. Appl. Biosci. 8: 481-487. Chen, R., Sodergren, E., Weinstock, G.M., and Gibbs, R.A. 2004. Dynamic building of a BAC clone tiling path for the rat genome sequencing project. Genome Res. (this issue). Choi, V. and Farach-Colton, M. 2003. Barnacle: An assembly algorithm for clone-based sequences of whole genomes. Gene 320: 165-176.[CrossRef][Medline] Cole, S.T., Brosch, R., Parkhill, J., Garnier, T., Churcher, C., Harris, D., Gordon, S.V., Eiglmeier, K., Gas, S., Barry III, C.E., et al. 1998. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393: 537-544.[CrossRef][Medline]
Dehal, P., Satou, Y., Campbell, R.K., Chapman, J., Degnan, B., De Tomaso, A., Davidson, B., Di Gregorio, A., Gelpke, M., Goodstein, D.M., et al. 2002. The draft genome of Ciona intestinalis: Insights into chordate and vertebrate origins. Science 298: 2157-2167.
Delcher, A.L., Phillippy, A., Carlton, J., and Salzberg, S.L. 2002. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30: 2478-2483.
Ewing, B. and Green, P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8: 186-194.
Ewing, B., Hillier, L., Wendl, M.C., and Green, P. 1998. Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res. 8: 175-185.
Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M., et al. 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496-512.
Goff, S.A., Ricke, D., Lan, T.H., Presting, G., Wang, R., Dunn, M., Glazebrook, J., Sessions, A., Oeller, P., Varma, H., et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92-100.
Goffeau, A., Barrell, B.G., Bussey, H., Davis, R.W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J.D., Jacq, C., Johnston, M., et al. 1996. Life with 6000 genes. Science 274: 546-567.
Holt, R.A., Subramanian, G.M., Halpern, A., Sutton, G.G., Charlab, R., Nusskern, D.R., Wincker, P., Clark, A.G., Ribeiro, J.M., Wides, R., et al. 2002. The genome sequence of the malaria mosquito Anopheles gambiae. Science 298: 129-149.
Huang, X., Wang, J., Aluru, S., Yang, S.P., and Hillier, L. 2003. PCAP: A whole-genome assembly program. Genome Res. 13: 2164-2170. Huson, D.H., Reinert, K., Kravitz, S.A., Remington, K.A., Delcher, A.L., Dew, I.M., Flanigan, M., Halpern, A.L., Lai, Z., Mobarry, C.M., et al. 2001. Design of a compartmentalized shotgun assembler for the human genome. Bioinformatics 17 Suppl 1: S132-S139.[Abstract]
Jaffe, D.B., Butler, J., Gnerre, S., Mauceli, E., Lindblad-Toh, K., Mesirov, J.P., Zody, M.C., and Lander, E.S. 2003. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13: 91-96. Karp, R.M. and Rabin, M.O. 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31: 249-260.
Kent, W.J. and Haussler, D. 2001. Assembly of the working draft of the human genome with GigAssembler. Genome Res. 11: 1541-1548. Kim, S. and Segre, A.M. 1999. AMASS: A structured pattern matching approach to shotgun sequence assembly. J. Comp. Biol. 6: 163-186.
Kirkness, E.F., Bafna, V., Halpern, A.L., Levy, S., Remington, K., Rusch, D.B., Delcher, A.L., Pop, M., Wang, W., Fraser, C.M., et al. 2003. The dog genome: Survey sequencing and comparative analysis. Science 301: 1898-1903. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921.[CrossRef][Medline] Mewes, H.W., Albermann, K., Bahr, M., Frishman, D., Gleissner, A., Hani, J., Heumann, K., Kleine, K., Maierl, A., Oliver, S.G., et al. 1997. Overview of the yeast genome. Nature 387: 7-65.[Medline]
Mullikin, J.C. and Ning, Z. 2003. The phusion assembler. Genome Res. 13: 81-90.
Myers, E.W., Sutton, G.G., Delcher, A.L., Dew, I.M., Fasulo, D.P., Flanigan, M.J., Kravitz, S.A., Mobarry, C.M., Reinert, K.H., Remington, K.A., et al. 2000. A whole-genome assembly of Drosophila. Science 287: 2196-2204. Owolabi, O. and McGregor, D.R. 1988. Fast approximate string matching. Software Practice and Experience 18: 387-393.[CrossRef] Rat Genome Sequencing Project Consortium. 2004. Genome sequence of the Brown Norway Rat yields insights into mammalian evolution. Nature (in press). Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., and Miller, W. 2003. Human-mouse alignments with BLASTZ. Genome Res. 13: 103-107. Soderlund, C., Longden, I., and Mott, R. 1997. FPC: A system for building contigs from restriction fingerprinted clones. Comput. Appl. Biosci. 13: 523-535. |