Genome Research scroll

Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
 QUICK SEARCH:   [advanced]


     


Genome Res. 14:1786-1796, 2004
©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00
This Article
Right arrow Full Text
Right arrow Full Text (PDF)
Right arrow Correction
Right arrow Erratum (v14,p2510)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Pevzner, P. A.
Right arrow Articles by Tesler, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pevzner, P. A.
Right arrow Articles by Tesler, G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Methods

De Novo Repeat Classification and Fragment Assembly

Paul A. Pevzner1, Haixu Tang1 and Glenn Tesler2,3

1 Department of Computer Science and Engineering, University of California, San Diego,La Jolla, California 92093, USA 2 Department of Mathematics, University of California, San Diego,La Jolla, California 92093, USA

Repetitive sequences make up a significant fraction of almost any genome, and an important and still open question in bioinformatics is how to represent all repeats in DNA sequences. We propose a new approach to repeat classification that represents all repeats in a genome as a mosaic of sub-repeats. Our key algorithmic idea also leads to new approaches to multiple alignment and fragment assembly. In particular, we show that our FragmentGluer assembler improves on Phrap and ARACHNE in assembly of BACs and bacterial genomes.


3 Corresponding author.
E-MAIL gptesler{at}math.ucsd.edu; FAX (858) 534-7029.

[The following individuals kindly provided reagents, samples, or unpublished information as indicated in the paper: D. Kisman, M. Li, B. Ma, and to S.-P. Yang.]

Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2395204.

4 The concept of an "elementary repeat," although important, was never rigorously defined in the recent papers on repeat analysis (Volfovsky et al. 2001; Bailey et al. 2002; Bao and Eddy 2002). Although it is well defined for perfectly conserved repeats (e.g., as maximal simple paths of multiplicity >1 in the de Bruijn graph), the imperfectly conserved repeats defy simple definitions. "Elementary repeats" are defined in this paper as maximal simple paths of multiplicity >1 in the A-Bruijn graph. This definition is not perfect either, but we believe that it fits the spirit of Bailey et al. (2002). However, in many biological applications, this concept needs to be adjusted (e.g., by introducing appropriate thresholds that remove low-multiplicity edges from A-Bruijn graphs) to account for high-multiplicity mobile elements, fractured repeats, and other biological complications.

5 We are not criticizing other repeat classification programs here: They were designed with the primary goal of characterization of high-multiplicity mobile elements rather than the explicit representation of mosaic structure of sub-repeats. The goal of this example is to illustrate that the problem of mosaic repeat representation raised by Bailey et al. (2002) is not adequately addressed by the existing software tools. We also remark that our approach reveals mobile elements as well (as high-multiplicity edges/paths in A-Bruijn graphs).

6 Pevzner et al. (2001) introduced an error-correction procedure that mimics de Bruijn graphs for nearly identical repeats (repeats that are 98%-99% similar), a small step toward a generalization of the original de Bruijn approach. However, extending this approach beyond nearly identical repeats remains an open problem. Our construction of repeat graphs (below) is very different and more general.

7 Although the A-Bruijn graphs may be very complicated for highly repetitive genomes, they are often broken into simple paths if one removes all low-multiplicity edges. These paths (formed by high-multiplicity edges) typically correspond to mobile elements studied in Bao and Eddy (2002).

8 For example, inconsistent pairwise alignments in Figure 6B can be made consistent by moving the gap in the alignment of at and acat from positions 1-2 to positions 2-3.

9 Short tandem repeats deleted during whirl elimination can be easily added to the repeat graphs.

10 RepeatGluer classifies repeat families in a single DNA strand. To account for both strands and for inverted repeats, one has to concatenate both DNA strands into a single sequence S.

11 We emphasize that highly repetitive sequences correspond to a similarity matrix with a large number of 1s, which may be quadratic in the length of the sequence. Although it does not present a problem for genomes under study, it may present a problem for longer genomes. A possible approach to reducing the running time in this case is to limit the number of 1s for each column/row. Such filtering of 1s in the similarity matrix has to be done with caution to ensure that all copies of a mobile element are "glued together" with high probability (e.g., all copies of a perfect m-copy repeat can be glued with m-1 1s instead of m2 1s). Also, blocks of adjacent 1s are represented as position and length. Another approach (that is equivalent to the "repeat masking" procedure used in many WGS assemblers) is to simply glue together all predefined high-multiplicity repeats at the preprocessing stage.

12 If one removes low-multiplicity edges from the graph in Figure 4C, the resulting path on four bold edges will correspond to the consensus Alu repeat. Other transposable elements can be extracted from the A-Bruijn graphs by applying different thresholds depending on the multiplicity of the transposable element (applications of RepeatGluer for finding transposable elements will be described elsewhere).

13 Multiple alignment is not a focus of this paper, and the goal of this section is simply to introduce the construction that is used in the follow-up fragment assembly section (we view fragment assembly as multiple alignment of reads). The applications of RepeatGluer for multiple alignment of proteins with shuffled domains will be described elsewhere.

14 The informed reader may notice parallels between our A-Bruijn graph approach to fragment assembly and two earlier approaches pioneered by Idury and Waterman (1995) and Myers (1995). Although the earlier algorithms look very different, both implicitly tried to develop the idea of the repeat graph. Myers tried doing this by collapsing the overlap graph at the level of read (~500 bp) resolution, whereas Idury and Waterman tried simplifying the de Bruijn graph at the level of l-mer resolution. Our key contribution is the A-Bruijn graph construction that deals with fragment assembly at the level of single-nucleotide resolution. This increased granularity alleviates challenging algorithmic problems that we faced trying to design an efficient assembler for low-coverage regions and low-quality reads/read ends.

15 Low-quality reads can be defined as reads with fewer than 100 positions with Phred quality values above 15.

16 It may sound like a simple integration problem, but it turned out to be a very difficult task that required development of a new idea for fragment assembly (our FragmentGluer algorithm, implemented in EULER+). The difficulty is that it is unclear how the algorithmic ideas used in the recent WGS assemblers can be adjusted for working with low-coverage and low-quality sequencing data (like read ends).


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
S. Saha, S. Bridges, Z. V. Magbanua, and D. G. Peterson
Empirical comparison of ab initio repeat finding programs
Nucleic Acids Res., April 1, 2008; 36(7): 2284 - 2294.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
M. J. Chaisson and P. A. Pevzner
Short read fragment assembly of bacterial genomes
Genome Res., February 1, 2008; 18(2): 324 - 330.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
C. M. Bergman and H. Quesneville
Discovering and detecting transposable elements in genome sequences
Brief Bioinform, November 1, 2007; 8(6): 382 - 392.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. ProteomicsHome page
N. Bandeira, K. R. Clauser, and P. A. Pevzner
Shotgun Protein Sequencing: Assembly of Peptide Tandem Mass Spectra from Mixtures of Modified Proteins
Mol. Cell. Proteomics, July 1, 2007; 6(7): 1123 - 1134.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Hou, P. Berman, C.-H. Hsu, and R. S. Harris
HomologMiner: looking for homologous genomic groups in whole genomes
Bioinformatics, April 15, 2007; 23(8): 917 - 925.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. Achaz, F. Boyer, E. P. C. Rocha, A. Viari, and E. Coissac
Repseek, a tool to retrieve approximate repeats from large DNA sequences
Bioinformatics, January 1, 2007; 23(1): 119 - 121.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
A. Caspi and L. Pachter
Identification of transposable elements using multiple alignments of related genomes
Genome Res., February 1, 2006; 16(2): 260 - 270.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Morgulis, E. M. Gertz, A. A. Schaffer, and R. Agarwala
WindowMasker: window-based masker for sequenced genomes
Bioinformatics, January 15, 2006; 22(2): 134 - 141.
[Abstract] [Full Text] [PDF]




Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
Genes Dev. Learn. Mem.
Protein Science RNA Genome Res.
Copyright © 2004 by Cold Spring Harbor Laboratory Press.