|
|
|
|
Vol. 12, Issue 8, 1152-1155, August 2002
INSIGHT/OUTLOOK
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ARTICLE |
|---|
|
|
|---|
Transposable elements (TEs), or transposons, form a
major fraction of the eukaryotic genome (Kidwell and Lisch 2001
).
Dismissed for some time as junk DNA, these repetitive sequences are now recognized for their diverse evolutionary roles. In this issue of
Genome Research, Bao and Eddy (2002)
describe a software tool (RECON) for de novo annotation of transposons in genomic sequence, offering new possibilities for discovery to biologists interested in TE evolution as well as a practical tool for masking repetitive DNA from genomic annotation pipelines.
In this commentary, I begin by reviewing why transposons are relevant to studies of genome evolution. I then outline the advances of Bao and Eddy's method from previous work, highlighting certain exemplary features of the RECON method. Finally I describe some of the open questions of transposable element evolution that may be more easily addressed by large-scale bioinformatics and functional genomics approaches as RECON, and more tools like it, become available.
Why Care about Transposable Elements?
Transposable elements (Box 1) are of interest to geneticists (as an experimental tool), genome annotators (typically as junk DNA to be screened out), and structural and evolutionary biologists (for many reasons). My own bias lies toward the latter two (structural and evolutionary) aspects, and I will only briefly outline the former two (experimental and annotative) aspects before moving onto molecular evolution.
|
Experimental geneticists use transposons regularly as vectors for
germ-line transformation, particularly gene knockouts. The use of
P elements (class II TEs) to transform fruitflies, for example, is well-documented, and a systematic program of
P-element-induced gene disruption is a key part of the
Drosophila melanogaster genome project (Spradling et al.
1995
). Transposons are very powerful tools in this context, and the
discovery of a new transposon in a given host organism can greatly
assist studies of that organism's genetics.
To many genome annotators, transposons are less exciting and more of a
nuisance, because (owing to amplification of TEs following colonization
of a new host) they comprise much of the repetitive content of genomes.
(Other sources of repetition include signals that are locally amplified
by replication error or recombinational mispairing, such as short
oligonucleotide repeats.) Repetitive sequence can profoundly confuse
well-intentioned statistical analysis, such as the reporting of
Expectation values (E values) by programs like
BLAST (Altschul et al. 1990
) or MEME (Bailey and Elkan 1995
). These E values are computed based on the
assumption that sequences under neutral selection contain no long-range
correlations, an assumption that is broken when the transposon copy
number is amplified. Masking out previously characterized transposons
prior to analysis (e.g., using the RepeatMasker program
(A.F.A. Smit and P. Green, unpubl.) is one way around this. As for
previously uncharacterized transposons, the very repetition of these
sequences can be used to identify them. The approach described by Bao
and Eddy is of this latter kind.
To a molecular biologist interested in protein structure and function, transposons have many interesting homologies, most notably to the replication machinery of many viruses, also to transcription factors and other specific DNA- and RNA-binding proteins. The information-processing nature of many tasks connected to transposition (specific and nonspecific sequence recognition, DNA and RNA processing, host defenses, self-regulation) and the possibility of assaying for transposition in vitro links them to an interesting variety of cellular processes.
It is, perhaps, among evolutionary theorists that transposons arouse
the most interest. Since their discovery, TEs more than any other genes
have highlighted the neo-Darwinian question of whether to model the
gene or the organism as the fundamental unit of selection. Because TEs
are a burden to the host, owing to the replicative load of the extra
DNA, and (worse) because repetitive sequence content and
transpositional activity are both mutagenic, TEs have in the past been
regarded as purely selfish parasites (Orgel and Crick 1980
). However,
the mutations induced by transposons may be more structured than, say,
mutations caused by irradiation or chemical toxicity, often rearranging
rather than merely corrupting the host genome, and one can imagine them
more readily generating a more neutral or advantageous phenotype
(Kidwell and Lisch 2001
); furthermore, TE activity often increases when
the host is stressed (Capy et al. 2000
), leading to a symbiotic or
mutualistic (rather than parasitic) view of the host-TE relationship
wherein TEs are agents of change or "natural genetic engineers"
called in to stimulate evolution at times of stress (Shapiro 1999
). If
only they could talk.
Despite the seeming anthropomorphism of the mutualism/parasitism
debate, the underlying question
how TE and host evolution are
linked
is a rational and important one to ask. One way to address this
question is to assemble a picture of the various processes, molecular
and evolutionary, that are linked to transposition in one way or
another, using sequence analysis and direct experiment to collect data
on the observed interactions between TEs and their hosts (Kidwell and
Lisch 2001
). Such interactions include potential TE-related mutations
in the host (Table 1), host defenses
against TEs (Table 2), and evolution and
self-regulation of TEs (Table 3). We can,
in principle, amass many of these data by detailed sequence analysis
(Kidwell and Lisch 2001
). This requires that we know, first of all,
where in the genome the transposons actually are.
|
|
|
Automated Annotation: Hunting for Repetition
The task of de novo, automated annotation of all TEs in a genome is
a difficult one, as explained by Bao and Eddy (2002)
. The principle of
identifying repeated sequences, by clustering hits from a
BLAST self-comparison or similar search, is
straightforward enough. The problem is that certain TEs are often found
to be associated, for example, if one TE jumps next to or into another
and the resulting chimera TE is then amplified, so that automated
programs can easily conflate adjacent or nested TEs (Holmes 1998
).
Bao and Eddy have developed an elegant solution to this problem. As with previously described methods, their RECON algorithm starts by doing a BLAST-versus-self of the input genome. In place of the single-linkage clustering of prior methods, however, their algorithm examines the BLAST coordinates in detail. Wherever the density of BLAST hits to a region changes sharply, RECON places a boundary between elements. The sensitivity of the algorithm to changes in hit density, that is, the willingness of RECON to split up elements, is a parameter that can be fine-tuned. RECON also uses a simple length-based heuristic to distinguish major insertion and deletion variants from close familial relatives.
The approach of Bao and Eddy appears to work well. In a test on 3 Mb of human sequence data, RECON identified 6 out of 10 known repeat families and one new family (f179); in most of the known families, the reported consensus matched the canonical sequence closely. Several of the larger families are broken up, but this is probably inevitable to a certain extent and is certainly better than lumping distinct families together.
The problem addressed by RECON, of separating sequence
motifs that may frequently be found adjacent to or nested within each
other, is one that crops up throughout bioinformatics. RECON's solution is a clustering approach that is mindful of the nature of the data and the statistical issues involved. As
pointed out by the authors, a similar approach may be useful for
identifying the boundaries of protein domains; indeed, a clustering approach was used to build the protein hidden Markov model database Pfam (Sonnhammer et al. 1997
). In protein sequences, domain-adjacency is more common than domain-nesting, but the domain boundaries should
still be apparent from changes in the density of hits.
An open and ongoing algorithmic challenge is to develop clustering algorithms akin to RECON, reflecting the underlying biology as much as possible. Examples of patterns that could conceivably be modeled are proximity effects, substitutions and indels in the TE (and the rates of such mutations), patterns of TE aggregation and nesting, subfamilies, deletion variants, and prediction of TE class (Box 1). Probabilistic models seem a natural choice for this task, as many of the rules will be stochastic rather than deterministic.
Challenges for Bioinformatics: Smoking Icicles
A primary task for postgenomic bioinformatics studies of transposon evolution is the characterization of the full transposon complement of sequenced genomes. This includes the annotation of all transposons, using both de novo and homology-based methods, including classification in the vocabulary of Box 1 and more stringent familial groupings. Clearly, the work of Bao and Eddy is a significant step in this direction.
With the transposon complement annotated, it will be possible to
investigate more systematically the evolutionary relevance of the
various processes outlined in Tables 1, 2, and 3. Some of these stories
can be expected to feature more smoking guns than others. For example,
it is possible that relatively few class II TE-induced mutations
directly implicate transposons simply because the evidence (the TE
itself) disappears quickly. Rather than a smoking gun, this is
reminiscent of Agatha Christie's perfect murder weapon: an icicle. We
can, however, assemble a statistical picture of the rates and patterns
of these mutations, for example, by looking at duplications of
noncoding DNA within a genome (Holmes 1998
), and see how this fits with
other measurable rates of TE evolution.
The host responses of Table 2 should be more amenable to direct
investigation in the lab. The TE evolutionary and self-regulatory processes of Table 3 may similarly yield to a combination of sequence
analysis and experimental methods. An example is the coexistence of
multiple related TE subfamilies within a single host. Previous
informatics-led work revealed six previously undescribed families in
the Caenorhabditis elegans genome related to the class I
element Tc1/mariner. Sequence analysis suggests that these
families have rapidly evolved distinct transposase-DNA specificity,
thus avoiding crossmobilization (Holmes 1998
). Hypotheses like this can
be developed and tested by in vitro footprinting (Colloms et al. 1994
)
and transposition assays (Lampe et al. 1996
), making this fertile
ground for collaborations between computational and experimental biologists.
Of central interest are questions of evolutionary timing. How fast, and in what ways, do transposable elements evolve at the sequence level? How does this compare to the rates of transposition and recolonization? What can we learn about colonization from comparing the transposon complements of closely related species? Can we link transposon invasions to bursts of evolutionary activity? How, if at all, do transposons and their hosts coevolve?
The fundamental evolutionary issue at hand is the role and raison d'être for what was once called junk DNA. This fascinating junk now forms an important part of our evolutionary picture of genomes as information repositories in flux. Computational whole-genome transposon screens like RECON offer the possibility of moving beyond anthropomorphic debates of selfishness-versus-altruism to pragmatic questions about the organization of genomes and the mutational/selective forces that shape their history.
| |
FOOTNOTES |
|---|
1 E-MAIL holmes{at}stats.ox.ac.uk; FAX 44 1865 272595.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.453102.
| |
REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
A. Caspi and L. Pachter Identification of transposable elements using multiple alignments of related genomes Genome Res., February 1, 2006; 16(2): 260 - 270. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. C. Tubio, H. Naveira, and J. Costas Structural and Evolutionary Analyses of the Ty3/gypsy Group of LTR Retrotransposons in the Genome of Anopheles gambiae Mol. Biol. Evol., January 1, 2005; 22(1): 29 - 39. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. v. STERNBERG On the Roles of Repetitive DNA Elements in the Context of a Unified Genomic-Epigenetic System Ann. N.Y. Acad. Sci., December 1, 2002; 981(1): 154 - 188. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||