|
|
|
|
Vol. 10, Issue 7, 939-949, July 2000
LETTER
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Tetraodon nigroviridis is a freshwater pufferfish 20-30 million years distant from Fugu rubripes. The genome of both tetraodontiforms is compact, mostly because intergenic and intronic sequences are reduced in size compared to other vertebrate genomes. The previously uncharacterized Tetraodon genome is described here together with a detailed analysis of its repeat content and organization. We report the sequencing of 46 megabases of bacterial artificial chromosome (BAC) end sequences, which represents a random DNA sample equivalent to 13% of the genome. The sequence and location of rRNA gene clusters, centromeric and subtelocentric satellite sequences have been determined. Minisatellites and microsatellites have been cataloged and notable differences were observed in comparison with microsatellites from Fugu. The genome contains homologies to all known families of transposable elements, including Ty3-gypsy, Ty1-copia, Line retrotransposons, DNA transposons, and retroviruses, although their overall abundance is <1%. This structural analysis is an important prerequisite to sequencing the Tetraodon genome.
[The sequence data described in this paper have been submitted to the EMBL data library under accession nos. AJ245809, AJ270048, AJ245808, AJ270029-AJ270047, DS42722 and AL305790-AL352938.]
| |
INTRODUCTION |
|---|
|
|
|---|
The human genome is in the process of being
completely sequenced, and an attempt is being made in parallel to
systematically identify functionally relevant sequences. Current gene
identification methods are based on software predictions or comparisons
with expressed sequence tags and still lack accuracy and completeness. Comparisons between the human genomic sequence and the complete sequence of another vertebrate should be a useful complement to rapidly
and accurately reveal regions of functional interest. Indeed, two
vertebrate genomes that are evolutionarily distant should only show
strong conservation of sequences of functional importance (protein
coding regions; tRNA; rRNA) while other segments submitted to random
mutations will show much less similarity. It has been amply
demonstrated that the genomic sequence of a tetraodontiform such as
Fugu rubripes is a powerful yet efficient tool to reveal such
coding regions (Elgar 1996
; Elgar et al. 1999
).
We have chosen Tetraodon nigroviridis to develop such
comparative analyses on a genome scale (Roest Crollius et al. 2000
) because of its widespread availability and trivial and inexpensive maintenance in the laboratory. It was also reasoned that studying a
species related to Fugu but distant by 20-30 million years
(Crnogorac-Jurcevic et al. 1997
) would enable the identification of
functionally important sequences that appeared after the
human/teleostean divergence. We have initiated a random sequencing
approach of the Tetraodon genome based on bacterial artificial
chromosome (BAC) end templates and have generated 46 Mb of DNA or 13%
of the genome. The average read length is 1 kb, which contributes to
making this approach a very fast and cost-effective method of genome
scanning. BAC end sequencing provides an added advantage by physically
linking two sequences over a relatively short distance (75-200 kb),
allowing direct comparisons between linked sequences in
Tetraodon and other genomes. It also represents an ideal
genomic resource for long-range physical mapping, as well as an STC
resource (Mahairas et al. 1999
) to assist shotgun sequencing in
specific regions.
This Tetraodon genome sample was exploited in combination with
fluorescence in situ hybridization experiments, to decipher the
organization of repeat sequences. This study serves several purposes.
First, repeat sequences occur naturally in multiple copies in the
genome either in tandem or in dispersed distribution, and therefore can
seriously hamper clustering studies or sequence assemblies. In any case
such sequences must be identified and eliminated, generally by masking,
during sequence comparison procedures to avoid the formation of
unwanted repeat alignments. Second, major satellite and rRNA gene
clusters form heterochromatic blocks in the genome that are easily
recognizable cytogenetically. These blocks can serve as useful markers
when the chromosome formula is difficult to establish, as is the case
in pufferfishes (Barat and Khuda-Bukhsh 1984
; Miyaki et al. 1995
;
Grützner et al. 2000
; Fischer et al. 2000
). Finally, repeat sequences
are important elements of the genome from an evolutionary point of view
(Charlesworth et al. 1994
). They can contribute an important fraction
of the DNA in a genome, between <10% for tetraodontiforms (Brenner
et al. 1993
and this work) to >50% in some mammalian species. In addition, repeat sequences and in particular transposable elements, can
influence chromosome evolution by promoting chromosome breakage, deletions, inversions and amplifications (Lim and Simmons 1994
; Dimitri
et al. 1997
; O'Neill et al. 1998
). Transposable elements and tandem
repeats are closely associated in heterochromatic regions of the
genomes of many distant eukaryotes such as Drosophila (Pimpinelli et
al. 1995
) and plants (Presting et al. 1998
), a situation that further
supports the structural role of such repeats in genome evolution
(Dimitri and Junakovic 1999
). It is therefore of particular interest to
investigate repeat distribution in Tetraodon considering its
unusual evolution which positions it today as the smallest known
vertebrate genome.
We have identified the major satellite sequences, which are localized in the centromeres and acrocentric arms. The complete sequence of rRNA genes has been determined and their cluster localized on a small heteromorphic chromosome. The detection of minisatellite sequences essentially reveals their paucity in the genome. A comprehensive cataloguing of microsatellites compared with Fugu, shows that this genome is particularly rich in polyA stretches. We have found homologies to transposable elements (TEs) belonging to all major families, although their overall abundance is low compared to other eukaryotes. Globally, the genome contains 6.17% of repeated sequence. Taken together, these results represent a structural basis on which new studies focused on genome organization, evolution, and coding potential can be initiated.
| |
RESULTS |
|---|
|
|
|---|
Genomic Clone Library Construction, Characterization, and Sequencing
In order to limit possible cloning biases and redundancy in sequencing templates, two BAC libraries were constructed from the same fish specimen, using different vectors (pBAC3e.6 and pBeloBAC11) and two restriction enzymes to fractionate genomic DNA (EcoRI and HindIII). The resulting library A (pBAC3e.6/EcoRI) and library B (pBeloBAC11/HindIII) comprise 20,352 and 22,658 clones respectively. Based on field inversion gel electrophoresis separation of 1792 control clones, the average insert size is 126 kb and 153 kb for libraries A and B respectively. Taking into account that 7% of the clones in each library have no visible insert, both libraries together represent 14.5 genomic equivalents of the Tetraodon genome. A total of 52,619 BAC end sequences have been generated (60% library A, 40% library B). Control clones were also re-sequenced and therefore represent duplicate sequences spread evenly in the library, which serve as indicators of possible errors which may have occurred at any point along the production line. The average raw sequence length is 1075 bases, reduced to 969.2 bases after clipping off vector and low quality sequence at both ends of each read. The resulting sequences contain 3.2% of uncalled bases (N).
A database of 47599 reads was created after removal of redundant (same
BAC end sequenced more than once) and contaminating (E. coli,
vector) sequences. This set is available for similarity searches at
http://www.genoscope.cns.fr/tetraodon and is the basis of the studies
described here. The fraction of unique DNA in the database has been
estimated by performing a BLAST search (Altschul et al. 1990
) of the
database against itself. This estimate is essential to evaluate the
efficiency of the sequencing strategy as well as the probability to
obtain a match when querying the database. In the present case,
redundancy can be contributed either by cloning biases, supernumerary
reads of the same BAC end or repeated sequences. The major families of
repeated sequences are described in this report and include rRNA genes,
tandem and interspersed repeats. It is however impossible to exclude at
this stage that other types of repeated elements remain undetected,
rendering attempts at formally distinguishing between the different
types of redundancy unreliable. On the other hand it is possible to clearly separate the unique fraction, i.e. sequences that do not find
any other match in the database than themselves, from the redundant
fraction. Unique sequences represent 87% of the reads, equivalent to
approximately 41 Mb of DNA.
Genome Size and Compositional Patterns
Measurement of haploid DNA content by a variety of methods initially
suggested that Tetraodon has a haploid genome size around 380 Mb
(Hinegardner 1968
; Pizon et al. 1984
) However more recent estimates
based on flow cytometry indicate a genome size of 350 Mb (Lamatsch et
al. 2000
). Tetraodon possess 21 chromosome pairs (Grutzner et
al. 1999
; Fischer et al. 2000
) which range in size between
approximately 11 and 28 Mb, based on measurements of metaphase chromosomes and correlation with the haploïd genome size of 350 Mb. Thus the largest chromosome is still approximately twice smaller than the smallest human chromosome. The genome is 45.5% G + C rich,
with BAC end sequences ranging from 15% to 70% G + C. The relative
abundance of dinucleotides (
XY=
fXY/fXfY, where fX denotes the
frequency of the nucleotide X and fXY the frequency of the dinucleotide XY) deviates significantly from expected values for CpG
(0.60), TpA (0.62), TpT/ApA (1.20) and TpG/CpA (1.21).
Ribosomal RNA Genes
The typical eukaryotic rRNA gene array consists of a tandem repetition of a basic unit, separated from the next by an intergenic spacer (IGS). Each unit starts with a 5' external transcribed spacer (ETS), followed by the 18S, 5.8S and 28S genes separated by two internal transcribed spacers (ITS1 and ITS2), and ending with a 3'ETS (Fig. 1). Gene sequences are extremely well conserved from mammals to bacteria, although the number and distribution of the genes and of the repeating units may vary between and within species.
|
The high degree of sequence conservation of rRNA genes among
vertebrates led us to select the complete and well annotated human
repeated unit (U13369) to identify the Tetraodon homologous genes. The complete human transcribed unit was searched against the
Tetraodon database and retrieved 606 reads (0.73% of the
nucleotides in the database; Table 1). Assembly by
Phred and Phrap of these sequences delineated one contig that covers
the complete transcribed region. We have thus established the first
consensus sequence of the transcribed rRNA repeated unit of a fish
containing the 18S, 5.8S and 28S genes (Fig. 1). The sequence is 8303 bases long and includes a partial 5'ETS and 3'ETS. Compared to
the homologous human sequence which measures 10502 bp, the
Tetraodon sequence has smaller intergenic spacers and shows
significant deletions in the 28S gene. Fluorescence in situ
hybridization experiments with a 28S probe identify a small pair of
chromosomes containing a characteristic heterochromatic region (Fig.
2B). This Nucleolar Organizer Region (NOR) is partly
4',6-diamidino-2-phenylindole (DAPI)- and strongly propidium iodide
(PI)-positive and entirely covered by the hybridization signal.
|
|
The sequence of the complete 5S gene (120 bp) and its spacer (289 bp)
has also been determined. In all vertebrates the 5S rRNA gene is
organized in tandem repetitions and generally in separate cluster(s)
from those formed by the 18S, 5.8S and 28S genes. A Tetraodon
5S rDNA PCR product was used as an in situ probe and gives a single
signal on the short arms of one of the smallest chromosome pairs, but
different from the pair bearing the other rRNA gene cluster. No real
size polymorphism could be observed between the two arms. Localization
of Tetraodon rRNA gene clusters (5S and 18S-5.8S-28S) on two
different chromosome pairs will facilitate the unequivocal
identification of the latter in a karyotype where the majority of
chromosomes are of similar size (Grutzner et al. 1999
; Fischer et al. 2000
).
Centromeric Satellite Repeat
Centromeres of higher eukaryotes are often associated with tandem
repetitions of a basic repeat unit that do not appear evolutionarily conserved between species, and no definite sequence-specific function has yet been determined for such repeats. However, it is clear that in
most species, several
and sometimes all
chromosomes contain the same
satellite sequence, indicating that a mechanism of concerted evolution
is operating within populations (Elder and Turner 1995
). The sequence
of satellite repeats has been determined in several fish species, and
some have been assigned to centromeres. For instance, tandemly repeated
monomers of 355 bp and 168 bp are found in all centromeres of
Hoplias malabaricus (Haaf et al. 1993
) and of Sparus
Aurata (Garrido-Ramos et al. 1994
), respectively.
In Tetraodon, we have found a 118-bp repeated monomer in a large number of sequences (0.34% of nucleotides). Its organization in clusters is indicated by the observation that when a 118-bp tandem repeat is found at one end of a BAC, it is frequently found at the other end as well (27% of cases). A cloned monomer was hybridized to Tetraodon chromosomes and labels uniformly all centromeres (Fig. 2A), demonstrating its centromeric origin and pointing towards a concerted evolution of this satellite sequence. However, a more detailed comparison of the sequences of randomly chosen monomers reveals that this repeat is highly variable in a ~60 bp region, while the remaining half is remarkably constant (Fig. 3A). This sequence variation is present within at least some centromeres, since examination of both end sequences belonging to the same BAC clones (the last eight sequences above the consensus in Fig. 3A) show that each end contains different variants. The monomer has a sequence composition of 57.6% A/T, close to the genome average (56.1% A/T).
|
A Fugu tandem repeat sequence of identical monomer size has
also been described (Brenner et al. 1993
) with a probable centromeric origin (Elgar et al. 1999
). A gapped alignment between the two monomer
sequences shows 56.6% identity (Fig. 3B).
Subtelocentric Satellite Repeats
A second abundant tandem repeat of monomer size 10 bp was found in Tetraodon BAC end sequences. A prominent feature of this repeat is its high sequence variability, while the monomer size is strictly conserved. For instance, the alignment of 25 consecutive monomers found in a BAC end sequence (accession number AL315101; Fig. 3C) shows that this stretch is composed of 21 variant monomers. Interestingly, a thymidine is always found in the 5th position in the monomer in all sequences examined. Other bases show 4% to 48% variation on the sample described in Figure 3C.
The organization of this repeat in potentially very large arrays was suggested by the observation that out of all BAC clones that contain the repeat at at least one end, 30% of clones contain the repeat at both ends. We have investigated the genome distribution of this repeat. A 40-mer oligonucleotide probe, containing twice the consensus sequence interspersed by the two most abundant variants, was hybridized on Tetraodon metaphase chromosomes. The probe specifically hybridizes to the complete length of the short arms of 10 out of 11 pairs of subtelocentric chromosomes (Fig. 2B). The subtelocentric pair that does not hybridize is the pair bearing the 18S-5.8S-28S rRNA genes.
Similarity searches with the BAC end AL315101 in Fugu sequences identifies sequences that contain a 20-mer tandem repeat. The Tetraodon 10-mer consensus sequence (GGCGTCTGAG) is 80% identical to half of the Fugu 20-mer consensus sequence (GGCATCTGATCCTGGTAGCT), which may point toward a common origin for this satellite sequence in Tetraodontidae.
Minisatellite Repeats
The definition of a minisatellite repeat is not well standardized in
the literature and can vary in terms of repeat unit size (or period)
and total array size (Franck et al. 1991
; Charlesworth 1994
). We chose
to use this category loosely and include all tandem repeats that are
neither microsatellite nor satellite sequences. Thus, our definition
includes all sequences of repeat unit larger than 6 bases, tandemly
repeated at least 3 times, and that are not satellite sequences. We
used the software Tandem Repeat Finder (Benson 1999
) with default
parameters, except for the maximum period size that was set to 300 bases. Indeed, no motif of more than 300 bases repeated at least 3 times can be detected in sequences of average size 1 kb. Figure
4 shows the percentage of bases in the genome
contributed by repeats of period sizes comprised between 7 and 300 bases. The two major peaks correspond to the subtelocentric (10-mer)
and centromeric (118-mer) satellite sequences. Clearly no other tandem
repeat contributes any substantial amount of DNA. The total fraction of
nucleotides represented by minisatellites, excluding the 10-mer and
118-mer repeat, is 0.41%.
|
Microsatellite Repeats
Microsatellite repeats are defined as short tandem repetitions of
monomer units of 1 to 6 bases that are present in most if not all
eukaryotic genomes. Their widespread distribution and high
heterozygosity have promoted their use as polymorphic markers in
genetic mapping (Dib et al. 1996
) and population genetics (Jarne and
Lagoda 1996
). Their identification and characterization is essential in
whole genome studies based on sequence analysis because their high
frequency and repetitive nature tends to hinder clustering analysis and
homology studies. Early characterization of the Fugu genome
(Brenner et al. 1993
) has shown that microsatellites are the second
most abundant class of repeats in this species, and a more exhaustive
classification has since been performed (Edwards et al. 1998
; Elgar et
al. 1999
). A direct comparison of microsatellite distribution in
Fugu and Tetraodon genomes is possible because both
species benefit from large, publicly available sequence samples that
have been randomly generated from genomic clones (Elgar et al. 1999
and
this work).
Our method, based on the Smith-Waterman algorithm, underestimates the
total content of microsatellite sequences in the sample, because only
one alignment is produced per motif per sequence. Thus, for instance,
if two (CA)n are present in a sequence, only one will be reported.
Despite this bias, we observe that 3.21% of the Tetraodon
genome consists of microsatellites, versus 1.29% measured by Edwards
et al. (1998)
in Fugu. This disparity between two figures
measured in closely related genomes is not negligible and is most
probably due to the different strategies used in both studies. To
resolve this, we repeated our study on Fugu genomic sequence
(13.7 Mb, Fugu Landmark Mapping Project), a sample size similar to that
used by Edwards et al. (1998)
, and found a total microsatellite content
of 2.12%.
The motif frequency distribution is relatively similar between the Tetraodon and Fugu genomes when analyzed with our approach (Fig. 5), except for one noticeable difference: the polyA repeat is twice as frequent in Tetraodon (15%) than in Fugu (7%). Table 1 summarizes other features of microsatellite distribution in both genomes. There are twice as many reads containing at least one microsatellite in Tetraodon compared to Fugu, which correlates with the Tetraodon sequences being twice as long (969 bp and 473 bp in Tetraodon and Fugu respectively). Provided microsatellites are similarly distributed in both genomes, this constitutes good evidence that their identification is not dependent upon differences in sequence quality or sequencing chemistry between the two samples. A microsatellite occurs on average once every 588 bases in Tetraodon and once every 850 bases in Fugu. The longest microsatellite in Tetraodon is a 502-bp AGAT repeat, and the most abundant in nucleotides are AC (18%) and A (13%) which together constitute 31% of all microsatellites. In Fugu, the same repeats represent only 20% of all microsatellites.
|
Transposable Elements (TEs)
Considering the relative small size of the Tetraodon genome
and the impact TEs may have on genome size, it is of interest to
investigate their presence in pufferfishes, which have the smallest
known vertebrate genome. We have performed a detailed cataloguing of
TEs in Tetraodon and show that elements belonging to all known
families have been integrated in the genome (Table 2). This observation is based on comparisons between
translated Tetraodon genomic sequences and all known
eukaryotic TEs annotated in nonredundant proteic and nucleic databases.
The 732 BAC end sequences displaying such homologies were then
subdivided into the following families based on database annotation:
Ty3/gypsy, Ty1/copia, Line, Retrovirus, TC1/mariner and Hobo. The
Tetraodon sequences belonging to each group show little or no
sequence similarity between each other and thus form distinct families
in the genome as suggested by the database matches. The total DNA
content of TE-like regions in Tetraodon is only 0.9%, a large
fraction of which is contributed by Line elements (0.4%). Out of the
27 TEs that are present in Tetraodon DNA, 10 are more similar
to anonymous Fugu sequences than to any cognate TE in public
databases (Table 2). From this, we deduce that these TEs are also
present in the Fugu genome. TEs belonging to all families are
present in both species, except for Hobo and Ty1/copia, which are
present in the Tetraodon sequence sample only. However, these
families are underrepresented in Tetraodon and their absence
in Fugu may simply be a reflection of the smaller amount of
DNA currently available for screening in this species (Table 1).
|
Of the 732 BAC end sequences that contain a TE, the frequency of this occurring at both ends of a given BAC clone is 10 times higher than expected from the average frequency of TE sequences in the database. This would suggest that TEs have a tendency to be organized in clusters in the Tetraodon genome.
| |
DISCUSSION |
|---|
|
|
|---|
A large sample of the Tetraodon nigroviridis genomic
sequence has been analyzed to characterize repeat organization in this genome, in comparison with the Fugu genome. The sequence of
the Tetraodon genome is 45.5% GC rich, which is within the
vertebrates range, between 40% for Bos taurus and 48% for
Sus scrofa (Karlin and Mrazek 1997
). However, we observe a
suppression of the CpG dinucleotide (
CG = 0.6) as has
previously been observed in Fugu (Elgar et al. 1999
), although
not as strong as in mammals where the odds ratio
CG is
comprised between 0.22 (Mus musculus) and 0.33 (S. scrofa) (Karlin and Mrazek 1997
). We also observe a suppression of
the TpA dinucleotide and a clear overrepresentation of the TpT/ApA and
TpG/CpA dinucleotides. The mechanisms that drive these deviations from
the expected values are not yet understood. It is, however, clear that
tetraodontiforms and perhaps teleosts in general do not present
extremes of suppression or overrepresentation for the same
dinucleotides as mammals.
The two major satellite sequences reported here (centromeric and
subtelocentric) are located in the main heterochromatic blocks of the
chromosome complement. The subtelocentric repeat displays a highly
variable monomer sequence within the genome, but its 10-bp length
appears strictly conserved. The centromeric satellite, on the other
hand, is less variable, but here the conservation of the monomer length
has probably extended well beyond the Tetraodon species.
Indeed, a similar satellite repeat of exact same monomer length (118 bp) but different sequence (56.6% similar) has been found in
Fugu and is presumably also of centromeric origin (Brenner et
al. 1993
; Elgar 1996
; Elgar et al. 1999
). This would suggest that for
both types of satellites evolutionary constraints have been much
stronger on monomer length than on monomer sequence composition. The
processes that affect satellite sequence evolution are not yet
understood, although a number of models have been proposed (for review
see Charlesworth et al. 1994
) to explain variations in the number of
consecutive monomers rather than the sequence of the monomer itself.
We can envisage two possible explanations for the conservation of
monomer length despite their sequence variation. It is possible that
a still-unknown structural role for such satellite sequences requires a
fixed monomer length but places few requirements on sequence
composition per se. The alternative is that maintenance of the monomer
length may only be the consequence of an amplification mechanism that
would generate motifs of identical size, but without any strict
requirement on sequence composition, except perhaps for a few critical
bases. The poor sequence homogeneity of the 10-bp subtelocentric
satellite is at odds with the generally accepted notion of concerted
evolution that tend to maintain the sequence similarity of repeating
units within a population or a species (Elder and Turner 1995
).
Microsatellite sequence distributions have been investigated in a
number of vertebrate species, although different software, sample size,
and even microsatellite definition were often used (Beckmann and Weber
1992
; Edwards et al. 1998
; Jurka and Pethiyagoda 1995
; Moran 1993
; Van
Lith and Van Zutphen 1996
). Precise comparisons are therefore limited
to studies performed in identical conditions. The most striking
differences between Tetraodon and Fugu concern the
overall microsatellite content (3.21% and 2.12% of the genome, respectively) and the overrepresentation of the mononucleotide A in
Tetraodon (15% versus 7%). Poly(A) tails are also the most abundant microsatellite family in the human genome, where they are
often introduced by retrotransposons, and in particular by Line and Alu
sequences (Boeke 1997
). In the Tetraodon genome such retrotransposons are rare (Line) or absent (Alu), and cannot be considered as a source of overrepresentation for poly(A) repeats.
TEs are DNA sequences that can move or copy themselves within a host
genome, to which they can contribute a large fraction. For instance,
approximately 50% of the maïze (SanMiguel et al. 1996
), 35% of
the human (Smit 1996
), and 10% of the Drosophila melanogaster
(Finnegan 1989
) genomes are made of such elements. They can be
classified according to their transposition mechanisms. Class I
elements replicate via an RNA intermediate and may be flanked by long
terminal repeats (LTR-retrotransposons, such as Ty3-gypsy and Ty1-copia
families) or end with an A-rich tail in 3' (non-LTR
retrotransposons, such as the LINE and SINE families). Class II
elements are essentially DNA-based transposons that code for a
transposase and include Tc1-mariner and Hobo families. Early studies in
Fugu on a small sequence sample concluded that this genome was
devoid of interspersed repeats (Brenner et al. 1993
). However, a
Ty3/gypsy LTR-retrotransposon and a Line element have since been
described in this genome (Poulter and Butler 1998
; Poulter et al. 1999
)
and additional homologies to reverse transcriptase identified (Elgar et
al. 1999
). TEs have been documented in many teleosts (Britten et al.
1995
; Duvernell and Turner 1998
; Flavell and Smith 1992
; Ivics et al.
1996
; Izsvak et al. 1995
; Koga et al. 1996
; Tristem et al. 1995
; ). In
Tetraodon, the representation of these sequences is below 1%,
similar to the 1.89% found in Fugu. It appears, therefore,
that although a wide variety of TEs have repeatedly integrated the
genome of pufferfishes, their amplification and spreading has been
drastically limited compared to other eukaryotes. It is possible that
this situation is related to the fact that these genomes are the
smallest among vertebrates. The mechanisms that have limited TE
amplification in the pufferfish genomes are not known, but
investigating their distribution and local organization in the
chromosome complement may shed light on this unusual phenomenon.
The characterization of the Tetraodon genome presented here
lays the foundation for comparative genomic studies that may take several orientations. From an evolutionary point of view, results of
rRNA genes and satellite sequences, when compared to those of other
teleosts, particularly Fugu, may help us understand the complex processes involved in repeat dynamics over relatively short
evolutionary distances in vertebrates. Comparative genomics with
Tetraodon will, however, take its full dimension in the
context of gene identification and analysis (Roest Crollius et al.
2000
). Gene identification in human and other vertebrates sequence is one of the primary goals in sequencing Tetraodon. However, a
large sample of teleost genomic sequence will also be invaluable to help us understand phenomenons such as genome duplication (Amores et
al. 1998
; Wittbrodt et al. 1998
), or the importance and extent of
conserved synteny over long evolutionary distances.
| |
METHODS |
|---|
|
|
|---|
Fluorescence In Situ Hybridization
All specimens were provided by the same supplier. We don't know
their geographic origin, but they were positively identified as
Tetraodon on the basis of morphological characters and
genotyping using mitochondrial sequences. Fishes were injected with
2µl/g b.w. of 0.05% colchicine solution 1 hr. 15 min. before
killing. Cephalic kidney and spleen were separated on a 350-µm mesh
stainless steel sieve directly in a 0.075-M KCl hypotonic
solution. After a 30-min hypotonic treatment at 29°C, suspension was
centrifuged, and the pellet was fixed for 20 min in a 3:1
methanol-acetic acid solution that was changed only one time. The
fixed cell suspension was immediately dropped on cleaned slides and
stored deep-frozen at
20°C after 30 min drying. All probes were
labeled with digoxigenin (Boehringer Mannheim) and hybridized according
to standard protocols. The centromeric probe was a 180-bp PCR product
cloned in the pAmp1 system (Gibco BRL) using primers (5'-
ATGCAGCACACAGATTTCCA-3') and (5'-TCCATCATTCTGCACCAAAC-3').
The subtelocentric probe was a 40-base oligomer
(GGCGTCTGAGGGCGTCTGATGGTGTCTGATGGCGTCTGAT) consisting of two consensus
monomers interspersed with the two most frequent variants. The probe was
synthesized with a 5' digoxigenin label (Genosys Biotechnologies Ltd.).
BAC Library Construction and Sequencing
Two BAC libraries were constructed from erythrocyte DNA from a
single Tetraodon specimen identified as such by morphological characters and genotyping using mitochondrial sequences. DNA was partially digested with EcoRI (library A) and HindIII (library B) and
separated on a 1% agarose gel by pulse field gel electrophoresis. For
each digest, three size-selected samples (~50 ng) ranging from
approximately 100 kb to 175 kb were ligated to 10-ng vector DNA
(pBACe3.6 for library A; pBeloBAC11 for library B). The BAC vectors
pBeloBAC11 (Kim et al. 1996
) and pBACe3.6 (Genbank accession number
U80929) were gifts from H. Shizuya, Department of Biology, California
Institute of Technology, Pasadena, CA and P. de Jong, Roswell Park
Cancer Institute, Human Genetics Department, Buffalo, NY, respectively.
Ligation reactions were electroporated into DH10B electrocompetent
cells (Gibco-BRL) and plated on 2YT agar containing 12.5 µg/ml
chloramphenicol and 5% saccharose. Recombinant clones were picked in
microtiter plates, grown in 2YT media containing 12.5 µg/ml
chloramphenicol and 5% glycerol, and subsequently frozen at
80°C. In total, 20,352 clones were picked from library A
(EcoRI/pBAC3e.6) and 22,658 from library B (HindIII/pBeloBAC11). A
sub-library, termed the control library, was arranged by selecting 16 clones in the central part of each microtiter plate (1792 clones) of libraries A and B. DNA from all control clones was isolated, digested by NotI to release the insert, and separated by field inversion gel
electrophoresis in order to characterize a representative amount of
clones covering the entire libraries. All clones in the control library
were also resequenced. Templates for sequencing were prepared by
alkaline lysis and purified on Qiagen columns. Sequences were obtained
by sequencing the same template with two different dye primers in the
same reaction. Four reactions were required in total, one for each
base. One reaction contained 25 ng/µl DNA, 0.1 µM
each primer, and 4.5 µl ThermoSequenase mix (Amersham) in a final
volume of 11 µl. Primers were TET3 (TGACACTATAGAAGGATCCG) and T7
(TAATACGACTCACTATAGGG) for BACs from library A and BELO1 (CTATTTAGGTGACACTATAG) and T7 for BACs from library B. Reactions were
loaded on 4.8% acrylamide gels on LiCor4200 machines, and images were
collected and analyzed by BaseImagir V4.00. Graph files were then
transferred to a UNIX environment, and sequences that showed at least a
300-base window containing <6 ambiguous bases were further processed
by routine quality checks and vector clipping prior to analysis.
Sequence Comparison and Assembly
All sequence comparisons between large sets of sequences were
performed using standard algorithms such as BLAST (Altschul et al.
1990
) or Smith-Waterman (Smith and Waterman 1981
) implemented in LASSAP
version 1.1.3 (Large Scale Sequence Comparison Package; [Glemet and
Codani 1997
]). Most calculations were performed on one digital
quadriprocessor (AXP 21164; each processor at 440 MHz), although when
required, we used up to four quadriprocessors simultaneously. Sequence
assembly was performed with Phrap and Phred (Ewing and Green 1998
)
Tandem Repeat Analysis
The Tetraodon sequences consist of 47,599 single reads of
average size 969.57 bases (45,742 Mb of DNA). For minisatellite detection, the software Tandem Repeat Finder (version 2.02, [Benson 1999
]) was used with the following parameters: match: 2, mismatch: 7, delta: 7, PM: 80, PI: 10, minscore: 50, maxperiod: 300. The output was
filtered to retain motifs of period size of at least 7 bases, repeated
3 times or more. When adding the percentage of bases contributed by
each motif size, redundant motifs were eliminated by taking into
account only the motifs with the smallest period size. For
microsatellite analysis, our approach is very similar to that used for
the identification of microsatellites in Fugu (Edwards et al.
1998
), although some modifications were made. The repeat definition is
the same, i.e., a motif of size 1 to 6 bases repeated at least three
times, and of a total size of at least 12 bases. We also allowed up to
15% variation over the complete length of the sequence, between the
microsatellite and the perfectly repeated motif of same length.
However, here this definition is strictly observed regardless of the
size of the repeat and implies that a 12-base microsatellite may also include up to one mismatch. This double constraint on size and identity
is used when selecting microsatellites that respect the definition and
eliminates the need for an arbitrary minimal score. The Fugu
sequences were retrieved from the Human Genome Mapping Project web site
(http://fugu.hgmp.mrc.ac.uk/fugu/fugu) and consist of 29,078 sequences (release 07/20/98) of average size of 473 bases (13,753 Mb of
DNA). The reference microsatellite library consists of all 501 possible
motifs from monomer to hexamer, repeated over 500 bases, in forward and
in reverse complement (1002 sequences; (Jin et al. 1994
). Comparisons
between this library and pufferfish genomic DNA were performed
exclusively with the Smith-Waterman algorithm (Smith and Waterman 1981
)
implemented in LASSAP version 1.1.3. The scoring matrix and gap
costs were as follows: match +10, mismatch
30, ambiguity (N)
5,
gap opening
40, gap extension
30. The results consist of the
best local alignment per sequence and per motif (47.3 million
alignments), to which two filters are applied. The first retains
alignments that respect the definition of a microsatellite: a
repetition of at least 3 motifs of at least 12 bases, with at least
85% identity over the complete length of the alignment. In cases where
several similar motifs overlapped over the same region of a query
sequence, a second filter was applied to retain only the motif with the
highest percentage of identity.
| |
ACKNOWLEDGMENTS |
|---|
We thank Patrick Lafaite and the Museum National d'Histoire Naturelle for assistance with photographic work and the sequencing teams of Genoscope, in particular Patrick Wincker and Philippe Brottier.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
NOTE ADDED IN PROOF |
|---|
After the submission of this article, an additional 100 Mb of Tetraodon DNA has been submitted to the EMBL data library under accession nos. AL163976-AL305789.
| |
FOOTNOTES |
|---|
4 Corresponding author.
E-MAIL hrc{at}genoscope.cns.fr; FAX 33 1 608 72589.
| |
REFERENCES |
|---|
|
|
|---|
Received October 28, 1999; accepted in revised form May 17, 2000.
This article has been cited by other articles:
![]() |
A. Koga, A. Iida, H. Hori, A. Shimada, and A. Shima Vertebrate DNA Transposon as a Natural Mutator: The Medaka Fish Tol2 Element Contributes to Genetic Variation without Recognizable Traces Mol. Biol. Evol., July 1, 2006; 23(7): 1414 - 1419. [Abstract] [Full Text] [PDF] |
||||
![]() |
G.-F. Richard and B. Dujon Molecular Evolution of Minisatellites in Hemiascomycetous Yeasts Mol. Biol. Evol., January 1, 2006; 23(1): 189 - 202. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. H. Margulies, NISC Comparative Sequencing Program, V. V. B. Maduro, P. J. Thomas, J. P. Tomkins, C. T. Amemiya, M. Luo, and E. D. Green Comparative sequencing provides insights about the structure and conservation of marsupial and monotreme genomes PNAS, March 1, 2005; 102(9): 3354 - 3359. [Abstract] [Full Text] [PDF] |
||||