|
|
|
|
Vol. 11, Issue 5, 710-730, May 2001
LETTER
|
| |
ABSTRACT |
|---|
|
|
|---|
We present the sequence of a contiguous 2.63 Mb of DNA extending from the tip of the X chromosome of Drosophila melanogaster. Within this sequence, we predict 277 protein coding genes, of which 94 had been sequenced already in the course of studying the biology of their gene products, and examples of 12 different transposable elements. We show that an interval between bands 3A2 and 3C2, believed in the 1970s to show a correlation between the number of bands on the polytene chromosomes and the 20 genes identified by conventional genetics, is predicted to contain 45 genes from its DNA sequence. We have determined the insertion sites of P-elements from 111 mutant lines, about half of which are in a position likely to affect the expression of novel predicted genes, thus representing a resource for subsequent functional genomic analysis. We compare the European Drosophila Genome Project sequence with the corresponding part of the independently assembled and annotated Joint Sequence determined through "shotgun" sequencing. Discounting differences in the distribution of known transposable elements between the strains sequenced in the two projects, we detected three major sequence differences, two of which are probably explained by errors in assembly; the origin of the third major difference is unclear. In addition there are eight sequence gaps within the Joint Sequence. At least six of these eight gaps are likely to be sites of transposable elements; the other two are complex. Of the 275 genes in common to both projects, 60% are identical within 1% of their predicted amino-acid sequence and 31% show minor differences such as in choice of translation initiation or termination codons; the remaining 9% show major differences in interpretation.
[All of the sequences analyzed in this paper have been deposited in the EMBL-Bank database under the following accession nos.: AL009146, AL009147, AL009171, AL009188-AL009196, AL021067, AL021086, AL021106-AL021108, AL021726, AL021728, AL022017, AL022018, AL022139, AL023873, AL023874, AL023893, AL024453, AL024455-AL024457, AL024485, AL030993, AL030994, AL031024-AL031028, AL031128, AL031173, AL031366, AL031367, AL031581-AL031583, AL031640, AL031765, AL031883, AL031884, AL034388, AL034544, AL035104, AL035105, AL035207, AL035245, AL035331, AL035632, AL049535, AL050231, AL050232, AL109630, AL121804, AL121806, AL132651, AL132792, AL132797, AL133503-AL133506, AL138678, AL138971, AL138972, and Z98269. A single file (FASTA format) of the 2.6-Mb contig is available from ftp://ftp.ebi.ac.uk/pub/databases/edgp/contigs/contig_1.fa.]
| |
INTRODUCTION |
|---|
|
|
|---|
Less than 90 years have elapsed since Alfred H. Sturtevant presented
the world with the first-ever genetic map of six visible markers on the
X chromosome of Drosophila melanogaster
(Sturtevant 1913
). The extraordinary achievement of determining the
entire euchromatic DNA sequence of D. melanogaster
(Adams et al. 2000
) now gives us the potential to identify every single
coding region within this gene-rich region.
The first tentative steps towards sequencing the complete genome of
Drosophila were taken 10 years ago with the construction of a
physical map of the X chromosome (Sidén-Kiamos et al.
1990
; Madueño et al. 1995
) and the explicit declaration of the
objective of whole-genome sequencing. Since then, both the European and Berkeley Drosophila Genome Projects (EDGP and BDGP) (Saunders et al. 1989
; Kafatos et al. 1990
; Rubin 1996
, 1998
; Louis et al. 1997
)
and, more recently Celera Genomics, have worked towards the common goal
of completing the sequence of the entire genome of this fly. An
essentially complete sequence of the euchromatic genome of D. melanogaster has now been published by the Celera Genomics/BDGP/Baylor College of Medicine collaboration with some input
from EDGP; in this paper we call this the Joint Sequence (see Methods)
(Adams et al. 2000
; Myers et al. 2000
; Rubin et al. 2000a
).
We present an ~2.7 Mb region accurately sequenced and analyzed
independently of the Joint Sequence. This is only the second detailed
molecular analysis of a genomic sequence of several megabases from
Drosophila, and it offers some interesting contrasts with the
3 Mb region of an autosome, whose analysis has been published recently
(Ashburner et al. 1999
). It also gives an opportunity to compare the
results and analysis of a sequence obtained by the widely adopted
clone-by-clone approach to those obtained from the whole-genome shotgun
approach adopted by Celera and their collaborators (Venter et al.
1998
). We also report the collection of ~6 Mb discontinuous sequence
from divisions 4 - 10, which was obtained by sequencing at 1.5-fold
coverage a collection of 29 BAC clones representing a minimal tiling path.
The tip of the X chromosome of D. melanogaster is a
region of some sentimental, as well as much scientific, interest to
geneticists. It includes the locus of the gene white, whose
mutation was the first clear visible mutation found in
Drosophila (Morgan 1910
) and whose study led to the discovery
of sex-linked inheritance and, hence, to the proof of the chromosome
theory of heredity (Bridges 1916
). It also includes a region, between
the genes zeste and white, which was intensively
studied by Burke Judd and colleagues (Judd et al. 1972
) in an attempt
to analyze the relationship between polytene chromosome bands and
genes. There are two classic genetic complexes at the tip of the
chromosome
the achaete-scute complex, whose phenotypic
effects have long fascinated geneticists and generated much theoretical
speculation (Agol 1929
; García-Bellido 1979
), and the
broad complex (Zhimulev et al. 1995
). The physical bases for
the complexities in genetic analysis are quite different in these two
cases (see below). Cytologically, the region includes, of course, the
XL telomere, perhaps the best-characterized telomere in
Drosophila (Biessmann and Mason 1997
) as well as a region of polytene banding complexity that had indicated to Bridges (1935)
the
presence of a long reverse-repeat (Benos et al. 2000
).
The main part of the sequence is contiguous, consisting of a single contig of 2,626,764 bp. The rest consists of a cosmid clone (23E12) that contains a number of Drosophila subtelomeric repeats (EMBL accession no. L03284) and thus represents the most distal part of the X chromosome. The two parts are separated by an unspecified number of repeats, and together amount to 2,664,670 bp.
| |
RESULTS AND DISCUSSION |
|---|
|
|
|---|
Linking the Genetic Map of the X Chromosome to a Molecular Framework
A decade ago, the founding members of the EDGP argued the case for
constructing an accurate physical map of the genome of D. melanogaster linked to the genetic map (Sidén-Kiamos et al. 1990
). To this end, cosmid clones were selected by hybridization with
PCR-amplified DNA microdissected from each of the 100 individual divisions of the major polytene chromosome arms. A physical map was
generated by determining overlaps between the cosmids based on the
shared fragments generated by restriction endonuclease digestion
(Sulston et al. 1988
). The localization of cosmids was verified by in
situ hybridization to the polytene chromosomes and by determining STSs
of cosmid end sequences (Louis et al. 1997
). This physical map, and the
cosmid library on which it was based, are available as a public resource
(http://www.hgmp.mrc.ac.uk/Biology/descriptions/drosophila.html).
A physical map was also constructed by the BDGP (Kimmerly et al. 1996
)
based on segments of DNA cloned in a P1 phage vector that were aligned
using PCR based STS content mapping. However, it was clear that both
the cosmid and P1 maps would be an incomplete resource for sequencing
the genome. Moreover, although the YAC map of Ajioka et al. (1991)
does
give good coverage, in our hands YAC clones were impractical for DNA
sequencing purposes. We therefore undertook to build another map based
on BAC clones because these vectors can, in principle, accommodate
larger inserts of DNA. The generation of these BAC clones, that give an
approximately 10-fold coverage of the genome, will be described in
detail elsewhere. The library is available as a public resource
(http://www.hgmp.mrc.ac.uk/Biology/descriptions/dros_bac.html). Clones from both this and a BAC library of partial EcoRI
digestion products of DNA constructed for the BDGP (Hoskins et al.
2000
) were physically ordered and linked by hybridization with a total of 647 hybridization probes each of 40 nucleotides in length
corresponding to sequences distributed along the length of the
X chromosome. The resulting maps, whose full description
will also be provided elsewhere, allowed us to determine a minimal
tiling path of clones for sequencing purposes. We selected such a
minimal tiling path extending through polytene divisions 4-10, and
determined the sequence of these clones at ~1.5-fold coverage
(http://edgp.ebi.ac.uk/cgi-bin/progress.pl). This provided a skeletal
sequence scan of ~6 Mb of the chromosome that was made available to
the Celera/BDGP/Baylor shotgun sequencing project for use as an
assembly scaffold.
The accurate sequencing of polytene divisions 1-3 was initiated on a minimal tiling path of cosmid clones, subsequently extended using the BAC clones to fill gaps in the cosmid map. The clones selected for sequencing are presented in Figure 1A, and the assembled nonredundant sequence can be directly accessed at http://edgp.ebi.ac.uk/cgi-bin/progress.pl, which links to the EMBL-Bank deposits.
|
General Features of Gene Content
As explained in Methods, we have used two general classes of
computational method to predict genes in this chromosome region: similarity-based methods and ab initio methods. Together these two
approaches have enabled us to predict 277 protein-coding genes overall,
of which 94 (33.9%) had been sequenced previously by the community
(Table
1;
Figure 1B). A total of 25 genes (9%) were predicted solely by ab
initio methods, a lower fraction than in the Adh region
(19%). A possible reason for this difference is that we used a
stricter criterion for accepting a gene predicted only by an ab initio
method than did Ashburner et al. (1999)
. Of the predicted genes, 205 have matches with ESTs from the BDGP (Rubin et al. 2000b
) and NIH
(Andrews et al. 2000
) projects. The fraction of previously known
Drosophila genes that had EST matches (77.1%) is the same as
that of the genes predicted by sequence similarity (77.2%), and is
very similar to the proportion of matches from the Adh region
(71%). Assuming that the criteria used to predict genes are adequate,
these figures provide a good indication for the proportion of
Drosophila genes currently represented in EST collections.
Presumably the shortfall reflects mainly that the cDNAs used to
generate the ESTs have been derived from a restricted number of
developmental stages. The value of ESTs in confirming gene identity and
splicing patterns provides a strong argument to extend the generation
of EST data to other developmental stages and tissues (Andrews et al.
2000
; Rubin et al. 2000b
). Based on the analysis of EST hits, we
identified nine genes that are alternatively spliced in their coding
regions, and thus able to direct the synthesis of two or more different
proteins (Table 1, asterisks). It is striking that of the 183 newly
predicted genes, 55% have significant similarities with sequences in
other organisms thus indicating the extent of conserved function.
|
The average size of the coding regions of the genes predicted in the
tip of the X chromosome is 1.8 Kb, with 2.7 introns per gene. The gene with the highest number of introns is
EG:BACR25B3.1 (26 introns in the coding region). The
average size of the introns is 475 bp, with the shortest being 26 bp
(EG:63B12.3) and the longest being 34,401 bp
(sidekick [sdk], EG:BACR19J1.1). The
calculated average number of introns per gene in this chromosomal
region is consistent with previous studies that have indicated the
majority of Drosophila genes contain one or two small introns
located near their 5' ends (although exon and intron numbers will have
been underestimated as ab initio gene prediction methods will not
predict untranslated exons). There are, however, some exceptionally
large genes. These include sdk, which encodes an
immunoglobulin-C2 domain protein, and is required to prevent
the"mystery cell" of the developing eye disc differentiating as a
photoreceptor (Nguyen et al. 1997
). This gene, sequenced previously as
a cDNA, covers 60 Kb and includes at least 14 exons. Another very large
gene is futsch (EG:49E4.1), covering 18 Kb and
encoding a protein of 5327 amino acids predicted to encode a
microtubule-associated protein, on the basis of its similarity with
human MAP1B (SWISS-PROT:P46821), which is only half the size. Recently
Hummel et al. (2000)
have shown that futsch encodes the
well-known Drosophila neural antigen 22C10. Four other genes
have large transcription units: Appl, 35.1 Kb; br,
27.7 Kb: EG25B3.1, 20.0 Kb; and csw, 17.4 Kb. The
overall GC content of this collection of genes from the tip of the
X chromosome is significantly lower (45.5%) than the
overall GC content of the genes in the Joint Sequence (56.1%).
One of the surprising results of the analysis of the Adh
region sequence (Ashburner et al. 1999
) was the number of genes
predicted to be included within the introns of other genes (8%). These
were most frequently, but not exclusively, arranged as anti-parallel transcription units. The present analysis of the tip of the
X permits a comparison with another segment of genomic DNA.
We predict four nested genes. This corresponds to 1.4 % of all of the
genes we identify. This is probably an underestimate, because ab initio gene prediction programs do not predict genes within genes.
One group of duplicated genes worthy of specific mention in this region
are the cytochrome P450s, small monooxygenases often involved in the
metabolism of xenobiotic compounds. Eighty-seven genes encoding these
microsomal or mitochondrial enzymes had been identified in the
essentially complete Joint Sequence of D. melanogaster (Nelson
2000
). Only two (l(2)35Fb in the Adh region
[Ashburner et al. 1999
] and disembodied [Chávez et al.
2000
]) have been associated with a mutant phenotype, although
polymorphisms at others have implicated them in differential resistance
to DDT and other compounds (Berge et al. 1998
). One characteristic of
the genes encoding these proteins is that they often occur in small
clusters, indicating an expansion of the gene family by duplication. In
region 1-3 we have identified five cytochrome P450-encoding genes
(Cyp4g1, Cyp4d1, Cyp4d2, Cyp4ae1, and Cyp4d14); of
these, the latter three are in tandem within about 7.5 Kb at 2E1 and
Cyp4d1 is some 12 Kb distal at 2D6. The Cyp4g1 (at
1B4) gene appears to be more abundantly transcribed than any other P450
gene in D. melanogaster, at least judging from the large
number of its EST sequences (59; Nelson 2000
).
We have analyzed all of the known or predicted proteins by several
methods, most extensively by BLASTP against data sets
derived from SWISS-PROT and TrEMBL sorted by taxonomic origin (see
Ashburner et al. 1999
). We have also analyzed all of the protein
sequences by various methods to detect protein motifs, and domains.
Overall, 71% of the known or predicted proteins have a
BLASTP match with an expectation of 10
7 or
less when compared with nondrosophilid protein sequences. Similarly,
137 contain at least one known motif or domain (other than the PROSITE
Nuclear Localization Signal profile) as determined by matches against
InterPro (http://www.ebi.ac.uk/interpro/). These numbers are, of
course, both preliminary and transitory. All of these data have been
communicated to FlyBase and can be found in the supplementary data (see
Methods). We have chosen only to present the PFAM hits in Table 1, as
an indication of the data obtained.
As we have discussed previously (Benos et al. 2000
), examples of 12 different transposable elements were identified within the region
analyzed: 412, roo, Doc, FB, jockey, mgd1, Tirant, S-element,
1360, Burdock, blastopia, and yoyo. It is possible that
more transposable elements may be present in the region; however, we
have not identified them molecularly.
Chromosomal Regions of Particular Interest
The achaete-scute Complex
The achaete-scute complex (AS-C) comprises a region of ~95 Kb (between y and Cyp4g1; chromosomal bands 1B1-4) defined by the physical mapping of >110 achaete (ac) and scute (sc) mutations associated with chromosomal breakpoints or insertions of transposable elements (Campuzano et al. 1985The broad Complex
In region 2B1-10 of the polytene X chromosome, an ecdysterone-induced puff forms in the late third instar larva (Becker 1962The zeste-white Region
The discovery of polytene chromosomes in the larvae of Drosophila in the early 1930's was a major event in the history of genetics. These chromosomes are characterized by a nonperiodic pattern of darkly staining bands and lightly staining interbands, reflecting differences in the degree of DNA packing. These patterns are both colinear with the genetic map, as proven by Bridges (1937)
|
P-element Insertions
The majority of P-element screens to have been carried out
to date have been performed on the autosomes. Spradling and colleagues (1999)
have described their attempts to consolidate a number of such
P-element collections, including a large collection of
lethal P-element insertions on the second chromosome
(Török et al. 1993
). Similarly, the EDGP have described a
collection of lethal insertions on chromosome 3 (Deak et al.
1997
). We have begun to generate a comparable collection of
P-element insertion mutants on the X chromosome in
anticipation of their value for functional genomics. The initial group
of mutants corresponds to ~500 lethal insertions that have been
mapped by hybridization of P-element probes to polytene
chromosomes in situ. The characterization of this collection will be
presented elsewhere. We have localized the insertion sites for 64 P-element-induced lethal mutations that map to divisions 1-3,
and determined the gene(s) whose function is likely to be affected by
each insertion (Table 2).
We have carried out a similar computational analysis on a collection of random EP-element insertions sequenced by the BDGP (Rørth et
al. 1998
). Forty-seven of these had been mapped to divisions 1-3 by in
situ hybridization; this is a density of one element per 55 Kb, about
twice that found for EP-elements in the Adh region
(1/108 Kb). This difference in density is not due to the existence of major hotspots for insertion of EP-elements on the
X chromosome tip, nor to a higher proportion of the
insertions on the X tip being outwith genes (in both regions
~47% of EP-element insertions are within genes).
|
From a total of 111 P-element insertions that we have located within the region analyzed, 41% fall in regions in which they are expected to affect the expression of genes already known, whereas 50% are expected to affect the expression of predicted genes. These expectations are based on the positions of the P-element insertion either within transcribed regions or within 5 Kb 5' to these. Some insertions might affect two different genes, one on either side of the insertion (Table 2). Only 13 elements or clusters of elements map more distantly, 7-33 Kb 5' to the nearest known or predicted gene (footnotes in Table 2; of these, five elements or groups were selected as lethal, but may or may not cause the lethality).
Comparison with the Joint Sequence
The determination of the sequence and gene annotation of chromosomal
divisions 1-3 was completed and submitted to the EMBL-Bank by February
7, 2000, six weeks before the publication and release of the annotated
Joint Sequence of the D. melanogaster genome in March
2000 (Adams et al. 2000
). Although preexisting gene features were taken
into account during the analysis of the Joint Sequence, these are
essentially independent annotation experiments that can be compared.
Moreover, direct comparison of the nucleotide sequence determined by
the EDGP with the Joint Sequence, allows one to assess some of the
strengths and weaknesses of the two different sequencing strategies. We
have compared both individual gene predictions and the overall sequence
between these two studies.
Comparison of Gene Predictions
We have identified 277 protein coding genes in the region 1A-3C, including 94 genes that had been known previously. There are 275 genes common to both studies; two, namely EG:80H7.1 and EG:196F3.1, have no corresponding prediction in the Joint Sequence. Neither of these two predictions are very strong (in terms of their GeneFinder and/or Genscan scores; see Methods), but both contain trypsin protein motifs (EG:196F3.1 has only a PROSITE match whereas EG:80H7.1 has both PROSITE and PFAM matches). There are 33 genes predicted on the Joint Sequence that are absent from the EDGP annotation. Some (13) of these predictions were also seen in the EDGP analysis but were excluded due to their low scores and lack of other supporting evidence (see Methods). We have examined the data for the remaining 20 and consider these to be overpredictions in the Joint Sequence, for a variety of reasons (see supplementary data). We have carefully compared the known or predicted amino acid sequence of all genes between the annotated Joint Sequence and our analysis (Table 1). At the level of their predicted proteins, 60% of the 275 genes in common are identical or differ by no more than 1% of their amino-acid residues (class 0); 31.3% have one or more minor differences, for example in the choice of ATG or stop codon or in an internal exon (classes A-C); 8.7% (24 genes) have major differences in their structure between the two studies (class D). We have analyzed these 24 in detail; for 10 of them we cannot make a decision, based on the available data, as to which interpretation is the better. However, for the remaining 14 (i.e., 5.1% of the total number of genes) the EDGP model is the more correct, based on the EST data. (Note that the Joint Sequence analysis did not use all available ESTs, as noted in Methods.) Some of the class C differences (Table 1) in gene models may reflect different splice variants of the same gene. Since the submission of version 1.0 of the Joint Sequence, some 263 "new" genes from across the genome have been sequenced by the community as a whole (and submitted to EMBL-Bank, GenBank, or to DDBJ). Of these, some 53% are essentially identical in their protein coding regions to the Joint Sequence predictions (M. Ashburner, unpubl.). It is of some interest that both these community data and the EDGP data indicate that ~55% of the proteins predicted by the Joint Sequence are essentially correct. This is a minimum figure, because it takes no account of alternative splice forms or the fact that some of the new community data represent only partial sequences.Overall Sequence Comparison
The Joint Sequence for region 1A-3C is found on nine GenBank entries (Fig. 3). We have compared it to the contiguous EDGP sequence using the MUMmer program of Delcher et al. (1999)
|
in Fig. 3). These
include two roo elements of different length found at the same
position (nucleotide 572,960) in both sequences; five roo
elements of variable location; and 10 single occurrences of other
transposable element families at unique locations (BEL, 412, FB4,
I, 412-like and mgd1 in the Joint Sequence, and
Doc, Tirant, Burdock, and FB in the
EDGP Sequence). It should be noted that two of the long runs of
n in the Joint Sequence correspond to transposable elements in
the EDGP Sequence (see below). The 17 differences in transposable
elements are not surprising, as the majority of the two sequences were
derived from two quite different fruitfly strains. In the EDGP sequence
we have identified 18 transposable elements or fragments of elements
and at least 7 of these differ in position in the Joint Sequence.
Ten of the 30 blocks are long gaps in the Joint Sequence
(
,
,
in Fig. 3), represented in
the GenBank accessions by long runs of n, with a total
estimated length of 39,938 nucleotides. For four of the 10 gaps
(
), the length of the gap in the Joint Sequence is
considerably larger than the corresponding region in the EDGP sequence;
for example the run of 4722 n's at position 1,245,921 corresponds to 102 bp in the EDGP sequence. We presume the reason for
this is that the gap in the Joint Sequence represents a
transposable element. Indeed, two gaps (
) are caused
by transposable elements: The 6353-bp gap at 2,294,896 corresponds to a
6062-bp Burdock element in the EDGP sequence, and the 8060-bp
gap at 2,511,915 corresponds to a roo element in the EDGP
sequence. Of the four remaining gaps (
), two are complex
(at 237,007 bp and 556,147 bp) and cannot be explained simply; one
corresponds to the ph-d/ph-p gene duplication (see
below), and the final gap, at 2,011,597 bp will be discussed below.
The remaining three long blocks (
in Fig. 3) of the 30 that differ between the two sequences are informative, and will be discussed more fully. Two are only found in the EDGP sequence and are
clearly the result of misassemblies in the Joint Sequence. The first of
these is just 3' to the Actn gene and is 4.7-Kb long; the
probable explanation for it is that the Joint Sequence has failed to
properly assemble a duplicated sequence that includes a partial
duplication of the predicted gene EG:133E12.4. This duplication was first indicated by the matches of EST sequences (e.g.,
EMBL accession no. AA202518, EMBL accession no. AA696909) to both an
exon of EG:133E12.4 and to a region between this gene and
Actn. The duplication is 4777 bp in length and the two copies are only mismatched over a 77-bp internal gap (1.5% mismatch). The
second is in the region of the duplicate gene pair ph-d and ph-p; the Joint Sequence has an incorrect model for
ph-p. That this region includes a long tandem repeat is known
from the work of Deatrick et al. (1991)| |
METHODS |
|---|
|
|
|---|
Clone Libraries and Map Construction
DNA from two strains has been sequenced. About 44% of the sequence is from BAC clones derived from the same strain as that sequenced by the BDGP and by Celera; in contrast, the cosmid clones sequenced were from a different strain (Fig. 1). The relationship between these strains cannot be determined. Both strains were free of P-elements.
The cosmid library used for the construction of the X
chromosome physical map was derived from a wild-type (Canton-S) strain and described in detail by Sidén-Kiamos et al. (1990)
. It has an
estimated average insert size of 35 Kb and contains ~18,000 clones
providing a fourfold coverage of the genome. The library is available
on high density double spotted filters from the MRC HGMP Resource
Centre (http://www.hgmp.mrc.ac.uk/Biology/Bio.html).
Three BAC clone libraries were used; each was constructed from DNA from
the y2; cn bw sp isogenic strain. Two BAC libraries
were made at CEPH (Centre d'Etude du Polymorphisme Humaine). One (BACN
clones) was prepared with NdeII inserts and the other (BACH
clones) with HindIII inserts, both in the vector pBeloBACII.
These two libraries were made with pools of size-fractioned DNA that
gave mean insert sizes of up to 90 Kb. The 23,400 clones gave
~10-fold coverage of the genome. The third library was of
EcoRI digested DNA (BACR clones) and was constructed in the
vector pBACe.3.6 by Aaron Mammoser and Kazutoyo Oseogawa at the Roswell
Park Cancer Institute (Buffalo, NY) in collaboration with the BDGP
(Hoskins et al. 2000
). This library gave an ~17-fold coverage of the
genome with an average insert size of 165 Kb.
Sequencing
Cosmids and BACs were sequenced by a two-stage approach involving
random sequencing of sub-clones followed by directed sequencing to
resolve problems. DNA from cosmids and BACs was sonicated and fragments
of 1.4-2 Kb were cloned into either M13 or pUC18 vectors. Clones were
sequenced using dye-terminator chemistry and loaded on ABI373 or ABI377
automated sequencing machines. Sequence base calling and contig
assembly was accomplished using Phred/Phrap software
(Ewing and Green 1998
; Ewing et al. 1998
) and editing took place in
either Consed (Gordon et al. 1998
) or Gap4 (Bonfield et al. 1995
). Gaps were filled using a combination of custom
primer walking and PCR.
Cosmid and BAC DNAs were nebulized and end repaired. Following agarose gel purification, fragments of ~1500 nucleotides were ligated to linearized vector (pTZ19R or pCR-BluntII) and cloned in the KK2186 strain of Escherichia coli. Bacterial clones were picked at random and cultured overnight. Plasmid DNAs were prepared by an alkaline lysis method and purified using the QIAprep 96 Turbo Miniprep kit (QIAGEN). Insert DNA were sequenced from both ends using universal primers. Cycle sequencing was performed with labeled terminators using AmpliTaq and the Big Dye Terminator Cycle Sequencing Ready Reaction kit (Applied Biosystems).
The Heidelberg group employed the RANDI strategy that combines the
advantages of RANdom and DIrected approaches.
It involves systematic simultaneous sequencing on both strands from
clones of combined libraries without cloning gaps. The random library fragments were generated by separate partial digestion with two four-cutter restriction enzymes (Tsp, Sau3A), gel-purified and ligated into plasmid vector. In parallel, BAC or cosmid DNA was completely digested with EcoRI (or HindIII)
and fragments were isolated from agarose gel and inserted into the pUC
vector. Their sequences served as a "scaffold" in the assembly of
the complete sequence of the BAC genomic insert and also as templates
for primer walking in the finishing stage. Cycle sequencing of plasmid
DNA was performed with the AmpliTaqFS core kit (Applied Biosystems), using forward and reverse primers labeled with FITC or CY5. An MJ
Research PT-200 cycler was used for 25 cycles (97°C, 15 sec; 55°C,
30 sec; 68°C, 30 sec). Reactions were loaded off-gel on the 72-clone
porous-membrane combs, applied to 60-cm long polyacrylamide gels (4.5%
Hydrolink Long Ranger gel solution, FMC) and analyzed on the ARAKIS
sequencing system with array detectors, developed at EMBL (Erfle et al.
1997
). This system allows simultaneous on-line sequencing of both
strands (doublex sequencing), with the two sequencing products obtained
in a single sequencing reaction, each labeled with a different
fluorescent dye (Wiemann et al. 1995
). Up to 2000 bases are thus
obtained simultaneously in one sequencing reaction, which represents an
efficient system for identifying large numbers of long sequences in one
run. Raw sequencing data were evaluated, analyzed, and the consensus
sequence assembled, using the software packages
(LaneTracker and GeneSkipper) developed at
EMBL. Remaining sequencing gaps were covered by primer walking (Voss et
al. 1993
). Direct cosmid or BAC DNA sequencing was carried out
essentially as described elsewhere (Benes et al. 1997
).
P-element Stocks and Mapping
A large-scale screen for insertions of the enhancer trap vector
P{lacW} (Bier et al. 1989
) in essential X
chromosome genes has been performed in H. Jäckle's laboratory (Peter
et al., in prep.). Females homozygous for a male sterile insertion of
the P{lacW} element in chromosome 2 were
crossed en masse to w/Y; wg Sp/CyO;
P{ry+=delta2-3}(99B)
males. In the next generation five homozygous FM6 females were
mated to two w/Y; P{lacW}/CyO;
P{ry +=delta2-3}(99B)/+
males. F2 daughters in which the CyO and
P{lacW} chromosomes had cosegregated were individually
mated to Fm7c/Y males. Lines that produced only FM6
sons in the F3 generation were kept as candidates for a lethal
insertion. If these re-tested, then the lethal insertion was kept in
stock balanced with FM7c.
P{lacW} insertion sites were mapped by either plasmid rescue or inverse PCR. DNA from adult flies was isolated using a QIAGEN column, digested overnight with an appropriate restriction enzyme, and then ligated under conditions favoring intramolecular joining. For plasmid rescue, E. coli cells were electroporated with the DNA and plated for the selection of ampicillin resistant colonies. These were used to inoculate small scale overnight cultures from which plasmid DNA was then isolated. Cycle sequencing was performed with a primer complementary to the 31-bp inverted repeat of the P-element on an ABI373 DNA sequencer using dye terminator technology. In the case of inverse PCR, we followed essentially the protocol from the BDGP. We used their primers Plac1 and Plac4 for the amplification of 5' sequences and primers Pry4 and Plw3-1 for the amplification of 3' sequences, respectively. Sequencing was done as before with primer SP1 for 5' and primer SP6 for 3' analysis.
Sequence Analysis
Sequences were analyzed by the EDGP on a clone-by-clone basis;
i.e., only fully sequenced clones (cosmids or BACs) were included. The
overall analysis scheme is similar to that adopted by other genome
projects (e.g., C. elegans Sequencing Consortium 1998
).
tRNA genes were identified by tRNAscan-SE program, v. 1.0 (Lowe and
Eddy 1997
). Candidate protein coding genes were predicted independently
by GENEFINDER version 0.84 (P. Green, unpubl.) and the
publicly available Genscan version 1.0 (Burge and Karlin
1997
). These two programs employ fundamentally different algorithms and
complemented each other on gene discovery. GENSCAN and
GENEFINDER had been trained on a vertebrate gene set and a
Drosophila-specific set (compiled by G. Helt, pers. comm.),
respectively. We measured the accuracy of prediction of the two
programs with already known Drosophila genes and we found them
to be comparable. However, each of them performed better on a
different set of genes. As expected, Drosophila-trained
GENEFINDER showed a preference for genes with fewer exons
and smaller introns when compared to the vertebrate-trained
GENSCAN.
Additional supporting evidence for the predicted genes, as well as
indications of their function, was obtained by similarity searches
against SWISS-PROT and TrEMBL protein databases (Bairoch and Apweiler
2000
), Drosophila nucleic acid sequences (derived from
EMBL-Bank), and Drosophila EST sets, generated by the BDGP (Rubin et al. 2000b
) and by Andrews et al. (2000)
. (Note that the
annotation of version 1 of the Joint Sequence did not use the entire
BDGP EST data set; in particular 4,654 3' ESTs, out of a total of
86,121, were not used [S. Lewis, pers. comm.]). EST alignments were
also used to fine-tune the intron/exon boundaries of the predicted
genes. Simple repetitive sequences were filtered out by
TANDEM, INVERTED, and
QUICKTANDEM programs (R. Durbin, pers. comm..) whereas
repeats of higher complexity were screened out using similarity
searches against Drosophila repetitive and transposable
element databases (see below). For protein and nucleotide database
searches we used BLASTX and BLASTN, v. 1.4.9. (Altschul et al. 1990
), respectively.
Finally, protein domains/motifs of the predicted genes were identified
by PPSEARCH and HMMER (v. 2.1.1) programs,
scanning the PROSITE and PFAM databases, respectively. PROSITE output
was further filtered using the EMOTIF program
(Nevill-Manning et al. 1998
).
All data generated by the automatic computational analysis described
above were parsed into an ACeDB-based database (http://www.acedb.org/), XDrosDB, tailored to the needs of the EDGP. The combined data were
manually examined/analyzed using ACeDB software. During this analysis
we disregarded genes with a GENEFINDER score <50, if
there was no other supporting evidence for them (i.e., protein
similarity and/or EST matches). This cutoff is stricter than the one
used by the BDGP (cutoff = 20) for the analysis of the Adh
region (Ashburner et al. 1999
); and, presumably, increases the number
of rejected genes (false negatives). However, we chose to set it this
high to avoid overpredicting genes (false positives).
During the initial phase of our work, we, in collaboration with the BDGP, created and subsequently curated three datasets. One consisted of 1332 D. melanogaster coding sequences from genes that have been previously studied genetically and/or biochemically. This is a nonredundant set, i.e., only one copy of each gene is included in it. In case a gene appears in multiple entries in the public databases (e.g., alternatively transcribed, submitted from more than one laboratory, etc.), we manually selected one copy (usually the best documented or longest open reading frame). We used this dataset to test the accuracy of the two chosen gene prediction programs (GENEFINDER, GENSCAN), as well as a source for hexanucleotides score calculation (GENEFINDER). This dataset has been subsequently expanded/updated to include genes identified by Drosophila genome projects (EDGP, BDGP, and Celera), with the help of Leyla Bayraktaroglou (FlyBase at Harvard). Both the original and expanded versions, together with information about their history, can be found at: ftp://ftp.ebi.ac.uk/pub/databases/edgp/sequence_sets/ or from http://fruitfly.berkeley.edu/.
Similarly, a nonredundant collection of 47 D. melanogaster transposable elements and another consisting of 96 miscellaneous repetitive sequences were also assembled during the initial phase of our project. These datasets were used to identify complex repetitive regions, as described previously. They are also available from the same ftp site or from the BDGP site.
For clarity, we use the term "Joint Sequence" to refer to v1.0 of
the complete sequence of the genome of D. melanogaster (Adams et al. 2000
) released on March 24, 2000 by Celera. Comparisons of
predicted, or known, protein sequences from the EDGP project with those
from the Joint Sequence were done by CLUSTALW using the
protein sequences of release 1.0 of the Joint Sequence (http://www.fruitfly.org/sequence/sequence_db/aa_gadfly.dros of March
21, 2000). These comparisons were then analyzed by hand. The comparison
of the entire sequence of the X chromosome tip with the
sequence of the same region from the Joint Sequence was done using the
MUMmer program (Delcher et al. 1999
), which aligns long
genomic regions by finding corresponding maximal unique matches. Nine
separate alignments were done using the following GenBank accession
nos.: AE003417, AE003418, AE003419, AE003420, AE003421, AE003422,
AE003423, AE003424, and AE003425, each being matched against the entire
EDGP sequence. The resulting alignments were analyzed by hand to find
regions where the discrepancies between the sequences were large.
Figure 3 was drawn by hand and is a graphic depiction of the alignment produced by MUMmer. Large segments absent from one of the
sequences have been highlighted.
The results presented in this study were obtained by or before February 7, 2000. However, if we had repeated the same analysis today we would have assigned function (by protein similarity) to 23 more of the predicted genes (raising the percentage of the genes with significant protein similarities to 66% of the 206 newly identified genes).
Supplementary data are available from ftp://ebi.ac.uk/pub/databases/edgp/EDGP-GenomeResearch_suppdata_2001.
| |
ACKNOWLEDGMENTS |
|---|
This work was supported by a Contract from the European Commission under Framework Programme 4 (coordinator D.M. Glover), by a grant from the Medical Research Council, London to M.A. and D.M.G., by a grant from the Dirección General de Investigacion Científica y Técnica to J.M., by a grant from the Hellenic Secretariat General for Science and Technology to K.L., and by a grant from the Deutsche Humangenomprojekt to H.J. R.D.C.S. was supported by a Wellcome Trust Senior Fellowship. We thank many colleagues for their help. We are grateful to Gerry Rubin and his colleagues at the BDGP, particularly Suzanna Lewis, Sima Misra, and Susan Celniker (and, of course, Gerry himself) for the exchange of materials, information, and ideas over the years. Greg Helt of the BDGP was very helpful in providing us with the initial Drosophila gene training set. We also thank Rolf Apweiler and his SWISS-PROT/TrEMBL team at the EBI, particularly Alexander Kanapin and Wolfgang Fleischmann for their help with the protein motif analysis. We also thank Rolf Apweiler, head of that team, for his blessings. Richard Durbin's group at the Sanger Center have been extraordinarily helpful; in particular, Daniel Lawson gave tremendous help with ACeDB despite having to bend double at times. Kim Rutherford of the Pathogen Sequencing Unit at the Sanger Center provided the software to draw Figure 1; without this we may have been lost. We thank Brian Oliver of the NIH, Bethesda for a pre-print copy of his paper on testis ESTs, Leyla Bayraktaroglou (FlyBase group, Harvard) for her help in the curation of reference sequence data sets, and David Judge of the Cambridge School of Biological Sciences Biocomputing Unit for help.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
15 Present address: Department of Genetics, School of Medicine, Washington University, 4566 Scott Avenue,St. Louis, MO 63110 USA.
16 Corresponding author.
E-MAIL m.ashburner{at}gen.cam.ac.uk; FAX 44-1223-333992.
Article and publication are at www.genome.org/cgi/doi/10.1101/gr.173801.
| |
REFERENCES |
|---|
|
|
|---|
one-hit coverage in yeast artificial chromosomes.
Chromosoma
100:
495-509[CrossRef][Medline].