|
|
|
|
Vol. 10, Issue 7, 959-966, July 2000
LETTER
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Despite the accumulation of sequence information sampling from a broad spectrum of phyla, newly sequenced genomes continue to reveal a high proportion (50%-30%) of "uncharacterized" genes, including a significant number of strictly "orphan" genes, i.e., putative open reading frames (ORFs) without any resemblance to previously determined protein-coding sequences. Most genes found in databases have only been predicted by computer methods and have never been experimentally validated. Although theoretical evolutionary arguments support the reality of genes when homologs are found in a variety of distant species, this is not the case for orphan genes. Here, we report the direct reverse transcriptase-polymerase chain reaction assay of 25 strictly orphan ORFs of Escherichia coli. Two growth conditions, exponential and stationary phases, were tested. Transcripts were identified for a total of 19 orphan genes, with 2 genes found to be expressed in only one of the two growth conditions. Our results suggest that a vast majority of E. coli ORFs presently annotated as "hypothetical" correspond to bona fide genes. By extension, this implies that randomly occurring "junk" ORFs have been actively counter selected during the evolution of the dense E. coli genome.
| |
INTRODUCTION |
|---|
|
|
|---|
Following the pioneering whole genome shotgun
sequencing of Haemophilus influenzae (Fleischmann et al.
1995
), bacterial genomes have accumulated steadily in public databases
(see www.tigr.org/tdb/mdb/). The sequence universe of
gram-proteobacteria is well represented with two complete genomes for
the gamma subdivision (H. influenzae. and Escherichia
coli (Blattner et al. 1997
)), one for the alpha subdivision
(Rickettsia prowazekii (Andersson et al. 1998
)), one for the
beta subdivision (Neisseria meningitidis [Parkhill et al.
2000a
; Tettelin et al. 2000
]), and two for the epsilon subdivision (Helicobacter pylori [Tomb et al. 1997
; Alm et al. 1999
] and
Campylobacter jejuni [Parkhill et al. 2000b
]).
Gram-positive bacteria are also well sampled by four complete firmicute
genomes (Bacillus subtilis, [Kunst et al. 1997
]
Mycobacterium tuberculosis, [Cole et al. 1998
]
Mycoplasma genitalium, [Fraser et al. 1995
] and
Mycoplasma pneumoniae [Himmelreich et al. 1996
]), two
spirochetes (Borrelia burgdorferi [Fraser et al. 1997
] and Treponema pallidum [Fraser et al. 1998
]), and several
Chlamydia species and strains (Stephens et al. 1998
; Kalman et al.
1999
; Read et al. 2000
).
The whole genomic sequences of Deinococcus radiodurans (White
et al. 1999
), of the cyanobacteria Synechocystis (Kaneko et al. 1996
), and of the two hyperthermophilic bacteria Aquifex
aeolicus (Deckert et al. 1998
) and thermotoga maritima
(Nelson et al. 1999
) complete an already broad survey of the
eubacteria sequence universe. The two other kingdoms of life are
represented, on one hand, by five completed genomes of
hyperthermophilic archebacteria (Methanococcus, Methanobacterium,
Archaeoglobus, Pyrococcus, and Aeropyrum ) (Bult et al.
1996
; Klenk et al. 1997
; Smith et al. 1997
; Kawarabayasi et al. 1998
;
Kawarabayasi et al. 1999
) and, on the other hand, by three eukaryote
genomes from Saccharomyces cerevisiae (Mewes et al. 1997
),
Caenorhabditis elegans (The C. elegans Sequencing Consortium
1998
), and Drosophila melanogaster (Adams et al. 2000
).
Given this large body of sequence data sampling from the three main
phyla and a wide variety of lifestyles (aerobic, anaerobic, intracellular, mesophilic, hyperthermophilic, etc.), it seems paradoxical that each newly sequenced genome continues to reveal a
significant fraction of unknown genes. At the time of publication, the
fraction of completely unassigned open reading frames (ORFs) (Blattner
et al. 1997
) were, for instance, 37% for E. coli, 43% for
H. influenzae, 45% for Synechocystis, and 32% for
M. genitalium. The corresponding figure for yeast is about
40% (Dujon et al. 1994
). This trend is persisting in the latest
deciphered genome of T. maritima where 46% of the ORFs are of
unknown function (Nelson et al. 1999
). Those numbers are close to the
predicted 50% proportion of phylum specific genes made a while ago
when the concept of ancient conserved regions was introduced on the
basis of statistical arguments (Claverie 1993
; Green et al. 1993
).
The notion of "uncharacterized" genes is not simple, and depends on
the details of the different protocols used to annotate the genomic
sequence. In a first step, a computer analysis of the genomic sequence
is used to delineate ORFs. There is no accepted standard protocol for
the processing of genomic sequence into ORFs ("ORFing"). Different
programs (Audic and Claverie 1998
; Lukashin and Borodovsky 1998
;
Salzberg et al. 1999
) can be used; different significance, size, or
overlapping threshold can be applied; and variable levels of human
supervision can be given. Once selected, ORFs are translated into
putative protein sequences that are used to query available public
databases for homology. Uncharacterized ORFs are those (1) bearing a
significant similarity only with proteins of unknown function, or (2)
exhibiting no significant similarity to any other real or hypothetical
protein. Throughout this article, the latter category will be referred
to as "orphan" genes. Like ORFing, homology searches and functional
assignments also involve different programs, target databases, and
empirical significance thresholds. The classification of genes into the uncharacterized and orphan categories is thus subject to change (Casari
et al. 1995
; Ouzounis et al. 1995
; Fisher and Eisenberg 1999
;
Mackiewicz et al. 1999
).
Although a large fraction of putative ORFs is not associated to any
demonstrated protein or function, the fact that some of them could
simply arise by chance is rarely, if ever, discussed. The average
protein length is above 350 amino acids (1050 nucleotide-long ORF), and
proteins shorter than 100 amino acids are rare. A minimal ORF size
cutoff of 300 nucleotides is thus often used during genomic annotation.
However, even if the probability for a 300-nucleotide-long random
sequence to contain an ORF is low, this is yet expected to happen
frequently (Fickett 1995
; Claverie et al. 1997
) within the two strands
of a 4.6 million-bp genome such as E. coli. According to a
simple Bernoulli model (with equal frequencies of A,T, C, and G), the
numbers of expected random ORFs (starting with ATG) are about 200 with
sizes
300, about 35 with sizes
400, and about 4 with sizes
500. Those numbers might become even higher for random models with more
realistic (e.g. order 2- or 3- Markov models) nucleotide
distributions (Fickett 1995
). Potentially, nonphysiological random ORFs could
thus represent 5% or more of the 4290 annotated ORFs in E. coli.
In the absence of a functional assignment, the identification of a homologous ORF (using its putative translation) in another organism is still a good support for the reality of a gene because the chance is small for a nonphysiological ORF to be conserved throughout evolution. The evidence is of course better if homologous sequences are found across evolutionary distant organisms or in several of them. Finding homologs only within the same bacterial genome (putative paralogs) is also positive evidence, albeit much weaker, because even random ORFs may get duplicated during evolution. However, the best candidates for being the result of chance (i.e., "junk" ORFs) are the truly orphan ORFs, the putative products of which do not exhibit any significant similarity to any other known sequences.
By using all sequence data currently available, we have reanalyzed the
current annotation of the E. coli genome (Blattner et al.
1997
) and identified 25 orphan ORFs in a very conservative manner
(i.e., eliminating ORFs exhibiting even poorly significant similarity
within the databases). The presence of a cognate transcript for each of
these highly hypothetical ORFs was then tested by using a sensitive
reverse transcriptase-polymerase chain reaction (RT-PCR) assay on mRNA
extracted during the exponential and stationary phase of E. coli K-12 MG1655 growth on a rich medium. Reproducible evidence of
transcription was found for 19 of these 25 orphan ORFs, 2 of them
exhibiting differential expression. This experimental validation of
strictly orphan ORFs strongly suggests that most of them are indeed
biologically relevant and, by extension, that randomly occurring junk
ORFs are virtually absent from the E. coli genome.
| |
RESULTS AND DISCUSSION |
|---|
|
|
|---|
Nineteen Orphan Genes Exhibit Evidence of Transcription
Figure 1 shows that amplicons were detected for 19 of the 25 orphan ORFs assayed by RT-PCR, by using the primer pairs listed in Table 1. Figure 2 shows the results of each control PCR experiment, as well as typical generic RT-PCR and PCR controls (Fig. 2c). The results are summarized in Table 1. Two ORFs, B0645 and B1668, showed qualitative evidence of differential expression. A B0645 transcript was detected only within total RNA from the exponential phase of growth, whereas B1668 mRNA was only detected in stationary phase. The 6 negative ORFs did not display any systematic difference in terms of their size, nucleotide composition, or amino-acid and repeat content of their putative translations.
|
|
|
Among the negative ORFs, B2625, B2630, and B4215 belong to single-gene
operons (as predicted in Thieffry et al. 1998
). B0279 is the second ORF
behind the putative DNA-binding protein (of unknown function) YagL, in
a two-gene operon. B2760 is the first ORF of a seven-gene operon, all
other ORFs of which are of unknown function. B3875 is the last ORF of a
four-gene operon involving YihQ (a putative glycosidase), YihP, and
YihO (two putative permeases). Among the 19 orphan ORFs for which
transcription was detected, 11 are part of single-gene operons, and 8 are in different multi-gene operons. ORF B1085 is an interesting case.
A putative Salmonella typhi (the sequencing of which is in
progress; see NCBI unfinished microbial database) homolog of this ORF
(85% identical at the amino-acid level) was revealed after the
completion of this work. However, this apparent homolog has no valid
start codon and exhibits an in-frame STOP codon. We can think of three
hypotheses that might account for this situation: (1) We are looking at
sequencing errors in the unfinished S. typhi sequence, (2)
This B1085 homolog (perhaps not orthologous) is being lost in S. typhi, or (3) This ORF in fact corresponds to a functional RNA.
In a recent article, Richmond et al. (1999)
used high-density arrays
composed of full-length ORF-specific PCR products to examine the whole
E. coli transcriptome response to isopropyl
-thio
galactoside (IPTG) induction (as a control) or heat-shock treatment.
None of the orphan genes tested here were mentioned as significantly
involved in any of those responses. The data reported by Richmond et
al. provided reliable measure of expression for only 25% of E. coli ORFs in batch culture at 37°C and consist of expression
ratios that cannot easily be compared to our results. More recently,
while our report was being prepared, Tao et al. (1999)
published
another study that used commercially available gene arrays. These
results have been made available on a web site and consist of
expression values in arbitrary units reflecting the mRNA level for each
gene expressed in E. coli growing on minimal and/or rich
media. The results of Tao et al. agree with our finding that a vast
majority of the orphan genes selected are indeed expressed. Of the 25 ORFs listed in Table 1, 21 (84%) are detected in at least one of their
growth conditions. In between the two studies, 23 of the 25 (92%)
orphan genes that we selected (Table 1) are seen expressed in at least
one of the tested condition (rich/minimal media, exponential/stationary
phase). Finally, two of the ORFs (B0279 and B2760) that we failed to
detect here also correspond to undetectable mRNA level according to Tao
et al. data.
Promoter Sequence of the Stationary Phase-Specific ORF B1668
A central regulator of gene expression in stationary phase is the
RNA polymerase
38 factor encoded by the rpoS gene
(Tanaka et al. 1993
). This alternate sigma factor is thought to
recognize a different subset of promoter sequences, although no clear
consensus has been found associated to
38-dependent genes.
Site-directed mutagenesis has suggested that DNA sequence in the -35 region is involved in the discrimination between
70- and
38-dependent transcription. To analyze the upstream region
of ORF B1668, we first collected the promoter sequences (encompassing the -10 and -35 regions) of 11 rpoS-dependent genes:
osmY, osmB, fic, proP, aldB, bolA, xthA, glgS, poxB, cfa, and
pexB (Wise et al. 1996
). We generated optimal multiple
alignments of these sequences by using ClustalW (Higgins et al. 1996
).
From this multiple alignment, a position-weight matrix (PWM) (defined
over 38 positions) was generated by using the NMksite (Claverie and
Audic 1996
) program. This
38 PWM was then used to scan the
upstream region of ORF B1668. A statistically significant match
(Claverie and Audic 1996
) (P score < 0.01) was found,
encompassing the
35 and
10 region of the putative promoter. In
a control computer experiment, no significant match of the
38 PWM was found in the upstream region of a selection of
experimentally proven
70-dependent promoters (alaS,
dnaQ, leuX, rnaII, rnh, rplJ, rpsA, rrnE, tufB)(Tanaka et al.
1993
). The
38 recognition motif that we designed from
genes previously known to be specific to the stationary phase is thus
in agreement with the promoter sequence and expression behavior of ORF B1668.
Nonphysiological (Random) ORFs Are Rare
The present survey of all E. coli strictly orphan ORFs
indicates that 19 of 25 (76%) (our study) belong to bona fide
transcripts when tested in exponential growth and stationary phase.
Merging our results with those of Tao et al. (1999)
increases that
estimate to 92%. This high rate of mRNA detection suggests that a
large majority of ORFs of unknown function is of biological relevance. Indeed, this statement will remain speculative until evidence of
protein products are given for all of these orphan ORFs, a work now
being initiated in a structural genomic context. This also might come
as a surprise if we think that the "normal" habitat of E. coli is anaerobic, whereas all of the tests described earlier were
performed in aerobic conditions. This would indicate that only a small
fraction of genes are specific for anaerobic growth.
However, our results are confirmed by a statistical survey of Tao et al. expression data as available on their web site (http://bomi.ou.edu/faculty/tconway/global.html). According to their database, 1352 ORFs are classified as hypothetical (including the 25 considered orphan by using our very relaxed similarity criteria; see Methods). Of these, we computed that 80% exhibited detectable mRNA levels in at least one of the two conditions tested. This figure becomes 86% when computed on all 4290 E. coli ORFs. This already indicates that hypothetical ORFs behave not much differently than genes for which functional attributes have been recognized. Our experimental results on orphan ORFs now indicate that hypothetical ORFs with no recognized similarity are not less likely to be transcribed than those with orthologs in other microbial genomes.
The fact that almost all ORFs annotated in the E. coli genome
sequence appear to be real is, first, a tribute to the high-quality sequencing and annotation work of Blattner's laboratory (Blattner et
al. 1997
) as well as to that of Collado-Vides (Thieffry et al. 1998
).
In the current state of annotation, very little room is left for
potentially unrecognized ORFs, and our analysis of orphans can in fact
be considered comprehensive. We can thus conclude from our work that
random ORFs (of which about 200 are expected of sizes
300 nucleotides) are virtually absent, and must have been actively selected
against throughout the evolution of the E. coli genome. A
strong selection pressure would then exist against the maintenance of
nonphysiological ORFs in the genome of proteobacteria (with the
exception of intracellular parasites such a R. prowazekii (Andersson et al. 1998
)). The situation appears to be different in a
unicellular eukaryote such as yeast, where up to 76% of annotated ORFs
might not be expressed (Mackiewicz et al. 1999
). The intolerance for
fake ORFs in prokaryote genomes might be related to the direct coupling
between transcription and translation that is characteristic of these
organisms. It might also be related to a mode of evolution where
horizontal gene transfer
allowing the acquisition at once of already
functional genes
is important. In this context, orphan ORFs would
simply have been acquired from yet unsequenced organisms or would have
diverged beyond recognition. Eukaryotes, in contrast, seem to evolve
new functions by gene duplication, followed by rapid pseudo-gene
evolution and reactivation. Such an evolutionary pathway is clearly
making junk ORFs a necessity.
| |
METHODS |
|---|
|
|
|---|
Sequence Analysis: Selection of Orphan ORFs
1393 E. coli ORF sequences annotated as unknown (as of
January 1997) were selected from the genome site maintained by
Blattner's laboratory (ftp.genetic.wisc.edu). Our purpose was not to
validate this annotation but to estimate the percentage of likely junk ORFs among them. To select out the ultimate orphan genes, these hypothetical ORF sequences were further submitted to a comprehensive similarity search survey according to a very low stringency protocol. In the first step, all available complete bacteria and archebacteria genomes (downloaded locally) were scanned by using WU-BLAST 2.0 tblastx
(Warren Gish, unpublished; Gish and States 1993
). Default scoring
matrix, filtering, and significance level (E=10) were used.
The use of the similarity search program tblastx (putative translation
of the query vs. putative translation of the target sequences in all
reading frames) eliminated the risk of not recognizing a match due to
ORF annotation errors in the query or target genomes. All ORFs with
similarity matches were eliminated, including partial matches with
interrupted ORFs in other bacteria. The remaining ORFs were further
compared to the complete yeast genome by using the same protocol, and
the matching ORFs were eliminated. Finally, the remaining ORFs were
compared against the NR-protein database (www.ncbi.nlm.nih.gov) by
using BLAST 2.0 (Altschul et al. 1997
). This succession of database
searches resulted into an ultimate set of 31 orphan ORFs. None of the
181 ORFs shorter than 300 nucleotides that were present in the original
set of 1393 unknown ORFs made it into the ultimate orphan ORF category.
While this work was in progress, 6 of the 31 orphan candidates were
further eliminated because of their similarity to newly available
genomic sequences from S. typhi, S. Typhimurium,,
Klebsiella pneumoniae,, and Clostridium perfringens.
The list of the 25 orphan ORFs used in the experimental validation is
given in Table 1, according to their original nomenclature (Blattner et
al. 1997
). These ORFs exhibit the same statistical bias (fifth-order
Markov model) as do other protein-coding genes in E. coli and
are indeed detected by the SelfID genome annotation program (Audic and
Claverie, 1998
).
Bacterial Growth, Isolation of Total RNA and DNAse I Treatment
E. coli K-12 (MG1655, obtained from Blattner's group) was
grown in sterile Luria-Bertani (LB10) in 250-ml Erlenmeyer flasks, on a
shaker (at 81 rpm and 150 rpm for the exponential and stationary phases, respectively) at 37°C. Cells were harvested after 5H
(exponential phase) or 27H (stationary phase). For the exponential
phase culture only, 25 mM of sodium azide and 192 µg/ml of
chloramphenicol were added (Mahbubani et al. 1991
), followed by a
10-min incubation at 37°C and 81-rpm shaking. Cultures were stopped
by dropping the temperature to 0°C. Cells were pelleted (20 min,
4000 rpm) once, then resuspended twice in YM90 (1X) medium. Final
pellets were finally resuspended in YM90 1X, aliquoted, and pelleted
(10 min, 6500 rpm). After discarding the supernatant, the tubes were rapidly frozen at
80°C. For each experiment, total RNA from
5.108 bacteria (quantified on LB agar petri dishes) was
isolated by using Qiagen RNeasy columns, strictly following the
manufacturer's protocol. Nucleic acids were quantified by 260 nm/280
nm spectrophotometry. Contaminant bacterial DNA was eliminated by using
the DNAse I kit from Gibco BRL. The total elution volume was digested
by DNAse I at the concentration of 2UK/µg of total RNA. After
incubation at 37°C for 30 min, the digestion was stopped by adding
2mM EDTA followed by incubation at 65°C for 10 min. A final
purification with Qiagen RNeasy column was then performed.
Reverse Transcriptase and PCR Primer Design
PCR primer pairs were designed with the OLIGO 5.0 software (Medprobe) to amplify the transcript corresponding to each of the selected 25 orphan ORFs. The primers were chosen to be entirely contained within the putative protein-coding region. Primer pair sequences (from 19 to 23 nucleotides long) are given in Table 1, as well as their positions relative to the beginning of each ORF sequence. For instance, the sense primer for B0220 starts at position 45, is 21 nucleotides long, and is denoted 45U21. The reverse primer, denoted 405L21, is also 21 nucleotides long and starts at position 405. For each ORF, the antisense primer was used for the initial RT reaction, as well as for the following PCR cDNA amplification. In the case of ORF B2847, a different sense primer was required to remove the presence of a nonspecific band when using exponential phase total RNA. All primer pairs produced amplicons of the expected sizes when tested on E. coli K-12 (MG1655) genomic DNA.
RT-PCR Assay and Control PCR
RT-PCR assays were performed by using the one-step protocol
(Aatsinki et al. 1994
) as implemented in the Access kit (Promega) following the manufacturer's instructions and optimizing the number of
cycles to 35. Higher numbers of cycles (40, 45, and 60) led to
nonreproducible results, most likely due to residual genomic DNA
contamination. We used an MJ Research PTC 200 thermocycler. Each ORF
was tested at least twice on the same total RNA batch. Contaminant DNA
was removed by treatment with DNAse 1 as described earlier. RNA samples
(0.1 µg of total RNA each) were simultaneously amplified with
35-cycle PCR, in presence versus absence of RT. The latter protocol
tested for the eventual amplification of contaminant genomic DNA (see
Fig. 2a,b). All of the results summarized in Table 1 correspond to
experiments where amplicons were not observed in the absence of RT. In
addition, a series of PCR control experiments that used independent
primers was performed as shown in Figure 2c.
Amplicon Detection
Amplicons were detected by electrophoresis in 2% agarose gels (2 hr, 100 V, in TAE [1X] buffer), followed by ethidium bromide (0.5 ug/ml) staining for 15 min at room temperature. Gels were then washed for 5 min in TAE buffer. Results were then visualized and recorded by using the Seikosha VP1500 Imager (Appligene). All amplicon sequences were verified by direct sequencing (Qiagen) by using the cognate RT-PCR primers.
| |
ACKNOWLEDGMENTS |
|---|
We thank Dr. V. Roux for precious technical advice and Prof. D. Raoult for kindly giving us access to its `contamination-free' PCR laboratory. We thank Dr. P. Moreau for helpful discussions at the beginning of the project and Dr. C. Bartoli for her help with gel reading. Thanks are also due to Dr. C. Abergel, Prof. A. Lazdunski, Prof. D. Gautheret, and Dr. R.J. Roberts for helping improving the manuscript. This work was supported by the CNRS genome program.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
1 Corresponding author.
E-MAIL secr{at}igs.cnrs-mrs.fr ; FAX 33 4 91 16 45 49.
| |
REFERENCES |
|---|
|
|
|---|
Received January 5, 2000; accepted in revised form May 4, 2000.
This article has been cited by other articles:
![]() |
E. van Nimwegen, M. Zavolan, N. Rajewsky, and E. D. Siggia Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics PNAS, May 15, 2002; (2002) 112690399. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Monchois, C. Abergel, J. Sturgis, S. Jeudy, and J.-M. Claverie Escherichia coli ykfE ORFan Gene Encodes a Potent Inhibitor of C-type Lysozyme J. Biol. Chem., May 18, 2001; 276(21): 18437 - 18441. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Rajewsky, N. D. Socci, M. Zapotocky, and E. D. Siggia The Evolution of DNA Regulatory Regions for Proteo-Gamma Bacteria by Interspecies Comparisons Genome Res., February 1, 2002; 12(2): 298 - 308. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||