|
|
|
|
Vol. 10, Issue 7, 991-1000, July 2000
LETTER
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Accumulation of complete genome sequences of diverse organisms creates new possibilities for evolutionary inferences from whole-genome comparisons. In the present study, we analyze the distributions of substitution rates among proteins encoded in 19 complete genomes (the interprotein rate distribution). To estimate these rates, it is necessary to employ another fundamental distribution, that of the substitution rates among sites in proteins (the intraprotein distribution). Using two independent approaches, we show that intraprotein substitution rate variability appears to be significantly greater than generally accepted. This yields more realistic estimates of evolutionary distances from amino-acid sequences, which is critical for evolutionary-tree construction. We demonstrate that the interprotein rate distributions inferred from the genome-to-genome comparisons are similar to each other and can be approximated by a single distribution with a long exponential shoulder. This suggests that a generalized version of the molecular clock hypothesis may be valid on genome scale. We also use the scaling parameter of the obtained interprotein rate distribution to construct a rooted whole-genome phylogeny. The topology of the resulting tree is largely compatible with those of global rRNA-based trees and trees produced by other approaches to genome-wide comparison.
| |
INTRODUCTION |
|---|
|
|
|---|
Multiple, complete genome sequences from
taxonomically diverse species create unprecedented opportunities for
new phylogenetic approaches (Huynen and Bork 1998
). Comparative genome
analysis shows a striking complexity of evolutionary scenarios that
involve, in addition to vertical descent, a number of horizontal gene
transfer and lineage-specific gene loss events (Koonin et al. 1997
;
Doolittle 1999
). With these "illicit" events being so prominent in
the history of life (at least as far as prokaryotes are concerned), the
question arises as to whether whole-genome comparisons are still
capable of detecting a sufficiently strong signal to produce a
coherent, large-scale phylogeny. One way to approach this problem is
based on the presence or absence of representatives of different
genomes in orthologous protein families (Fitz-Gibbon and House 1999
;
Snel et al. 1999
; Tekaia et al. 1999
). Another strategy involves the analysis of multiple protein families, with a subsequent attempt to
derive a consensus that could reflect the "organismal" phylogeny (Teichmann and Mitchison 1999
).
In the present study, we apply an alternative approach that, to our
knowledge, has not been systematically explored before. The methodology
is based on the analysis of the distributions of evolutionary rates
among orthologous proteins (Fitch 1970
), or the interprotein rate
distribution. We hypothesized that the distribution of relative
evolutionary rates does not change significantly in the course of
evolution because all organisms possess similar repertoires of core
protein functions that are primarily represented among orthologs
(Tatusov et al. 1997
). Below we describe a statistical test for this
hypothesis. Under this assumption, the evolutionary distances, defined
as the average number of substitutions per site between likely
orthologs, linearly depend on substitution rates. Thus, the
distribution of the rates can be determined from the distribution of
the distances using a scaling factor proportional to the divergence
time. To estimate these rates, it was necessary to use another
fundamental distribution, that of the substitution rates among sites in
individual proteins, or the intraprotein rate distribution. We show
that intraprotein substitution rate variability appears to be
significantly greater than generally accepted. We further demonstrate
that the interprotein rate distributions inferred from the
genome-to-genome comparisons are similar to each other and can be
approximated by a single distribution with a long exponential shoulder.
The scaling parameter of this distribution was used to construct a
rooted whole-genome phylogenetic tree. The resulting topology is
largely compatible with that of global rRNA-based trees and with those
of the trees produced by other methods of genome-wide analysis.
| |
RESULTS AND DISCUSSION |
|---|
|
|
|---|
Estimating Evolutionary Distances Using the Intraprotein Rate Distribution
The evolutionary distances between likely orthologs, which were
identified as statistically significant, symmetrical best hits, were
obtained from the results of all-against-all comparisons of protein
sequences from all pairs of genomes, as described in the Methods
section. All genome-to-genome comparisons used at least 100 and,
typically, >200 pairs of likely orthologs (Table 1). The distribution functions of the fraction of
unchanged sites show that, even for closely related species, such as
the two Mycoplasmas or Escherichia coli and
Haemophilus influenzae, there exists a considerable fraction
of poorly conserved potential orthologs (Fig. 1a), in
agreement with previous observations (Tatusov et al. 1996
). We
examined, case by case, the pairs with 30-40% identical residues,
produced by the comparison of the two species of mycoplasmas, for
structural and biological relevance and did not identify any apparent
false-positive results. Such poorly conserved but apparently orthologous pairs of proteins include, for example, lipoproteins, adhesins, and other surface proteins. Distances were estimated from the
identity fractions (Fig. 1a) by using the intraprotein rate
distribution. This distribution traditionally had been estimated from
multiple sequence alignments (Uzzell and Corbin 1971
; Dayhoff et al.
1978
; Holmquist et al. 1983
; Gogarten et al. 1996
; Zhang and Gu 1998
).
The existing methods rely on elaborate models of sequence change. They
require knowledge of the tree topology for sequences in the multiple
alignment and good estimates for the number of substitutions in each
site. Usually, the tree topology is not easily recoverable, and some
multiple substitutions at highly variable sites are always missed,
leading to underestimates of the rate variability. The latter effect
becomes particularly noticeable at larger evolutionary distances when
highly variable sites approach saturation with substitutions.
Furthermore, previously used methods are based on the assumption of
independence of the site rates on the type of amino-acid replacements,
which may result in a significant underestimate of the rate variability
(Feng and Doolittle 1997
). It is well known that amino acids are not
equally interchangeable (Dayhoff et al. 1978
), and it is erroneous to neglect this fact (Grishin 1995
; Feng and Doolittle 1997
). In addition,
application of maximum likelihood for simultaneous estimation of rate
variability and branch lengths of the tree employed by previous workers
(Zhang and Gu 1998
) results in a strong correlation between these
parameters. The likely reason for such correlation is a highly
curvilinear relationship between the average number of substitutions
per site and identity fractions.
|
|
To avoid these limitations, we developed a method that involves only a
few simple assumptions and does not require highly accurate estimates
of the number of substitutions. We express the intraprotein rate
variation in terms of relative substitution rates, which are normalized
to keep the mean instantaneous rate over all sites in a sequence equal
to 1:
|
i(t) is the absolute rate for site
i at time t, and n is the total number of
sites in the given sequence. The assumptions are (1) sites evolve
independently, (2) xi(t) does not change if
no substitutions occur in site i, (3) the probability that a
site with a relative rate x remains unchanged after d
amino-acid substitutions per site occurred in the protein sequence is
e
xd, and (4) the distribution of relative
substitution rates among sites does not change with time (Zuckerkandl
and Pauling 1965
|
(1) |
(x) is the probability density
function of substitution rates. Thus,
(x) can be estimated
using equation (1). We obtained the lower bound of d for a
range of u values from multiple alignments of 15 protein
families from the Pfam database (Bateman et al. 1999
|
From our data, it is also possible to obtain the best-fit value of the
parameter for any single-parameter density function (Fig. 2b).
Traditionally, the rate variation among sites has been described by
gamma density (Nei et al. 1976
; Ota and Nei 1994
; Yang 1994
; Li and Gu
1996
; Grishin 1999
):
|
(2) |
is the gamma function and
is the distribution parameter.
We estimated
using our method. Consistent with the variation coefficient analysis, the obtained value
= 0.31 is significantly lower than those generally used (on average, 0.7 - 0.8; Gogarten et
al. 1996
, in the range of 0.27 ± 0.05 (Table
2). This suggests that site-rate variation for
different proteins is described by similar if not by identical
distributions.
|
The gamma density with
1 is a decreasing, L-shaped function. In
contrast, other distributions, for example, the bell-shaped log-normal
distribution that also has been proposed to describe the site-rate
variation (Olsen 1987
), have a non-zero mode. To discriminate between
an L-shaped and a bell-shaped distribution, we fitted our data to a
two-parameter density function that is a combination of a linear
segment near zero with an exponential tail (Fig. 2c). The negative
slope of the best-fit linear segment obtained from the data strongly
suggests an L-shaped density of substitution rates, with the single
mode in zero. Thus the majority of the sites in a protein exhibit very
low relative rates, whereas a small fraction of variable, selectively
(almost) neutral sites absorb most of the substitutions through
multiple mutations at the same site.
Based on the assumption of the evolutionary constancy of the
interprotein rate distribution (see introduction), we proposed an
independent way to estimate the parameter of the gamma density for the
intraprotein rate distribution. The value of
was determined that
minimized the differences between distributions of normalized evolutionary distances for all pairs of complete protein sets. The
resulting value of
obtained by this approach (0.31 ± 0.03) is
remarkably close to that derived from the multiple alignment comparison
(0.31). Given the good agreement between these independent estimates,
we believe that the gamma density with the parameter
= 0.3 provides
an adequate description of the intraprotein rate variation.
Interprotein Distributions of Evolutionary Rates
The gamma distribution with the parameter
= 0.3 was used to
calculate the distances for each pair of protein sets using equation
(1) and to generate the respective distributions (Fig. 1b). Notably,
even for genomes with high average identity between proteins, for
example, the spirochaetes Borrelia burgdorferi and Treponema pallidum, the distances calculated as the average
number of substitutions per site were large and significantly greater than those reported previously (Huynen and Bork 1998
) (compare Fig. 1a
and b). As expected, different genome pairs differed greatly in the
median distance, which could be as low as one substitution per site
between the two Mycoplasmas and as high as 50 substitutions per site between divergent organisms, such as Bacteria and
Archaea (Fig. 1b). After normalization, these distributions
were compared with each other to test our central hypothesis that the
interprotein distribution is constant in evolution (Fig. 1c). It was
found that 64% of genome pairs passed the chi-square test at the 0.01 level when compared with the rest of the data combined. Only a few
genome pairs, in particular Escherichia coli and
Haemophilus influenzae (
2
3700) and
Methanococcus jannaschii and Methanobacterium
thermoautotrophicum (
2
170), showed strong
deviations from the distribution of the combined data (Fig. 1a-c). We
suspect that these anomalies may be due to extensive loss of conserved
genes, in particular those coding for metabolic enzymes, in H. influenzae (Tatusov et al. 1996
) and massive horizontal gene
transfer between the two archaeal methanogens, respectively. Figure 1c
shows that most of the normalized distribution functions are much
closer to each other than the non-normalized functions shown in Figure
1b. The normalization brings all the distributions to the same
variance, equalized to 1, but, in general, should not affect other
parameters. However, the medians of the distributions show strong
correlation with the variance and typically fall within the range
between 0.5 and 1.0. In contrast, the medians of the normalized
distributions (Fig. 1c) show little correlation with the medians of the
not-normalized ones (Fig. 1b). For example, the median of the
mge_mpn distribution is (as expected) much smaller than that of
the mja_mth distribution shown in Figure 1b, but the reverse is
true for these pairs after normalization (Fig. 1c). The shape of the
hin_eco distribution after normalization significantly differs
from that of the rest of the distributions (Fig. 1c; also see below).
A systematic bias in the obtained distributions of evolutionary distances might arise from the underestimate of the number of highly divergent orthologs (see Methods). Such fast-evolving pairs of orthologs could be missed because of the requirement of statistical significance of the observed sequence similarity (see Methods). The distributions of the fraction of unchanged sites level off at about 0.15 (Fig. 1a), which should be expected because alignments with <15% identity will typically fail the cutoff, even if the proteins involved are orthologs. This could lower the variance of the distributions, particularly for evolutionarily distant genomes. The fraction of weakly similar orthologs increases with the evolutionary distance between genomes. Therefore, if a large number of divergent orthologs is missed, the variance of the distribution of normalized distances will show an inverse correlation with the mean distance between proteins in genome pairs. Empirically, however, we found only a weak dependence between the variance and the mean (Fig. 1d).
Our assumption of time independence of the intraprotein rate
distribution and of the notion of time-invariance of the interprotein rate distribution that was tested as described above are not equivalent to or dependent on the molecular-clock assumption. Molecular
clock means that the substitution rate of a site or the average
substitution rate of a protein does not change with time. We do not
make such an assumption. Indeed, we allow rates of sites and proteins
to depend on time as long as the distribution of relative rates remains time independent. There seem to be two principal scenarios whereby this
distribution remains constant. Under the first scenario, the absolute
substitution rates of sites may change with time, but in a correlated
manner, so that the ratio of any two rates remains constant. In this
case, the distribution of relative rates (each rate divided by the mean
rate) will not change. However, this "relative molecular clock" is
not necessary for the time invariance of the distribution, and we favor
a different model. Under this second scenario, the rates of site (or
protein) change may change freely, and the ratio of any two rates does
not need to be fixed. However, we expect that if some sites (proteins) increase their relative rates, others take their place and decrease their relative rates so that there is no significant change in the
overall rate distribution. This scenario can be viewed as a statistical
interpretation of the covarion model of evolution (Miyamoto and Fitch 1995
).
The two most deviant genome pairs were excluded from the combined data set, and the resulting distribution of normalized distances was used as an estimate of interprotein rate variation (Fig. 1e). The distribution of interprotein rates can be simulated by randomly sampling sites from the intraprotein rate distribution, given the distribution of protein lengths in each genome (Fig. 1e). The standard variance of the resulting distribution was about 10 times lower than that of the empirical distribution (Fig. 1e). This major difference between the simulated and observed interprotein rate distributions indicates that, for the purpose of describing the evolutionary process, a protein cannot be approximated by a random collection of sites from the intraprotein rate distribution. In other words, the interprotein rate distribution is not determined by the intraprotein rate variability but rather is dictated by the diversity of the functional constraints for different proteins. It is well known that some proteins (e.g., histones) evolve extremely slowly, whereas others (e.g., fibrinopeptides) are highly variable.
We attempted to fit the empirical interprotein rate distribution with
different standard density functions, but none of the fits was
sufficiently close (not shown). We noticed, however, that the mean of
the observed distribution is close to its variance, which was set to 1. This is the case for the exponential distribution, and the following
density gives the best fit to the data
|
(3) |
Constructing a Global Phylogenetic Tree on the Basis of the Interprotein Rate Distribution
Distribution (3) can be used to find the maximum-likelihood scaling
parameter for each genome pair; this parameter defines an additive
evolutionary distance between the respective species. The matrix
composed of these distances was used to construct a phylogenetic tree
with the Fitch-Margoliash method (Fig. 3) and the
neighbor-joining method (not shown). Both methods produced essentially
the same tree topology. The tree clearly shows the separation of the
three domains of life (bacteria, archaea, and eukaryotes), whereas the
major bacterial lineages show a star phylogeny, which is generally
compatible with the currently accepted gross evolutionary scenarios
(Woese et al. 1990
; Snel et al. 1999
). Under the genome-scale
molecular-clock assumption, the root is between archaea-eukaryotes and
bacteria (Fig. 3), which agrees with the results of rooting by paralogy
for several protein families (Gogarten et al. 1989
; Iwabe et al. 1989
;
Brown and Doolittle 1997
).
|
Some of the reported rRNA-based trees and trees based on gene content
in complete genomes show a better resolution of the bacterial branches
(Olsen et al. 1994
; Fitz-Gibbon and House 1999
; Snel et al. 1999
).
Heavy corrections for multiple substitutions that are inherent in our
analysis inevitably lead to larger distances and thus to larger errors
in the distance estimates, which may preclude the resolution of deep
branching among bacteria. It has been shown under the molecular-clock
hypothesis that tree reconstruction is consistent with simple
proportions of different residues between sequences without correction
(p distances; Rzhetsky and Sitnikova 1996
). Thus, under the
molecular-clock assumption, correct tree topology may be produced with
underestimated distances when the adequate correction for multiple
substitutions has not been made. Underestimated distances will display
smaller standard errors and will allow for a better-resolved tree. The
present analysis suggests that the molecular clock could be valid on
the genome scale, and accordingly, undercorrection might not have led
to incorrect tree topology in previous phylogenetic analyses. It appears likely that some of the unresolved bacterial branches in the
tree shown in Figure 3 are indeed explained by a large error in
distance estimates. For example, the grouping of R. prowazekii with Proteobacteria is convincingly supported by both rRNA-based phylogenetic analysis (Olsen et al. 1994
) and detailed analysis of the
genome (Andersson et al. 1998
) but is missed in the tree produced by
our approach (Fig. 3). Some other relationships reported in the
previous analyses could be artifactual, however, such as, for example,
the grouping of Synechocystis with Aquifex suggested by one of the genomewide analyses (Snel et al. 1999
), but not others or
the rRNA-based phylogeny (Olsen et al. 1994
), or the grouping between
gram-positive bacteria and Proteobacteria seen in another genomewide
study (Fitz-Gibbon and House 1999
). In these cases, the lack of
resolution due to the extensive correction for multiple substitutions
could provide the most realistic, if not the most informative, picture
of the star phylogeny of the bacteria. Such a conclusion appears
compatible with the results of phylogenetic analyses of multiple
protein families (Teichmann and Mitchison 1999
). To a large extent, the
difficulties in resolving bacterial phylogeny could be due to massive
horizontal transfer across bacteria.
Important as the effects of horizontal transfer seem to be, they are not sufficient to entirely wash out the phylogenetic signal in genomewide comparisons, at least the three primary domains of life are recovered reliably. Clearly, at this time, the final word on the best method(s) for genomewide comparison and its utility in phylogenetic reconstruction is not yet out. Experimentation with genome data on a much larger scale should help in finding the optimal solution and assessing the attainable level of resolution.
The results of this study indicate that using complementary information produced by whole-genome comparisons and by analysis of large protein families may significantly enhance our understanding of molecular evolution. Eventually, application of statistical methods to the rapidly growing amounts of such complementary data may result in the derivation of a compact set of "laws of evolutionary genomics."
| |
METHODS |
|---|
|
|
|---|
Comparison of Protein Sequence Sets from Complete Genomes
The protein sets from complete genomes were retrieved from the
genome division of the Entrez system
(http://www.ncbi.nlm.nih.gov/Entrez/Genome/org.html) except for the
nematode Caenorhabditis elegans proteins, which were retrieved
from the Sanger Center FTP site
(http://www.sanger.ac.uk/Projects/C_elegans/). The
abbreviations for species names are: Bacteria
Aquifex
aeolicus (aae), Bacillus subtilis (bsu), B. burgdorferi (bbu), Chlamydia trachomatis (ctr),
Escherichia coli (eco), H. influenzae (hin), Helicobacter pylori (hpy), Mycobacterium tuberculosis
(mtu), Mycoplasma genitalium (mge), Mycoplasma
pneumoniae (mpn), Rickettsia prowazekii (rpr),
Synechocystis sp. PCC6803 (ssp), T. pallidum (tpa);
Archaea
Archaeoglobus fulgidus (afu), M. thermoautotrophicum (mth), M. jannaschii (mja), Pyrococcus horikoshii (pho); Eukaryota
Caenorhabditis
elegans (cel), Saccharomyces cerevisiae (sce). The BLASTP
program (Altschul et al. 1997
) was used to perform an all-against-all
similarity search for the set of proteins from 19 completely sequenced
genomes. Only symmetrical best hits between genomes (Tatusov et al.
1997
), supported by an e value <0.01 and fraction of low
complexity regions (Wootton 1994
) <50% in the aligned segment, were
included in the analysis. These criteria maximize the likelihood that
the sequence similarity reflects homology and that most of the proteins
pairs identified using this approach are true orthologs. The fraction of unchanged sites u was estimated from the identity
percentage q as u = (q
q
)/(1
q
), where
q
is the expected identity for random sequences
(Feng et al. 1997
; Grishin 1997
).
Estimation of Intraprotein Rate Variation from Multiple Alignments
We used 14 large protein families from the Pfam database (Pfam IDs
PF00016, PF00077, PF00078, PF00032, PF00509, PF00361, PF00129, PF00559,
PF00522, PF00393, PF01010, PF00042, PF00091, PF00075; see also Table
2). The families were selected so that each contained more than 50 nonidentical sequences with >90% identity to a master sequence. The
sequence that had the maximal number of nonidentical family members
with >90% identity was chosen as the master. These criteria were
adopted to minimize the effect of multiple and back substitutions. For
each subaligment (a random subset of more than four sequences from the
alignment), the fraction of sites occupied by the same amino acid in
all sequences (invariant sites) was used as an estimate of the fraction
of unchanged sites u (given that we analyzed only highly
similar sequences, we assumed that such invariant sites have not been
affected by back substitutions). For each site in the subalignment, the
number of different amino acids in this site minus one cannot exceed
(and for u
1 approaches) the number of substitutions at this
site. Averaged over the sites in the subalignment, this gives the lower
bound of d(u), which is the number of substitutions
per site in the tree that relates all the sequences in the
subalignment. Because the lower bound approaches the true function
d(u) for u
1 (d
0), the
standard deviation of
(x) is calculated from the second
derivative of u(d) at d = 0, which is
estimated numerically from the lower bound curve by extrapolation to
d = 0. Conveniently, u(d), given by equation
(1), is a moment-generating function for
(x) and the moments of the distribution can be found by recursive differentiation of u(d). Because the mean x is equal to 1, the standard deviation of x is
|
parameter of the gamma density
function
(x
) that generates the curve
u(d) (equation 1) passing through this point was
calculated. The extrapolation of the obtained
values for each bin
on the u axis to u = 1 for the lower bound of d
gives an upper estimate of
under the conditions described
previously (Grishin 1999Estimation of Intra- and Interprotein Rate Variation from Genome Comparisons
For each pair of complete genomes, the u values obtained
as described above were converted to the distances d by using
equation (1) and the gamma distribution of intraprotein rates. The
distances d were normalized by dividing each distance by the
standard deviation of distances for a genome pair. The value of the
parameter
(equation 2) was found for which the differences between
171 normalized distributions of distances for all genomes pairs (19 *
18/2) measured by the chi-square test were minimal. Specifically,
normalized distances were binned into 20 intervals. We minimized
|
|
= 0.3 was used to generate the final distribution of
normalized distances for each pair of genomes. These distributions were
combined to obtain an estimate of interprotein rate density
(x) (equation 3). The normalized distance distribution for
each genome pair was compared with the combined data using the
chi-square test.
Phylogenetic Tree Analysis
The distance between genomes A and B was estimated as the scaling
parameter DABin the formula f(d) =
(d/DAB)/DAB, where
d is the distance between orthologs from A and B,
f(d) is the probability density function of
d, and
is given by equation (3). Bootstrap analysis was
performed by resampling pairs of orthologs for each pair of genomes.
The tree was constructed by using the method of Fitch and Margoliash
(1967)
implemented in the FITCH program of the PHYLIP package
(Felsenstein 1996
); the tree constructed by using the neighbor-joining
method (Saitou and Nei 1987
) implemented in the NEIGHBOR program of the
PHYLIP package had the same topology. The root position was inferred by
using a least-squares version of the midpoint rooting procedure (Wolf
et al. 1999
).
| |
ACKNOWLEDGMENTS |
|---|
We thank Mikhail Gelfand, Alex Kondrashov, David Lipman, Andrei Mironov, John Spouge and John Wilbur for critical reading of the manuscript and helpful comments.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
1 Permanent address: Institute of Cytology and Genetics, Russian Academy of Sciences, Novosibirsk 630090, Russia.
2 Present address: University of Texas Southwestern Medical Center, Dallas, Texas 75235 USA.
3 Corresponding author.
E-MAIL grishin{at}chop.swmed.edu; FAX (214) 648-8856.
| |
REFERENCES |
|---|
|
|
|---|
Analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events.
Genome Res.
9:
689-710Received October 14, 1999; accepted in revised form May 2, 2000.
This article has been cited by other articles:
![]() |
M. Ventura, C. Canchaya, A. Tauch, G. Chandra, G. F. Fitzgerald, K. F. Chater, and D. van Sinderen Genomics of Actinobacteria: Tracing the Evolutionary History of an Ancient Phylum Microbiol. Mol. Biol. Rev., September 1, 2007; 71(3): 495 - 548. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Y. Yampolsky, F. A. Kondrashov, and A. S. Kondrashov Distribution of the strength of selection against amino acid replacements in human proteins Hum. Mol. Genet., November 1, 2005; 14(21): 3191 - 3201. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. S. Novichkov, M. V. Omelchenko, M. S. Gelfand, A. A. Mironov, Y. I. Wolf, and E. V. Koonin Genome-Wide Molecular Clock and Horizontal Gene Transfer in Bacterial Evolution J. Bacteriol., October 1, 2004; 186(19): 6575 - 6585. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. I. Wolf, I. B. Rogozin, and E. V. Koonin Coelomata and Not Ecdysozoa: Evidence From Genome-Wide Phylogenetic Analysis Genome Res., January 1, 2004; 14(1): 29 - 36. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. M. Krylov, Y. I. Wolf, I. B. Rogozin, and E. V. Koonin Gene Loss, Protein Sequence Divergence, Gene Dispensability, Expression Level, and Interactivity Are Correlated in Eukaryotic Evolution Genome Res., October 1, 2003; 13(10): 2229 - 2235. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Lio Investigating the Relationship Between Genome Structure, Composition, and Ecology in Prokaryotes Mol. Biol. Evol., June 1, 2002; 19(6): 789 - 800. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. D. P. Clarke, R. G. Beiko, M. A. Ragan, and R. L. Charlebois Inferring Genome Trees by Using a Filter To Eliminate Phylogenetically Discordant Sequences and a Distance Matrix Based on Mean Normalized BLASTP Scores J. Bacteriol., April 15, 2002; 184(8): 2072 - 2080. [Abstract] [Full Text] |
||||
![]() |
J. Nolling, G. Breton, M. V. Omelchenko, K. S. Makarova, Q. Zeng, R. Gibson, H. M. Lee, J. Dubois, D. Qiu, J. Hitti, et al. Genome Sequence and Comparative Analysis of the Solvent-Producing Bacterium Clostridium acetobutylicum J. Bacteriol., August 15, 2001; 183(16): 4823 - 4838. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Lecompte, R. Ripp, V. Puzos-Barbe, S. Duprat, R. Heilig, J. Dietrich, J.-C. Thierry, and O. Poch Genome Evolution at the Genus Level: Comparison of Three Complete Genomes of Hyperthermophilic Archaea Genome Res., June 1, 2001; 11(6): 981 - 993. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. I. Wolf, I. B. Rogozin, A. S. Kondrashov, and E. V. Koonin Genome Alignment, Evolution of Prokaryotic Genome Organization, and Prediction of Gene Function Using Genomic Context Genome Res., March 1, 2001; 11(3): 356 - 372. [Abstract] [Full Text] |
||||