|
|
|
|
Published online before print
October 19, 2006, 10.1101/gr.5346206 Genome Res. 16:1529-1536, 2006 ©2006 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/06 $5.00
Letter Origins and impact of constraints in evolution of gene families1 Bioinformatics Program, Boston University, Boston, Massachusetts 02215, USA; 2 National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland 20894, USA
Recent investigations of high-throughput genomic and phenomic data have uncovered a variety of significant but relatively weak correlations between a genes functional and evolutionary characteristics. In particular, essential genes and genes with paralogs have a slight propensity to evolve more slowly than nonessential genes and singletons, respectively. However, given the weakness and multiplicity of these associations, their biological relevance remains uncertain. Here, we show that existence of an essential paralog can be used as a specific and strong gauge of selection. We partition gene families in several genomes into two classes: those that include at least one essential gene (E-families) and those without essential genes (N-families). We find that weaker purifying selection causes N-families to evolve in a more dynamic regime with higher rates both of duplicate fixation and pseudogenization. Because genes in E-families are subject to significantly stronger purifying selection than those in N-families, they survive longer and exhibit greater sequence divergence. Longer average survival time also allows for divergence of upstream regulatory regions, resulting in change of transcriptional context among paralogs in E-families. These findings are compatible with differential division of ancestral functions (subfunctionalization) or emergence of novel functions (neofunctionalization) being the prevalent modes of evolution of paralogs in E-families as opposed to pseudogenization (nonfunctionalization), which is the typical fate of paralogs in N-families. Unlike other characteristics of genes, such as essentiality, existence of paralogs, or expression level, membership in an E-family or an N-family strongly correlates with the level of selection and appears to be a major determinant of a genes evolutionary fate.
The nature of connections between organismal and molecular evolution remains a fundamental and, generally, unanswered question. The relationship between evolution of a gene and the organism can be characterized by the change in fitness precipitated by deletion or mutation of that gene (Drake et al. 1998
Driven by the recent availability of many complete genome sequences, along with results from genome-wide functional assays, many researchers observed significant correlations between functional characteristics of genes, such as essentiality (Hurst and Smith 1999
Gene duplication followed by divergence is one of the primary driving forces behind functional innovation during evolution (Ohno 1970 The goal of the present study is to elucidate some general relationships between functional constraints, differential strengths of purifying selection, and gene duplication. First, we observe that membership in a gene family is a good predictor of selection. For example, genes that have an essential paralog are under stronger selection than genes without essential paralogs. We show that families of paralogs that include at least one essential gene (E-families) and those that consist entirely of nonessential genes (N-families) evolve in dramatically different regimens. Although genes in E-families, on average, evolve substantially more slowly than genes in N-families, the E-families show a much greater divergence between paralogs. This can be attributed to the significantly longer average survival time of paralogs in E-families as compared to N-families. The E-families appear to comprise a reservoir of genes for evolution of new functions via the subfunctionalization and neofunctionalization routes. Finally, we show that there is a relationship between evolution of the open reading frames and upstream regions. Specifically, genes in E-families that are under stronger selection evolve novel transcriptional regulatory contexts.
Homogeneity in strength of selection within paralogous families While essential genes tend to evolve slowly and, by implication, appear to be under stronger purifying selection than nonessential ones (Hirsh and Fraser 2001
Paralogy relationships between genes within genomes can be conveniently represented in the most general form as a Divergence and Diffusion Graph (DDG). The vertices represent genes, and edges represent homology relationships weighed according to their sequence similarity scores (Fig. 1; see Supplemental material). All genes can be partitioned into paralogous families (Harrison and Gerstein 2002
For example, partitioning paralogous gene sets into E-families and N-families can be used to test whether essentiality is a major determinant of selection. If the strength of selection is a characteristic of gene family membership as opposed to essentiality, we predict that all genes in E-families, including nonessential ones, would be subject to significantly stronger purifying selection than members of N-families. We assessed the strength of purifying selection by several standard measures: single feature polymorphism (SFP) densities in Saccharomyces cerevisiae genes (Winzeler et al. 2003
We found that both essential and nonessential members of E-families are under substantially stronger purifying selection than members of N-families, independent of the species analyzed and the method used to estimate selection (Table 2). Furthermore, by all employed criteria, the difference in the strength of selection between nonessential genes in E-families and N-families was considerably more significant than the difference between all essential and nonessential genes in the same species (Table 2). Thus, it seems that strength of selection is a more salient characteristic of gene family membership than essentiality. Moreover, the data in Table 2 suggest that essentiality per se is neither necessary nor sufficient to impose purifying selection. Instead, given that E-families do not exhibit significant biases in characteristics that might be responsible for transitive correlations with the strength of selection, such as Codon Adaptation Index or protein abundance (Hahn and Kern 2005
Selection and dynamics of duplications and divergence The implication of the observation above is that there are sets of genes related by evolution that may share characteristics that impose selection. We recently showed (Shakhnovich 2006
The larger total number of paralogs (Table 1) in N-families, coupled with the observation of a higher pseudogenization rate, suggests that recent duplicates from N-families should also enjoy a higher duplication rate. To test this prediction, we identified pairs of orthologs and lineage-specific paralogs using the InParanoid algorithm (Remm et al. 2001
Taken together, duplication and pseudogenization data indicate that N-families evolve in a significantly more "dynamic" regime than E-families. Perhaps, because of weaker purifying selection (Table 2), members of N-families have a higher rate of both pseudogenization (Table 3) and duplicate fixation (Table 4). Of course, the two observations might not be entirely independent as a greater duplicate fixation rate might also result in a higher rate of pseudogenization. The relevant issue is the effect of this more prodigious rate of evolution on the typical fate of paralogs. Does the higher pseudogenization rate in N-families (Table 3) offset the higher duplicate fixation rate (Table 4) resulting in a shorter overall survival of both pairs of paralogs? To test this, we assume that synonymous sites are approximately neutral and evolve in a clock-like fashion. Under this assumption, the distribution of synonymous substitutions in paralogous gene families should mirror the age distribution of the paralogs. For the N-families, the distribution of Ks shows the best fit to an exponential decay curve. This is consistent with an approximately constant probability of pseudogenization per unit time. In sharp contrast, the number of paralogs in E-families correlates linearly with the increase in synonymous site divergence (Fig. 2). The shape of the distribution in Figure 2 for E-families can be explained by a model in which the pseudogenization rate in these families drops as paralogs diverge (data not shown). Furthermore, the observed difference in the distributions of synonymous site divergence in E-families and N-families is compatible with the notion that the characteristic half-life of paralogs in E-families is much longer than that in N-families. Taking the Ks value as a measure of evolutionary time, we estimated that pairs of paralogs in E-families survive, on average, almost three times longer (mean Ks = 3.25) than paralogs in N-families (mean Ks = 0.15; P < 1e-40). Thus, it seems that the increased evolutionary dynamism results in a shift toward shorter life spans for genes in N-families.
Longer life span of duplicates in E-families allows greater divergence in sequence and transcriptional regulation The observation of a longer average time of survival of duplicates in E-families carries significant implications for divergence of protein sequences in these families. In fact, visual examination of the graph representation of the largest paralogous families from S. cerevisiae immediately reveals a striking difference between the E-families and N-families (Fig. 1). Although the sizes of both families shown in Figure 1 are similar, the N-family has a much greater number of connections per node (node degree) than the former ( 10 compared to 1). We also observed a substantial difference between the mean clustering coefficients that characterize the density of connections between genes for the two types of families ( 0.55 for N-families compared to 0.21 for E-families; P < 1e-3). Both the number of connections per node and the clustering coefficient measure transitivity in sequence space, suggesting that, on average, E-family paralogs diverged farther away from each other than N-family paralogs.
We calculated the distribution of sequence divergence between all pairs of paralogs in E-families and N-families using two standard measures of the nonsynonymous substitution rate, the number of nonsynonymous substitutions per nonsynonymous site (Ka), and amino acid sequence identity (see the Supplemental material). Indeed, by both criteria, paralogs in E-families were characterized by much greater divergence (Fig. 3A,B). For example, in S. cerevisiae, the average amino acid sequence identity between paralogs in E-families is
In fact, the difference in sequence divergence distributions was large enough that we wanted to assess its predictive value for differentiating between E-families and N-families in the absence of essentiality data. We performed a reciprocal analysis, that is, examined how well characteristics of sequence divergence would differentiate between E-families and N-families. To this end, we used Receiver Operating Characteristic (ROC) statistics based on the average separation of paralogs in families to classify genes into E-families and N-families without explicitly invoking essentiality as a marker. The ROC curve in Figure 4 shows that 80% of E-families have paralogs that are as far diverged as 20% of N-families. An even better separation of E-families and N-families can be obtained by using the clustering coefficient as the classification criterion: up to 73% of E-families were captured without a single false positive (N-family), although this analysis covered only larger families (those with three or more members), resulting in classification of 11 E-families and 18 N-families in yeast. Thus, the separation of paralogs in sequence space for E-families and N-families is so dissimilar that classification based solely on sequence divergence, mostly, reproduces the partitioning based on using essential paralogs as markers.
Our results show that E-families explore the protein sequence space, through duplication and divergence of paralogs, to a much greater depth than N-families. This has important bearings on the impact of family membership on functional divergence. So far, we have presented evidence that paralogs in E-families survive longer, enjoy a lower rate of pseudogenization, and diverge in sequence farther than paralogs in N-families. Subfunctionalization (division of ancestral pleiotropy) is characterized by strong purifying selection on both paralogs after duplication (Force et al. 1999
If the diverging paralogs in E-families follow the subfunctionalization (or neofunctionalization) routes, one would expect to observe divergence not only in sequence but also in expression regulation as the paralogs adapt to new biological niches. We compared the extent of transcription factor (TF) sharing between paralogs in the two classes of families. In accord with the notion of greater functional diversification of genes in E-families, most of the paralogs in these families had no common transcription factors binding to their upstream regions (Harbison et al. 2004
We present evidence that gene family membership is a general and reliable indicator of the strength of purifying selection acting on a gene. Purifying selection, linked to functional constraints, may affect the course of molecular evolution not only through influencing the speed of divergence, but also by affecting the fate of paralogs after duplication. Specifically, we show that paralogs in E-families that include essential genes are subject to much stronger purifying selection than genes in N-families (Table 2) without essential paralogs. Thus, the first salient observation is that genes within families of paralogs experience similar levels of selection. Although there was no clear-cut difference in the distribution of protein functions over E-families and N-families, it appears that the presence of essential genes in the former correlates with greater biological importance of E-family members (Table 2). In spite of stronger selective constraints, paralogs in E-families exhibit greater sequence divergence than N-families (Fig. 3). Furthermore, the difference in the exploration of sequence space between E-families and N-families was so large that we could use average sequence divergence of paralogs or clustering coefficient in a predictive manner to classify a large proportion of families without using the existence of an essential paralog as a criterion (Fig. 4).
The difference in average sequence divergence between members of the E-families and N-families can be attributed to the differential dynamics of molecular evolution in these families. Specifically, N-families are evolutionarily more dynamic, that is, duplicates in these families tend to become pseudogenes shortly after duplication (Table 3) but also enjoy a higher fixation rate (Table 4). These results are compatible with recent evidence indicating that, in yeast, functionally less important genes tend to duplicate more often (He and Zhang 2006
The magnitude of the observed differences between the two types of families sharply contrasts previous observations of weak or moderate correlations between various functional and evolutionary characteristics of genes determined on the genome scale (Hurst and Smith 1999
Another possible conclusion from this study is that subfunctionalization is a transient phase in gene evolution, and genes that divide ancestral functions soon undergo neofunctionalization. This hypothesis is consistent with the observations of the apparently decreasing rate of pseudogenization with passage of time (Fig. 2) and change in transcriptional regulation of paralogs in E-families (Fig. 5). A similar sub-, neofunctionalization model has been recently proposed by He and Zhang in a study of the evolution of proteinprotein interaction networks (He and Zhang 2005 The results presented here uncover a surprisingly strong link between a genes membership in a paralogous family, the constraints imposed on its evolution by purifying selection, and the characteristics of gene family evolution by duplication. The classification of genes into E-families and N-families could be a useful starting point for a variety of future studies into the relationships among the evolution of genes, genomes, and phenotypes.
To construct the DDG for each of three species (S. cerevisiae, C. elegans, and E. coli), the complete sets of protein sequences from the respective genomes were extracted from the GenBank database, and an all-against-all sequence comparison was performed using the BLAST program (Altschul et al. 1997
Paralogous families were identified by finding all strongly connected components in the DDG as described in Cormen (2001) We used the InParanoid program (http://inparanoid.cgb.ki.se/) to identify orthologs and species-specific paralogs from the five yeast species (S. cerevisiae, C. glabrata, K. lactis, A. gossypii, and D. hansenii). The genomes were obtained from the NCBI (http://www.ncbi.nlm/nih.gov).
The authors thank Eugene Shakhnovich for his invaluable support and review of the manuscript. We also thank Matthew Hahn for proposing the duplication rate study and for many helpful comments regarding the manuscript. Additionally, we extend our appreciation to Charles DeLisi, Tim Reddy, Joe Mellor, Julian Mintseris, and others at the Bioinformatics program at Boston University for discussion and insights.
3 Corresponding author.
E-mail Borya{at}acs.bu.edu; fax (617) 353-4814. [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5346206
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893402. Conery, J.S. and Lynch, M. 2001. Nucleotide substitutions and the evolution of duplicate genes. Pac. Symp. Biocomput. 2001: 167178. Cormen, T.H. 2001. Introduction to algorithms. MIT Press, Cambridge, MA. 2nd ed. Davis, J.C. and Petrov, D.A. 2004. Preferential duplication of conserved proteins in eukaryotic genomes. PLoS Biol. 2: e55. Drake, J.W., Charlesworth, B., Charlesworth, D., and Crow, J.F. 1998. Rates of spontaneous mutation. Genetics 148: 16671686. Drummond, D.A., Bloom, J.D., Adami, C., Wilke, C.O., and Arnold, F.H. 2005. Why highly expressed proteins evolve slowly. Proc. Natl. Acad. Sci. 102: 1433814343. Drummond, D.A., Raval, A., and Wilke, C.O. 2006. A single determinant dominates the rate of yeast protein evolution. Mol. Biol. Evol. 23: 327337. Enright, A.J., Kunin, V., and Ouzounis, C.A. 2003. Protein families and TRIBES in genome sequence space. Nucleic Acids Res. 31: 46324638. Force, A., Lynch, M., Pickett, F.B., Amores, A., Yan, Y.L., and Postlethwait, J. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151: 15311545. Fraser, A.G., Kamath, R.S., Zipperlen, P., Martinez-Campos, M., Sohrmann, M., and Ahringer, J. 2000. Functional genomic analysis of C. elegans chromosome I by systematic RNA interference. Nature 408: 325330.[CrossRef][Medline] Fraser, H.B., Hirsh, A.E., Steinmetz, L.M., Scharfe, C., and Feldman, M.W. 2002. Evolutionary rate in the protein interaction network. Science 296: 750752. Gerdes, S.Y., Scholle, M.D., Campbell, J.W., Balazsi, G., Ravasz, E., Daugherty, M.D., Somera, A.L., Kyrpides, N.C., Anderson, I., and Gelfand, M.S., et al. 2003. Experimental determination and system level analysis of essential genes in Escherichia coliMG1655. J. Bacteriol. 185: 56735684. Giaever, G., Chu, A.M., Ni, L., Connelly, C., Riles, L., Veronneau, S., Dow, S., Lucau-Danila, A., Anderson, K., and Andre, B., et al. 2002. Functional profiling of the Saccharomyces cerevisiae genome. Nature 418: 387391.[CrossRef][Medline] Gu, X. 1999. Statistical methods for testing functional divergence after gene duplication. Mol. Biol. Evol. 16: 16641674.[Abstract] Gu, X. 2001a. Mathematical modeling for functional divergence after gene duplication. J. Comput. Biol. 8: 221234.[CrossRef][Medline] Gu, X. 2001b. A site-specific measure for rate difference after gene duplication or speciation. Mol. Biol. Evol. 18: 23272330. Hahn, M.W. and Kern, A.D. 2005. Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks. Mol. Biol. Evol. 22: 803806. Harbison, C.T., Gordon, D.B., Lee, T.I., Rinaldi, N.J., Macisaac, K.D., Danford, T.W., Hannett, N.M., Tagne, J.B., Reynolds, D.B., and Yoo, J., et al. 2004. Transcriptional regulatory code of a eukaryotic genome. Nature 431: 99104.[CrossRef][Medline] Harrison, P.M. and Gerstein, M. 2002. Studying genomes through the aeons: Protein families, pseudogenes and proteome evolution. J. Mol. Biol. 318: 11551174.[CrossRef][Medline] Harrison, P.M., Echols, N., and Gerstein, M.B. 2001. Digging for dead genes: An analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome. Nucleic Acids Res. 29: 818830. Harrison, P., Kumar, A., Lan, N., Echols, N., Snyder, M., and Gerstein, M. 2002. A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome evolution. J. Mol. Biol. 316: 409419.[CrossRef][Medline] He, X. and Zhang, J. 2005. Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution. Genetics 169: 11571164. He, X. and Zhang, J. 2006. Higher duplicability of less important genes in yeast genomes. Mol. Biol. Evol. 23: 144151. Hirsh, A.E. and Fraser, H.B. 2001. Protein dispensability and rate of evolution. Nature 411: 10461049.[CrossRef][Medline] Hurst, L.D. and Smith, N.G. 1999. Do essential genes evolve slowly? Curr. Biol. 9: 747750.[CrossRef][Medline] Jordan, I.K., Rogozin, I.B., Wolf, Y.I., and Koonin, E.V. 2002. Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res. 12: 962968. Jordan, I.K., Wolf, Y.I., and Koonin, E.V. 2003. No simple dependence between protein evolution rate and the number of proteinprotein interactions: Only the most prolific interactors tend to evolve slowly. BMC Evol. Biol. 3: 1.[CrossRef][Medline] Jordan, I.K., Wolf, Y.I., and Koonin, E.V. 2004. Duplicated genes evolve slower than singletons despite the initial rate increase. BMC Evol. Biol. 4: 22.[CrossRef][Medline] Kamath, R.S., Fraser, A.G., Dong, Y., Poulin, G., Durbin, R., Gotta, M., Kanapin, A., Le Bot, N., Moreno, S., and Sohrmann, M., et al. 2003. Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature 421: 231237.[CrossRef][Medline] Keightley, P.D. and Eyre-Walker, A. 1999. Terumi Mukai and the riddle of deleterious mutation rates. Genetics 153: 515523. Kimura, M. 1981. Possibility of extensive neutral evolution under stabilizing selection with special reference to nonrandom usage of synonymous codons. Proc. Natl. Acad. Sci. 78: 57735777. Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., and Simon, I., et al. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298: 799804. Levins, R. 1968. Evolution in changing environments; some theoretical explorations. Princeton University Press, Princeton, NJ. Lynch, M. and Conery, J.S. 2000. The evolutionary fate and consequences of duplicate genes. Science 290: 11511155. Lynch, M. and Force, A. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154: 459473. Lynch, M. and Katju, V. 2004. The altered evolutionary trajectories of gene duplicates. Trends Genet. 20: 544549.[CrossRef][Medline] Macarthur, R. and Levins, R. 1964. Competition, habitat selection, and character displacement in a patchy environment. Proc. Natl. Acad. Sci. 51: 12071210. Nei, M. 1987. Molecular evolutionary genetics. Columbia University Press, New York. Nei, M. and Roychoudhury, A.K. 1973. Probability of fixation of nonfunctional genes at duplicate loci. Am. Nat. 107: 590605. Ohno, S. 1970. Evolution by gene duplication. Springer-Verlag, Berlin, New York. Pal, C., Papp, B., and Hurst, L.D. 2001. Highly expressed genes in yeast evolve slowly. Genetics 158: 927931. Petrov, D.A. and Hartl, D.L. 2000. Pseudogene evolution and natural selection for a compact genome. J. Hered. 91: 221227. Remm, M., Storm, C.E., and Sonnhammer, E.L. 2001. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314: 10411052.[CrossRef][Medline] Shakhnovich, B.E. 2006. Relative contributions of structural designability and functional diversity in molecular evolution of duplicates. Bioinformatics 22: e440e445. Shakhnovich, B.E. and Max Harvey, J. 2004. Quantifying structurefunction uncertainty: A graph theoretical exploration into the origins and limitations of protein annotation. J. Mol. Biol. 337: 933949.[CrossRef][Medline] Simmer, F., Moorman, C., van der Linden, A.M., Kuijk, E., van den Berghe, P.V., Kamath, R.S., Fraser, A.G., Ahringer, J., and Plasterk, R.H. 2003. Genome-wide RNAi of C. elegans using the hypersensitive rrf-3 strain reveals novel gene functions. PLoS Biol. 1: e12. Todd, A.E., Orengo, C.A., and Thornton, J.M. 2001. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307: 11131143.[CrossRef][Medline] Wall, D.P., Hirsh, A.E., Fraser, H.B., Kumm, J., Giaever, G., Eisen, M.B., and Feldman, M.W. 2005. Functional genomic analysis of the rates of protein evolution. Proc. Natl. Acad. Sci. 102: 54835488. Winzeler, E.A., Castillo-Davis, C.I., Oshiro, G., Liang, D., Richards, D.R., Zhou, Y., and Hartl, D.L. 2003. Genetic diversity in yeast assessed with whole-genome oligonucleotide arrays. Genetics 163: 7989. Yang, Z. 1997. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13: 555556. Yang, Z. and Nielsen, R. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17: 3243. Yang, J., Gu, Z., and Li, W.H. 2003. Rate of protein evolution versus fitness effect of gene deletion. Mol. Biol. Evol. 20: 772774.
Received March 28, 2006; accepted in revised format August 16, 2006. This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||