|
|
|
|
Genome Res. 15:385-392, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 Letter Protein structure and evolutionary history determine sequence space topology1 Bioinformatics Program, Boston University, Boston, Massachusetts 02215, USA 2 Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts 02138, USA 3 Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts 02138, USA
Understanding the observed variability in the number of homologs of a gene is a very important unsolved problem that has broad implications for research into coevolution of structure and function, gene duplication, pseudogene formation, and possibly for emerging diseases. Here, we attempt to define and elucidate some possible causes behind the observed irregularity in sequence space. We present evidence that sequence variability and functional diversity of a gene or fold family is influenced by quantifiable characteristics of the protein structure. These characteristics reflect the structural potential for sequence plasticity, i.e., the ability to accept mutation without losing thermodynamic stability. We identify a structural feature of a protein domaincontact densitythat serves as a determinant of entropy in sequence space, i.e., the ability of a protein to accept mutations without destroying the fold (also known as fold designability). We show that (log) of average gene family size exhibits statistical correlation (R2 > 0.9.) with contact density of its three-dimensional structure. We present evidence that the size of individual gene families are influenced not only by the designability of the structure, but also by evolutionary history, e.g., the amount of time the gene family was in existence. We further show that our observed statistical correlation between gene family size and contact density of the structure is valid on many levels of evolutionary divergence, i.e., not only for closely related sequence, but also for less-related fold and superfamily levels of homology.
Gene family and domain-fold family sizes are known to vary widely (Finkelstein and Ptitsyn 1987
In particular, in a recent study, Taverna and Goldstein (2000
Recent successes in structural genomics and bioinformatics provide a wealth of data for statistical analysis of the distributions of gene family sizes of real proteins with known structures. On the other hand, recent research in our lab and others has increased our understanding of the structural determinants of protein designability (Wolynes 1996
From a biological perspective, we may hypothesize that gene family size is at least in part influenced by functional constraints related to the number of different, but perhaps related functions needed by the cell (Lespinet et al. 2002
Building PDUG In order to consider sequence, structure, and function information in a unified, systematic way, we define both gene families and fold families quantitatively using the Protein Domain Universe Graph (PDUG) (Dokholyan et al. 2002
Using this PDUG formalism, we can define a gene family based on micro-evolutionary considerations; the PDUG represents the variability accessible to a given gene upon mutation, whether that variability occurs in sequence, function, or structure space. Unlike other definitions of gene families (Sonnhammer et al. 1998
The role of designability
The physical explanation for the correlation between traces of powers of the CM (a structural feature) and sequence entropy (i.e., designability) follows from the fact that these traces of powers of the CM reflect topological characteristics of the network of contacts within the structure. For example, the trace of CM2 simply gives the total number of contacts (or equivalently the total number of two-step, self-returning walks) and the trace of CM4 reflects the number of length-4 closed loops in the system, and so on. One may also note that certain closed sets of contacts allow for optimal placement of amino acids that interact very favorably. For example, if four amino acids that strongly attract each other are folded into an architecture where they all interact favorably (e.g., on four corners of a square, see Fig. 2), this formation represents a greater contribution to the stability of the overall structure than configurations in which the same four amino acids are arranged linearly, or in cases where the last of the contacts is out of the contact range (Fig. 2). Such optimal placement of several strongly interacting amino acids allows more sequences to be folded into the structure by relaxing energy constraints for the rest of the sequence. Thus, structures that provide certain features, such as availability of long closed loops of interactions and higher density of contacts per residue, are expected to be able to accommodate a wider variety of different sequences. This qualitative argument is similar in spirit to derivation of Boltzmann distribution in Statistical Mechanics (Landau et al. 1978
For this study, we use the trace of the second order of the contact matrix normalized by chain length as a simplest approximation for designability. This quantity, known as the contact density (CD), is proportional to the number of contacts per amino acid residue (see Methods); it corresponds to the lowest second-order term in the expansion of equation 1. A designability criterion at this level of approximation has been considered earlier by several authors (Wolynes 1996
Correlation between CD and sequence family size on average
Next, we want to assess the robustness of the average correlation, as well as estimate the area of sequence space affected by designability. Structural determinants may influence small areas of sequence space, such as those evaluated in Figure 3A, or larger ones, defined by fold-level structure comparison. In part, the area influenced will depend on how CD changes with respect to divergence in structure. Thus, we perform analysis on distantly related gene families as defined through structural comparison between nodes on PDUG. To this end, we take the structural neighborhood of a given domain to be all nodes that are connected by an edge on the PDUG (Dokholyan et al. 2002 Figure 3B shows that average CD, which serves as a proxy for average designability of a structural neighborhood (Fold), itself correlates with the (log) of the gene-sequence family size of that neighborhood. Together, Figure 3, A and B, show that gene family size and designability (as approximated by CD) correlate across various scales of evolutionary distance. This could indicate that designability affects large sequence and structure spaces spanning not only close sequence homology, but extending into sets of sequences with identifiable homology only through structure comparison. From an evolutionary standpoint, this may indicate that domains with higher CD diverge to produce other high-designability domain structures. Since these observations of correlations between designability and gene family size are statistical in nature, we want to comment on the robustness of the reported results. There are two issues to consider, the variability of contact density (CD) for structures within gene families and the robustness in the calculation of the mean number of sequences for all gene families in each bin. To address these concerns, we first calculate the intrafamily deviation in CD for each gene family on PDUG (see Methods). While the points in Figure 3, A and B, show mean values of the CD for the representative domains (nodes on PDUG), we also include estimates of the deviation in CD, taking into account sequences inside gene families with solved structures (i.e., domains that have sequence homology to the representative domain). In order to calculate this deviation, we take all solved structures for domains with sequence homology to the representative domain and calculate the standard deviation of CD inside each gene family. We then calculate the average standard deviation in every bin of Figure 3, A and B. The deviation is shown as CD-axis (X) error bars in Figure 3, A and B. It is apparent from the size of the error bars that the deviation in CD within each gene family is relatively small, on the order of 0.05 or less. Indeed, as expected, the intrafamily dispersion deviation of CD gets smaller as average contact density increases. The CD deviation ranges from 0.01 at CD = 4.8 to 0.06 at CD = 3.8 in Figure 3A. The deviation is much smaller when considering domains inside fold-level structural neighborhoods, i.e., the deviation falls to be on the order of 0.001. This calculation is primarily meant to show that the choice of the representative structure for each gene family size is not expected to significantly affect the results. Next, we calculate the possible error in the calculation of the mean in the size of the gene family for each bin. This quantity is proportional to the square root of the number of observations in the bin according to Central Limit Theorem. We include this as the gene family size (Y) axis error bars in Figure 3. It is worth noting that this measures the deviation of the mean over all gene families belonging to a given bin only, and does not reflect the scatter of the distribution inside the bin. That quantity is considered separately in detail, later. Clearly, the consideration of both of these errors is small enough so that it does not affect the conclusions drawn from Figure 3, A and B. It is also worth noting that, as the size of the error bars suggests, changing the binning does not appreciably affect these results. However, it is important to point out that even considering all the possible caveats mentioned above, the correlation between CD and average sequence variability on both the domain and fold levels is striking, and the error bars show the surprising level of robustness of these results.
For a more biological perspective, we determine how gene family size is related to the diversity of functions which that family performs. We define the functional determinant of a gene family as entropy in function space. When we calculate this measure in the context of PDUG, we utilize Gene Ontology (GO) (Ashburner et al. 2000
The role of evolution While average statistical correlations of gene family size and FFS with CD are highly significant, how predictive are they when it comes to calculations of gene family size for a particular domain? To answer this question, we present a scatter plot of gene family size versus CD that shows all domains in the PDUG (Fig. 5A). Though the scatter reveals significance, it is clear that CD is not a reliable predictor of gene family size for every domain. This is perhaps not surprising, given that other factors may have influenced gene family sizes. A natural possibility that has also been observed in lattice simulations (Taverna and Goldstein 2000
Understanding the evolutionary history of all of the protein domains on the PDUG requires construction of the most parsimonious scenario for protein structure evolution, a complex proposition (Mirkin et al. 2003
We thus define the structural content of the LUCA to be all domains that have homologs in at least one prokaryotic and at least one eukarytic species. This yields approximately a third of the structural content of PDUG. We present the LUCA domains on a separate scatter plot in Figure 5B. Two observations are immediately apparent. First, LUCA domains clearly feature greater CD, suggesting that "first" domains were more designable (difference of means 0.27, t-test P-value < 1e-8). Secondly, even at equal CD (designability) with their younger counterparts, LUCA domains feature greater family sizes, on average 37 more members (Fig. 5B, scatter plot is markedly shifted toward higher gene family sizes P-value < 1e-10). This observation provides evidence that, as simulations on simple lattice models suggest (Taverna and Goldstein 2000 To avoid circularity in the calculation of gene family size difference, we calculate the average number of genomes where LUCA domains are present and compare that to the background distribution of all domains. We find that LUCA domains are not present in a significantly larger number of genomes (data not shown); however, they do exhibit a statistically significant increase in gene family size as outlined above. Furthermore, we see the importance of designability even within LUCA domains by noting that higher CD domains exhibit higher gene family size within LUCA. To underline this observation further, we calculate the linear fit for all domains (R = 0.30) and compare that with the LUCA domains. We observe that the goodness of fit (R) increases to 0.40. We tested the statistical significance of this increase by modeling a random assignment of LUCA domains. We randomly picked the same number of domains from PDUG and calculated the linear fit (R value). We then repeated the sampling 1000 times. Predictably, we find that the mean R value from random simulation is 0.30, as the background distribution and standard deviation is 0.025. From this simple experiment, we can conclude that LUCA domains represent a biased sample, where the linear fit of the correlation between CD and sequence family size is four standard deviations away from random. The increase in the goodness of the linear fit for the LUCA domains is consistent with our theory that given the same amount of time for divergence, higher CD domains will have larger sequence families. However, the result mainly outlines the importance of evolutionary history in fulfilling the potential for sequence family size defined by the structural designability of that family. The increase in the linear fit (R) also underlines the independence of this result from bias stemming from uneven genome distribution of LUCA domains.
In this study, we presented evidence that across widely varying evolutionary distances, there are significant statistical correlations between structural designability, functional flexibility, and gene family size. The statistical nature of these observations is obvious from the scatter plot presented in Figure 5. We have found that this scatter may be explained, at least in part, by variations in the evolutionary history (Ponting and Russell 2002
While we believe that these results are illuminating, we must mention several caveats. Using CD as a proxy for entropy in sequence space is an approximation that assumes, among other things, that protein energetics may be correctly represented in contact form and that the second-order approximation of equation 2 is sufficient to capture the designability of a structure. An additional and perhaps more interesting caveat to consider is that the "designability principle" in its canonical form assumes equilibrium in sequence space, in which all structures take full advantage of their designability potential and that this fact is reflected in the data. Consideration of phylogeny clearly shows that this is not an entirely valid assumption. On the other extreme, several dynamic divergent evolution models predict uneven fold populations without assuming any structural preferences due to designability (Dokholyan et al. 2002 In this work, we clearly see that domains with low CD are most likely to represent smaller size families, while more designable, higher CD domains may exhibit both large and small family sizes. This is exactly what one would expect from the interplay of historical and physical factors; while physical constraints impose upper bounds on sizes of families of low-CD domains, more designable domains may exhibit greater family sizes if they are "old," and smaller sizes if they are "young." Higher designability thus reflects the potential for higher family size, but does not necessarily imply it.
Another interesting observation is that older domains seem more designable. One may speculate that early protein evolution could have imposed more stringent constraints on domain designability, either due to more challenging conditions (e.g., higher temperature) (England et al. 2003 The findings presented here may have broad implications for our understanding of structural genomics as well as structure-function relationships and coevolution. However, more quantitative evolutionary models are required to fully rationalize our findings. Further research along these lines may provide new insights into the genetic mechanisms underlying both neofunctionalization and the potential development of resistance to emerging diseases. These results provide an example of how fundamental physical principles can be statistically predictive in the biological Universe of protein folds and gene sequences.
PDUG In order to build the PDUG, we use sequences from NRDB90 (Holm and Sander 1998 An important issue in this study is one of sequence weighting. The use of NRDB to exclude close sequence homologs ensures that we calculate sequence entropy by including far diverged sequences. The calculations of FFS provide another corroboration with the same result, but a different weighting of sequences. Inclusion of all sequences from SWISS-PROT will introduce noise due to oversequencing of some genes versus others, and will not yield a sufficient approximation of entropy in sequence space.
Designability
where the weights ai are all positive functions that depend on the interaction energies B. The contact matrix C is defined as Cij =1 if amino acids i and j are in contact, and 0 otherwise. Definitions of contact may vary, but in this study, we use the standard cutoff of 7.5 angstroms between C
Calculation of variability in CD of intrafamily members
FFS
Here, Max(L) is the maximal number of levels of annotation, the summation is taken over all levels l and over all nodes i filled by the gene family on the GO tree, and pi is the percentage of the family that is annotated with function i (see Fig. 1).
We are grateful to Jeremy England and Hooman Hennessey for their help, as well as to Nikolay Dokholyan, Andrew Murray, and Nick Grishin for fruitful discussions and critical readings of the manuscript, and to NIH for support.
4 Corresponding author. E-mail eugene{at}belok.harvard.edu; fax (617) 384-9228. Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3133605.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402.
Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D., et al. 2000. InterPro-An integrated documentation resource for protein families, domains and functional sites. Bioinformatics 16: 1145-1150. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25-29.[CrossRef][Medline]
Baker, D. and Sali, A. 2001. Protein structure prediction and structural genomics. Science 294: 93-96. Bashford, D., Chothia, C., and Lesk, A.M. 1987. Determinants of a protein fold. Unique features of the globin amino acid sequences. J. Mol. Biol. 196: 199-216.[CrossRef][Medline]
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O'Donovan, C., Phan, I., et al. 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31: 365-370.
Deeds, E.J., Dokholyan, N.V., and Shakhnovich, E.I. 2003. Protein evolution within a structural space. Biophys. J. 85: 2962-2972.
Dietmann, S., Park, J., Notredame, C., Heger, A., Lappe, M., and Holm, L. 2001. A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3. Nucleic Acids Res. 29: 55-57.
Dokholyan, N.V., Shakhnovich, B., and Shakhnovich, E.I. 2002. Expanding protein universe and its origin from the biological Big Bang. Proc. Natl. Acad. Sci. 99: 14132-14136. England, J.L. and Shakhnovich, E.I. 2003. Structural determinant of protein designability. Phys. Rev. Lett. 90: 218101.[CrossRef][Medline]
England, J.L., Shakhnovich, B.E., and Shakhnovich, E.I. 2003. Natural selection of more designable folds: A mechanism for thermophilic adaptation. Proc. Natl. Acad. Sci. 100: 8727-8731. Finkelstein, A.V. and Ptitsyn, O.B. 1987. Why do globular proteins fit the limited set of folding patterns? Prog. Biophys. Mol. Biol. 50: 171-190.[CrossRef][Medline] Finkelstein, A.V., Gutin, A.M., and Badretdinov, A. 1995. Boltzmann-like statistics of protein architectures. Origins and consequences. Subcell. Biochem. 24: 1-26.[Medline]
Govindarajan, S. and Goldstein, R.A. 1996. Why are some proteins structures so common? Proc. Natl. Acad. Sci. 93: 3341-3345. Grzybowski, B.A., Ishchenko, A.V., Shimada, J., and Shakhnovich, E.I. 2002. From knowledge-based potentials to combinatorial lead design in silico. Acc. Chem. Res. 35: 261-269.[CrossRef][Medline]
Hegyi, H. and Gerstein, M. 2001. Annotation transfer for genomics: Measuring functional divergence in multi-domain proteins. Genome Res. 11: 1632-1640. Holm, L. and Sander, C. 1993. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233: 123-138.[CrossRef][Medline]
____. 1998. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14: 423-429. Huynen, M.A. and van Nimwegen, E. 1998. The frequency distribution of gene family sizes in complete genomes. Mol. Biol. Evol. 15: 583-589.[Abstract]
Koehl, P. and Levitt, M. 2002. Protein topology and stability define the space of allowed sequences. Proc. Natl. Acad. Sci. 99: 1280-1285. Koonin, E.V., Wolf, Y.I., and Karev, G.P. 2002. The structure of the protein universe and genome evolution. Nature 420: 218-223.[CrossRef][Medline] Landau, L.D., Lifshitz, E.M., and Pitaevskii, L.P. 1978. Statistical physics. Pergamon Press, Oxford, New York.
Lespinet, O., Wolf, Y.I., Koonin, E.V., and Aravind, L. 2002. The role of lineage-specific gene family expansion in the evolution of eukaryotes. Genome Res. 12: 1048-1059. Li, H., Helling, R., Tang, C., and Wingreen, N. 1996. Emergence of preferred structures in a simple model of protein folding. Science 273: 666-669.[Abstract] Manning, G., Plowman, G.D., Hunter, T., and Sudarsanam, S. 2002. Evolution of protein kinase signaling from yeast to man. Trends Biochem. Sci. 27: 514-520.[CrossRef][Medline] Miller, J., Zeng, C., Wingreen, N.S., and Tang, C. 2002. Emergence of highly designable protein-backbone conformations in an off-lattice model. Proteins 47: 506-512.[CrossRef][Medline] Mirkin, B.G., Fenner, T.I., Galperin, M.Y., and Koonin, E.V. 2003. Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol. Biol. 3: 2.[CrossRef][Medline] Orengo, C.A., Todd, A.E., and Thornton, J.M. 1999. From protein structure to function. Curr. Opin. Struct. Biol. 9: 374-382.[CrossRef][Medline] Orengo, C.A., Pearl, F.M., and Thornton, J.M. 2003. The CATH domain structure database. Methods Biochem. Anal. 44: 249-271.[Medline] Ponting, C.P. and Russell, R.R. 2002. The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct. 31: 45-71.[CrossRef][Medline] Qian, J., Luscombe, N.M., and Gerstein, M. 2001. Protein family and fold occurrence in genomes: Power-law behaviour and evolutionary model. J. Mol. Biol. 313: 673-681.[CrossRef][Medline] Shakhnovich, E.I. 1998. Protein design: A perspective from simple tractable models. Fold Des. 3: R45-R58.[CrossRef][Medline]
Sonnhammer, E.L., Eddy, S.R., Birney, E., Bateman, A., and Durbin, R. 1998. Pfam: Multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 26: 320-322. Taverna, D.M. and Goldstein, R.A. 2000. The distribution of structures in evolving protein populations. Biopolymers 53: 1-8.[CrossRef][Medline] Teichmann, S.A., Chothia, C., and Gerstein, M. 1999. Advances in structural genomics. Curr. Opin. Struct. Biol. 9: 390-399.[CrossRef][Medline]
Tiana, G., Shakhnovich, B.E., Dokholyan, N.V., and Shakhnovich, E.I. 2004. Imprint of evolution on protein structures. Proc. Natl. Acad. Sci. 101: 2846-2851. Vitkup, D., Melamud, E., Moult, J., and Sander, C. 2001. Completeness in structural genomics. Nat. Struct. Biol. 8: 559-566.[CrossRef][Medline]
Westbrook, J., Feng, Z., Chen, L., Yang, H., and Berman, H.M. 2003. The Protein Data Bank and structural genomics. Nucleic Acids Res. 31: 489-491.
Wolynes, P.G. 1996. Symmetry and the energy landscapes of biomolecules. Proc. Natl. Acad. Sci. 93: 14249-14255. Yanai, I., Camacho, C.J., and DeLisi, C. 2000. Predictions of gene family distributions in microbial genomes: Evolution by gene duplication and modification. Phys. Rev. Lett. 85: 2641-2644.[CrossRef][Medline]
Received August 10, 2004; accepted in revised format November 23, 2004. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||