|
|
|
|
Published online before print
May 12, 2004, 10.1101/gr.2079204 Genome Res. 14:1160-1169, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00
Methods Visualizing Sequence Similarity of Protein Families owski1Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
Classification of proteins into families is one of the main goals of functional analysis. Proteins are usually assigned to a family on the basis of the presence of family-specific patterns, domains, or structural elements. Whereas proteins belonging to the same family are generally similar to each other, the extent of similarity varies widely across families. Some families are characterized by short, well-defined motifs, whereas others contain longer, less-specific motifs. We present a simple method for visualizing such differences. We applied our method to the Arabidopsis thaliana families listed at The Arabidopsis Information Resource (TAIR) Web site and for 76% of the nontrivial families (families with more than one member), our method identifies simple similarity measures that are necessary and sufficient to cluster members of the family together. Our visualization method can be used as part of an annotation pipeline to identify potentially incorrectly defined families. We also describe how our method can be extended to identify novel families and to assign unclassified proteins into known families.
Genome projects (Bernal et al. 2001
The simplest methods for clustering proteins into families rely on sequence-similarity measures, such as those obtained by BLAST (Altschul et al. 1990
Similarity-based clustering is a two-step processone first needs to determine pairwise similarities between all pairs of proteins and then apply a clustering method that uses the similarity matrix to group proteins into clusters. However, methods that quantify similarity by using some attribute of the best BLAST hit and use single-linkage clustering are not always successful. One problem such methods face is the detection of the multidomain structure of many protein families. Ideally, proteins should be classified into a single family only if they exhibit highly similar domain architecture. Best hit-based approaches may group together different multidomain proteins that share a common domain (Smith and Zhang 1997
In this study, we test our methods on the protein families of Arabidopsis thaliana. The Arabidopsis thaliana genome was fully sequenced in 2000 (Arabidopsis Genome Initiative 2000 In this work, we study whether Arabidopsis thaliana families constructed by such diverse methods can be characterized by a small set of biologically meaningful parameters. In other words, we do not attempt to discover families ab initio; rather, we show that most discovered families can be described by one or two parameters. We consider two different parameter schemes. In the first scheme, similarity between two proteins is measured in terms of the fraction of the proteins participating in a gapped alignment (cover) and the percentage identity of such an alignment. We also analyze a second scheme in which similarity is measured in terms of relative score, that is, the ratio of the score of the alignment to the self-similarity score (score of a protein with itself). In either scheme, we say that a family is clusterable if carrying out single-linkage clustering with some particular threshold value for the parameter(s) groups members of that family into a single cluster. Carrying out the clustering operation with a lower threshold usually results in the cluster becoming corrupted by members of other families, whereas raising the threshold may split the family across multiple clusters. We describe a novel method for visualizing the variation in clusterability with choice of parameters. Our method identifies the parameter values that best characterize a family, and thereby provides ready answers to questions of the form "How similar are members of family X?" One result of our work is the discovery that, despite the wide variety of methods used in the construction of protein families, 76% of all analyzed Arabidopsis thaliana families are fully clusterable by the proposed simple parameter schemes. Our results, available online at http://warta.bio.psu.edu/htt_doc/ArabCluster, also show relationships between families that share members, and help identify potentially incorrect family assignments. We also show how our results could be used to identify novel families and assign unclassified proteins to known families.
Constructing Matches From Hits Let A be the set of all protein sequences. We compare the proteins of A against each other by running BLASTp with e-value 0.0001. The result is a set of hits, in which each hit is a local alignment that aligns a region of one protein sequence with a region from another protein sequence with a particular score. By parsing the BLAST output, we can define, for each hit, location attributes that specify which regions of the proteins are participating in the local alignment and quality attributes that indicate how good the hit is. More formally, a hit h that aligns region [x1, x2] of protein x with region [y1, y2] of protein y has the following location attributes:
and the following quality attributes:
We term the hit that aligns the entire length of a protein sequence p against itself as a self-hit and use the notation self-score(p) to refer to the score of such a hit. On the basis of these self scores, we can define two relative score (quality) attributes for any hit h involving distinct proteins x, y:
A set of hits H between a pair of proteins x, y is compatible if all pairs of hits in H are compatible by the above definition. Such a compatible set of hits can be grouped into a match, m. A match has the same quality attributes as a hit. Percentage identity is computed by taking the weighted percentage identity across the hits in H, that is,
Clustering
We represent the similarity relationships in our protein data set by an undirected weighted graph, G. The nodes of G correspond to the set of all proteins A, and edges connect proteins x, y if, and only if, there is some hit with x as the query and y as the subject (or vice-versa). The weight of an edge represents the extent of similarity between the proteins connected by the edge. In the case of (i, c) clustering, the weight of the edge is given by a pairthe first element of the pair is the percentage identity of the best match between the proteins and the second element is the percentage of the proteins participating in the match (cover). More formally,
To observe the effect of using different percentage identity and cover thresholds on the formation of clusters, we carried out (i, c)-clustering 100 times by varying percentage identity i and percentage cover c independently in increments of 10, from 0 to 90. For a particular choice of (i, c), we first construct a restricted graph Gi,c from G by retaining only those edges with weight at least (i, c). We then identify clusters by computing connected components of Gi,c (see Fig. 2). It is easy to see that G0,0, which is identical to G, will be a dense graph that yields a few large clusters, and that G90,90 will be a relatively sparse graph that yields several small clusters.
Relative score-based clustering is carried out in a similar manner by varying the threshold r from 0 to 90, in increments of 10.
Measuring Cluster Quality Ideally, each family of F will correspond to a single cluster of Ci,c. However, the more likely scenario is that some families will be spread across several clusters, or that several families will be grouped into a single cluster. Intuitively, we would consider the clustering parameters (i, c) to be "good" with respect to a family F if
Note that these two measures are orthogonalif all of the classified proteins P are placed in a single cluster, then concentration is high, but purity is low. On the other hand, if each protein of P is placed in an individual cluster of size 1, then purity is high, but concentration is low. Concentration and purity reflect the sensitivity and specificity, respectively, of the clustering with respect to the family under consideration. Another method for measuring clustering quality that attempts to combine concentration, purity is matching rate (Kawaji et al. 2001
We measure the concentration, purity, matching rate of family F in a particular cluster C
In other words, concentration measures the fraction of the family present in the cluster, whereas purity corresponds to the fraction of the cluster that belongs to the family. It is easy to see that the matching rate measure, which combines these two measures, satisfies the condition match_rate(F,C) We now extend these definitions to a set of clusters as:
When measuring quality in terms of concentration and purity, we say that a family F is (x, y) clusterable by parameters (i, c) if concentration(F,Ci,c)
In the example shown in Figure 2, the proteins belong to two familiesthe B family with five members is shown in black, and the W family with three members is shown in white. The computation of concentration, purity, and matching rate for the two families is summarized in the table below:
Although (30, 20) may not be the right clustering parameters for families B, W, this does not mean that the families are not clusterable. In fact, family B is (100, 100) clusterable by parameters (0, 50) and family W is (100, 100) clusterable by parameters (60, 0).
Displaying Clustering Quality
The clustering quality of family B, which consists of the black nodes from Figure 2, is shown on the left hand side of Figure 3. In the top left corner, where i = 0, c = 0, all members of the B family are in the same cluster (high concentration or green), but the cluster also contains all members of the W family (low purity or red). This leads to a strong green color. At the opposite end of the picture, each member of the B family is in its own trivial cluster of size 1 (high purity, low concentration), leading to the red color. As indicated by the calculations shown in the table, the grid element corresponding to i = 30, c = 20 is filled with a color that is 40% green and 70% red, resulting in a slightly reddish color. Also note that because the B family is fully clusterable by parameters (0, 50), the grid element at that location is 100% red, 100% green, that is, yellow. A small blue dot is used to indicate such perfect concentration, purity.
The results for the MDR family of proteins (Sanchez-Fernandez et al. 2001
Notes on Clusterability We classify nontrivial families into several categories on the basis of the extent of shared family members. The categories can be described without ambiguity in set theoretic terms; however, we choose to illustrate them with the help of Figure 4 due to space constraints.
In reality, the picture can be more complicated, as a family can fall into more than one category, for example, a superset family can itself be a subset or intersected family, etc. However, even with this simple picture, one can see that our expectations regarding the clusterability of a family vary with the category in which the family falls. For instance, we would expect family A to be more clusterable than the other families.
The complete set of 28,581 Arabidopsis thaliana protein sequences from TIGR formed the set A. Gene family information downloaded from http://www.arabidopsis.org on July 28, 2003 helped us classify 4241 of these proteins into 571 families. A total of 119 families are trivial and 345 are atomic. The classification of the remaining 107 families is shown in Figure 5.
The entire set of proteins A was compared against itself using BLASTp with a e-value threshold of 0.0001. The distribution of the resulting 2,254,453 hits is shown in Figure 6. A total of 8.6% of proteins participate in no hits at all, whereas 1.3% participate in more than 1000 hits. A total of 19 nontrivial families defined by experts contain proteins that have no hits to any other proteinsclearly these families will not be (100, 100) clusterable for any choice of clustering parameters.
In 76% of the cases, there is exactly one hit between a pair of proteins, so the best match is identical to this hit. In the other cases, where there are multiple hitsdue to repeated motifs or conserved domains separated by a distancewe compute the compatible set of hits with the maximum score and create the best match. Clusters were determined using single linkage clustering. Graph G0,0, in which no edges are discarded, contains 238 connected components (clusters), whereas G90, 90, in which all edges with percentage identity and cover less than 90 are removed, yields 3961 clusters. Finally, unclassified proteins were removed from the computed clusters, and the clustering quality for each family was computed for all choices of clustering parameters. Overall, 86% of atomic families are at least (90, 90) clusterable for some choice of clustering parameters, whereas only 64% of nonatomic families are similarly clusterable. The variation of clusterability, with family size and classification is shown in Figure 7. The results for r clustering are almost as good (within 2%).
VCQ pictures similar to Figure 3 for each family and superfamily are available at http://warta.bio.psu.edu/htt_doc/ArabCluster. All of the pictures and the Web pages are constructed on-demand by perl scripts querying a MySQL database that stores the necessary information.
Match as Unit of Similarity In this study, we use single-linkage clustering as the mechanism for grouping similar proteins. The potential drawbacks of using single-linkage clustering have been documented in several papers that propose more sophisticated clustering methods. However, our goal in this study was not to discover families, but rather to characterize existing families by meaningful attributes such as identity, cover, and relative score. We avoided the use of biologically unmeaningful parameters such as inflation value (Enright et al. 2002
In an effort to overcome some of the problems associated with using single-linkage clustering for grouping members of multidomain families, we use the notion of a match that can be thought of as a form of gapped alignment composed of possibly multiple BLAST hits. Note that the concept of a match is not novelit has been used implicitly by programs such as Sim4 (Florea et al. 1998
Figure 8 shows an instance where our usage of match as the basic unit of similarity helps distinguish members of two different families in the ABC superfamily (Sanchez-Fernandez et al. 2001
Overall, only 2% of the matches computed are composed of multiple hits. One reason for this unexpectedly small number could be that our criteria for hits to be compatible is too stringentwe require hits not to overlap at all. It is possible that allowing for small overlaps between hitsas is done in XDOM (Gouzy et al. 1997
Usefulness of VCQ Pictures VCQ pictures (like Fig. 3), can give a rough idea of the nature and extent of conserved domains in a family. Families with small, unique domains are clusterable by a high identity, low-cover threshold that is visible as a yellow region in the top right-hand side of the (i, c)-clustering VCQ picture, whereas multidomain families are likely to be clusterable by low-identity, high-cover thresholds. One can also use the pictures to identify families that have been defined too broadly (concentration is unusually low, even at low thresholds), or too narrowly (purity is unusually low, even at high thresholds). Note that the VCQ picture of a family may change as more proteins are classified and novel families are created. However, updating the pictures is fairly simple, as the time-consuming steps of measuring similarities and carrying out the clustering with different thresholds are independent of the classification of proteins into families. When family definitions are added or modified, we simply have to filter the precomputed clusters to discard unclassified proteins and remeasure the quality.
Comparison of Clustering Schemes
The second clustering scheme uses relative score as a measure of similarity. Relative score-based clustering is computationally simpler, as it needs to be carried out only 10 times as opposed to 100 times for (i, c)-clustering. The results shown in Table 1 indicate that it is almost as effective as (i, c)-clustering. However, as high-identity, low-cover matches and low-identity, high-cover matches can have the same relative score, it is harder to gain an understanding regarding the nature of similarity within a family by viewing the relative score-based clustering quality picture. Analogous to Figure 9, we show in Figure 10 the number of families that are clusterable at different relative score levels.
Factors Affecting Clusterability As can be inferred from the results presented in the previous section, small families have a higher chance of being clusterable. However, equally important is the type of the familyatomic families are much more likely to be clusterable than subset, superset, or intersected families. One should also keep in mind that the same family is sometimes independently listed by several groups. For instance, the PDR family appears three timesas a member of the ABC superfamily (Sanchez-Fernandez et al. 2001 We now list some of the reasons why an atomic family may not be clusterable in our analysis:
The one nonbiological parameter that affects our results slightly is the e-value that was chosen for the initial BLAST run. All of the results described in this study were the result of running BLAST with an e-value threshold 0.0001. This somewhat stringent e-value is responsible for some low-similarity families not being clusterable. When we repeated our analysis with e-value set to 1, the number of no-trival families that had proteins with 0 hits reduced from 19 to 7. This resulted in a small increase in the overall number of families that were clusterable.
Identifying New Families The only fact we can be sure of is that clusters that form at higher thresholds are purer than those that form at lower thresholds. For instance, consider Figure 12, which shows the distributions of the number of clusters (of size at least 5) with respect to relative score threshold. For the purpose of this figure, each cluster was classified into one of four categories:
The negligible number of clusters of type T3, when relative score threshold 50 (or greater) is used, indicates that, at this level, almost all clusters are likely to be pure. Thus, one can choose a cluster of type T4, align its member sequences, detect conserved blocks in the multiple alignment, and construct a new family by identifying all unclassified proteins that contain the blocks. Whereas T4 clusters formed with relative score threshold 90 are also going to be pure, they are not appropriate seeds for the discovery of new families, as the sequences in those clusters are likely to be almost identical, making it impossible to extract functionally relevant blocks from the alignment. In many cases, one can also predict the family of unclassified members of clusters of type T2 on the basis of the classified members. However, any such predictions or new family definitions need to be followed with more comprehensive work to identify the functional role of the conserved regions. One should also note that the relative score threshold of 50 may not be appropriate in the case of other genomesonly after a significant number of protein families are defined, can we calibrate a suitable threshold that can aid in the detection of the remaining families.
Applicability to Other Species Data
Conclusion
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2079204. Article published online before print in May 2004.
1 Corresponding author.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403410.[CrossRef][Medline] Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815.[CrossRef][Medline] Azevedo, C., Santos-Rosa, M.J., and Shirasu, K. 2001. The U-box protein family in plants. Trends Plant Sci. 6: 354358.[CrossRef][Medline]
Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L., Marshall, M., and Sonnhammer, E.L. 2002. The Pfam protein families database. Nucleic Acids Res. 30: 276280.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235242.
Bernal, A., Ear, U., and Kyrpides, N. 2001. Genomes OnLine Database (GOLD): A monitor of genome projects world-wide. Nucleic Acids Res. 29: 126127. Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., and Yuan, Y. 1998. Predicting function: From genes to genomes and back. J. Mol. Biol. 283: 707725.[CrossRef][Medline] Devoto, A., Hartmann, H.A., Piffanelli, P., Elliott, C., Simmons, C., Taramino, G., Goh, C.S., Cohen, F.E., Emerson, B.C., Schulze-Lefert, P., et al. 2003. Molecular phylogeny and evolution of the plant-specific seven-transmembrane MLO family. J. Mol. Evol. 56: 7788.[CrossRef][Medline] Doolittle, R.F. 1995. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64: 287314.[CrossRef][Medline]
Duret, L., Mouchiroud, D., and Gouy, M. 1994. HOVERGEN, database of homologous vertebrate genes. Nucleic Acids. Res. 22: 23602365. Eisenberg, D., Marcotte, E.M., Xenarios, I., and Yeates, T.O. 2000. Protein function in the post-genomic era. Nature 405: 823826.[CrossRef][Medline]
Enright, A.J. and Ouzounis, C.A. 2000. Generage: A robust algorithm for sequence clustering and domain detection. Bioinformatics 16: 451457.
Enright, A.J., Van Dongen, S., and Ouzounis, C.A. 2002. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30: 15751584.
Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. 1998. A computer program for aligning a CDNA sequence with a genomic DNA sequence. Genome Res. 8: 967974.
Geer, L.Y., Domrachev, M., Lipman, D.J., and Bryant, S.H. 2002. CDART: Protein homology by domain architecture. Genome Res. 12: 16191623.
Gouzy, J., Eugene, P., Greene, E.A., Kahn, D., and Corpet, F. 1997. XDOM, a graphical tool to analyze domain arrangements in any set of protein sequences. Comput. Appl. Biosci. 13: 601608. Heger, A. and Holm, L. 2000. Towards a covering set of protein family profiles. Prog. Biophys. Mol. Biol. 73: 321337.[CrossRef][Medline] Hegyi, H. and Gerstein, M. 1999. The relationship between protein structure and function: A comprehensive survey with application to the yeast genome. J. Mol. Biol. 288: 147164.[CrossRef][Medline]
Holm, L. and Sander, S. 1996. Mapping the protein universe. Science 273: 595602.
Hwang, I., Chen, H.C., and Sheen, J. 2002. Two-component signal transduction pathways in Arabidopsis. Plant Physiol. 129: 500515. Kawaji, H., Yamaguchi, Y., Matsuda, H., and Hashimoto, A. 2001. A graph-based clustering method for a large set of sequences using a graph partitioning algorithm. Genome Inform. Ser. Workshop Genome Inform. 12: 93102.[Medline]
Li, L., Tutone, A.F., Drummond, R.S., Gardner, R.C., and Luan, S. 2001. A novel family of magnesium transport genes in Arabidopsis. Plant Cell 13: 27612775.
Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O., and Eisenberg, D. 1999. Detecting protein function and proteinprotein interactions from genome sequences. Science 285: 751753. Matsuda, H., Ishihara, T., and Hashimoto, A. 1999. Classifying molecular sequences using a linkage graph with their pairwise similarities. Theor. Comput. Sci. 210: 305325.[CrossRef] Metz, A.M., Timmer, R.T., and Browning, K.S. 1992. Sequences for two cDNAs encoding Arabidopsis thaliana eukaryotic protein synthesis initiation factor 4A. Gene 120: 313314.[CrossRef][Medline] Michalski, R.S., Bratko, I., and Kubat, M. 1998. Machine learning and data mining. Wiley, New York.
Mott, R. 1997. Estgenome: A program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci. 13: 477478.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., et al. 2003. The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res. 31: 315318.
Perriere, G., Duret, L., and Gouy, M. 2000. HOBACGEN: Database system for comparative genomic in bacteria. Genome Res. 10: 379385.
Rhee, S.Y., Beavis, W., Berardini, T.Z., Chen, G., Dixon, D., Doyle, A., Garcia-Hernandez, M., Huala, E., Lander, G., Montoya, M., et al. 2003. The Arabidopsis Information Resource (TAIR): A model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 31: 224228.
Sanchez-Fernandez, R., Davies, T.G., Coleman, J.O., and Rea, P.A. 2001. The Arabidopsis thaliana ABC protein superfamily, a complete inventory. J. Biol. Chem. 276: 3023130244.
Servant, F., Bru, C., Carrere, S., Courcelle, E., Gouzy, J., Peyruc, D., and Kahn, D. 2002. Prodom: Automated clustering of homologous domains. Brief Bioinform. 3: 246251. Smith, T.F. and Zhang, X. 1997. The challenges of genome sequence annotation or "the devil is in the details". Nat. Biotechnol. 15: 12221223.[CrossRef][Medline] Tsoka, S. and Ouzounis, C.A. 2000. Recent developments and future directions in computational genomics. FEBS Lett. 480: 4248.[CrossRef][Medline] van den Brule, S. and Smart, C.C. 2002. The plant PDR family of ABC transporters. Planta 216: 95106.[CrossRef][Medline]
Vandepoele, K., Raes, J., De Veylder, L., Rouze, P., Rombauts, S., and Inze, D. 2002. Genome-wide analysis of core cell cycle genes in Arabidopsis. Plant Cell 14: 903916. Veeramachaneni, V. 2002. "Aligning fragmented sequences." Ph.D. thesis, The Pennsylvania State University, University Park, PA.
Wheelan, S.J., Church, D.M., and Ostell, J.M. 2001. Spidey: A tool for mRNAto-genomic alignments. Genome Res. 11: 19521957.
Wolf, Y.I., Brenner, S.E., Bash, P.A., and Koonin, E.V. 1999. Distribution of protein folds in the three superkingdoms of life. Genome Res. 9: 1726.
Zhang, H. 2003. Alignment of BLAST high-scoring segment pairs based on the longest increasing subsequence algorithm. Bioinformatics 19: 13911396.
http://www.arabidopsis.org/; The Arabidopsis Information Resource (TAIR). http://warta.bio.psu.edu/htt_doc/ArabCluster; Arabidopsis families similarity pictures.
Received October 16, 2003; accepted in revised format February 10, 2004.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||