|
|
|
|
Vol. 12, Issue 11, 1625-1641, November 2002 Structural Characterization of the Human Proteome1 Biomolecular Modelling Laboratory, Cancer Research UK, London, United Kingdom; 2 Department of Biological Sciences, Structural Bioinformatics Group, Imperial College of Science, Technology and Medicine, South Kensington, London, United Kingdom
This paper reports an analysis of the encoded proteins (the proteome) of the genomes of human, fly, worm, yeast, and representatives of bacteria and archaea in terms of the three-dimensional structures of their globular domains together with a general sequence-based study. We show that 39% of the human proteome can be assigned to known structures. We estimate that for 77% of the proteome, there is some functional annotation, but only 26% of the proteome can be assigned to standard sequence motifs that characterize function. Of the human protein sequences, 13% are transmembrane proteins, but only 3% of the residues in the proteome form membrane-spanning regions. There are substantial differences in the composition of globular domains of transmembrane proteins between the proteomes we have analyzed. Commonly occurring structural superfamilies are identified within the proteome. The frequencies of these superfamilies enable us to estimate that 98% of the human proteome evolved by domain duplication, with four of the 10 most duplicated superfamilies specific for multicellular organisms. The zinc-finger superfamily is massively duplicated in human compared to fly and worm, and occurrence of domains in repeats is more common in metazoa than in single cellular organisms. Structural superfamilies over- and underrepresented in human disease genes have been identified. Data and results can be downloaded and analyzed via web-based applications at http://www.sbg.bio.ic.ac.uk. [Supplemental material is available online at http://www.genome.org.]
The interpretation and exploitation of the wealth of biological
knowledge that can be derived from the human genome
(Lander et al. 2001 A widely used first step in a bioinformatics-based functional
annotation is to identify known sequence motifs and domains from
manually curated databases such as PFAM/INTERPRO (Bateman et al. 2000 A powerful source of additional information is available when the
three-dimensional coordinates of the protein are known. The structure
often provides information about the residues forming ligand-binding
regions that can assist in evaluating the function and specificity of a
protein. For example, recently we have shown that spatial clustering of
invariant residues can assist in assessing the validity of function
transfer in this twilight zone (Aloy et al. 2001 A valuable tool in exploiting three-dimensional information is the
databases of protein structure in which domains with similar three-dimensional architecture are grouped together. Here, we use the
structural classification of proteins (SCOP) (Conte et al. 2000 The above considerations have led us to focus our analysis on the following three objectives: (i) to estimate the extent to which the known proteomes can be annotated in terms of structure and function and how reliable we consider these annotations to be; (ii) to place the occurrence of particular SCOP structural superfamilies in terms of their biological and species-specific contexts; and (iii) to derive evolutionary insights from frequency-based analysis of homologous SCOP domains. Strategy for Structural and Functional Annotations Protein sequences from the human genome and from 13 other species
were analyzed (for details, see Methods). The main strategy was to use
the sensitive protein sequence similarity search program PSI-BLAST
(Altschul et al. 1997 Functional annotation was considered as sequence matches to the PFAM domain library (excluding families of unknown function), which is now part of the more general INTERPRO database. In the absence of a match to these characterized motifs/domains, we need to evaluate functional annotation via transfer from homology. To represent this approach computationally, we simply considered a functional annotation if a homolog contains some textual description of function (see legend to Fig. 1A). Thus, the total of the proteome that can be functionally annotated is the sections that are assigned to a PFAM domain or, if no assignment to PFAM, the sections that are homologous to a protein with a text functional description. Status of Structural and Functional Annotations Figure 1A shows the structural annotation status of the proteomes expressed as the fraction of the total residues in each proteome. We use the residue fraction to include situations when only part of a protein sequence is annotated, as one cannot quantify this as a fraction of domains because one does not know the number of domains in unannotated regions. Thirty-nine percent of the human proteome can be structurally annotated from either having a known protein structure or via a PSI-BLAST detectable homology to a known structure. This percentage is higher than that for yeast, fly, and worm and is comparable to the coverage of many bacteria and archaea. A further 38% of the human genome falls into the category of functional annotation without structure. Because nearly every protein structure has some functional annotation, the total functional annotation of the human proteome is 77%. The remainder are (i) either homologous to another protein of unknown function, or (ii) orphan regions without any detectable homology, or (iii) an unannotated, nonglobular region (a region of low complexity, coiled-coil, or a transmembrane segment).
We also consider how many protein sequences can be fully annotated. To allow for gaps, we require that >95% of a particular sequence be covered without gaps of >30 residues (Fig. 1B). The fraction of the human protein sequences that are fully annotated in terms of structure is only 15%. A further 14% of the human protein sequences are fully annotated in terms of function but not structure. The fraction of fully covered annotated sequences for human is much higher than for worm, fly, and yeast. Another 8% of the human sequences are fully covered by hypothetical sequences or sequences of unknown function. The accuracy of the above analysis is dependent on the quality of the
gene prediction. For the eukaryotic genomes analyzed, particularly for
the human proteome, this is problematic and it is anticipated that
several new genes will be identified and some present assignments
modified. The human proteome we analyzed is based on gene predictions
that are confirmed by matches to expressed sequence tags (ESTs) or
homologs in other species (see http://www.ensembl.org). This use of
homology would contribute to the high level of structural and
functional annotation and, if additional genes were identified, the
values for coverage probably would be somewhat lower. We can obtain an
upper estimate of the magnitude of this problem by noting that the
human genome has 6% by residue of orphans. In worm, this figure is
17%, and it is considered that most genes have been identified in this
genome (Reboul et al. 2001 However, even for prokaryotes, errors in gene prediction can affect our
survey. For example, the proteome of the archaea Aeropyrum pernix contains the largest fraction of orphan regions. This result may be because the gene prediction in A. pernix produced many very short questionable open reading frames (ORFs) (Skovgaard et al.
2001 Reliability of Annotation The reliability of the structural annotation from homology
model-building depends on the level of sequence identity between the
protein of known structure with that of unknown structure for which one
wants to build a model (Sanchez and Sali 1998
Figure 2B provides an assessment of the reliability of functional
annotation. We consider that a match to a PFAM domain (excluding domains of unknown function) constitutes a reliable functional annotation. For the human proteome, 26% of the residues can be assigned to PFAM domains, this includes 19% for which we have a
structural assignment, which often will assist in functional annotation. Next, we identify those proteins where the closest homolog
that has a text functional description (see legend to Fig. 1A) shares
at least 30% sequence identity. This cutoff was chosen because studies
have shown that below this value, homologs often have diverged to
radically different functions (Devos and Valencia 2000 SCOP Superfamilies Table 1 reports the commonly occurring
SCOP superfamilies in human, fly, worm, yeast, and average values for
archaea and bacteria. Complete tables can be downloaded from our web
site, http://www.sbg.bio.ic.ac.uk.
First, we consider the commonly occurring superfamilies in the human proteome. The most common domain in human is the C2H2 classic zinc finger, which occurs four times more often than the next most common domain, the immunoglobulin. The P-loop SCOP superfamily involved in nucleotide triphosphate hydrolysis is the fourth most common in human and second in fly, but the most common in the other analyzed proteomes. In general, the commonly occurring superfamilies in the human proteome reflect the eukaryotic and multicellular organization. Commonly observed superfamilies involved in or part of cell-surface receptors, protein-protein, or cell-cell interaction, signaling, or cytoskeleton structure are represented by superfamilies such as: immunoglobulin, EGF/laminin, fibronectin, cadherin, protein kinase, homeo-domain, tetratricopeptide repeat, spectrin repeat, PH-domain, and SH3-domain. In general, the fly and worm have similar ranking of the common
superfamilies to those in human reflecting the multicellular organization. There are, however, some differences. The c-type lectins
are at rank 26 with 149 domains in human but at rank five with 310 domains in worm. C-type lectins have a wide spectrum of functions
associated with carbohydrate binding and occur membrane bound and
soluble. The high occurrence of c-type lectins has previously been
noted but not explained by Koonin and coworkers (Koonin et al. 2000 There are, however, major differences in rank order for the single-celled organisms. Several of the superfamilies in Table 1 have similar ranks in human, fly, and worm, whereas the rank in yeast often differs markedly (e.g., the immunoglobulin). Domains of superfamilies found in cell-cell interaction proteins and cell-surface proteins such as the fibronectin and cadherin are not found or only occur infrequently in the proteomes of the single cellular organisms. In bacteria, and especially in archaea, the top ranks are mainly occupied with superfamilies associated with enzymes. The most common DNA-binding domain in bacteria and archaea is the winged helix-turn-helix motif. The abundance of several superfamilies in metazoans that are absent or
have relatively low domain frequencies in yeast leads us to conclusions
different than those recently published for the
Schizosaccharomyces pombe genome (Wood et al. 2002 Proteins forming a particular SCOP superfamily are identified on the
basis of both their similar structure and function. In contrast, PFAM,
INTERPRO, and PANTHER are primarily sequence- and function-based
families. Because homologies can be recognized from structural
conservation that are undetectable by sequence-based methods, one SCOP
superfamily can include several PFAM, INTERPRO, or PANTHER families. In
addition, SCOP is a structural domain database whereas PFAM identifies
a single sequence motif that can be repeated to form a structural
domain. For example, PFAM describes each of the Our results are in broad agreement with similar analyses by others
(Frishman et al. 2001 Our superfamily rankings are more different than those in PartsList
(Qian et al. 2001b Wolf et al. (1999) SCOP Superfamilies Specific for Phylogenetic Branches Table 2 presents SCOP superfamilies that
just occur with one species or set of related species but not in any of
the other organisms analyzed. In addition, each member of the
superfamily was run against the nonredundant sequence database using
PSI-BLAST (with the parameters described in the method) to identify
other species not included in those from the genomes analyzed here. In
Table 2, we exclude any superfamily that occurs less than four times in
a particular branch (human, fly, worm, yeast, bacteria, archaea) to
prevent erroneous inferences because of the inherent difficulties of
automated annotation. This information identifies biological function
specific for one branch of life.
Human Branch The three most frequent domains are implicated with immunity, in particular the major histocompatibility complex (MHC) antigen-recognition domain, interleukin 8-like chemokines, and the 4-helical cytokines. Analysis of our results that include the complete sequence database showed that in addition to mammals the interleukin 8-like superfamily is also found in sequences from birds and fish, and the MHC antigen-recognition domain is also found in amphibia. Several of the other domains specific to the mammalian branch are also involved in immunity MHC class II-associated invariant chain ectoplasmic
trimerization domain and p8-MTCP1 (mature T-cell proliferation). The
mammalian defensin is involved in defense against a wide range of
microorganisms, whereas the defensin-like superfamily is also found as
neurotoxin in some Cnidaria such as anemonae. At third in frequency in
the human branch is serum albumin that is a major protein component of
blood. Many of these superfamilies potentially specific for human
rather than the other species for which we have annotated the genomes
were also found in viruses, amphibia, reptiles, fish, and birds when
considering the complete sequence database. These include the following
frequently occurring domain families: colipase-like for enzyme
regulation (particularly required by pancreatic lipases) and involved
in development; RNaseA-like with different ribonucleases involved in
endonuclease function in pancreas, blood (eosinophil granules), and in
angiogenesis; the PKD domain, which is possibly involved in
extracellular protein-protein interaction. The RNAase A-like was also
found in Aspergillus.
Fly Insect pheromone/odorant-binding proteins are the most common SCOP superfamily (which occurs 26 times). The next most common are the scorpion toxin-like domains that occur as parts of the fungicide drosomycin, and the antibacterial defensin. Thus, the insect form of immunity/defense leads to a commonly occurring branch-specific SCOP superfamily. However, in addition to arthropods, the scorpion-like toxin and the antibacterial defensin are also found in plants.Worm Two superfamilies occur with a frequency four (the osmotin, thaumatin-like proteins and the plant lectins/antimicrobial peptides). Both are involved in pathogen response. However, further examination of protein of the complete sequence database showed that both SCOP superfamilies occur in plant genomes with close homologs.Yeast (S. cerevisiae) This is dominated by the Zn-Cys DNA-binding domain of transcription factors. This family is also found in the recently sequenced genome of the yeast S. pombe (Wood et al. 2002Bacteria Given the smaller size of bacterial genomes, we have pooled the superfamilies and their frequencies from the seven organisms we have annotated (i.e., the reported frequencies are the sums of domains in superfamilies from all seven bacterial proteomes, and not averages). Here, we discuss the higher ranking superfamilies. The most frequent domain is a transcription factor, the tetR/NARL DNA-binding domain (also found in some archaea and algae). This is followed by the dimerization domain of the AraC protein that is involved in the transcription regulation of that operon. Third is the superfamily of the DNA-bending protein. Other potentially specific superfamilies are involved in transport (especially the phosphate transferase system, possibly also present in fungi). There is one superfamily involved in the phosphate transferase system, the duplicated hybrid motif, that is also found in mouse (but not human) as has been previously noted (Nakamura et al. 1994Archaea There are only three species of archaea in our set of organisms, and we did not find any frequently occurring archaea-specific SCOP superfamilies. The general conclusion from this analysis is that three general classes of biological activity lead to commonly occurring branch-specific superfamilies. These functions are defense (e.g., immunity), transcriptional regulation, and hormone-related signaling.Gene Duplication The presence of multiple copies of any particular SCOP domains
within the proteome is the result of domain duplication and divergence
during evolution, both within and between proteins. The extent of this
duplication can be quantified:
This estimate of domain duplication relies on two assumptions. First is
that the duplication frequency of structurally characterized domains
(i.e., SCOP) is a representative sample of all proteins in the genomes.
This has been analyzed for proteins in the M. genitalium
genome by Teichmann et al. (1998) The values for domain duplication are without a time scale and
substantial further work is required to estimate the extent of
duplications since divergence of the different phylogenetic branches.
Recently, Qian et al. (2001a)
These results can be contrasted to analysis of the top superfamilies in bacteria. Of the top 10, seven are expanded in bacteria between 150 and 350% relative to human (data not shown). The two superfamilies that are reduced in bacteria compared to human are the periplasmic binding protein-like II (extra-cellular receptor domains in human and mainly extra-cellular solute binding domains in bacteria) with 70% and the thiolase-like domain (84%). In human, we do not find any CheY-like transcription factors at all. Figure 4B shows the relative domain frequencies (number of observed domains in a superfamily normalized by the total number of domains in the proteome) of the top 10 human superfamilies for the processed genomes. The 5092 zinc-finger domains that were identified for human comprise more than 20% of the identified domains. Zinc-finger domains have an average length of just 27 residues, and together this corresponds to only 1.5% of the residues in the human proteome. Compared to the majority of the top 10 human superfamilies, the P-loop decreases its relative abundance from prokaryotes to human. Although the domain fraction comprised by P-loops is much lower than for the zinc-finger, because of its average length of 217 residues in human, the P-loop accounts for 2% of all residues. In yeast and worm, the protein kinase-like superfamily seems to have more importance than in fly and human. In addition the RNA-binding domain involved in a range of functions is more abundant in yeast than in the metazoan proteomes where this superfamily accounts for roughly the same fraction of domains. The worm proteome contains relatively more EGF/laminins compared to fly. In general, the relative abundance of the top 10 superfamilies in human, except for the zinc finger, is similar between the metazoan proteomes. Plotting the top 10 superfamilies for yeast shows a similar trend (data not shown); there are only slight changes in the relative domain abundance for most superfamilies between the eukaryotic proteomes. These results imply that in general, the most popular superfamilies in a particular proteome do not comprise a substantially different fraction of the domain repertoire in other proteomes. Given an increasing number of domains for larger proteomes, it may not be a change in relative domain abundance that leads to specialization. In general, domains of superfamilies found at a high rank are often
found in repeats. Here, we define a repeat as at least two domains of
the same superfamily that are found within the same peptide sequence
irrespective of the sequence distance between these domains. Indeed,
the zinc finger is the most repeated domain in human. The average
number of repeats for the zinc finger is seven (maximum 36), four
(maximum 17), two (maximum five), and two (maximum five) per zinc
finger containing sequence for human, fly, worm, and yeast,
respectively. In fly and worm, the most repeated domain is the cadherin
with on average 12 repeats in fly and eight in worm. The most repeated
superfamily in yeast is the KH domain (probably involved in RNA
binding) with four repeats on average, and in prokaryotes this is the
thiolase-like superfamily (found in proteins of degradative pathways
such as fatty acid Considering only the existence (and not the frequency) of a superfamily in a sequence to exclude the effect of repeats overall just slightly changes the order of the top ranks of superfamilies. The domain-based top 10 ranks in human are still present in the top 22 list that excludes repeats (except for the spectrin repeat at rank 43). The immunoglobulin, the EGF/laminin, and the fibronectin are still within the top 10 (data not shown). Figure 4C plots the average number of repeats within a protein for each of these 10 SCOP superfamilies in human. The most notable feature is that the fly has far more duplicated copies per protein for cadherins (cell surface) and spectrin repeats (cytoskeleton) compared to human. Both worm and fly have more repeated copies per protein of fibronectin and immunoglobulin than human. Overall, seven of the 10 superfamilies are repeated on average at least twice per sequence. The most abundant superfamilies in yeast and especially in bacteria are not as frequently found in repeats as the most popular superfamilies in metazoa (data not shown). In general, this implies that repetitiveness on the domain level may play an important role in the divergence of the metazoan branch from single cellular eukaryotes. As mentioned above, several of the popular superfamilies in human are associated with cell-surface functions such as cell adhesion, for which long proteins with regular structure may be required. We also considered the number of different domain-domain associations
for the commonly occurring SCOP superfamilies. An association is taken
when two different SCOP superfamilies occur within the same sequence
(including self association). For a detailed analysis of pairs of
adjacent domains and their phylogenetic distributions, see Apic et al.
(2001)
Figure 5B shows the top 10 superfamilies in yeast. Only the tetratricopeptide repeat, a domain probably involved in a wide range of protein-protein interactions, expands its domain partner repertoire in a step from yeast and worm to fly and to human. The other superfamilies have similar frequencies in the three metazoans. Figure 5C shows that all the popular superfamilies in bacteria have markedly fewer cooccurrence partners in archaea, although seven of these superfamilies are also found in the top 10 superfamilies in archaea (data not shown). In worm, five of the popular bacterial superfamilies have an increased number of partners compared to yeast, fly, and human, possibly reflecting a closer phylogenetic relationship between worm and bacteria. The plots in Figure 5 only show the number of different superfamily
partners. However, even if the number of partners is similar, the
actual frequencies and composition of these partnerships often shows
great variation. Hegyi and Gerstein (2001) In summary, our analysis suggests that for most superfamilies, as the organism increases in complexity, specialization and diversity does not arise from an increasing number of domain combinations, rather from refinement and diversification of the superfamily repertoire itself and probably by changing the repertoire of domain partners. The web site mentioned in Methods provides a link to an application that allows generic ranking of selected proteomes according to selected properties such as domain frequencies, superfamily partners, or domain repetitiveness of superfamilies. The results can be displayed as a table and as a plot similar to those shown in this paper. SCOP Superfamilies in Disease Genes The Online Mendelian Inheritance in Man (OMIM) database (Antonarakis
and McKusick 2000 This analysis directly associates SCOP superfamilies with disease and
nondisease genes. However, the cause of disease state could be the
result of one (or a combination) of effects not directly involving the
protein, for example alteration of regulation or deletion of the entire
gene. In addition, any point mutation or deletion within a protein may
not be within a particular SCOP domain. However, for many genes in
OMIM, the location of the alteration (e.g., point mutation) is not
known. Thus, to analyze the entire OMIM database, one can only perform
a high level view of the distribution of SCOP superfamilies between
disease and nondisease genes. A more focused analysis would consider
only those genes where the location of the alteration has been
identified (for a review of computational analysis of disease genes,
see Sreekumar et al. 2001 The overall frequencies of SCOP superfamilies in the two sets of genes
are significantly different at >99.9% confidence. Table 3 reports the SCOP superfamilies that are
significantly over- and underrepresented in the disease genes at >95%
confidence as confirmed by a
Superfamilies over-represented in proteins of disease genes are mainly associated with regulation having biological functions in development, differentiation, and proliferation, and not being directly involved in metabolism. Overall, the overrepresented superfamilies can be put into the categories immune response, immune regulation, growth factors, and transcription factors. The main biological relevance of the underrepresented superfamilies may be summarized as transcription factors, protein-protein interaction domains involved in signaling and transcription (other than transcription factors), and translation. However, many of the superfamilies are involved in a wide range of biological functions and may be placed in more than one category, e.g., the interleukin 8-like chemokines are not only involved in immune response but also play a regulatory role during development. The most over-represented superfamilies (with a ratio >2) are biased
toward small, mainly extracellular single or two-domain messenger
proteins (interleukin, cystine-knot cytokines, and 4-helical cytokines), whereas three of the seven strongly underrepresented superfamilies (with a ratio Taking the above observations together, the most overrepresented superfamilies in disease genes are those likely to have evolved within the metazoan branch of evolution and that are moderately expanded in human (average sequence rank of 65 of 463). The homeodomain-like and protein kinase-like superfamilies are just slightly but significantly underrepresented, and are found with high overall frequencies in both categories. These two superfamilies are associated with biological key functions in many regulatory pathways (see Table 3 for details). Our results suggest that it is, in general, unlikely to find abundant superfamilies with a massive bias toward disease proteins, possibly because the disruption of key functions may often be lethal. However, despite this general suggestion, we do not have any explanation why certain superfamilies are over- or underrepresented in disease genes. Our observations may encourage future work to formulate hypotheses that may lead to deeper insights into the relationship between disease and structural folds. Transmembrane Proteins Transmembrane regions in the proteomes were identified using the
hidden Markov approach implemented in TMHMM-2 (Sonnhammer et al. 1998
Figure 6B shows the ratio of residues in globular domains to residues in transmembrane regions for different membrane proteins as determined by the number of predicted membrane-spanning helices. The ratios are substantially different between species for proteins with one to three transmembrane regions and become more similar as the number of transmembrane regions increases. This shows that the full sequence of transmembrane proteins with only one to three membrane-spanning regions differ in length between the proteomes of the analyzed organisms reflecting a higher number of potential globular domains, with the fly having longer protein sequences for transmembrane proteins than the other organisms. In bacteria and archaea, the ratio drops below one (e.g., the majority of the protein is membrane integral) at about six to seven membrane segments. In contrast, eukaryotes have the majority of the residues of the protein in potential globular domains, suggesting additional functionality such as protein-protein interaction or receptor capabilities of these membrane proteins. Table 4 reports the frequencies of SCOP
superfamilies that occur in protein chains that span the membrane. We
focus on the globular domains associated with transmembrane proteins
and accordingly exclude completely membrane-integral proteins of the
analyzed proteomes and do not consider the SCOP class of membrane
proteins. The four superfamilies of highest rank are domains that can
be found in cell-surface proteins involved in cell-cell interaction and receptor molecules. In human, the most common SCOP domain associated with membrane-spanning chains is the immunoglobulin superfamily, whereas in fly and worm this superfamily is at rank four
and five, respectively. The cadherin is the most common SCOP superfamily in fly, and in worm the EGF/laminin is the most popular membrane-associated superfamily. The relative importance of
superfamilies involved in cell-cell interaction and cell-surface
proteins is also pointed out by the absence of these superfamilies in
yeast (also see Table 1). All eight immunoglobulin domains found in yeast are located in soluble, probably intracellular proteins (no
signal peptides could be found via prediction). In conclusion, the
results of the transmembrane analysis reflects the multicellular environment of human, fly, and worm, where specialized systems for
cell-cell communication and recognition are required in, for example,
tissue formation.
Table 4 also presents the fraction of the total domain frequency for each superfamily that is associated with membrane-spanning chains. Of the superfamilies with at least five domains in transmembrane proteins, only the MHC antigen-recognition domain and the periplasmic binding protein-like I have more than 80% of their representative domains in transmembrane proteins. Further down the list (bottom part of Table 4), several other superfamilies are found with more than 50% of their domains in transmembrane proteins. However, in worm we find all six representatives of the scavenger receptor cysteine-rich (SRCR) domain (found in membrane glycoproteins) and all spoIIa domains with five representatives (sulphate transports) in membrane proteins. SCOP superfamilies that are frequently associated with transmembrane regions are also common in chains that do not span the membrane. This supports the view that domains are mobile elements that are not restricted to coevolve either always in association with a transmembrane section or always in a chain that does not span the membrane. The top-ranking superfamilies in bacteria are different from those found in eukaryotes (data not shown in Table 4). These superfamilies are mainly associated with bacterial signaling (ATPase domain and homodimeric domain of signal transduction histidine kinase, PYP-like sensor domain, CheY-like) or with small molecule binding, probably as membrane-bound receptors or enzymes (P-loop containing nucleotide hydrolases, phosphatases/sulphatases, Rossmann-fold, nucleotide-diphospho-sugar transferases, FAD/NAD(P)-binding domain, metal-binding domain). In bacteria, we do not find any globular superfamily with more than two representatives (an average over the seven processed bacterial proteomes) that is exclusively found in membrane proteins. The list of most popular superfamilies found in transmembrane proteins for archaea is similar to those for the bacteria, but the frequencies of which domains are found are much lower, e. g., the top-ranking superfamily is the P-loop with only eight domains in the three archaea proteomes. Figure 7 shows the frequencies of the overall top 10 human superfamilies (the same superfamilies as in Figure 4) with their number of domains in membrane proteins compared to the other processed proteomes (7A), and the same for the top-ranking bacterial superfamilies (7B, the P-loop is not shown). As expected the immunoglobulin, cadherin, fibronectin, and EGF/laminin are most expanded in human compared to fly and worm. Interestingly, the P-loop is found with very similar numbers in membrane proteins in all metazoan proteomes, compared to the overall expansion shown in Figure 4A. This suggests that, although there are more P-loops in human than in fly and worm, the additional duplications are associated with soluble proteins only.
The top-ranking superfamilies in bacteria (7B) are rarely associated with membrane proteins in prokaryotes and yeast, and this trend also remains across the metazoans for seven of the 10 superfamilies (we did not find any CheY-like domains in human). Little expansion is observed in total numbers for three superfamilies compared to the figure in human (7A). We find only one periplasmic binding protein-like II domain on average in membrane proteins in bacteria, and although the total number of domains in this superfamily is higher than for the other proteomes (data not shown), membrane association has only been expanded in metazoa. However, the periplasmic binding protein-like II is a diverse superfamily that contains at least 10 different PFAM families, and in bacteria there seem to be many soluble extracellular members of this superfamily (suggested by signal peptide prediction). Most of the metazoan domains of this superfamily are ligand-gated, ion-channel domains and receptor family ligand-binding domains, both found in membrane proteins. In yeast, four of the five domains of this superfamily are part of presumably intracellular soluble proteins involved in pyrimidine biosynthesis. The divergence of the periplasmic binding protein-like II superfamily to produce different functional families in bacteria and metazoa seems to be coupled to some extent with different sub-cellular location (soluble and membrane bound). Conclusion We have performed an integrated analysis of the human proteome and compared the results to those of other proteomes. The key aspect of this study is the integration in the context of the different species of the following features: the extent and reliability of structural and functional annotations of the proteomes; the extent of domain duplication; change and expansion of the structural superfamily repertoire between different proteomes; the relationship between human disease genes and structural superfamilies; and the relationship between transmembrane proteins and their globular regions. The study included a structure-based analysis from which we were able to get evolutionary insights that could not be obtained from sequence-based methods alone. The structural analysis complements consideration of the extent of functional annotation. We assessed the role of structural knowledge in assisting functional annotation. These general bioinformatics analyses require simplifications and are also subject to errors in the predictive methods. In particular, we have had to employ a simplified strategy to estimate the extent to which there is some functional information derivable by homology. However, this reflects the standard practice in obtaining an initial suggestion of protein function in the absence of characterized motifs such as PFAM. Automated proteome annotation, particularly in eukaryotes, is complex and the exact numbers reported in our analysis will need to be refined as the bioinformatics tools improve and more experimental data becomes available. This study and related work by others (Koonin et al. 2000
Protein Sequences From Complete Genomes Eukaryota: Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens. Bacteria: Mycobacterium tuberculosis, Escherichia coli, Bacillus subtilis, Mycoplasma genitalium, Helicobacter pylori, Aquifex aeolicus, Vibrio cholerae. Archaea: Aeropyrum pernix, Pyrococcus horikoshii, Methanococcus jannaschii. The H. sapiens proteome is the ENSEMBL-0.8.0 confirmed peptide data set (http://www.ensembl.org). Other sequences were taken from the NCBI (ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/). Sequence Analysis Sequences, annotations, and results are stored in a relational database (MySQL, http://www.mysql.com), which serves as the back end for an automated processing pipeline running on a Linux computer farm. Our software and database system allows for updates of the data and results as well as comparisons across proteomes. The sequences were first scanned for signal peptides (SignalP-1.1
[Nielson et al. 1997 Protein sequence database searches were performed using PSI-BLAST
version 2.0.14 (Altschul et al. 1997 Examination of our initial results showed that there was a problem in
PSI-BLAST detecting very short SCOP domains (<50 residues) because
BLAST/PSI-BLAST e-values may not be significant for short alignments
yet manual investigation of the region strongly suggested that it
should be assigned to a particular SCOP domain. We developed a
heuristic method to address this problem whereby an assignment to a
SCOP domain was accepted with an e-value <10 for an IMPALA and BLAST
hit and five for a PSI-BLAST hit if the domain is shorter than 50 residues and the sequence identity of the alignment satisfies the
identity cut-off described by Rost (1999) GAP-BLAST (Altschul et al. 1997 For the analysis of transmembrane proteins, sequences were truncated if the SignalP program could identify a potential signal peptide. This avoids false-positive predictions of transmembrane regions at the N terminus of a sequence. Availability of Annotation The results of our analysis are available as 3D-GENOMICS via our web page http://www.sbg.bio.ic.ac.uk. This includes query forms for database searches and the display of tables and alignments. We provide a special section with results from comparative analyses, including an application to generically list different domain properties such as repetitiveness, association with transmembrane proteins, or domain partners ranked by frequency in a selected "master" proteome. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/; results from the HMM
superfamily analysis by Gough et al. (2001) http://www.cbs.dtu.dk/services/SignalP/; SignalP-1.1 Web site. http://www.cbs.dtu.dk/services/TMHMM-2.0/; TMHMM-2.0 Web site. http://www.ensembl.org; ENSEMBL Web site. http://www.mysql.com; MySQL relational database. http://www.sbg.bio.ic.ac.uk; data and results for this article. |