|
|
|
|
Published online before print
September 15, 2003, 10.1101/gr.1293003 Genome Res. 13:2265-2270, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00
Letter The Secreted Protein Discovery Initiative (SPDI), a Large-Scale Effort to Identify Novel Human Secreted and Transmembrane Proteins: A Bioinformatics AssessmentDepartments of Bioinformatics, Molecular Biology and Protein Chemistry, Genentech, Inc., South San Francisco, California 94080, USA
A large-scale effort, termed the Secreted Protein Discovery Initiative (SPDI), was undertaken to identify novel secreted and transmembrane proteins. In the first of several approaches, a biological signal sequence trap in yeast cells was utilized to identify cDNA clones encoding putative secreted proteins. A second strategy utilized various algorithms that recognize features such as the hydrophobic properties of signal sequences to identify putative proteins encoded by expressed sequence tags (ESTs) from human cDNA libraries. A third approach surveyed ESTs for protein sequence similarity to a set of known receptors and their ligands with the BLAST algorithm. Finally, both signal-sequence prediction algorithms and BLAST were used to identify single exons of potential genes from within human genomic sequence. The isolation of full-length cDNA clones for each of these candidate genes resulted in the identification of >1000 novel proteins. A total of 256 of these cDNAs are still novel, including variants and novel genes, per the most recent GenBank release version. The success of this large-scale effort was assessed by a bioinformatics analysis of the proteins through predictions of protein domains, subcellular localizations, and possible functional roles. The SPDI collection should facilitate efforts to better understand intercellular communication, may lead to new understandings of human diseases, and provides potential opportunities for the development of therapeutics.
Discovery of novel human proteins provides new opportunities for development of drug therapies for treatment of the wide range of diseases for which there is still no cure. In other cases, these proteins play an integral role in a disease state or the biological pathway leading to disease, and their identification and characterization may lead to an understanding of disease paradigms. Secreted and transmembrane proteins, in particular, have properties that lend themselves to be utilized as therapeutic agents or targets. They are accessible to various drug delivery mechanisms, because they are presented on the cell surface or within the extracellular space. A purified secreted protein or a receptor extracellular domain can be utilized directly as a therapeutic (e.g., growth hormone), or may be targeted by specific antibodies or small molecules. Important therapeutics have been created that target proteins present on the cell surface in a specific cell type or disease state. Rituxan is an antibody therapeutic targeting the B lymphocyte-specific CD20 protein and is an effective therapeutic in the treatment of non-Hodgkin's lymphoma. Herceptin is an antibody therapeutic targeting the breast carcinoma-specific HER2 protein and is an effective therapeutic in the treatment of breast cancer.
A number of gene families of secreted and transmembrane proteins related by homology have emerged that include members known to have key roles in important biological processes such as morphogenesis, cellular differentiation, angiogenesis, apoptosis, and modulation of the immune response, as well as disease processes such as cancer progression. These gene families include tumor necrosis factors (Flavell 2002 In some cases, a protein may have therapeutic potential if it is present in a disease state, even if it does not play a role in the progression or maintenance of the disease. However, there are many factors that influence a protein's potential as an effective and safe therapeutic or therapeutic target; the presence and abundance in normal and diseased tissues, the subcellular localization, the activity, and the biological role of the protein are just some of these factors. Therefore, it is imperative to screen a large number of proteins for a wide variety of such characteristics in order to identify the most promising potential candidates for drug development. Computational methods can be useful in predicting the likelihood of some of a protein's characteristics in order to focus further laboratory investigation on the proteins with the most potential for playing a role in a disease state and leading to a therapeutic. To facilitate the discovery of new therapeutic opportunities, we undertook a large-scale program of biological and computational strategies to identify and classify new secreted and transmembrane proteins.
This effort to identify novel secreted and transmembrane human proteins resulted in 1047 transcripts successfully cloned, representing 1021 genes (Table 1). A complete list of the GenBank accession numbers of the cDNA clone sequences with details of the analysis summarized in this publication is available as a Supplementary Table. The success rate of the SPDI project can be measured by the proportion of these genes that appear to encode secreted or transmembrane proteins, which is 86% (879 genes). A total of 13% (136 genes) appear to encode cytoplasmic and nuclear proteins, and the subcellular localization could not be predicted for 1% (6 genes). Because our identification of these transcripts as representing novel genes, 77% (791) of them have been submitted to GenBank from sources other than this SPDI effort (Table 2). However, 25% of these cDNAs are still unique transcripts. This includes 20% (209) that are variants of genes currently represented in GenBank and 5% (47) that may represent completely novel genes.
A number of SPDI transcripts still represent novel genes at the time of the submission of this work. Evidence of these being bonafide genes includes ESTs, homology with known protein domains, and orthology with a mouse gene. Such evidence is present for most of the novel SPDI transcripts (Table 3). Transcripts with none of this evidence may also represent bonafide genes, but those with small predicted proteins may be more likely to be partial transcripts or other artifacts. Nonetheless, almost all of the SPDI transcripts that initially lacked supporting evidence have been confirmed by cDNAs identified by others over the years.
The first approach to the identification of novel secreted proteins was to exploit biological screens for the ability of cDNA library-encoded fusion proteins to direct the secretion of a reporter protein. Yeast cells provide an easily manipulated system for such screens for secreted proteins (Klein et al. 1996
One difficulty inherent in biological screens for secreted proteins is that they encounter diminishing yields as the more abundant novel proteins are discovered and the remaining novel proteins become more rare. Computational methods, by comparison, are in principle, well suited to the identification of rare genes, provided there is sequence information to analyze. The overall SPDI strategy was to use both biological and computational methods to identify novel secreted and transmembrane proteins from multiple sources of DNA sequence (Fig. 1). The availability of very large collections of ESTs has greatly facilitated the use of such computational strategies. Two algorithms that detect the properties of signal peptides were developed and utilized, Signal Sensor (C. Watanabe, unpubl.) and Sighmm (Zhang and Wood 2003
Some proteins known to be secreted or membrane bound cannot currently be identified as such computationally and/or do not possess a signal peptide. Additionally, limitations to the EST collections result in some genes not being represented with EST coverage containing amino-terminal sequence information. However, these proteins may have amino acid sequence homology to known secreted and transmembrane proteins. This homology may suggest a similar role and subcellular localization. Thus, homology-based screening strategies can be a powerful tool to identify putative secreted and transmembrane proteins. We utilized a collection of known ligands and receptors of interest as a homology-based method of identifying new members of these protein families. The protein families used in this search represent key players in cell-cell signaling, such as growth factors, cytokines, chemokines, and their receptors. The recent availability of large-scale genomic sequence has provided new opportunities to identify rare genes not abundantly present within cDNA libraries and EST collections. The presence of introns in genomic sequence requires that a gene-prediction algorithm such as Genscan be used for gene identification. We have utilized both signal-sequence detection strategies and homology-based approaches to mine both predicted genes and genomic sequence directly for the identification of additional genes. The SPDI effort utilized multiple gene-identification strategies that were used at different times during the course of the project, and genes already identified by one strategy were bypassed with later strategies. For this reason, it is not possible to evaluate which strategy was most effective at identifying secreted and transmembrane proteins. However, the largest number of genes were identified in this effort by computational signal sequence or homology detection from ESTs (Table 4). The smallest number of genes were detected only from genomic sequence. EST evidence was not sufficient for identification of these genes because of their rarity of expression, as EST coverage did not include a signal sequence, or because they are not highly homologous to the known ligands and receptors used to identify family members. For some genes, multiple methods were required in an iterative strategy in order to attain a full-length cDNA clone. Often, this occurred for particularly long transcripts when a 5'-truncated transcript was identified by EST mining, and then genomic sequence mining revealed the first exon of the gene. The SPDI effort exemplifies the value of utilizing various complementary approaches of gene identification.
Many of the genes identified belong to gene families related by homology, which are known to include important regulators of key physiological processes. These include secreted proteins such as cytokines, chemokines, and growth factors and their receptors. Other genes, such as those that apparently encode cytoplasmic or nuclear proteins, were also identified. In some cases, this was due to the presence of domains such as protease domains that can occur in proteins localized to either intracellular or extracellular spaces. The families of proteins that were found to have the greatest number of new members through this effort were the immunoglobulin (Ig) domain and leucine-rich repeat proteins. Combined, these two structural domains were present in >10% of the proteins identified. Another 10% of the proteins are clearly related to known classes of enzymes. A number of these proteins appear to be localized within subcompartments of secretory pathways and may have roles in regulating protein post-translational modification (e.g., glycoslation). Perhaps surprisingly, new members were identified for most of the major known families of secreted proteins. In some cases, such as in the interferon family, new members were identified despite the considerable previous efforts to identify members of the family.
The success of this effort was due to the combined use of multiple strategies for the identification of genes that encode secreted and transmembrane molecules. Each strategy has different strengths and limitations. The strategies were directed at both the source of gene evidence, such as ESTs, and both predicted gene and exon homology from genomic sequence, and at the method of detecting putative proteins with the properties of secreted and transmembrane proteins including biological screens for secretion, algorithms for detecting signal sequences, and homology searches based on a collection of known secreted and transmembrane proteins of interest. The various methods described have differed in their success at identifying particular types of genes. For instance, novel secreted genes without a recognizable relationship to other known genes can perhaps only be identified with the biological or computational signal-sequence detection methods. Conversely, many secreted and transmembrane proteins of known gene families do not have a detectable signal sequence (e.g., basic FGF), but could be recognized by homology. The success rate of these methods was also influenced by the timing of their introduction. For example, the yeast signal trap screening was gradually discontinued as EST collections became larger and proved to be a more efficient means of gene identification. Similarly, genomic sequence mining was introduced only after EST mining had been fairly exhaustive.
The novelty of these proteins was a key factor in the criteria for cloning them. Candidates that had identity to cDNA clone sequences in GenBank were not pursued. Therefore, the genes identified do not represent a complete collection of secreted and transmembrane proteins. Large-scale efforts by others have also identified comprehensive collections of cDNA clones for human (Strausberg et al. 1999
Many proteins in the SPDI collection have already been shown to have functions in important biological processes through investigations with the cDNA clones identified here. Of particular interest have been newly identified growth factors, cytokines, tumor necrosis factors, and Toll family receptors. Angiogenic mitogens stimulate growth of vascular endothelial cells, which is critical to the development of vascular supply. EG-VEGF induces proliferation, migration, and fenestration in capillary endothelial cells derived from endocrine glands (LeCouter et al. 2001
Cytokines and their receptors transmit signals that modulate the immune response. The IL-17B and IL-17C cytokines induce the release of the TNF-
Tumor necrosis factors and their receptors are involved in a number of physiological and pathological responses. DR5 induces apoptosis in tumor cells after binding Apo2L/TRAIL, and DcR1 and DcR2 act as decoy receptors that inhibit this signaling (Marsters et al. 1997
The innate immune system uses Toll family receptors to signal for the presence of microbes and initiate host defense. Bacterial lipoproteins are potent activators of Toll-like receptor-2, mediating both apoptosis and NF-kB signaling through myeloid differentation factor 88 (Yang et al. 1998
A number of the SPDI proteins have been implicated in a wide range of other biological processes, and further investigation of others is underway (Pennica et al. 1998
Biological Screens in Yeast Cells for Detection of Secretion Sequences Recombinant gene libraries were constructed by replacing the signal peptide encoded by the reporter gene with a library of cDNA fragments. If a given cDNA fragment encodes a signal peptide, the fusion protein may be secreted by a clonal colony of yeast cells, enabling identification of functional signal sequences. Several reporter genes were utilized in these studies, including invertase, amylase (Klein et al. 1996
Sequence Data Sources for Computational Screens
Computational Screen for Signal Peptides
Computational Screen for Homology to Proteins of Interest
Novelty Assessment of Identified Transcripts
Prediction of Protein Domains and Subcellular Localization
Assignment of Transcipts to Gene Categories
We thank Thomas Wu for the Tmdetect algorithm and David Carpenter for fruitful analysis and discussions. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
1 Corresponding author. E-MAIL hclark{at}gene.com;FAX (650) 225-5389. [Supplemental material is available online at www.genome.org and at http://share.gene.com. The cDNA clone sequences from this study have been submitted to GenBank under accession nos. AY358081 [GenBank] -AY359127. The following individuals kindly provided reagents, samples, or unpublished information as indicated in the paper:T. Wu.] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1293003. Article published online before print in September 2003.
Aliprantis, A.O., Yang, R.B., Weiss, D.S., Godowski, P., and Zychlinsky, A. 2000. The apoptotic signaling pathway activated by Toll-like receptor-2. EMBO J. 19: 3325-3336.[CrossRef][Medline] Armant, M.A. and Fenton, M.J. 2002. Toll-like receptors: A family of pattern-recognition receptors in mammals. Genome Biol. 3: 3011-3016. Baker, K. and Gurney, A.L. 2000. Method of selection for genes encoding secreted and transmembrane proteins. In US Patent 6,060,249.
Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L., Marshall, M., and Sonnhammer, E.L. 2002. The Pfam protein families database. Nucleic Acids Res. 30: 276-280. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94.[CrossRef][Medline] Chen, Q., Ghilardi, N., Wang, H., Baker, T., Xie, M.H., Gurney, A., Grewal, I.S., and de Sauvage, F.J. 2000. Development of Th1-type immune responses requires the type I cytokine receptor TCCR. Nature 407: 916-920.[CrossRef][Medline] Cross, M.J. and Claesson-Welsh, L. 2001. FGF and VEGF function in angiogenesis: Signalling pathways, biological responses and therapeutic inhibition. Trends Pharmacol. Sci. 22: 201-207.[CrossRef][Medline] Danielsen, A.J. and Maihle, N.J. 2002. The EGF/ErbB receptor family and apoptosis. Growth Fact. 20: 1-15. Dedhar, S. 1999. Integrins and signal transduction. Curr. Opin. Hematol. 6: 37-43.[CrossRef][Medline]
Eddy, S.R. 1998. Profile hidden Markov models. Bioinformatics 14: 755-763. Flavell, R.A. 2002. The relationship of inflammation and initiation of autoimmune disease: Role of TNF super family members. Curr. Top. Microbiol. Immunol. 266: 1-10.[Medline]
Gewirtz, A.T., Navas, T.A., Lyons, S., Godowski, P.J., and Madara, J.L. 2001. Cutting edge: Bacterial flagellin activates basolaterally expressed TLR5 to induce epithelial proinflammatory gene expression. J. Immunol. 167: 1882-1885.
Ghilardi, N., Li, J., Hongo, J.A., Yi, S., Gurney, A., and de Sauvage, F.J. 2002. A novel type I cytokine receptor is expressed on monocytes, signals proliferation, and activates STAT-3 and STAT-5. J. Biol. Chem. 277: 16831-16836. Grandvaux, N., tenOever, B.R., Servant, M.J., and Hiscott, J. 2002. The interferon antiviral response: From viral invasion to evasion. Curr. Opin. Infect. Diseases 15: 259-267.[Medline] Gurney, A.L., Marsters, S.A., Huang, A., Pitti, R.M., Mark, M., Baldwin, D.T., Gray, A.M., Dowd, P., Brush, J., Heldens, S., et al. 1999. Identification of a new member of the tumor necrosis factor family and its receptor, a human ortholog of mouse GITR. Curr. Biol. 9: 215-218.[CrossRef][Medline] Hackel, P.O., Zwick, E., Prenzel, N., and Ullrich, A. 1999. Epidermal growth factor receptors: Critical mediators of multiple receptor pathways. Curr. Opin. Cell Biol. 11: 184-189.[CrossRef][Medline] Holcomb, I.N., Kabakoff, R.C., Chan, B., Baker, T.W., Gurney, A., Henzel, W., Nelson, C., Lowman, H.B., Wright, B.D., Skelton, N.J., et al. 2000. FIZZ1, a novel cysteine-rich secreted protein associated with pulmonary inflammation, defines a new gene family. EMBO J. 19: 4046-4055.[CrossRef][Medline] Kawai, J., Shinagawa, A., Shibata, K., Yoshino, M., Itoh, M., Ishii, Y., Arakawa, T., Hara, A., Fukunishi, Y., Konno, H., et al. 2001. Functional annotation of a full-length mouse cDNA collection. Nature 409: 685-690.[CrossRef][Medline]
Klein, R.D., Gu, Q., Goddard, A., and Rosenthal, A. 1996. Selection for genes encoding secreted proteins and receptors. Proc. Natl. Acad. Sci. 93: 7108-7113. LeCouter, J., Kowalski, J., Foster, J., Hass, P., Zhang, Z., Dillard-Telm, L., Frantz, G., Rangell, L., DeGuzman, L., Keller, G.A., et al. 2001. Identification of an angiogenic mitogen selective for endocrine gland endothelium. Nature 412: 877-884.[CrossRef][Medline]
Lee, J., Ho, W.-H., Maruoka, M., Corpuz, R.T., Baldwin, D.T., Foster, J.S., Goddard, A.D., Yansura, D.G., Vandlen, R.L., Wood, W.I., et al. 2001. IL-17E, a novel proinflammatory ligand for the IL-17 receptor homolog IL-17Rh1. J. Biol. Chem. 276: 1660-1664. Lennon, G., Auffray, C., Polymeropoulos, M., and Bento Soares, M. 1996. The I.M.A.G.E. consortium: An integrated molecular analysis of genomes and their expression. Genomics 33: 151-152.[CrossRef][Medline]
Li, H., Chen, J., Huang, A., Stinson, J., Heldens, S., Foster, J., Dowd, P., Gurney, A.L., and Wood, W.I. 2000. Cloning and characterization of IL-17B and IL-17C, two new members of the IL-17 cytokine family. Proc. Natl. Acad. Sci. 97: 773-778. Marsters, S.A., Sheridan, J.P., Pitti, R.M., Huang, A., Skubatch, M., Baldwin, D., Yuan, J., Gurney, A., Goddard, A.D., Godowski, P., et al. 1997. A novel receptor for Apo2L/TRAIL contains a truncated death domain. Curr. Biol. 7: 1003-1006.[CrossRef][Medline] Marsters, S.A., Sheridan, J.P., Pitti, R.M., Brush, J., Goddard, A., and Ashkenazi, A. 1998. Identification of a ligand for the death-domain-containing receptor apo3. Curr. Biol. 8: 525-528.[CrossRef][Medline] Onuffer, J. and Horuk, R. 2002. Chemokines, chemokine receptors and small-molecule antagonists: Recent developments. Trends Pharmacol. Sci. 23: 459-467.[CrossRef][Medline] Ornitz, D. and Itoh, N. 2001. Fibroblast growth factors. Genome Biol. 2: 3001-3009.
Pennica, D., Swanson, T.A., Welsh, J.W., Roy, M.A., Lawrence, D.A., Lee, J., Brush, J., Taneyhill, L.A., Deuel, B., Lew, M., et al. 1998. WISP genes are members of the connective tissue growth factor family that are up-regulated in wnt-1-transformed cells and aberrantly expressed in human colon tumors. Proc. Natl. Acad. Sci. 95: 14717-14722. Pitti, R.M., Marsters, S.A., Lawrence, D.A., Roy, M., Kischkel, F.C., Dowd, P., Huang, A., Donahue, C.J., Sherwood, S.W., Baldwin, D.T., et al. 1998. Genomic amplification of a decoy receptor for Fas ligand in lung and colon cancer. Nature 396: 699-703.[CrossRef][Medline] Schooltink, H. and Rose-John, S. 2002. Cytokines as therapeutic drugs. J. Interfer. Cyto. Res. 22: 505-516.
Sheridan, J.P., Marsters, S.A., Pitti, R.M., Gurney, A., Skubatch, M., Baldwin, D., Ramakrishnan, L., Gray, C.L., Baker, K., Wood, W.I., et al. 1997. Control of TRAIL-induced apoptosis by a family of signaling and decoy receptors. Science 277: 818-821.
Sonnhammer, E.L., Eddy, S.R., Birney, E., Bateman, A., and Durbin, R. 1998. Pfam: Multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 26: 320-322.
Strausberg, R.L., Feingold, E.A., and Klausner, R.D. 1999. The mammalian gene collection. Science 286: 455-457. Tang, B.L. 2001. ADAMTS:A novel family of extracellular matrix proteases. Intl. J. Biochem. Cell Biol. 33: 33-44.[CrossRef][Medline] Williamson, A.R. 1999. The Merck gene index project. Drug Discovery Today 4: 115-122.[CrossRef][Medline] Xie, M.H., Holcomb, I., Deuel, B., Dowd, P., Huang, A., Vagts, A., Foster, J., Liang, J., Brush, J., Gu, Q., et al. 1999. FGF-19, a novel fibroblast growth factor with unique specificity for FGFR4. Cytokine 11: 729-735.[CrossRef][Medline]
Xie, M.H., Aggarwal, S., Ho, W.H., Foster, J., Zhang, Z., Stinson, J., Wood, W.I., Goddard, A.D., and Gurney, A.L. 2000. Interleukin (IL)-22, a novel human cytokine that signals through the interferon receptor-related proteins CRF2-4 and IL-22R. J. Biol. Chem. 275: 31335-31339. Yamamoto, S., Higuchi, Y., Yoshiyama, K., Shimizu, E., Kataoka, M., Hijiya, N., and Matsuura, K. 1999. ADAM family proteins in the immune system. Immunol. Today 20: 278-284.[CrossRef][Medline]
Yan, M., Wang, L.C., Hymowitz, S.G., Schilbach, S., Lee, J., Goddard, A., de Vos, A.M., Gao, W.Q., and Dixit, V.M. 2000. Two-amino acid molecular switch in an epithelial morphogen that regulates binding to two distinct receptors. Science 290: 523-527. Yancopoulos, G.D., Klagsbrun, M., and Folkman, J. 1998. Vasculogenesis, angiogenesis, and growth factors: Ephrins enter the fray at the border. Cell 93: 661-664.[Medline] Yang, R.B., Mark, M.R., Gray, A., Huang, A., Xie, M.H., Zhang, M., Goddard, A., Wood, W.I., Gurney, A.L., and Godowski, P.J. 1998. Toll-like receptor-2 mediates lipopolysaccharide-induced cellular signalling. Nature 395: 284-288.[CrossRef][Medline]
Zhang, Z. and Wood, W.I. 2003. A profile hidden Markov model for signal peptides generated by HMMER. Bioinformatics 19: 307-308.
Received February 23, 2003;
accepted in revised format July 28, 2003.
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||