|
|
|
|
Genome Res. 13:1231-1243, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00 Resources Kinase Pathway Database: An Integrated Protein-Kinase and NLP-Based Protein-Interaction Resource1 Human Genome Center, Institute of Medical Science, University of Tokyo, Shirokane-dai, Minato-Ku, Tokyo 108-8639, Japan 2 Central Research Laboratory, Hitachi, Ltd., Higashi-Koigakubo, Kokubunji-shi, Tokyo 185-8601, Japan 3 Life Science Group, Hitachi, Ltd., Minamidai, Kawagoe-shi, Saitama, 350-1165, Japan
Protein kinases play a crucial role in the regulation of cellular functions. Various kinds of information about these molecules are important for understanding signaling pathways and organism characteristics. We have developed the Kinase Pathway Database, an integrated database involving major completely sequenced eukaryotes. It contains the classification of protein kinases and their functional conservation, ortholog tables among species, proteinprotein, proteingene, and proteincompound interaction data, domain information, and structural information. It also provides an automatic pathway graphic image interface. The protein, gene, and compound interactions are automatically extracted from abstracts for all genes and proteins by natural-language processing (NLP).The method of automatic extraction uses phrase patterns and the GENA protein, gene, and compound name dictionary, which was developed by our group. With this database, pathways are easily compared among species using data with more than 47,000 protein interactions and protein kinase ortholog tables. The database is available for querying and browsing at http://kinasedb.ontology.ims.u-tokyo.ac.jp/.
Cellular signaling in eukaryotes plays a key role in the processes of tissue growth, cell differentiation, and rapid response to environmental changes. Defects in signaling components have been found to cause diseases such as cancer (Hunter 2000
The Protein Kinase Resource (Smith 1997
On the other hand, automatic extraction of information from articles has been studied, and some systems for extracting proteinprotein interactions have been developed (Sekimizu et al. 1998 Toward a resolution of these problems, we developed the Kinase Pathway Database using sequence analysis and natural-language processing (NLP)techniques. The database provides ortholog tables for the protein kinases of the major eukaryotes Drosophila melanogaster, Mus musculus, Rattus norvegicus, and Homo sapiens. It also provides manually commented functional conservation information on protein kinases in these species, domain information, structural information, gene/protein/compound interactions, and a Java-based graphic viewer. The interaction data were automatically extracted from MEDLINE abstracts using natural-language processing for all genes, not only protein kinases. This system offers the following main features. (1)A protein, gene, and compound name dictionary called GENA (http://gena.ontology.ims.u-tokyo.ac.jp/search/servlet/gena)is used to identify as many synonyms as possible. GENA specifies the relationship between gene and protein names extracted from text and the locus (sequence)or external database. (2)By combing automatically extracted information and ortholog information, users can easily get interaction information on the ortholog genes or proteins of the target genes or proteins and can easily compare the pathways of different species on the graphic viewer. Here we present an overview of the Kinase Pathway Database and discuss the recall and precision of extracted protein, gene, and compound interaction information. In the main text, we briefly summarize the performance of automatic information extraction. However, this subject requires an extended discussion for the examination of its validity. Accordingly, our analysis of extraction errors and a comparison with other natural-language processing methods are added in the Appendix.
Database Contents Protein Kinases The predicted numbers of protein kinases are summarized in Table 1. Their family classification and ortholog tables are contained in the current version (March 2003)of the Protein Kinase Database. The numbers of predicted protein kinases are slightly different from those in previous reports (Hunter and Plowman 1997
Automatically Extracted Protein-Interaction
For the calculation of recall, 500 abstracts (200 for S. cerevisiae and 300 for H. sapiens)were checked manually, and the results were compared with automatically extracted results. The recall rate (=true positive/[true positive + false negative])values of S. cerevisiae and H. sapiens are 12(automatic extraction)/46(manual extraction)= 26% (12/46 = 26%)and 38/153 = 25% (27/120 = 23%), respectively. "False negative" indicates the number of wrongly unextracted interactions. The numbers in parentheses are the results without compounds. Further discussion of extraction errors and the comparison with other methods are given in the Appendix. This program leaves room for further analysis of coordinate clauses and anaphora of pronouns, and improvement in these areas will increase the recall and precision. We are working on those improvements now. About half of the molecular names are not written as gene names; rather they are written as family names (including family-like names)or as unspecified gene names (see Appendix). To extract interaction information between families and between families and proteins, we are preparing an ontology of family names that contains enough synonyms.
Comparison With Other Databases In the DIP database, there is a total of 16,581 nonredundant interactions regarding our target organisms. They consist of 5720 proteins (H. sapiens, 687; M. musculus, 177; R. norvegicus, 82; D. melanogaster, 45; C. elegans, 5; S. cerevisiae, 4724). Of these, 5327 proteins (H. sapiens, 585; M. musculus, 116; R. norvegicus, 56; D. melanogaster, 37; C. elegans, 4; S. cerevisiae, 4529)are registered in GENA. Therefore, GENA-IDs are assigned for 14,888 interactions (H. sapiens, 567; M. musculus, 47; R. norvegicus, 17; D. melanogaster, 38; C. elegans, 2; S. cerevisiae, 14,217). Of the S. cerevisiae interactions (14,217), 6256 interactions are yeast two-hybrid results. Only 706 interactions (H. sapiens, 144; M. musculus, 16; D. melanogaster, 1; C. elegans, 2; S. cerevisiae, 543) of those are found in our extracted data. In DIP, some interactions are extracted from tables and figures in review articles. Many interactions are not written in abstracts. In TRANSPATH, 3021 molecules are connected by 3225 reactions from 977 extracted papers (almost all reviews). Unfortunately, those data could not be downloaded, and we could not compare it with our results. The overlap between manually gathered data and our data is not so large. Uncollected data will be probably be complemented by the analysis of review articles.
The Web Interface of the Kinase Pathway Database
Figures 1A, 1B, 1C1D, 1E, 1F shows the results of searching with the Web interface. The manually developed comment file for each family (search method 1)is shown in Figure 1A, and Figure 1B is an ortholog table (search method 5). To express gene duplication, "level" has been used. The "III" indicates genes that have diverged from "II". The "II" indicates those that have diverged from "I". The phylogenetic tree for kinases is shown in Figure 1F (search method 4), and the interaction data automatically extracted from abstracts are shown in Figure 1D (search method 3). As evidence of the extracted interaction information, the corresponding sentence from which the interaction data is extracted and the author information are provided as shown in Figure 1E. Using the information in Figure 1B and D, the pathways for one or two organisms can be drawn side by side, as shown in Figure 1C (search method 2). In this figure, the orthologous proteins have been specified as the starting proteins of the pathways, and the interacting proteins within two steps from the starting proteins are drawn. In the pathway viewer, orthologous proteins are selected according to the ortholog table level. For example, when M. musculus PAK1 is selected, S. cerevisiae STE20, CLA4, D. melanogaster PAK, R. norvegicus PAK1, and H. sapiens
PAK1 are shown as orthologous proteins with interaction information in the intermediate Web interface, and one protein can be selected (STE20 in this example; Fig. 1C). However, other PAK subfamily members such as H. sapiens PAK27 are not shown, because they are set as paralogs in the ortholog table. By using the pathway viewer, the known interlog (both proteins are ortholog)pathways such as STE11-
Further information related to the use of this database, statistical information regarding its contents, and the structure of the relational database can be accessed from the database Web page.
Application
Summary
Architectural Design An overview of the Kinase Pathway Database is shown in Figure 2. The Kinase Pathway Database and GENA are implemented using Post-greSQL. In the following section, the construction of each component is discussed in detail. The table layout is written on the Web (http://kinasedb.ontology.ims.u-tokyo.ac.jp/comment/table.files/slide0001.htm).
The Kinase Classification and Ortholog Table
Pfam (Bateman et al. 2000 The ortholog tables including paralog were produced from these phylogenetic trees. "Level," to express gene duplication in ortholog tables, was also determined from these phylogenetic trees. These tables can be used to search for the orthologous pathways, using the graphic pathway viewer, as is discussed in the Results and Discussion section. Further, family names and functional conservation for eukaryotic protein kinases have hierarchical definitions and can be summarized manually.
Domains and Structural Information
On the other hand, protein structures that are expected to be biased in function and the distribution of structural information on pathways are attracting attention (Hegyi and Gerstein 1999
Graphic Pathway Viewer
Protein Interaction Information: Automatic Extraction The general procedure is shown in Figure 3. The details are described in following subsections.
GENA: The Gene, Protein, and Compound Name Dictionary
GENA automatically and periodically gathers full official gene names, official gene symbols, gene synonyms, gene products, and family names from PomBase (http://www.sanger.ac.uk/Projects/S_pombe/), SGD (http://genome-www.stanford.edu/Saccharomyces/), MIPS (http://mips.gsf.de/projects/fungi/yeast.html), WormBase (http://www.wormbase.org/), FlyBase (http://flybase.bio.indiana.edu/), MGI (http://www.informatics.jax.org/), RGD (http://rgd.mcw.edu/), HUGO(http://www.gene.ucl.ac.uk/hugo/), GDB (http://gdbwww.gdb.org/), GenAtlas (http://www.dsi.univ-paris5.fr/genatlas/), OMIM (http://www.ncbi.nlm.nih.gov/omim/), LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html), SWISS-PROT (Bairoch and Apweiler 2000
ShallowParsing The shallow parser FDG-Lite by Conexor (http://www.conexoroy.com/products.htm)assigns word form, base form, part-of-speech, and light syntactic representation to all targeted abstracts. FDG-Lite was originally developed by A. Voutilainen, P. Tapanainen, and T. Järvinen at the University of Helsinki. The parsed example can be seen on the Web (http://www.conexoroy.com/lite.htm). Although most protein, gene and compound names are not stored in the FDG-Lite dictionary, correct syntactic tags and part-of speech tags are assigned for them in almost all cases. Stemming and abbreviation errors caused by protein, gene, or compound names are observed. For example, "c-fos" is wrongly recognized as "plural of c-fo". A simple program corrected these errors.
Recognition of Noun Phrases
Regular Expression of Noun Phrases Here, special characters are used similarly to perl. In this expression, each term corresponds to the syntactic tag and part-of-speech tag pairs. For instance, when the syntactic tag is "premodifer" and the part-of speech tag is "noun", "noun_ modifier" is assigned for the corresponding word. The following noun phrases enclosed in angle brackets (< >)are recognized, and their base forms and positions in the sentence are indexed. "Using < leptomycin B > <we> demonstrate that <transport> of the <FXR proteins> out of the <nucleus> is mediated by the < export receptor exportin1>". This step is not necessary if abstracts are used only once for information extraction. The indexes are useful when gene/protein/compound names are added to GENA, because the search of sentences including gene/protein/compound names becomes a CPU time-consuming task with an increase in their names. The indexes are also used for the extraction of molecular function under development. When prepositions are included in the target protein or compound name, they are not extracted correctly. However, the percentage of protein and compound names that include a preposition is less than 0.5% for all of those used in this study. In addition, if the other interacting protein name does not include a preposition, the protein with the preposition will be extracted as an interacting partner protein. Therefore, we ignored them in this version.
Identification of Protein Interactions The template phrase patterns were mainly divided into two types: the noun-phrase type (e.g., the interaction between proteins A and B; protein A induced activation of protein B) and the predicate verb type (e.g., protein A interacts with protein B; protein A induces the expression of protein B). Actually, instead of only protein, gene, or compound names, extended noun phrases (noun phrases or noun phrase + prepositional phrase, or noun phrase + prepositional phrase + coordinate clause)are used in these template phrase patterns. For example:
The articles and demonstrative pronouns are ignored in the pattern matching procedure. The auxiliary verbs and verb modifier phrases are recognized as the verb phrase and are ignored in the pattern matching procedure. The verb phrases and preposition patterns and the noun phrase in bold typeface are incorporated in the template phrase patterns. The angle brackets (< >)indicate extended noun phrases, and the underlined angle brackets (< >)show noun phrases extracted by phrase pattern matching.
The detailed pattern types and sentence examples are summarized in Table 5. Actually, some restriction is imposed on noun phrases including protein names in every pattern. For example, most noun-phrase 2 (NP2s)including a protein name in the extended noun phrase (surrounded by < >)in Table 5 must be the noun phrase without prepositions or protein name + special expression (such as "activation of protein name"). More than 600 complete template phrase patterns, including differences in prepositions, are provided. Patterns are made for every similar verb/noun. To increase pattern variety, we use ambiguous pattern expressions. For example, instead of "play an essential role in" we use "play <noun phrase ended with `role'> in", because there are various expressions such as "play an important role" and "play a crucial role". Template phrase patterns are extracted by manually checking reviews and abstracts (more than 1000 in all). Whether proteins directly or indirectly interact with other proteins, the direction of interaction (the specifications for the signal donor and acceptor proteins in the signaling cascade), and the type of interaction, such as activation, inhibition, or unknown (but signal direction is known, which indicates regulation)are also recognized. When the direction of the signal cascade is not known, "undirected" is assigned. These types of interactions are determined by matching template phrase patterns. For example, "complex," "interact," "associate," "bind," and "verb of chemical modifier" are recognized as "direct interaction." "Activate" and "inhibit" are recognized as "indirect interaction," because they are not always used to indicate direct interaction. In only special cases, the order of pattern matching is fixed for the correct extraction. For example, protein A
In the present study, anaphora (the definition of anaphora is coreference of an expression with its antecedent)of pronouns was not resolved. The antecedent of an interrogative is always considered to be the previous noun phrase. For example, "Recently we cloned the <cDNA encoding human Fe65L2, which> interacts with <Alzheimer's beta-amyloid precursor protein (APP)>." Because < > are recognized as the subject and object of "interact", "which" is considered to point to "cDNA encoding human Fe65L2". When negative words such as "neither," "not," "never, " or "fail" are included in a verb phrase, or other species names are included in the sentence, the sentence is discarded. When "investigate," "examine," "design" and so on are included in the sentence, the sentence does not denote a fact or experimental result, so we discarded such sentences. NCBI taxonomy (http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/)is used as the list of species names. When other species names are included in the abstract, a "problematic" mark was assigned to avoid wrong recognition of other species interactions. Data marked "problematic" are searchable in protein interaction data but are not shown in the pathway viewer.
Assignment of GENA-ID For example:
FXR proteins The names in the Kinase Pathway Database are unified by this official symbol and Gena-ID. In principle, uppercase and lowercase characters are not distinguished. However, for some gene names that have the same spelling as common nouns or prepositions such as "yellow" and "on", only all-uppercase letter names, names that include uppercase letters, or exact case matches are used to avoid the extraction of unrelated information and reduction of precision. (This is not applied when the gene name has all lowercase characters in GENA). Concerning some words such as `cAMP', exact matching, including the case pattern, is used to distinguish gene names from words that have other meanings (in this case `cAMP' must be distinguished from `CAMP [cathelicidin antimicrobial peptide]' in human). Because the synonyms of GENA are insufficient to allow use by only complete matching, special treatment of some special marks/letters, for example, "-" (hyphen), "." (period), Greek letters, etc. and spaces was devised. Because the characteristics of gene name spelling vary considerably among species, different treatments are required for different species. When extracted and assigned gene names were abbreviated and their full names were written in the abstracts (full names and synonyms were written preceding parentheses and abbreviations were written within parentheses), we checked the consistency of GENA-IDs to identify them. When the full names or synonyms and the abbreviations in parentheses were not registered with the same GENA-ID, those interaction data were not saved in the database. When the extracted gene name was assigned to multiple genes and the actual gene name was not distinguishable, a "problematic" mark was added and the interactions of all candidate gene patterns were stored. "Problematic" data are searchable, but are not shown in the pathway viewer. Biological experts check gene-like and protein-like noun phrases that do not completely match the GENA entries but match GENA entries through heuristics such as the treatment of hyphens, spaces, and Greek letters, and those were then registered in GENA as synonyms. Because mammalian gene names are similar to one another and there are not as many synonyms for mice and rats as there are for humans, we used dictionary information for other species. That is, gene-like or protein-like noun phrases (e.g., `-ase')that are not registered as GENA entries for targeted species but are registered for other species, are also checked by biological experts and registered in GENA as synonyms if proven true.
In this Appendix, we discuss the cause of errors in the automatic protein interaction extraction and compare our method with other natural-language extraction methods.
The Precision of Automatically Searched Abstracts The following errors (false positive)were detected in extracted gene, protein, or compound interactions.
The occurrence of extraction errors caused by #1.1 is quite low, probably less than 1%. Except for the errors caused by GENA, more than half of the errors are parsing errors [#1.3, mainly failures in coordinate clause analysis and noun-phrase (object-phrase)recognition]. Most of the noun-phrase (object-phrase)recognition errors described in #1.3 are avoidable if analysis of the constituent words of noun phrases (especially prepositional phrases)is added. For example, consider the sentence, "<p38> was
also activated by <cisplatin with similar kinetics as JNK>." With our program, the wrong interaction "JNK
Blaschke and Valencia (2002
The Recall of Automatically Searched Abstracts For S. cerevisiae, 93 (90)interactions were manually extracted, and among these, 47 (44)included names that were unregistered or unrecognizable in GENA (such as family names and the existence or nonexistence of a number of the proteins such as `PYK' and `PYK1'). The above numbers in parentheses are the results without compounds. Some interactions with names that are unregistered or unrecognizable in GENA are also extracted, but we did not evaluate them in this study. To separate the problem of name detection caused by GENA and parsing/phrase-pattern error, we evaluated 46 (46) interactions whose names are identified by GENA out of 93 (90)interactions. As a result, 12 (12)were extracted by our program. The recall rate (=true positive/[true positive + false negative])is 26% (26%). About 40% of the false negatives are due to the complexity of the sentences (e.g., the problems of anaphora by relative pronouns, coordinate clauses, and parenthetical expressions). About 10% is due to insufficient sentence analysis, which will be resolved by improvement of our program. The rest of the false negatives are mainly due to the lack of template phrase patterns. Anaphora covering multiple sentences seems to be a trivial problem, because only two interactions are extracted as descriptions that were covered in two sentences. For H. sapiens, 349 (244)interactions were extracted manually, and among them 196 (124)included names that are unregistered, unrecognizable names in GENA. Accordingly, the names of 153 (120)interactions are registered in GENA. Although some of the interactions of these unregistered or unrecognizable names were extracted, we did not evaluate them here. As a result, 38 (27)out of 153 (120)were extracted by our programs. The recall rate is 25% (23%). About 30% of the unregistered names are compound names, about 10% are gene names, and about 60% are family-level or unspecified gene names. About 6% of the gene names that were recognized as genes registered in GENA manually were not found automatically by simple heuristics. For example, if only `c-kit' is registered in GENA, it is difficult to recognize `c-kit receptor' as a `c-kit' synonym, because similar name patterns such as `insulin receptor' and `insulin' are not correct. About 40% of the false negatives are due to sentence complexity. About 10% of the false negatives are caused by insufficient analysis of coordinate clauses and parenthetical expressions; the rest are mainly due to the lack of template phrase patterns. Interestingly, although the signaling cascade of H. sapiens is more complicated than that of S. cerevisiae, the complexity of sentences seems to be on the same level.
Comparison With Other NLP Methods
There are mainly two kinds of protein, gene, or compound relation extraction. One is the co-occurrence of two proteins, genes, or compounds in text; the other is the use of phrase patterns, as was done in this study. The former seems to be useful for well known interactions that co-occur in the text (Stapley and Benoit 2000
We thank Drs. T. Takai and K. Nakai for their helpful discussions, Mrs. K. Kodama and S. Asahi at Hitachi ULSI Systems for helping us by programming the database, and Ms. A. Nakata, Ms. Y. Shidahara, and Mr. K. Yamada for their careful reading of a large number of abstracts. This work was supported in part by a Grant-in-Aid for Scientific Research on Priority Areas (C)Genome Information Science from the Ministry of Education, Culture, Sports, Science and Technology of Japan. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.835903.
4 These authors contributed equally to this work.
5 Corresponding author.
Adachi, J. and Hasegawa, M. 2000. MOLPHY Version 2.3: Programs for phylogenetics, ver. 2.3. Institute of Statistical Mathematics, Tokyo.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25:
33893402.
Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D., et al. 2000. InterProAn integrated documentation resource for protein families, domains, and functional sites. Bioinformatics 16:
11451150.
Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28:
4548.
Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L., and Sonnhammer, E.L.L. 2000. The Pfam protein families databases. Nucleic Acids Res. 28:
263266. Blaschke, C. and Valencia, A. 2001. The potential use of SUISEKI as a protein interaction discovery tool. Genome Inform. Ser. Workshop Genome Inform. 12: 123134.[Medline] Blaschke, C. and Valencia, A. 2002. The frame-based module of the SUISEKI information extraction system. IEEE Intelligent Systems II 28. Friedman, C., Kra, P. Yu, H., Krauthammer, M., and Rzhetsky, A. 2001. GENIES: A natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17: 7482. Gotoh, O. 1996. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol. 264: 823838.[CrossRef][Medline] Hanks, S.K. and Quinn, A.M. 1991. Protein kinase catalytic domain sequence database, identification of conserved features of primary structure and classification of family members. Methods Enzymol. 200: 3862.[Medline] Hegyi, H. and Gerstein, M. 1999. The relationship between protein structure and function: A comprehensive survey with application to the yeast genome. J. Mol. Biol. 288: 147164.[CrossRef][Medline] Hunter, T. 2000. Signaling2000 and beyond. Cell 100: 113127.[CrossRef][Medline] Hunter, T. and Plowman, G.D. 1997. The protein kinases of budding yeast: Six score and more. Trends Biochem Sci. 22: 1822.[Medline] Jenssen, T.K., Laegreid, A., Komorowski, J., and Hovig, E. 2001. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 28: 2128.[CrossRef][Medline]
Kanehisa, M., Goto, S., Kawashima, S., and Nakaya, A. 2002. The KEGG databases at GenomeNet. Nucleic Acids Res. 30:
4246. Krauthammer, M., Kra, P., Iossifov, I., Gomez, S.M., Hripcsak, G., Hatzivassiloglou, V., Friedman, C., and Rzhetsky, A. 2002. Of truth and pathways: Chasing bits of information through myriads of articles. Bioinformatics (Suppl.) 17: 249257. Lo Conte, L., Brenner, S.E., Hubbard, T.J., Chothia, C., and Murzin, A.G. 2002. SCOP database in 2002: Refinements accommodate structural genomics. Nucleic Acids Res. 30: 264267. Manning, G., Plowman, G.D., Hunter, T., and Sudarsanam, S. 2002. Evolution of protein kinase signaling from yeast to man. Trends Biochem Sci. 27: 514520.[CrossRef][Medline]
McGarvey, P.B., Huang, H., Barker, W.C., Orcutt, B.C., Garavelli, J.S., Srinivasarao, G.Y., Yeh, L.S., Xiao, C., and Wu, C.H. 2000. PIR: A new resource for bioinformatics. Bioinformatics
16:
290291. Morrison, D.K., Murakami, M.S., and Cleghon V. 2000. Protein kinases and phosphatases in the Drosophila genome. J. Cell Biol. 150: 5762.
Ono, T., Hishigaki, H., Tanigami, A., and Takagi, T. 2001. Automated extraction of information on proteinprotein interactions from the biological literature. Bioinformatics 17:
155161.
Page, R.D. 1996. TreeView: An application to display phylogenetic trees on personal computers. Comput. Appl. Biosci. 12:
357358.
Plowman, G.D., Sudarsanam, S., Bingham, J., Whyte, D., Hunter, T. 1999. The protein kinases of Caenorhabditis elegans: A model for signal transduction in multicellular organisms. Proc. Natl. Acad. Sci. USA 96:
1360313610. Rindflesch, T.C., Tanabe L., Weinstein J.N., and Hunter L. 2000. EDGAR: Extraction of drugs, genes, and relations from the biomedical literature. Pac. Symp. Biocomput. 517528.
Schacherer, F., Choi, C., Gotze, U., Krull, M., Pistor, S., and Wingender, E. 2001. The TRANSPATH signal transduction database: A knowledge base on signal transduction networks. Bioinformatics 17:
10531057. Sekimizu, T., Park, H.S., and Tsujii, J. 1998. Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts. Genome Inform. Ser. Workshop Genome. Inform. 9: 6271.[Medline] Smith, C.M. 1997. The protein kinase resource and other bioinformation resources. Prog. Biophys. Mol. Biol. 71: 525533. Stapley, B.J and Benoit, G. 2000. Biobibliometrics: Information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pac. Symp. Biocomput. 529540.
Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. ClustalW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res.
22:
46734680.
Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F., and Higgins, D.G. 1997. The CLUSTAL_X windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25:
48764882.
Tusnady, G.E. and Simon, I. 2001. The HMMTOP transmembrane topology prediction server. Bioinformatics 17:
849850.
Xenarios, I., Salwinski, L., Duan, X.J., Higney, P., Kim, S.M., and Eisenberg, D. 2002. DIP, the Database of Interacting Proteins: A research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 30:
303305. Yasuzawa, Y. and Nomura, K. 1994. Operation Research: Its Technique and Application. pp. 13754. Corona Publishing Co., Tokyo, Japan.
http://www.wjh.harvard.edu/soc_help/cnxfdgen.pdf; Conexor tag information Web site. http://www.conexoroy.com/products.htm; Conexor Web site. http://www.ensembl.org/; ENSEMBL Web site. http://flybase.bio.indiana.edu/; FlyBase Web site. http://gdbwww.gdb.org/; GDB Web site. http://gena.ontology.ims.u-tokyo.ac.jp/search/servlet/gena; GENA Web site. A. Koike, T. Takai, and T. Takagi; a partial database is open to the public. http://www.dsi.univ-paris5.fr/genatlas/; GENATLAS Web site. http://www.gene.ucl.ac.uk/hugo/; HUGO Web site. http://mips.gsf.de/projects/fungi/yeast.html; MIPS Web site. http://www.ncbi.nlm.nih.gov/Entrez/; NCBI Entrez Web site. http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html; NCBI locus link Web site. http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/; NCBI taxonomy Web site. http://www.ncbi.nlm.nih.gov/; NCBI Web site. http://www.ncbi.nlm.nih.gov/omim/; OMIM Web site. http://www.sanger.ac.uk/Projects/S_pombe/; PomBase Web site. http://genome-www.stanford.edu/Saccharomyces/; SGD Web site. http://sosui.proteome.bio.tuat.ac.jp/sosuiframe0.html; SOSUI Web site. http://www.expasy.ch/sprot/; SWISS-PROT and TrEMBL Web site. http://www.wormbase.org/; WormBase Web site. www.celera.com; Celera Web site. http://kinasedb.ontology.ims.u-tokyo.ac.jp; Kinase Pathway Database Web site. http://www.informatics.jax.org; MGI Web site. http://rgd.mcw.edu; RGD Web site. http://www.conexoroy.com/lite.htm; Conexor FDG Lite parser information Web site.
Received September 24, 2002;
accepted in revised format March 26, 2003.
|