|
|
|
|
Vol. 12, Issue 1, 203-214, January 2002
METHODS
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Functional characterizations of thousands of gene products from many species are described in the published literature. These discussions are extremely valuable for characterizing the functions not only of these gene products, but also of their homologs in other organisms. The Gene Ontology (GO) is an effort to create a controlled terminology for labeling gene functions in a more precise, reliable, computer-readable manner. Currently, the best annotations of gene function with the GO are performed by highly trained biologists who read the literature and select appropriate codes. In this study, we explored the possibility that statistical natural language processing techniques can be used to assign GO codes. We compared three document classification methods (maximum entropy modeling, naïve Bayes classification, and nearest-neighbor classification) to the problem of associating a set of GO codes (for biological process) to literature abstracts and thus to the genes associated with the abstracts. We showed that maximum entropy modeling outperforms the other methods and achieves an accuracy of 72% when ascertaining the function discussed within an abstract. The maximum entropy method provides confidence measures that correlate well with performance. We conclude that statistical methods may be used to assign GO codes and may be useful for the difficult task of reassignment as terminology standards evolve over time.
| |
INTRODUCTION |
|---|
|
|
|---|
The remarkably rapid emergence of high-throughput methods for
acquiring information about sequences and function of
genes provides a wealth of valuable new data. These methods include high-throughput gene expression measurement (Schena et al. 1995
; Chee
et al. 1996
), yeast two-hybrid screens (Uetz et al. 2000
), genome
sequencing (Cole et al. 1998
; Adams et al. 2000
), randomized gene
disruption (Ross-Macdonald et al. 1999
; Tissier et al. 1999
; Winzeler
et al. 1999
), single nucleotide polymorphism detection (Cargill et al.
1999
; Halushka et al. 1999
), and bulk biochemical functional assays
(Martzen et al. 1999
). Efficient interpretation of these data is
challenging because the number and diversity of genes exceed the
ability of any single investigator to track the complex relationships
established by the data sets.
To provide some standards for describing gene function, investigators
have developed controlled vocabularies for annotation. The vocabularies
include a pioneering classification for Escherichia coli gene
function (Riley 1993
), the Munich Information Center for Protein
Sequences (MIPS) classification (Mewes et al. 2000
), and Gene Ontology
(GO) Consortium's recent widespread effort across multiple organisms
(Ashburner et al. 2000
). These vocabularies contain a set of codes
associated with specific genetic attributes and functions. The GO is a
hierarchically arranged set of codes that permits multiple inheritance;
it is organized into three broad components: molecular function,
cellular location, and biological process (see Fig. 1). The vocabulary
is fluid and consistently undergoes revision.
Unfortunately, annotating genes with these controlled vocabulary codes is a labor-intensive task. An expert inspects the literature (and, in principle, other available data) associated with each gene to determine the appropriate function code. It is likely that one-time annotation will not be sufficient; as our knowledge of biology increases and expands into new areas, the vocabularies will undergo refinement and coding may need to be repeated.
The emergence of powerful methods for analyzing text raises the
possibility that gene annotation can be facilitated using natural
language processing (NLP) techniques. Investigators have annotated
genes with informative keywords (Andrade and Valencia 1997
; Shatkay et
al. 2000
) and sought out specific relationships within text, such as
macromolecular interactions (Hishiki et al. 1998
; Craven and Kumlien
1999
; Ng and Wong 1999
; Proux et al. 2000
; Thomas et al. 2000
; Stephens
et al. 2001
). Some work has focused on the assignment of predefined
codes to genes. Eisenhaber and Bork (1999)
developed a rule-based
system to identify protein localization from SWISS-PROT
records. Tamames et al. (1998)
developed a system to assign three broad
terms
Energy, Communication, and Information
to genes from database annotations.
In this study we developed a method to assign GO codes to genes using statistical NLP techniques. We built a document classifier based on the maximum entropy principle to associate abstracts with GO codes. Then we annotated each gene by combining the GO code classifications from all of their abstracts using a weighted voting scheme. Such a method should reduce the time and labor necessary for gene annotation.
The NLP techniques we employed fall under the heading of supervised
machine learning (Manning and Schutze 1999
); the text in the
unclassified abstract is compared to training examples that have been
previously identified as relevant to specific GO codes. The
unclassified text is assigned to categories based on similarities with
the training examples. Ideally, experts should compile the requisite
training examples. However, this is a labor-intensive task. To expedite
the process for the purposes of this study we used human-assigned MeSH
major headings (Hutchinson 1998
). Some GO codes correspond well to
specific headings; for other codes we devised ad hoc queries. In
general, reliable human annotations will not be available in analyzing
text, thus highlighting the need for a general purpose method that
relies solely on the unannotated text.
Maximum entropy modeling is effective in generic text classification
tasks (Ratnaparkhi 1997
; Manning and Schutze 1999
; Nigam et al. 1999
).
Models that classify documents can be characterized by their entropy.
Low entropy models depend on making many distinctions when classifying
documents and can suffer from overinterpretation of the training data.
High entropy models make fewer distinctions but do not take full
advantage of the signal within the training data. Maximum entropy
methods are based on the assumption that the best models are those with
the highest entropy that are still consistent with the training data.
One advantage of maximum entropy classification is that in addition to assigning a classification, it provides a probability of each assignment being correct. These probabilities can be helpful in combining multiple gene annotation predictions. If they are reliable measures of prediction confidence, they can be leveraged in a voting scheme; the influence of low confidence predictions can be mitigated when combined with other high confidence predictions.
We conducted experiments to annotate Saccharomyces cerevisiae
genes with codes from a subset of GO. We chose this organism because
many of its genes have manually curated GO annotations that can be used
as a gold standard. We used two gene-document sets for testing. One set
was a high quality list of PubMed citations hand-annotated by the
curators of the Saccharomyces Genome Database (SGD; Cherry et al.
1998
). The other set consisted of literature associated with sequence
homologs of yeast genes.
We employed two types of performance measures in this study. To evaluate document classification, we used accuracy; to evaluate gene annotation, we used precision and recall. Accuracy is the percentage of predictions on a document test set for which the classifier prediction was correct. True positives are the genes that truly have the function and were correctly assigned the annotation. Precision is the percentage of annotated genes that are true positives. Recall is the percentage of genes that truly have the function that are true positives; recall is equivalent to sensitivity.
In this study we (1) evaluated the performance of document classifiers to obtain genetic functions and (2) evaluated gene annotation from literature. Because the crux of our annotation strategy is a document classifier, we compared its accuracy against two other types of classifiers. After establishing the effectiveness of a maximum entropy classifier, we evaluated the classifier's probabilistic estimates as robust confidence estimates of prediction. Finally we used a voting scheme to combine document classifications into gene annotations. We estimated precision and recall to evaluate gene annotations in yeast. Because our classifications of documents have reliable confidence estimates, our annotations of genes should also. As we choose higher confidence cutoff values, we can often achieve better precision because we are more certain of the predictions we make but at the cost of lower recall because we likely miss low confidence correct annotations.
| |
RESULTS |
|---|
|
|
|---|
Construction of the Training and Test Corpora
We constructed training and test corpora of documents for the 21 GO
codes by searching PubMed for each code's corresponding MeSH heading
and title words (see Fig. 1). The queries and the number of abstracts
per GO code are listed in Table
1A. We split the results
into three sets based on publication date; documents published before
2000 constituted the training set, documents published in 2000 constituted the test2000 set, and documents published in 2001 constituted the test2001 set. A few of the documents are relevant to
more than one GO code (see Table 1B). Table
2 lists the properties of the two data sets
of abstracts associated with S. cerevisiae genes.
|
|
|
Comparing Document Classification Algorithms
We compared the classification accuracy of two different classifier
families,
naïve Bayes and nearest-neighbor
to maximum entropy
classification. We also examined different parameter settings for each
of the classifiers, such as vocabulary size. We trained each classifier
on the training set described above and fit their parameters by
maximizing performance on the test2000 data set. The results of the
classification trials on the test2000 data set are summarized in Table
3A. An ideal classifier
would obtain 100% accuracy.
|
|
For maximum entropy classification trials, we reported the highest accuracy over the 200 generalized iterative scaling (GIS) iterations for different vocabulary sizes. Based on these results, we chose to stop at iteration 186 with 100 words/code for maximum entropy classification. Although 500 words/code perform slightly better, it is less robust than 100 words. Either doubling to 200 words or splitting to 50 words does not significantly affect performance; however, going from 500 to 750 words degrades the performance on the test2000 set by more than a full percent. The best performance for naïve Bayes is with a vocabulary of 500 words; the best performance for nearest-neighbor is with 50 neighbors and 5000 words.
Table 3B lists the performance of each of the classifiers on the
smaller held out test2001 data set after parameter optimization on the
test2000 data set. Maximum entropy has the best performance (72.12%
accuracy) compared to nearest-neighbor (61.54%) and naïve Bayes
(59.62%). Results of maximum entropy classification for individual
categories are reported in Table 4.
|
Assigned Maximum Entropy Probabilities Can Be Used to Rank Predictions
In this study we established that although the document classifier may misclassify a document, the correct class is almost always assigned a high probability and is contained in the top four predictions. Maximum entropy classification assigns a probability to each of the 21 codes for each abstract. A good classifier would assign the correct classification a high probability; a perfect classifier would assign the correct classification the highest probability. For abstracts in test2000 we sorted the predicted GO codes by probabilities and calculated how often the nth prediction was correct (Fig. 2). The top prediction was correct 72.8% of the time, as shown in Table 3. Predictions that were ranked greater than four rarely contained accurate predictions. The accuracy of the prediction drops off gradually with its rank.
|
Because some abstracts in the test sets (~9%) have multiple relevant GO codes, we were concerned that the accuracy results were inflated by those abstracts. We eliminated the abstracts in test2000 with more than a single correct prediction and again calculated how often the nth prediction was correct (Fig. 2). The accuracy of the top prediction in this case is reduced slightly to 72.0%. The second and third predictions are somewhat less accurate. The accuracy of the predictions are not dramatically altered.
|
Accuracy of Prediction Tracks with Confidence Scores
To define a reliable voting scheme, it was critical to establish robust confidence estimates for correct document classification. We tested the probability of the predicted GO code as a measure of confidence for the prediction. With reliable confidence scores, we expect that document classifications with high confidence scores are likely to have been classified correctly. To assess whether the reliability of the classifier prediction tracked with the confidence score, we separated the test2000 predictions into 10 groups by confidence score. For those predictions with the highest confidence scores (ranging from .9 to 1) the classifiers prediction was correct 92.89% of the time (see Fig. 3). For predictions with lower confidence scores, the accuracy was proportionately less; the algorithm appears to estimate low confidence predictions conservatively.
|
Predicting Gene Function from Curated Abstracts
We evaluated our efforts to annotate yeast genes based on abstracts from SGD. With the voting scheme described in Methods, the maximum entropy classifier predicted gene function for the 835 genes with three or more associated abstracts and relevant true assignments from the GO Consortium. Even though many of the 21 categories we studied do not apply to yeast, we still included them in our calculations. Different thresholds of confidence can be used as a cutoff to assign an annotation to a gene. Typically, higher confidence values obtain higher precision at the cost of a lower recall. We computed the precision and recall for different confidence thresholds for each of the categories and plotted them in Figure 4. Ideally, precision remains 100% at all levels of recall. Table 5A lists the sensitivity (or recall) for each of the different GO codes.
|
|
|
|
Predicting Gene Function from SWISS-PROT Abstracts Obtained by BLAST
Similarly, we assigned functions to yeast genes based on the abstracts obtained from SWISS-PROT records of homologs found with BLAST. Table 5B describes the prediction results for all 695 genes with three or more associated abstracts and relevant GO annotations. Table 5C describes the prediction results for all 353 genes with 3 to 24 abstracts and GO annotations. We noted improved performance when considering those genes with fewer abstracts. We calculated and plotted precision and recall for the genes described in Table 5C for selected codes in Figure 5.
|
| |
DISCUSSION |
|---|
|
|
|---|
Document Classification Is Successful
We experimented with three different classifiers and found that
maximum entropy is the most effective method, with an accuracy of
72.8%. Our findings are consistent with recent reports within the
statistical natural language processing literature (Nigam et al. 1999
;
Rosenfeld 2000
). Frequently in statistical natural language modeling
tasks, there is insufficient data to estimate adequately the large
number of parameters involved. Naïve Bayes compensates for this
limitation by making a strong independence assumption that the words
are associated with codes independent of each other (Manning and
Schutze 1999
). This is untrue in text classification tasks, in which
many dependencies exist between words. Maximum entropy relaxes this
assumption by allowing differential weights for different word-code
associations, as in equations 2 and 3.
Maximum entropy assigned 72.8% of codes correctly over 21 categories. Moreover, the correct classification is present consistently in its top four choices (Fig. 2) and performance is reasonably consistent across all categories (Table 4). The performance is reproducible on the held-out test2001 data set; this indicates we did not overoptimize parameters (see Table 3B).
It should be recognized that the classification of documents is not
exact; there are often ambiguities. Funk and Reid (1983)
examined 760 biomedical articles that had been assigned MeSH headings by two
experts. They found that the major MeSH headings, controlled vocabulary
terms that represent the central concepts of the document, were
assigned with only a 61.1% consistency. This study illustrates the
subjective nature of document classification; the same sort of
inconsistency may fundamentally limit performance on documents analyzed
in our study.
Maximum Entropy Classifies Documents with Reasonable Estimates of Classification Accuracy
When classifications are imperfect, it is important that they be associated with confidence scores. The maximum entropy classifier assigns probabilities to each possible prediction. After sorting predictions by probability score, we observe that a code's prediction accuracy matches its probability rank (Fig. 2). Thus, the probability of the predicted code can be used as a measure of confidence on the prediction (see Fig. 3).
Gene Annotation Success Depends Critically on Training Set Quality
Annotation efforts of genes from the curated set of abstracts yielded uneven results. The precision-recall performance for some of the GO codes is reliable (Fig. 4A), whereas others are passable (Fig. 4B) and some are poor (Fig. 4C). At one extreme, for the code meiosis, we obtained the ideal precision-recall plot; a 100% precision was achieved at all levels of recall. In other words, all of the correct genes were annotated. Invasive growth (16.7% precision at 100% recall), sporulation (100% precision at 11.1% recall), and stress response (9.1% precision at 76.9% recall) were the three codes that were difficult to annotate genes with.
Because the classifier performs consistently across all categories when
the testing and training sets were similar (Table 4), the discrepant
performance is explained by how well a training set represents the
biological processes. In general, the availability of a major MeSH
heading corresponding to the code ensures the quality of our PubMed
search-based training set. Three of the four reliably predicted codes,
plotted in Figure 4A, had a single corresponding MeSH term; two of the
five codes plotted in Figure 4B had a single corresponding MeSH term.
Of the three codes in Figure 4C, one code
invasive growth
had only a
single gene and thus a biased sample of text. For the other two codes,
sporulation and stress response there were no corresponding MeSH terms
for either, and ad hoc strategies were fabricated to create the article sets. These strategies may not have been effective.
An ideal training set should be constructed by experts. The National
Library of Medicine relies on experts to read the articles to assign
MeSH headings (Bachrach and Charen 1978
). These headings are likely to
have a low false-positive rate (high specificity) but may suffer from
false negatives (low sensitivity) since experts assign some correct
headings but may miss others. Our reliance on MeSH terms therefore
assures that we get good training data when a MeSH heading corresponds
directly to a GO code. However, the query strategy is limited when
there are no appropriate MeSH terms. Better training sets consisting of
more specific paragraphs from whole-text articles and abstracts
selected under expert supervision would address many of these difficulties.
Granularity of Code May Be an Issue
In this study we chose an arbitrary level in the GO hierarchy to test. We found that although some of the categories are quite specific at this level, others, such as metabolism are quite general. For such categories, our approach would be more successful had we further broken down such large categories into smaller more specific ones.
Using BLAST to Rapidly Annotate Sequences
The strategy shown in this study has promise for whole-genome annotation. We filtered out all yeast sequences so the annotations are based only on sequences from other species. This strategy may be applied for rapid first-pass annotation of recently sequenced genes.
The limitation of this approach is that the references associated with protein sequences are often uninformative. We observed that the predictions on those genes with fewer references are actually more reliable (Table 5C) and this may be related to a diversity of language and topics associated with large sets of articles that obscure the more basic functions of a gene. As more resources such as the articles manually associated and maintained by SGD are made available, it will become possible to exploit a better set of references for sequences.
Annotation Is a Challenging Recurring Problem
Although a careful one-pass annotation is useful, it is certain that
as our knowledge of biology develops, controlled vocabularies such as
GO will grow and change. In fact, during this study, GO underwent considerable structural alterations. In addition, entirely new controlled vocabularies may emerge to serve the specific needs of
specialized scientific communities as genomic-scale methodologies become more prevalent. For these reasons, annotation should not be
considered a one-time task. We anticipate that genes will be annotated
repeatedly. The key is to maintain an organized, machine-readable repository of articles for each gene so that they may be mined and
annotated as necessary. This underscores the wisdom of the approach
taken by SGD to maintain a set of high quality curated references
(Cherry et al. 1998
). We anticipate that larger corpuses of full text
in the future will become available for analysis (through efforts such
as PubMed Central, for example [Roberts et al. 2001
]). We anticipate
that the data sets will provide further useful information for
extraction using techniques based on statistical and linguistic analysis.
| |
METHODS |
|---|
|
|
|---|
Functional Classifications from Gene Ontology
Our work focused on the biological process codes from the GO.
Prediction of gene process from literature seemed to be most critical,
as sequence signals in the gene product might be directly predictive of
molecular function (Krogh et al. 1994
; Huang and Brutlag 2001
) and
cellular localization (Horton and Nakai 1997
; Emanuelsson et al. 2000
).
We chose gene functions that were the direct descendants of cellular
maintenance and growth and cell communication, the biological process
categories most relevant to single-cell organisms (see Fig. 1).
In total we used 21 gene function categories (see Table 1A). We omitted some categories due to difficulties in precisely defining the associated literature. These included cell recognition, budding, cell-size and -shape control, cell-volume regulation, pseudohyphal growth, and pH regulation. We included the response-to-external-stimulus category with the related stress-response category.
Creating Data Sets for Functional Annotation
Construction of a Training Corpus for Document Classification
We sought to create a corpus containing
1000 PubMed abstracts
relevant to each functional category. These comprised training sets for
the classification algorithms. The best approach to this would be
careful examination of many articles by qualified experts. However,
obtaining a large volume of abstracts in this manner is very difficult.
Instead, we used MeSH term headings and title words to query PubMed for
relevant abstracts (Bachrach and Charen 1978Test Sets to Validate Document Classification
To examine the ability of the document classification strategy, we constructed two independent test sets using the same queries as above but with later publication dates. Abstracts from the year 2000 made the test2000 test set; abstracts from 2001 made the test2001 test set. Because the titles were sometimes used to select the articles, we omitted the title from the document when testing. The test data sets are described in Table 1.Test Sets to Simulate Genome-Scale Annotation
We created two data sets of abstracts, summarized in Table 2, associated with S. cerevisiae genes. Each gene was linked to a small set of abstracts using two strategies. The first strategy took advantage of the curated abstracts for S. cerevisiae genes maintained by the Saccharomyces Genome Database at Stanford University (courtesy of Kara Dolinski, SGD; Cherry et al. 1998Document Preprocessing
We used the abstracts and title fields from the PubMed records for all data sets except for test2000 and test2001. From these documents, we found 63,992 unique tokens by tokenizing on white space, punctuation, and common nonalphanumeric characters, such as hyphens and parentheses. From these, we excluded stopwords, which we defined as tokens that appeared in
4 or
10,000 documents. This left a total
of 15,741 unique words. Then we represented all documents as 15,741 dimensional vectors of word counts.
Machine Learning Approaches
We compared performance of maximum entropy classification to two standard document classification approaches: naïve Bayes and nearest-neighbor algorithms. All algorithms were trained on the same training set. We optimized the parameters from the classification algorithms based on the performance on the test2000 corpus and reserved the test2001 corpus for an unbiased evaluation.
Maximum Entropy Classification
The motivation behind this classification strategy is to model training data with a probability distribution that satisfies certain constraints while remaining as close to a uniform distribution as possible. A full description of the method is outside the scope of this paper but has been provided elsewhere (Ratnaparkhi 1997
|
(1) |
|
(2) |
i's are feature weights, and
Z(d) is a normalization constant:
|
(3) |
i weight is selected so that the
following constraint on the probability density is satisfied: the
expectation of fi must equal its observed
frequency in the training data. This expectation should be calculated
over the true distribution of documents. However, this distribution is
unknown, so we estimate the expectation for the feature empirically
with the training documents D.
|
(4) |
is that satisfy
equation 4 (Ratnaparkhi 1997
2 measure
described below.
Naïve Bayes Classification
We perform naïve Bayes classification as a comparison benchmark. In naïve Bayes, the probability of each word appearing in a document of a certain class is estimated directly from the training data (Manning and Schutze 1999
|
(5) |
C is the number of documents within
the class c, and s is a pseudocount to compensate for
unseen events. I(w,d) is an indicator
function that is 1 if the word w is in the document
d, 0 otherwise. Here we set s = 0.4. Once all of these
probabilities are obtained, we can estimate the probability that an
unknown document belongs to a class
|
(6) |
2 criteria described below.
Nearest-Neighbor Classification
We also performed nearest-neighbor classification as a benchmark. In nearest-neighbor classification, a distance metric is employed to calculate the distance between the word vector of an unclassified abstract in the test set and each of the abstracts in the training set (Manning and Schutze 1999
|
(7) |
Feature Selection: Choosing Words for Vocabularies
In all the above descriptions, we experimented with different vocabulary sizes. For the naïve Bayes and nearest-neighbor classifiers, we used a
2 distribution test to identify the
words whose distribution is most skewed across all 21 GO codes in the
training set (Manning and Schutze 1999
2 values.
Because of its formulation, we used a slightly different strategy for
the maximum entropy classifier. As discussed above, features were
defined relative to a code and a word. We use the
2 test
to find the words that are most unevenly distributed when comparing
abstracts relevant to a code to other abstracts. A word that scores
high against a code is used with that code to define a feature. We took
only the words with the highest scoring
2 values.
Voting Scheme
The voting scheme takes classifications of individual abstracts associated with a gene and combines them into a single gene classification. Maximum entropy classification provides the probabilities of a document's relevance to each of the 21 codes. The ad hoc parameter fr is the expected fraction of associated abstracts that should discuss a function if it is relevant to the gene. Here we selected a value of one-third for fr in all experiments; ideally a specific fr should be selected for each function separately depending on its prevalence in the literature. If N is the number of abstracts associated with the gene, analysis of each abstract with maximum entropy classification obtains N probability values for each GO code. We averaged the top ceil(fr*N) probabilities for each code to score the code's relevance to the gene. This score ranges between 0 and 1; higher code scores indicate greater relevance to the gene. Genes with scores above a predetermined cutoff are assigned the code; the cutoffs were varied to create precision-recall plots.Validation
Validation consisted of comparing performances on three tasks. The first, and most straightforward, is the classification of the test2000 and test2001 data set documents. The second is gene annotation based on the corpus of curated abstracts hand chosen to be descriptive of yeast genes. The final, and most indirect test, consists of gene annotation based on abstracts referenced with sequences homologous to yeast genes.
Document Classification
We applied the different document classifiers described above trained on the same training set to predict the subject matter of the documents in test2000 and test2001. Accuracy of classification was calculated for the different classifiers using different parameters. The test2000 set was used to optimize parameters; performance was measured on the test2001 set to insure that overfitting did not occur.Annotating Yeast Genes with Curated Articles
Using the maximum entropy classifier we assigned GO codes to each gene, based on the gene's curated abstracts from SGD. We made predictions only on those genes with three or more associated abstracts. To validate our predictions, we used the annotations assigned by the GO Consortium (Ashburner et al. 2000Annotating Yeast Genes with Articles Obtained by BLAST Search
We annotated genes based on the maximum entropy classifications of the abstracts obtained by BLAST. We made predictions only on those genes with three or more abstracts. To validate our predictions we also used the annotations assigned by the Gene Ontology Consortium here. In total, 695 genes were annotated by the GO Consortium and also had the requisite number of abstracts. This procedure was repeated eliminating those genes with >24 abstracts. There were 353 such genes annotated by the GO Consortium. We calculated the precision and recall at various thresholds for each of the annotations using the GO Consortium assignments as a gold standard.Computation
We conducted all computations on either a SUN Ultra 4 server with four 296 MHz UltraSPARC-II processors or a SUN Enterprise E3500 server with eight 400 MHz UltraSPARC-II processors. PubMed database queries and data preprocessing were implemented using perl (Schwartz and Christianson 1997| |
ACKNOWLEDGMENTS |
|---|
This work was supported by NIH LM06244 and GM61374, NSF DBI-9600637, and a grant from the Burroughs-Wellcome Foundation. S.R. was supported by NIH GM-07365, P.D.S. by NIH CA-88480, and J.T.C. by a Stanford Graduate Fellowship. We thank Mike Liang for assistance in code optimization and Kara Dolinski of SGD for a curated data set of gene-associated articles.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 Present address: Stanford Medical Informatics, 251 Campus Drive, MSOB X-215, Stanford, CA 94305, USA.
4 Corresponding author.
E-MAIL russ.altman{at}stanford.edu; FAX (650) 725-7944.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.199701.
| |
REFERENCES |
|---|
|
|
|---|
Received June 6, 2001; accepted in revised form October 16, 2001.
This article has been cited by other articles:
![]() |
I. N. Sarkar Biodiversity informatics: organizing and linking information across the spectrum of life Brief Bioinform, September 1, 2007; 8(5): 347 - 357. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. A. Lussier and Y. Liu Computational Approaches to Phenotyping: High-Throughput Phenomics Proceedings of the ATS, January 1, 2007; 4(1): 18 - 25. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-B. Lee, J.-j. Kim, and J. C. Park Automatic extension of Gene Ontology with flexible identification of candidate terms Bioinformatics, March 15, 2006; 22(6): 665 - 670. < |