|
|
|
|
Genome Res. 13:1222-1230, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00 Resources eVOC: A Controlled Vocabulary for Unifying Gene Expression Data1 South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa 2 Electric Genetics PTY Ltd. Bellville, South Africa 3 Office of Information Technology, Ludwig Institute for Cancer Research and Swiss Institute of Bioinformatics, Lausanne, Switzerland 4 Genetics and Genomics Research Institute, Imperial College Faculty of Medicine, Hammersmith Hospital, London, W12 0NN, UK 5 Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX37BN, UK
Expression data contribute significantly to the biological value of the sequenced human genome,providing extensive information about gene structure and the pattern of gene expression. ESTs,together with SAGE libraries and microarray experiment information,provide a broad and rich view of the transcriptome. However, it is difficult to perform large-scale expression mining of the data generated by these diverse experimental approaches. Not only is the data stored in disparate locations,but there is frequent ambiguity in the meaning of terms used to describe the source of the material used in the experiment. Untangling semantic differences between the data provided by different resources is therefore largely reliant on the domain knowledge of a human expert. We present here eVOC,a system which associates labelled target cDNAs for microarray experiments,or cDNA libraries and their associated transcripts with controlled terms in a set of hierarchical vocabularies. eVOC consists of four orthogonal controlled vocabularies suitable for describing the domains of human gene expression data including Anatomical System,Cell Type,Pathology and Developmental Stage. We have curated and annotated 7016 cDNA libraries represented in dbEST,as well as 104 SAGE libraries,with expression information,and provide this as an integrated,public resource that allows the linking of transcripts and libraries with expression terms. Both the vocabularies and the vocabulary-annotated libraries can be retrieved from http://www.sanbi.ac.za/evoc/. Several groups are involved in developing this resource with the aim of unifying transcript expression information.
Mining of large volumes of transcriptome data is currently frustrated by an inability to relate sequence and descriptive information. In part, this is due to the absence of a common structured vocabulary to describe the source of the biological sample materials.
Recent years have seen a growing trend toward the adoption of ontologies for the management of biological knowledge. In Computer Science, an ontology is defined as an "explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest, and the relationships that hold among them" (The Free Online Dictionary of Computing http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?query=ontology).
Biological ontologies aim to overcome the semantic heterogeneity commonly encountered in molecular biology databases, and to provide a common terminology for the description of a focused aspect of biology. One such resource, TAMBIS (Stevens et al. 2000
Although several ontologies for the formal description of sample materials exist or are under development (Table 1), these are not suitable for querying gene expression data. For example, clinical ontologies including anatomical, pathological, and developmental stage-specific concepts have been available for some time (ICD-9-CM, SNOMED, GALEN, MeSH), but these have not been widely adopted for describing human gene expression profiles. A major reason why clinical ontologies are not widely used for describing gene expression is that they are extremely detailed and often tangled (Rector et al. 2001
Implementing multiple ontologies with simple concepts in orthogonal domains provides a preferable solution, as it enables users to produce logical ontology cross-products. Cross-products are hybrid ontologies that can be constructed through the combination of simple ontologies. For example, the ICD-9-CM term mentioned above could have been constructed through the combination of terms from an anatomical and a pathological ontology by producing the cross-product of the terms "stomach" and "neoplasm| benign" from the respective ontologies. Ideally, ontologies for gene expression should reflect a level of detail appropriate to the data being classified and the level at which queries are likely to be performed while simultaneously providing sufficient flexibility to enable regular updating without needing to significantly restructure the hierarchies. For the extensive description of gene expression and to provide maximum flexibility in querying, we have developed eVOCfour orthogonal ontologies that aim to provide an appropriately detailed set of terms for describing the sample source of cDNA and SAGE libraries and labeled target cDNAs for microarray experiments. We have taken a data-driven approach to determining the level of granularity required. We have annotated all publicly available human cDNA and SAGE libraries as extensively as possible. This is achieved by the assignment of terms from each of the four ontologies to the libraries. Initial assignment of terms to libraries was performed computationally, with curators who are domain experts performing assessment of annotation quality and further manual assignment. Where information was lacking in the library record, the original submitters were contacted where possible to provide more extensive information. The most widely used ontology for keywording human SAGE and EST libraries is the CGAP/UniLib vocabulary (ftp://ftp.ncbi.nih.gov/pub/bioannot/info/keys) currently used by the National Cancer Institute to categorize libraries for CGAP (http://www.ncbi.nlm.nih.gov/CGAP/). CGAP provides a single integrated hierarchy of keywords that includes terms from multiple classification domains (including tissues, developmental stage, library preparation, and chemical agents among others). There are many different relationships between parent and child terms in different sections of the hierarchy. eVOC, in contrast, provides completely orthogonal ontologies covering four distinct domains. There is a single implied type of relationship between the terms within each of the eVOC ontologies. The structure of the CGAP ontology enables rapid keyword searching, whereas the eVOC data structure, by incorporating the rigorous separation of classification terms into orthogonal domains and the formalization of relationships between terms, allows for a degree of computer reasoning to be applied. This facilitates a wide range of query types. For example, a comparison of eVOC and UniLib querying shows clearly that both eVOC and UniLib allow querying for multiple terms combined with "AND" (the intersection set), and yield comparable results in terms of the libraries returned. However, UniLib is unable to support more complex queries incorporating "OR" and "NOT", which are possible with eVOC. eVOC therefore provides users with greater flexibility, as more complex biological queries can be formulated. Whereas this may be a simple implementation issue, it is one that directly affects the user interaction with the data. A major distinction between CGAP and eVOC is that the CGAP hierarchy is cancer specific by design. The terms included are therefore those of interest in cancer, whereas eVOC is designed for more general application. Specifically, CGAP lacks the comprehensive pathology terminology that is necessary for a broadly applicable human expression ontology.
The design and creation of the expression ontologies is distinct from the annotation of cDNA and SAGE libraries by use of each of the ontologies. These processes will be discussed separately.
Development of a Data Structure for Expression Ontologies The expression ontologies are independent pure hierarchies (or trees). In a pure hierarchy, each node has only one parent but may have multiple children. Each node is associated with a specific concept in the knowledge domain represented by the hierarchy through the association of each node with one or more synonymous terms. For example, the terms "nasal" and "nose" are synonyms attached to a single node in the ANATOMY ontology. In these pure hierarchies, there is only a single type of relationship between the nodes in each hierarchy, although the nature of the relationship is not defined explicitly. For each ontology, the nature of the expression domain imposes an implicit type on the relationship between the nodes. For instance, in the Anatomical System ontology, the relationships are of the "part-of" type. In the Cell Type and Pathology ontologies, they are of the "subclass" type, and in the Developmental Stage ontology, the relationships are of the "is-a" variety.
Pure hierarchies have a number of advantages over the more complex data structures often used to represent ontologies (Rector et al. 2001
In cases in which terms appear to have more than one parent, two options are available: migration to a directed acyclic graph (DAG), or untangling of the hierarchy to yield a pure hierarchy (Fig. 1). To handle multi-parent terms and different parent-child relationships, the GO project (Gene Ontology Consortium 2001
The disadvantage of maintaining untangled orthogonal ontologies is that the volume of work involved in curation increases linearly with the number of hierarchies. It is therefore necessary to strike a balance between keeping the number of ontologies manageable, and representing relationships in as fine grained a fashion as possible. The sort of queries the ontologies are required to accommodate dictates where this balance is found. In other words, the ontology design should be data driven. Each of the terms in the ontologies has a numeric identifier that uniquely identifies the term and that can be used as an unambiguous database cross-reference. Definitions of each of the terms are to be provided as part of the ongoing development. The source of each definition will be made available, along with the definition.
Development of the Four Expression Ontologies
Anatomical System Ontology
Cell Type Ontology Because various cell types are represented across many anatomical systems, cell types could have been included in the Anatomical Site ontology, with cell type terms having multiple parents. Instead, we have separated the Anatomical System and Cell Type ontologies in order to maintain pure trees. This separation provides users with greater flexibility, as they can query on specific cell types, regardless of the anatomical location, and can also perform combined queries across Cell Type and Anatomical System terms to yield results for a cell type in a specified location.
Developmental Stage Ontology
Pathology Ontology
Species-Specific Considerations
There is significant value in being able to identify and relate equivalent tissues in different species, and to compare gene expression patterns in these tissues. Although it is not clear that it will always be possible to identify these equivalent tissues in the model organisms, the production of species-specific ontologies to form the basis of these comparisons is the first step. To facilitate interoperability between species-specific ontologies, these need to be in a compatible, accessible format (Bard and Winter 2001
Curation of the eVOC Ontologies Groups that choose to modify the ontologies for their own purposes are encouraged to contribute their modifications and corrections to the curators for inclusion. A mailing list, evoc{at}sanbi.ac.za, has been established for this purpose.
Annotation of cDNA and SAGE Libraries Using eVOC
cDNA and SAGE libraries are collections of the transcribed sequences expressed in the biological sample material from which the library is prepared. Information about the source of the sample is stored with the library information. The amount and quality of the source information provided varies depending on the source of the library. Libraries submitted to public databases are described by use of highly inconsistent terminology. Here, curators have manually translated the unstructured terms used in the library records into standardized terms selected from the four ontology domains, and have applied these to each of the libraries. Ideally, an ontology-based form would guide submitters in selecting appropriate terms for the description of their libraries. This would reduce the curation required and facilitate querying of the public databases in a manner not currently possible. Each of the cDNA and SAGE libraries was assigned computationally to the most specific possible terms in each of the four ontologies. Manual curation and annotation of the computational assignments was then performed. Libraries are annotated with terms in each of the four hierarchies if sufficient information is available in each of the ontology domains. Annotation of a library in one ontology is completely independent of annotation in another ontology. Each annotation is transferred from the library information provided by the original submitter. Whereas the curators exercise domain expertise in assigning libraries to specific terms within each hierarchy, they derive no new information. This process is therefore largely objective. Evidence for annotations is primarily based on the original submission record for both cDNA and SAGE libraries. In most instances, annotation of data from existing databases is performed following the development of ontologies. Appropriate terms are assigned to data points on the basis of information already present in the database. This post-facto approach results in an often-imperfect mapping between data and terms, as much of the sample information is not provided in the original submission and is therefore lost. The Ontologies Working Group of the MGED Consortium is building ontologies for use in data submission forms for the microarray databases. This will allow subsequent database queries to take advantage of the standardized terms provided by the ontologies. The implementation of a similar ontology-based data entry system for the public nucleotide databases would be of immense value for the submission of cDNA and SAGE library information. The clone libraries annotated here are generated from biological sample materials representing specific expression states (e.g., infant lung). These libraries represent a snapshot collection of the transcripts expressed in the original sample. The transcripts expressed in the original biological sample can therefore be sequenced as ESTs from the clone library. By mapping the clone libraries to a set of controlled terms (the ontologies), all of the ESTs from each clone library can be transitively linked to these same standardized terms in the relevant ontology via their association with their parent clone library. In the case of ESTs, we maintain a database for the bidirectional accession to clone library lookup, which in turn allows us to link vocabulary terms directly to ESTs (Fig. 4).
We have annotated 7016 human cDNA and 104 human SAGE libraries with the eVOC expression ontologies. These represent all of the human cDNA and SAGE libraries that were available publicly in April 2002. The amount of information provided for each library varies widely. In some cases, extensive information about the anatomical system, developmental stage, and pathological state of the sample source is provided, whereas in other cases, only a subset of this information is provided. The majority of the cDNA libraries (94.8%) have the information required for classification in the Anatomical System ontology, and most have information required for annotation with Pathology and Developmental Stage terms (Table 2). Where libraries were unable to be annotated, this was because the library information provided by submitters did not capture the relevant information. As a result of the fact that cDNA and SAGE libraries are largely derived from whole organs and tissues rather than from individual cell types, the majority of the libraries (94.2%) could not be annotated using the Cell Type ontology.
Using the Ontologies This simplistic query methodology can be the basis of an enormously powerful query infrastructure if the ability to perform basic set algebra (union and intersection) operations on the returned sets of cDNA libraries is used. Consider, for instance, the query "liver AND neoplasia" (Fig. 5). A query on liver resolves to a node in the Anatomical System ontology, which in turn results in a set of cDNA libraries (all of the libraries associated with the liver node and all its subnodes). Similarly, a query on neoplasia returns the set of cDNA libraries associated with a subtree of the Pathology ontology. The combined query"liver AND neoplasia"returns the intersection of these two sets of cDNA libraries. In other words, it will return only libraries that were constructed from neoplastic liver samples.
Example Applications By simply curating dbEST using the eVOC ontologies, users are provided with the ability to perform queries on the basis of location, state, and timing of expression on human ESTs or cDNA libraries. Querying using terms from any combination of the ontologies, both libraries and transcripts can be selected from the database on the basis of their expression patterns. Moreover, the differential expression of genes or gene isoforms on the basis of EST data can be determined swiftly and accurately by providing a list of EST accessions and analyzing the distribution of terms attached to each EST. Laboratory-based applications of eVOC include the selection of clone libraries relevant to laboratory research projects; for example, a simple query that returns the total number of publicly available retinal cDNA libraries yields 22 results (Fig. 6). To select suitable libraries for the comparison of gene expression in adult and fetal retina, further refined queries can be used to show that 7 libraries are derived from adult retina, 3 are derived from fetal retina, and 12 libraries do not have information about the developmental stage from which the retinal tissue was isolated.
Similarly, the number of cDNA libraries available for pancreatic tissue yields 31 results. To determine how many of these are pancreatic islet libraries, a second query is performed and yields a total of 10 pancreatic islet libraries that have source descriptions as diverse as Human insulinoma and HR85 islet. Additionally, the ability to identify cDNA and SAGE libraries from similar expression states provides access to an increased resource for data mining, and allows users to identify and analyze genes that are differentially expressed both in their expression location and their expression level. We have used the system to identify neoplastic and normal cDNA libraries, and have identified differential gene expression and alternative splicing in these expression states (H. Brentani, O.L. Caballero, A.A. Camargo, A.M. da Silva, W.A. da Silva, E. Dias Neto, M. Grivet, A. Gruber, P.E.M. Guimaraes, W. Hide, et al., in prep.).
To illustrate the power of expression ontologies in determining the tissue specificity of alternatively spliced transcripts, we have analyzed the data produced by Xu et al. (2002
We submitted the isoform-specific EST lists provided for a subset of the genes identified by Xu et al. (2002
By implementing a set of orthogonal, hierarchical controlled vocabularies, eVOC provides a detailed and flexible system for the detection of expression-state-specific splice-forms. eVOC can be used to identify not only tissue-specific spliceforms, but also splicing that is specific to certain developmental stages, cell types, and pathological states, or any combination of these states.
Future Applications
Availability and Interfaces (Editing and Graphical Browsing) Although genomic information is not integrated directly into eVOC, users have the ability to integrate the expression information within eVOC with human genome information through the transitive mapping of ESTs (generated from the clone libraries that are mapped to eVOC) to the genome. This functionality is being provided through the integration of eVOC with the EnsemblMart data mining resource that is part of the Ensembl Project at EBI. The eVOC ontologies will be available in the January 2003 release of the EnsemblMart database (http://www.ensembl.org/Homo_sapiens/martview). EnsemblMart is a data retrieval tool that provides users with the ability to build queries of the biological data (including genome sequence and annotation data) present in the Ensembl genome database. Because ESTs have been mapped to the genome by Ensembl, eVOC terms can be linked transitively (via their parent clone library, which is mapped to the eVOC ontologies) to the genomic sequence. As a result, users will be able to perform expression-based queries in the context of genomic data and will be able to extract transcripts and genes on the basis of the location, state, and timing of their expression. A graphical interface for querying eVOC has been developed by Electric Genetics (Fig. 3) and is available from info{at}egenetics.com. This interface provides users with the ability to view the ontologies, browse the hierarchical trees, and perform set operations on the annotated cDNA library data. Using this interface, it is possible to obtain the list of cDNA libraries or ESTs returned by a query, or to provide a list of libraries or EST accessions and obtain the associated expression profile. The interface will be extended to include curation facilities, simplifying the users ability to modify the existing eVOC ontologies or create de novo ontologies of their own. In addition, Electric Genetics has developed an API that provides the ability to develop custom software to interface eVOC with external data repositories and to perform complex ontological queries on that data.
Summary The simple orthogonal ontologies are flexible and extensible, making them applicable to real data and allowing them to be both machine and human readable. The ontologies are under continual development; existing ontologies are extended and altered, appropriate new ontologies are added, and the annotation of expression libraries is regularly updated. Both the ontologies and the annotated expression libraries are publicly available and able to be adopted freely, modified, and integrated for both novel and existing applications. The wide number of potential applications makes eVOC a valuable resource for the biologist.
We thank Cathal Seoighe for comments and suggestions. This work had financial support from the South African Government through the Department of Arts, Culture, Science, and Technology initiated Innovation Fund Program, grant 32146 (J.V, D.O, G.G, T.H., and W.H.), and the South African National Research Foundation (J.K.). The work was supported in part by funds from the European Commission ([GIFT consortium QLG2-CT-1999-00546] M.M and D.S.) and the Wellcome Trust. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.985203.
6 Present address: Molecular Genetics/Fugu informatics, Institute of Molecular and Cell Biology, Singapore.
7 Corresponding author. [Supplemental material is available online at www.genome.org.]
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25-29.[CrossRef][Medline]
Bard, J. and Winter, R. 2001. Ontologies of developmental anatomy: Their current and future roles. Brief. Bioinform. 2:
289-299.
Gene Ontology Consortium. 2001. Creating the gene ontology resource: Design and implementation. Genome Res. 11:
1425-1433. Gray, H.L., Bannister, L.H., Williams, P.L., Collins, P., and Berry, M.M. 1995. Gray's Anatomy. 38.
Kanehisa, M., Goto, S., Kawashima, S., and Nakaya, A. 2002. The KEGG databases at GenomeNet. Nucleic Acids Res. 30:
42-46.
Karp, P.D., Riley, M., Paley, S.M., and Pellegrini-Toole, A. 2002a. The MetaCyc database. Nucleic Acids Res. 30:
59-61.
Karp, P.D., Riley, M., Saier, M., Paulsen, I.T., Collado-Vides, J., Paley, S.M., Pellegrini-Toole, A., Bonavides, C., and Gama-Castro, S. 2002b. The EcoCyc database. Nucleic Acids Res. 30:
56-58. Kemp, G. and Gray, P. 2002. Modelling biological data in hierarchies. Tutorial: Intelligent systems for molecular biology. Edmonton, Alberta, Canada. Rector, A.L., Wroe, C., Rogers, J., and Roberts, A. 2001. Untangling taxonomies and relationships: Personal and practical problems in loosely coupled development of large ontologies. K-CAP'01. 139-146.
Stevens, R., Baker, P., Bechhofer, S., Ng, G., Jacoby, A., Paton, N.W., Goble, C.A., and Brass, A. 2000. TAMBIS: Transparent access to multiple bioinformatics information sources. Bioinformatics. 16:
184-185.
Xu, Q., Modrek, B., and Lee, C. 2002. Genome-wide detection of tissue-specific alternative splicing in the human transcriptome. Nucleic Acids Res. 30:
3754-3766.
ftp://ftp.ncbi.nih.gov/pub/bioannot/info/keys; Ontology for keywording human SAGE and EST libraries. http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?query=ontology; The Free Online Dictionary of Computing. http://www.ensembl.org/Homo_sapiens/martview; EnsemblMart is a data retrieval tool which provides users with the ability to build queries of the biological data (including genome sequence and annotation data) present in the Ensembl genome database. http://www.informatics.jax.org/searches/anatdict_form.shtml; The Mouse Anatomical Dictionary, an extensive mouse-specific expression ontology. http://www.ncbi.nlm.nih.gov/CGAP/; The Cancer Genome Anatomy Project (CGAP). http://www.sanbi.ac.za/evoc/; Ontologies and associated expression data describing human anatomical systems, cell types, pathologies and developmental stages. www.ana.ed.ac.uk/anatomy/database/humat; A human developmental anatomy ontology. www.cbil.upenn.edu/anatomy.php3; A human anatomical ontology. www.mcis.duke.edu/standards/termcode/icd9/1tabular.html; The World Health Organization's ICD-9-CM system for the classification of morbidity and mortality information. http://www.biobase.de/pages/products/cytomer.html; Cytomer, a human developmental anatomy ontology. http://www.cbil.upenn.edu/EpoDB/release/version_2.2/controlled.vocab.html; EPOdb, a human anatomy, developmental stage and cell type ontology. http://www.ncgr.org/genex/; GeneX, a human gene expression ontology. http://www.nlm.nih.gov/mesh/meshhome.html; MeSH, a clinical ontology. http://www.nlm.nuh.gov/research/umls/umlsmain.html; UMLS, a clinical ontology. http://www.opengalen.org; GALEN, a clinical ontology. http://www.snomed.org/main.html; SNOMED, a clinical ontology. http://www.cdc.gov/nchs/about/otheract/icd9/abticd9.htm; The World Health Organization's ICD-9-CM system for the classification of morbidity and mortality information.
Received November 12, 2002;
accepted in revised format February 25, 2003.
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||