|
|
|
|
Genome Res. 17:954-959, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00 OPEN ACCESS ARTICLE Resource The ENCODEdb portal: Simplified access to ENCODE Consortium dataGenome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
The Encyclopedia of DNA Elements (ENCODE) project aims to identify and characterize all functional elements in a representative chromosomal sample comprising 1% of the human genome. Data generated by members of The ENCODE Project Consortium are housed in a number of public databases, such as the UCSC Genome Browser, NCBIs Gene Expression Omnibus (GEO), and EBIs ArrayExpress. As such, it is often difficult for biologists to gather all of the ENCODE data from a particular genomic region of interest and integrate them with relevant information found in other public databases. The ENCODEdb portal was developed to address this problem. ENCODEdb provides a unified, single point-of-access to data generated by the ENCODE Consortium, as well as to data from other source databases that lie within ENCODE regions; this provides the user a complete view of all known data in a particular region of interest. ENCODEdb Genomic Context searches allow for the retrieval of information on functional elements annotated within ENCODE regions, including mRNA, EST, and STS sequences; single nucleotide polymorphisms, and UniGene clusters. Information is also retrieved from GEO, OMIM, and major genome sequence browsers. ENCODEdb Consortium Data searches allow users to perform compound queries on array-based ENCODE data available both from GEO and from the UCSC Genome Browser. Results are retrieved from a specific genomic area of interest and can be further manipulated in a variety of contexts, including the UCSC Genome Browser and the Galaxy large-scale genome analysis platform. The ENCODEdb portal is freely accessible at http://research.nhgri.nih.gov/ENCODEdb.
With the completion of human genome sequencing, one of the major challenges of genomic biology is to comprehensively identify the structural and functional components encoded in the human genome. To this end, the Encyclopedia of DNA Elements (ENCODE) project aims to identify and characterize all functional elements in a representative chromosomal sample comprising 1% of the human genome (The ENCODE Project Consortium 2004
The data management challenges of any such consortium-based effort are substantial. Early on in the planning of the ENCODE project, it became apparent that much of the data generated by members of the ENCODE Consortium should be stored in already-existing public repositories in order to maximize the visibility and availability of these data within the biological community. Data that can be directly linked to specific genomic coordinates are available through the UCSC Genome Browser (Hinrichs et al. 2006
Concurrent with the generation and deposition of experimental data into these allied databases, substantial progress has been made in the development of new tools intended to facilitate the analysis of genomic data sets. The assembly of genome-wide data sets has been facilitated by the introduction of tools such as the UCSC Table Browser, which is intended to retrieve specific subsets of coordinate-based data, and the Genome Alignment and Annotation Databases (GALA), which merge genome annotations with multi-species alignment data (Giardine et al. 2003
While the ENCODE Consortium is generating large amounts of data related to the targeted 1% of the genome, it is important to remember that there are other significant sources of biological information, accumulated over many years, that contain data that lie within ENCODE regions; data from these very rich sources of information should also be considered in order to have a complete view of all known information regarding a particular region of interest. These data include mRNA, EST, and STS sequences from GenBank (Benson et al. 2006 The ENCODEdb portal was developed to provide a unified, single point-of-access to data generated by the ENCODE Consortium, as well as to data from other source databases that lie within ENCODE regions, regardless of which public database the primary data are housed in. This provides the user a complete view of all known data in a particular region of interest. ENCODEdb users can both browse data in genomic regions of interest, as well as easily assemble custom data sets that can be visualized with the UCSC Genome Browser or used for downstream analysis with tools such as Galaxy. This report focuses on the functional aspects of ENCODEdb, illustrating the kinds of biological insights that can be made using the data generated by the ENCODE Consortium and other sources. The ENCODEdb portal is freely accessible at http://research.nhgri.nih.gov/ENCODEdb.
Genomic Context searches Genomic Context searches allow the user to retrieve information on functional elements that have been annotated within ENCODE regions, and the data returned by these searches are not limited to data that have been generated by the ENCODE Consortium. As such, the Genomic Context searches are intended to provide the user a compendium of all relevant genomic information that is known about each individual ENCODE region, without requiring the user to query each individual source database separately. Using data provided by UCSC, NCBI, or NCBI GEO, the ENCODEdb database stores the genomic position of the relevant functional elements. Query terms, whether they be gene based (e.g., gene symbol, GenBank accession number) or region based (e.g., ENCODE region, cytological band), are all translated into their genomic coordinates before the search is issued, and all elements that overlap these coordinates are reported. To illustrate some of the types of data that are returned by a Genomic Context query, the gene CFTR will be used as an example. CFTR is the cystic fibrosis conductance regulator gene (OMIM:602421) and lies within ENCODE region ENm001. A user would initiate the search by entering the term "CFTR" as the query term and selecting Gene Symbol from the pull-down menu, as shown in Figure 1A. Note that searches can be done on a variety of terms, including RefSeq mRNA accession number, chromosomal coordinates, cytological band, and UniGene cluster ID; the user does not need to know in advance whether the genomic region of interest actually lies in an ENCODE region, or which ENCODE region it lies in. The user may also select which genome assembly to search (Human July 2003 [hg16] or May 2004 [hg17]). Once the query is submitted, the user is provided a summary of detailed annotations on the CFTR locus, organized as a series of tabs in the results window (Fig. 1B). Each one of the tabs corresponds to one of the queried source databases, providing links back to the source database so that the user can examine the primary database entries directly. A tab is shown only when data from that source database are available in the region of interest.
Many users are probably familiar with the UCSC Genome Browser, the NCBI MapViewer, and Ensembl for browsing and retrieving genomic annotations. At the time of this writing, the NCBI MapViewer provides no access to ENCODE-specific data, and Ensembl provides only limited views of these data. The GEO and OMIM links that are provided as part of the results of ENCODEdb Genomic Context searches are not currently available through the UCSC Genome Browser. The results provided in the RefSeq, mRNA, EST, SNP, and STS tabs are also available from the UCSC Genome Browser, albeit as a graphic not as text. Although UCSC does also provide data in text format through the Table Browser, we believe that the user interface available through ENCODEdb will be more straightforward for many bench biologists. Another important feature of the ENCODEdb Genomic Context searches is that array-based data housed in GEO for a region of interest can be retrieved by simply clicking on the GEO tab, producing the view shown in Figure 1B; this "one-click" method is easier for the user since a region-based search cannot easily be done at the GEO Web site. The tabular view used here provides the user a quick overview of the data in a more readable, compact, and informative format than can be obtained using GEO directly, allowing users to quickly focus in on particular array-based data of interest. The results are organized by GEO series (GSE numbers), which are defined as related samples that make up an experiment. There are also links to the appropriate GEO platform (GPL numbers), which describe the list of elements (e.g., oligonucleotide probe sets, cDNAs, or SAGE tags) being assayed or that may be detected and quantified in that experiment. In addition to the summary data stored under each individual tab, users can select the Browser View tab to obtain a graphical view of all annotations in the region of interest. The user can select to view either a UCSC Genome Browser "default view", with track selection preset, or a view that they can configure based on their own needs. Two Ensembl-based views are also available under the Browser View tab: Ensembl CytoView, which provides a genomic overview of sequence-based features in this region, and Ensembl MultiSpecies view, which provides an alignment of the human region of interest with corresponding regions in other selected organisms. We anticipate providing similar access to the NCBI Map Viewer in the future.
Consortium Data searches There are two ways in which users can perform a Consortium Data search. If a user has already performed a Genomic Context search and then clicks on the Consortium Data tab, the query term will be "passed through" to the Consortium Data search. Alternatively, a user can perform a Consortium Data search directly by clicking on the Consortium Data link on the ENCODEdb home page (Fig. 1A); this would take the user to the view shown in Figure 2A. Unlike the Genomic Context search, Consortium Data searches can be performed on multiple regions at the same time. Multiple search terms are separated by commas, which are an implicit Boolean OR, so the search term "ENm001, TP53BP1" (as shown in Fig. 2A) would return ENCODE data that fall into either of these two regions. As before, users can issue their query against either the Human July 2003 (hg16) or May 2004 (hg17) assemblies. Queries are conducted against five different data sources.
UCSC Genome Browser If the user selects UCSC Genome Browser as the target for their query, the user will be taken to a query form similar to that shown in Figure 2A. As with the Genomic Context search, results are organized under a series of tabs, so that the user can easily switch between results from the five different data sources. The chromosomal regions specified by the query are shown above the table with the pull-down menus. The user can now filter what data they wish to view by making selections in the Data Category and Data Submitter pull-downs; the choices under each pull-down dynamically update, showing only valid choices based on any previous selections. The ability to prefilter data is important, since the ENCODE Consortium has generated huge amounts of data; to date, there are 81,563,202 individual data points across 10 cell lines, 17 data categories, and 31 data providers. Simply displaying all of these data at the same time may prove overwhelming to the user, so users can narrow down what data are displayed to those data that are relevant to their own research interests.
As an example, requesting data on DNase I-hypersensitive (DNase HS) sites generated by the Collins laboratory (Crawford et al. 2004
UCSC Table Browser
GEO DataSets and GEO Profiles
GEO Components Regardless of which of these output options is selected, users are taken to a new page, in which they are asked to select which specific data fields they are interested in. The ability to choose specific data fields depends on the output option selected in the previous step. For example, if Download BED File or either of the UCSC display options was selected, the user can only select a single field, due to the limitations of the file formats themselves. If Download Selected Columns or Send Query to Galaxy was selected, any or all of the data fields can be selected. Choosing Download Selected Columns allows users to identify any differences between the GEO data, which is often provided as replicate assays, and the array-based data provided through the UCSC Genome or Table Browser, which is displayed as an average of the replicate values. Furthermore, the UCSC Table Browser presents transformed data, usually in the form of averages of replicate P-values or binding sites. Data obtained directly from GEO are raw data (i.e., user-normalized for missing or aberrant spots) and are available as mean and median intensities, intensities normalized against background, the log ratio of intensities, flagged data (to identify bad spots), and P-value data. The ability to select specific data fields of interest was built into ENCODEdb since these data are not easily obtained on a per-column basis from the GEO Web site, even though these data are all provided within the source GEO records. The ability to access these types of data is useful for anyone who wishes to evaluate ENCODE (or genome-wide data, for that matter) as it becomes available. The ability to easily create data sets for downstream analysis, as described above, is one of the key features of ENCODEdb, making these data more accessible to the average user. The utility of the GEO Components feature of ENCODEdb can be best illustrated by returning to the example considering DNase Ihypersensitive sites discussed above. Using ENCODEdb, it is simple to ask whether the previously identified DNase HS sites overlap with other experimentally determined promoter elements. Figure 2C shows the results of retrieving, from GEO, the PolII binding sites reported in ChIP-chip analyses, then displaying the resulting data as a UCSC Genome Browser custom track. Four time points are shown (0, 2, 8, and 32 h), with the tracks for 0 and 32 h expanded; the heights of the peaks represent the P-values of the ChIP-chip results. From this view, it is apparent that RNA PolII binds to both positions exhibiting DNase I hypersensitivity and, therefore, that the promoters of both TP53BP1 transcripts, characterized and uncharacterized, have been experimentally verified. The most important aspect of this example is that a user is able to export data out of GEO and visualize them alongside other, disparate types of data found within the UCSC Genome Browser. Without a resource such as ENCODEdb, this would be virtually impossible for anyone but the most expert of users to do.
Concluding remarks
ENCODEdb currently provides unified access to data from the following sources through a publicly available Web front end: the NCBI mRNA Reference Sequences (RefSeq) database, GenBank (for mRNA, EST, and STS sequences), NCBIs dbSNP, the NCBI UniGene database, OMIM, the NCBI GEO, and the UCSC Genome Browser. Queries of Consortium data through ENCODEdb allow display of the data at the UCSC Browser. Results can also be transferred to the Galaxy server for further manipulation and comparison. ENCODEdb is updated at least monthly with new data from NCBI and UCSC. This database is implemented in Perl and uses an Oracle database back end, with cookies and JavaScript enabled. Metadata from various GEO experiments are stored in database tables, with the actual raw data stored as text flatfiles in GEO SOFT format, allowing the data structure flexibility of this format to be maintained. The ENCODEdb portal is freely accessible at http://research.nhgri.nih.gov/ENCODEdb, using any up-to-date Web browser such as Safari or Firefox. Additional information regarding the implementation of ENCODEdb can be found on the ENCODEdb Web site.
We thank Webb Miller, Ross Hardison, Gretchen Gibney, Elise Feingold, Peter Good, and Laura Liefer for their thoughtful insights and feedback during the development of ENCODEdb. This research was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health.
1 Corresponding author.
E-mail andy{at}nhgri.nih.gov; fax (301) 480-2634. Article is online at http://www.genome.org/cgi/doi/10.1101/gr.5582207
Barrett, T., Suzek, T.O., Troup, D.B., Wilhite, S.E., Ngau, W.C., Ledoux, P., Rudnev, D., Lash, A.E., Fujibuchi, W., and Edgar, R. 2005. NCBI GEO: Mining millions of expression profilesdatabase and tools. Nucleic Acids Res. 33: D562D566. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L. 2006. GenBank. Nucleic Acids Res. 34: D16D20. Crawford, G.E., Holt, I.E., Mullikin, J.C., Tai, D., Blakesley, R., Bouffard, G., Young, A., Masiello, C., Green, E.D., Wolfsberg, T.G., et al. 2004. Identifying gene regulatory elements by genome-wide recovery of DNase hypersensitive sites. Proc. Natl. Acad. Sci. 101: 992997. The ENCODE Project Consortium, 2004. The ENCODE (ENCyclopedia of DNA Elements) Project. Science 306: 636640. Giardine, B., Elnitski, L., Riemer, C., Makalowska, I., Schwartz, S., Miller, W., and Hardison, R.C. 2003. GALA, a database for genomic sequence alignments and annotations. Genome Res. 13: 732741. Giardine, B., Riemer, C., Hardison, R.C., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J., et al. 2005. Galaxy: A platform for interactive large-scale genome analysis. Genome Res. 15: 14511455. Hamosh, A., Scott, A.F., Amberger, J.S., Bocchini, C.A., and McKusick, V.A. 2005. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33: D514D517. Hinrichs, A.S., Karolchik, D., Baertsch, R., Barber, G.P., Bejerano, G., Clawson, H., Diekhans, M., Furey, T.S., Harte, R.A., Hsu, F., et al. 2006. The UCSC Genome Browser Database: Update 2006. Nucleic Acids Res. 34: D590D598. Parkinson, H., Sarkans, U., Shojatalab, M., Abeygunawardena, N., Contrino, S., Coulson, R., Farne, A., Lara, G.G., Holloway, E., Kapushesky, M., et al. 2005. ArrayExpressa public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 33: D553D555. Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio, R., Edgar, S., Federhen, L.Y., et al. 2006. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 34: D173D180.
Received June 1, 2006; accepted in revised format August 24, 2006. This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||