|
|
|
|
Genome Res. 17:960-964, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00 OPEN ACCESS ARTICLE Resource A framework for collaborative analysis of ENCODE data: Making large-scale analyses biologist-friendlyCenter for Comparative Genomics and Bioinformatics, Huck Institutes of the Life Sciences, Penn State University, University Park, Pennsylvania 16802, USA
The standardization and sharing of data and tools are the biggest challenges of large collaborative projects such as the Encyclopedia of DNA Elements (ENCODE). Here we describe a compact Web application, Galaxy2ENCODE, that effectively addresses these issues. It provides an intuitive interface for the deposition and access of data, and features a vast number of analysis tools including operations on genomic intervals, utilities for manipulation of multiple sequence alignments, and molecular evolution algorithms. By providing a direct link between data and analysis tools, Galaxy2ENCODE allows addressing biological questions that are beyond the reach of existing software. We use Galaxy2ENCODE to show that the ENCODE regions contain >2000 unannotated transcripts under strong purifying selection that are likely functional. We also show that the ENCODE regions are representative of the entire genome by estimating the rate of nucleotide substitution and comparing it to published data. Although each of these analyses is complex, none takes more than 15 min from beginning to end. Finally, we demonstrate how new tools can be added to Galaxy2ENCODE with almost no effort. Every section of the manuscript is supplemented with QuickTime screencasts. Galaxy2ENCODE and the screencasts can be accessed at http://g2.bx.psu.edu.
Analysis of data generated by The ENCODE Project Consortium (2004) In this study, we demonstrate the utility of our system with examples using ENCODE data (the utility of our system is not limited to ENCODE). We show two complex analyses that can be conducted by using our system in <15 min. In the first example, we define and analyze all unannotated expressed sequence tags (ESTs) in ENCODE regions. We show that over 2000 ESTs do not correspond to any annotated genes, yet show strong signature of purifying selection, indicating possible function. In the second example, we estimate the rate of nucleotide substitutions in ENCODE regions and demonstrate that it is consistent with genome-wide estimates. The two analyses are designed as "cookbook" examples for two distinct audiences. The first analysis is geared toward researchers studying the structure and function of the human genome. The second example is for researchers working in the area of evolutionary genomics. Finally, we show how easy it is to add new functionality to the Galaxy2ENCODE toolbox and to use Galaxy2ENCODE as a resource for sharing different analysis tools. This paper is supplemented with screencasts, short QuickTime movie clips. Each section of Results and Discussion features a screencast. The screencasts can be viewed directly from the main Galaxy2ENCODE Web site (http://g2.bx.psu.edu) under the heading "Screencasts."
Galaxy2ENCODE interface and ENCODE data portal (Screencasts 1 and 2) Galaxy2ENCODE allows experimental biologists to retrieve and analyze data within a single unified interface. For this purpose, Galaxy2ENCODE features a history system that stores data uploaded by the user as well as the results of all analyses. The concept of history was previously successfully deployed by our group (Giardine et al. 2005 To facilitate data exchange among different ENCODE groups during the analysis process, we implemented a local data repository at http://encode-upload.g2.bx.psu.edu. The repository is a Web application designed to (1) provide a user-friendly interface for data upload, (2) standardize naming of data files according to ENCODE guidelines, (3) automatically fragment the data into ENCODE analysis partitions, and (4) store the data for direct access through Galaxy2ENCODE (http://encode.g2.bx.psu.edu) and ftp (ftp://encode:encode@g2.bx.psu.edu). See Methods for a description of the naming conventions and partition process.
Galaxy2ENCODE tools (Screencasts 414)
Analysis of intronic, intergenic, and intertwined ESTs (Screencasts 1517) Here we define and characterize the 9191 transcripts that lie outside annotated genes within ENCODE regions. These are of considerable interest, as some may represent genes missed during the annotation process. We used GENCODE annotation as the source of gene data (http://genome.imim.es/gencode/). Genes are first predicted computationally and then experimentally verified using techniques such as RT-PCR, RACE, and direct sequencing of the products. As such, the gene predictions of GENCODE are the most reliable. In the following analysis, we define "genes" as the union of GENCODE Known Genes, GENCODE Putative Genes, and GENCODE pseudogenes annotations frozen during the Second ENCODE Workshop (University of California Santa Cruz, November 2005). Using genomic coordinates, we identified all ESTs that map outside GENCODE genes. We call such ESTs Non-GENCODE ESTs. Non-GENCODE ESTs belong to three categories (Fig. 2): intronic, intergenic, and intertwined (or interleaved as suggested by Chen and Stein 2006 15 min to complete. See Screencast 15 and the Methods section for a step-by-step explanation of the procedure. Briefly, we first defined a set that includes all Non-GENCODE ESTs (Fig. 3AD). Then, we classified Non-GENCODE ESTs into intronic, intergenic, and intertwined (Fig. 3E,F). Finally, we computed descriptive statistics as shown in Table 1.
Having defined Non-GENCODE ESTs in ENCODE regions, we can now use Galaxy2ENCODE to look into the biology of these transcripts. How many Non-GENCODE ESTs correspond to missing protein-coding genes? What fraction of the Non-GENCODE ESTs are under purifying selection? Is there a significant overlap between Non-GENCODE ESTs and transcriptional evidence produced by alternative methods? These are just some of the questions that can be easily answered with versatile Galaxy2ENCODE tools.
Screencast 15
Screencast 16
If Non-GENCODE ESTs represent biologically relevant transcripts, there should be a significant overlap between them and transcribed regions of the genome confirmed with other methods, such as transcribed fragments (transfrags) produced by the Affymetrix group (Kampa et al. 2004
Estimating mammalian substitution rates Since ENCODE regions have the highest depth of annotation, it is tempting to extrapolate their properties to the entire genome. However, is this legitimate? In other words, do ENCODE regions represent an unbiased sample of the genome? One way to answer this question is to compare evolutionary parameters of the ENCODE region with genome-wide estimates published elsewhere. We used ancestral repeats (ARs) (Hardison et al. 2003
Galaxy2ENCODE as a community resource for distributing tools (Screencasts 18 and 19) ENCODE analysis groups have designed several innovative software tools that can be of great use to the rest of the genomic community. Galaxy2ENCODE can be used to provide unified, simple, and user-friendly interfaces for these tools. Adding tools does not require any knowledge about the internal operation of Galaxy2ENCODE. The entire tool deployment process consists of downloading a software distribution from http://g2.bx.psu.edu, installing it (see the 3-min Screencast 18 that explains all steps of the installation process), and performing the two steps described in Supplemental Materials (also see Screencast 19).
Conclusions
Galaxy2ENCODE is a completely new compact implementation that combines the latest open-source technologies with ideas previously developed by our group (Giardine et al. 2005
We thank David Haussler and Jim Kent for their continuing support of the project and the members of the Center for Comparative Genomics and Bioinformatics at Penn State for their input. Roderic Guigo, France Denoeud, Julien Lagarde, and Robert Castelo provided critical comments during software testing. Special thanks to Michael OConnor for editing the wiki page content. This work is supported by funds provided by the Eberly College of Science, Huck Institutes of the Life Sciences, at Penn State University; NSF DBI grant 0543285 to A.N.; NIH R01 HG002238 to W.M.; and NIH R01 GM072264 to K.M.
1 Corresponding author.
E-mail anton{at}bx.psu.edu; fax (814) 863-6699. [Supplemental material is available online at www.genome.org and http://g2.bx.psu.edu.] Article is online at http://www.genome.org/cgi/doi/10.1101/gr.5578007
Axelsson, E., Smith, N.G., Sundstrom, H., Berlin, S., and Ellegren, H. 2004. Male-biased mutation rate and divergence in autosomal, z-linked and w-linked introns of chicken and turkey. Mol. Biol. Evol. 21: 15381547. Chen, N. and Stein, L.D. 2006. Conservation and functional significance of gene topology in the genome of Caenorhabditis elegans. Genome Res. 16: 606617. Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H., Helt, G., et al. 2005. Transriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308: 11491154. The Chimpanzee Sequencing and Analysis Consortium, 2005. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437: 6987.[CrossRef][Medline] The ENCODE Project Consortium, 2004. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306: 636640. Giardine, B., Riemer, C., Hardison, R.C., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Miller, W., et al. 2005. Galaxy: A platform for interactive large-scale genome analysis. Genome Res. 15: 14511455. Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren, E.J., Scherer, S., Scott, G., Steffen, D., Worley, K.C., Burch, P.E., et al. 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428: 493521.[CrossRef][Medline] Hardison, R.C., Roskin, K.M., Yang, S., Diekhans, M., Kent, W.J., Weber, R., Elnitski, L., Li, J., OConnor, M., Kolbe, D., et al. 2003. Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 13: 1326. Kampa, D., Cheng, J., Kapranov, P., Yamanaka, M., Brubaker, S., Cawley, S., Drenkow, J., Piccolboni, A., Bekiranov, S., Helt, G., et al. 2004. Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res. 14: 331342. King, D.C., Taylor, J., Elnitski, L., Chiaromonte, F., Miller, W., and Hardison, R.C. 2005. Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences. Genome Res. 15: 10511060. Lindblad-Toh, K., Wade, C.M., Mikkelsen, T.S., Karlsson, E.K., Jaffe, D.B., Kamal, M., Clamp, M., Chang, J.L., Kulbokas III, E.J., Zody, M.C., et al. 2005. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438: 803819.[CrossRef][Medline] Pond, S.L., Frost, S.D., and Muse, S.V. 2005. HyPhy: Hypothesis testing using phylogenies. Bioinformatics 21: 676679. Rodriguez, F., Oliver, J.L., Marin, A., and Medina, J.R. 1990. The general stochastic model of nucleotide substitution. J. Theor. Biol. 142: 485501.[Medline] Siepel, A. and Haussler, D. 2004. Combining phylogenetic and hidden Markov models in biosequence analysis. J. Comput. Biol. 11: 413428.[CrossRef][Medline] Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A.S., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., et al. 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15: 10341050. Yang, Z., Goldman, N., and Friday, A. 1994. Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol. 11: 316324.[Abstract]
Received June 1, 2006; accepted in revised format August 15, 2006. This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||