|
|
|
|
Genome Res. 13:2195-2202, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00 Resources GANESH: Software for Customized Annotation of Genome Regions1 Department of Computing, Imperial College, London SW7 2AZ, UK 2 Medical Research Council Prion Unit/Department of Neurodegenerative Diseases, Institute of Neurology, London WC1N 3BG, UK 3 School of Medicine, Imperial College, London W6 8RP, UK 4 School of Biotechnology and Biomolecular Science, University of New South Wales, Sydney 2052, Australia
GANESH is a software package designed to support the genetic analysis of regions of human and other genomes. It provides a set of components that may be assembled to construct a self-updating database of DNA sequence, mapping data, and annotations of possible genome features. Once one or more remote sources of data for the target region have been identified, all sequences for that region are downloaded, assimilated, and subjected to a (configurable) set of standard database-searching and genome-analysis packages. The results are stored in compressed form in a relational database, and are updated automatically on a regular schedule so that they are always immediately available in their most up-to-date versions. A Java front-end, executed as a stand alone application or web applet, provides a graphical interface for navigating the database and for viewing the annotations. There are facilities for importing and exporting data in the format of the Distributed Annotation System (DAS), enabling a GANESH database to be used as a component of a DAS configuration. The system has been used to construct databases for about a dozen regions of human chromosomes and for three regions of mouse chromosomes.
One of the underpinning aims of the Human Genome Project is to provide the resources to support genetic analysis of human conditions and disorders. This aim is incomplete in part because neither the finished DNA sequence of the whole of the human genome has yet been established, nor has the task of identifying all of the genes within even the available DNA sequence been completed (Lander et al. 2001
Presently, there are several partially independent sources of annotated human genomic sequences that include Ensembl (Hubbard et al. 2002 In contrast to the requirements of databases of record, genetic research places rather different constraints upon the annotation and analysis of genomic data. Modern genetic research generally takes a positional cloning approach, frequently using whole-genome scans for linkage in appropriately constructed pedigree collections to identify regions likely to contain a variant gene(s) that predisposes to the condition or disorder under study. Such approaches generally identify regions ranging from a few megabases (in the case of monogenic disorders) to tens of megabases (in the case of multifactorial traits). The challenge for the genetic researcher is then to identify the disease-susceptibility variant or variants within this region.
Genetic analysis software and gene identification software as represented by databases of record might, in time, become congruent. At present, this is not the case. There are three main distinguishing features. First, although gene identification is a key element of both types of software, the ultimate goal of genetic analysis software is to derive an exhaustive list of genes and gene-like objects within a specified region, so that these can be subjected to experimental analysis to identify sequence variants that might be correlated with disorder or condition state. Accuracy of gene prediction, especially the elimination of false positives, is of lesser importance because experimental analyses can be deployed readily to validate the in silico predictions (Shoemaker et al. 2001
These three design differences make the databases of record cumbersome and restricted in their utility for specifically genetic analysis. To overcome these limitations, we have developed a specialist software package (named GANESH), designed in explicit recognition of the differing goals and requirements of geneticist and genomicist. GANESH is a set of software components that may be assembled to construct a newself-updating database providing annotation for a specified region of human (or other) genomic sequence. Sequence and other relevant data for the target region are gathered from various distributed data sources, assimilated, and subjected to a range of database-searching and genome-analysis programs. The results are stored in the database in compressed form and updated on a regular schedule, so that they are always available immediately in their most up-to-date form. A front-end in Java, executed as a stand-alone application or as a web applet, provides a graphical interface for navigating the database and visualization of the genome features detected. There are utilities for importing and exporting data in a variety of formats, including those of DAS, the Distributed Annotation System (Dowell et al. 2001 At the time of writing, GANESH has been used to construct databases for about a dozen regions of human chromosomes and for three regions of mouse chromosomes. Example databases and further details can be found at http://zebrafish.doc.ic.ac.uk. In this work, we describe the structure of GANESH and its components, and demonstrate its role in the genetic analysis of several regions of the human genome.
GANESH: Overview A GANESH application has the following main components:
Figure 1 provides an overview of the system structure.
For construction of a new application, GANESH is focused upon a region of the genome (human or other) by first identifying DNA markers or genomic positions that flank the region of genetic interest. These markers, in turn, are used to define a set of DNA clones that span the interval and that have been, or are being, sequenced. Several databases can be used to select these clones including the UCSC golden path (Kent and Hausler 2001
All sequence data from the target region is thereafter downloaded by the assimilation module and processed using a range of (standard) genome analysis tools. The standard configuration presently includes RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html), Genscan (Burge and Karlin 1997
BLAST searching (Altschul et al. 1990 The results of all of these computations are compressed and stored in a standard relational database (specifically, the MySQL system (http://www.mysql.com), which we chose because it is used widely, and is freely available in the public domain). Scripts are provided to reconstruct the original form from the compressed form whenever required. For illustration, Figure 2 shows the compressed form of results of BLAST searches. Links to less frequently used data sources, or data sources requiring little computation (such as publication databases, OMIM, some gene expression collections) are provided as standard WWW links.
Once a GANESH database is set up, a set of procedures in the updating module scan the remote data sources periodically and download, process, and assimilate any new sequences from the target region as they are deposited. New sequences are passed to the assimilation module for processing in the usual way. The results for all stored sequences already processed are also updated. BLAST searches are repeated for every stored sequence as new versions of the archival databases are releaseddaily for the incremental releases of EMBL-nr (nonredundant EMBL that does not contain ESTs or STSs), EMBLnew (everything that has been added to EMBL since the last release), dbEST, TrEMBLnew (everything added to TrEMBL since the last release), dbSNP, and dbSTS, and weekly for other databases such as SWISSNEW (everything added to SWISS-PROT since the last release). Computationally intensive tasks, such as bulk searches against the EMBL, EST, and protein sequence databases, and the execution of the various DNA sequence analysis programs (e.g., repeat sequence detection and gene/exon prediction) are performed automatically, usually overnight. For our GANESH installation at Imperial College, we make use of a general purpose system, Disperse (Clifford and Mackey 2000 It is also possible for users to add their own annotations, as discussed further below. The system reports any new results automatically as they are discovered for regions that have been registered by the user as being of particular interest.
To make EST data more meaningful, and at the same time to reduce the amount of storage required, matches between ESTs and the region of interest are stored in the database according to various stringency criteria. The standard configuration uses the criteria identified in Bailey et al. (1998 We also provide an optional component that attempts to predict the presence of genes/exons by comparing the output from several of the annotation tools. This has been of particular interest to the user groups working on identification of candidate disease genes. It is described below separately. The first application of GANESH, and the application used to drive its development, was a reconstruction of the 11Db database, implemented previously in ACeDB (R. Durbin and J. Thierry-Mieg, unpubl.; http://www.acedb.org), maintained at the Department of Biochemistry, Imperial College. The 11Db contains the sequence and annotation for the WAGR region of human chromosome 11. At the time of writing, the GANESH version of the 11Db database has been operational (and publicly accessible) for about 2 yr. Similar GANESH databases have nowbeen set up for regions of human chromosomes 1, 2, 3, 5, 6, 11, 12, 14, 16, and 20, and regions of mouse chromosomes 2, 4, and 12. We also constructed a GANESH database of the complete human chromosome 21 to test whether that system can cope with this volume of data, and to provide comparisons with annotations produced elsewhere. Further details of current applications may be found at http://zebrafish.doc.ic.ac.uk.
The initial analysis of a new region
Sequence and Annotation Display We have chosen to use a nested series of displays organized at the first level around individual DNA clones (but, see below for the display of larger sequenced fragments). The display software is written in Java, and can be used both as a Web-based applet and an application. It uses a library of Java display utilities that were developed independently following trials with the Neomorphic Genome Software Development Kit (http://www.neomorphic.com), with which it shares some look and feel. Interactive access to GANESH, allowing individual user annotations, is controlled by a user login. Figure 3 shows the opening screen of the display, in this case a human chromosome region, 11p13, that is deleted in WAGR syndrome patients. The display can be scrolled, and zoomed in and out to focus on areas of interest. Selection of a clone in this screen (by clicking) accesses information about it and (by double-clicking) brings up the annotation display shown in Figure 4. The content of this display is configurable. The user can also add or remove displayed features as required during viewing. As with all displays, it is also possible to zoom in and out; increased detail (names, markers, etc.) appears automatically as display resolution allows.
Clicking on any of the individual features brings up details of the annotation, depending on the nature of the probe. There is a separate graphical display for viewing BLAST hits. Clicking on a BLAST feature brings up the BLAST window shown in Figure 5. This can be scrolled and zoomed as usual. The BLAST significance level for display is controlled with a slider (shown to the left of the screen in Fig. 5), which is used to specify the stringency threshold for displayed BLAST hits. Clicking on a BLAST hit brings up a detailed view of the BLAST alignment, as shown in Figure 6. Links are created automatically to access source data, via the WWW, from GenBank, Entrez, dbSNP, UniGene, and SRS. Because of security restrictions, it is not possible to copy and paste from an applet. Some output is therefore displayed in a separate browser window via a CGI Perl script to enable the user to cut and paste information into other software. Additional windows have also been created to display the complete FPC clone map of the region (Marra et al. 1997
A key feature superimposed on these displays is the automatic notification of updated information from the daily/weekly BLAST analyses or resulting from updates to the sequence itself. The last viewed version can be retrieved for comparison to the latest information. There is a further option to viewjust the information added since the user last accessed the system. This feature is particularly useful when combined with the tools provided for users to add their own annotations. These are keyed to individual operators and are intended to be the primary tool for recording acceptance or rejection of features within the displaysgenes, gene-like objects, SNPs, regulatory regions, etc. It is important to stress that these annotations do not necessarily have to represent definitive genome features as in the databases of record, and can take the form of a simple statement of the form `gene X is located at position Y' in the sequence. The purpose of GANESH is primarily to facilitate the identification of as complete a list as possible of genes and other genome features of interest within the target region, and only secondarily to provide definitive descriptions of their structures.
Unfinished Sequences All annotations, and user annotations in particular, have to be updated as each newrelease of the sequence is assimilated. We deal with this by first performing a Smith-Waterman comparison of the new sequence with the previously stored version to identify the parts that have changed. We use crossmatch (http://www.phrap.org/phrap.docs/general.html) for this purpose. Typically, large fragments of the sequence will have remained the same. Only those parts that are neware reprocessed (subject to the sequence length requirements discussed below), thus reducing the computational effort, and more importantly, preserving user annotations as far as possible. As already mentioned in the description of the display components, users are able to compare updated sequences with the previously stored and analyzed versions. Clearly, the problems of dealing with unfinished sequence for the human genome will reduce as more and more genome sequence is finished, but the need to deal with unfinished sequence will not disappear. Although GANESH was developed originally to support the genetic analysis of human genomic regions, it is being used for other model organisms as well (at present, for mouse), and it is expected that it will be used increasingly for other organisms, including organisms whose genomes will remain in largely unfinished draft form.
Long Sequences These large genomic sequences also cause some problems of focus, because not all regions are of similar interest to all users. Provision is therefore made within GANESH for the creation of custom sets of sequences. Users can pick out and register subregions and sets of subsequences as being of special interest. For the purposes of the annotation display and update notification, these custom sets are treated in the same way as normal (BAC-indexed) sequences. The user is notified of any new annotation data within their range as the parent sequences are updated, as usual. The difference is that displays of custom sets of sequence are only available to their registered users. Different users may create their own custom sets, and these regions may overlap, but they are only displayed when the registered user logs in, and then only their own custom sets are displayed.
Gene Identification Tools The gene predictions are subdivided into:
We make our gene predictions available to our collaborators by utilizing the Distributed Annotation System (DAS; Dowell et al. 2001
At the time of writing, we have not completed experimental work on the predicted genes in this table beyond some preliminary microarray analysis to confirm/discard some of the gene predictions. It seems likely that a proportion of the low-category predictions represent genuine transcripts. According to a recent microarray-based validation of predicted genes on chromosome 22q (Shoemaker et al. 2001
Portability and Requirements The applet requires Netscape 6(+), or for other browsers, the installation of the Java Runtime Environment 1.3.1_02(+) and can be installed on any web server that can access the GANESH installation database. GANESH is designed to be as straightforward to install and maintain as possible, but some knowledge of the unix/linux operating system is required. Furthermore, some knowledge of Perl would be beneficial in order to modify the scripts if required, particularly if newanalysis programs are to be added. We have also developed a portable version of the system that allows databases to be stored and queried off-line. A laptop or notebook is sufficient to run this version. These stand-alone databases can be synchronized with the central database over the internet. The GANESH software is available under an Open Source license. Details may be found at http://zebrafish.doc.ic.ac.uk.
Our main objective in developing GANESH was to provide a local tool for genetic analysis, but clearly, such studies can contribute to whole-genome annotation efforts. GANESH shares many features with other systems for the automated annotation of genomic sequence. The main distinguishing features of GANESH as we see them are as follows:
We have recently added facilities for importing and exporting data in the format of the Distributed Annotation System (DAS), enabling a GANESH database to be used as a component of a DAS configuration. A GANESH database of annotated human, and now also mouse, sequence can thus be accessed and viewed in two different ways as follows: (1) via the Ensembl site, using the Ensembl Web-based browser with the GANESH database selected as an additional DAS source; or (2) using the GANESH front end, either as a standalone application or as a web applet, with relevant data extracted from Ensembl and stored in GANESH as desired. The advantage of the first method is that the Ensembl browser is used widely and is increasingly familiar to researchers, and is likely to be the subject of sustained further development. It also provides an easy mechanism for disseminating results to our collaborators (as there is less for us to maintain). The advantages of the second method are that the GANESH front end is (comparatively) easy to adapt and customize, and can also be used as a stand-alone application, which some users seem to prefer. Only the second method is applicable, of course, to the analysis of genomes not presently stored in Ensembl.
The development of GANESH was supported by BBSRC/EPSRC Bioinformatics Initiative grant BIF28/10483. D.S. was supported by the EU Framework V project QLRT-CT-1999-00546. The library of Java graphics utilities used in the GANESH front-end was designed and implemented by Manuel Cardoso and Chris Iannou as part of their MSc projects in the Department of Computing, Imperial College. We thank Win Hide for many useful discussions and for his continuing support of this project. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.698103.
5 Corresponding author.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403410.[CrossRef][Medline]
Bailey Jr., L.C., Searls, D.B., and Overton, G.C. 1998. Analysis of EST-driven gene annotation in human genomic sequence. Genome Res. 8: 362376.
Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L., Marshall, M., and Sonnhammer, E.L.L. 2002. The Pfam protein families database. Nucleic Acids Res. 30: 276280. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 7894.[CrossRef][Medline]
Clifford, R.J. and Mackey, A.J. 2000. Disperse: A simple and efficient approach to parallel database searching. Bioinformatics 16: 564565.
Deloukas, P., Schuler, G.D., Gyapay, G., Beasley, E.M., Soderlund, C., Rodriguez-Tome, P., Hui, L., Matise, T.C., McKusick, K.B., Beckmann, J.S., et al. 1998. A physical map of 30,000 human genes. Science 282: 744746. Dowell, R.D., Jokerst, R.M., Day, A., Eddy, S.R., and Stein, L. 2001. The distributed annotation system. Bioinformatics 2: 7. Hogenesch, J.B., Ching, K.A., Batalov, S., Su, A.I., Walker, J.R., Zhou, Y., Kay, S.A., Schultz, P.G., and Cooke, M.P. 2001. A comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes. Cell 106: 413415.[CrossRef][Medline]
Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T., et al. 2002. The Ensembl genome database project. Nucleic Acids Res. 30: 3841. James, M.R., Richard III, C.W., Schott, J.J., Yousry, C., Clark, K., Bell, J., Terwilliger, J.D., Hazan, J., Dubay, C., Vignal, A., et al. 1994. A radiation hybrid map of 506 STS markers spanning human chromosome 11. Nat. Genet. 8: 7076.[CrossRef][Medline]
Kent, W.J. and Haussler, D. 2001. Assembly of the working draft of the human genome with gigassembler. Genome Res. 11: 15411548.
Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, A.D. 2002. The Human genome browser at UCSC. Genome Res. 12: 9961006. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., and FitzHugh, W. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860921.[CrossRef][Medline]
Marra, M.A., Kucaba, T.A., Dietrich, N.L., Green, E.D., Brownstein, B., Wilson, R.K., McDonald, K.M., Hillier, L.W., McPherson, J.D., and Waterston, R.H. 1997. High throughput fingerprint analysis of large-insert clones. Genome Res. 7: 10721084. McPherson, J.D., Marra, M., Hillier, L., Waterston, R.H., Chinwalla, A., Wallis, J., Sekhon, M., Wylie, K., Mardis, E.R., Wilson, R.K., et al. 2001. A physical map of the human genome. Nature 409: 934941.[CrossRef][Medline] Shoemaker, D.D., Schadt, E.E., Armour, C.D., He, Y.D., Garrett-Engele, P., McDonagh, P.D., Loerch, P.M., Leonardson, A., Lum, P.Y., Cavet, G., et al. 2001. Experimental annotation of the human genome using microarray technology. Nature 409: 922927.[CrossRef][Medline] Stein, L. 2001. Genome annotation: From sequence to biology. Nat. Rev. Genet. 2: 493503.[Medline]
Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., et al. 2001. The sequence of the human genome. Science 291: 13041351.
http://www.sanger.ac.uk/Software/Wise2; Birney, E., Wise2. http://www.doc.ic.ac.uk/~rc5/Disperse; Clifford, R.J., Mackey, A.J., Disperse. http://www.biodas.org; Distributed Annotation System (DAS) Home page. http://www.biojava.org/dazzle; Down, T. The Dazzle server. http://www.ensembl.org; Ensembl Home page. ftp.ebi.ac.uk/pub/databases/embl/new; European Bioinformatics Institute. EMBL daily update files. http://zebrafish.doc.ic.ac.uk; GANESH Home page. http://www.phrap.org/phrap.docs/general.html; Green, P. Crossmatch documentation. http://www.mysql.com; MySQL Home page. http://www.neomorphic.com; Neomorphic Home page. ftp.sanger.ac.uk/pub/human/sequences and ftp.sanger.ac.uk/pub/mouse/sequences; Sanger Institute ftp archives. http://ftp.genome.washington.edu/RM/RepeatMasker.html; Smit, A.F.A., Green, P. RepeatMasker documentation. http://www.acedb.org; AceDB home page.
Received August 7, 2002;
accepted in revised format June 30, 2003.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||