|
|
|
|
Genome Res. 14:472-477, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Resources zPicture: Dynamic Alignment and Visualization Tool for Analyzing Conservation Profiles1 Energy, Environment, Biology and Institutional Computing, Lawrence Livermore National Laboratory, Livermore, California 94550, USA 2 Genome Biology Division, Lawrence Livermore National Laboratory, Livermore, California 94550, USA 3 Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA 4 Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA 5 Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
Comparative sequence analysis has evolved as an essential technique for identifying functional coding and noncoding elements conserved throughout evolution. Here, we introduce zPicture, an interactive Web-based sequence alignment and visualization tool for dynamically generating conservation profiles and identifying evolutionarily conserved regions (ECRs). zPicture is highly flexible, because critical parameters can be modified interactively, allowing users to differentially predict ECRs in comparisons of sequences of different phylogenetic distances and evolutionary rates. We demonstrate the application of this module to identify a known regulatory element in the HOXD locus, in which functional ECRs are difficult to discern against the highly conserved genomic background. zPicture also facilitates transcription factor binding-site analysis via the rVista tool portal. We present an example of the HBB complex when zPicture/rVista combination specifically pinpoints to two ECRs containing GATA-1, NF-E2, and TAL1/E47 binding sites that were identified previously as transcriptional enhancers. In addition, zPicture is linked to the UCSC Genome Browser, allowing users to automatically extract sequences and gene annotations for any recorded locus. Finally, we describe howthis tool can be efficiently applied to the analysis of nonvertebrate genomes, including those of microbial organisms.
The availability of DNA sequence information from several complete genomes has created new opportunities for formulating and testing hypotheses on the basis of phylogeny. Systematic comparisons of related genomes now permit the deduction of clear evolutionary histories and the characterization of sequence conservation profiles. Several studies have shown that sequence elements with critical biological roles are resistant to accumulating mutations and can be distinguished from the neutrally evolving background in genomic alignments (Elnitski et al. 2003
As the genome community moves toward sampling DNA from organisms on far-reaching branches of the evolutionary tree, and as a large number of whole-genome shotgun sequencing projects reach completion, comparative sequence analysis will be vital for identifying functional sequences. In particular, the vast diversity of genomes being sequenced and the increasing sophistication of the scientific questions addressed require flexible analytical tools. We have developed novel ways to visualize genomic alignments that allow the user to actively modify conservation parameters, data retrieval, and output formats. These features have been incorporated into an automated alignment and visualization tool, zPicture (http://zpicture.dcode.org), to generate reliable, highly sensitive single or multiple pairwise sequence alignments and provide the results in a visually compact, user-friendly, and interactive manner. The tool can be applied to the analysis of large genomic regions of any length from microbes to human. Threshold levels of conservation can be adjusted dynamically to optimize the detection of conserved regions in alignments, independent of the evolutionary distances separating the underlying sequences. zPicture is also capable of analyzing alignments for the presence of conserved transcription factor-binding sites via the rVista tool portal (Loots et al. 2002
Generating and Visualizing Alignments zPicture uses BLASTZ (Schwartz et al. 2003
At the zPicture main page, the user can submit the sequence data by choosing from several available options as follows: (1) paste in or upload sequence files from the user's computer, (2) automatically download sequence files from the UCSC Genome Browser (Kent et al. 2002 zPicture allows for customized real-time processing of sequence alignment data by promptly returning a set of output alignment files in the same browser window where the user submitted the input sequences. These files include (1) a dot plot, (2) a dynamically interactive visualization module, (3) modifiable annotation files, (4) a transcription factor binding site analysis interface, and (5) a set of static sequence, alignment and annotation files.
By following the visualization link, the user is directed to a conservation profile plot that can be actively modified over the Web to optimize the computational analysis. Alignments are visualized either as standard percent identity plots (PIP; Schwartz et al. 2000
Dynamic Analysis of Conservation Profiles In general, there are no optimal fixed parameters for identifying functional conserved elements in any pair of sequences, as the degree of conservation varies, not only due to the evolutionary distance between species, but also due to highly variable regional mutation rates within a genome (Hardison et al. 2003 100 bp and 70 percent identity (%ID) have been suggested as reasonable parameters for identifying functional human/mouse noncoding elements (Loots et al. 2000 500 bp/ 85%ID) and amplify the signal to noise ratio in the mouse/human alignment to yield a manageable number of conserved noncoding elements with good correspondence to conserved noncoding elements that have been identified as regulatory elements (Fig. 2B; Gerard et al. 1997
This example illustrates the most critical feature of the zPicture program, the ability to actively modify the evolutionary criteria to reflect the appropriate phylogenetic relationship for the analyzed sequences. The dynamic visualization module allows for selection of (1) the minimum length and the minimum percent identity in a sliding window as a threshold for detecting ECRs by scanning BLASTZ alignments; (2) the sequence that will be displayed at a given time as the reference sequence (using the `Base-top' switch button); (3) the bottom cut-off value for percent identity (y-axis); (4) the picture resolution to either compact or zoom-into the alignments; and (5) the length of base sequence to be displayed per alignment layer. The zPicture visualization tool is capable of dynamically replotting, rescaling, and modifying the ECR-detection criteria and base sequence instantly without resubmitting the data. In contrast, other available visualization programs only provide static displays for alignments.
Annotation
To edit annotation files, users have to go to the `Update annotation' section on the results page, and select the sequence for which amendments are being made. To annotate a contiguous region, the starting position, the ending position, and the type of sequence feature have to be indicated [coding exon (blue) are indicated by `CDS', untranslated regions (yellow) by `UTR', and for all other types or elements (purple) by `OTH']. To annotate a transcript, on the first line, users must indicate the direction of the gene by < or >, followed by the start, the end position, and the desired gene name. On succeeding lines, the same format should be followed as described above for contiguous regions. Detailed gene annotations also play an essential role in distinguishing coding from noncoding conserved elements. On the basis of the resulting conservation plot, users can interactively modify sequence annotations to reflect new discoveries based on the detected shared homology of the underlying sequence data. Using this feature, gene coordinates can be edited to include for example, alternatively spliced exons, new genes, regulatory elements, or other available experimental data. The dynamic visualization interface immediately incorporates these changes without having to resubmit sequence data or recompute the alignments. Repeat content can be annotated either by distinguishing repeats (lowercase letters) from nonredundant sequences (uppercase letters), or by running the locally installed RepeatMasker program (http://repeatmasker.genome.washington.edu/). If sequences are provided by loading data from the UCSC browser, these sequences have been preprocessed for repeats, and the first option `repeats are identified by lowercase' should be selected. In this case, annotation files are automatically extracted and pasted into the annotation window. If sequences are supplied by other means, the user can choose to mask repetitive elements by selecting the `mask repetitive elements' option and indicating the organism of choice. In this case, annotation files are not automatically provided, and the user has the option to supply their own annotation files, either by uploading a file from the user's computer, or pasting in the gene coordinates in the suggested format.
Aligning Microbial Genomes
Transcription Factor Binding-Site Analysis Modulation of gene expression is achieved through the complex interaction of transcription factors (TF) and DNA-binding motifs. Characterizing patterns of TF binding is a critical step for sequences-based discovery of noncoding regulatory elements. zPicture allows regulatory element analysis and transcription factor-binding sites (TFBS) visualization though the rVista tool portal (Loots et al. 2002
As an example of this application, the aligned human and mouse sequences containing the HBB gene complex were searched for conserved matches to binding sites for GATA-1, NF-E2, and TAL1/E47 transcription factors (Fig. 5). Only one short region (
Our main objective in developing zPicture has been to create an alignment analysis tool that is dynamically Web interactive, fast, easy to use, and capable of generating multiple pairwise alignments that can be concurrently manipulated. Upon submitting the alignment request, the data is returned rapidly on the same Web page. Similar to PipMaker, zPicture can handle sequences of any length; alignments of sequences 1 Mb will be generated in less than 1 min; 23 Mb requests will be processed under 5 min, and jobs 5 Mb will require 30 min. We do not limit the size of input sequences; therefore, this tool can be used for comparing large genomic intervals or even complete bacterial genomes. If sequences are acquired from the UCSC database, zPicture analysis eliminates the need to mask repetitive elements prior to generating alignments, a step that accelerates the alignment process >100-fold, generating 1 Mb alignments in 30 sec. Also, the ability to extract sequence and annotation data from the UCSC Genome Browser is a unique feature that eliminates the need to manually create annotation files and expedites the process of comparative analysis. Conserved sequences can be retrieved interactively by clicking on the zPicture conservation profiles, an option unavailable for other comparative sequence analysis tools. Both zPicture and PipMaker provide users with dot-plots that present an overview of the evolutionary rearrangements in the sequences being compared. Conserved features within the dot plots can also be accessed and viewed as sequence alignments with a single mouse-click on the image. Because BLASTZ is a local aligner, zPicture identifies homologous regions independent of their location and orientation in the second sequence; therefore, this tool can efficiently be applied to the analysis of unfinished draft sequences to find overlaps between contigs and assist during assembly. The local alignment algorithm also provides the maximum efficiency in aligning distantly related genomes, such as those of mammals and fishes, in which gene order and orientation have not been faithfully preserved. Most importantly, zPicture has been designed to allow for interactive fine tuning of the conservation data to optimize the evolutionary thresholds required to extract the most significant biological data. Comparative genomic tools have been implemented successfully for prioritizing candidate regions to be tested in functional assays, and as these tools evolve, they have the potential to be applied for the de novo identification of functional coding and noncoding sequences. The zPicture tools can be used as a reverse-engineering approach for understanding the modular structure of DNA through cross-species comparisons, and provides a theoretical solution for decrypting the sequence of genomes.
The work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory Contract No. W-7405-Eng-48. Additional support was from NHGRI grant HG02238 (W.M. and R.H.) The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2129504.
6 Corresponding author.
Dubchak, I., Brudno, M., Loots, G.G., Pachter, L., Mayor, C., Rubin, E.M., and Frazer, K.A. 2000. Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res. 10: 13041306.
Elnitski, L., Hardison, R.C., Li, J., Yang, S., Kolbe, D., Eswara, P., O'Connor, M.J., Schwartz, S., Miller, W., and Chiaromonte, F. 2003. Distinguishing regulatory DNA from neutral sites. Genome Res. 13: 6472. Gerard, M., Zakany, J., and Duboule, D. 1997. Interspecies exchange of a Hoxd enhancer in vivo induces premature transcription and anterior shift of the sacrum. Dev. Biol. 190: 3240.[CrossRef][Medline]
Hardison, R.C., Roskin, K.M., Yang, S., Diekhans, M., Kent, W.J., Weber, R., Elnitski, L., Li, J., O'Connor, M., Kolbe, D., et al. 2003. Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 13: 1326.
Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., et al. 2003. The UCSC Genome Browser Database. Nucleic Acids Res. 31: 5154.
Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, D. 2002. The human genome browser at UCSC. Genome Res. 12: 9961006.
Loots, G.G., Locksley, R.M., Blankespoor, C.M., Wang, Z.E., Miller, W., Rubin, E.M., and Frazer, K.A. 2000. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288: 136140.
Loots, G.G., Ovcharenko, I., Pachter, L., Dubchak, I., and Rubin, E.M. 2002. rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res. 12: 832839.
Mayor, C., Brudno, M., Schwartz, J.R., Poliakov, A., Rubin, E.M., Frazer, K.A., Pachter, L.S., and Dubchak, I. 2000. VISTA: Visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 16: 10461047.
Oeltjen, J.C., Malley, T.M., Muzny, D.M., Miller, W., Gibbs, R.A., and Belmont, J.W. 1997. Large-scale comparative sequence analysis of the human and murine Bruton's tyrosine kinase loci reveals conserved regulatory domains. Genome Res. 7: 315329.
Pennacchio, L.A., Olivier, M., Hubacek, J.A., Cohen, J.C., Cox, D.R., Fruchart, J.C., Krauss, R.M., and Rubin, E.M. 2001. An apolipoprotein influencing triglycerides in humans and mice revealed by comparative sequencing. Science 294: 169173.
Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R., and Miller, W. 2000. PipMakerA Web server for aligning two genomic DNA sequences. Genome Res. 10: 577586. Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., and Miller, W. 2003. Humanmouse alignments with BLASTZ. Genome Res. 13: 103105. Talbot, D. and Grosveld, F. 1991. The 5'HS2 of the globin locus control region enhances transcription through the interaction of a multimeric complex binding at two functionally distinct NF-E2 binding sites. EMBO J. 10: 13911398.[Medline]
Wingender, E., Chen, X., Fricke, E., Geffers, R., Hehl, R., Liebich, I., Krull, M., Matys, V., Michael, H., Ohnhauser, R., et al. 2001. The TRANSFAC system on gene expression regulation. Nucleic Acids Res. 29: 281283.
http://genome.ucsc.edu/; Human Genome Browser at UCSC. http://www.ncbi.nlm.nih.gov/; NCBI Database. http://bio.cse.psu.edu/pipmaker/; PipMaker. http://rvista.dcode.org/; rVista. http://www.biobase.de/; Transfac Database. http://www-gsd.lbl.gov/VISTA/VistaInput.html; Vista. http://repeatmasker.genome.washington.edu/; RepeatMasker. http://zpicture.dcode.org/; zPicture.
Received October 31, 2003;
accepted in revised format December 28, 2003.
This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||