|
|
|
|
Genome Res. 14:1191-1198, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Resources eShadow: A Tool for Comparing Closely Related Sequences1 Energy, Environment, Biology and Institutional Computing (EEBI), Lawrence Livermore National Laboratory, Livermore, California 94550, USA 2 Genome Biology Division, Lawrence Livermore National Laboratory, Livermore, California 94550, USA 3 Department of Genome Sciences, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
Primate sequence comparisons are difficult to interpret due to the high degree of sequence similarity shared between such closely related species. Recently, a novel method, phylogenetic shadowing, has been pioneered for predicting functional elements in the human genome through the analysis of multiple primate sequence alignments. We have expanded this theoretical approach to create a computational tool, eShadow, for the identification of elements under selective pressure in multiple sequence alignments of closely related genomes, such as in comparisons of human-to-primate or mouse-to-rat DNA. This tool integrates two different statistical methods and allows for the dynamic visualization of the resulting conservation profile. eShadow also includes a versatile optimization module capable of training the underlying Hidden Markov Model to differentially predict functional sequences. This module grants the tool high flexibility in the analysis of multiple sequence alignments and in comparing sequences with different divergence rates. Here, we describe the eShadow comparative tool and its potential uses for analyzing both multiple nucleotide and protein alignments to predict putative functional elements.
Cross-species sequence comparisons between distantly related genomes, such as those of humans and rodents, have been instrumental in defining evolutionarily conserved elements with critical biological roles, whether they function as coding exons (Pennacchio et al. 2001
However, a deeper understanding of the biology and the evolution of Homo sapiens will require comparisons not only to distantly related genomes, such as rodents and fishes, but also to our closest relatives, the great apes. In such comparisons, it is very challenging to extract statistically significant differences, as the genomes of humans and their primate relatives are very similar at the nucleotide level (>90%) (Britten 2002
Recently, a novel approach, phylogenetic shadowing, was developed to compute and statistically evaluate conservation profiles of multiple sequence alignments from closely related species. This statistical method permitted the accurate prediction of exons and transcriptional regulatory elements in humanprimate comparisons, and validated the use of this approach for deciphering primate-specific functional DNA sequences (Boffelli et al. 2003
eShadow Computational Design and Visualization Scheme eShadow is an interactive computational tool for aligning, visualizing, and evaluating evolutionarily conservation profiles in multiple and pairwise nucleotide or protein alignments of closely related sequences. The tool works best in alignments between sequences characterized by divergence rates less than the average human versus mouse neutral substitution rate of 0.46 substitutions per site (Waterston et al. 2002 80% identical) up to 200 kb in length. While performing protein alignments, ClustalW weighs amino acid substitutions according to the divergence rates of the sequences being aligned, and assigns residue-specific gap penalties that are adjusted locally, resulting in increasing or decreasing the score, depending on the potential secondary structure of the protein (Thompson et al. 1994
eShadow visualizes MSAs as percent variation (or mismatch) plots. The percent of mismatched nucleotides or amino acids is calculated in a sliding window of user-defined length; where the percent identity y, in a window size w, centered at a given position x, is plotted at the (x,y) coordinate. The peaks and valleys of the conservation plot correspond to regions of low and high variation, respectively, and 0% variation signifies 100% sequence identity in the MSA. To increase plasticity for the visual representation of the data, eShadow uniquely allows users to interactively choose the base organism, modify parameters, and annotation filesfeatures absent from all other available multiple or pairwise sequence alignment tools.
The eShadow analytical module implements two different statistical methods of scanning MSAs to detect slow-mutating regions as follows: (1) HMMI, and (2) DT. Whereas Hidden Markov Model methods are used extensively to detect functional elements in raw genomic and protein sequences as well as in sequence alignments, DT methods are typically used for the analysis of conservation plots. HMMI implementations include gene prediction (Burge and Karlin 1997
The analytical component of the eShadow tool also contains several optional features as follows: (1) an ORF detection block and (2) an optimization module to assist during the characterization of conservation patterns across alignments. By superimposing ORF predictions, eShadow's HMMI detection module identifies nucleotide regions with high potential to code for proteins, possibly differentiating coding from noncoding conserved elements. The optimization module allows the user to train the program to identify parameters that would distinguish features similar in nature to a set of known elements (exons or regulatory elements) from the background noise. We have implemented three complementary optimization methods into the eShadow tool as follows: (1) Baum-Welch (Durbin et al. 1998
Applications for the eShadow Tool
Detecting Coding Exons
Despite the tremendous advances achieved in DNA-sequencing technologies, obtaining the sequence of dozens of closely related vertebrate-sized genomes is still not a practical goal. We therefore asked whether single human/primate pairwise alignments might contain enough information to distinguish conserved (slow-mutating) from neutral (fast-mutating) regions. This task is highly intricate, as the substitution rate decreases significantly when switching from an MSA to a pairwise alignment (Supplemental Fig. S2). Figure 3 illustrates eShadow's ability to predict the ApoB exon from a single human/primate (Allouatta seniculus) pairwise alignment (Fig. 3C) as accurately as it can be predicted from a human/mouse (Fig. 3A) or a primate MSA (Fig. 3B). Similar results were obtained in 54 other exons analyzed in human/baboon alignments (Table 1). These results suggest that if properly analyzed, single human/primate pairwise alignments have the potential to be as informative for exon identification as human/rodent alignments are.
Detecting Conserved Noncoding Elements Because humans and rodents share the majority of their protein-coding genes (Waterston et al. 2002 80 Myrs separating the rodent and human lineages (Dermitzakis and Clark 2002
We analyzed 53 kb from the Wingless-type MMTV Integration Site Family, Member 2 gene locus (WNT2), which has been deeply sequenced in many species, including chimps and baboons in addition to humans and mice (Thomas et al. 2003
Whereas the H/B/C alignment demonstrated 12% nucleotide variation, the conservation pattern clearly distinguished regions with different evolutionary rates. HMMI predictions (0.98/0.90/0.005) identified 26 ECRs, including all the WNT2 coding exons. Although this number (26) is approximately three times less then the corresponding number for H/M ECRs (62), in general, strings of smaller H/M ECRs were incorporated within larger H/B/C ECRs, such that 68% (42) H/M ECRs were recapitulated in H/B/C ECRs. On a per base-pair basis, the sum of all human nucleotides classified as highly conserved either in rodent (H/M) or primate (H/B/C) comparisons were highly similar, spanning 15 kb in H/B/C and 17 kb H/M alignments. Four primate-specific ECRs lacked a highly homologous counterpart in the mouse ortholog. These elements could either represent regions that did not accumulate enough mutations throughout primate evolution due to chance alone, or could be primate-specific elements. Computationally distinguishing between these two possibilities is not yet feasible; rather, the true biological relevance of these lineage-specific elements must be determined experimentally.
To evaluate the specificity and sensitivity by which the eShadow tool is able to recapitulate H/M conservation profiles from a single human/primate alignment, we analyzed a test set of four completely finished baboon BACs spanning
Identifying Conserved Protein Domains
One application for the eShadow tool includes the analysis of MPAs to detect protein domains under selective pressure using HMMI predictions. This strategy is particularly promising, as HMM profiling is one of the most successful strategies for detecting statistically significant regions of protein homology (Madera and Gough 2002
We have developed a computational Web-based tool, eShadow, that is highly proficient in performing phylogenetic shadowing analysis for closely related nucleotide and protein sequences. eShadow amplifies the information content from pairwise or multiple alignments by combining independent mutations present in each different lineage, and detects regions with the lowest cumulative density of mutations through the use of two different statistical methods, DT and HMMI. This tool also includes a parameters-optimization module for the HMMI model that can be amended to any particular evolutionary history underlying the input sequences, and trains the program to predict conserved elements in a wide variety of alignments. Unlike other available tools that analyze conservation across alignments using static parameters, eShadow allows for dynamic modifications of all parameters and picture settings, creating conservation plots in realtime. eShadow can be used to detect coding exons, protein domains, and conserved noncoding elements. Whereas eShadow identifies exons, exon-intron boundaries are not exactly delineated; therefore, this tool provides a good starting point for transcript analysis that could benefit from external gene-prediction information. When protein alignments are analyzed, eShadow can be used to highlight protein domains that are conserved across multiple species and may be involved in vital biochemical processes such as proteinprotein contacts or DNA binding. We have also indicated how amino acid mutational analysis can be superimposed on HHMI predictions in MPAs, and this analysis can be used to evaluate missense mutations.
We have shown that eShadow can recreate most of the information obtained from human/mouse alignments when human/baboon/chimp or human/baboon alignments are analyzed. A 53-kb three-way primate alignment analysis for the WNT2 locus recovered 68% of the human/mouse conserved elements, as well as identified several primate-specific conserved elements. Whereas the functional significance of these lineage-specific sequence elements is presently unknown, we speculate that they may potentially represent sequences that underlie noncoding functions shared by primates but not by other mammals. Regulatory modifications of conserved genes have been proposed to define the major molecular differences that set different organisms phenotypically apart (Boffelli et al. 2003
Hidden Markov Model Islands We used a two-state HMM method to predict slow diverging regions in MSA. We modeled the distribution of matches and mismatches, assuming that they correspond to two mutation states (slow and fast), with different probabilities of emitting a match (eS and eF in slow- and fast-mutation states, respectively), and under the assumption that the probability T of transferring from state to state is equal in both directions (Supplemental Fig. S1). The emission probability of a state relates to the average sequence conservation of that particular state, whereas the transfer probability is inversely proportional to the average length of regions in the alignment. Viterbi algorithm was used to predict the underlying distribution of slow-mutation states.
Parameters Optimization
HMMare the lengths of guiding regions and HMMI predictions, and the length of their overlap, respectively. An exact fit for the HMMI predictions and guiding regions will zero the scoring function S, whereas any discrepancies will increase it. This method iterates and converges to the point of the local minimum, due to the discrete nature of the scoring function. Usually there are several local extremum points in the three-dimensional surface of HMMI probability parameters in Baum-Welch and Golden Section Surface optimizations. The implemented procedure converges to one of the extremum points, depending on the input parameters. Therefore, when HMMI generates no predictions, or there is a large discrepancy between the predicted and the expected elements, parameters need to be modified accordingly.
ORF Detection
Output Option
This section also contains a report of all of the slow-mutating regions identified by the selected methods, providing the coordinates of the predicted elements in the base sequence, and indicating their parameters. HMMI predictions also contain scores that reflect confidence (statistically evaluating predicted regions vs. background noise for a given set of parameters). Every predicted slow-mutating interval I of the HMMI collects a score S(I), which reflects a log likelihood probability of this interval not to be a fast-mutating region:
We thank Lisa Stubbs and Marcelo Nobrega for critical reading of the manuscript, valuable suggestions, and comments. We thank Eddy Rubin for his support in creating this tool. This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory Contract No. W-7405-Eng-48 and Lawrence Berkeley National Laboratory Contract No. AC0376SF00098. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1773104.
4 Corresponding author. [Supplemental material is available online at www.genome.org.]
Anzai, T., Shiina, T., Kimura, N., Yanagiya, K., Kohara, S., Shigenari, A., Yamagata, T., Kulski, J.K., Naruse, T. K., Fujimori, Y., et al. 2003. Comparative sequencing of human and chimpanzee MHC class I regions unveils insertions/deletions as the major path to genomic divergence. Proc. Natl. Acad. Sci. 100: 77087713.
Bailey, J.A., Gu, Z., Clark, R.A., Reinert, K., Samonte, R.V., Schwartz, S., Adams, M.D., Myers, E.W., Li, P.W., and Eichler, E.E. 2002. Recent segmental duplications in the human genome. Science 297: 10031007. Balavoine, G., de Rosa, R., and Adoutte, A. 2002. Hox clusters and bilaterian phylogeny. Mol. Phylogenet. Evol. 24: 366373.[CrossRef][Medline] Bienkowska, J.R., Yu, L., Zarakhovich, S., Rogers Jr., R.G., and Smith, T.F. 2000. Protein fold recognition by total alignment probability. Proteins 40: 451462.[CrossRef][Medline]
Boffelli, D., McAuliffe, J., Ovcharenko, D., Lewis, K.D., Ovcharenko, I., Pachter, L., and Rubin, E.M. 2003. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299: 13911394.
Britten, R.J. 2002. Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels. Proc. Natl. Acad. Sci. 99: 1363313635. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 7894.[CrossRef][Medline]
Burset, M., Seledtsov, I.A., and Solovyev, V.V. 2000. Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res. 28: 43644375.
Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T.J., Higgins, D.G., and Thompson, J.D. 2003. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 31: 34973500.
Cline, M., Hughey, R., and Karplus, K. 2002. Predicting reliable regions in protein sequence alignments. Bioinformatics 18: 306314.
Cooper, G.M., Brudno, M., Green, E.D., Batzoglou, S., Sidow, A., and NISC Comparative Sequencing Program. 2003. Quantitative estimates of sequence divergence for comparative analysis of mammalian genomes. Genome Res. 13: 813820.
Couronne, O., Poliakov, A., Bray, N., Ishkhanov, T., Ryaboy, D., Rubin, E., Pachter, L., and Dubchak, I. 2003. Strategies and tools for whole-genome alignments. Genome Res. 13: 7380.
Dermitzakis, E.T. and Clark, A.G. 2002. Evolution of transcription factor binding sites in Mammalian gene regulatory regions: Conservation and turnover. Mol. Biol. Evol. 19: 11141121. Durbin, R.E., Eddy, S.R., Krogh, A., and Mitchison, G. 1998. Biological sequence analysis. Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK.
Ghanem, N., Jarinova, O., Amores, A., Long, Q., Hatch, G., Park, B.K., Rubenstein, J.L., and Ekker, M. 2003. Regulatory roles of conserved intergenic domains in vertebrate dlx bigene clusters. Genome Res. 13: 533543. Gilligan, P., Brenner, S., and Venkatesh, B. 2002. Fugu and human sequence comparison identifies novel human genes and conserved non-coding sequences. Gene 294: 3544.[CrossRef][Medline]
Hellmann, I., Zollner, S., Enard, W., Ebersberger, I., Nickel, B., and Paabo, S. 2003. Selection on human genes as revealed by comparisons to chimpanzee cDNA. Genome Res. 13: 831837.
Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T., et al. 2002. The Ensembl genome database project. Nucleic Acids Res. 30: 3841.
Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., et al. 2003. The UCSC Genome Browser Database. Nucleic Acids Res. 31: 5154.
Knudsen, B. and Miyamoto, M.M. 2001. A likelihood ratio test for evolutionary rate shifts and functional divergence among proteins. Proc. Natl. Acad. Sci. 98: 1451214517. Krogh, A. 1997. Two methods for improving performance of an HMM and their application for gene finding. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5: 179186.[Medline]
Lim, L.P., Glasner, M.E., Yekta, S., Burge, C.B., and Bartel, D.P. 2003. Vertebrate microRNA genes. Science 299: 1540.
Loots, G.G., Locksley, R.M., Blankespoor, C.M., Wang, Z.E., Miller, W., Rubin, E.M., and Frazer, K.A. 2000. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288: 136140.
Madera, M. and Gough, J. 2002. A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res. 30: 43214328.
Mayor, C., Brudno, M., Schwartz, J.R., Poliakov, A., Rubin, E.M., Frazer, K.A., Pachter, L.S., and Dubchak, I. 2000. VISTA: Visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 16: 10461047.
Nobrega, M.A., Ovcharenko, I., Afzal, V., and Rubin, E.M. 2003. Scanning human gene deserts for long-range enhancers. Science 302: 413.
Ovcharenko, I., Loots, G.G., Hardison, R.C., Miller, W., and Stubbs, L. 2004. zPicture: Dynamic alignment and visualization tool for analyzing conservation profiles. Genome Res. 14: 472477.
Pennacchio, L.A., Olivier, M., Hubacek, J.A., Cohen, J.C., Cox, D.R., Fruchart, J.C., Krauss, R.M., and Rubin, E.M. 2001. An apolipoprotein influencing triglycerides in humans and mice revealed by comparative sequencing. Science 294: 169173.
Pollock, D.D. and Taylor, W.R. 1997. Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng. 10: 647657. Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T. 1988. Numerical Recipes in C. Cambridge University Press, UK. Rivas, E. and Eddy, S.R. 2001. Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2: 8.[CrossRef][Medline] Scemama, J.L., Hunter, M., McCallum, J., Prince, V., and Stellwag, E. 2002. Evolutionary divergence of vertebrate Hoxb2 expression patterns and transcriptional regulatory loci. J. Exp. Zool. 294: 285299.[CrossRef][Medline]
Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R., and Miller, W. 2000. PipMakerA web server for aligning two genomic DNA sequences. Genome Res. 10: 577586. Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., and Miller, W. 2003. Humanmouse alignments with BLASTZ. Genome Res. 13: 103107.
Shannon, M., Hamilton, A.T., Gordon, L., Branscomb, E., and Stubbs, L. 2003. Differential expansion of zinc-finger transcription factor Loci in homologous human and mouse gene clusters. Genome Res. 13: 10971110. Silva, J.C. and Kondrashov, A.S. 2002. Patterns in spontaneous mutation revealed by humanbaboon sequence comparison. Trends Genet. 18: 544547.[CrossRef][Medline]
Takai, D. and Jones, P.A. 2002. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl. Acad. Sci. 99: 37403745. Tatusova, T.A. and Madden, T.L. 1999. BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett. 174: 247250.[CrossRef][Medline] Thomas, J.W., Touchman, J.W., Blakesley, R.W., Bouffard, G.G., Beckstrom-Sternberg, S.M., Margulies, E.H., Blanchette, M., Siepel, A.C., Thomas, P.J., McDowell, J.C., et al. 2003. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424: 788793.[CrossRef][Medline]
Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 46734680. Truong, K. and Ikura, M. 2002. Identification and characterization of subfamily-specific signatures in a large protein superfamily by a hidden Markov model approach. BMC Bioinformatics 3: 1.[CrossRef][Medline] Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520562.[CrossRef][Medline]
http://www.ebi.ac.uk/clustalw/; ClustalW. http://www.genet.sickkids.on.ca/cftr/; Cystic Fibrosis Mutation Database. http://ecrbrowser.dcode.org/; ECR Browser. http://www.ensembl.org/; Ensembl. http://eshadow.dcode.org/; eShadow. http://genome.ucsc.edu/; Human Genome Browser at UCSC. http://bio.cse.psu.edu/genome/hummus/; Pip Dispenser. http://bio.cse.psu.edu/pipmaker/; PipMaker. http://www.sanger.ac.uk/Software/Pfam/; Pfam database. http://www-gsd.lbl.gov/VISTA/; Vista. http://pipeline.lbl.gov; Vista Browser. http://zpicture.dcode.org/; zPicture.
Received July 18, 2003;
accepted in revised format February 17, 2004.
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||