|
|
|
|
Genome Res. 14:716-720, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Resources Visualization of Multiple Genome Annotations and Alignments With the K-BROWSER1 Department of Computer Science, University of California, Berkeley, Berkeley, California 94720, USA 2 Department of Mathematics, University of California, Berkeley, Berkeley, California 94720, USA
We introduce a novel genome browser application, the K-BROWSER, that allows intuitive visualization of biological information across an arbitrary number of multiply aligned genomes. In particular, the K-BROWSER simultaneously displays an arbitrary number of genomes both through overlaid annotations and predictions that describe their respective characteristics, and through the multiple alignment that describes their global relationship to one another. The browsing environment has been designed to allow users seamless access to information available in every genome and, furthermore, to allow easy navigation within and between genomes. As of the date of publication, the K-BROWSER has been set up on the human, mouse, and rat genomes.
Genome browsers at present (Kent et al. 2002 These tracks, in conjunction with present genome browser technology, have greatly contributed to recent breakthroughs by allowing rapid cross-referencing of diverse types of information. For instance, a scientist arguing for the recognition of a putative gene or exon would probably find important the ability to simultaneously use gene predictions, known mRNAs, and exonseven though information from one type of resource may not be sufficient, combined evidence might be persuasive. With the quality and diversity of these tracks quickly increasing, it is expected that they will greatly expand the power and scope of genome browsers.
In addition to the ability to cross-reference tracks within a particular genome, the ability to cross-reference tracks across genomes has also shown itself to be invaluable. Known as the comparative genomics method, this approach exploits the observation that functionally important regions in related organisms will, as a result of selective pressure, exhibit well-defined and interesting patterns of sequence conservation. This approach has been particularly useful and has led to significant improvements in cross-species gene finding (Korf et al. 2001
Present genome browsers, however, lack the ability to clearly represent information across genomes. To the extent that these browsers were originally designed to either display information on a single genome or relationships between multiple genomesbut not boththey all suffer from the same shortcoming; that is, they lack the ability to simultaneously display multiple alignment information and single genome tracks. The VISTA browser (http://pipeline.lbl.gov/vistabrowser There is, however, a more fundamental problem that precludes most present approaches from implementing a convenient multiple genome browser system. Because the genomic alignments that must underlie such browsers are not necessarily one-to-one, that is, are not necessarily between orthologous pairs, it is difficult for present genome browsers to represent many-to-many alignments as an alignment between genomes. In other words, it would be difficult for them to develop a representation that natively compared different genomes and not simply different regions.
The K-BROWSER has been specifically built to allow intuitive visualization of biological information both within and across an arbitrary number of multiply aligned genomes. It has been designed around two principles.
Genome Symmetry
Genome Homology The major component of the K-BROWSER, the image generation engine, takes as input a specific region in a specific genome, and produces a set of images that succinctly represents the requested region and all orthologous regions. In particular, the K-BROWSER generates a single image for each supported genome, displaying the corresponding orthologous region through (1) the tracks supported on that particular genome, (2) the multiple alignment that underlies the entire set of orthologous regions, and (3) the degree of conservation thereof. In addition, the K-BROWSER also provides the option to immediately and easily download the underlying multiple alignment.
Implementation
Of these components, the two responsible for the production of realigned track databases and image generation are most important, and are described in more detail below.
Track Realignment Intuitively, because the underlying sequence has been aligned, and one wants to visualize the tracks with respect to aligned sequence, the track must also be "aligned." As such, we extend the well-understood paradigm of sequence alignment, which requires that any two homologous bases occupy the same column of an (ideal) multiple alignment, and define the analogous idealization for visualized tracks. In particular, for an ideal visualization, we require that any two homologous entries (e.g., orthologous exons in mouse and rat) be visualized at the same position (e.g., in the same range of pixels on the horizontal axis). To this extent, realignment refers to the process of adjusting the original tracks with respect to a multiple alignment, so that this constraint is met. Another interpretation of this realignment component is that it converts positions in genomic coordinates to alignment coordinates. Such an interpretation is both meaningful and useful, as alignment coordinates allow one to succinctly refer to positions in an arbitrary number of genomes with a single position. In fact, this interpretation is critical for a number of purposes, such as in determining which regions are orthologous to a requested region. It is interesting to note that, given this interpretation, the previously described readjustment reduces to the problem of converting each genomic-coordinate position in each track to the corresponding alignment-coordinate position.
This conversion process requires three inputs: (1) an orthology map that defines sets of orthologous regions between genomes, (2) a multiple alignment of every set of regions in the orthology map, and (3) databases of tracks for each of the genomes. These inputs are, respectively, derived from ongoing work (C. Dewey, pers. comm.) in homology maps, MAVID multiple alignments (Bray et al. 2003 Segment tables are useful in that they simply, efficiently, and uniquely represent an alignment. A segment table is then simply a table that contains certain information about every segment in the aligned sequence, with a segment defined to be a maximal ungapped sequence in the aligned sequence. To characterize an aligned sequence uniquely, a segment table need only include the length, the unaligned start position, and the aligned start position of each segment. Because an aligned sequence is simply the original sequence interspersed with gaps, and the segment table indicates all gaps in the alignment, that is, all positions that are not in any segment, the aligned sequence is uniquely characterized. To the extent that segment tables efficiently represent the underlying alignment, they are used throughout the K-BROWSER. For instance, they allow one to efficiently translate genomic-coordinate positions to and from alignment-coordinate positions. In addition, they can be used to rapidly determine the exact positions and lengths of gaps in a particular region of the alignment. As it turns out, the latter is important for image generation, whereas the former is critical to both image generation and track realignment. Although the tracks can, in principle, be realigned in the obvious way using segment tables, this approach requires substantial computational overhead because of repetitive database accesses. As a result, it does not scale with respect to the track database, and we instead use another approach that requires only a constant number of database accesses, whose cost is amortized over the different tracks. In particular, we iterate through each orthology set in the homology map, retrieve the appropriate segment table subset, and build an array that maps each genomic-coordinate index position to the corresponding alignment-coordinate position. Track realignment then simply reduces to the problem of extracting the positions of each entry in the track, looking them up in the array in constant time, and saving the results. This conversion has the intuitive effect of "stretching" out tracks over gaps in the alignment. However, because it is easy to determine the gaps in the region, it is also easy to determine the actual positions that the realigned track actually covers. Despite being conceptually straightforward, track realignment is complicated by at least two factors: the diverse and dynamic nature of the track databases, and the existence of large-scale evolutionary events, for example, inversions. The K-BROWSER presently implements simple but sufficient solutions to both problems. Because the diversity of tracks is constantly increasing as a result of new evidence, alignments, predictions, and annotations, it is impossible to know a priori exactly which fields of track databases need to be adjusted. Even though nearly all of the different tracks follow a small number of standard schema, it is quite possible that schema could either change or that new schema will be introduced. To this end, one must require either manual intervention or relatively complicated automatic (and error-prone) inference to ensure that the proper fields are adjusted. The present track realignment implementation relies on a precomputed, human-verified set of appropriate fields. Large-scale duplication and deletion events are easily handled by the K-BROWSER, but inversions impose certain additional requirements on track realignment. In particular, because the track realignment phase requires an orthology map, at most one duplication is categorized as an ortholog and hence placed in the map. A deletion, similarly, does not violate the invariant that every region map to its ortholog; in the worst case, some regions will not have any orthologs, which is perfectly sensible. It is worth noting, however, that regions without orthologs are assumed to have the trivial alignment, that is, the original, ungapped sequence, and are realigned accordingly. Given all of this, inversions also do not break any orthology map invariant. The peculiar characteristic of inversions, however, is that they require that at least one genome be represented on the negative strand. We found this particular method of display to be somewhat unintuitive, especially when it was the original region requested by the user for visualization, and therefore require that the user-requested region always be displayed on the positive strand. It turns out that it is impractical to dynamically flip strands during image generation, and hence, the K-BROWSER requires precomputed track database realignments on both strands.
Image Generation
The K-BROWSER image generation component builds an entirely new framework based on subroutines borrowed from the UCSC Genome Browser. In regard to the latter, the K-BROWSER borrows low-level database- and track-processing subroutines provided with UCSC Genome Browser (Kent et al. 2002 Our original contributions to the K-BROWSER image generation component consist of (1) a new high-level framework that extends UCSC Genome Browser functionality, and (2) special methods to efficiently represent and visualize multiple alignments. The former extends the UCSC Genome Browser code to handle image production of regions that cross orthology set boundaries. Ultimately, this code appears to let UCSC build an independent image for each orthologous set and proportionately "stitch" them together afterward. In practice, however, the K-BROWSER does not generate independent images because it would be grossly inefficient and inelegant. Instead, it seamlessly integrates into lower-level data-processing UCSC code and is responsible for allocating the proportionate amount of space within an image and appropriately organizing UCSC data-processing function calls. The necessary, complement functionalitythe ability to produce images within an orthologous setis supported by the original UCSC Genome Browser code and aforementioned special methods. This functionality is implemented in two phases: we first use UCSC Genome Browser code to paint in the realigned tracks, and then walk through the region again to paint in gaps. In particular, because the produced images represent alignment-coordinate regions and the realigned tracks are already in alignment coordinates, the unmodified UCSC Genome Browser can be naively used to produce track visualizations. Recalling, however, that the realignment phase "stretched" tracks over multiple alignment gaps, it is clear that the UCSC Genome Browser code will entirely ignore gaps and paint arbitrary features in their place. To this end, we implement special methods that iterate through the appropriate subset of the segment table and overwrite the regions of the image that correspond to multiple alignment gaps.
Conservation
With regard to the identity plot, the K-BROWSER scores each position in the multiple alignment as the fraction of completely conserved columns in a window centered about that position. In addition, it allows the user to select a track according to which the conservation plot is to be colored, that is, blue for exonic regions, red for conserved noncoding regions, and so on. To this extent, it extends the useful identity plots on the Vista Genome Browser (Couronne et al. 2003 Furthermore, given a phylogenetic tree and an evolutionary model, the K-BROWSER can also compute the average probability that the root sequence is different from the leaf sequence in a window centered about a specified position. This metric is meaningful as it allows one to not just determine if a genomic region is conserved with other genomes, but, in fact, to infer the rate at which it is evolving from the root. This score can be computed by inferring, for the aforementioned window, the distances from the root to each of the observed nucleotides. Roughly, these distances can be interpreted as the average number of mutations between the root and the leaf in a continuous-time Markov chain. As such, they can be exponentiated to determine the av-erage probability of mutation between the root and leaf nucleotides.
Applications
The K-BROWSER picture immediately reveals the complex insertion and deletion patterns in the region. It is obvious, for example, just by visual inspection, that there has been a large rodent insertion in the middle of the region. A closer inspection reveals that this large insertion is accompanied by several significant, simultaneous deletions and a very large number of small deletions in the rodents. Indeed, there are >3000 gaps in the rodent alignments, of which approximately two-thirds are <10 bp in length; in contrast, there are 2000 gaps in the human sequence, half of which are <10 bp. The incomplete rat mRNA is also immediately obvious, thanks to the aligned ab initio predictions between the genomes. It is interesting to note that GENSCAN annotated the initial exon correctly only in mouse, and not a single ab initio method correctly annotated the gene in human (although SGP annotated the gene correctly in mouse). The conservation plot above the sequences instantly reveals conserved noncoding sequences (in red), and for viewing the mouse it is useful to switch the track base for the coloring. Zooming out (3x) reveals the larger-scale synteny in this region of Chromosome 7 in the human, and zooming out another 10x reveals that the synteny among Chromosomes 7 (human), 6 (mouse), and 4 (rat) is preserved throughout the entire region. The alignments for all of these regions can be retrieved, and are conveniently compressed for large regions.
Availability
We thank Nicolas Bray and Colin Dewey for suggestions and help with the alignments. Yin Lau helped with the Web site design. L.P. was partially supported by the NIH (R02-HG02362-01), and K.C. was partially supported by a COR grant from UC Berkeley. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1957004.
3 Corresponding author.
Alexandersson, M., Cawley, S., and Pachter, L. 2003. SLAM: Cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. 13: 496-502.
Boffelli, B., McAuliffe, J., Ovcharenko, D., Lewis, K.D., Ovcharenko, I., Pachter, L., and Rubin, E.M. 2003. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299: 1391-1394. Bray, N. and Pachter, L. 2004. MAVID: Constrained ancestral alignment of multiple sequences. Genome Res. (this issue).
Bray, N., Dubchak, I., and Pachter, I. 2003. AVID: A global alignment program. Genome Res. 13: 97-102.
Clamp, M., Andrews, D., Barker, D., Bevan, P., Cameron, G., Chen Y., Clark, L., Cox, T., Cuff, J., Curwen, V., et al. 2003. Ensembl 2002: Accommodating comparative genomics. Nucleic Acids Res. 31: 38-42.
Cliften, P., Sudarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., Waterston, R., Cohen, B., and Johnston, M. 2003. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301: 71-76.
Couronne, O., Poliakov, A., Bray, N., Ishkhanov, T., Ryaboy, D., Rubin, E., Pachter, L., and Dubchak, I. 2003. Strategies and tools for whole-genome alignments. Genome Res. 13: 73-80.
Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., and Down, T. 2002. The Ensembl genome database project. Nucleic Acids Res. 30: 38-41.
Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., et al. 2003. The UCSC Genome Browser Database. Nucleic Acids Res. 31: 51-54. Kellis, M., Patterson, N., Endrizzi, M., Birren, B., and Lander, E.S. 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423: 241-253.[CrossRef][Medline]
Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, D. 2002. The Human Genome Browser at UCSC. Genome Res. 12: 996-1006. Korf, I., Flicek, P., Duan, D., and Brent, M.R. 2001. Integrating genomic homology into gene structure prediction. Bioinformatics 17: s140-s148.
McCutcheon, J.P. and Eddy, S.R. 2003. Computational identification of non-coding RNAs in Saccharomyces cerevisiae by comparative genomics. Nucleic Acids Res. 31: 4119-4128.
Parra, G., Agarwal, P., Abril, J.F., Wiehe, T., Fickett, J.W., and Guigó, R. 2003. Comparative gene prediction in human and mouse. Genome Res. 13: 108-117. gene-centered resources. Nucleic Acids Res. 29: 137-140. Thomas, J.W., Touchman, J.W., Blakesley, R.W., Bouffard, G.G., Beckstrom-Sternberg, S.M., Margulies, E.H., Blanchette, M., Siepel, A.C., Thomas, P.J., and McDowell, J.C. 2003. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424: 788-793.[CrossRef][Medline]
http://baboon.math.berkeley.edu/mavid/; MAVID multiple alignment server. http://genome.ucsc.edu/; UCSC genome browser. http://hanuman.math.berkeley.edu/kbrowser/; K-BROWSER home page. http://pipeline.lbl.gov/vistabrowser/; VISTA genome browser. http://www.ensembl.org/; ENSEMBL genome browser home page.
Received September 10, 2003;
accepted in revised format November 17, 2003.
This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||