|
|
|
|
Vol. 8, Issue 9, 967-974, September 1998
GENOME METHODS
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
We address the problem of efficiently aligning a transcribed and spliced DNA sequence with a genomic sequence containing that gene, allowing for introns in the genomic sequence and a relatively small number of sequencing errors. A freely available computer program, described herein, solves the problem for a 100-kb genomic sequence in a few seconds on a workstation.
| |
INTRODUCTION |
|---|
|
|
|---|
With large amounts of both expressed and genomic DNA sequence data being made available, it is becoming more common to align the two. We have written a computer program, called sim4, to perform such alignments very efficiently and accurately, under the assumption that the differences between the two sequences are limited to (1) introns in the genomic sequence, and (2) sequencing errors (in either sequence).
The next section describes use of sim4 in a production setting. Then, the tool's accuracy is assessed using simulated data, into which "sequencing errors" are introduced using a random number generator. Next, we report some experimental data obtained by aligning human mRNAs with the homologous genomic sequence from the mouse. This application is somewhat outside sim4's intended scope, as evolutionary differences such as long insertions need to be handled, but useful results are frequently produced. We then illustrate how the capabilities of sim4 can be incorporated into larger tools and software packages and finish with a brief description of sim4's algorithmic approach.
The program can be obtained by anonymous ftp from globin.cse.psu.edu or over the World Wide Web from http://globin.cse.psu.edu/.
The BDGP: cDNA vs. Genomic Alignments
The Berkeley Drosophila Genome Project (BDGP) is a consortium
whose goal is to determine the complete DNA sequence of the euchromatic
genome of the fruit fly Drosophila melanogaster and to
develop experimental and computational tools to probe its biological significance (Rubin 1996
). It includes a large-scale sequencing project, together with both biological and computational annotation projects, the results of which are curated by experienced
Drosophila biologists. This work is available on the World
Wide Web at http://fruitfly.berkeley.edu/.
Among genomic annotations, the location of genes in the genomic
sequence is of great interest to both biologists and computer scientists. An accurate and well-curated transcript map helps biologists understand mutational effects and the regulation of gene
expression, and it gives computational biologists a powerful data set
for training and evaluating algorithms. Like other large-scale genome
projects (Eddy 1994
; Cherry et al. 1998
) the BDGP provides both
computational predictions and experimental results. Computational results come from a collection of gene finders, including Genie (Reese
et al. 1997
) and dGrail (Xu and Uberbacher 1997
), each of which has
different strengths and weaknesses. Experimental annotations are based
on sequence data from a variety of EST and full-length cDNA sequencing
projects. These cDNA sequences have been positioned on the genomic
sequence using a variety of tools [primarily Blast (Altschul
et al. 1990
)] with substantial manual intervention. Increasing
quantities of data have made this technique unworkable, necessitating a
specialized tool for aligning cDNA and genomic sequences. sim4
fills this need by quickly aligning a cDNA sequence to its parent
genomic sequence with sufficient accuracy to require minimal manual
editing.
Validating sim4's Alignments
To evaluate sim4's alignments on a set of genes with known structures, we started with a curated set of genomic GenBank sequences for multiexon Drosophila genes that was developed to train the Drosophila version of Genie (Reese et al. 1997Comparisons with Similar Tools
We know of three other tools that are designed to align spliced sequences (mRNA, cDNA, ESTs) to the corresponding genomic sequences. The goal of this section is to compare sim4 with these other tools. Gelfand et al. (1996)
|
Assigning cDNAs to a Genomic Clone
The Adh region of the Drosophila genome has been the object of intense genetic and biochemical scrutiny for many years. Because of the wealth of available information, the BDGP has been using it in a pilot study for its annotation project. It is one of the foci of the large-scale sequencing project, and much of our cDNA sequencing has been concentrated on transcripts from this region. As part of the annotation project we have identified 27 cDNA sequences in GenBank that are from the Adh region and have been assigned to particular P1 clones. We used these sequences to determine if sim4 would be able to detect the correct location of a cDNA sequence in our pool of genomic sequence. Each of the 27 cDNA sequences was compared to the current collection of 3120 contigs from our P1 clones, covering the Adh region as well as other regions of the genome, for a total of 84,240 alignments. Selecting the alignments that included >90% of the cDNA's sequence and were >90% similar over all of the exons gave a single alignment for each of 21 of the sequences. All of the six sequences missed by this simple screening rule were easily accounted for.| 1. | Four of the cDNAs spanned multiple P1 clones. Their alignment to an individual clone accounted for <90% of their length, though they were very similar. |
| 2. | One of the cDNA sequence annotations refered to a related, but incorrect, GenBank entry. This incorrect entry is not in a region for which we have genomic sequence, so sim4 was correct when it was unable to assign it to a P1 clone. Using the correct GenBank sequence for the gene results in a three-exon match with 100% identity using 100% of the clone. |
| 3. | The final cDNA had two difficulties. First, it is only partially contained in our collection of genomic clones and the clone ends in the middle of a large intron. Second, there are some substantial differences between the genomic and cDNA sequences that are probably due to differences in the parent Drosophila strains or to sequencing errors. sim4's alignment to the correct P1 clone found three exons, which were 100%, 94%, and 88% similar, respectively. The mismatches were all clustered in multibase deletions. |
Tests on Simulated Data
To further assess the accuracy of sim4, we extracted mRNAs
from 16 genes in a 222,930-bp genomic sequence from human Chromosome
12p13 (Ansari-Lari et al. 1997
; GenBank accession no. HSU47924) based
on the annotated exon boundaries. Using a random number generator,
nucleotide substitutions were introduced an average of twice as
frequently as either (single-nucleotide) insertions or deletions. We
modeled two kinds of data
ESTs and full-length mRNAs.
ESTs were simulated by randomly selecting 500 bp from the mRNA and introducing errors at rates of 1%, 3%, and 5%. The results cited in Table 2 indicate that even with ESTs, sim4 should usually give the correct alignment. For full-length mRNAs, we measured performance with error rates of 0.1% and 1%. sim4 failed to correctly identify the boundaries of a short (6 nucleotides) internal exon in the hBAP gene. The 6 nucleotides were instead distributed at the ends of the adjacent exons. Even so, the experiment's results, summarized in Table 3, suggest that with highly accurate full-length cDNA sequences, sim4's alignment should be completely correct the vast majority of the time.
|
|
Cross-Species Alignments
sim4 is intended to produce a correct alignment that
accounts for introns and for sequencing errors. It is not designed to
deal properly with evolutionary mutations, such as multinucleotide insertions and deletions. To get a better feel for the rate at which
sim4's accuracy degrades with evolutionary divergence, we
measured its effectiveness at aligning the 16 human mRNAs discussed in
the previous section with the orthologous genomic sequence from the
mouse, which is available as GenBank accession numbers AC002393 and
AC002397 (Ansari-Lari et al. 1998
).
Of the 16 genes, 13 are more highly conserved than the average of
84.6% nucleotide identity reported in a survey of 1196 human/mouse orthologs by Makalowski et al. (1996)
. The only gene that is
substantially less conserved than this average, CD4, is
associated with the immune system, which is frequently the case with
highly divergent genes.
Table 4, column 4, reports how much of each mRNA was aligned by sim4, and column 5 shows how much of each protein-coding region was aligned. We also compared the positions of exon boundaries with the positions determined by sim4's putative exons. Column 6 gives the number of nucleotides that were aligned to non-mRNA regions of the mouse, as a percentage of the mRNA's length. Each time an exon boundary was misplaced by, for example k nucleotides, k was added to this amount, and in one case (gene A-2) an erroneous exon of length 8 was predicted. Thus, we are assuming that the mouse mRNA preserves the human splice junctions.
|
Two trends are evident from the data presented in Table 4. First,
sim4 is frequently much more effective at aligning protein-coding regions than for the UTRs at the ends of the mRNA. For
instance, for 9 of the 16 genes, sim4 was 100% accurate in the
coding regions, whereas 100% accuracy for the entire gene was attained
in only three cases. This reflects the fact that a gene's 5' and
3' UTR are usually much less well conserved than the coding region
(Makalowski et al. 1996
). Second, typically <1% of the nucleotides
in sim4's putative exons were not in the true mRNA, even in
cases where sim4 was unable to find the gene accurately.
Other Uses of Sim4
The approach implemented in sim4 may be fruitfully
integrated into a variety of sequence analysis packages, as illustrated here. One natural use of these methods is for comparing a genomic sequence with an EST database. That problem was addressed earlier by
Huang et al. (1997)
, using other computational methods.
To explore the use of sim4's algorithm for this potential
application, we built a prototype program, called blEST, that
can quickly identify near-identity matches between a genomic sequence
and an EST database. After masking interspersed repeats (e.g.,
Alus) and low-complexity regions in the genomic sequence, blEST extracts from the database all ESTs that share a 32-bp
exact match with the genomic sequence. The resulting ESTs are then
compared with the unmasked genomic sequence using a variant of
sim4 that reports only those ESTs that meet certain (adjustable) conditions, such as (1) the putative identified exons must
cover at least 70% of the database sequence, and (2) the overall
identity within those exons must be at least 95%. Although the running
time depends on the number of matching ESTs, we found it to average
~1 min/100 kb of genomic sequence on a 200-MHz workstation, when
comparing a human genomic sequence with all human ESTs in the dbEST
database (Boguski et al. 1993
). However, the loss of effectiveness
caused by restricting attention to only very strong matches (e.g., at
least 95% identity) remains to be evaluated before this approach can
be recommended for general use.
Typically, results from database searches are combined with other
sources of information to reach certain conclusions. In particular, a
major use of ESTs is for identifying genes in a sequenced genomic
region (e.g., Smith et al. 1996
; Ansari-Lari et al. 1997
; Flint et al.
1997
; Ruddy et al. 1997
). Several groups have found that the
information provided by ESTs substantially enhances the results of
gene-prediction programs, such as GRAIL (Uberbacher et al. 1996
).
We recently began to explore the predictive power of combining
human/mouse sequence comparison with other tools to identify genes
(Ansari-Lari et al. 1998
). A goal is to produce a system that can
automatically analyze orthologous human and mouse genomic sequence data
at, for example, a rate of 100 kb in a few minutes (i.e., in a small
multiple of the time taken to identify repeats) and that presents the
results in a readily understood graphic format.
A number of approaches and software tools have been developed by
various groups to provide a graphic summary of sequence positions that
match ESTs. At one extreme are programs (e.g., Harris 1997
; Ansari-Lari
et al. 1998
) that do not distinguish regions that match only one EST
from regions with multiple matches. At the other extreme, the program
PowerBLAST (Zhang and Madden 1997
) shows each match, complete
with the identification of positions where sequences disagree. An
innovative approach of Smith et al. (1996)
uses colors and a kind of
"projected three-dimensional" display to indicate how many ESTs
match in a given region, as well as the strengths of those matches.
There is a strong rationale for at least giving some indication of how
many ESTs match the genomic sequence in a given region. A number of
investigators have observed that genomic regions aligning with several
ESTs are more likely to contain a gene than if only one EST aligns. For
instance, in tests using genomic sequence data with well-characterized
gene content, and at stringencies comparable to those used by
blEST, essentially every EST cluster detected a gene, whereas
only 70% of singleton aligning ESTs did so (Fig. 2C in Bailey et al.
1998
). Moreover, an indication of the number of hits may provide at
least a weak indication of expression levels for each gene.
Figure 1 shows part of a pip (percent
identity plot; Hardison et al. 1997
) of a
human/mouse alignment in the BTK region (Oeltjen et al. 1997
),
that has been automatically annotated using the output of
blEST. Note the singleton human ESTs containing portions of
introns 13, 15, 16, and 17 of BTK and the mouse EST extending
slightly upstream of exon 19. Also note that the introns estimated by
blEST and the indication of EST redundancy accurately identify
the true exons.
|
| |
METHODS |
|---|
|
|
|---|
In the approach described here, an expressed sequence is aligned with a genomic sequence in the following steps.
| 1. | Determine high-scoring segment pairs (HSPs). An HSP is just a
high-scoring gap-free alignment of regions of the two sequences, such
as computed by the blast program (Altschul et al. 1990 5 for a mismatch,
stopping when extensions no longer increase the score. Code to locate
HSPs in a pair of long DNA sequence was borrowed from a program
described by Schwartz et al. (1991) |
| 2. | Select a set of HSPs that could represent a gene. A dynamic programming algorithm selects a best chain of the HSPs subject to the constraint that (a) their starting positions in the expressed sequence are in increasing order, and (b) the diagonals of consecutive HSPs are either nearly the same or differ by enough to be a plausible intron. HSP scores are multiplied by 100 and reduced by the differences between diagonals of consecutive HSPs to determine a score for a chain. |
| 3. | Find exon boundaries. When consecutive "exon cores"
(each given by a collection of HSPs on nearly the same diagonal in the gene model) overlap, the ends are trimmed in an attempt to find an
intron matching either GT...AG or CT...AC. (It might be
worthwhile to consider more sophisticated rules for splice junctions,
e.g., those used by Burge and Karlin (1997) |
| 4. | Determine the alignment. The alignment for each exon (whose
boundaries in each of the sequences are determined by the previous step) is computed by the method of Chao et al. (1997) |
| |
ACKNOWLEDGMENTS |
|---|
We thank Jinghui Zhang for suggesting that we write sim4, Sima Misra for her help with Drosophila genome information, Martin Reese for sharing his Drosophila training set, and Michael Ashburner for sharing his set of genes in the Adh region. This work was supported in part by a grant from the National Human Genome Research Institute to G.M.R. and by grant R01 LM05110 from the National Library of Medicine to W.M.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 Corresponding author.
E-MAIL webb{at}cse.psu.edu; FAX (814) 865-3176.
| |
REFERENCES |
|---|
|
|
|---|
Database for "expressed sequence tags."
Nature Genet.
4:
332-333[CrossRef][Medline].
Practice Experience
15:
1025-1040.Received May 18, 1998; accepted in revised form July 21, 1998.
This article has been cited by other articles:
![]() |
Y. Okada, C. Tashiro, K. Numata, K. Watanabe, H. Nakaoka, N. Yamamoto, K. Okubo, R. Ikeda, R. Saito, A. Kanai, et al. Comparative expression analysis uncovers novel features of endogenous antisense transcription Hum. Mol. Genet., June 1, 2008; 17(11): 1631 - 1640. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Gotoh A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence Nucleic Acids Res., May 1, 2008; 36(8): 2630 - 2638. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Mourier, C. Carret, S. Kyes, Z. Christodoulou, P. P. Gardner, D. C. Jeffares, R. Pinches, B. Barrell, M. Berriman, S. Griffiths-Jones, et al. Genome-wide discovery and verification of novel structured RNAs in Plasmodium falciparum Genome Res., February 1, 2008; 18(2): 281 - 292. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Rispe, M. Kutsukake, V. Doublet, S. Hudaverdian, F. Legeai, J.-C. Simon, D. Tagu, and T. Fukatsu Large Gene Family Expansion and Variable Selective Pressures for Cathepsin B in Aphids Mol. Biol. Evol., January 1, 2008; 25(1): 5 - 17. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. K. Hane, R. G.T. Lowe, P. S. Solomon, K.-C. Tan, C. L. Schoch, J. W. Spatafora, P. W. Crous, C. Kodira, B. W. Birren, J. E. Galagan, et al. Dothideomycete Plant Interactions Illuminated by Genome Sequencing and EST Analysis of the Wheat Pathogen Stagonospora nodorum PLANT CELL, November 1, 2007; 19(11): 3347 - 3368. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Zhang, M. L. Hastings, A. R. Krainer, and M. Q. Zhang Dual-specificity splice sites function alternatively as 5' and 3' splice sites PNAS, September 18, 2007; 104(38): 15028 - 15033. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. J. McCarthy, J. L. Andrews, E. L. McDearmon, K. S. Campbell, B. K. Barber, B. H. Miller, J. R. Walker, J. B. Hogenesch, J. S. Takahashi, and K. A. Esser Identification of the circadian transcriptome in adult mouse skeletal muscle Physiol Genomics, September 11, 2007; 31(1): 86 - 95. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Balasenthil, A. E. Gururaj, A. H. Talukder, R. Bagheri-Yarmand, T. Arrington, B. J. Haas, J. C. Braisted, I. Kim, N. H. Lee, and R. Kumar Identification of Pax5 as a Target of MTA1 in B-Cell Lymphomas Cancer Res., August 1, 2007; 67(15): 7132 - 7138. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Schulze, B. Hepp, C. S. Ong, and G. Ratsch PALMA: mRNA to genome alignments using large margin algorithms Bioinformatics, August 1, 2007; 23(15): 1892 - 1900. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Cui, T. Vinar, B. Brejova, D. Shasha, and M. Li Homology search for genes Bioinformatics, July 1, 2007; 23(13): i97 - i103. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Juneau, C. Palm, M. Miranda, and R. W. Davis High-density yeast-tiling array reveals previously undiscovered introns and extensive regulation of meiotic splicing PNAS, January 30, 2007; 104(5): 1522 - 1527. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Kim, A. Magen, and G. Ast Different levels of alternative splicing among eukaryotes Nucleic Acids Res., January 12, 2007; 35(1): 125 - 131. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Chaparro, R. Guyot, A. Zuccolo, B. Piegu, and O. Panaud RetrOryza: a database of the rice LTR-retrotransposons Nucleic Acids Res., January 12, 2007; 35(suppl_1): D66 - D70. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Wang, S. Wang, Y. Li, M. S. R. Paradesi, and S. J. Brown BeetleBase: the model organism database for Tribolium castaneum Nucleic Acids Res., January 12, 2007; 35(suppl_1): D476 - D479. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. H. Nagaraj, R. B. Gasser, and S. Ranganathan A hitchhiker's guide to expressed sequence tag (EST) analysis Brief Bioinform, January 1, 2007; 8(1): 6 - 21. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Stapleton, J. W. Carlson, and S. E. Celniker RNA editing in Drosophila melanogaster: New targets and functional consequences RNA, November 1, 2006; 12(11): 1922 - 1932. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Gissot, C. Polge, M. Jossier, T. Girin, J.-P. Bouly, M. Kreis, and M. Thomas AKINbeta{gamma} Contributes to SnRK1 Heterotrimeric Complexes and Interacts with Two Proteins Implicated in Plant Pathogen Resistance through Its KIS/GBD Sequence Plant Physiology, November 1, 2006; 142(3): 931 - 944. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Yao, R. Charlab, and P. Li Systematic identification of pseudogenes through whole genome expression evidence profiling Nucleic Acids Res., September 11, 2006; 34(16): 4477 - 4485. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Michaloski, P. A.F. Galante, and B. Malnic Identification of potential regulatory motifs in odorant receptor genes by analysis of promoter sequences Genome Res., September 1, 2006; 16(9): 1091 - 1098. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Zhang, X. S. Liu, Q.-R. Liu, and L. Wei Genome-wide in silico identification and analysis of cis natural antisense transcripts (cis-NATs) in ten species Nucleic Acids Res., July 18, 2006; 34(12): 3465 - 3475. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. J. Hsieh, C. Y. Lin, N. H. Liu, W. Y. Chow, and C. Y. Tang GeneAlign: a coding exon prediction tool based on phylogenetical comparisons. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W280 - W284. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Yao, R. Zhang, Z. Zhu, K. Xia, and C. Liu MutScreener: primer design tool for PCR-direct sequencing. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W660 - W664. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Cnops, P. Neyt, J. Raes, M. Petrarulo, H. Nelissen, N. Malenica, C. Luschnig, O. Tietz, F. Ditengou, K. Palme, et al. The TORNADO1 and TORNADO2 Genes Function in Several Patterning Processes during Early Leaf Development in Arabidopsis thaliana PLANT CELL, April 1, 2006; 18(4): 852 - 866. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. J. Rademaker, F. J. Fallaux, D. J. M. Van den Wollenberg, R. N. De Jong, P. C. Van der Vliet, and R. C. Hoeben Relaxed template specificity in fowl adenovirus 1 DNA replication initiation. J. Gen. Virol., March 1, 2006; 87(Pt 3): 553 - 562. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Florea Bioinformatics of alternative splicing and its regulation Brief Bioinform, March 1, 2006; 7(1): 55 - 69. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Bonizzoni, R. Rizzi, and G. Pesole Computational methods for alternative splicing prediction Brief Funct Genomic Proteomic, March 1, 2006; 5(1): 46 - 51. |
||||
![]() |
|