|
Vol. 10, Issue 8, 1115-1125, August 2000
Conservation, Regulation, Synteny, and Introns in a Large-scale C. briggsae-C. elegans Genomic Alignment
W. James
Kent,1 and
Alan M.
Zahler
Department of Biology and Center for Molecular Biology of RNA,
University of California at Santa Cruz, Santa Cruz, California 95064 USA
 |
ABSTRACT |
A new algorithm, WABA, was developed for doing large-scale
alignments between genomic DNA of different species. WABA was used to
align 8 million bases of Caenorhabditis briggsae genomic DNA against the entire 97-million-base Caenorhabditis elegans
genome. The alignment, including C. briggsae homologs of 154 genetically characterized C. elegans genes and many times this
number of largely uncharacterized ORFs, can be browsed and searched on
the Web (http://www.cse.ucsc.edu/~kent/intronerator). The alignment
confirms that patterns of conservation can be useful in identifying
regulatory regions and rarely expressed coding regions. Conserved
regulatory elements can be identified inside coding exons by examining
the level of divergence at the wobble position of codons. The alignment
reveals a bimodal size distribution of syntenic regions. Over 250 introns are present in one species but not the other. The 3' and
5' intron splice sites have more similarity to each other in
introns unique to one species than in C. elegans introns as a
whole, suggesting a possible mechanism for intron removal.
 |
INTRODUCTION |
Biologists often compare DNA sequences to determine the
relationships between species and to learn about the
function of genes. As a result of the efforts of various sequencing
projects, it is now becoming possible to do sequence comparisons
between complete genomes. These comparisons can answer questions about
how genomes change over time in more detail than is possible by
chromosome painting (O'Brien et al. 1999 ). Comparing coding regions
between species can locate functionally important parts of proteins,
which are more highly conserved than other parts (Makalowski et al. 1996 ; Makalowski and Boguski 1998 ). Comparing noncoding regions of
homologous genes between species separated by an appropriate evolutionary distance can locate promoters and other conserved regulatory elements (Hardison et al. 1997 ; Endrizzi et al. 1999 ; Jareborg et al. 1999 ). Comparisons between mouse and human, between maize and rice (Wilson et al. 1999 ), and between Fugu
rubripes, Tetraodon fluviatilis, and Danio rerio
(Boeddrich et al. 1999 ) have all revealed important functional regions
of the genome.
This paper presents a genomic DNA comparison between the nematodes
Caenorhabditis briggsae and Caenorhabditis elegans.
C. elegans and C. briggsae are closely related
nematodes in the same genus (Baldwin et al. 1997 ; Blaxter 1998 ; Blaxter
et al. 1998 ; Voronov et al. 1998 ). They are estimated to have diverged
25-50 million years ago, although without a fossil record these
estimates are highly dependent on assumptions about mutation rates that vary considerably between organisms and between genes (Ayala et al.
1996 ). In practical terms, C. elegans and C. briggsae
are separated by a nearly ideal distance for comparative genomics. In
regions experiencing selective pressure, close to 80% base identity is
preserved between species. In other regions base identity is close to
30% (Shabalina and Kondrashov 1999 ). The large-scale comparative
genomic study we present here aligns 8 million bases of C. briggsae sequence from 229 cosmids with 97 million bases of C. elegans sequence covering essentially the entire genome (C.
elegans Sequencing Consortium 1998 ). The scale of this comparison presented unique challenges and resulted in the development of new
algorithms that can cope with long insertions. Our algorithms are also
able to recognize homologous regions at the DNA level despite the rapid
divergence in the wobble position of most codons.
 |
RESULTS AND DISCUSSION |
An Algorithm for Interspecies Genomic Sequence Alignments
The alignment of 8 million bases of C. briggsae against 97 million bases of C. elegans involves, at least conceptually,
almost 800 trillion nucleotide comparisons. This requires a relatively fast algorithm. Genomic alignment also requires handling both small and
large insertions and deletions and a high degree of divergence in the
third, wobble, position of codons. We developed a three-pass
algorithm the Wobble Aware Bulk Aligner (WABA) that meets these
requirements and is described in some detail in Methods. Homologous
regions were identified with the first pass of WABA, which took 20 hr
of run time on a Pentium III 450 mHz. The second pass, which did a
detailed alignment of overlapping 2000 × 5000 base regions, took 11 days to run on the same machine. The third pass
joined the overlapping alignments in 15 min on a comparable machine.
Although the first and especially the second passes are slow on a
single workstation, it is easy to distribute them on many machines.
Fortunately, it is in the first pass that the run time varies with the
product of the genome sizes that are being compared, where the run
times of the other passes are close to linear with respect to the
shorter genome size. It should thus be possible in the future to apply
WABA to aligning large vertebrate genomes.
To help assess the utility of WABA, we performed the same alignment
with BLAST (Altschul et al. 1997 ). We ran the BLASTN program with two
different settings: the default and the slower, but more sensitive,
settings. The results are summarized in Table 1. WABA aligned over twice as many bases as BLAST. The average length of a
single WABA alignment was >10 times as long as the average BLAST
alignment. WABA alignments with inserts of >5 bases or with sections
of poor conservation >20 bases long tended to appear as several
separate BLAST alignments. Blast found 80% of the regions WABA
classified as coding or highly conserved (see below). Table 2 shows an alignment that only WABA found. Running
BLAST with a window size of 8 and an extension threshold of 8 rather
than the default 11 and 11 resulted in BLAST aligning 6% more bases, but this also increased the run time by a factor of 41. WABA was also
slower than BLAST at default settings but only by a factor of 24.
View this table:
[in this window]
[in a new window]
|
Table 2.
An Alignment WABA Finds but BLAST Misses Between the C. briggsae Cosmid G47H01 and the C. elegans Cosmid
T23C6
|
|
Web-based Browser and Database
The results of the genomic sequence alignment of C. briggsae with C. elegans are available through a Web-based
display called the Intronerator. The alignments can be viewed both as a
graphical display indicating the degree of sequence conservation and as a base-by-base alignment in the context of gene predictions and mRNA
alignments (Kent and Zahler 2000 ) at
http://www.cse.ucsc.edu/~kent/intronerator. A sample Intronerator
high-level view is shown in Figure 1. This high-level
view displays the region of the C. elegans genome containing the unc-47 gene. Regions containing homology to C. briggsae are represented by a bar immediately under the gene
predictions. Areas of stronger homology are represented in a darker
shading of the bar than regions of lower homology. Clicking on the bar
brings up a base-by-base view of the alignment. The browser is entered via the Tracks Display link on the Intronerator home page. It is
possible to search as well as browse the alignments by following the
Extract Sequences link. C. elegans genes can be searched for ORFs, for sequenced genes, for the presence of aligning mRNA, and for
the presence of aligning C. briggsae genomic sequence. The
various search criteria can be combined in a flexible manner. The names
of genetically characterized and sequenced C. elegans genes
with C. briggsae sequence homology are shown in Table
3. The program can also restrict the returned
sequences to introns only, exons only, or windows relative to the
translational start or finish. This last feature can be used to extract
likely promoter regions. A Web interface to the alignment algorithm
itself is available at http://www.cse.ucsc.edu/~kent/xenoAli/. From
this page the user can align sequences against the C. elegans
genome or align two sequences to each other. The page also links to a list of all homologous C. briggsae and C. elegans
regions sorted by either alignment score, location in C. briggsae, or location in C. elegans.

View larger version (11K):
[in this window]
[in a new window]
|
Figure 1
An Intronerator tracks display of the region around
unc-47. The top track shows the gene prediction of the C.
elegans Sequencing Consortium (1998) that is used in AceDB. The
second track shows a prediction by the Genie gene finding program (Kulp
et al. 1996 , 1997 ; Reese et al. 1997 ). The third track gives a high
level view of the C. briggsae homolog. Regions of high
homology (classified by the seven state aligner as highly conserved or
coding) appear darker. Regions of lower homology appear in a lighter
shade. Nonaligning regions are white. The fourth track shows a cDNA
alignment. Note that in this case the Genie prediction agrees with the
mRNA, whereas the AceDB prediction misplaces the first exon. Note too
the highly conserved region upstream of the first exon containing known
promoter elements for this gene (Eastman et al. 1999 ).
|
|
View this table:
[in this window]
[in a new window]
|
Table 3.
A List of Genetically Characterized and Sequence C. elegans Genes for Which There Is Homology in the Washington
Univeristy C. briggsae
Sequence Data
|
|
Alignment Statistics
Overall, 59% of the C. briggsae sequence is homologous to
C. elegans. The alignment algorithm developed here has at its
core a pair-hidden Markov model (HMM) (Durbin et al. 1998 ) that, along with the alignment itself, classifies regions into weakly conserved regions, strongly conserved regions, and conserved coding regions (see
Methods). Within weakly conserved regions, an average of 49% of
nucleotides are an identical match. Within strongly conserved regions,
89% of nucleotides match. In the first two nucleotide positions inside
codons in conserved coding regions, 92% of nucleotides match. In the
third, wobble, position 53% of the nucleotides match. Because, as
discussed below, the discrimination between coding and highly conserved
regions is not robust, for purposes of further analysis we lump the
coding regions in with the highly conserved regions. Of the 8,074,467 bases of C. briggsae sequence, 30.5% are in strongly
conserved regions, 28.5% are in weakly conserved regions, and 41.0%
are in nonaligning regions. Fifty-five percent of the highly conserved
regions are coding, 20% are in introns, and 25% are in intergenic
regions according to the gene predictions of the C. elegans
Sequencing Consortium (1998) . It is likely that many of the conserved
non-coding regions are regulatory in nature.
Compared with mammalian genomes in which ~50% of the bases are
part of repeating elements (Smit 1999 ), there are relatively few
repeating elements in C. elegans and C. briggsae.
Nineteen percent of the strongly conserved regions and 8% of the
weakly conserved C. briggsae regions align to more than one
place in the C. elegans genome. Table 4
presents a breakdown of the highly conserved multiply aligning regions.
The highly conserved intergenic and intronic sequences are less likely
to align to multiple places than the coding bases, indicating that the
repeating regions are dominated by duplicated genes. The region that
aligned in the most places, 32 in all, was a coding region of the major
sperm protein. The transposable element Tc1 aligned in 25 places. The majority of repeating regions, however, only align to two
or three places.
View this table:
[in this window]
[in a new window]
|
Table 4.
Distribution of Highly Conserved Regions Comparing the Number of Places
a Region Aligns vs. Whether the Region is Coding, Intronic,
or Intergenic
|
|
In the current implementation, WABA discriminates weakly between
strongly conserved and coding regions. In the regions of alignment, the
gene predictions of the C. elegans Sequencing Consortium (1998) classify 1,364,947 bases as coding, WABA classifies 1,845,169 bases as coding, and 1,192,602 bases are classified as coding by both
WABA and the gene predictions. If the gene predictions are correct,
then WABA has a false-negative rate of 13% and a false-positive rate
of 55%, which is not as good as the current state of the art in gene
prediction software. The goal of this project was merely to create a
sensitive alignment tool, not a gene-finding program. However, in the
future it would be interesting to extend the pair-HMM in such a way
that it can model codon bias in general and stop codons in particular
as well as account for wobble position variations. It is possible such
extensions would yield a program that would produce both sensitive
genomic alignments and reliable predictions of coding regions.
Conservation of Promoters
Figure 1 shows an Intronerator display of the unc-47 gene.
This is a fairly typical example of the patterns of conservation in
protein-coding genes between C. elegans and C. briggsae. Most of the coding regions are heavily conserved as is
the promoter region. Beyond the splice consensus sequences, most
introns are lightly conserved if at all. In this case the gene has been
cloned and sequenced (McIntire et al. 1997 ) and the promoter region
studied (Eastman et al. 1999 ).The conserved region upstream of the
first exon covers almost exactly the minimum promoter needed to drive reporter constructs in the same expression pattern as unc-47
(Eastman et al. 1999 ). Identification of conserved regions in
promoters will help biologists identify the minimal promoter sequence
necessary to drive the proper expression of a gene in functional studies.
Cross-species alignments will be important for dissection of regulatory
networks of transcription factors and their binding sites. Two other
very powerful techniques for dissecting promoter logic are clustering
expression patterns from DNA microarrays (Eisen et al. 1998 ) and
searching for motifs with tools like MEME (Bailey and Elkan 1995 ) at
http://www.sdsc.edu/MEME/meme/website and Improbizer at
http://www.cse.ucsc.edu/~kent/improbizer/index.html. Together, these
techniques can find cis-acting regions in the promoters of
coexpressed genes. Areas in the promoter region conserved in
cross-species alignments can serve as an independent confirmation that
these regions are biologically relevant or, alternatively, can serve to
reduce the size of the region that the motif searching algorithms must
consider. As more microarray data is publicly released, it will become
possible to apply these powerful bioinformatic approaches to
characterizing transcription factor binding sites in C. elegans.
Conservation of Splicing Regulatory Elements
In genes that are alternatively spliced, such as let-2
(Sibley et al. 1993 ), it is not unusual to find highly conserved areas in introns (Fig. 2). Presumably, these conserved
areas may be involved in the regulation of alternative splicing. There
are several blocks of conserved intron sequence in the alternatively spliced region of let-2. These are seen in the intron just
upstream of exon 9, in the intron sequence between exons 9 and 10, and in several blocks between exons 10 and 11. A poly-GT motif conserved in
the intron between exons 10 and 11 is rare in C. elegans
introns as a whole. Searching the Intronerator's Alternative Splicing Catalog (Kent and Zahler 2000 ) shows that this motif does occur frequently in introns near alternative splice sites (data not shown).
This poly-GT motif is also found near the region of alternative splicing of bli-4 and is conserved between C. briggsae
and C. elegans in that gene as well (Thacker et al. 1999 ).

View larger version (13K):
[in this window]
[in a new window]
|
Figure 2
Tracks display of let-2, a gene that is alternatively spliced
in a developmentally regulated manner. The C. briggsae
sequence is split between two overlapping cosmids. Note the relatively
high level of conservation in the introns compared with
unc-47. (The exons numbered 8 on the top and bottom splicing
diagrams are numbered 9 and 10 in the text and in Table 5. This
difference is due to Tracks display not including exons in the UTR
regions and numbering the exons in each isoform separately.)
|
|
Splicing regulatory elements are often found in exons as well. Because
exons tend to be highly conserved, these regulatory elements do not
stand out well in the high-level view. However, the base-by-base
alignment view can be used to look for sequence conservation in the
wobble positions inside coding regions, which only match 53% of the
time in the alignment as a whole. Table 5 shows a
base-by-base alignment of the let-2 mutually exclusive exons 9 and 10. Sixty-eight bases at the end of exon 10 are perfectly conserved. This includes 22 wobble-position nucleotides. The last 19 bases of this conserved region are all purines. Some members of the SR
protein splicing factors have been shown to bind directly to
purine-rich elements in exons to promote regulated splicing events
(Nagel et al. 1998 )
Intron/Exon Structure of Rarely Expressed Genes
Even though there are > 100,000 C. elegans EST sequences
in GenBank, only 9313 of 19,722 (47%) of the predicted C. elegans ORFs align with any ESTs. However, rarely expressed genes
as well as commonly expressed genes can be identified by their patterns of conservation, as illustrated in Figure 3. Here, a
single homologous region of C. briggsae covers two predicted
C. elegans genes. One prediction is supported by seven EST
sequences derived from four different cDNA clones. The other gene
prediction has no supporting EST sequences at all. However, both genes
are strongly supported by the pattern of conservation of predicted
exons in the homologous C. briggsae region.

View larger version (15K):
[in this window]
[in a new window]
|
Figure 3
C. elegans predicted genes K04F10.1 and K04F10.7. The K04F10.7
gene prediction is supported by a number of ESTs. K04F10.1 appears by
lack of EST coverage to be rarely expressed. The pattern of
conservation of K04F10.1 nonetheless provides strong support for this
gene prediction.
|
|
Introns Unique to One Species
Long insertions and deletions (indels) are common in the alignments.
For indels from 3 to 30 nucleotides long, indels of an integral
multiple of three were more common than other indels (data not shown).
As indicated in Table 6, 95% of the longer indels
occurred in regions of weak homology. Of the indels in regions of
strong homology, 72% were in coding regions. Seventy-three percent of
the indels in coding regions of strong homology, 263 in all, aligned
with GT ... AG ends on the region unique to one species. These
indels appear to correspond to introns present in one species but not
the other (Table 7). 263 unique introns is a
conservative number. There are many additional cases where it appears
that a codon or two were inserted or deleted along with the intron. In
addition, there are cases where the insert resembles an intron that has
since undergone a few point mutations in the flanking region. There
also may be introns inserted or deleted in weakly conserved regions and
noncoding regions as well. The figure of 263 unique introns does not
include any of these cases.
From an alignment between just two sequences it is not possible to tell
whether the introns are being gained, lost, or a mixture of both. There
is only a small amount of genomic sequence data available for other
members of the genus Caenorhabditis in GenBank and only a
single instance (Hansen and Pilgrim 1998 ) where this data covers a
coding indel in C. elegans and C. briggsae. In this case C. briggsae is missing a stretch of DNA that is present
in C. elegans and Caenorhabditis remanei. C. remanei is considered to be a closer relative of C. briggsae than of C. elegans (Blaxter et al. 1998 ). Thus,
this indel appears to represent an intron that was lost along with a
single additional codon sometime since C. briggsae and C. remanei diverged. In a previous study comparing sequences of two
large families of chemoreceptor genes in C. elegans and
C. briggsae (Robertson 1998 ), Robertson found a pervasive pattern of intron loss over recent evolutionary time. Therefore, in
all, it seems most likely that the majority of the introns unique to
one species represent intron loss in the other species. Robertson noted
a bias toward introns being lost in C. briggsae rather than
C. elegans. Assuming the unique introns we observe in this
larger study represent intron loss, 65% of the introns were lost from
C. briggsae (Table 6).
We tested the model that flanking the unique introns may be some
sequences associated with intron loss. The consensus sequence near the
splice site of unique introns is different from the consensus sequence
near the splice site of C. elegans introns as a whole (Table
8). In particular, the consensus of the AG before the 5' splice site is much stronger in unique introns, particularly in
C. briggsae. This and other changes to the consensus have the effect of strengthening the homology between the 3' and 5'
splice sites. However, the homology is only strengthened significantly in the two bases before, and the one base after, the splice site (Table
9). This homology is too short for homologous
recombination to be the mechanism for intron loss. However, the Ku
mechanism for repairing double-stranded breaks in DNA (Jeggo 1998 )
joins broken ends at regions of microhomology ( 3 bp). The Ku
mechanism deletes bases between the broken ends and the microhomology
region as well. Thus, it seems possible that introns may be lost during repair of double-stranded breaks in DNA.
Synteny
The large scale of this alignment also permitted us to observe the
level of short-range synteny between the two genomes. On average, a
C. briggsae sequence would only align for 8553 bases to one
part of the C. elegans genome before the best alignment shifted to another part of the C. elegans genome. Because the average size of a C. briggsae clone was only 35,722 bases,
approximately one-fourth of the alignments ended because the C. briggsae sequence ended. Thus, the average size of an aligning
segment is actually larger. Figure 4 depicts the
number of fragments clones are broken into by the alignment. Curiously,
this quantity exhibits a bimodal distribution, with peaks at one
fragment per clone and three fragments per clone. The first peak
indicates that ~40% of the genome is resistant to rearrangment,
going longer between rearrangement events than the 35,722-bp size of
the C. elegans clones. The other 60% of the genome appears to
be susceptible to rearrangements, with rearrangement events occurring
approximately every 4000 bases. Because the average length between
genes in C. elegans is ~5000 nucleotides, this is
consistent with arrangement occurring freely in regions between genes
in the 60% of the genome that is susceptible to rearrangement.

View larger version (15K):
[in this window]
[in a new window]
|
Figure 4
Synteny summary graph. The horizontal axis is how many pieces a C. briggsae clone needed to be broken into for maximal alignment with
the C. elegans genome. The vertical axis is the number of
clones broken into that many pieces. There were 229 clones, and the
average clone size was 34,722 bases. Clones were broken up if the best
aligning region of one 2000-base region of the C. briggsae
clone aligned to a spot 50,000 bases or more away or on a different
chromosome from where the previous 2000-base region of the C. briggsae clone aligned best. The 2000 base regions overlapped by
1000 bases. The average length of a broken-up region was 8553 bases.
The bimodal distribution suggests that ~40% of the genome is
resistant to rearrangements and the rest is not.
|
|
The 40% of the genome resistant to rearrangement may reflect regions
that are transcriptionally coregulated. The largest single alignment is
part of the homeobox complex. Genes that are part of operons (Spieth et
al. 1993 ) or that share an enhancer would also seem likely to be
resistant to rearrangement. In part, the length of an alignment
reflects the position on the chromosome. Eight of the 10 longest
alignments, 15 of the 20 longest alignments, 36 of the 50 longest
alignments, and 63 of the 100 longest alignments are found in the
gene-rich clusters near the middle of chromosomes. The chromosome arms
appear to be more susceptible to rearrangement. One cosmid, G46J06,
aligns in nine widely separated pieces on chromosomes I, II, III, and
IV. None of the G46J06 alignments are in the gene-rich cluster regions.
Conclusions
We developed a new algorithm for aligning genomic DNA. We used the
algorithm to perform a large-scale alignment between C. briggsae and C. elegans genomic sequences. The alignment
confirms that patterns of conservation can be useful in identifying
regulatory regions and rarely expressed coding regions. The alignment,
including C. briggsae homologs of 154 cloned C. elegans genes and many times this number of largely uncharacterized
ORFs, can be viewed and searched in the Intronerator. The detailed
alignment display can help identify conserved regulatory elements even
inside exons. There is a bimodal distribution on the size of syntenic
regions. Over 250 introns are present in one species but not the other, within the ~10% of the C. briggsae genome sequenced.
 |
METHODS |
Sources of Sequence Data
The genomic sequence data for C. elegans came from the
C. elegans Sequencing Consortium (1998) downloaded from
http://www.sanger.co.uk/Projects/C_elegans/chromosomes.shtml on
August 4, 1999. The C. briggsae genomic sequence data came from the Washington University site
ftp://genome.wustl.edu/pub/gsc1/sequence/st.louis/briggsae/finish/ on
October 6, 1999. Genomic sequences from other organisms and all the
mRNA sequences were obtained from GenBank (Benson et al. 1999 ) at
http://www.ncbi.nlm.nih_.gov/Entrez/ on December 4, 1999.
Cross-species DNA Alignment Algorithms
The large scale of the C. briggsae and C. elegans
genomic alignment required a relatively fast algorithm. Yet the
algorithm also needed to handle both small and large insertions and
deletions and deal with the statistical challenges presented by the
AT-richness of the Caenorhabditis genome. To cope with these
somewhat conflicting concerns, we broke the alignment problem into
three passes: a fast program that took overlapping 2000 base regions of
C. briggsae and scanned the C. elegans genome for
homologous regions, a program that aligned the homologous regions in a
very detailed manner, and a program that stitched together the
overlapping alignments.
The fastest DNA alignment algorithms, such as BLAST (Altschul et al.
1990 ), demand areas that match perfectly to serve as seeds for the
alignment. Although such approaches work well for searching EST
databases, they break down when comparing DNA sequences between
species. Although coding regions of DNA are fairly well conserved in
the first and second codon positions, the third, wobble, position is
largely redundant in the genetic code and tends to diverge freely. To
cope with this we developed an algorithm that worked on the general
principles of gapped BLAST (Altschul et al. 1997 ) but that, rather than
requiring six consecutive nucleotides to match perfectly to seed an
alignment, required six nucleotides spread out in the pattern XXoXXoXX
(where the X's must match but the o's need not) to serve as the seed.
We call this algorithm WABA. There are three passes to the WABA algorithm.
The first pass of WABA works in the following fashion: The C. elegans DNA (the target sequence) is packed so that each 16-bit word contains 8 nucleotides. This serves two purposes to reduce the
amount of RAM required to store the genome and to allow multiple nucleotides to be compared in a single computer operation. For each
section of the C. briggsae sequence (the query sequence) we do
the following: An array that contains an entry for all possible 8-nucleotide sequences (8-mers) is allocated. The query sequence is
then scanned, and each 8-mer is checked and possibly added to the
array. Note that these 8-mers in the query sequence overlap each other
by 7 bases. Eight-mers are rejected if they contain ambiguous
nucleotides such as N or if they could match degenerate repeating DNA
sequences of period 4 or less. (This eliminates 8-mers such as
AAAAAAAA, ACACACAC, ACGACGAC, and ACGTACGT.) The 8-mers are packed into
a 16-bit word, which is then binary logical anded with the wobble mask
(hexadecimal 0xF3CF) so that the third and sixth nucleotides are
ignored. The packed, masked 8-mer is then added to the corresponding
entry in the array that stores the position of the 8-mer in the query
sequence as well. Then, the packed genome is scanned. Each packed word
of the genome is "anded" with the wobble mask, and the result is
used to index the array. If the corresponding array entry is nonempty,
then the current genome position and the position of the corresponding query 8-mer are noted in a hit list. Because the hit list grows quite
quickly, it is periodically scanned and nonpromising hits are
eliminated. If a hit occurs within 1000 bases on the target sequence of
another hit and the positions of the hits in both query and target
indicate that they could be part of a homologous region lacking inserts
(which is simply checked by subtracting from query offsets of both hits
to get a "diagonal" offset and then checking that the diagonal
offsets are identical), it is considered promising. In this way, WABA
scans the entire genome and assembles a list of hits. The hits are then
clumped together. A clump starts with a single hit. Each hit within 48 nucleotides in both target and query and with the same diagonal offset
is added to a clump. As hits are added to a clump, the boundaries of
the clump are extended to include the new hits. The clumps are then
scored by the square of the number of hits in the clump, and the clumps
scoring <25% of the highest score are eliminated. Because the
distribution of scores is highly peaked, this saves processing in later
steps in the common scenario where there are one or a few areas of
strong homology and many areas of weak homology. The remaining clumps
are scored for the best local alignment without inserts between the
full query sequence and the corresponding part of the target sequence.
Because the worm genome is AT rich, AT matches are scored as 2, whereas
GC matches are scored as 3. Mismatches are scored as 4. The bottom
scoring 25% of the alignments without inserts are dropped, and the
remaining alignments are stored for the next pass.
Though the first pass is quite fast, it has no provision for insertions
and deletions. The first pass merely identifies regions of the target
homologous to the 2000-base query sequence. A 5000-base window of the
target sequence centered around the first pass aligment is then aligned
with a slower algorithm that can cope with insertions and deletions.
The Smith-Waterman algorithm (Smith and Waterman 1981 ) and extensions
of it to allow affine gap scores (Pearson and Miller 1992 ) would serve
this purpose reasonably well. However, even affine gap penalties tend
to penalize long gaps excessively in genomic and cDNA alignments.
Furthermore, we wanted our slow, detailed aligner to be at least as
wobble-aware as the fast front end. HMMs have proven to be very useful
in predicting gene structure (Kulp et al. 1996 ). Pair HMMs of three
(hidden) states can provide a conceptually simple and effective
implementation of alignment with affine gap costs (Durbin et al. 1998 ).
This motivated us to design a seven state pair HMM to do the detailed
alignment (Fig. 5). One state captures long inserts
in the target sequence, one state captures long inserts in the query
sequence, one state captures highly conserved regions (~90% base
similarity), one state captures lightly conserved regions (~50%
base similarity), and three states capture coding regions. As the HMM
matches a nucleotide pair in one coding state, it shifts to the next
coding state. Match and mismatch scores for the first two coding states are very similar to match and mismatch scores for the highly conserved regions. Match and mismatch scores for the third coding state are lower
and favor mismatches between C and T or G and A over other mismatches.
The heavily and lightly conserved and coding states all allow inserts
within the state with a gap penalty that is approximately equivalent to
eight mismatches. To transition into a long gap state invokes a penalty
equivalent to roughly 12 mismatches. To continue a gap in a long gap
state costs a penalty that is only about one-fifth of a mismatch cost.
The result is an algorithm that penalizes 1-base gaps by 8, 2-base gaps
by 12, and longer gaps by 12 plus one-fifth of their length. A 60-bp gap is only penalized twice as much as a 2-bp gap. This is a fair approximation of the distribution of gap sizes in genomic alignments (data not shown) that is not as computationally expensive as more elaborate gap penalty schemes (Gotoh 1990 ). The matches are also somewhat more sensitive than the usual practice of treating all nucleotide matches equivalently. Because the worm genome is 61% A/T,
we weighed G/C matches more heavily than A/T matches. The various
match, mismatch, and gap scores were chosen first by an educated guess
of the equivalent probabilities. The resulting HMM was used to align a
substantial fraction of the available data. The distributions of
matches, mismatches, and gaps in this first alignment were used to
refine the scores for a second alignment over the entire data set. This
provided one step of parameter optimization by expectation maximization
(Durbin et al. 1998 ). The result is a robust and sensitive alignment
algorithm. It also produces predictions as to whether a region is
coding or not and a score that can be used to compare the significance
of different alignments. We empirically set the scoring threshold for
an alignment to pass onto the final pass of WABA to a number that just
excluded a large number of 40-bp alignments that match primarily in
poly-A and poly-T regions. This is necessary because the worm genome is
very rich in these regions. The oligonucleotide AAAA occurs 7.5 times
as often as would be expected from a random distribution of nucleotides.

View larger version (25K):
[in this window]
[in a new window]
|
Figure 5
Finite state machine depiction of seven state aligner. This is a
pairwise HMM. Transitions between states are associated either with
aligning a pair of nucleotides (solid arrows) or aligning a single
nucleotide with a gap (broken line arrows). The most likely transition
out of each state is shown by a thick arrow. Thin arrows represent
possible, but unlikely, transitions. In coding regions, the most likely
path will typically cycle between coding states 1, 2, and 3. In
strongly conserved regions, the most likely path will typically stay in
the high conservation state. In weakly conserved regions, the most
likely path will typically stay in the low conservation state. Short
inserts follow the broken line arrows in the coding and conserved
states. Long inserts transition into the query or target insert states.
|
|
The pairwise HMM that is the second pass of WABA is implemented in
optimized C code. Nonetheless, because the run time increases with the
product of the length of the two sequences that it is aligning, it is
not efficient to use it for long alignments. This is the primary
motivation for breaking the query sequence into 2000-bp blocks that
overlap each other by 1000 bp. This necessitates a third pass to scan
the resulting short alignments and, when possible, merge them into
longer alignments. This last pass works by looking for alignments that
overlap in both query and target. Overlapping areas are scanned for
regions of at least 15 nucleotides where the alignments are identical
and are joined if such a region is found. After this joining, the
alignment is complete.
 |
ACKNOWLEDGMENTS |
We thank David Haussler for his sage advice on HMMs, David Kulp for
making Genie predictions for the Intronerator, and members of the
Zahler, Jin, and Chisholm labs and Manny Ares for helpful discussions.
Thanks to David Haussler, David Hoffman, and Jane Silverthorne for
critical reading of this manuscript. We are indebted to all the various
sequencing projects for providing sequence data. This work was
supported by grant 1R01GM52848 from the National Institutes of Health
to A.M.Z. and a training grant from the University of California
Biotechnology Program.
The publication costs of this article were defrayed in part by payment
of page charges. This article must therefore be hereby marked
"advertisement" in accordance with 18 USC section 1734 solely to
indicate this fact.
 |
FOOTNOTES |
1
Corresponding author.
E-MAIL kent{at}biology.ucsc.edu; FAX (831) 459-3737.
 |
REFERENCES |
-
Altschul, S.F.,
Gish, W.,
Miller, W.,
Myers, E.W., and
Lipman, D.J.
1990.
Basic local alignment search tool.
J. Mol. Biol.
215:
403-410[CrossRef][Medline].
-
Altschul, S.F.,
Madden, T.L.,
Schaffer, A.A.,
Zhang, J.,
Zhang, Z.,
Miller, W., and
Lipman, D.J.
1997.
Gapped BLAST and PSI-BLAST: A new generation of protein database search programs.
Nucleic Acids Res.
25:
3389-3402[Abstract/Free Full Text].
-
Ayala, F.J.,
Barrio, E., and
Kwiatowski, J.
1996.
Molecular clock or erratic evolution? A tale of two genes.
Proc. Natl. Acad. Sci.
93:
11729-11734[Abstract/Free Full Text].
-
Bailey, T.L. and
Elkan, C.
1995.
Unsupervised learning of multiple motifs in biopolymers using EM.
Mach. Learn.
21:
51-80.
-
Baldwin, J.G.,
Frisse, L.M.,
Vida, J.T.,
Eddleman, C.D., and
Thomas, W.K.
1997.
An evolutionary framework for the study of developmental evolution in a set of nematodes related to Caenorhabditis elegans.
Mol. Phylogenet. Evol.
8:
249-259[CrossRef][Medline].
-
Benson, D.A.,
Boguski, M.S.,
Lipman, D.J.,
Ostell, J.,
Ouellette, B.F.,
Rapp, B.A., and
Wheeler, D.L.
1999.
GenBank.
Nucleic Acids Res.
27:
12-17[Abstract/Free Full Text].
-
Blaxter, M.
1998.
Caenorhabditis elegans is a nematode.
Science
282:
2041-2046[Abstract/Free Full Text].
-
Blaxter, M.L.,
De Ley, P.,
Garey, J.R.,
Liu, L.X.,
Scheldeman, P.,
Vierstraete, A.,
Vanfleteren, J.R.,
Mackey, L.Y.,
Dorris, M.,
Frisse, L.M.
1998.
A molecular evolutionary framework for the phylum Nematoda.
Nature
392:
71-75[CrossRef][Medline].
-
Boeddrich, A.,
Burgtorf, C.,
Roest Crollius, H.,
Hennig, S.,
Bernot, A.,
Clark, M.,
Reinhardt, R.,
Lehrach, H., and
Francis, F.
1999.
Analysis of the spermine synthase gene region in Fugu rubripes, Tetraodon fluviatilis, and Danio rerio.
Genomics
57:
164-168[CrossRef][Medline].
-
C. elegans Sequencing Consortium.
1998.
Genome sequence of the nematode C. elegans: A platform for investigating biology.
Science
282:
2012-2018[Abstract/Free Full Text].
-
Durbin, R.E.,
Eddy, S.,
Krogh, A., and
Mitchison, G.
1998.
Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK.
-
Eastman, C.,
Horvitz, H.R., and
Jin, Y.
1999.
Coordinated transcriptional regulation of the unc-25 glutamic acid decarboxylase and the unc-47 GABA vesicular transporter by the Caenorhabditis elegans UNC-30 homeodomain protein.
J. Neurosci.
19:
6225-6234[Abstract/Free Full Text].
-
Eisen, M.B.,
Spellman, P.T.,
Brown, P.O., and
Botstein, D.
1998.
Cluster analysis and display of genome-wide expression patterns.
Proc. Natl. Acad. Sci.
95:
14863-14868[Abstract/Free Full Text].
-
Endrizzi, M.,
Huang, S.,
Scharf, J.M.,
Kelter, A.R.,
Wirth, B.,
Kunkel, L.M.,
Miller, W., and
Dietrich, W.F.
1999.
Comparative sequence analysis of the mouse and human Lgn1/SMA interval.
Genomics
60:
137-151[CrossRef][Medline].
-
Gotoh, O.
1990.
Optimal sequence alignment allowing for long gaps.
Bull. Math. Biol.
52:
359-373[Medline].
-
Hansen, D. and
Pilgrim, D.
1998.
Molecular evolution of a sex determination protein. FEM-2 (pp2c) in Caenorhabditis.
Genetics
149:
1353-1362[Abstract/Free Full Text].
-
Hardison, R.C.,
Oeltjen, J., and
Miller, W.
1997.
Long human-mouse sequence alignments reveal novel regulatory elements: A reason to sequence the mouse genome.
Genome Res.
7:
959-966[Free Full Text].
-
Jareborg, N.,
Birney, E., and
Durbin, R.
1999.
Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs.
Genome Res.
9:
815-824[Abstract/Free Full Text].
-
Jeggo, P.A.
1998.
DNA breakage and repair.
Adv. Genet.
38:
185-218[Medline].
-
Kent, W.J. and
Zahler, A.M.
2000.
The Intronerator: Exploring introns and alternative splicing in C. elegans.
Nucleic Acids Res.
28:
91-93[Abstract/Free Full Text].
-
Kulp, D.,
Haussler, D.,
Reese, M.G., and
Eeckman, F.H.
1996.
A generalized hidden Markov model for the recognition of human genes in DNA.
ISMB
4:
134-142.
-
  . 1997. Integrating database homology in a probabilistic gene
structure model. Pac. Symp. Biocomput. 232-244. -
Makalowski, W. and
Boguski, M.S.
1998.
Evolutionary parameters of the transcribed mammalian genome: An analysis of 2,820 orthologous rodent and human sequences.
Proc. Natl. Acad. Sci.
95:
9407-9412[Abstract/Free Full Text].
-
Makalowski, W.,
Zhang, J., and
Boguski, M.S.
1996.
Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences.
Genome Res.
6:
846-857[Abstract/Free Full Text].
-
McIntire, S.L.,
Reimer, R.J.,
Schuske, K.,
Edwards, R.H., and
Jorgensen, E.M.
1997.
Identification and characterization of the vesicular GABA transporter.
Nature
389:
870-876[CrossRef][Medline].
-
Nagel, R.J.,
Lancaster, A.M., and
Zahler, A.M.
1998.
Specific binding of an exonic splicing enhancer by the pre-mRNA splicing factor SRp55.
RNA
4:
11-23[Abstract].
-
O'Brien, S.J.,
Menotti-Raymond, M.,
Murphy, W.J.,
Nash, W.G.,
Wienberg, J.,
Stanyon, R.,
Copeland, N.G.,
Jenkins, N.A.,
Womack, J.E., and
Marshall Graves, J.A.
1999.
The promise of comparative genomics in mammals.
Science
286:
458-462[Abstract/Free Full Text]; 479-481.
-
Pearson, W.R. and
Miller, W.
1992.
Dynamic programming algorithms for biological sequence comparison.
Methods Enzymol.
210:
575-601[Medline].
-
Reese, M.G.,
Eeckman, F.H.,
Kulp, D., and
Haussler, D.
1997.
Improved splice site detection in Genie.
J. Comput. Biol.
4:
311-323[Medline].
-
Robertson, H.M.
1998.
Two large families of chemoreceptor genes in the nematodes Caenorhabditis elegans and Caenorhabditis briggsae reveal extensive gene duplication, diversification, movement, and intron loss.
Genome Res.
8:
449-463[Abstract/Free Full Text].
-
Shabalina, S.A. and
Kondrashov, A.S.
1999.
Pattern of selective constraint in C. elegans and C. briggsae genomes.
Genet. Res.
74:
23-30[CrossRef][Medline].
-
Sibley, M.H.,
Johnson, J.J.,
Mello, C.C., and
Kramer, J.M.
1993.
Genetic identification, sequence, and alternative splicing of the Caenorhabditis elegans alpha 2(IV) collagen gene.
J. Cell Biol.
123:
255-264[Abstract/Free Full Text].
-
Smit, A.F.
1999.
Interspersed repeats and other mementos of transposable elements in mammalian genomes.
Curr. Opin. Genet. Dev.
9:
657-663[CrossRef][Medline].
-
Smith, T.F. and
Waterman, M.S.
1981.
Identification of common molecular subsequences.
J. Mol. Biol.
147:
195-197[CrossRef][Medline].
-
Spieth, J.,
Brooke, G.,
Kuersten, S.,
Lea, K., and
Blumenthal, T.
1993.
Operons in C. elegans: Polycistronic mRNA precursors are processed by trans-splicing of SL2 to downstream coding regions.
Cell
73:
521-532[CrossRef][Medline].
-
Thacker, C.,
Marra, M.A.,
Jones, A.,
Baillie, D.L., and
Rose, A.M.
1999.
Functional genomics in Caenorhabditis elegans: An approach involving comparisons of sequences from related nematodes.
Genome Res.
9:
348-359[Abstract/Free Full Text].
-
Voronov, D.A.,
Panchin, Y.V., and
Spiridonov, S.E.
1998.
Nematode phylogeny and embryology.
Nature
395:
28[CrossRef][Medline].
-
Wilson, W.A.,
Harrington, S.E.,
Woodman, W.L.,
Lee, M.,
Sorrells, M.E., and
McCouch, S.R.
1999.
Inferences on the genome structure of progenitor maize through comparative analysis of rice, maize and the domesticated panicoids.
Genetics
153:
453-473[Abstract/Free Full Text].
Received January 27, 2000; accepted in revised form June 2, 2000.
10:1115-1125 ©2000 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/00 $5.00

CiteULike Connotea Del.icio.us Digg Reddit Technorati What's this?
This article has been cited by other articles:

|
 |

|
 |
 
M. Irimia, J. L. Rukov, D. Penny, J. Garcia-Fernandez, J. Vinther, and S. W. Roy
Widespread Evolutionary Conservation of Alternatively Spliced Exons in Caenorhabditis
Mol. Biol. Evol.,
February 1, 2008;
25(2):
375 - 382.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
H. He, J. Wang, T. Liu, X. S. Liu, T. Li, Y. Wang, Z. Qian, H. Zheng, X. Zhu, T. Wu, et al.
Mapping the C. elegans noncoding transcriptome with a whole-genome tiling microarray
Genome Res.,
October 1, 2007;
17(10):
1471 - 1477.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
J. H. Graber, J. Salisbury, L. N. Hutchins, and T. Blumenthal
C. elegans sequences that control trans-splicing and operon pre-mRNA processing
RNA,
September 1, 2007;
13(9):
1409 - 1426.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
A. Vishnoi, R. Roy, and A. Bhattacharya
Comparative analysis of bacterial genomes: identification of divergent regions in mycobacterial strains using an anchor-based approach
Nucleic Acids Res.,
June 28, 2007;
35(11):
3654 - 3667.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
J. L. Rukov, M. Irimia, S. Mork, V. K. Lund, J. Vinther, and P. Arctander
High Qualitative and Quantitative Conservation of Alternative Splicing in Caenorhabditis elegans and Caenorhabditis briggsae
Mol. Biol. Evol.,
April 1, 2007;
24(4):
909 - 917.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
A.-F. Ruaud and J.-L. Bessereau
The P-type ATPase CATP-1 is a novel regulator of C. elegans developmental timing that acts independently of its predicted pump function
Development,
March 1, 2007;
134(5):
867 - 879.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
S. W. Roy and D. Penny
Patterns of Intron Loss and Gain in Plants: Intron Loss-Dominated Evolution and Genome-Wide Comparison of O. sativa and A. thaliana
Mol. Biol. Evol.,
January 1, 2007;
24(1):
171 - 181.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
S. W. Roy and D. Penny
Smoke Without Fire: Most Reported Cases of Intron Gain in Nematodes Instead Reflect Intron Losses
Mol. Biol. Evol.,
December 1, 2006;
23(12):
2259 - 2262.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
S. W. Roy and D. Penny
Large-scale intron conservation and order-of-magnitude variation in intron loss/gain rates in apicomplexan evolution
Genome Res.,
October 1, 2006;
16(10):
1270 - 1275.
[Abstract]
[Full Text]
[PDF]
|
 |
|
|