|
|
|
|
Vol. 11, Issue 7, 1175-1186, July 2001 Surveying Saccharomyces Genomes to Identify Functional Elements by Comparative DNA Sequence Analysis1 Department of Genetics and 2 Genome Sequencing Center, Washington University School of Medicine, St. Louis, Missouri 63110, USA
Comparative sequence analysis has facilitated the discovery of protein coding genes and important functional sequences within proteins, but has been less useful for identifying functional sequence elements in nonprotein-coding DNA because the relatively rapid rate of change of nonprotein-coding sequences and the relative simplicity of non-coding regulatory sequence elements necessitates the comparison of sequences of relatively closely related species. We tested the use of comparative DNA sequence analysis to aid identification of promoter regulatory elements, nonprotein-coding RNA genes, and small protein-coding genes by surveying random DNA sequences of several Saccharomyces yeast species, with the goal of learning which species are best suited for comparisons with S. cerevisiae. We also determined the DNA sequence of a few specific promoters and RNA genes of several Saccharomyces species to determine the degree of conservation of known functional elements within the genome. Our results lead us to conclude that comparative DNA sequence analysis will enable identification of functionally conserved elements within the yeast genome, and suggest a path for obtaining this information.
Identifying functional elements in DNA sequence is a significant challenge. It is difficult enough to predict correctly protein-coding genes; an even greater challenge is to identify functional sequences that do not code for protein, such as sequences regulating gene expression, sequences governing chromosome replication, structure and stability, and sequences of nonprotein-coding RNAs. This task is complicated by the diverse nature of these sequence elements. For example, sequences that regulate gene expression are usually short, often independent of orientation, and can reside at varying distances from their target gene. Genes encoding RNAs can be difficult to identify because they contain few hallmarks and because their folded structure, rather than their primary sequence, dictates their function. Because functional sequences are maintained in evolution, they often
can be recognized by their conservation among different organisms
(Tagle et al. 1988 The Saccharomyces genus is composed of three subgroups
(Barnett 1992
Sequence Summary of Saccharomyces Species We obtained sequence from ~1000 genomic DNA clones from each of
four species of the sensu stricto group (Saccharomyces
bayanus, Saccharomyces paradoxus, Saccharomyces
cariocanus, and Saccharomyces mikatae) (Naumov et al.
2000
Comparisons of the nucleotide sequences (using BLASTN, Table 1) and inferred protein sequences (using BLASTX, Table 2) to S. cerevisiae sequence are instructive for determining the relative genetic distance of the species from S. cerevisiae and for identifying which species are likely to yield the most data in sequence comparisons. Most (94.5%-98.2%) of the sensu stricto sequences align to S. cerevisiae coding or noncoding sequence using BLASTN. As expected, the sequences are most similar to S. cerevisiae DNA sequence in protein-coding genes, but the sequences of S. paradoxus and S. cariocanus are also highly conserved within nonprotein-coding regions. S. mikatae and S. bayanus have lower similarity and yield significantly fewer alignments to nonprotein-coding regions of the genome using the same BLASTN parameters (Table 1). The sensu lato and petite-negative species seem to lie a similar
evolutionary distance from S. cerevisiae, because the average identities of their protein-coding DNA (Table 1) and predicted proteins
(Table 2) to S. cerevisiae
sequences are similar for each sensu lato species. These species are
more diverged from S. cerevisiae than we expected from
ribosomal RNA comparisons. Many presumably orthologous DNA sequences
(identified by aligning the protein-coding sequences using
BLASTX) do not align to S. cerevisiae sequence
with BLASTN, and many of those that do align are near the
lower limits of detection by BLASTN.
Although most of the proteins of the sensu stricto group are >80% identical to their S. cerevisiae homologs, 29 alignments show <50% identity (Fig. 1A). These low similarity alignments may indicate rapidly evolving proteins or genes that duplicated in other Saccharomyces species (but not in S. cerevisiae), allowing one copy to diverge from its S. cerevisiae ortholog. A large number of proteins of the sensu lato species (>30%) are <50% identical to their S. cerevisiae homologs (Fig. 1B), emphasizing the substantial divergence of these species from S. cerevisiae.
We noted many discrepancies in open reading frame (ORF) boundaries in the different sensu stricto species (data not shown). Some of these are likely caused by sequencing errors (most likely in our sequence, but potentially in the S. cerevisiae sequence); some may possibly be attributable to real DNA sequence differences between these closely related species. ORF length polymorphisms are even more common in the sensu lato alignments, which is not surprising considering the relatively greater divergence of these species from S. cerevisiae. Species-Specific Sequences A small percentage of the sequences of sensu stricto species have no
significant similarity to any S. cerevisiae sequence at the
nucleotide or amino acid levels, regardless of the BLAST parameters that were used. Several of these sequences are predicted to
encode proteins similar to those found in other species. For example,
we identified a Ty5 transposon protein that is unique to S. paradoxus, and a gene in S. bayanus and S. cariocanus predicted to encode a protein related to an amidase of
S. pombe. Many sensu lato and petite-negative sequences are
predicted to encode proteins similar to those in species other than
S. cerevisiae (Table 3). Similar
findings have been made in Kluyveromyces lactis
(Ozier-Kalogeropoulos et al. 1998
To verify that these DNA sequences are specific to certain Saccharomyces species, we designed oligonucleotide primers to the amidase-encoding gene identified in S. cariocanus, and to two other unique sequences confined to S. mikatae and S. paradoxus that do not appear to encode protein. The sequences were amplified by PCR from genomic DNA of the corresponding species and produced products of the expected size. (Three other species-specific sequences produced PCR products larger than expected.) The amidase PCR product from S. cariocanus hybridized to a specific DNA fragment of S. cariocanus and S. paradoxus DNA in Southern blots (data not shown). The S. mikatae-specific sequence hybridized to genomic DNA from S. mikatae, but not to genomic DNA from the other sensu stricto species. Because we determined the DNA sequence of only about 1/32nd of the four sensu stricto species, it is likely that several more Saccharomyces sequences not present in S. cerevisiae remain to be identified. Identification of Small Protein-Coding Genes Some of the intergenic regions in the S. cerevisiae genome
likely encode small proteins (<100 amino acids) that have not been annotated. Comparisons of S. cerevisiae intergenic sequences
to those of the other Saccharomyces species using
TBLASTX revealed many potential protein-coding sequences.
Such comparisons of the sensu stricto sequences are not very
informative because their relatively high degree of similarity to
S. cerevisiae sequence produces many spurious alignments, but
the sequences of the sensu lato and petite-negative species are well
suited for identification of small ORFs (smORFs, <100 codons) because
their significant divergence from S. cerevisiae sequence
yields relatively few TBLASTX alignments outside of
protein-coding sequences. Most of the high-scoring TBLASTX
alignments to S. cerevisiae sequence are extensions or fusions
of known ORFs, but we identified 11 alignments of smORFs that
potentially encode a protein (TBLASTX P value
<1.0e-05; Table 4). Two of the smORFs were
predicted previously by searching the S. cerevisiae genome for
transcripts that originate from large intergenic regions of the genome
(Olivas et al. 1997
Identifying Functional Non-Protein-Coding Sequences Our primary goal is to use sequence conservation to identify functional nonprotein-coding sequences. However, much of the similarity of the nonprotein-coding sequences of the sensu stricto species is likely caused simply by an inadequate amount of evolutionary time for the accumulation of sequence changes. Which species are sufficiently diverged so that functional nonprotein-coding sequences are apparent? How many different sequences need to be compared to reveal regions of sequence similarity that are functionally significant? We began to address these questions by searching for nonprotein-coding RNA genes and promoter regulatory sequences in non protein-coding portions of the S. cerevisiae genome. Identification of Non-Protein-Coding RNAs Many highly conserved sequences in intergenic regions were
identified that are unlikely to encode proteins because no ORF was
apparent in the S. cerevisiae genome sequence. In the few cases where we obtained sequences from more than one species that are
similar to the same S. cerevisiae sequence, the conserved sequence elements readily stand out from surrounding sequences. Because there are only a few regions of the genome for which we have
sequences from multiple species, we determined the DNA sequence of
a few genes encoding nonprotein-coding RNAs from many different Saccharomyces species to compare how quickly they diverge
across the genus, and to help us decide which Saccharomyces
species are best suited for identifying these genes by comparative
sequence analyses. We amplified the sequences by PCR using
oligonucleotide primers chosen from S. cerevisiae flanking
sequences that seemed likely to be conserved. We determined the
sequence of SNR39, encoding a C/D box snoRNA required for
methylation of the 25s ribosomal subunit (Kiss-Laszlo et al. 1996 All eight SNR39 sequences align well using CLUSTALW, with 58% of the 93 nucleotides of SNR39 being conserved among eight Saccharomyces species. The C and D boxes as well as the guide sequence are conserved perfectly throughout the genus (Fig. 2). The gene is even readily distinguishable from surrounding intron sequence in most two-way alignments with S. cerevisiae sequence, although it is indistinguishable in an alignment of S. cerevisiae sequence to S. paradoxus (too similar), and is difficult to discern in an alignment to S. exiguus sequence (too diverged).
The SNR44 gene (encoding an H/ACA snoRNA) sequences also align using CLUSTALW, but are less well conserved than SNR39 (Fig. 3). The alignment reveals four highly conserved blocks, ranging from 5 to 15 nucleotides in length, which, surprisingly, do not include the ACA sequence (AAA in the sensu stricto species) or the H-box (Fig. 3), known functional elements in this snoRNA
Identification of Gene Regulatory Sequences We expect potential gene regulatory sequences to be manifested as
short blocks of sequence similarity in intergenic regions of the
genome. These are difficult to recognize in the random DNA sequences of
the Saccharomyces species for two reasons. On one hand, the
sequences of the sensu stricto species are usually so similar to
S. cerevisiae sequence that few isolated runs of identical
nucleotides are found. In rare cases where we can align DNA sequence of
two or more species, the background sequence similarity is reduced and
potential regulatory elements begin to stand out (e.g., Fig.
4), but these alignments never extend over
the entire promoter. On the other hand, the DNA sequences of the sensu
lato species are generally too different from S. cerevisiae
sequence to align with local alignment algorithms (such as
BLASTN or BESTFIT). Some sensu lato sequences
can be anchored to their presumed orthologs in S. cerevisiae
gene using BLASTX, then extended into the promoter region.
Of the 4296 sensu lato sequences that align to S. cerevisiae
proteins with BLASTX, 866 extend >30 nucleotides into the
promoter region. We were able to align only 398 of them to their
S. cerevisiae ortholog (using BLASTN, with a word
length of five and only 111 of the alignments extend >100 bp into
the promoter region. Many potential regulatory elements are apparent in
these alignments. For instance, we can clearly see conservation of
DNA-binding sites for the Alpha1 and Mcm1 proteins in the MFA2
and STE3 promoters of S. castellii (Fig.
5A,B). A consensus Hap2 binding site
(ACCAATNA; Svetlov and Cooper 1995
To assess Saccharomyces species more broadly for their suitability for identifying regulatory sequences in gene promoters by comparative DNA sequence analysis, we amplified by PCR and determined the sequences of the promoters of three well-characterized genes from many Saccharomyces species (see Methods for details). Although the sequences of the promoter region between GAL1 and GAL10 of the sensu stricto species align well using CLUSTALW, the sensu lato species' sequences do not align. We searched for Gal4 and Mig1 protein-binding sites by a simple pattern search [FINDPATTERNS] and, as expected, they are apparent in the GAL1-10 promoter of all 10 species (Fig. 6). The spacing of these binding sites is well conserved in the sensu stricto species, with only a few sites missing from some of the species. The spacing and number of Gal4 and Mig1 binding sites are not conserved in the sensu lato species. Of these, only the S. castellii sequence aligned with the S. cerevisiae sequence using local alignment algorithms (BLASTN or BESTFIT).
We obtained from each species the DNA sequence of at least one of the
two nearly identical copies of the divergently transcribed HHT
and HHF genes (HHT1-HHF1 and
HHT2-HHF2), encoding histones H3 and H4, respectively. In
some cases we obtained the sequence of both copies (though they are so
diverged from the S. cerevisiae sequence in the sensu lato
species that it was usually difficult to distinguish between the two
copies using BLASTN alignments). Although conserved blocks
of sequence begin to emerge from 3-way CLUSTALW alignments
of sequences of the sensu stricto species (data not shown), addition of
a sensu lato sequence makes these elements clearly stand out (Fig.
7A). Both (TATA) boxes are conserved, as
are sequences within and near CCA boxes 1 and 2 that were defined
previously as regulatory elements of this promoter (Freeman et al.
1992
We succeeded in amplifying the GAL4 promoter from only three
Saccharomyces species (presumably because the flanking
protein-coding sequences are not well conserved). One of these
sequences (from S. paradoxus) is too similar to the S. cerevisiae sequence to be useful, but alignments of the other two
sequences (one from the sensu stricto species S. bayanus, the
other from the petite-negative species S. kluyveri) seem very
informative (Fig. 8). The promoter elements
that were previously defined genetically (Griggs and Johnston 1993
We have investigated the feasibility of using comparative DNA sequence analysis to identify functional sequences in the genome of S. cerevisiae, with the goal of identifying regulatory sequences and sequences specifying nonprotein-coding RNAs. We are confident that promoter regulatory sequences, RNA genes, and smORFs could be identified by comparisons of orthologous Saccharomyces sequences. Analysis of Protein-Coding Genes Before discussing our analysis of non-protein-coding sequence, a few
observations regarding protein-coding genes are worth noting. First, we
identified several gene sequences not found in the S. cerevisiae genome (Table 3). These are genes that likely were lost
from (or evolved rapidly in) S. cerevisiae, though it is
possible they were acquired by the other species after they diverged
from S. cerevisiae. Second, we noticed several ORF boundaries that seem to differ among the species. Some of these could be attributable to errors in the S. cerevisiae genome sequence,
and some could be real, indicating variations in gene length in the different Saccharomyces species. More interestingly, we
identified several small ORFs that likely encode protein, two of which
show similarity to proteins found in distantly related organisms (Table 4). Based on the number of smORFs identified in the small amount of
sequence we analyzed, we estimate that Identification of Novel RNA Genes Non-protein-coding RNA genes can be difficult to recognize in a DNA
sequence or to identify experimentally (Eddy 1999 Identification of Regulatory Sequences Finding gene regulatory sequences is a difficult task because they are often short, independent of orientation, and their position in a promoter can vary greatly. For these reasons, DNA sequences of presumably orthologous promoters often fail to align using tools such as BLASTN or CLUSTALW. Because most comparisons of nonprotein-coding DNA sequence have used relatively diverged species (such as human-mouse or human-puffer fish), we tested the use of closely related yeast species to facilitate identification of promoter elements. As expected, most of the DNA sequences of the closely related sensu stricto Saccharomyces species align to cerevisiae sequences, and known promoter regions are conserved in the alignments. Two species (S. cariocanus and S. paradoxus), are too closely related to S. cerevisiae (>80% identity in noncoding regions) to yield much information from the alignments (Table 1), but many S. mikatae and S. bayanus alignments show conservation of short runs of conserved sequence and are nearly as informative as the three-way alignment shown in Figure 8. We believe that many S. cerevisiae regulatory elements could be predicted using sequence alignments of S. cerevisiae sequence to S. mikatae and S. bayanus. Alignment of multiple sequences will be required to add statistical confidence to the predictions. Less than 2% of the sensu lato and petite-negative species' sequences
align to intergenic regions of S. cerevisiae, and many of them
are repetitive. By anchoring the alignment to adjacent protein-coding
sequence we were able to extend the alignments of 111 sequences (out of
579) Choosing Saccharomyces Species for Comparative Analysis Clearly, multiple sequences from several species of various degrees of divergence will be needed to identify conserved sequences. How many sequences will be required for informative sequence comparisons? Which species are optimal for this kind of analysis? The answers depend somewhat on the gene being analyzed, but we believe a few general principles guide the choice of species. Number of Species for Genome Sequencing We estimate that at least three sequences of varying degrees of similarity need to be aligned to S. cerevisiae sequence for short blocks of conserved sequence in non-protein-coding DNA to be significant. Consider an alignment of S. cerevisiae noncoding sequence to the two most diverged sensu stricto species (S. mikatae and S. bayanus). The percent identity of their noncoding sequence to S. cerevisiae sequence averages ~75% for the sequence reads that align with BLASTN (Table 1). Because some reads do not align readily, the overall identity is somewhat less; 70% identity seems like a conservative estimate. The chance of both sequences having the same nucleotide as in the S. cerevisiae sequence at any given position is then ~0.49 (.7 × .7, or ~0.5), so the chance of a hexamer aligning perfectly in all three sequences is ~0.015 (0.56), assuming the sequences are equally diverged from each other, which is approximately true. Thus, a conserved hexamer in this three-way alignment is not very significant, as we expect to find one every 67 base pairs on average. Adding a fourth sequence of similar divergence increases the significance of the alignments only modestly: The chance of a conserved hexamer in a four-way alignment is 0.0016 = (.73)6, or one every 625 base pairs on average. However, if the third sequence added is only 40% identical to the S. cerevisiae sequence (probably a reasonable estimate for the sensu lato non-protein-coding sequences, because few of them align by BLASTN [Table 1]), the expected frequency of an exact hexamer alignment decreases significantly, to .000057, or one approximately every 17,600 nucleotides (1 in ~5000 nucleotides if the sensu lato sequence is 50% identical to S. cerevisiae). Indeed, this is close to what we observed in alignments of randomly generated sequence: in many four-way CLUSTALW alignments of sequences (three of them 70% identical to each other; one 40% identical to these), hexamers appeared on average once every 5555 nucleotides; heptamer or longer runs appeared once every 25,000 nucleotides. Because the average intergenic region in S. cerevisiae is ~600 nucleotides in length, a conserved hexamer is expected to occur by chance about once in every eight promoters on average. We realize that these "back of the envelope" calculations must be interpreted cautiously because it is unlikely that even a probable optimal multiple sequence alignment will correspond precisely to the biological events that generated the sequences being aligned (Thorne and Churchill 1995Candidate Species for Comparative Analysis Which species' sequences will yield the most information? Two considerations need to be balanced in making this decision. The sequences need to be similar enough so that most of them can be aligned with their S. cerevisiae ortholog, a requirement that favors the sensu stricto species, as more of their sequence reads can be aligned to S. cerevisiae sequence (Tables 1, 2). Conversely, if the sequences are too similar the functional elements do not stand out, a consideration that favors the sensu lato species. It seems clear that at least one of each group of species need to be compared. Two of the sensu stricto species for which we obtained DNA sequence S. mikatae and S. bayanus seem
like good candidates for large-scale sequence comparisons. S. cariocanus and S. paradoxus seem too similar to S. cerevisiae for this purpose (Tables 1, 2). Another good candidate
for comparative sequence analysis is S. kudriavzevii, a
recently described species (Naumov et al. 2000
Sequencing Methods The DNA sequence of random genomic clones from different
Saccharomyces species was determined by the Washington
University Genome Sequencing Center (WUGSC). A brief description of
each step follows. Detailed protocols are provided in Mardis (1997) Preparation of Genomic DNA Derivatives of each yeast species lacking mitochondrial DNA were obtained by plating colonies on YPD agar media containing ethidium bromide (Slonimski et al. 1968Preparation of Clone Libraries After quantification, the DNA was sheared by sonication, the ends repaired with mung bean nuclease, and fragments of 1-2 kb were selected by electrophoresis through a 0.8% agarose gel. The DNA fragments were excised and extracted from the gel, ligated to the plasmid (pBC) vector, and introduced by electroporation into competent Escherichia coli DH10B cells. DNA sequence of a representative sample of the resulting plasmid subclones was determined to assess library quality.DNA Sequencing Plating and sequencing of plasmid library subclones, as well as sample loading, data collection and processing was done as described by Mardis and Wilson, (1997)BLAST Sequence Comparisons Sequence alignments were initially generated using WU-BLAST 2.0 (http://blast.wustl.edu). S. cerevisiae sequence databases were obtained from SGD (http://genome-www.stanford.edu/Saccharomyces). The databases included S. cerevisiae genomic DNA (all_sacchdb.dna), S. cerevisiae protein-coding sequences and genes (orf.trans.fasta and orf_coding.fasta) and a library of intergenic DNA (NotFeature.fasta). This latter database consists of S. cerevisiae DNA from which was removed all genes (encoding both proteins and RNA), LTRs, and transposons. Another library was created by fusing the sequences of a 5' UTR library (consisting of the 500 bp upstream of the ATG start codon) to the genes sequences in orf_coding.fasta. BLAST output was parsed and processed with ad hoc PERL scripts. Percent identities of BLAST alignments were taken from the highest scoring alignment, to avoid biasing the calculations with low scoring and/or multiple alternative alignments of the same sequence segment. TBLASTX comparisons used default parameters. Sensu lato
and petite-negative sequences were compared to S. cerevisiae "not-feature" DNA sequences (those not known or predicted to encode proteins or functional RNAs) to identify potential protein-coding sequences. High scoring alignments were manually selected from all
TBLASTX alignments. S. cerevisiae sequences
with interesting alignments were visualized using AceDB
(Eeckman and Durbin 1995 Multiple sequence alignments were created with CLUSTALW
(Thompson et al. 1994 Identification of Random Sequences Encoding Proteins Absent from S. cerevisiae Random genomic sequences were compared with BLASTX to a nonredundant protein database similar to that maintained at NCBI. Because the database contains S. cerevisiae sequences, most of the identified homologs were from S. cerevisiae. In the sensu stricto species only five DNA sequences had significant hits to proteins not found in S. cerevisiae. The situation was more complex for the sensu lato and petite-negative sequences. In addition to finding genes not encoded in S. cerevisiae, we also found many sequences that had more similarity to sequences of species other than S. cerevisiae. Often the proteins with higher similarity are from closely related fungi, but occasionally they are from a distantly related organism, with the S. cerevisiae homolog showing much weaker similarity. PCR Amplification and Southern Blotting of Species-Specific Sequences Sequences were amplified by a standard PCR using the
oligonucleotide primers listed below with genomic templates prepared as
described by Hoffman and Winston (1987) S. cariocanus amidase OM2186 AAACAACACAGCACCAGC OM2187 TATCCTCAGACGCAGTCG S. cariocanus unique sequences OM2196 ATGGTGTGCGTTGTTATC OM2197 TCAACATGTCTGCATTCG OM2198 GCTAGTAGTTCCGTGGTG OM2199 CATCCCGACTCTGTCTTG S. paradoxus unique sequence OM2191 GGTACTTGGAGTTCTGTG OM2192 TGGTGCTTTGGCAACAAG S. mikatae unique sequence OM2194 CACGTCCACATTAACCTG OM2195 GATGTGGACATCATGCTTG S. bayanus unique sequence OM2206 CAATAACACGGATGCTCAAC OM2207 TGTCGGAGGTCACAGGAG Cloning of Specific Genomic Regions Yeast genomic DNA was prepared as described by Hoffman and Winston
(1987) PCR Primers (asterisks indicate nested primers): 26s rDNA OM1833 GCATATCAATAAGCGGAGGAAAAG OM1834 GGTCCGTGTTTCAAGACGG GAL1-10 OM1975 AGCCGTCAGTTCAAAACATCACC OM1866* CCATGTATCCAGCACCACC OM1867 AGTCACAATAATCAATATGTTCACC GAL4 OM1941 TTCCCACCAATAACATCATTTGACTGGAA OM1920 TTCCACTTCTGTCAGATGTGCCCTAGTCAGCGG OM1921* GACTCGAACAAAATCATTATTCTAGATATGAG OM2208* GATAAACAACATTGCATGGAGGC S. paradoxus: OM1920-OM1941 S. bayanus: OM1920-OM1941/ OM1920-OM1921 S. kluyveri: OM1920-OM2208 HHT-HHF OM1837 CACCAGTGGACTTTCTTGCTGTTTG OM1838 CCACCTTTACCTAGACCTTTACCACCTTTACC SNR39 OM1959 GAAAAAATCTTGACCCCAGAATCTCAGTTGAAGA OM1961 CTTGTTAATACCCTTGATTCTGACAACGAA SNR44 OM1835 TCATCAAGTTTTTACAAGTTATGCAAAAGC OM1836 CGACAATCTTACCAGATCTGTGGTCEstimation of Hexamer Frequency in Multiple Sequence Alignments Random DNA sequences were generated in the computer with Seq-Gen
(Rambaut and Grassly 1997
We are indebted to Ed Louis for yeast strains, advice, helpful discussion, and enthusiastic support. We also thank members of the Gish Lab for programming assistance, and Sean Eddy, David States, and Gary Stormo for ongoing advice and for reviewing the manuscript. Linda Riles and Matt Curtis provided technical assistance that was instrumental in the initial phases of the project. This work was supported by funds provided by the James S. McDonnell Foundation. P.F.C. was supported in part by a NHGRI NSRA Fellowship (NIH #IF32HG00218). The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Corresponding author.
E-MAIL mj{at}genetics.wustl.edu; FAX (314) 362-2985.
Article and publication are at www.genome.org/cgi/doi/10.1101/gr.182901.
Received February 1, 2001; accepted in revised form April 11, 2001. 11:1175-1186 ©2001 by Cold Spring Harbor Laborato |