|
|
|
|
Vol. 10, Issue 6, 758-775, June 2000
LETTER
|
| |
ABSTRACT |
|---|
|
|
|---|
The progress of human and mouse genome sequencing programs presages the possibility of systematic cross-species comparison of the two genomes as a powerful tool for gene and regulatory element identification. As the opportunities to perform comparative sequence analysis emerge, it is important to develop parameters for such analyses and to examine the outcomes of cross-species comparison. Our analysis used gene prediction and a database search of 430 kb of genomic sequence covering the Bpa/Str region of the mouse X chromosome, and 745 kb of genomic sequence from the homologous human X chromosome region. We identified 11 genes in mouse and 13 genes and two pseudogenes in human. In addition, we compared the mouse and human sequences using pairwise alignment and searches for evolutionary conserved regions (ECRs) exceeding a defined threshold of sequence identity. This approach aided the identification of at least four further putative conserved genes in the region. Comparative sequencing revealed that this region is a mosaic in evolutionary terms, with considerably more rearrangement between the two species than realized previously from comparative mapping studies. Surprisingly, this region showed an extremely high LINE and low SINE content, low G+C content, and yet a relatively high gene density, in contrast to the low gene density usually associated with such regions.
[The sequence data described in this paper have been submitted to EMBL under the following accession nos.: Mouse Genomic Sequence: Mouse contig A (AL021127), Mouse contig B (AL049866), BAC41M10 (AL136328), PAC303O11(AL136329). Human Genomic Sequence: Human contig 1 (U82671, U82670), Human contig 2 (U82695).]
| |
INTRODUCTION |
|---|
|
|
|---|
As significant amounts of the human genome and more recently the mouse genome are sequenced, the opportunity to use cross-species sequence comparison as an analytical tool becomes increasingly attractive. The premise for this analysis is that functionally important sequences will be strongly conserved, whereas other regions will differ as a result of mutations that have accumulated since the time when the species shared a common ancestor. The detailed analysis and comparison of sequence in conserved segments may aid our understanding of the genomic organization of complex genes and suggest candidate regulatory regions. It is also anticipated that it will provide new insights into chromosome and genome evolution, e.g., by defining the sequence content of chromosomal evolutionary breakpoints.
A number of comparative sequence studies have begun to demonstrate the
value of this approach in gene annotation and regulatory element
identification (Hardison et al. 1997
). Comparative sequencing of a
number of regions in mouse and human, including
has underlined the value of comparative sequencing for gene annotation.
With the completion of the sequence of human chromosome 22 (Dunham et
al. 1999
) and the rapid progress towards a working draft of the human
genome, the opportunities for sequence comparison of human with mouse
genome sequence will increase, emphasizing the need to develop
parameters for cross-species sequence comparison and to document the
outcomes over extensive regions of the genome. Ab initio gene
prediction methods applied to the finished sequence of human chromosome
22 suggest that there are at least 100 genes in this chromosome for
which there is no supporting evidence in the sequence databases (Dunham
et al. 1999
). Moreover, sequence analysis has highlighted a large
surfeit of CpG islands, which are not associated with defined
transcription units, and may represent uncharacterized genes.
Comparative sequencing might be expected to make a major contribution
to the detection and annotation of these undefined mammalian gene loci.
The comparative sequence approach represents a potential universal
method for gene prediction that can be applied to any and every genome
region. An evolutionary conserved region (ECR) that exceeds a defined
threshold of sequence homology is likely to represent a functional
element. We have applied such an approach, together with conventional
gene prediction and homology searching methods, to identify potential
genes in a region of the mouse and human X chromosomes. In so doing we
have made use of extensive published studies of orthologous genes in
man and mouse (Makalowski et al. 1996
) as well as employing available
data from annotated mouse and human genomic sequence regions to help us
set meaningful parameters for the detection of ECRs.
The mouse region sequenced encompasses a pair of murine X-linked
dominant disorders, bare-patches (Bpa), and striated
(Str). The comparative analysis of this region has facilitated
the identification of the Bpa and Str gene,
Nsdhl, as described previously (Liu et al. 1999
). We now
report a full analysis of the sequence comparison of this mouse
sequence region and its human counterpart. The analysis has revealed a
number of conserved elements that may represent novel genes not
revealed by database searching. Moreover, the region is considerably
more rearranged between the two species than previously realized from
comparative mapping studies, highlighting that some regions of
mammalian genomes may be highly mosaic in evolutionary terms.
| |
RESULTS AND DISCUSSION |
|---|
|
|
|---|
We have undertaken the sequencing of a 600-kb region that
encompasses the Bpa and Str mutations on the mouse X
chromosome. The critical region containing these two mutations is
flanked by the loci DXHXS1104 and DXHXS52 (Levin et
al. 1996
). In addition, we have completed the sequencing of the
homologous human region. In the mouse, cosmid contigs partially
spanning the Bpa/Str critical region were already available
(Chatterjee et al. 1994
; Levin et al. 1996
) and STS markers and
previously identified genes were used to construct a BAC and PAC contig
(see Methods). Thirteen markers were used to isolate clones and
facilitate contig construction, resulting in 28 clones selected for
further characterization. Fluorescent fingerprinting was used to
assemble the minimal tiling path by comparison of overlaps between the
clones. The mouse genomic sequence is enclosed in two contigs - A and
B. Mouse contig A of 194 kb is assembled from three clones; BAC45N8,
CMX137, and PAC525I4 and mouse contig B of 166 kb comprises BAC437P9
(Fig. 1). A central region separating mouse contigs A
and B has also been sequenced. However, although a number of clones
encompassing this interval were mapped, both STS content and
fingerprint data suggested that this region demonstrated a high degree
of instability in different clones. For example, whereas clone BAC41M10
from this region contained the marker F8a, this locus was
absent from PAC303O11. This was confirmed by sequencing both these
clones completely. PAC303O11 appears to encompass the whole region as its termini overlap mouse contig A and mouse contig B. Nevertheless, as
expected the F8a gene was not contained within the finished sequence. The sequence from clone 41M10 contains as expected the F8a locus, but this clone is substantially rearranged with
respect to PAC303O11.
|
Originally, the human critical region was covered by a complete YAC map and
cosmid map containing several gaps (Heiss et al. 1996
). Two of these gaps were
bridged by BAC/PAC clones. Despite all efforts, one gap remains between
ZNF275 and ZFP92. Comparison of the available human sequence
data to the mouse Zfp275-Zfp92 interval may provide a rough
estimate of the gap size. If we assume no major human rearrangements, the gap
may be about 20 kb composed of highly repetitive sequences. In total,
the 32 cosmids, three BACS and one PAC span two genomic sequence
contigs of 577 kb (Human contig 1) and 168 kb (Human contig 2) (Fig. 1).
Extensive analysis using similarity searching and gene/exon prediction has enabled us to identify 11 genes in mouse, 13 genes and two pseudogenes in human, and to characterize their genomic structure (see Methods). The order, orientation, and conservation of these genes are displayed in the percent identity plot (PIP) (Fig. 2). In addition, we undertook extensive analysis of the sequence by pair-wise comparison to identify conserved gap-free alignments, which we have named ECRs that might represent additional unrecognized coding elements within the region. At least four potential novel genes were identified by this approach. We first describe those genes identified primarily on the basis of similarity searches and gene/exon predictions.
|
Genes in the Bpa/Str Critical Region of Mouse and Human Identified through Similarity Searching and Gene/Exon Prediction
Melanoma Antigen Gene (MAGE) Family Cluster
MAGE genes encode tumor-specific proteins of unknown function, which are recognized by cytolytic T lymphocytes. A group of 12 genes, named MAGEA (1-12) have been previously located in the human Xq28 region and five other MAGE genes have been located elsewhere on the X chromosome (De Plaen et al. 1994
|
Caltractin and NAD(P)H Steroid Dehydrogenase-like Gene
Calt belongs to a family of calcium-binding proteins and is a structural component of the centrosome (Chatterjee et al. 1995
-hydroxysteroid dehydrogenase (3
-HSD) and was
identified as the gene mutated in Bpa and Str mice
(Liu et al. 1999Zinc Finger Protein 185
The mouse Zfp185 and human ZNF185 are group 3 LIM domain proteins (Heiss et al. 1997
|
High Mobility Group Protein 17
The high-mobility group (HMG) proteins are the most abundant nonhistone chromosomal proteins in the nuclei of higher eukaryotes. HMG17, and the closely related protein HMG14, bind preferentially to the nucleosomal core particle and may modulate the chromatin configuration of transcriptionally active genes (Bustin et al. 1990X-linked Lymphocyte-regulated (Xlr) Related Genes
XLR3A AND XLR3B
The murine Xlr multi-gene family was originally identified by subtractive cDNA hybridization and cloning (Cohen et al. 1985XLR4
Sequence generated in this central interval in mouse also allowed the identification of an additional Xlr family member, subsequently named Xlr4. Two copies of Xlr4 were identified in 41M10, whereas only one copy of the gene was found in clone 303O11, again reinforcing the apparent clone instability in this region. Full-length cDNA sequence for Xlr4 was obtained through sequencing an EST (GenBank accession no. AA472809) initially identified by similarity searching. The cDNA is 1290 bp in length and alignment of this cDNA to genomic sequence identified that Xlr4 has nine exons, covering 8.5 kb of mouse genomic sequence. Xlr4 has a similar exon-intron structure to the Xlr3 genes in the region. Xlr4 has an ORF of 636 bp, encoding a putative protein of 212 aa. ProfileScan identified a potential bipartite nuclear localization signal (Prosite accession no. PS50079) in Xlr4. Xlr4 has a predicted pI of 9.51 and is therefore a basic protein like the Xlr3 subfamily. Over the full length of the protein, Xlr4 has 31% identity to Xlr3a and 25% identity to Xlr1 (Table 2). No significant similarities at the DNA or protein sequence level were identified in the human sequence databases. Furthermore, no related human sequences were identified in the corresponding human sequence. Xlr4 appears to define a new Xlr subfamily.
|
XLR5
Analysis of mouse contig B identified a further Xlr family member, based on EST sequence similarity and exon prediction. GENSCAN and HMMGene predicted a gene comprised of six exons. Several of these predicted exons overlapped exons predicted by Grail. An additional two exons of this gene were also detected in PAC303011. The assembled gene structure has an ORF of 708 bp, coding for a putative protein of 236 aa. A scan of the predicted protein sequence against PROSITE did not identify any functional domains. Database searches using BLASTP against Swiss-Prot identified homology to murine XLR3A, murine XLR3B, hamster SCP3 (Synaptonemal Complex Protein 3), rat SCP3, murine SYCP3, and murine XLR1. FASTA comparisons of this group of proteins showed that XLR5 demonstrated relatively low identity with other members of the XLR family; XLR5 showed highest identity with SCP3 (Table 2). Expression studies were performed to deduce the expression profile of this putative gene. RT-PCR assays between exon 1 and 2 on testis RNA detected the expected cDNA product (data not shown). Northern blot analysis of a probe amplified between exons 1 and 5 in testis cDNA detects a major transcript of 1.5 Kb in testis (data not shown). Finally, no related human sequences were identified in the corresponding human sequence contigs. Again Xlr5 appears to define a new Xlr subfamily.Factor VIII-associated Gene
The human F8A gene was originally identified within intron 22 of the F8 gene (Levinson et al. 1992Zinc Finger Protein 275
A new zinc finger protein gene, Zfp275, was identified distal of F8a in mouse contig B. A ZNF275 ortholog was also identified in human contig 1. Very few ESTs cover this coding region. However, the gene structure of exons 1-5 was determined utilizing a mouse partial cDNA clone. The terminal end of the 6th exon was defined by two mouse ESTs that included the polyA tail (GenBank accession nos. AA189691 and AA833132). The alignment of the available consensus mouse cDNA sequence with genomic DNA defined a gene of six exons, with the 6th exon being 6 kb in length. The gene structure in mouse and human is conserved with an ORF of 1392 bp encoding a protein of 464 aa. Analysis of this protein sequence showed that it contains 11 zinc finger motifs. The coding exons of Zfp275 are highly conserved between mouse and human. A different pattern of conservation can be observed in the 3' UTR, with most of the conservation consisting of short gap-free alignments. It is apparent from the PIP plot that repeat density in the Zfp275 and ZNF275 gene regions is very low and overall sequence conservation is high. For example, conservation around exons 3 and 4 extends beyond the exons into intronic sequence. Expression studies using Northerns demonstrated a double band at about 7 kb in polyA+ mRNA from ES cells and embryos from E10.5-E18.5 (data not shown). RT-PCR within exon 6 detected signals in adult brain, kidney, heart, thymus, and spleen, all showing the same expected size in cDNA and genomic (data not shown).Zinc Finger Protein 92
The Zfp92 gene was originally identified in the distal part of the DXHXS1104-DXHXS52 region in both human and mouse (Levin et al. 1996hsxq28orf /mmxq28orf
Significant sequence similarities were identified at the distal end of mouse contig B with a human cDNA (hsxq28orf or STS1769, GenBank accession no. X99270) in GenBank. It became apparent that this cDNA is chimeric due to the fact that its 5' end (from 4-504 bp) has a 100% identity to GDP-D-mannose-4, 6-dehydratase mRNA (GenBank accession no. AF040260) and that this gene maps to 6p25. The 3' end of this gene in human can be confirmed by many ESTs, however the 5' end is presently defined by one EST (GenBank accession no. T66063). The human cDNA compiled from this data is 1.365 kb in length and identifies 10 exons. The homologous mouse cDNA as defined by ESTs, appears to be 1.363 kb in length. Exons 5-10 are in mouse contig B. The predicted peptide in mouse is 353 aa and 358 aa in human, showing 65% identity to each other and no significant similarity to anything in the protein database. Expression studies have shown that the expected 650 bp RT-PCR product from exon 4-9 was identified in all of the adult tissues tested. However, in addition a 300 bp-smaller product was identified and appears to be due to alternative splicing between exon 4 and 8 (data not shown). Northern blot analysis of this gene detects two bands in testis, with the larger being the expected 1.3 kb size (data not shown).Novel Genes in the Bpa/Str Region Identified by Their Conservation with Human Sequence
With the aim of identifying additional putative genes, the mouse and
human genomic sequences were analyzed extensively by pairwise
comparison. At the outset of this analysis it was important to identify
meaningful thresholds for the detection of ECRs that may represent
undiscovered coding sequence. A previous study of 1196 orthologous
mouse and human full-length mRNA sequences has described statistical
distributions of sequence conservation in translated and untranslated
regions (Makalowski et al. 1996
). Our aim, therefore was to
characterize these statistical distributions at a genomic level and to
define the thresholds to be used to identify ECRs in novel genomic
sequence. We examined the sequence conservation in coding (CDS) and
noncoding exons (UTRs) of previously annotated genomic regions of the
mouse and human (see Methods). Six mouse and human regions were chosen
for this study:
These regions comprised 581 kb of mouse genomic sequence and 595 kb
of human genomic sequence. Overall, the regions chosen contained 32 annotated genes present in both the mouse and human. Comparison of
mouse and human sequence from these six regions using BLASTZ with the
standard parameter settings (Schwartz et al. 1999) and manipulation of
its output resulted in the identification of 283 gap-free alignments
overlapping known coding exons, 24 gap-free alignments overlapping
known 5' UTRs and 40 gap-free alignments overlapping known 3'
UTRs (http://www.mgc.har.mrc.ac.uk/comp_seq/reference.html). This
reference set of gap-free alignments was determined using thresholds of
50-bp length and 50% identity. The percentage identity distributions
for gap-free alignments in each category is illustrated in Figure
4. The analysis indicates that the distribution of
percentage identity is broad for 5' UTRs but more narrowly
distributed for 3' UTRs and coding exons. The average identity for
5' UTRs is 79.08% (SD=11.65), 74.05% (SD=9.62) for 3' UTRs,
and 84.31% (SD=8.40) for CDS. In the study by Makalowski et al. (1996)
the average percentage identity for 5' UTRs is 67.47% (SD=13.2),
69.13% (SD=12.4) for 3' UTRs , and 84.62% (SD=6.78) for CDS.
Because each of the above studies was performed independently using
different alignment algorithms, the results cannot be compared
directly. Nevertheless, both studies indicated clear differences in the
range of percentage identity observed between CDS and UTR regions and
have aided us in setting parameters for the identification of ECRs.
|
We have also assessed studies of exon size to assist us with setting
parameters for identifying ECRs. A large study of human exon size has
demonstrated that there is little constraint on exon length, the
smallest identified being 15 bp (Zhang 1998
). However in this study,
which categorized exons into a number of different groups, the smallest
average exon size (100 bp), was found in the iuexon
(internal-untranslated) category. Nevertheless, to ensure that we
identified the majority of exons, but also reduced noise from short
conserved noncoding sequences, we set an arbitrary lower limit of 50 bp
for the identification of ECRs. Taking together both percentage
identity and exon size studies, we have defined two categories of ECR:
category 1 [those with a percentage identity >80% and length
>50 bp (http://www.mgc.har.mrc.ac.uk/comp_seq/category1.html)], and category 2 [those with a percentage identity >70% and a length >50 bp (http://www.mgc.har.mrc.ac.uk/comp_seq/category2.html)]. Clearly category 1 ECRs have a higher likelihood of representing true
coding regions, as the 80% cutoff differentiates well between CDS and
UTRs on the basis of both our analysis and that of Makalowski et al.
(1996)
. Utilizing these filters and aided by visual interpretation from
the PIP plot, we identified at least four further putative transcription units in the Bpa/Str region. It should be noted at this point that the visual interpretation of the plot identified additional ECRs not present in either category 1 or 2.
Identification of ECRs in Mouse Contig A
Analysis of mouse contig A identified 24 ECRs in category 1 and an additional 60 ECRs in category 2 not overlapping with previously annotated exons. From the 24 category 1 gapfree alignments, 20 (ECRA4-ECRA23) were identified in a 37-kb region between Nsdhl and Zfp185 (Fig. 2). The other four category 1 ECRs (ECRA1, A2, A3, and A24) are annotated on Figure 2 and may represent further transcription elements, but analysis did not uncover any other evidence to suggest they represent genes.ECRA4-ECRA23
Following identification of these 20 ECRs, database searching identified a human EST (GenBank accession no. AI653754) with homology to ECRA20. None of the other ECRs showed matches to the nonredundant DNA database. The human ECRA20 sequence shows 100% identity to the EST whereas mouse ECRA20 shows 88% identity. An ORF was identified in seven of the ECRs (ECRA7, A8, A9, A17, A18, A20, and A23). Eight of the ECRs (ECR A9, A10, A17, A18, A19, A21, A22, A23) also have overlapping exon predictions, although no more than one package predicted each exon. ECRA18 and A19 also overlap a predicted CpG island (see Fig. 2). Preliminary expression studies were carried out on six of the ECRs (ECR A4, A10, A11, A17, A20, and A23) using RT-PCR (see Methods). Of the seven regions identified with an ORF, three (ECRA17, 20, and 23) were tested by RT-PCR and two of these (ECRA20 and ECRA 23) gave positive results. ECRA20 was positive by RT-PCR in embryonic days 9.5, 11.5, 13.5, 15.5, and 16.5, neonate skin and in adult brain, heart, kidney, thymus, and testis (data not shown). ECRA23 is positive by RT-PCR in embryonic days 9.5, 11.5, 13.5, 15.5 and 16.5, neonate skin and in adult kidney, spleen, and testis (data not shown). Three other ECRs were tested (ECRA4, 10, and 11), one of which had an overlapping predicted exon (ECR10) and one of these (ECRA4) was shown by RT-PCR to be expressed. ECRA4 was positive by RT-PCR in 15.5d embryo, adult kidney, spleen, and thymus. We have not clarified if these putative exons can be connected into single or multiple transcription units. Nevertheless, several strands of data suggest the presence of at least one gene in this region: (1) the presence of multiple ECRs, (2) RT-PCR data from ECRs; and (3) the discovery of a single EST with matches to ECRA20.ECRA25-ECRA30
PIP plots also identified a region of high overall conservation between Zfp185 and the end of mouse contig A. This island of conservation lying between 175 kb and 179 kb of mouse contig A is composed of six category 2 gapfree alignments (ECRA25-ERCA30) with percentage identities ranging from 70%-78% and lengths from 68-455 bp. GENSCAN, HMMGene, Grail, and Genemark all predicted a combination of exons over this island of conservation in both mouse and human. Both GENSCAN and HMMGene predicted a gene overlapping ECRA26, A27, and A28. ORFinder predicts a single ORF encoding a putative protein of 499 aa in mouse and 448 aa in human. BLAST analysis of the murine putative protein identified sequence similarity to Homo sapiens KIAA0883 (GenBank accession no. AB020690) at 38% identity over 328 aa and also to Homo sapiens paraneoplastic antigen MA1 (PNMA1) (GenBank accession no. NM_006029) at 34% identity over 309 aa. The first 102 aa of this putative protein also has 35% identity to part (311 bp) of HUMXQ28B (GenBank accession no. M89986), an anonymous X-linked STS. ECRA26 and ECRA27 also have similarity to a Rat EST (GenBank accession no. AI549430), at 91% identity and a predicted CpG island is identifiable over ECRA26. In summary, the comparative analysis indicates the presence of a putative gene with one coding exon encoding a protein of 499/448 aa in mouse and human.Identification of ECRs in Mouse Contig B
Analysis of contig B identified 11 ECRs in category 1 and 56 ECRs in category 2 not overlapping with previously annotated exons. Eight of the category 1 ECRs were analyzed in this study; the others were excluded because they were identified within intronic sequence. ECRB3 and B4 are category 1 ECRs annotated on the PIP plot and may represent additional transcription units but, at present, we have not uncovered any other supporting evidence to indicate they represent genes.ECRB1-ECRB2
Two ECRs, named ECRB1 and ECRB2, are localized between the Xlr5 and Zfp275 loci. ECRB1 was identified from the PIP as it was only 60% identical over 337 bp. However, an adjacent sequence, ECRB2, was identified as category 2, as it is 74% identical over 998 bp of mouse and human sequence. ECRB2 overlaps an exon predicted by GeneFinder, GENSCAN, and HMMGene. The predicted murine gene codes for a putative 528 aa protein that has 49% identity over 250 aa to mouse UBE-1c2 (GenBank accession no. AB030505). A search using ProfileScan also identified a bipartite nuclear localization signal (Prosite accession no. PDOC00015), between amino acids 385 and 402. ECRB1 is similar to melanoma ubiquitous mutated protein (MUM-1; GenBank accession no. U20896), at 44% identity over 127 aa. MUM-1 is a mutated intron sequence that codes for an antigenic peptide recognized by cytolytic T lymphocytes on a human melanoma. The gene is expressed in many normal tissues.ECRB5-ECRB8
A second region of conservation was identified in mouse contig B between Zfp92 and hsxq28orf (Fig. 2). Four ECRs from category 1 and 2 were identified in this region (ECRB5, B6, B7, and B8), with percentage identities ranging from 70%-81% and lengths from 61-412 bp. Analysis of the nonredundant DNA database subsequent to the identification of ECRB5 and B4 identified matches to one mouse EST (GenBank accession no. AA060540). The EST is 1085 bp in length and has a polyA tail and polyadenylation signal at the 3' end. The available EST sequence suggests a gene structure comprising at least two exons of 114 bp and 969 bp. Subsequent database searches have identified other mouse ESTs to support this organization (GenBank accession nos. AI595465, AI427513, AV021721) (see Fig. 2). The identified ECRs agree well with the proposed gene structure. The predicted ORF is 369 bp in length encoding a protein of 122 aa. This putative protein shows no matches to Swiss-Prot. RT-PCR of cDNA from adult tissue using primers from exon 1 and 2 detects the expected spliced product exclusively in skin. Northern blots of total RNA from adult tissues detected a 1.1-kb transcript only in skin (data not shown).Repeat and Gene Distribution
In the mouse genomic sequence generated to date, 34.85% is composed of repetitive elements as identified by RepeatMasker, compared to 36.55% of the human sequence. The number of repetitive elements identified in each class is summarized in Table 3 along with the percentage of sequence occupied by each class. Moreover, we aligned mouse and human genomic sequences without repeat masking to identify orthologous repetitive elements. There appeared to be fewer mammalian interspersed repeat (MIR) relics in the mouse sequence and eight out of nine of these were identified in aligned regions, suggesting their presence predated mouse-human divergence some 80 million years ago.
|
|
Conclusions
We have sequenced 430 kb from the mouse Bpa/Str
critical region and 745 kb from the homologous region of the human X
chromosome. Sequence from each species was subjected to gene prediction
and homology searches to identify potential genes. These analyses had
allowed us previously to undertake a comprehensive search for candidate
genes for the bare-patches and striated mutants and
ultimately lead to the identification of causative mutations (Liu et
al. 1999
). We also identified eight genes in mouse and human sequence
not found previously by exon trapping or cDNA selection. These include
a member of the melanoma antigen gene family (Magea9), two
novel members of the X-linked lymphocyte-regulated family (Xlr4 and Xlr5), and a zinc-finger gene
(Zfp275). However, additional analyses employing comparisons
of mouse and human sequence allowed us to identify at least four
potential additional genes based on their evolutionary conservation.
Using available genomic sequence from a variety of mouse and human regions, we developed an approach for the identification of ECRs that was likely to represent coding sequence. We searched for gapfree alignments of either 70% or 80% identity and with a minimum length of 50 bp. Our analyses of previously determined mouse and human genome sequence indicated that such gapfree alignments had a high probability of representing coding sequences. Using annotated PIP plots and these thresholds of sequence similarity, we identified four further potential transcribed regions in the 430 kb of mouse sequence analyzed: ECRA4-23, ECRA25-30, ECRB1-2, and ECRB3-6.
It appears that this approach provides a potentially significant enhancement to the process of identifying putative genes. For example, for the putative gene ECRA4-23, only one ECR homologous to an EST was identified and, though eight of the ECRs had overlapping exon predictions, no more than one package predicted each exon. It appears that the identification of ECRs has the potential to provide a much richer view of the putative gene sequences in this region and, indeed, this was confirmed by the demonstration that a number of the ECRs are transcribed. For example, ECRA4 is transcribed, yet neither similarity to ESTs or exon prediction would have highlighted these sequences as potential transcription units. Equally, ECRB5-B8 is another example of a candidate gene revealed by sequence comparison where evidence from transcription studies has subsequently emerged to substantiate the presence of a real transcription unit. It will remain to be seen if the untested ECRs in ECRA4-A23 are transcribed and if the two other ECR regions that have been identified do indeed represent true genes.
Although we employed relatively low thresholds (70% and 80%) to
identify ECRs, a surprisingly low level of noise was present with only
35 category 1 and 116 category 2 ECRs being identified in the 430-kb
region studied. Moreover, the cumulative data suggested the presence of
only four additional transcription units that would not have been
detected by homology searching and exon prediction methods. For two of
these putative transcription units, we have provided evidence that they
are transcribed. The results demonstrate that the approach we have
employed to date is a productive means of identifying putative exons
that may remain undetected by gene prediction and similarity searching.
It would seem likely that with further enhancements to this method, the
process of comparative sequence analysis could be made even more
discriminating. This work and the comparative studies of others (Koop
and Hood 1994
; Lamerdin et al. 1996
; Oeltjen et al. 1997
; Ansari-Lari
et al. 1998
; Brickner et al. 1999
; Jang et al. 1999
) underlines the
immense potential value of mouse genomic sequence as a means of
annotating the human genome.
One cautionary note should be made, however, in that the Bpa/Str region appears to be a mosaic in evolutionary terms. It contains genes apparently specific to one species, members of the Xlr family, other genes that are conserved at the sequence level, but disrupt conservation of gene order, such as F8a, and others that show extremely high conservation of sequence and structure, such as Nsdhl. This may well be true of other parts of the genome and must be kept in mind when comparing the "working draft" mouse genomic sequence (to be available by 2003) with finished, high-quality human sequence. For these reasons, it would seem best to adopt a strategy of both clone-based and genome-wide approaches for the mouse genome sequencing project. Gaps in the draft sequence may well obscure gene relationships, unless workers make careful use of other positional information, such as genetic or physical maps.
| |
METHODS |
|---|
|
|
|---|
Contig Construction
Murine Critical Region
BACs (129Sv library: Research Genetics) and PACs (RPCI21; 129/SvevTACfBr mouse spleen genomic DNA library: K. Osoegawa, P. de Jong, Roswell Park Cancer Institute, Buffalo; MRC HGMP-RC) were identified by radioactive hybridization and PCR screening. Resulting clones were tested subsequently for other markers from the region by PCR and Southern analysis. Pulsed field gel electrophoresis (PFGE) was used to estimate the size of the clones. DNA was digested with Not 1 to remove the vector, and run on a 1% low melting point (LMP) agarose gel. A 20 sec switch time was used at 170V over 20 hr. The size of the BAC clones varied between 85 and 165 kb, with the size of the PAC inserts being larger, between 140 and 200 kb. Fluorescent fingerprinting (Gregory et al. 1997Human Critical Region
Cosmids indicated by `Qc' (see Fig. 1) were isolated from a Xq28-specific cosmid library constructed from the hamster/human cell hybrid QIZ (Warren et al. 1990Sequencing of the Mouse and Human Intervals
Mouse
The sequencing strategy adopted was the random shotgun approach with an 8- 10-fold redundancy. Typically, around 3500 subclones were sequenced for each BAC or PAC and 1000 for cosmids. Half of the sequencing reactions were carried out by the Dye Terminator cycle sequencing method (Rhodamine or dRhodamine) and half by Energy Transfer dye primer cycle sequencing. Automated editing of the reads was carried out using Pregap (Bonfield and Staden 1996Human
Cosmid, BAC, and PAC DNA preparation and sequencing were performed as described previously (Kioschis et al. 1998Expression Studies
RT-PCR Assays and Northern Analysis
RT-PCR assays on Xlr5 were carried out on mouse Origene Multiple ChoiceTM cDNAs. PCR was carried out in a 20µl reaction volume, using 0.33 mM (final concentration) primers and HotstarTaq (Qiagen). The cDNA templates were denatured for 15 min at 95°C. This was followed by 35 cycles at 95°C, 5 sec; 58°C, 10 sec; 72°C, 1 min and a final extension of 5 min at 72°C. A probe for Northern blot hybridization was amplified from testis cDNA using primers spanning from exon 1-exon 5 of the Xlr5 gene (conditions above), resulting in a product of 373 bp. The PCR product was purified using Spin Columns (Quantum Prep® PCR Kleen, BIO-RAD) prior to labeling by random priming with 32P-dCTP using Megaprime DNA labeling system (Amersham Pharmacia Biotech). The probe was subjected to 2 hr of competition with mouse Cot-1 DNA (GIBCO BRL). Hybridization was carried out at 68°C for 1 hr in Expresshyb solution (Clontech) to a Multiple Tissue Northern (MTNTM) from Clontech following the given protocol and washed in 0.1×SSC, 0.1% SDS for 45 min at 65°C. RT-PCR analysis of ECRA4-A23, Zfp275, ECRB5-B8 and Northern analysis of Zfp275 and ECRB5-B8 were carried out as described previously (Levin et al. 1996Sequence Analysis
The sequence analysis of the genomic sequence was performed using
Nix (http://www.hgmp.mrc.ac.uk/) and Rummage (Glockner et al. 1998
).
Nix is a WWW (World Wide Web) tool used to view the results of running
multiple DNA analysis programs on DNA sequence. The analysis programs
include GRAIL (Uberbacher and Mural 1991
), Fex, Hexon (Solovyev et al.
1994
), MZEF (Zhang 1997
), Genemark (Borodovsky 1993
), Genefinder
(http://menu.hgmp.mrc.ac.uk/Nix/Help/genefind_washhelp.html), Fgene
(Solovyev et al. 1994
), GENSCAN (Burge and Karlin 1997
), HMMGene (Krogh
1997
), BLAST (Altschul et al. 1994
) (against many databases), Polyah
(Salamov and Solovyev 1997
), RepeatMasker
(http://ftp.genome.washington.edu/cgi-bin/RepeatMasker), and tRNAscan
(Lowe and Eddy 1997
).
ORFinder (http://www.ncbi.nlm.nih.gov/gorf/gorf.html) was used to
identify ORFs in query sequences, ClustalX (Thompson et al. 1994
) to
perform multiple sequence alignments, and ScanProsite (http://www.expasy.ch/tools/scnpsit1.html) and ProfileScan
(http://www.isrec.isb-sib.ch/software/PFSCAN_form.html) to identify
protein domains in PFAM (Bateman et al. 1999
) and PROSITE (Hofmann et
al. 1999
). To determine the genomic structure of genes where cDNA
sequence was available, sim4 was used to align the cDNA to genomic
sequence (Florea et al. 1998
). The mouse and human genomic sequences
were compared using PipMaker (Schwartz et al. 2000
;
http://bio.cse.psu.edu/). PipMaker aligns two sequences using a program
called BLASTZ, which is a new implementation of the gapped BLAST
program (Altschul et al. 1997
) that was designed specifically for
determining local alignments of two long DNA sequences. Gap-free
segments of these alignments are displayed in a PIP. The plot graphs
gap-free segments according to their position in the query sequence on
a percent identity scale from 50%-100% along the length of the
chosen sequence. The light horizontal line through the middle of the
plot indicates 75% nucleotide identity.
Analysis of Genomic Sequence Conservation in Annotated Regions of Mouse and Human Genome
Annotated regions of the mouse and human genome were aligned using BLASTZ (Schwartz et al. 1999). From the BLASTZ output, gap-free local pairwise alignments with a percentage identity
50% were extracted
and categorized into those that overlapped EMBL-annotated CDS (coding
sequences), 5' UTRs, and 3' UTRs. The statistical analysis of
the results was performed using Minitab (http://www.minitab.com).
| |
ACKNOWLEDGMENTS |
|---|
This work was supported by the Medical Research Council (UK), German BMBF (BEO 0311108/0), and the European Commission (BMH4-CT96-0338). G.E.H. was supported in part by grant NIH R01 NS34953 and by funds from the Children's Research Institute. We thank Simon Gregory and Gareth Howell from the Sanger Center for their help with fingerprinting and Duncan Campbell for helpful discussions.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
7 These authors contributed equally to this work.
8 Joint senior authors.
9 Corresponding author.
E-MAIL s.brown{at}har.mrc.ac.uk; FAX 44 1 235 824542.
| |
REFERENCES |
|---|
|
|
|---|