|
|
|
Published online before print
June 18, 2002, 10.1101/gr.220102
Vol. 12, Issue 7, 1142-1149, July 2002
RESOURCES
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
We have developed GFScan (Gene Family Scan), a tool that identifies members of a gene family by searching genomic DNA sequences with genomic DNA motifs (or matrices) that are representative of the family. We have tested GFScan on four human gene families including the neurotransmitter-gated ion-channels (NGIC) family, the carbonic anhydrases (CA) family, the Dbl homology (DH) domain family, and the ETS-domain family. All known members of these families with motifs mapped to sequenced genomic DNA regions were found, whereas some novel genomic locations were also found to match the motifs, which may indicate new members in these families. Compared with other methods, GFScan recognized all true positives with much fewer false positives. We also showed that motifs constructed based on human genes could be used to search the mouse genome to identify orthologous family members in mouse. This program is available at http://www.cshl.org/mzhanglab/.
[The following individuals and institutions kindly provided reagents, samples or unpublished information as indicated in the paper: J. Maddock and Celera Genomics.]
| |
INTRODUCTION |
|---|
|
|
|---|
With the advances of several whole-genome
sequencing projects, including human, mouse,
Drosophila, and so on, more and more genomic DNA sequences
have become available. These projects make it possible to analyze gene
families in one species systematically. One of the well-known
strategies for gene family analysis is to detect all the gene models
first in one genome with some gene prediction methods, such as
Genscan (Burge and Karlin 1997
), Genie (Kulp
et al. 1996
), or FGENES (Solovyev and Salamov 1997
);
translate these genes into proteins; then try to find gene families at
the protein level using similarity search or protein motif databases,
such as BLOCKS+ (Henikoff et al. 1999
), Pfam (Bateman et al. 1999
),
ProDom (Corpet et al. 1999
), PRINTS (Attwood et al. 1999
), PROSITE
(Hofmann et al. 1999
), IntroPro (http://www.ebi.ac.uk/interpro/).
Additionally, mRNAs can also be used to find gene family members by
BLAST or FASTA searches (Pearson and Lipman
1988
; Altschul et al. 1990
, 1997
). Recently, Henikoff (Henikoff and
Henikoff 2000
) had tried to use protein fragments in the BLOCKS+
database to search the Drosophila genomic sequence using
BLAST.
Our method seeks to find all members of a gene family by searching the whole genome with the representative genomic DNA motif of this family. Motif search at the protein level is a reliable method to find protein family members based on known proteins. However, protein motifs can only be used to search the known proteins, and some proteins remained undiscovered by existing experimental or theoretical methods. On the other hand, TBLASTN, a program of the BLAST package, can align protein sequences with genomic DNA sequence directly to find matched regions that may code new members of the gene family. However, as shown in the Results and Discussion sections, programs in the BLAST family are general sequence-alignment programs and find many false positives. To circumvent this problem, we developed GFScan (Gene Family Scan), which uses a representative DNA motif of a gene family to search genomic DNA sequence directly to identify new members of the gene family. The representative genomic DNA motif is constructed based on protein motifs in PROSITE (release 16.0, updates up to September 2000) and the genomic structure of known members of the family. As more and more mRNA and protein sequences are submitted to the public databases, and as each genome becomes more complete, GFScan will be increasingly effective to find new members of a gene family.
| |
RESULTS |
|---|
|
|
|---|
GFScan was developed in C++ language. To show the
usefulness of this program, we applied it to four gene families, searching for new members of the family in the whole human genome (Genome Sequencing Consortium 2001
) in GoldenPath (April 2001 freeze
and August 2001 freeze; http://genome.ucsc.edu/) and mouse genome in
the Celera Genomics Company's database.
Neurotransmitter-Gated Ion-Channels (NGIC) Family
The human neurotransmitter-gated ion-channels family is a large
family, whose members include GABA
(gamma-aminobutyric
acid) A receptors, glycine receptors, acetylcholine
receptors, and 5-hydroxytryptamine-3 receptor. All members of the
family have a common protein motif, called NEUROTR_ION_CHANNEL in
the PROSITE database (ID: PS00236). Using the known 37 human genes of
this family in the public database and the protein motif in PROSITE, a
45-bp intronless genomic DNA motif was constructed. We also found that
one family member, CHRNB1, has an intron in the motif-matching genomic
region, and the intron separates the 45-bp motif into two parts. An
intron-containing genomic DNA motif was then constructed (see Methods).
Both genomic DNA motifs were used to search the whole human genome. Of
37 known motif regions, 29 were found by GFScan. For the
missed eight genes, all the genomic regions corresponding to the motifs fell into the gaps of the genome. Moreover, nine additional genomic regions were found. Three of them were duplications of the known genes.
Among the remaining six novel genomic regions, one is located in the
repeat region, and the other five were likely to be members of this
gene family that are previously unidentified. Based on the human genome
annotation in GoldenPath (http://genome.ucsc.edu/), these
five regions were reported to be similar to mouse glycine receptor
subunit
1, rat GABA A receptor subunit
1, rat
3, mouse GABA A
receptor subunit
3, and Gallus nicotinic acetylcholine subunit
8,
respectively. With the exception of GABA A receptor
3, no mRNA or
protein sequence has been known for the other four genes (see Table
1).
|
Carbonic Anhydrases (CA) Family
Human carbonic anhydrases (CA) are zinc metalloenzymes that catalyze
the reversible hydration of carbon dioxide. There are 14 known members
in the family. From the mRNAs of the known members, we first
constructed a 57-bp cDNA motif based on the PROSITE protein motif (ID:
PS00162). All of the genomic sequence regions corresponding to this
cDNA motif contain one intron. The splice locations of the introns are
identical among all members, but the lengths of the introns are
different. We next constructed a genomic DNA motif from the cDNA motif
incorporating information on the intron. By searching the whole human
genome with the genomic DNA motif, 12 of 14 known genes were found, and
the two genes that were missed had their motif-matching genomic region
falling into the genomic gaps. Moreover, we found two additional
genomic regions that match the motif: One was related to a non-CA
family gene, PTPRG (protein tyrosine phosphatase, receptor
type G) in Chromosome 3; the other was found in Chromosome 8, whose
closest homologous gene was the mouse Car13 gene. It is worth
noticing that the human CA13 gene has not been found before,
and our finding may have shed light on this new member of the family
(see Table 2).
|
Dbl Homology (DH) Domain Family
The Dbl homology (DH) domain is responsible for the guanine
nucleotide exchange factor (GEF) catalytic activity (Zhu et al. 2001
).
Eight human genes belong to this family, and some of these genes are
oncogenes, including DBL, Break Cluster Region
(BCR) oncogene, VAV, VAV2, and
VAV3. The protein sequences of all eight members share the DH
domain (PROSITE ID: PS00741). From their mRNA sequences, a 78-bp cDNA
motif was constructed. In the genomic regions corresponding to the
motif, no intron was found for one of the family members,
TIAM; two introns were found for ABR and BCR; and one intron was found for the remaining five members
of the family. Based on above information on gene structure, we next constructed three genomic DNA motifs of this domain from the cDNA motif. Searching the whole human genome with the genomic DNA motifs revealed nine genomic regions that significantly match the motifs. Among the nine regions, seven contain known genes, one of the two new
locations was the VAV gene's genomic DNA sequence
duplication, and the other overlapped with the known VAV2's
motif region (see Table 3). VAV3
was the only known member of the family that was missed by the search,
and this is because the genomic region matching the motif region was
not available in the April 2001 Goldenpath freeze (it was found in the
August 2001 freeze).
|
ETS-Domain Family
The ETS-domain gene family includes a group of proteins that
function as transcription factors under physiologic conditions and, if
aberrantly expressed, can cause cellular transformation (Karim et al.
1990
). These proteins share a conserved domain, the ETS domain, which
is involved in DNA binding. From the mRNAs of the 19 known members and
a protein motif in the PROSITE database (ID: PS00346), a 48-bp cDNA
motif was constructed. Four of these 19 genes have one intron in their
genomic regions matching the motif, and the splice location of the
intron is the same. Therefore, we constructed an intron-containing
genomic DNA motif, and it is used to search the human genome together
with the cDNA motif. Twenty-six genomic regions were found to match the
motifs, which include 18 of the 19 known genes. ETV5's genomic DNA
motif region was missed because the genomic DNA sequence around the
motif-matching region was uncompleted. Out of the eight additional
motif-matching regions, three were duplications of three known genes
(i.e., GABP, ETV6, and ERF). The other five
were related to unknown genes in human: one was in the FEV
gene region, two were similar to mouse Ets-protein Spi-C (GenBank
accession no. AF098863), and the last two were located in two genes
predicted by Genscan and Ensembl. Both FEV
and Spi-C are ETS-domain family members (Bemark et al. 1999
). FEV was
not listed in the PROSITE database because of the database-updating
problem, and human Spi-C has not been found. Likely, these new
motif-matching regions will provide experimental scientists with useful
guidance to identify new members of the ETS-domain family in the human
genome (see Table 4).
|
Comparison with the BLAST Results
The other common method to search for new members of a gene family
is to run the BLAST program against the whole genome using
known members' sequences as queries. We compared BLAST
and GFScan on all four families. We searched the protein
sequence of each known member of a given family in human genome using
TBLASTN. We also used the motif region of the mRNA
sequence of each known member to search the human genome using
BLASTN. The results are listed in Table 5.
|
Table 5 indicates that GFScan had less false positives than TBLASTN (except for the CA family under a low E-value threshold, but the false positives of TBLASTN were increased when the E-value threshold was increased). In the BLASTN search, even with a very high E-value threshold (e.g., E = 10), some known genes were still not found, especially the ones whose motifs contain introns. For those genes, the match of the motif region to the genomic sequence is rather poor. Meanwhile, very few new genomic regions were found in this case. In short, compared with BLAST, GFScan offers both higher sensitivity and higher specificity, especially in intron-containing cases.
Mouse Genome Searching with Two Human DNA Motifs
We searched Celera's mouse genome using the motif constructed from human genes. For the neurotransmitter-gated ion-channels family, 23 of 24 known mouse members in the NCBI LocusLink Database (http://www.ncbi.nlm.nih.gov/LocusLink/) were found by GFScan. For the one that was missed (NM_017369: 1824-1868), the genomic DNA sequence of this gene was incomplete in the database. At the same time, 13 new motif-matching genomic locations were found, which may correspond to 13 novel mouse members of this family.
The result was different for the CA family. For 13 known mouse CA members in the LocusLink Database, 11 had the genomic DNA sequence matches. Using GFScan and the motif constructed by human genes, we could only find five loci. The reason for missing the other six was that the motif segments in these mouse genes are different from the motif in human genes (Fig. 1). Three of these six genes cannot even match the motif in human (NM_030558, mouse Car15; NM_009802, mouse Car6; NM_007608, mouse Car5a) at the protein level. However, two new genomic locations matching the human motif were still found, which may correspond to novel members in mouse.
|
In summary, GFScan is capable of identifying all the true members of a family with very few false positives and requiring no gene prediction. It performs especially well with intron-containing motifs where most BLAST-based tools may fail. One should be cautioned when using GFScan for cross-species search, however, as the results may depend on the divergence among members of the family, as well as the evolutionary distance between the two species. By adding more mRNAs from different species or modifying a genomic motif to allow species-specific codon usages, further improvement on performance can be achieved. GFScan is implemented in a way that such customizations can be easily made (see Methods for more detail).
| |
DISCUSSION |
|---|
|
|
|---|
Same Species versus Cross-Species
As DNA sequences are usually less conserved than protein sequences in evolution, we recommend constructing motifs using known mRNAs in one species and then using the motif to search the genome of the same species. This will reduce false positives. For cross-species searches, this method sometimes worked well, as in neurotransmitter-gated ion-channels family; at other times it missed many true positives, as in the case of the CA family described above. As the program allows users to reconstruct motifs by adding more mRNAs from other species, it is easy to extend the search to the cross-species cases. One could also redefine the motif by relaxing on codon usage when searching related species or adding other conserved information into the motif.
Regular Expression Pattern Search and Weight Matrix Search
From the mRNA sequences and protein motifs of the known members of a
given gene family, we could construct both a regular expression pattern
and weight matrix for later searching. GFScan can use
either of them to search the genomic DNA. Based on the matrix
constructed, the scores of all known motif regions were calculated.
When we chose the minimum score of the known motif regions as the
threshold of matrix search to minimize false positives, we found that
the genomic locations whose scores were higher than the threshold could
all be found by a regular expression pattern search (Table
6), whereas the latter saved a lot of CPU
time, because searching with regular expressions is almost 15-20 times faster than searching with matrices. However, because matrix search has
higher sensitivity (at the expense of specificity and CPU time), the
genomic locations missed by a regular expression pattern search may be
recovered by a matrix search, especially in the cross-species cases.
|
Motifs
In the present program, the motif length is taken as a constant; in other words, all the motif regions in the family should have the same length. For those families whose protein motifs have variable lengths, it is difficult to construct the DNA motif, and allowing gaps in the motif can be very CPU-expensive. We will address these issues in future work.
Although GFScan constructs the genomic motif automatically, it also accepts user-defined motifs as its input. This makes GFScan a very flexible tool for gene family analysis at the genomic level. In conjunction with gene prediction tools, it can be used for gene finding and gene structure prediction as well.
| |
METHODS |
|---|
|
|
|---|
For a protein or a gene family, we collected protein, mRNA, and
genomic DNA sequences of all known members, as well as the PROSITE
entry. Using the protein motif in PROSITE, we extracted the protein
motif fragments and their corresponding mRNA fragments. Based on the
protein motif, these mRNA fragments were aligned, and the consensus
pattern was created. Each site in the consensus pattern was determined
from all the corresponding sites in the known mRNA sequences. In other
words, each site in the protein motif was converted into three sites in
the cDNA motif based on all existing codons in known mRNAs. Using
SIM4 (Florea et al. 1998
) to align mRNAs with genomic
DNAs, we find the potential intron position and its length range within
the genomic regions that matches the motif regions. This intron
information was incorporated into the cDNA motif as the genomic DNA
motif of this family was constructed (see Fig.
2). For each genomic DNA motif, if there were introns inside, the motif was divided into several submotifs, and
the longest submotif would be used first to find the potential match
location, then the other submotifs were used to search the sequences
around this location (see Fig. 3). Each
genomic DNA region matching the motif would be translated into a
protein sequence, and this protein fragment was tested by the protein
motif to identify the false-positive results.
|
|
The weight matrix can be created while constructing the consensus regular expression pattern. In this algorithm, we simply used the nucleotide occupation frequencies at each site of the motif as the weights. For the intron-containing motif, we used the same strategy as we did in pattern search, namely, the longest submatrix was used first to find a candidate genomic location, and the local region around this location would be searched by the other submatrices.
We used protein sequences of all known members to search the human genome by TBLASTN, and we used the motif region of known members' mRNA sequences to search the human genome by BLASTN. As the exact number of the real members in a given gene family is unknown, we regarded the locations found by GFScan or BLAST false positives if the DNA fragment in these locations could not be translated into protein sequences without a stop codon, or the translated protein sequences did not match the motif pattern of the gene family. If the location is overlapped by one gene that is obviously not a member of the gene family by knowledge, the location would also be regarded as false positive. At the same time, those locations that do not code the known proteins listed in one PROSITE entry and are not false positive will be regarded as potential candidates. In TBLASTN search, only genomic DNA regions that could match the protein motif region partially or completely were considered as the locations of gene family members. The other genomic regions where the matches between genomic DNA sequence and protein sequence were outside of the motif were not considered. In BLASTN search, because the query sequences were so short that the significance of matches was low, only those genomic DNA match regions that could be aligned completely with the query sequence were regarded as the gene member's locations to avoid many short, partial, and random matches. The Expect-value (E-value) was used as the threshold to filter the most significant match in BLAST. In our comparison, we chose different E-values as thresholds in TBLASTN searches and used the default setting in BLASTN (E-value < 10) searches. To compare the specificity with GFScan meaningfully, we chose the smallest E-value that could find all known gene members as the threshold for TBLASTN, then compared the new motif match locations number with that obtained from GFScan.
Availability
The program GFScan is available at http://www.cshl.org/mzhanglab/.
| |
WEB SITE REFERENCES |
|---|
|
|
|---|
http://genome.ucsc.edu/; GoldenPath.
http://www.cshl.org/mzhanglab/; GFScan program.
http://www.ebi.ac.uk/interpro/; IntroPro.
http://www.ncbi.nlm.nih.gov/LocusLink/; NCBI LocusLink Database.
| |
ACKNOWLEDGMENTS |
|---|
We thank Theresa Zhang for revising the English text. We also thank Celera Genomics for the mouse genome database. The C++ Boost library was kindly supplied by Dr. John Maddock. This work is supported by grant CA81152 from NIH.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
1 Corresponding author.
E-MAIL mzhang{at}cshl.org; FAX (516) 367-8461.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.220102. Article published online before print in June 2002.
| |
REFERENCES |
|---|
|
|
|---|
Received October 26, 2001; accepted in revised form April 11, 2002.
This article has been cited by other articles:
![]() |
C. K. Galang, W. J. Muller, G. Foos, R. G. Oshima, and C. A. Hauser Changes in the Expression of Many Ets Family Transcription Factors and of Potential Target Genes in Normal Mammary Tissue and Tumors J. Biol. Chem., March 19, 2004; 279(12): 11281 - 11292. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||