|
|
|
|
Genome Res. 13:1496-1500, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00
Methods Human Disease Genes and Their Cloned Mouse Orthologs: Exploration of the FANTOM2 cDNA Sequence Data Set1National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland 20894, USA 2The Jackson Laboratory, Bar Harbor, Maine 04609, USA 3Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute, Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan 4Departments of Pediatrics and Medicine, University of California, San Diego School of Medicine, San Diego, California 92093, USA 5National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892-4472, USA 6Applied Genomics, Inc., Sunnyvale, California 94085, USA 7Department of Genetics, Boys Town National Research Hospital, Omaha, Nebraska 68131, USA 8Graduate School of Medicine, University of Tokyo, Bunkyo-ku, Tokyo 113-0033, Japan 9Genome Science Laboratory, RIKEN, Hirosawa, Wako, Saitama 351-0198, Japan
The FANTOM2 cDNA sequence data set is an excellent model to demonstrate the power of large-scale cDNA sequencing, with the goal of providing a full-length transcript sequence for each mouse gene. This data set enhances the use of the mouse as a model for human disease. Here we identify mouse cDNA sequences in the FANTOM2 data set for a set of 67 human disease genes that as of May 2002 had no corresponding mouse cDNA annotated in the Mouse Genome Informatics (MGI) database. These 67 human disease genes include genes related to neurological and eye disorders and cancer. We also present a list of the human disease genes and their cloned mouse orthologs found in two public databases, LocusLink and MGI. Allelic variant and gene functional information available in MGI provides additional information relative to these mouse models, whereas computed sequence-based connections at NCBI support facile navigation through multiple genomes.
The mouse has been used as a genetic system for more than 100 years and provides a rich resource of genetic mutations and inbred strains for biomedical research (Beck et al. 2000
A view of the complete set on mouse transcripts, the transcriptome, is emerging as new mouse sequence data from whole genome annotation projects (ENSEMBL, http://www.ensembl.org/
Here we present an analysis of the FANTOM2 cDNA data set (as of May 2002)wherein we looked for novel mouse cDNAs representing mouse orthologs of human disease genes. Using the publicly available human and mouse gene, sequence, and orthology data from the NCBI LocusLink project (http://www.ncbi.nlm.nih.gov/LocusLink/ The goal of this study was to find novel full-length mouse cDNAs in the FANTOM2 cDNA data set that represent orthologs of human disease genes. In addition, we generated a comprehensive list (as of May 2002)of the cloned human disease genes and their annotated mouse orthologs in MGI. Here, we present the set of human disease genes, their orthologs, and the mouse cDNA clones we have identified and discuss the information that is presently available for their study.
Analysis of the FANTOM2 Set: Identifying Orthologs Examining the annotated gene records at MGI and LocusLink, we found that of the 1022 human disease genes, 921 had a cloned ortholog in MGI and 101 did not. Then examining only those human disease genes included in the BLAST analysis (N = 993; Supplemental Table 1, available online at www.genome.org), we found that of the set of human disease genes that had no annotated mouse orthologous cDNA as of May 2002 (N = 80), 84% had a probable ortholog in the FANTOM2 cDNA data set (Supplemental Table 3). These include a variety of disease genes including cancer-related genes, for example, NUMA1, which is associated with acute promyelocytic leukemia, and SDHD, mutations of which have been linked to hereditary paraganglioma. Additionally, this set includes examples of neurological disease genes such as USH3A, which is associated with the phenotype of Usher syndrome, type 3, and IL1RAPL1, which is associated with type 1 X-linked nonspecific mental retardation. Examples of human diseases affecting the eye in this set include PRPF8, which is a candidate for the autosomal dominant form of retinitis pigmentosa, and NYX, mutations of which have been shown to cause X-linked complete congenital stationary night blindness (CSNB1). The lists of cloned human disease genes and the results of the BLAST analysis for the subset for which we found curated orthologs are presented in Supplemental Table 2. For completeness, the human disease genes for which the Fantom2 cDNA set did not yield significant BLAST results are listed in Supplemental Table 4 (set with curated orthologs) and Supplemental Table 5 (set with no curated orthologs).
A Second Look: Re-evaluation of the No Ortholog Human Disease Set We included in our results those proteins that shared sequence identity >65% over the entire length of the alignable region. Because we only used percent identity to identify candidate orthologs, some of the protein accessions listed in Table 1, such as KRT1, may be from paralogs. We also looked for mouse models via text queries using the gene name for the human disease gene (e.g., XP_134985 [GenBank] for BBS4 and XP_130099 for NUP214). Protein sequence similarity for BBS4 and NUP214, the mouse proteins XP_134985 [GenBank] and XP_130099, was determined by BLAST2. Using both methods, we identified highly related mouse proteins for 37 of the 101 human genes in the No Ortholog Set. Of these, three (AMT, D10S170, CLN6)had no hits in the FANTOM2 data set (Table 1). We found that 23 genes in this No Ortholog Set now have an associated orthologous mouse gene with cDNA sequence data available in MGI. This number will continue to grow as additional data are entered into MGI.
Allelic Variants in Mouse
It is important to note that in this study no emphasis was placed on the phenotypic characterization of alleles to determine their validity in modeling human disease. One could think of alleles as falling into four broad categories: (1)natural variants that were used in initial mapping and characterization of genes, (2)natural or induced mutations that were isolated on the basis of a noticeable phenotype, (3)engineered transgenic animals that result in a neomorphic or hypermorphic mutation, or (4)engineered knockout animals that represent a null or hypomorphic mutation. Mutants from the first category will likely not be very useful as disease models, because they were usually originally identified as isoenzyme variants or simple restriction-fragment-length polymorphisms. One example of these types of alleles is the electrophoretic variants of Gpi1
(DeLorenzo and Ruddle 1969
Functional Analysis of Human Disease Orthologs: Gene Ontology
In this study we examined data for more than 1000 human disease genes and their mouse orthologs. From the FANTOM2 cDNA clone data set, we identified 67 cDNAs representing mouse orthologs to human disease genes, for which no full-length cDNA previously existed at MGI. This information will be useful to mouse geneticists and other researchers investigating the genetic basis of human disease. In this study, we found that 90% (921/1022)of the human disease genes identified in the initial data set were represented at that time in MGI or LocusLink. This underscores the power of cocuration of mouse and human genes between LocusLink and MGI. Of the remaining 10% (101), 80 were represented in the protein BLAST database and analysis at RIKEN. Of these 80 human proteins, 84% (N = 67)shared significant sequence similarity to one or more proteins encoded by cDNAs in the FANTOM2 clone set, thus demonstrating the power of a large-scale sequencing project like the Mouse Gene Encyclopedia project to increase the representation of novel mouse cDNAs in the public databases.
In our re-examination of the set of 101 human disease genes (the No Ortholog Set), we identified related protein sequences (some partial)via BLink. We identified 37 candidate orthologs. This result includes candidate orthologs for three human disease genes that had no hits in the FANTOM2 data set analysis. Additionally, querying MGI, we found that several more highly similar mouse sequences have been characterized since May 2002 as genes with sequences (23 genes)in MGI, thus demonstrating the power of using multiple lines of analysis to mine the wealth of data in the public domain. To complete this analysis of identifying mouse orthologs to these human disease genes and storing these data in the public databases, we are evaluating further evidence of orthology to these human disease genes beyond sequence similarity. We are examining shared synteny data, accessible publicly via MGI's Mammalian Homology and Comparative Maps page (http://www.informatics.jax.org/menus/homology_menu.shtml The increasing number of orthologous relationships (mRNA and gene models)between mouse sequences and human disease genes, as seen in our re-evaluation of the May data set and requerying of LocusLink in September 2002, is the result of the continued daily curation efforts of the mouse and human research communities, at NCBI, at MGI, and externally. It illustrates that our view of the transcriptome is still highly dynamic and becomes more and more complete as data are integrated, and that re-evaluation continues to result in additional orthology relationships being identified.
The methods used in this study include a few of the possible approaches for increasing the power of the mouse as a model for human disease. Additional resources to explore include, for example, using LocusLink as a gateway to other NCBI resources. One can take advantage of the large body of computational analyses that are available at the nucleotide level, in HomoloGene (http://www.ncbi.nlm.nih.gov/HomoloGene/
Identification of Candidate Genes We first identified a list of 1022 cloned human disease genes in NCBI's May 03, 2002 release of LocusLink using the query "disease_known AND has_seq." Of the 1022 cloned human disease genes, we excluded 29 genes from the BLAST analysis because they did not encode proteins. These 29 gene records included only either partial mRNA or ESTs at the time of the analysis and are indicated in Supplemental Tables 3 and 4. Therefore, the BLAST set included 993 protein sequences. We downloaded this list from NCBI and created a BLAST-able database at RIKEN of the protein sequences for each human gene. For each human disease protein sequence, we identified those FANTOM2 cDNAs that shared a high sequence similarity from a TBLASTN (Altschul et al. 1997
Looking at each human disease gene manually in LocusLink, we then determined whether a cloned mouse orthologous cDNA had been identified in the MGI curated humanmouse orthology data set (http://www.informatics.jax.org
We thank Takeya Kasukawa for technical assistance on the analysis and Monica McAndrews-Hill, Janan Eppig, and Donna Maglott for stimulating discussions and critical help in editing. The content of this publication does not necessarily reflect the views or policies of the U.S. Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. This project has been supported in part with Federal funds from NIH/NICHD grant HD33745 for the Gene Expression Database, NIH/NHGRI grant HG002273 to the Gene Ontology Project, and NIH/NHGRI grant P40 HG00330 for the Mouse Genome Database.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.979503.
11 Corresponding author.
10 Yoshihide Hayashizaki, Takahiro Arakawa, Piero Carninci, and Jun Kawai. [Supplemental material is available online at www.genome.org.]
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25:3389
-3402. Beck, J.A., Lloyd, S., Hafezparast, M., Lennon-Pierce, M., Eppig, J.T., Festing, M.F.W., and Fisher, E.M.C. 2000. Genealogies of mouse inbred strains. Nat. Genet. 24: 23-25.[CrossRef][Medline]
Bu, L., Yan, S., Jin, M., Jin, Y., Yu, C., Xiao, S., Xie, Q., Hu, L., Xie, Y., Solitang, Y., et al. 2002. The Charles, D.J. and Lee, C.Y. 1980. Biochemical and immunological characterization of genetic variants of phosphoglucose isomerase from mouse. Biochem. Genet. 18:153 -169.[CrossRef][Medline] DeLorenzo, R.J. and Ruddle, F.H. 1969. Genetic control of two electrophoretic variants of glucosephosphate isomerase in the mouse (Mus musculus). Biochem. Genet. 3: 151-162.[CrossRef][Medline] Denny, P. and Justice, M.J. 2000. Mouse as the measure of man? Trends Genet. 16:283 -287.[CrossRef][Medline] The FANTOM Consortium and The RIKEN Genome Exploration Research Group Phase I and II Team. 2002. Analysis of the mouse transcriptome based upon functional annotation of 60,770 full length cDNAs. Nature 420:563 -573.[CrossRef][Medline]
The Gene Ontology Consortium. 2001. Creating the gene ontology resource: Design and implementation. Genome Res. 11:1425
-1433. Hill, D.P., Davis, A.P., Richardson, J.E., Corradi, J.P., Ringwald, M., Eppig, J.T., and Blake, J.A. 2001. Biological annotation of mammalian systems: Implementing gene ontologies in mouse genome informatics. Genomics 74:121 -128.[CrossRef][Medline] Kawai, J., Shinagawa, A., Shibata, K., Yoshino, M., Itoh, M., Ishii, Y., Arakawa, T., Hara, A., Fukunishi, Y., Konno, H., et al. 2001. Functional annotation of a full-length mouse cDNA collection. Nature 409:685 -690.[CrossRef][Medline]
Makalowski, W., Zhang, J., and Boguski, M.S. 1996. Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res.
6: 846-857. Padua, R.A., Bulfield, G., and Peters, J. 1978. Biochemical genetics of a new glucosephosphate isomerase allele (Gpi-1c)from wild mice. Biochem. Genet. 16:127 -143.[CrossRef][Medline]
Pruitt, K.D. and Maglott, D.R. 2001. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29:137
-140.
Rubin, G.M., Yandell, M.D., Wortman, J.R., Gabor Miklos, G.L., Nelson, C.R., Hariharan, I.K., Fortini, M.E., Li, P.W., Apweiler, R., Fleischmann, W., et al. 2000. Comparative genomics of the eukaryotes. Science 287:2204
-2215.
http://ftp.informatics.jax.org/pub/informatics/reports/GDB_Accession.rpt; Human/Mouse Homology data set at MGI. http://mgc.nci.nih.gov/; Mammalian Gene Collection. http://www.ensembl.org/; ENSEMBL. http://www.gsc.riken.go.jp/e/FANTOM/; RIKEN's Mouse Encyclopedia. http://www.informatics.jax.org/; The Mouse Genome Informatics (MGI)Database. http://www.informatics.jax.org/menus/homology_menu.shtml; MGI's Mammalian Homology and Comparative Maps. http://www.ncbi.nih.gov/cgi-bin/Entrez/blink?pid=4557225&all=1; BLink links on LocusLink page to precomputed protein neighbors. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM/; Online Mendelian Inheritance in Man (OMIM)database. http://www.ncbi.nlm.nih.gov/HomoloGene/; NCBI's HomoloGene Home page. http://www.ncbi.nlm.nih.gov/Homology; NCBI's HumanMouse Homology Map. http://www.ncbi.nlm.nih.gov/htbin-post/Omim/getmap?chromosome=CYP1&start=-2; OMIM's Gene Map. http://www.ncbi.nlm.nih.gov/htbin-post/Omim/getmorbid; OMIM's Morbid Map. http://www.ncbi.nlm.nih.gov/LocusLink/; NCBI's LocusLink Home page.
Received November 12, 2002;
accepted in revised format January 28, 2003.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||