|
|
|
|
Genome Res. 16:678-685, 2006 ©2006 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/06 $5.00 OPEN ACCESS ARTICLE Resource Iterative gene prediction and pseudogene removal improves genome annotationLaboratory for Computational Genomics, Department of Computer Science Washington University, Saint Louis, Missouri 63130, USA
Correct gene prediction is impaired by the presence of processed pseudogenes: nonfunctional, intronless copies of real genes found elsewhere in the genome. Gene prediction programs frequently mistake processed pseudogenes for real genes or exons, leading to biologically irrelevant gene predictions. While methods exist to identify processed pseudogenes in genomes, no attempt has been made to integrate pseudogene removal with gene prediction, or even to provide a freestanding tool that identifies such erroneous gene predictions. We have created PPFINDER (for Processed Pseudogene finder), a program that integrates several methods of processed pseudogene finding in mammalian gene annotations. We used PPFINDER to remove pseudogenes from N-SCAN gene predictions, and show that gene prediction improves substantially when gene prediction and pseudogene masking are interleaved. In addition, we used PPFINDER with gene predictions as a parent database, eliminating the need for libraries of known genes. This allows us to run the gene prediction/PPFINDER procedure on newly sequenced genomes for which few genes are known.
With the sequencing of more and more genomes, the need for accurate gene prediction is greater than ever. One of the key hurdles in mammalian genome annotation is the presence of large numbers of pseudogenescopies of real genes that have lost their ability to encode a functional protein product (Zhang and Gerstein 2004
There are two classes of pseudogenes: nonprocessed and processed. Nonprocessed pseudogenes arise through segmental duplication, and hence, they typically retain at least part of the exonintron structure of the parent gene. Processed pseudogenes arise through retrotransposition of a spliced mRNA and therefore do not contain introns (Vanin 1985
Estimates of the total number of processed pseudogenes in the human genome vary. Zhang et al. (2003) Currently, none of these pseudogene detection methods is available as a standalone tool that can be used to screen genomes or gene sets. Furthermore, they have been optimized for finding as many pseudogenes as possible, rather than the younger pseudogenes that typically get incorporated into models of functional genes. We have created PPFINDER, a standalone tool that can be used to identify processed pseudogenes that have been incorporated into gene models in any mammalian genome annotation. PPFINDER is optimized for this purpose rather than for finding all processed pseudogenes in a genome. In this article, we show that it can be used to improve gene models by iteratively masking pseudogenes incorporated in models and rerunning a gene predictor until no more pseudogenes are found.
PPFINDER identifies processed (but not nonprocessed) pseudogenes by combining two homology-based approaches that are similar to previously described methods (Ohshima et al. 2003
Description of PPFINDER PPFINDER uses two different methods of finding pseudogenes: the intron location method and the conserved synteny method. Both methods start with a gene model and try to find a parent gene from which it was derived by retroposition. If a parent gene is found, the homologous segment in the gene model is marked as a potential pseudogene fragment and put into a filtering procedure. This procedure aligns the parent to the potential pseudogene and discards those cases where the alignment contains an intron.
We developed and tested PPFINDER by using the human genome, NCBI build 35. Unless otherwise indicated, all mentions of genome and annotation sets refer to this build. Parameters were optimized by using the pseudogenes annotated on chromosome 7 (Hillier et al. 2003
Intron location method
Segments of the predicted gene that are not pseudogene-derived will have hits only with themselves and their family members. For pseudogene-derived gene segments, the best hit will be with their parent. To distinguish between family members and parent genes, the predicted gene is aligned to the genomic region of its best hits. Such alignments will introduce large gaps in the predicted gene, corresponding to introns. If the locations of these gaps do not correspond to the intron locations in the gene model, the part of the gene model that aligns to the parent is considered a potential pseudogene (Fig. 1B). When intron locations are not conserved among functional family members, this procedure yields false-positive pseudogenes, most of which are filtered out in the filtering step (see below). A limitation of this method is that if the pseudogenic segment of a gene model aligns to a single exon of its parent gene, it will not be identified. In addition, this method will not identify pseudogenes with a single-exon parent gene, such as the olfactory receptors. (For a detailed discussion of how PPFINDER deals with the olfactory receptors, see the Supplemental data.)
Conserved synteny method
In general, processed pseudogenes are evolving neutrally (Ophir et al. 1999 5% of processed pseudogenes on human chromosome 22 are preserved on the orthologous region in mouse, although they note that this may not be typical for the complete genome. Most functional genes, on the other hand, are ancestral. Therefore a mouse/human conserved synteny map can be used to identify pseudogenes in human gene models: If a putative pseudogene in the human genome interrupts a region of conserved synteny, it is likely to have arisen after the mousehuman split and hence to be a real pseudogene (Fig. 3). On the other hand, if it is found in the mouse at a position that corresponds to its position in the human genome, it is likely to be a functional gene.
To determine regions of conserved synteny in human, we made use of the Mouse Net Synteny map from UCSC (Karolchik et al. 2003 PPFINDER will only look for conservation of a gene if it has a BLASTp hit in the procedure described at the beginning of this section. Although it would be possible to look for conservation of all gene models, doing so would remove all species-specific genes and exons from the annotation set as well as genes for which the conserved syntenic region in the informant is missing from the assembly. A limitation of this method is that it does not identify ancestral pseudogenes, so its sensitivity depends on finding a sufficiently diverged informant genome (see Discussion).
Filtering Each of the methods described above finds false positives. In the intron location method, this occurs if gene family members differ in one or more intron locations. In the conserved synteny method, this happens if a predicted gene (1) is a member of a gene family, and (2) has one or more exons that do not fall in regions of conserved synteny, defined as blocks of at least 10 kb that map to a contiguous region in the informant genome. In most cases, alignment of the potential parent gene (which is in fact a family member) to the genomic region of the gene model will contain gaps, corresponding to introns. (The exceptions are single-exon gene models that are mislabeled as pseudogenes by the conserved synteny method.) Alignments of parent genes to pseudogene regions derived from them do not contain intron gaps. We used this to identify false positives. To make this filtering step effective, it is necessary to distinguish true intron gaps from all other gaps. To allow for smaller gaps in the alignment, potential pseudogenes were considered real if the average length of interruptions (potential introns) was less than twice the average length of aligned segments (potential exons). We found that this cutoff works well for mammalian genomes. However, sometimes large gaps occur in parent-to-pseudogene alignments because repeats were inserted in pseudogenes after their formation. PPFINDER checks whether interruptions in the alignment contain mostly repeat sequence. If >75% of the interruption sequence is interspersed repeat, the pseudogene is considered verified. The filtering step is very effective at removing false pseudogene candidates. However, it does allow a few false positives whose introns consist primarily of identifiable interspersed repeats. It also allows a few false positives whose putative parent has no introns. The intron location method cannot produce such false positives, but the conserved synteny method can. Finally, by using this filter we forgo the possibility of identifying nonprocessed pseudogenes. Because they often have a genelike structure with apparently normal introns, nonprocessed pseudogenes are difficult to distinguish reliably from functional genes. During the filtering step, PPFINDER keeps track of which genomic nucleotides are covered by a parent-to-pseudogene alignment and outputs their coordinates. This output can be used to remove pseudogene-containing gene models or exons from the input annotation set. It can also be used to mask the pseudogene-derived nucleotides. Although this list can be used to annotate some of the pseudogene-derived nucleotides in a genome, PPFINDER is optimized for finding only those that affect the gene models in the input annotation.
Testing PPFINDER
For the intron location method, we used the human RefSeq mRNAs as a parent database (Pruitt et al. 2005 Of 13,133 CCDS gene annotations, 37 were marked by PPFINDER as processed pseudogenes (0.3%). We manually inspected those hits and found that 23 were single-exon genes that are most likely to be functional retrogenes, because expressed sequence tags (ESTs) are found for each of them. The rest are genes from small gene families that have differences in their exonintron boundaries in addition to large ratios of exon length to intron length. These genes are marked by the intron location method as putative pseudogenes and are not removed by the alignment filter because of their relatively small introns.
There are 2006 processed pseudogenes annotated in the Vega pseudogene track at UCSC. These pseudogenes were identified by the HAVANA group (http://www.sanger.ac.uk/HGP/havana/) because they are similar to known genes but contain frameshifts and/or stop codons and lack the exonintron structures of their parent genes (Dunham et al. 1999 The intron location method identified 1283 pseudogenes in the Vega set, while the conserved synteny method found 1400 pseudogenes; 1116 pseudogenes were identified by both methods. This shows that the sensitivity of these two methods is similar at this evolutionary distance, but each finds pseudogenes that are missed by the other.
Effects of iterative pseudogene masking on gene prediction
To test the effects of pseudogene masking on gene prediction, we used N-SCAN (Gross and Brent 2006
We evaluated the gene predictions by using a gold standard set of annotated genes as described (Flicek et al. 2003 As expected, a sizeable number of single-exon gene predictions were masked out: 702 of the original gene predictions. Of these, 16 overlapped a single-exon gene in the CCDS set. This suggests that most of the masked single-exon gene models are incorrect. In addition, masking caused 85 nonmasked single-exon gene models to be incorporated into multi-exon genes. N-SCAN also predicted 122 new single-exon genes after iterative masking, of which 10 overlapped a single-exon CCDS gene. In total, the number of single-exon gene predictions decreased by 687. We also compared the predictions before and after iterative pseudogene masking with the Vega pseudogene set. Before masking, the coding sequences of 749 N-SCAN predictions overlapped 783 annotated pseudogenes. Iterative masking and reprediction reduced these numbers to 101 and 106, respectively. Thus, the fraction of Vega pseudogenes incorporated into N-SCAN predictions was reduced from 39.0% to 5.3%.
As expected (Zhang et al. 2002
Bootstrapping pseudogene detection from N-SCAN predictions
To illustrate the application of this method to a genome with relatively few known genes, we ran it on the dog genome (Lindblad-Toh et al. 2005 All N-SCAN predictions and pseudogene-masked regions generated for this article are available in the Supplemental data at http://genes.cse.wustl.edu/vanbaren-06-pseudogene-data/. The UCSC Genome Browser is updated regularly with the most current predictions.
Applying PPFINDER to other sets of gene models These numbers comprise a substantial part of the gene annotations, and those methods may improve markedly if pseudogenes are removed. A list of these putative pseudogenes can be found at http://genes.cse.wustl.edu/vanbaren-06-pseudogene-data/.
We also ran PPFINDER on the RefSeq human mRNAs (Pruitt et al. 2005
PPFINDER is an accurate, standalone system for removing processed pseudogenes from any set of gene models. When applied to N-SCAN predictions, it reduces the number of pseudogenes incorporated into gene models by a factor of 8%. Its false-positive rate is only 0.3%, as estimated by comparison to the highly accurate CCDS collection of protein-coding gene annotations. This low false-positive rate may be due, in part, to the fact that PPFINDER is optimized for finding only those processed pseudogenes that overlap models of protein-coding genes. If we had designed PPFINDER to find all the processed pseudogenes in the input genome, we would have had to lower our threshold of evidence, thereby admitting more false positives. Additional pseudogenes can be found by alternately masking pseudogenes in gene models and rerunning a gene prediction program. Using PPFINDER to remove pseudogenes from human genome predictions by N-SCAN, a state-of-the-art de novo gene prediction program, led to significant improvements in accuracy as evaluated by comparison to the CCDS gene models. Furthermore, PPFINDER made the statistical characteristics of the prediction set, including the fraction of genes that consist of a single exon and the average number of exons per gene, more like those of the CCDS gene models. Alternating gene prediction with pseudogene masking led N-SCAN to correctly predict exons it did not find before. Masking pseudogenes and rerunning gene prediction improves gene prediction in two ways. First, it may result in a correct gene model that is similar to the original except for the absence of a pseudogene derived exon (Fig. 5A). Second, it may have a long distance effect on other parts of the gene model, such as causing it to be split into two correct models (Fig. 5B). Pseudogenic exons in gene models may also change the reading frame, causing real exons on either side of the pseudogenic exon to be omitted (not shown in Fig. 5). Finally, removing single-exon gene models that are based on pseudogenes in the introns of real genes allows N-SCAN to incorporate exons on both sides of the pseudogene into correct gene models (Fig. 5C). If a single-exon gene is predicted in an intron of a real gene, the real gene must be split in two because the current generation of de novo gene predictors does not predict overlapping transcripts.
The number of pseudogenes PPFINDER found in other sets of de novo human gene predictions was similar to what it found in N-SCAN predictions. Interestingly, the gene set produced by the Ensembl annotation pipeline, which uses known transcripts to annotate the genome, also contained a substantial number of putative pseudogenes. Finally, we identified 305 previously unannotated, putative pseudogenes in the RefSeq gene set and found by manual curation that at least some of them are indeed pseudogenes. One of our key findings is that PPFINDER can be effective even when there are no known genes to serve as potential parents. In that case, it can be run using gene predictions as the potential parents, including the same prediction set targeted for pseudogene removal. We found that using N-SCANs human genome predictions as the parent database was almost as effective for removing pseudogenes from those predictions as using known human genes. This bootstrapping capability is essential for removing pseudogenes from predictions in species with few known genes. An example is the dog genome, for which only a small number of mRNAs is present in GenBank. We show that the effects of pseudogene masking using the bootstrap method are similar to those seen in the human genome: It results in fewer predicted exons and gene models, and these gene models increase in length. When the dog mRNAs from GenBank were used instead, pseudogene finding appeared to be much less effective. The effectiveness of the conserved synteny method depends on the divergence between the target and informant genomes. If the informant is too close, a large number of processed pseudogenes will be ancestral and hence undetectable by the conserved synteny method. The intron alignment method will therefore identify many more pseudogenes than the conserved synteny method, indicating that a more distant informant genome would yield greater sensitivity. On the other hand, if the informant is too distant, some functional genes may not fall in regions of identifiable conserved synteny. This will lead to a high number of false positives, most of which will be caught by the filter. The intron alignment method will not target most of these functional genes, since intron location is conserved over much longer evolutionary distances than is gene order. Therefore, if the intron alignment method identifies far fewer pseudogenes than the conserved synteny method, it may be better to use a more closely related informant. If no closer informant is available, PPFINDER can run the intron alignment method alone. Although the inability to use the conserved synteny method will reduce sensitivity, the data reported here suggest that the reduction will be modest. In the future, we plan to enhance the filtering step to make PPFINDER even more broadly applicable. Currently, it relies on most introns being substantially longer than most exons. For species with relatively short introns and long exons, such as Caenorhabditis elegans, the filter cannot be used. In addition, some genes are masked out with paralogs despite a large intron-to-exon ratio because their introns consist largely of repeats. The next version of PPFINDER will rely more on splice site models and less on length for distinguishing true introns from interruptions caused by elements inserted into the pseudogene.
Another important task for the future is the development of an "NPPFINDER" for removing nonprocessed pseudogenes from gene models. Currently, no pseudogene finding method can reliably separate gene family members from nonprocessed pseudogenes, in part because the latter often do not have in-frame stop codons (Torrents et al. 2003 Although there is always more work to be done, PPFINDER can now be used to significantly improve the accuracy of mammalian genome annotations, from well-studied genomes such as those of human and mouse to newly sequenced genomes such as those of dog and cow.
Sequences All sequences were downloaded from UCSC (ftp://hgdownload. cse.ucsc.edu/goldenPath/). For details, see Supplemental data.
Synteny map and downloads The Known Genes track and sequences were also downloaded from UCSC in July 2004. A position table was created from the track, and the sequences were formatted for BLASTp. The RefSeq tracks (23,045 clones) were downloaded on March 13, 2005, and used for extracting the RefSeq sequences from the genome. Note that the annotated pseudogenes available in RefSeq (NG_ id numbers) are not part of this set. RefSeqs with obvious errors were removed (see Supplemental data). Gene annotation sets were downloaded from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/database on June 15, 2005. The dog mRNA track was also downloaded from ftp://hgdownload.cse.ucsc.edu/goldenPath/canFam1/database/ in August 2005 and converted to GTF. No sequences were removed.
Validation sets
Selecting putative pseudogenes The intron location method uses whole-gene models instead of single-exon translations. This means that when a pseudogene exon is incorporated in a gene model, this gene model can have BLAST hits with both the pseudogene parent and the actual gene. Therefore, for every hit, the range of overlap with the gene model was determined. Hits were kept if their score was at least 75% of the highest scoring hit, or if they overlapped a different segment of the gene model than all higher scoring hits and had a percentage identity of at least 75%. Every putative parent gene found in this way was used in the filtering step of PPFINDER.
N-SCAN evaluation Only coding exons on whole chromosomes were used for evaluation of N-SCAN performance.
Segmental duplications
We thank LaDeana Hillier for providing the human chromosome 7 pseudogene set that started this work, and Mark Diekhans and Robert Baertsch for helpful discussions. This work was supported by grant HG02278 from the National Human Genome Research Institute to M.R.B.
1 Corresponding author.
E-mail brent{at}cse.wustl.edu. [Supplemental material is available online at www.genome.org. N-SCAN and PPFINDER are open source software and may be obtained from http://genes.cse.wustl.edu/.] Article is online at http://www.genome.org/cgi/doi/10.1101/gr.4766206
Alexandersson M., Cawley S., Pachter L. 2003. SLAM: Cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. 13: 496502. Ashurst J.L., Chen C.K., Gilbert J.G., Jekosch K., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S.et al. 2005. The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res. 33: D459D465. Using geneid to identify genes. In (ed. D.B. Davison) pp. Unit 4.3.Blanco E., Parra G., Guigo R. In Current protocols in bioinformatics . 2003. John Wiley & Sons Inc., New York. Burge C. and Karlin S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 7894.[CrossRef][Medline] Buzdin A.A. 2004. Retroelements and formation of chimeric retrogenes. Cell. Mol. Life Sci. 61: 20462059.[Medline] Curwen V., Eyras E., Andrews T.D., Clarke L., Mongin E., Searle S.M.J., Clamp M. 2004. The Ensembl automatic gene annotation system. Genome Res. 14: 942950. Dunham I., Shimizu N., Roe B.A., Chissoe S., Hunt A.R., Collins J.E., Bruskiewich R., Beare D.M., Clamp M., Smink L.J.et al. 1999. The DNA sequence of human chromosome 22. Nature 402: 489495.[CrossRef][Medline] Flicek P., Keibler E., Hu P., Korf I., Brent M.R. 2003. Leveraging the mouse genome for gene prediction in human: From whole-genome shotgun reads to a global synteny map. Genome Res. 13: 4654. Gross S.S. and Brent M.R. 2006. Using multiple alignments to improve gene prediction. J. Comput. Biol. 13: 379393.[CrossRef][Medline] Hillier L.W., Fulton R.S., Fulton L.A., Graves T.A., Pepin K.H., Wagner-McPherson C., Layman D., Maas J., Jaeger S., Walker R.et al. 2003. The DNA sequence of human chromosome 7. Nature 424: 157164.[CrossRef][Medline] Hubbard T., Andrews D., Caccamo M., Cameron G., Chen Y., Clamp M., Clarke L., Coates G., Cox T., Cunningham F.et al. 2005. Ensembl 2005. Nucleic Acids Res. 33: D447D453. Karolchik D., Baertsch R., Diekhans M., Furey T.S., Hinrichs A., Lu Y.T., Roskin K.M., Schwartz M., Sugnet C.W., Thomas D.J.et al. 2003. The UCSC Genome Browser Database. Nucleic Acids Res. 31: 5154. Kent W.J., Sugnet C.W., Furey T.S., Roskin K.M., Pringle T.H., Zahler A.M., Haussler D. 2002. The human genome browser at UCSC. Genome Res. 12: 9961006. Kim N., Shin S., Lee S. 2004. ASmodeler: Gene modeling of alternative splicing from genomic alignment of mRNA, EST and protein sequences. Nucleic Acids Res. 32: W181W186. Kimura M. 1968. Evolutionary rate at the molecular level. Nature 217: 624626.[CrossRef][Medline] Korf I., Flicek P., Duan D., Brent M.R. 2001. Integrating genomic homology into gene structure prediction. Bioinformatics 17: S140S148.[Abstract] Lindblad-Toh K., Wade C.M., Mikkelsen T.S., Karlsson E.K., Jaffe D.B., Kamal M., Clamp M., Chang J.L., Kulbokas III E.J., Zody M.C.et al. 2005. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438: 803819.[CrossRef][Medline] Ohshima K., Hattori M., Yada T., Gojobori T., Sakaki Y., Okada N. 2003. Whole-genome screening indicates a possible burst of formation of processed pseudogenes and Alu repeats by particular L1 subfamilies in ancestral primates. Genome Biol. 4: R74.[CrossRef][Medline] Ophir R., Itoh T., Graur D., Gojobori T. 1999. A simple method for estimating the intensity of purifying selection in protein-coding genes. Mol. Biol. Evol. 16: 4953.[Abstract] Parra G., Agarwal P., Abril J.F., Wiehe T., Fickett J.W., Guigo R. 2003. Comparative gene prediction in human and mouse. Genome Res. 13: 108117. Pruitt K.D., Tatusova T., Maglott D.R. 2005. NCBI Reference Sequence (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33: D501D504. Schwartz S., Kent W.J., Smit A., Zhang Z., Baertsch R., Hardison R.C., Haussler D., Miller W. 2003. Humanmouse alignments with BLASTZ. Genome Res. 13: 103107. Torrents D., Suyama M., Zdobnov E., Bork P. 2003. A genome-wide survey of human pseudogenes. Genome Res. 13: 25592567. Vanin E.F. 1985. Processed pseudogenes: Characteristics and evolution. Annu. Rev. Genet. 19: 253272.[CrossRef][Medline] Zhang Z. and Gerstein M. 2004. Large-scale analysis of pseudogenes in the human genome. Curr. Opin. Genet. Dev. 14: 328335.[CrossRef][Medline] Zhang Z., Harrison P., Gerstein M. 2002. Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome. Genome Res. 12: 14661482. Zhang Z., Harrison P.M., Liu Y., Gerstein M. 2003. Millions of years of evolution preserved: A comprehensive catalog of the processed pseudogenes in the human genome. Genome Res. 13: 25412558. Zheng D., Zhang Z., Harrison P.M., Karro J., Carriero N., Gerstein M. 2005. Integrated pseudogene annotation for human chromosome 22: Evidence for transcription. J. Mol. Biol. 349: 2745.[CrossRef][Medline]
Received October 28, 2004; accepted in revised format March 13, 2006. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||