|
|
|
|
Genome Res. 15:1584-1591, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 Methods Survey of allelic expression using EST miningMcGill University and Genome Quebec Innovation Centre, Montreal, Quebec H3A 1A, Canada
Cis-acting allelic variation in gene regulation is a source of phenotypic variation. Consequently, recent studies have experimentally screened human genes in an attempt to initiate a catalog of genes possessing cis-acting variants. In this study, we use human EST data in dbEST as the source of allelic expression data, and the HapMap database to provide expected allele frequencies in human populations. We demonstrate a greater concordance of allele frequencies estimated from human ESTs in dbEST with those derived from the CEPH HapMap sample representing Caucasians from northern and western Europe, than population samples obtained in Asia and Africa. Deviations between allele frequencies observed in EST databases and the ones obtained from the CEPH HapMap samples may result from common heritable cis-acting variants altering the relative allele distribution in RNA. We provide in silico as well as experimental evidence that this strategy does allow significant enrichment of genes harboring common heritable cis-acting polymorphisms in linkage disequilibrium with expressed alleles.
Progress in human disease gene discovery and characterization of global expression patterns in unrelated individuals suggest that genetic variation affecting gene expression has significant effects on phenotypic variability, including risk to disease. Cis-acting variation that regulates expression leads to preferential expression of an allelic transcript. This can be detected directly by in vivo comparisons of the relative abundancies of allelic transcripts using intragenic polymorphisms (Pastinen and Hudson 2004
Increased attention to regulatory polymorphisms underlying human traits (for review, see Knight 2005
The largest public source of sequence variation in human transcripts that exists is in dbEST and contains 4,636,789 sequence traces (build #173) in UniGene clusters. Along with EST-based gene identification (Adams et al. 1992
If a common cis-acting regulatory variant (or regulatory SNP, rSNP) is in linkage disequilibrium (LD) with a marker SNP found in redundant ESTs, the marker allele corresponding to the underexpressed allele should be underrepresented in EST allele counts. In order to detect such occurrences, the expected EST allele distribution needs to be estimated. Figure 1 illustrates an in silico approach to detecting deviations between the observed versus expected allele counts in EST data sets.
The major caveats in this approach relate to the quality of EST sequencing, estimating allele frequencies in EST data sets when the ethnic origin of the donor is unknown, and biased sampling. To handle these issues, we made several assumptions. The impact of errors in EST sequencing is alleviated by focusing on SNPs that have been validated independently: We consider the demonstration of Mendelian transmission as an ultimate proof that an SNP is valid, and only SNPs showing evidence of transmission in Caucasian HapMap trios were further studied. Because each cDNA library can maximally contribute two alleles in the EST allele counts, we randomly selected a maximum of two sequences from each library. This reduces the sampling bias attributed to unequal sequencing density of individual libraries. The most difficult issue is the lack of information about the origin of the donor. Based on the location of the largest EST sequencing projects, we assume that Caucasian alleles are overrepresented in the EST libraries contained in dbEST. This hypothesis was tested by comparing the observed EST allele counts versus expected allele frequencies (population allele frequencies) in four populations across >2500 loci (Fig. 2). The Caucasian allele frequencies fit relatively well with the allele frequencies observed in ESTs, whereas the allele frequencies in other populations generally deviate considerably. We therefore used only Caucasian allele frequencies to estimate the expected allele frequencies in further steps. We cannot correct for the possibility that some genes may be preferentially detected in EST libraries derived from few non-Caucasian donors, nor can we adjust for multiple libraries having been derived from the same individual. The normalization (Soares et al. 1994
Table 1 lists the top 117 SNPs (nominal p < 0.001) showing the greatest deviation between CEPH allele frequencies derived from HapMap (http://www.HapMap.org release#13; The International HapMap Consortium 2003
In order to detect cis-acting differences using the EST data, the marker SNP has to be in LD with the unknown regulatory SNP. Most of the known or suggested variants affecting human gene regulation map to the 5'-end of transcripts (Rockman and Wray 2002 For experimental validation of the predicted differences between expressed alleles, we chose to use direct sequencing, which allows normalization of polymorphic bases relative to the surrounding invariant bases and other quality measures that are not possible with most genotyping methods. To facilitate the analysis of allelic expression from these sequence traces, we developed the software PeakPicker, as illustrated in Figure 4. Genomic DNA and RNA (cDNA) samples are sequenced in parallel, with heterozygous genomic DNA samples serving to establish the expected range of 50:50 allelic ratios. We evaluated allelic imbalance (AI) using a sample-specific test (called AI95) and a locus-specific test (called AILS). For the AI95 test, we estimated a 95% confidence interval (CI) for equal expression, based on the relative peak heights observed in heterozygous genomic DNA samples. The 95% CI threshold corresponds, on average, to a <1.2-fold difference between transcript abundance (or a 45:55 allelic ratio), as determined by dilution experiments (see Supplemental Fig. 1). Unequal expression of allelic transcripts in a heterozygous sample (i.e., AI) is called when we observe consistent deviations beyond the 95% CI in independent RNA preparations derived from the same LCL. If only one sample falls outside the 95% CI or if the replicate samples give conflicting deviations (suggesting unequal expression of opposite alleles), the allelic expression status of the sample is categorized as undefined. The determination of allelic expression status in individual samples using the AI95 test is useful if further mapping studies are pursued. Locus-specific allelic imbalance is defined using the AILS test, which is a two-tailed t-test comparing the average peak ratios observed in genomic DNA and RNA (cDNA). The advantages of the AILS test is that all data points (including discordant results that were discarded by the AI95 test) are used; the test appears to offer a small improvement in sensitivity for allelic expression differences, but it does not specify the AI status of individual samples.
From the set of 976 genes having in silico allelic imbalance (P < 0.05) (http://genomequebec.mcgill.ca/EST-HapMap), we chose 40 genes that are expressed in lymphoblastoid cell lines (LCLs), based on earlier DNA-microarray studies (Pastinen and Hudson 2004 2 = 9.0, 1df, empirical P < 0.005) enrichment of genes with LCLs having allelic imbalance, as predicted by the EST-genotype comparison: 14 of 39 genes (36%) predominantly overexpress the predicted allele (with >80% of heterozygous samples called showing the predicted allele being overexpressed), and only two (5%) showed predominant deviation toward the opposite allele. Similar results were observed by the locus-specific AILS method to determine allelic imbalance (see Table 2, two rightmost columns and footnotes h and i), with 15 genes showing statistically significant overexpression of the predicted allele, whereas two genes showed significant overexpression of the opposite allele; again, the distribution of the results deviates significantly from chance ( 2 = 9.9, 1df, empirical P < 0.005). The same genes were generally identified with both the AI95 and AILS methods of analysis. Overall, using the AI95 test, the predicted allele was overexpressed in 242 informative cases, the alleles were equally expressed or labeled undefined in 348 cases, and the opposite allele showed evidence of overexpression in 105 cases. This distribution deviates highly significantly ( 2 = 54.1, 1df, P < 0.0001) from chance. We note that the thresholds to call allelic expression using the AI95 test can influence the total number of allelic imbalances called. For genes that are expressed at widely different expression levels (and show more variability in allele ratios in cDNA samples), this may inflate the total number of RNA samples demonstrating putative allelic expression. The potential for false-positive assignments of allelic imbalance due to too non-stringent threshold definition would be expected to work as a random confounder and should not bias the assessment of the direction of relative expression differences between allelic transcripts. Similarly, allelic expression is quantitatively deviated toward the predicted allele across the tested 39 loci: The average predicted high allele versus predicted low allele ratio (HDNA/LDNA ratio) was 0.96 (95% CI: 0.911.02) in control heterozygous genomic DNA samples, while in RNA the average HRNA/LRNA ratio was 1.21 (95% CI: 1.121.30). The difference between distribution of allele ratios in genomic DNA and RNA (cDNA) is highly significant (P = 0.0000011, two-tailed t-test).
By combining the dbEST and International HapMap data, we generated a list enriched for candidate genes exhibiting allelic expression differences. These candidates differ from those derived from experimental surveys (Yan et al. 2002
The candidate gene list provided here offers a relatively straightforward approach for discovering regulatory variants in other tissues as well. We also note that the data used in this study are based on the HapMap release of November 2004, which may represent only 15%20% of the final data to be included in the International HapMap project (The International HapMap Consortium 2003
Samples and RNA preparation for allelic expression assays A description of the lymphoblastoid cell lines used in this study is listed in Supplemental Table 1. The RNA from cultured cells used in allelic expression assays consisted of 60 unrelated lymphoblastoid cell lines (LCLs) of Caucasian origin, corresponding to the parents of the CEPH trios used in the International HapMap project (The International HapMap Consortium 2003
cDNA synthesis
Allelic expression analysis by high-sensitivity sequencing PCR and sequencing-primer designs avoided known SNPs underlying primer sequences, and applied same designs for RNA and DNA samples unless an exonic SNP was located close to the exonintron boundary. All primer designs are available at http://genomequebec.mcgill.ca/EST-HapMap. All RNA samples were amplified in duplicate from independent cDNA preparations, and a subset (410/SNP) of informative gDNA heterozygotes were amplified in identical conditions to establish the expected 50:50 heterozygote profiles. Sequencing was carried out using 0.5 µL of BigDye Terminator (BDT) v3.1 Cycle Sequencing Kit (Applied Biosystems) with 2 µL of a PCR product, 1.75 µL of 5x sequencing buffer (0.4 M Tris-HCl at pH 9.0, 10 mM MgCl2), and 10 pmol of the sequencing primer in a 10-µL reaction in conditions suggested by the manufacturer. Reaction products were separated on an Applied Biosystems 3730XL DNA analyzer. The sequence traces were analyzed using in-house software (see below). The normalized heterozygote ratios of genomic DNA samples were used to establish a 95% confidence interval (95% CI) for each SNP. If both heterozygote ratios in independent RNA samples showed convergent deviation beyond 95% CI derived from genomic DNA data, the sample was called to have allelic imbalance. If one of the two RNA replicates was within the 95% CI or if the replicates deviated to opposite directions, the sample was defined as "unknown."
Databases and resources Genomic SNP allele frequencies in four populations were calculated from HapMap data. Unrelated individuals were used for frequency calculation. We used 60 independent individual data from the CEPH Europe population (CEU), 45 from Han Chinese in Beijing (HCB), 44 from Japanese in Tokyo (JPT), and 60 from Yoruba in Ibadan (YRI). The criteria for SNP selection for the comparison were: A SNP should be transmitted at least once in 30 trios of CEPH families and should be located inside the RefSeq gene exon region, based on the UCSC RefSeq gene annotation database. For four different population frequency comparisons, we used a subset of SNPs that were common to all four populations.
We used EST sequence data to estimate SNP allele frequencies in RNA samples. EST sequences from UniGene were first mapped to the reference genome sequence by using BLAT. To avoid mapping errors, the EST sequences were required to be at least 70% identical to genomic sequence in the matched region and the EST were expected to map to the same UniGene cluster region. Approximately 3,856,000 sequences satisfied these two requirements and then were used in this study. Based on EST sequence alignment, which involved base-to-base matching to reference genome sequence, and SNP location on the genome, we counted both SNP alleles on the EST sequence. To limit the bias introduced by unequal representation of EST libraries in EST sequence data, a maximum of two sequences were counted from one library. Two-sided Fisher's exact tests (implemented in R) were performed on 11822 SNPs to compare the allele frequencies between CEPH-derived genomic and EST-derived RNA samples. The same SNP was included under two UniGene entries for
Software for normalized peak height analysis The input files for PeakPicker are the raw sequence files from ABI sequencers (*.AB1). Multiple alignments of the selected sequences are carried out with a user-defined sequence range using Needleman-Wunsch for global alignment. After identifying the marker SNP in a sequence and identifying bases to which its height should be compared (reference peaks), PeakPicker identifies and analyzes the SNP and reference peaks in all sequences. Because peak heights vary depending on sample, base type, and their position within the chromatogram, a normalization step is performed. A text file with SNP allele ratios normalized on the reference peaks is generated as an output. Suspicious values are flagged to allow manual review of the peak selection. PeakPicker is written in Java language and can be used in many platforms such as Windows and Linux systems. "PeakPicker" is available at http://genomequebec.mcgill.ca/EST-HapMap. Sequence traces for the assays described in Table 2 are available upon request.
Statistical analysis of validation data
T.J.H. is the recipient of a Clinician-Scientist Award in Translational Research by the Burroughs Wellcome Fund and an Investigator Award from CIHR. This work is supported by Genome Canada and Genome Quebec.
[Supplemental material is available online at www.genome.org.] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.4023805. Freely available online through the Genome Research Immediate Open Access option.
1 Corresponding author.
Adams, M.D., Dubnick, M., Kerlavage, A.R., Moreno, R., Kelley, J.M., Utterback, T.R., Nagle, J.W., Fields, C., and Venter, J.C. 1992. Sequence identification of 2,375 human brain genes. Nature 355: 632634.[CrossRef][Medline]
Bonaldo, M.F., Lennon, G., and Soares, M.B. 1996. Normalization and subtraction: Two approaches to facilitate gene discovery. Genome Res. 6: 791806. Bray, N.J., Buckland, P.R., Owen, M.J., and O'Donovan, M.C. 2003. Cis-acting variation in the expression of a high proportion of genes in human brain. Hum. Genet. 113: 149153.[Medline] Buetow, K.H., Edmonson, M.N., and Cassidy, A.B. 1999. Reliable identification of large numbers of candidate SNPs from public EST data. Nat. Genet. 21: 323325.[CrossRef][Medline] Cowles, C.R., Hirschhorn, J.N., Altshuler, D., and Lander, E.S. 2002. Detection of regulatory variation in mouse genes. Nat. Genet. 32: 432437.[CrossRef][Medline] Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J., and Lander, E.S. 2001. High-resolution haplotype structure in the human genome. Nat. Genet. 29: 229232.[CrossRef][Medline]
Field, L.L., Bonnevie-Nielsen, V., Pociot, F., Lu, S., Nielsen, T.B., and Beck-Nielsen, H. 2005. OAS1 splice site polymorphism controlling antiviral enzyme activity influences susceptibility to type 1 diabetes. Diabetes 54: 15881591. The International HapMap Consortium. 2003. The International HapMap Project. Nature 426: 789796.[CrossRef][Medline] Irizarry, K., Kustanovich, V., Li, C., Brown, N., Nelson, S., Wong, W., and Lee, C.J. 2000. Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences. Nat. Genet. 26: 233236.[CrossRef][Medline]
Kent, W.J. 2002. BLATThe BLAST-like alignment tool. Genome Res. 12: 656664. Knight, J.C. 2005. Regulatory polymorphisms underlying complex disease traits. J. Mol. Med. 83: 97109.[CrossRef][Medline]
Lo, H.S., Wang, Z., Hu, Y., Yang, H.H., Gere, S., Buetow, K.H., and Lee, M.P. 2003. Allelic variation in gene expression is common in the human genome. Genome Res. 13: 18551862. Marth, G.T., Korf, I., Yandell, M.D., Yeh, R.T., Gu, Z., Zakeri, H., Stitziel, N.O., Hillier, L., Kwok, P.Y., and Gish, W.R. 1999. A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 23: 452456.[CrossRef][Medline]
Mathe, C., Sagot, M.F., Schiex, T., and Rouze, P. 2002. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 30: 41034117.
Pastinen, T. and Hudson, T.J. 2004. Cis-acting regulatory variation in the human genome. Science 306: 647650.
Pastinen, T., Sladek, R., Gurd, S., Sammak, A., Ge, B., Lepage, P., Lavergne, K., Villeneuve, A., Gaudin, T., Brandstrom, H., et al. 2004. A survey of genetic and epigenetic variation affecting human gene expression. Physiol. Genomics 16: 184193. Reich, D.E., Cargill, M., Bolk, S., Ireland, J., Sabeti, P.C., Richter, D.J., Lavery, T., Kouyoumjian, R., Farhadian, S.F., Ward, R., et al. 2001. Linkage disequilibrium in the human genome. Nature 411: 199204.[CrossRef][Medline]
Rockman, M.V. and Wray, G.A. 2002. Abundant raw material for cis-regulatory evolution in humans. Mol. Biol. Evol. 19: 19912004. Sato, K., Emi, M., Ezura, Y., Fujita, Y., Takada, D., Ishigami, T., Umemura, S., Xin, Y., Wu, L.L., Larrinaga-Shum, S., et al. 2004. Soluble epoxide hydrolase variant (Glu287Arg) modifies plasma total cholesterol and triglyceride phenotype in familial hypercholesterolemia: Intrafamilial association study in an eight-generation hyperlipidemic kindred. J. Hum. Genet. 49: 2934.[CrossRef][Medline]
Soares, M.B., Bonaldo, M.F., Jelene, P., Su, L., Lawton, L., and Efstratiadis, A. 1994. Construction and characterization of a normalized cDNA library. Proc. Natl. Acad. Sci. 91: 92289232. Xie, X., Lu, J., Kulbokas, E.J., Golub, T.R., Mootha, V., Lindblad-Toh, K., Lander, E.S., and Kellis, M. 2005. Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature 434: 338345.[CrossRef][Medline] Yamamoto, N., Nakayama, J., Yamakawa-Kobayashi, K., Hamaguchi, H., Miyazaki, R., and Arinami, T. 2002. Identification of 33 polymorphisms in the adipocyte-derived leucine aminopeptidase (ALAP) gene and possible association with hypertension. Hum. Mutat. 19: 251257.[CrossRef][Medline]
Yan, H., Yuan, W., Velculescu, V.E., Vogelstein, B., and Kinzler, K.W. 2002. Allelic variation in human gene expression. Science 297: 1143.
Yang, H.H., Hu, Y., Edmonson, M., Buetow, K., and Lee, M.P. 2003. Computation method to identify differential allelic gene expression and novel imprinted genes. Bioinformatics 19: 952955.
http://davidmlane.com/hyperstat/chi_square.html; Web-based statistical tools for determination of http://genomequebec.mcgill.ca/EST-HapMap; Additional information relating to current manuscript. http://genome.ucsc.edu; Human Genome Browser. http://www.HapMap.org; Web site for description of and data from the International Haplotype Map Consortium.
Received April 9, 2005; accepted in revised format July 12, 2005. This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||