Genome Research

Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Audic, S.
Right arrow Articles by Claverie, J.-M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Audic, S.
Right arrow Articles by Claverie, J.-M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Genome Research
Vol. 7, No. 10, pp. 986-995, October 1997

RESEARCH
The Significance of Digital Gene Expression Profiles

Stéphane Audic, and Jean-Michel Claverie1

Laboratory of Structural and Genetic Information, Centre National de la Recherche Scientifique-E.P.91, Marseille 13402, France
    ABSTRACT
Abstract
Introduction
Results
Discussion
Methods
References

Genes differentially expressed in different tissues, during development, or during specific pathologies are of foremost interest to both basic and pharmaceutical research. "Transcript profiles" or "digital Northerns" are generated routinely by partially sequencing thousands of randomly selected clones from relevant cDNA libraries. Differentially expressed genes can then be detected from variations in the counts of their cognate sequence tags. Here we present the first systematic study on the influence of random fluctuations and sampling size on the reliability of this kind of data. We establish a rigorous significance test and demonstrate its use on publicly available transcript profiles. The theory links the threshold of selection of putatively regulated genes (e.g., the number of pharmaceutical leads) to the fraction of false positive clones one is willing to risk. Our results delineate more precisely and extend the limits within which digital Northern data can be used.

    INTRODUCTION
Abstract
Introduction
Results
Discussion
Methods
References

Very large-scale, single-pass partial sequencing of cDNA clones from a large number of libraries has led to the identification of ~50,000 human genes (Adams et al. 1995; Aaronson et al. 1996; Hillier et al. 1996). However, a precise function or a complete transcript sequence are known for <5000 of these (Adams et al. 1995; Boguski and Schuler 1995). In the absence of functional clues for most of the newly identified genes, evidence of differential expression is the most important criteria to prioritize the exploitation of anonymous sequence data in both basic and pharmaceutical (Nowak 1995; Adams 1996; Bains 1996; Editorial 1996) research. For example, the study of expression profiles in various tumors is central to the new Cancer Genome Anatomy project (Kuska 1996; O'Brien 1997). In contrast to functional assays, the quantitative analysis of gene expression level lends itself to large-scale implementation. Two main approaches have been proposed (1) "analog" methods based on hybridization to arrayed cDNA libraries (Lennon and Lehrach 1991; Gress et al. 1992; Nguyen et al. 1995; Schena et al. 1995; Zhao et al. 1995) or oligonucleotide "chips" (Fodor et al. 1991; Southern et al. 1992; Guo et al. 1994; Matson et al. 1995); and (2) "digital" methods, based on the generation of sequence tags. This paper focuses on the latter. The sequence tag-based method (Okubo et al. 1992; Matsubara and Okubo 1994) consists of generating a large number (thousands) of expressed sequence tags (ESTs) (Adams et al. 1991; Wilcox et al. 1991; Adams et al. 1992; Khan et al. 1992) from 3'-directed regional non-normalized cDNA libraries. Recently, Velculescu et al. (1995) have introduced the serial analysis of gene expression (SAGE). Although tags are 100-300 nucleotides in length in the original EST approach, the SAGE method only requires nine nucleotides, therefore allowing a larger throughput. In both protocols, the number of tags is reported to be proportional to the abundance of cognate transcripts in the tissue or cell type used to make the cDNA library. The variation in the relative frequency of those tags, stored in computer databases, is then used to point out the differential expression of the corresponding genes: This is the concept of a "digital Northern" comparison. In the absence of a sound theoretical framework, the validity of the method has only been verified for a handful of genes in the context of two cellular differentiation systems (Lee et al. 1995; Okubo et al. 1995) inducible in vitro. Yet, with a total number of human genes of ~80,000 or more, it is not intuitive that sequencing a mere few thousand tags (a typical experiment) from highly redundant non-normalized cDNA libraries can produce a useful picture, or realistic "transcript profile," of a given tissue, development stage, or cell type. What variations in tag numbers allow for a reliable inference about differential expression? How many tags should be generated? Here we present the statistical framework required to answer those questions and analyze transcript profiles in a quantitative manner.

    RESULTS
Abstract
Introduction
Results
Discussion
Methods
References

In Methods we establish the probability distribution governing the occurrence of the same rare event in duplicate experiments. This probability distribution is a general result applicable to a wide variety of experimental situations, although this paper focuses on its use to analyze digital gene expression patterns. The main and only mathematical assumption behind the derivation is that the observed events are rare and part of a large population of possible outcomes (the distribution of which is not specified). In the context of a digital Northern, one event is the observation of a given cDNA sequence tag, and the experiment consists of the random picking and partial sequencing of a number N of cDNA clones. Given the usual complexity (i.e., the number of different genes expressed) of cDNA libraries, observing a given cDNA qualifies as a rare event, as the abundance of most individual messages is of the order of a few percents or less.

Random Fluctuation vs. Significant Change in Tag Number: When to Infer Differential Expression

Let us randomly pick N = 1000 clones from a cDNA library and generate the corresponding sequence tags; a given message (e.g., interleukin-2) will be picked x (e.g., two) times, with x in a typical (0-10) range. If we now redo this experiment, that is, again pick 1000 clones and generate the tags, the same message will now be picked y (e.g., 3) times. If the experiments have been duplicated correctly and the clones selected at random, we expect x and y to be close, albeit often different because of random fluctuations. In the Methods section, we show that the expected probability of observing y occurrences of a clone already observed x times is given by the simple formula:
p(y|x) = <FR><NU>(x + y)!</NU><DE>x!y!2<SUP>(x+y+1)</SUP></DE></FR> (1)
Equation 1 can be used to compute a confidence interval [ymin, ymax]epsilon within which we expect to find y with a given probability, noted 1-2epsilon , where 2epsilon is the significance level. For epsilon  small (e.g., 2.5% or less), y values falling outside the [ymin, ymax]epsilon interval correspond to p(y | x) << 1, therefore pointing out very unlikely random fluctuations between the two experiments. The confidence intervals for the usual 1% and 5% significance levels are given in Table 1.

                              
View this table:
[in this window]
[in a new window]
 
Table 1.   Confidence Intervals in Function of the Value of x

The same confidence intervals listed in Table 1 can in fact be used to analyze the results of sampling N clones from two different libraries. Provided all experimental factors are well replicated, significant discrepancies between x (from one library) and y (from the other) will now characterize differentially expressed genes, for example, the relative abundance of which is unlikely to be the same in the two libraries. Simply reading Table 1, we see that variations in counts such as 7 right-arrow 0, or 2 right-arrow 12 are significant (P < 0.01) evidence of regulated gene expression, whereas variations such as 3 right-arrow 0 or 8 right-arrow 16 are not (P > 0.05). However, we do not advocate the use of rigid significance thresholds to analyze digital transcript profiles, as discussed below.

Influence of the Sampling Size

Surprisingly at first, p(y | x) in Equation 1 does not involve the sampling size N, that is, the total number of picked clones. The fluctuation probabilities, and confidence intervals, depend only on the values of the observed counts. To understand why, we must remember that Equation 1 governs the results of strictly duplicated experiments. Given N clones are sampled, the most likely tags to be picked up are, intuitively, those corresponding to cDNA, the abundance of which is of the order of 1/N, or larger (according to Equation 3, the probability of finding a given cDNA with 1/N abundance while picking up N clones is 0.63, see also Equation 13). Choosing a sampling size therefore corresponds to targeting a given subset of genes, the level of expression of which allows their tags to occur at reasonable frequencies.

As expected, more reliable inferences can be made on clones corresponding to larger absolute frequencies (i.e., the ones more often picked up). For example (see Table 1), a variation in counts from 1-3 (threefold increase) is not indicative of a significant (P < 0.05) increase, whereas a variation from 4-12 is significant at P < 0.05, and a variation from 7-21 is significant at P < 0.01. For a gene expressed at a given rate, increasing the sampling size N leads to higher tag counts, and allows more stringent statistical inference to be made, for the same proportional variation.

Most often in practice one wishes to compare digital Northerns or gene profiles that have been computed from the random picking of different numbers of clones, N1 and N2. The mathematical problem is now to establish the probability for a given cDNA (e.g., interleukin-2) to be picked up x times when the sampling size was N1 and y times when the sampling size was N2. Equation 1 then becomes (see Methods):
p(y|x) = <FENCE><FR><NU>N<SUB>2</SUB></NU><DE>N<SUB>1</SUB></DE></FR></FENCE><SUP>y</SUP> <FR><NU>(x + y)!</NU><DE>x!y! <FENCE>1 + <FR><NU>N<SUB>2</SUB></NU><DE>N<SUB>1</SUB></DE></FR></FENCE><SUP>(x+y+1)</SUP></DE></FR> (2)
Whereas Equation 1 applied to the analysis of fluctuation in counts in strictly identical experiments, Equation 2 now applies to the analysis of counts in experiments only differing by the total number of clones randomly picked up. In practice, Equation 2 will be used to analyze experiments performed on two different libraries, using different sampling sizes. As for Equation 1, small p(y | x) are expected to characterize the genes exhibiting regulated expression, the relative abundance of which is unlikely to be the same in the two libraries.

Comparison with Fisher's (2 × 2) Exact Test

The (2 × 2) contingency tables arising from treatment versus control experiments are traditionally analyzed with Fisher's exact test (Siegel 1956; Agresti 1996). Differential EST count data can be presented in a tabulated form so as to suggest the use of this test, as follows:

Brain cDNA library Liver cDNA library
Number of actin ESTs 2 11
Number of other ESTs 998 1189
  Total clones sampled
1000 1200
 

The statistical significance according to Fisher's exact test for such a result is 4.6% (two-tail P-value, i.e., the probability for such a table to occur in the hypothesis that actin EST frequencies are independent of the cDNA libraries). In comparison, the P-value computed from the cumulative form (Equation 9, see Methods) of Equation 2 (i.e., for the relative frequency of actin ESTs to be the same in both libraries, given that at least 11 cognate ESTs are observed in the liver library after two were observed in the brain library) is 1.6%. Fisher's (2 × 2) exact test is always more conservative than our test (e.g., Fisher's P-value of 1.6% requires a 2 right-arrow 13 EST count transition in the above setting). Besides being too conservative, there is a more fundamental difficulty in using this test to analyze EST count data. The sampling scheme assumed by Fisher's exact test in principle requires the total number of data values in the contingency table to be fixed, as well as both the row marginal total and the column marginal totals. In our prospective experimental situation, only the column marginals (i.e., the numbers of clones sampled from each library) are fixed. The extension of Fisher's exact test to cases where only one set of marginal totals is fixed (Tocher 1950) is still controversial. In the context of the above EST counting results, there is an additional problem with the lack of homogeneity in the definition of the "other EST" category. This category represents different subsets of transcripts for different libraries.

The use of Fisher's (2 × 2) exact test is more natural for a different type of EST data analysis: the study of library-dependent alternative transcripts of the same gene (i.e., splice or polyadenylation variants) (D. Gautheret, O. Poirot, F. Lopez, S. Audic, and J.-M. Claverie, in prep.). Here, the results for an hypothetical gene G1 may look as follows:

G1-related transcripts in brain library G1-related transcripts in liver library
Long-form mRNA 2 10
Short-form mRNA 8 3
  Total G1-related     clones
10 13
 

  where the alternative categories are unambiguously defined and refer to the same objects. For example, the above results constitute good evidence that G1 is expressed in different forms in those tissues (Fisher's exact test two-tail P-value = 1.2%).

False Leads in the Selection of Candidate Genes

A crucial measure of the power of statistical significance tests is their rate of false alarm, that is, how often random fluctuations are expected to be mistaken for significant differences in the results. When analyzing the transcript profiles from two different libraries, a false alarm would cause a gene to be deemed differentially transcribed, whereas in fact it is not. The rate of false alarm is therefore a direct estimate of the fraction of false leads, when searching for differentially expressed genes on the basis of differences in tag counts. The rates of false alarm associated with the P < 0.01 and P < 0.05 confidence intervals listed in Table 1 have been computed by Monte-Carlo simulation on the basis of two experimental sequence tag distributions (Table 2; Fig. 1). The rate of false alarms associated with the use of Equation 1 (in fact, its cumulative form Equation 9, see Methods) is very small for genes represented by small tag counts and slowly increases for higher tag counts, without ever exceeding the selected significance level. Such good behavior validates the use of the confidence intervals (Table 1) computed from Equation 1 and Equation 9 to assess the statistical significance of variations in digital Northern data. The curves labeled "window" characterize the very similar behavior of a slightly less conservative derivation of the same test (see Methods, Equation 15). For comparison, Figure 1 also presents the behavior of another test, based on an inappropriate application of Ricker's confidence intervals (Ricker 1937) (see Methods).

                              
View this table:
[in this window]
[in a new window]
 
Table 2.   Publicly Available Distributions of Sequence Tags


View larger version (31K):
[in this window]
[in a new window]
 
Figure 1   Rate of false alarm computed according to the confidence intervals listed in Table 1. (Top) Monte-Carlo simulation of the random sampling of 840 tags distributed according to the data from Velculescu et al. (1995; see Table 2). (Bottom) Monte-Carlo simulation of the random sampling of 982 ESTs distributed according to the data from Okubo et al. (1992; see Table 2). The frequency of false alarm was computed for two significance levels (2epsilon  = 5%, left and 2epsilon  = 1%, right) and plotted in function of the tag class size (from 1-64 for Velculescu et al., from 1-22 for Okubo et al.). In all cases, the rate of false alarm increases up to a plateau for larger class sizes. The test (cumulative form of Equation 1) derived from the flat p(lambda ) prior shows perfect behavior with a maximal rate of false alarm always less than the significance levels (broken lines). The test (cumulative form of Equation 15) derived from the window p(lambda ) prior exhibits a slightly higher rate of false alarms. Both versions of the test exhibit conservative behaviors for class size <5, with a false alarm rate even less than expected. In contrast, Ricker's confidence intervals (Equation 12) are grossly inadequate and lead to false alarm rates up to four times the significance level. Graphs are computed from the analysis of 1000 repetitions of each experiment.

    DISCUSSION
Abstract
Introduction
Results
Discussion
Methods
References

An appropriate statistical test is now at our disposal to begin analyzing digital gene expression profiles in a more quantitative way. For example, the test can be used to determine how many genes appear regulated at various confidence levels using the data from a typical experiment (e.g., sampling a thousand clones). We analyzed the data gathered by Okubo et al. (1995) on the human promyelocytic leukemia cell line HL60 induced by dimethylsulfoxide (DMSO) or tetradecanoylphorbolacetate (TPA). Table 3 shows the 21 EST classes the occurrences of which exhibit significant variations at the 1% level. Most of the corresponding genes make biological sense in term of differentiation along the granulocyte or monocyte pathways.

                              
View this table:
[in this window]
[in a new window]
 
Table 3.   List of ESTs Exhibiting Significant (P < 0.01) Differences in Abundance in the HL60 Cell Line Induced by DMSO or TPA

This example serves to discuss a subtle point in the interpretation of the P values computed from Equation 1, 2, and 9. Rigorously, these equations apply to the case where a given gene (e.g., lipocortin) would have been selected for scrutiny before looking at the differences in cognate tag counts between libraries. When comparing two libraries without specifying in advance the transcripts we want to follow, and then focusing a posteriori on any of those exhibiting significant variations, the average number of expected false positive Nfalse is Nfalse = PNspecies, where Nspecies is the number of different transcript species encountered and p is a given significance level. For instance, in the experiment analyzed in Table 3, Nspecies is of the order of 600 (Okubo et al. 1995). It is therefore possible that up to four (600 × 7 × 10-3) out of the 21 transcript species listed in Table 3 are not truly differentially expressed.

Therefore, when two libraries are compared without prior gene selection, the use of a predetermined significance threshold is not advisable. The P values computed from Equation 1, 2, and 9 should simply be used to rank all observed variations by order of decreasing statistical significance (analogous to how "similarity hits" are listed after database searches). The end-users can then make their own choice about the number of candidate target genes to be retained from the top of the list, bearing in mind the corresponding number of expected false positives.

Although the present interpretation of a digital Northern focuses on the genes exhibiting the most spectacular differential expressions, there is already ample evidence that small changes can cause drastic effects. Disease states caused by haploinsufficiency and trisomy suggest that 2 right-arrow 1 or 2 right-arrow 3 proportional changes in expression level may be of biological significance. Table 1 shows that there is no theoretical limit to the detection of such small variations from the comparison of digital expression patterns. Simply, the sampling size has to be increased enough for the required numbers of cDNA tags to reach a significance threshold (for instance 40 right-arrow 60, for a confidence level of 95%).

Analog hybridization-based methods (Fodor et al. 1991; Lennon and Lehrach 1991; Gress et al. 1992; Southern et al. 1992; Guo et al. 1994; Matson et al. 1995; Nguyen et al. 1995; Schena et al. 1995; Zhao et al. 1995) are traditionally opposed to digital tag-counting methods (Okubo et al. 1992; Matsubara and Okubo 1994; Lee et al. 1995; Okubo et al. 1995; Velculescu et al. 1995) for the analysis of differential gene expression. Both types of methods are sensitive to the quality of the original messenger RNA preparation and/or cDNA libraries. Analog methods promise higher throughput, lower cost, and have the capacity of studying transcripts on a much wider scale of abundance. They are therefore expected to supersede digital methods. On the down side, however, hybridization signals are not easily reproducible, and can be affected by many unknown properties such as the cDNA library complexity, as well as clone and sequence specific features (e.g., insert size, nucleotide composition, presence of repeats, secondary structure, triple helix interaction, etc.). Therefore, the hybridization-based methods require an estimation of the dispersion of the signal associated with each clone (i.e., enough repetitions of each experiment), and multiple standardization and calibration procedures to allow the meaningful comparison of hybridization patterns obtained from various sources (tissues, cell types, etc.) or from different membranes or chips. This is far from routine and has yet to be worked out. In contrast, and thanks to the unique properties of the Poisson distribution, digital methods have the capacity of providing a quantitative assessment of differential expression without the repetition or the standardization of individual tag-counting experiments. The statistical analysis presented here provides an objective method to analyze digital transcript profile data, and adapts it to fit (1) the number of leads one wants to be followed; (2) the fraction of false clues to be tolerated; and (3) the level of modulation in gene expression considered of biological interest.

A program is available on our web site (http://igs-server.cnrs-mrs.fr) to compute the confidence intervals corresponding to arbitrary significance levels and sampling size N1 and N2.

    METHODS
Abstract
Introduction
Results
Discussion
Methods
References

Let us denote p(x) the probability to observe x sequence tags of the same gene (i.e., from the 3' end of the same transcript) when N cDNA clones are picked randomly. For each transcript representing a small (i.e., less than 5%) fraction of the library and N >=  1000, p(x) will closely follow the Poisson distribution:
p(x) = <FR><NU>e<SUP>−&lgr;</SUP>&lgr;<SUP>x</SUP></NU><DE>x!</DE></FR> (3)
where lambda  is the actual (albeit unknown) number of transcript of this type per N clones in the library. If we duplicate this experiment (i.e., once again randomly pick N clones of the same library and generate sequence tags), we will now observe y occurrences of the same transcript. What is the probability of the various y values? An approximate solution consists in using x as the maximum likelihood estimate for lambda  and compute the probability for y occurrences given a Poisson distribution of mean lambda  = x:
p(y|x) = <FR><NU>e<SUP>−x</SUP>x<SUP>y</SUP></NU><DE>y!</DE></FR> (4)
Equation 4 is not symmetrical in x and y. This is an obvious flaw as the probability should not depend on which of the x or y values were observed first. p(y | x) = p(x | y) should hold provided that an equal number N of clones is sampled in both experiments. Equation 4 is not the correct formula, because we have not yet taken into account the fluctuation of x around the unknown mean lambda . To account for the fact that the actual value of lambda  is unknown, we have to integrate Equation 4 over all possible lambda  values:
p(y|x) = <LIM><OP>∫</OP><LL>0</LL><UL>∞</UL></LIM> d&lgr;p(d = &lgr;|x)p(y|d = &lgr;) (5)
p(d = lambda  | x) in Equation 5 is the probability that the actual abundance of a given transcript is lambda  given that x occurrences of a cognate tag have been observed in one experiment. The second term in the integral is the probability of drawing y occurrences given a Poisson distribution of mean lambda :
p(y|d = &lgr;) = <FR><NU>e<SUP>−&lgr;</SUP>&lgr;<SUP>y</SUP></NU><DE>y!</DE></FR> (6)
Using Bayes' theorem p(d = lambda  | x) can be written as
p(d = &lgr;|x) = <FR><NU>p(x|d = &lgr;)p(d = &lgr;)</NU><DE><LIM><OP>∫</OP><LL>0</LL><UL>∞</UL></LIM> d&lgr;‘ p(x|d = &lgr;‘) p(d = &lgr;‘)</DE></FR> (7)
To evaluate Equation 7, we need to define the prior distribution p(d = lambda ). The least constrained hypothesis (i.e., with the least information content), is to attribute an equal a priori probability to all lambda  values in the [0, infinity ] range. Incorporating such a flat prior in Equation 5 leads to
p(y|x) = <FR><NU>1</NU><DE>x!y!</DE></FR> <LIM><OP>∫</OP><LL>0</LL><UL>∞</UL></LIM> d&lgr;e<SUP>−2&lgr;</SUP>&lgr;<SUP>(x+y)</SUP> (8)
From the definition of the Gamma  function for integer arguments we observe that
<LIM><OP>∫</OP><LL>0</LL><UL>∞</UL></LIM> d&lgr;e<SUP>−2&lgr;</SUP>&lgr;<SUP>(x+y)</SUP> = <FR><NU>(x + y)!</NU><DE>2<SUP>(x+y+1)</SUP></DE></FR>
and finally obtain the expression given in Results:
p(y|x) = <FR><NU>(x + y)!</NU><DE>x!y!2<SUP>(x+y+1)</SUP></DE></FR> (1)
This equation can be used in a wide variety of experimental situations. Equation 1 defines the probability of observing x and y occurrences of the same rare event in duplicated experiments, regardless of the detailed probability distribution of those events among the set of possible outcomes. In particular, in the context of transcription profiles, p(y | x) can be evaluated regardless of the distribution of each transcript (provided it is rare) within a cDNA library.

To compute the confidence intervals listed in Table 1, we made use of the cumulative distributions:
C(y ⩽ y<SUB>min</SUB>|x) = <LIM><OP>∑</OP><LL>y=0</LL><UL>y⩽ y<SUB>min</SUB></UL></LIM> p(y|x) (9a)
D(y ⩾ y<SUB>max</SUB>|x) = <LIM><OP>∑</OP><LL>y=y<SUB>max</SUB></LL><UL>∞</UL></LIM> p(y|x) (9b)
These equations allow the computation of an interval [ymin, ymax]epsilon such as C(y  ymin | x)  epsilon and D(y >=  ymax | x)  epsilon . Given that an event is observed x times in one experiment, the number y of occurrences of this event in a duplicate experiment is expected to fall within the interval [ymin, ymax]epsilon with a probability of 1-2epsilon . Equation 9, a and b, can therefore serve as a significance test when comparing, for instance, the results of sampling N clones from two different libraries. For 2epsilon small (e.g., 5% or less), y values falling outside the [ymin, ymax]epsilon interval correspond to p(y | x) << 1, and point out significant differences between the two experiments. They should include differentially expressed genes, for example, for which lambda  is different in the two libraries.

Generalization to Different Sampling Sizes

When different numbers of clones N1 and N2 are sequenced from the same library, Equation 5 becomes
p(y|x) = <LIM><OP>∫</OP><LL>0</LL><UL>∞</UL></LIM> d&lgr;<SUB>2</SUB> <LIM><OP>∫</OP><LL>0</LL><UL>∞</UL></LIM> d&lgr;<SUB>1</SUB>p(d<SUB>1</SUB> = &lgr;<SUB>1</SUB>|x)p(y|d<SUB>2</SUB> = &lgr;<SUB>2</SUB>)&dgr;<FENCE>&lgr;<SUB>2</SUB> − <FR><NU>N<SUB>2</SUB></NU><DE>N<SUB>1</SUB></DE></FR> &lgr;<SUB>1</SUB></FENCE> (10)
where the two abundance values lambda 1 and lambda 2 are forced in the same ratio as N1 and N2. Using the same bayesian argument as before (Equation 7) leads to
p(y|x) = <FR><NU>1</NU><DE>x!y!</DE></FR> <FENCE><FR><NU>N<SUB>2</SUB></NU><DE>N<SUB>1</SUB></DE></FR></FENCE><SUP>y</SUP> <LIM><OP>∫</OP><LL>0</LL><UL>∞</UL></LIM> d&lgr;<SUB>1</SUB>e<SUP>−&lgr;<SUB>1</SUB><FENCE>1+<FR><NU>N<SUB>2</SUB></NU><DE>N<SUB>1</SUB></DE></FR></FENCE></SUP> &lgr;<SUB>1</SUB><SUP>(x+y)</SUP> (11)
the last integral is simply
<FR><NU>(x + y)!</NU><DE><FENCE>1 + <FR><NU>N<SUB>2</SUB></NU><DE>N<SUB>1</SUB></DE></FR></FENCE> <SUP>(x + y + 1)</SUP></DE></FR>
leading to the formula presented in the Results section:
p(y|x) = <FENCE><FR><NU>N<SUB>2</SUB></NU><DE>N<SUB>1</SUB></DE></FR></FENCE><SUP>y</SUP> <FR><NU>(x + y)!</NU><DE>x!y! <FENCE>1 + <FR><NU>N<SUB>2</SUB></NU><DE>N<SUB>1</SUB></DE></FR></FENCE><SUP>(x+y+1)</SUP></DE></FR> (2)

Ricker's Confidence Interval

The confidence interval computed from Equation 1 (and its cumulative form, Equation 9, a and b) is different from one introduced previously by Ricker (1937) although, at first, the two may appear to be related.

Given x occurrences of a sequence tag, Ricker's formula defines a confidence interval [lambda min, lambda max]x for lambda  (again the actual number of transcripts of this type per N clones in the library) such as
p(k ⩽ x) = <LIM><OP>∑</OP><LL>k=0</LL><UL>x</UL></LIM> <FR><NU>e<SUP>−&lgr;<SUB>max</SUB></SUP> &lgr;<SUB>max<SUP>k</SUP></SUB></NU><DE>k!</DE></FR> ⩽ <FR><NU>&agr;</NU><DE>2</DE></FR> (12a)
and
p(k ⩾ x) = <LIM><OP>∑</OP><LL>k=x</LL><UL>∞</UL></LIM> <FR><NU>e<SUP>−&lgr;<SUB>min</SUB></SUP> &lgr;<SUB>min<SUP>k</SUP></SUB></NU><DE>k!</DE></FR> ⩽ <FR><NU>&agr;</NU><DE>2</DE></FR> (12b)
where alpha  is typically 5% or 1%. Ricker's confidence intervals for various values of x are given in Table 1. Those intervals are close to those computed from Equation 1, but delineate the range of likely lambda  values, not y (the number of occurrences of the same event in a duplicated experiment). It is possible for x and y to fall outside each other's Ricker's confidence interval [lambda min, lambda max], while still being nonsignificant fluctuations around the same lambda  value. The confidence intervals computed from Equation 12, a and b, are therefore too narrow to properly define significant discrepancies between x and y. The false alarm rate associated with the use of Ricker's confidence intervals is too high (Fig. 1).

However, an interesting use of Equation 12, a and b, is the estimation of the range of possible frequencies [lambda min, lambda max]x = 0 for cDNAs not yet encountered after picking N clones. For example, the 95% confidence interval is given by:
0 < <IT>N</IT>&lgr; < 3.7 (13)
That is, the abundance of a cDNA not picked up among one thousand clones is unlikely (5% chance) to be larger than 3.7/1000.

Influence of the Prior Distribution

In the bayesian context, it is prudent to assess the influence of the prior hypothesis used to derive Equation 1 and Equation 2. The flat p(lambda ) prior allowing equiprobable lambda  values in the [0 infinity ] range might appear too broad and unrealistic. Nevertheless, it is the most intuitively neutral distribution one can use. The quick convergence of the Poisson distribution rends the contribution of extreme lambda  values negligible as soon as | lambda  - x | or | lambda  - y | increase. To verify this point, more reasonable distributions for p(lambda ) can be constructed by confining the accessible lambda  values within a window [lambda min, lambda max]x centered around the already observed value x. Such a window can for instance be Ricker's confidence intervals as defined in the previous section (Equation 12, a and b). We then confine the only permitted values of lambda  to be in this interval, with an equal probability; therefore,
p(d = &lgr;) = <FR><NU>H(&lgr; − &lgr;<SUB>min</SUB>) × [1 − H(&lgr; − &lgr;<SUB>max</SUB>)]</NU><DE>&lgr;<SUB>max</SUB> − &lgr;<SUB>min</SUB></DE></FR> (14)
where H denotes the Heaviside function, the value of which is 1 for positive argument and 0 otherwise. Equation 1 then becomes
p(y|x) = <FR><NU>F<SUB>2&lgr;<SUB>min</SUB>,2&lgr;<SUB>max</SUB></SUB>(x + y)</NU><DE>F<SUB>&lgr;<SUB>min</SUB>, &lgr;<SUB>max</SUB>(x)y!2<SUP>(x+y+1)</SUP></SUB></DE></FR> (15)
with
F<SUB>&lgr;<SUB>min</SUB>, &lgr;<SUB>max</SUB></SUB>(x) = <LIM><OP>∫</OP><LL>&lgr;<SUB>min</SUB></LL><UL>&lgr;<SUB>max</SUB></UL></LIM> d&lgr;e<SUP>−&lgr;</SUP>&lgr;<SUP>x</SUP> (16)
When lambda min right-arrow 0 and lambda max right-arrow infinity , we recover the initial Equation 1. We note in passing that the other limit, lambda min = lambda max = x, corresponds to the most stringent "Dirac prior":
<IT>p</IT>(<IT>d</IT> = &lgr;|<IT>x</IT>) = &dgr;(<IT>x</IT>− &lgr;) (17)
forcing lambda  = x and turning p(y | x) into the simple Poisson distribution (Equation 4).

The confidence intervals for the usual 1% and 5% significance levels are given in Table 1 for both the flat and the window prior p(d = lambda ). There is little difference, with the test derived from using a flat prior being a bit more conservative, as expected. On the down side, Figure 1 shows that the test derived from the window p(lambda ) prior gives rise to a higher rate of false alarm.

    ACKNOWLEDGMENTS

We thank Drs. J. Weissenbach, D. Gautheret, R. Ewing, and C. Abergel for critically reading the manuscript. This work was sponsored by a collaborative research grant from Incyte Pharmaceuticals, Inc.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.

    FOOTNOTES

1   Corresponding author.

   E-MAIL jmc{at}igs.cnrs-mrs.fr; FAX 334 91 16 45 49.

    REFERENCES
Abstract
Introduction
Results
Discussion
Methods
References
  • Aaronson, J.S., B. Eckman, R.A. Blevins, J.A. Borkowski, J. Myerson, S. Imran, and K.O. Elliston. 1996. Toward the development of a gene index to the human genome: An assessment of the nature of high-throughput EST sequence data. Genome Res. 6: 829-845[Abstract/Free Full Text].
  • Adams, M.D. 1996. Progress towards a complete set of human genes. In Genomes, molecular biology and drug discovery (ed. M.J. Browne and P.L. Thurby). Academic Press, London, UK.
  • Adams, M.D., J.M. Kelley, J.D. Gocayne, M. Dubnick, M.H. Polymeropoulos, H. Xiao, C.R. Merril, A. Wu, B. Olde, R.F. Moreno 1991. Complementary DNA sequencing: Expressed sequence tags and human genome project. Science 252: 1651-1656[Abstract/Free Full Text].
  • Adams, M.D., M. Dubnick, A.R. Kerlavage, R. Moreno, J.M. Kelley, T.R. Utterback, J.W. Nagle, C. Fields, and J.C. Venter. 1992. Sequence identification of 2,375 human brain genes. Nature 355: 632-634[CrossRef][Medline].
  • Adams, M.D., A.R. Kerlavage, R.D. Fleischmann, R.A. Fuldner, C.J. Bult, N.H. Lee, E.F. Kirkness, K.G. Weinstock, J.D. Gocayne, O. White 1995. Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. (The Genome Directory, Suppl.) Nature 377: 3-174[Medline].
  • Agresti, A. 1996. An introduction to categorical data analysis. John Wiley, New York, NY.
  • Bains, W. 1996. Virtually sequenced: The next genomic generation. Nature Biotechnol. 14: 711-713.[CrossRef][Medline]
  • Boguski, M.S. and G.D. Schuler. 1995. ESTablishing a human transcript map [News]. Nature Genet. 10: 369-371[CrossRef][Medline].
  • Editorial. 1996. Capitalizing on the genome. Nature Genet. 13: 1-5.
  • Fodor, S.P.A., J.L. Read, M.C. Pirrung, L. Stryer, A.T. Lu, and D. Solas. 1991. Light-directed, spatially addressable parallel chemical synthesis. Science 251: 467-470.
  • Gress, T.M., J.D. Hoheisel, G.G. Lennon, G. Zehetner, and H. Lehrach. 1992. Hybridization fingerprinting of high-density cDNA-library arrays with cDNA pools derived from whole tissues. Mamm. Genome 3: 609-619[CrossRef][Medline].
  • Guo, Z., R.A. Guilfoyle, A.J. Thiel, R. Wang, and L.M. Smith. 1994. Direct fluorescence analysis of genetic polymorphisms by hybridization with oligonucleotides arrays on glass supports. Nucleic Acids. Res. 22: 5456-5465[Abstract/Free Full Text].
  • Khan, A.S., A.S. Wilcox, M.H. Polymeropoulos, J.A. Hopkins, T.J. Stevens, M. Robinson, A.K. Orpana, and J.M. Sikela. 1992. Single pass sequencing and physical and genetic mapping of human brain cDNAs [see Comments]. Nature Genet. 2: 180-185[CrossRef][Medline].
  • Kuska, B. 1996. Cancer genome anatomy project set for take-off. J. Natl. Cancer Inst. 88: 1801-1803[Free Full Text].
  • Hillier, L., G. Lennon, M. Becker, M.F. Bonaldo, B. Chiapelli, S. Chissoe, N. Dietrich, T. DuBuque, A. Favello, W. Gish 1996. Generation and analysis of 280,000 human expressed sequence tags. Genome Res. 6: 807-828[Abstract/Free Full Text].
  • Lee, N.H., K.G. Weinstock, E.F. Kirkness, J.A. Earle-Hugues, R.A. Fuldner, S. Marmaros, A. Glodek, J.D. Gocayne, M.D. Adams, A.R. Kerlavage 1995. Comparative expressed-tag analysis of differential gene expression profiles in PC-12 cells before and after nerve growth factor treatment. Proc. Natl. Acad. Sci. 92: 8303-8307[Abstract/Free Full Text].
  • Lennon, G.G. and H. Lehrach. 1991. Hybridization analyses of arrayed cDNA libraries. Trends Genet. 7: 314-317[Medline].
  • Matson, R.S., J. Rampal, S.L. Pentoney Jr, P.D. Anderson, and P. Coassin. 1995. Biopolymer synthesis on polypropylene supports: Oligonucleotide arrays. Anal. Biochem. 224: 110-116[CrossRef][Medline].
  • Matsubara, K. and K. Okubo. 1994. Identification of new genes by systematic analysis of cDNAs and database construction. Curr. Opin. Biotechnol. 4: 672-677.
  • Nguyen, C., D. Rocha, S. Granjeaud, M. Baldit, K. Bernard, P. Naquet, and B.R. Jordan. 1995. Differential gene expression in the murine thymus assayed by quantitative hybridization of arrayed cDNA clones. Genomics 29: 207-216[CrossRef][Medline].
  • Nowak, R. 1995. Entering the postgenome era. Science 270: 368-369[Abstract/Free Full Text].
  • O'Brien, C. 1997. Cancer genome anatomy project launched. Mol. Med. Today 3: 94.
  • Okubo, K., N. Hori, R. Matoba, T. Niiyama, A. Fukushima, Y. Kojima, and K. Matsubara. 1992. Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene expression. Nature Genet. 2: 173-179[CrossRef][Medline].
  • Okubo, K., K. Itoh, A. Fukushima, J. Yoshii, and K. Matsubara. 1995. Monitoring cell physiology by expression profiles and discovering cell type-specific genes by compiled expression profiles. Genomics 30: 178-186[CrossRef][Medline].
  • Ricker, W.E. 1937. The concept of confidence or fiducial limits applied to the Poisson frequency distribution. J. Am. Statist. Assoc. 32: 349-357.[CrossRef]
  • Siegel, S. 1956. Nonparametric methods for the behavioral sciences. McGraw-Hill, New York, NY.
  • Schena, M., D. Shalon, R.W. Davis, and P.O. Brown. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270: 467-470[Abstract/Free Full Text].
  • Southern, E.M., U. Maskos, and J.K. Elder. 1992. Analyzing and comparing nucleic acid sequences by hybridization to arrays of oligonucleotides: Evaluation using experimental models. Genomics 13: 1008-1017[CrossRef][Medline].
  • Tocher, K.D. 1950. Extension of the Neyman-Pearson theory of tests to discontinuous variates. Biometrika 37: 130-144[Free Full Text].
  • Velculescu, V.E., L. Zhang, B. Vogelstein, and K.W. Kinzler. 1995. Serial analysis of gene expression. Science 270: 484-487[Abstract/Free Full Text].
  • Wilcox, A.S, A.S. Khan, J.A. Hopkins, and J.M. Sikela. 1991. Use of 3' untranslated sequences of human cDNAs for rapid chromosome assignment and conversion to STSs: Implications for an expression map of the genome. Nucleic Acids Res. 19: 1837-1843[Abstract/Free Full Text].
  • Zhao, N., H. Hashida, N. Takahashi, Y. Misumi, and Y. Sakaki. 1995. High-density cDNA filter analysis: A novel approach for large-scale, quantitative analysis of gene expression. Gene 156: 207-213[CrossRef][Medline].

Received March 12, 1997; accepted in revised form August 22, 1997.


7:986-995 ©1997 by Cold Spring Harbor Laboratory Press ISSN 1054-9803/97 $5.00

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
Proc. Natl. Acad. Sci. USAHome page
M. Semon and K. H. Wolfe
Preferential subfunctionalization of slow-evolving genes after allopolyploidization in Xenopus laevis
PNAS, June 17, 2008; 105(24): 8333 - 8338.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
R. D. Morin, M. D. O'Connor, M. Griffith, F. Kuchenbauer, A. Delaney, A.-L. Prabhu, Y. Zhao, H. McDonald, T. Zeng, M. Hirst, et al.
Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells
Genome Res., April 1, 2008; 18(4): 610 - 621.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
J. W. H. Wong, M. J. Sullivan, and G. Cagney
Computational methods for the comparative quantification of proteins in label-free LCn-MS experiments
Brief Bioinform, March 1, 2008; 9(2): 156 - 165.
[Abstract] [Full Text] [PDF]


Home page
Plant CellHome page
J. Nagel, L. K. Culley, Y. Lu, E. Liu, P. D. Matthews, J. F. Stevens, and J. E. Page
EST Analysis of Hop Glandular Trichomes Identifies an O-Methyltransferase That Catalyzes the Biosynthesis of Xanthohumol
PLANT CELL, January 1, 2008; 20(1): 186 - 200.
[Abstract] [Full Text] [PDF]


Home page
Cancer Epidemiol. Biomarkers Prev.Home page
J. H. Cho-Vega, S. Tsavachidis, K.-A. Do, J. Nakagawa, L. J. Medeiros, and T. J. McDonnell
Dicarbonyl/L-Xylulose Reductase: A Potential Biomarker Identified by Laser-Capture Microdissection-Micro Serial Analysis of Gene Expression of Human Prostate Adenocarcinoma
Cancer Epidemiol. Biomarkers Prev., December 1, 2007; 16(12): 2615 - 2622.
[Abstract] [Full Text] [PDF]


Home page
MicrobiologyHome page
M. Costa, C. L. Borges, A. M. Bailao, G. V. Meirelles, Y. A. Mendonca, S. F. I. M. Dantas, F. P. de Faria, M. S. S. Felipe, E. E. W. I. Molinari-Madlum, M. J. S. Mendes-Giannini, et al.
Transcriptome profiling of Paracoccidioides brasiliensis yeast-phase cells recovered from infected mice brings new insights into fungal response upon host interaction
Microbiology, December 1, 2007; 153(12): 4194 - 4207.
[Abstract] [Full Text] [PDF]


Home page
Plant CellHome page
J. K. Hane, R. G.T. Lowe, P. S. Solomon, K.-C. Tan, C. L. Schoch, J. W. Spatafora, P. W. Crous, C. Kodira, B. W. Birren, J. E. Galagan, et al.
Dothideomycete Plant Interactions Illuminated by Genome Sequencing and EST Analysis of the Wheat Pathogen Stagonospora nodorum
PLANT CELL, November 1, 2007; 19(11): 3347 - 3368.
[Abstract] [Full Text] [PDF]


Home page
Mol Hum ReprodHome page
J.-A.L. Stanton, A.B. Macgregor, C. Mason, M. Dameh, and D.P.L. Green
Building comparative gene expression databases for the mouse preimplantation embryo using a pipeline approach to UniGene
Mol. Hum. Reprod., October 1, 2007; 13(10): 713 - 720.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
P. Huang, E. D. Ple