Skip Navigation


Briefings in Bioinformatics Advance Access originally published online on February 3, 2006
Briefings in Bioinformatics 2006 7(1):25-36; doi:10.1093/bib/bbk002
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
7/1/25    most recent
bbk002v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pounds, S. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pounds, S. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. For Permissions, please email: journals.permissions@oxfordjournals.org

Estimation and control of multiple testing error rates for microarray studies

Stanley B. Pounds

Stanley B. Pounds, Department of Biostatistics, MS 768, St Jude Children's Research Hospital, Memphis, TN 38105, USA. Tel: 901-495-5052; Fax: 901-544-8843;stanley.pounds{at}stjude.org


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 BASIC PRINCIPLES OF STATISTICAL...
 THE PROBLEM OF MULTIPLE...
 ERROR RATES FOR MULTIPLE...
 OTHER PRINCIPLES THAT...
 BRIEF DESCRIPTIONS OF METHODS
 CHOOSING A METHOD FOR...
 ASSESSING THE RELIABILITY OF...
 DISCUSSION
 FOOTNOTES
 Acknowledgements
 REFERENCES
 
The analysis of microarray data often involves performing a large number of statistical tests, usually at least one test per queried gene. Each test has a certain probability of reaching an incorrect inference; therefore, it is crucial to estimate or control error rates that measure the occurrence of erroneous conclusions in reporting and interpreting the results of a microarray study. In recent years, many innovative statistical methods have been developed to estimate or control various error rates for microarray studies. Researchers need guidance choosing the appropriate statistical methods for analysing these types of data sets. This review describes a family of methods that use a set of P-values to estimate or control the false discovery rate and similar error rates. Finally, these methods are classified in a manner that suggests the appropriate method for specific applications and diagnostic procedures that can identify problems in the analysis are described.

Keywords: microarray, gene expression, multiple testing, false discovery rate, error rate, statistical analysis


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 BASIC PRINCIPLES OF STATISTICAL...
 THE PROBLEM OF MULTIPLE...
 ERROR RATES FOR MULTIPLE...
 OTHER PRINCIPLES THAT...
 BRIEF DESCRIPTIONS OF METHODS
 CHOOSING A METHOD FOR...
 ASSESSING THE RELIABILITY OF...
 DISCUSSION
 FOOTNOTES
 Acknowledgements
 REFERENCES
 
Numerous complex statistical problems arise during the analysis of microarray data [1]. These include the multiple testing problem, which occurs when numerous statistical tests are performed [2]. For each statistical test performed, there is some probability that an erroneous inference will be made. Consequently, several incorrect inferences can easily occur by chance alone in an analysis that includes many statistical tests. Typically, the analysis of microarray data involves applying one or more statistical tests for each queried feature; therefore, thousands of tests are performed in any given analysis, and an extensive multiple testing problem is introduced in this setting. Researchers can easily be misled about the quality of their inferences if they do not properly account for or describe the occurrence of errors that arise in applications that involve multiple testing.

A typical objective of a microarray experiment is to identify features (genes or probe sets) whose expression is associated with a trait of interest. In a comparison of two or more experimental groups, any feature that has a differing mean or median expression across those groups, i.e. is ‘differentially expressed,’ is considered to be associated with the trait of interest. In microarray studies, a common approach to identify features of potential interest is to apply a statistical hypothesis testing procedure (e.g. a statistical test) to each feature's data. Therefore, when analysing microarray data in this manner, it is important to: understand the basic principles of statistical hypothesis testing, be aware of the difficulties that arise when many statistical hypothesis tests are performed, be familiar with some methods presently available for addressing the multiple testing problem, know how to choose the appropriate method in a specific application and assess the reliability of the results produced by a selected method.


    BASIC PRINCIPLES OF STATISTICAL HYPOTHESIS TESTING
 TOP
 ABSTRACT
 INTRODUCTION
 BASIC PRINCIPLES OF STATISTICAL...
 THE PROBLEM OF MULTIPLE...
 ERROR RATES FOR MULTIPLE...
 OTHER PRINCIPLES THAT...
 BRIEF DESCRIPTIONS OF METHODS
 CHOOSING A METHOD FOR...
 ASSESSING THE RELIABILITY OF...
 DISCUSSION
 FOOTNOTES
 Acknowledgements
 REFERENCES
 
A statistical hypothesis testing procedure is mathematically a formal way to use data to examine the plausibility of a specific statement regarding the association of two or more variables. The statement about the association of the variables is called the ‘null hypothesis’ in statistical jargon. Conceptually, the null hypothesis could make many different types of statements about the association of the variables; however, in practice, the null hypothesis usually implies that there is no association between the variables. For example, when searching for differentially expressed genes, the null hypothesis usually states that all experimental groups have an equal mean (or median) expression, thereby implying that expression is not associated with the experimental groups. The null hypothesis is used to derive a reference distribution of a test statistic (such as a t-statistic); this distribution is called the ‘null distribution,’ and it describes the variability of that statistic due to chance. The procedure compares the test statistic to its null distribution and computes a P-value to summarize the comparison. A small P-value indicates that the test statistic lies in the extremities of the null distribution; this finding suggests that the null hypothesis does not accurately describe the association of the considered variables. In statistical jargon, the procedure ‘rejects the null hypothesis’ when the P-value is less than some threshold such as 0.05. Commonly, a result is described as ‘statistically significant’ or as ‘significant’ when the procedure rejects the null hypothesis. In this article, the term ‘significant result’ will be used. Conversely, the procedure fails to reject the null hypothesis when the P-value is greater than the preselected threshold. Commonly, a result is described as ‘insignificant’ or ‘not statistically significant’ when the procedure fails to reject the null hypothesis.

The various statistical hypothesis testing procedures, such as the t-test, analysis of variance (ANOVA), etc., use different assumptions to derive the null distribution of a test statistic. The reliability of a statistical procedure depends in part on the validity of assumptions it uses to derive the null distribution of the test statistic. One class of procedures, commonly referred to as parametric tests, assumes that the chance variation of the data can be accurately modelled by a particular probability distribution, such as the normal distribution. Another class of procedures, commonly referred to as exact tests, use the observed data to empirically derive a null distribution. For simple comparisons of two or more groups, exact tests assume that the data values are identically distributed within each group. Under this assumption, the group names can be thought of as arbitrary labels assigned to data values completely at random. Exact tests compute the test statistic for every possible assignment of group labels to data values. The resulting set of test statistics is an empirical null distribution. The P-value is computed by comparing the test statistic computed from the label assignments in the original data set to the empirical null distribution. In practice, there are cases in which it is not feasible to enumerate all the possible label assignments. In these cases, permutation methods can be used to approximate the empirical null distribution. Permutation methods compute the test statistic for each of many randomly selected assignments of group labels to the data values to obtain an approximate empirical null distribution that is used to compute the P-value. Parametric tests typically offer greater statistical power (i.e. probability that the procedure correctly rejects a false null hypothesis) than the other methods when the assumed model accurately describes the distribution of the data. However, permutation methods are based on less restrictive assumptions and therefore provide reliable inferences in a broader array of settings than do analogous parametric methods.

Each time a statistical test is performed, one of four outcomes occurs, depending on whether the null hypothesis is true and whether the statistical procedure rejects the null hypothesis (Table 1): the procedure rejects a true null hypothesis (i.e. a false positive); the procedure fails to reject a true null hypothesis (i.e. a true negative); the procedure rejects a false null hypothesis (i.e. a true positive); or the procedure fails to reject a false null hypothesis (i.e. a false negative) [3]. Therefore, each time a statistical test is performed, there is some probability that the procedure will suggest an incorrect inference. When only one hypothesis is to be tested, the probability of each type of erroneous inference can be limited to tolerable levels by carefully planning the experiment and the statistical analysis. In this simple setting, the probability of a false positive can be limited by preselecting the P-value threshold used to determine whether to reject the null hypothesis. The probability of a false negative can be limited by performing an experiment with adequate replication. Statistical power calculations can determine how much replication is required to achieve a desired level of control of the probability of a false negative result. When multiple tests are performed, as in the analysis of microarray data, it is even more critical to carefully plan the experiment and statistical analysis to reduce the occurrence of erroneous inferences.


View this table:
[in this window]
[in a new window]
 
Table 1: Four possible hypothesis testing outcomes

 

    THE PROBLEM OF MULTIPLE TESTING
 TOP
 ABSTRACT
 INTRODUCTION
 BASIC PRINCIPLES OF STATISTICAL...
 THE PROBLEM OF MULTIPLE...
 ERROR RATES FOR MULTIPLE...
 OTHER PRINCIPLES THAT...
 BRIEF DESCRIPTIONS OF METHODS
 CHOOSING A METHOD FOR...
 ASSESSING THE RELIABILITY OF...
 DISCUSSION
 FOOTNOTES
 Acknowledgements
 REFERENCES
 
The analysis of microarray data usually requires that many statistical hypothesis tests be performed. Typically, one or more tests are applied for each feature queried in the experiment. For example, to identify differentially expressed genes, one may apply a statistical procedure to the data of each feature to test whether the feature has the same mean expression across all experimental groups. Each statistical test has a certain probability of suggesting an erroneous inference. For instance, it is expected that 5% of all features that are not associated with the trait of interest to be declared statistically significant if all P-values < 0.05 are considered statistically significant. Numerous false positives could occur simply because there are many features not associated with the trait of interest. Also, the choice of the threshold will affect the number of false negatives in an experiment. Reducing the threshold of significance increases the stringency of the statistical tests and may substantially increase the number of false negatives. Subsequently, choosing the P-value threshold used to determine statistical significance is a delicate problem that requires very careful attention. Additionally, the results must be appropriately interpreted after the P-value threshold is chosen.

Several methods account for multiple testing when determining which results should be considered statistically significant. These methods are called ‘multiple testing procedures’ or ‘multiple comparison procedures’. However, it can be difficult for an investigator to choose a procedure that is appropriate for a specific application. Understanding some key differences between the various methods can help in selecting the procedure that should be used in a particular setting.


    ERROR RATES FOR MULTIPLE TESTING
 TOP
 ABSTRACT
 INTRODUCTION
 BASIC PRINCIPLES OF STATISTICAL...
 THE PROBLEM OF MULTIPLE...
 ERROR RATES FOR MULTIPLE...
 OTHER PRINCIPLES THAT...
 BRIEF DESCRIPTIONS OF METHODS
 CHOOSING A METHOD FOR...
 ASSESSING THE RELIABILITY OF...
 DISCUSSION
 FOOTNOTES
 Acknowledgements
 REFERENCES
 
Every multiple testing procedure uses some error rate to measure the occurrence of incorrect inferences. Multiple testing procedures use a variety of error rates to measure the occurrence of erroneous inferences. Most error rates focus on the occurrence of false positives, but a few also consider the occurrence of false negatives. Some error rates that have been used in the multiple testing setting are described next.

Classical multiple testing procedures use the family-wise error rate (FWER). The FWER is defined as the probability that the analysis yields any false positive findings. The FWER was quickly recognized as being too conservative for the analysis of microarray data, because in many applications, the only way to limit the probability that any of thousands of statistical tests yields a false positive inference is to not allow any result to be deemed significant. A similar, but less stringent, error rate is the generalized family-wise error rate (gFWER or k-FWER). The k-FWER is the probability that k or more of the significant findings are actually false positives. Recently, some procedures have been proposed that use the gFWER to measure the occurrence of false positives [4].

The false discovery rate [5] (FDR) is now recognized as a useful measure of the occurrence of false positives in microarray studies [2]. The FDR can be interpreted as the expected proportion of significant findings that are indeed false positives. The positive false discovery rate [6] (pFDR) and conditional false discovery rate [7] (cFDR) have similar interpretations and have also been proposed as useful error rates for addressing multiple testing in the analysis of microarray data. The FDR, pFDR and cFDR are reasonable error rates for microarray studies because they can naturally be translated into terms of the costs of attempting to validate false positive results.

Other criteria have recently been proposed to measure the occurrence of incorrect inferences in the multiple testing settings that arise in the analysis of microarray data. The total error criterion [8, 9] (TEC) is the expected sum of the number of false positives and the number of false negatives. The profile information criterion [9] (PIC) also measures the balance between false positives and false negatives. The probability that the proportion of significant findings that are false positives is greater than a user-specified limit, which is also known as the tail probability of the proportion of false positives, has also been suggested as another useful error rate that could be used by a multiple testing procedure [4].


    OTHER PRINCIPLES THAT DISTINGUISH MULTIPLE TESTING PROCEDURES
 TOP
 ABSTRACT
 INTRODUCTION
 BASIC PRINCIPLES OF STATISTICAL...
 THE PROBLEM OF MULTIPLE...
 ERROR RATES FOR MULTIPLE...
 OTHER PRINCIPLES THAT...
 BRIEF DESCRIPTIONS OF METHODS
 CHOOSING A METHOD FOR...
 ASSESSING THE RELIABILITY OF...
 DISCUSSION
 FOOTNOTES
 Acknowledgements
 REFERENCES
 
Several principles distinguish the various multiple testing procedures from one another. The procedures differ according to: whether their objective is error estimation or error control; how they account for correlation, their computational demands and complexity; how rigorously they have been validated and what conditions are most likely to lead them to produce reliable or unreliable results.

Most multiple testing procedures applied to the analysis of microarray data have one of two general objectives: estimation of a particular error rate or control of a particular error rate. Control procedures seek to determine a threshold for significance in such a manner that the error rate is limited to being less than or equal to a prespecified level of tolerance. On the other hand, estimation procedures seek to accurately estimate the value of an error rate for a user-selected threshold of significance. A set of results that are significant at a given level of error control are generally considered to be more definitive than a set of results with an estimated equal error rate. However, it can be difficult to choose an appropriate level of tolerance for an error rate before performing the analysis, which is a drawback of using a control method.

Multiple testing methods differ in how they account for correlations among the collection of test statistics (or P-values) computed in an analysis. In statistical language, these differences can be summarized by stating whether the method performs inferences by considering the marginal distributions of the test statistics (or P-values) or the joint distribution of the collection of the test statistics (or P-values) as a group. The marginal distribution of an individual test statistic describes how that statistic varies due to chance, without considering how it may be correlated with the other test statistics. The joint distribution of a collection of test statistics describes the chance variation of all statistics as a group. Some methods accept a collection of marginal P-values, i.e. P-values computed using the marginal null distributions of the test statistics, as input. Marginal P-values can be computed by a parametric procedure, a rank-based procedure, or a permutation. Methods that use marginal P-values assume that the effects of correlation between test statistics are negligible. Other methods perform inferences that account for multiple testing by using an empirical joint distribution of the test statistics derived by permutation.

Multiple testing procedures vary in their computational demands and complexity. As previously mentioned, some methods use permutation, which can require substantial computing time in some applications. Also, some methods use the bootstrap [10], a computationally demanding resampling procedure. Many procedures perform relatively simple calculations on the set of marginal P-values. The computationally intensive procedures can offer some robustness properties, such as more fully accounting for the effects of correlation. However, it is not always clear if the gains in robustness warrant the computational efforts that are required. A later section discusses how to balance the trade-off between robustness and computational effort in choosing a procedure for a particular application.

The multiple testing methods have been validated at various degrees of rigour. There are advantages and disadvantages of various techniques used to validate statistical methods [11]. A rigorous mathematical proof is considered the gold standard to establish the properties of a statistical method. However, a mathematical proof establishes the properties of the method only under an assumed scenario, which may be unrealistic in application. Therefore, it has been especially difficult to use proofs to establish that a statistical method has desirable operating characteristics for the complex settings most likely to arise in the analysis of microarray data. Nevertheless, some methods have proven mathematical properties under a fairly diverse range of hypothetical settings. Another validation method is simulation. In simulation, many data sets are generated under an assumed setting, methods are applied to analyse those data sets, and then the performance of the methods is summarized across the data sets. Again, properties demonstrated in a simulation hold only under the assumed setting. Still, simulations are a valuable tool for validating and studying the performance properties of statistical procedures. Also, simulations are typically a more feasible way to validate a method than the mathematical proofs.

The performance and reliability of multiple testing procedures depends primarily on how accurately the assumptions of the method reflect reality. Some methods fit mixture models with uniform and beta components to the observed set of P-values [3, 12]. These methods can give accurate estimates of the error rates when the fitted model accurately represents the observed P-value distribution [3]. In some cases, however, these models do not fit the observed P-values very well; therefore, the resulting error rate estimates are most likely to be quite inaccurate [13]. Furthermore, methods that operate on marginal P-values typically assume that P-values arising from statistical tests of true null hypotheses (i.e. corresponding to non-differentially expressed microarray features) are statistically independent and uniformly distributed over the interval (0,1). Strictly speaking, the assumption regarding statistical independence is probably not true, because genes operate in pathways, and the results of analyses of genes in a common pathway are correlated. Nevertheless, some of the methods operating on P-values are robust against certain mild violations of this assumption [14]. In particular, many of the methods that operate on P-values are robust in settings where the number of correlated features is small relative to the number of queried features [14]. Genome-wide experiments typically fall into this type of setting because the number of genes in any pathway is small relative to the number of genes represented on the array. Nevertheless, some experiments may have a large number of features that are correlated relative to the number of features represented on the array. Resampling methods offer an alternative approach in the context of experiments having correlation structures that influence a substantial proportion of the queried features [15].


    BRIEF DESCRIPTIONS OF METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 BASIC PRINCIPLES OF STATISTICAL...
 THE PROBLEM OF MULTIPLE...
 ERROR RATES FOR MULTIPLE...
 OTHER PRINCIPLES THAT...
 BRIEF DESCRIPTIONS OF METHODS
 CHOOSING A METHOD FOR...
 ASSESSING THE RELIABILITY OF...
 DISCUSSION
 FOOTNOTES
 Acknowledgements
 REFERENCES
 
This review briefly describes some methods that use marginal P-values to estimate or control the FDR, pFDR or cFDR. Table 2 gives an overview of these methods. Two extensive reviews of many other multiple testing procedures have been published recently [16, 17].


View this table:
[in this window]
[in a new window]
 
Table 2: Multiple testing procedures that operate on p-values to estimate or control the FDR or related metrics

 
Benjamini and Hochberg [5] introduced the FDR as a useful error rate for the multiple testing setting. The FDR is the expected proportion of false discoveries among a set of reported significant results. They subsequently proposed a method (we call BH95 to reflect the authors and publication year) that operates on P-values to control the FDR at a prespecified level. Additionally, BH95 has been mathematically proven to conservatively control the FDR if the P-values testing the true null hypotheses (i.e. features that are truly not differentially expressed) are statistically independent and uniformly distributed over the interval (0,1). The control is conservative in the sense that the actual level of control is the prespecified level times the proportion of tests with a true null hypothesis (i.e. the null proportion), hence it is possible to develop a method that finds more significant results and still controls the FDR at the prespecified level. The BH95 procedure has also been described in terms of ‘FDR-adjusted P-values’ [18]. The marginal P-values are used to compute FDR-adjusted P-values. The FDR-adjusted P-values are compared to a prespecified threshold for FDR control to determine statistical significance. FDR adjusted P-values less than the threshold are considered statistically significant.

Yekutieli and Benjamini [14] introduced a resampling-based method (YB99) to provide approximate control of the FDR under dependency. This procedure computes P-values for each of many data sets generated by using resampling methods that empirically simulate the P-values under the complete null hypothesis. The simulated distribution of the P-values is used to compute ‘FDR local estimates’, which are then used to approximately control the FDR under dependency.

Benjamini and Hochberg introduced another method (BH00) for FDR control. BH00 first applies BH95 to the set of observed P-values [19]. If BH95 declares any results significant, then BH00 computes an estimate of the null proportion (i.e. the proportion of queried features that are truly not differentially expressed) to adjust the results in a manner that may lead to additional significant findings. BH00 offers limited improvements in statistical power over BH95, because the estimate of the null proportion is very conservative [20].

Benjamini and Yekutieli [21] proved that the BH95 procedure maintains its control properties under certain forms of dependency. They also introduced a method (BY01) that is a simple modification of BH95. They proved that BY01 controls the FDR under any form of dependency and for any P-value distribution, continuous or discrete. They also mentioned that BY01 has considerably less statistical power than does BH95, YB99 or BH00. Therefore, BH95, YB99 and BH00 should be preferred over BY01, except when a large proportion of all features are very strongly correlated with one another.

Storey [6] suggested that the pFDR is a more reasonable error rate to use in the analysis of microarray data than is the FDR. He proposed a procedure (St02) based on a quantity he calls the q-value to control the pFDR [6]. Subsequently, Storey and colleagues have published additional descriptions and theoretical properties of the q-value and the St02 procedure [2, 14, 22]. The St02 procedure can be used to reliably control the pFDR at a prespecified level, when the P-values testing a true null hypothesis are statistically independent and continuously distributed over the interval (0,1). Additionally, St02 has greater statistical power than BH95 in any setting [2]. The power improvements are accomplished by incorporating an estimate of the null proportion into the procedure. Additionally, St02 reliably controls the pFDR under a set of correlation structures that are similar to those under which BH95 reliably controls the FDR [14]. Furthermore, the pFDR can be interpreted as the Bayesian posterior probability that a significant result is a false discovery [22].

Allison et al. [15] have introduced a method (Al02) that fits a mixture model to the observed set of P-values. The mixture model consists of a continuous uniform component and one or more beta components. The method uses a bootstrap test to first determine if the set of observed P-values differs significantly from a uniform distribution. If significant departure from the uniform distribution is detected, then a mixture model with a uniform component and a single two-parameter beta component is fit to the observed P-values. A bootstrap test is used to determine whether incorporation of another beta component into the mixture model would significantly improve the model's fit. This process is repeated until it is determined that adding another beta component will not significantly improve model fit. The final fitted model is then used to compute estimates of the FDR. Al02 implicitly assumes that all P-values are statistically independent, but its FDR estimates should be reasonably accurate when the P-values are mildly correlated, as long as the model fits well.

Pounds and Morris [3] also introduced a method (PM03) that fits a mixture model to the observed set of P-values. The mixture model has a continuous uniform component and a one-parameter beta component. Maximum likelihood estimation is used to fit this model to the observed set of P-values. The fitted model is geometrically partitioned into four regions that correspond to the four distinct hypothesis testing outcomes defined earlier (i.e. false positives, false negatives, true positives and true negatives). The geometric partition of the fitted model is used to compute estimates of the FDR and other multiple testing error rates. PM03 implicitly assumes that all P-values are statistically independent. The error rate estimates should be accurate when the correlation between P-values is mild and limited in scope relative to the number of tests and the model fits well. Methods to assess model fit are described in the next section.

Tsai, Hseuh and Chen [7] proposed two models for the distribution of the number of each of the four hypothesis testing outcomes (Table 1) and used those models to develop a method (THC03) to estimate the FDR, pFDR or cFDR. They showed that the estimates of these metrics produced by THC03 are very similar in a simulation study and in a case study analysis.

Pounds and Cheng [13] introduced a method (PC04) that uses a special non-parametric density estimator called the spacings loess histogram (SPLOSH) to describe the observed distribution of P-values. The area under the SPLOSH density estimate is partitioned into four regions corresponding to the four distinct hypothesis testing outcomes. The partitioned SPLOSH density estimate is then used to compute estimates of the cFDR and other multiple testing error rates. In a simulation study and a case study analysis, PC04 produced more accurate estimates of the cFDR than did St02 or PM03.

Liao et al. [23] introduced a method (L04) that fits a flexible piecewise proportional hazards model to the observed distribution of P-values. The model imposes the constraint that P-values testing features that are truly differentially expressed are stochastically less than (i.e. tend to be smaller than) P-values testing features that are not differentially expressed. The fitted model is used to estimate a local FDR. In simulation studies and real data examples, L04 was more conservative than St02 [23].

Cheng et al. [9] have argued that in certain applications it is difficult to prespecify an FDR level to control, and it may be more desirable to balance the trade-off between false positive and false negatives. They introduce three data-driven criteria for statistical significance (Ch04): the PIC, the TEC and the guide-gene driven threshold, which incorporates prior biological knowledge of the study into considerations of statistical significance. The P-value threshold for statistical significance is determined by minimizing the PIC or TEC or by examining the P-values of the guide genes. Additionally, they propose a spline-based FDR estimator for the significance level determined from any of the three criteria.


    CHOOSING A METHOD FOR A SPECIFIC APPLICATION
 TOP
 ABSTRACT
 INTRODUCTION
 BASIC PRINCIPLES OF STATISTICAL...
 THE PROBLEM OF MULTIPLE...
 ERROR RATES FOR MULTIPLE...
 OTHER PRINCIPLES THAT...
 BRIEF DESCRIPTIONS OF METHODS
 CHOOSING A METHOD FOR...
 ASSESSING THE RELIABILITY OF...
 DISCUSSION
 FOOTNOTES
 Acknowledgements
 REFERENCES
 
Three key observations can guide the selection of a reasonable multiple testing procedure for many applications. These observations include: whether statistical sample size calculations were used in planning the experiment, expectations regarding the magnitude and extent of correlation among features measured on the array and whether the P-values can be accurately represented by a continuous probability distribution.

The first observation to make in choosing a multiple testing procedure is whether statistical power calculations were used to plan the sample size for the experiment. This observation should help determine whether one should choose to control or estimate the FDR, pFDR or cFDR. Table 2 identifies which procedures are control methods and which are estimation methods. If the sample size was chosen to assure adequate statistical power to detect differential expression of a practically relevant magnitude, then it is most desirable to control the FDR, pFDR or cFDR at the prespecified level. The sample size calculations performed in planning the experiment provide some assurance that features differentially expressed at a magnitude of practical importance would be detected when the control procedures are appropriately applied. Therefore, in the context of such careful planning, results that are not statistically significant can be more legitimately interpreted as also being of little practical relevance. If the experimental sample size was not determined by statistical power calculations, then it is more advisable to accurately estimate the FDR, pFDR or cFDR as a function of the P-value significance threshold. Presently, it is somewhat uncommon in practice for the sample size to be determined by statistical power calculations [24]. Nevertheless, it is difficult to overstate the importance of adequate replication in a microarray experiment [25]. Elaborate analytical approaches cannot serve as a substitute for the information provided by adequate replication (biological or technical). Presently, there are many methods to perform statistical power calculations to aid in determining the level of replication that is needed for a particular application when the final analysis uses the FWER [26, 27], kFWER [28] or FDR [29, 30–35] to address the multiple testing problem. However, currently available power calculation methods do not carefully account for the effect of correlations between genes; therefore, one should keep in mind that the actual number of replicates needed may be somewhat greater than what these methods suggest [32].

Secondly, one should consider the magnitude and extent of correlation between the P-values. The St02 procedure is reliable when correlation is limited in scope relative to the number of features queried [14]. Thus, if it is reasonable to anticipate that the number of features involved in any biological pathway is small relative to the number of features queried in the experiment, then St02 should give reliable control of the pFDR. Additionally, all the methods, except YB99, apply a very similar set of operations to the P-values, hence it is reasonable to believe that they will reliably estimate or control their respective error rates under these types of correlation structures. Therefore, using a method that operates on P-values is reasonable in the analysis of genome-wide experiments. However, experiments that focus on very specific biological pathways or functions may have correlations that are of high magnitude and wide extent. For example, relatively few genes on a genome-wide array are likely to be strongly correlated with one another, but it is likely that most genes on a cancer array or boutique array are strongly correlated with one another. In focused experiments, a resampling method such as YB99 should be used to account for the dependencies present in the data. However, the statistical power of YB99 will be severely limited when the sample size is small.

Finally, one should consider whether it is reasonable to use continuous probability distributions to model the behaviour of the observed P-values. With the exceptions of BH95 and BY01, all methods in Table 2 that operate on P-values utilize an estimate of the null proportion, i.e. the proportion of features that are truly not differentially expressed. The methods used to obtain the estimate of the null proportion assume that the P-values have a continuous distribution. If the P-values produced in the analysis are discretely distributed with wide gaps between realized values, then the estimates of the null proportion and methods that use those estimates are unreliable. Such discrete P-values may be produced by applying statistical procedures based on ranks or contingency tables to data sets with small sample sizes. When permutation is applied to a data set with small sample sizes, the resulting P-values will also be very discrete [36].

The three considerations described earlier can be used to roughly classify each application into one of eight categories (Table 3). The classification of the application often suggests an appropriate method to use. However, there are still some application classes for which there is no clear choice. For example, there is no clear choice of an FDR procedure for an application with mild or limited correlation between features, discrete P-values and experimental planning not based on statistical power considerations. In this setting, the first two considerations suggest that an estimation method that operates on P-values should be applied. However, the third consideration makes clear that the estimation methods are not capable of producing an accurate estimate of the null proportion. Until a more appropriate method is developed, a practical approach is to use BH95 and simply report the FDR-adjusted P-values. However, in such a setting, it is important that the FDR-adjusted P-values not be interpreted as an estimate of the proportion of significant findings. Interpreting the FDR-adjusted P-values in this manner is likely to understate the proportion of the reported results that are actually false discoveries [13]. As indicated in Table 3, special care should be given when interpreting the results from BH95. Also, the BY01 procedure is tentatively recommended when the experimental plan was not based on statistical power considerations and the dependency is strong and extensive.


View this table:
[in this window]
[in a new window]
 
Table 3: Schema for selecting a multiple testing procedure

 

    ASSESSING THE RELIABILITY OF THE RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 BASIC PRINCIPLES OF STATISTICAL...
 THE PROBLEM OF MULTIPLE...
 ERROR RATES FOR MULTIPLE...
 OTHER PRINCIPLES THAT...
 BRIEF DESCRIPTIONS OF METHODS
 CHOOSING A METHOD FOR...
 ASSESSING THE RELIABILITY OF...
 DISCUSSION
 FOOTNOTES
 Acknowledgements
 REFERENCES
 
A histogram of P-values, which is a bar graph of the number or proportion of P-values that fall within certain non-overlapping intervals, is a useful tool for determining when problems are present in the analysis. This simple graphic assessment can indicate when crucial assumptions of the methods operating on P-values have been radically violated. The most desirable shape for the P-value histogram is one in which the P-values are most dense near zero and become less dense as the P-values increase (Figure 1A). This shape does not indicate any violation of the assumptions of methods operating on P-values and suggests that several features are differentially expressed, though they may not be statistically significant after adjusting for multiple testing. A U-shaped histogram is also desirable in applications with all tests being one-sided, e.g. the tests examine the plausibility of a null hypothesis that specifies an association of a specific direction (i.e. the expression of group A is greater than or equal to that of group B) [37]. A flat P-value histogram (Figure 1B) does not indicate any violation of the assumptions of methods that operate on P-values but suggests the disappointing result that very few, if any, features are differentially expressed. A P-value histogram with one or more humps (Figure 1C) can indicate that an inappropriate statistical test was used to compute the P-values, some poor quality arrays were included in the analysis, or a strong and extensive correlation structure is present in the data set. A P-value histogram with a choppy appearance (Figure 1D) indicates that the P-values are discretely distributed. A discrete P-value distribution is problematic for the St02, Al02, PM03, PC04, Ch04 and L04 procedures, because they estimate the null proportion and will probably yield unreliable results.


Figure 1
View larger version (26K):
[in this window]
[in a new window]
 
Figure 1: Examples of P-value histograms.

 
Additionally, it can be helpful to add a horizontal reference line to the P-value histogram at the value of the estimated null proportion [6]. A line falling far below the height of the shortest bar suggests that the estimate of the null proportion may be downward biased; therefore, the FDR estimates or control may understate the actual occurrence of false positives. Conversely, a line high above the top of the shortest bar may suggest that the method is overly conservative. It is appropriate to add this line to the histogram to assess the reliability of the null proportion estimates of BH00, St02, Al02, PM03, PC04, Ch04 and L04. Furthermore, adding the estimated density curves to the P-value histogram can aid in assessing model fit [3, 13]. Large discrepancies between the density of the fitted model and the histogram indicate a lack of fit. This diagnostic can identify when some methods produce unreliable results [13]. This is a good graphic diagnostic for any of the smoothing-based and model-based methods that operate on P-values.

There are some items to remember when using a histogram of P-values to evaluate the reliability of an FDR procedure. If the intervals are too narrow, then the histogram will have a spiky appearance, and if the intervals are too wide, it may appear too flat. This should be kept in mind when using the histograms to evaluate model fit and the reliability of the estimate of the null proportion. Wand [38] offers guidance on choosing appropriate interval widths for histograms. Additionally, when a histogram is used to evaluate the reliability of FDR estimation, its bars should be scaled so that the total area equals one. This scaling makes the histogram comparable to the curve of the fitted model and the estimate of the null proportion.

Quantile–quantile (QQ) plots [39] are another useful tool to evaluate the reliability of model-based methods. To produce a QQ plot, one needs to compute the empirical quantile of each data value and the quantile of each data value under the fitted model (model quantile). The QQ plot then produces a point for each data value with the x-axis given by the model quantile and the y-axis given by the empirical quantile. Good model fit is indicated by all points falling along a straight line. Departure from the straight line indicates poor model fit. Substantial departure from the straight line indicates that results of a model-based procedure may be unreliable.

A simple computational tool can be used to check for discrete P-values. One can compute the spacings [40] between P-values, i.e. the differences between consecutive ordered P-values. If a large number of spacings are zero, then the P-value distribution may be discrete. Furthermore, large spacings may lead to unstable estimates of the null proportion. This instability could cause BH00, St02, Al02, PM03, PC04, Ch04 and L04 to yield unreliable results.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 BASIC PRINCIPLES OF STATISTICAL...
 THE PROBLEM OF MULTIPLE...
 ERROR RATES FOR MULTIPLE...
 OTHER PRINCIPLES THAT...
 BRIEF DESCRIPTIONS OF METHODS
 CHOOSING A METHOD FOR...
 ASSESSING THE RELIABILITY OF...
 DISCUSSION
 FOOTNOTES
 Acknowledgements
 REFERENCES
 
Multiple testing procedures that operate on marginal P-values are applicable to a wide variety of microarray experiments. These methods can be used in conjunction with classical statistical methods to perform the statistical analysis of microarray data collected across a broad spectrum of experimental designs [3]. Many recently proposed methods for the analysis of microarray data are readily applicable only to two-group comparisons [41], but the family of methods described in this review have been applied in more complex settings. For example, PM03 has been used in conjunction with Cox regression analysis to achieve an objective as complex as identifying features associated with survival after adjusting for known prognostic factors [42].

The reliability of the results produced by these methods clearly depends on the reliability of the statistical analyses used to produce the P-values. In other words, the old adage ‘garbage in—garbage out’ applies. Therefore, it is crucial to carefully choose the statistical method used to compute the P-values and to assess whether it is reliable. Diagnostic methods are available for testing commonly made assumptions of many statistical procedures, including that the data are normally distributed [43], that all experimental groups have equal variance [44], and that those groups are statistically independent [45, 46]. However, application of these diagnostic methods to each feature's data will result in another multiple testing problem, and it is unclear how to use the results of these diagnostic statistical tests to guide further analyses. Rank-based methods, such as the rank–sum or Kruskal–Wallis tests, are based on more flexible assumptions than parametric methods such as the t-test or one-way ANOVA. The flexible assumptions make the rank-based methods robust across many settings [47, 48]. These robustness properties make the rank-based methods an attractive choice whenever the sample size is sufficiently large. Permutation methods offer similar robustness properties.

The classification of methods presented in Table 3 conceptually simplifies the selection of an appropriate procedure for a specific application. However, it is not always clear whether the correlations present in any given data set are ‘mild or limited’ or ‘strong and extensive’. Therefore, it may be difficult to determine which method to use. Unfortunately, there are currently no statistical methods to determine whether correlations in the data are such that the methods in the first row of Table 3 are unreliable [21]. Therefore, the user's understanding of the nature of the dependency is still crucial for ascertaining whether the correlation is mild and limited enough to use the method. A potentially helpful ad hoc approach to this problem would be to compute and examine the correlations between a randomly selected set of features within each experimental group. Further research is needed to develop statistical tools to assist in this determination.

To my knowledge, a good statistical method that can use discretely distributed P-values to accurately estimate the FDR is not currently available. A method to determine whether the set of discrete P-values contains significant evidence that at least some of the features are differentially expressed has been developed [36]. However, it is unclear whether or how this method can be extended to estimate the FDR. The inability to accurately estimate the null proportion is a primary obstacle to developing a method that accurately estimates the FDR in the presence of discretely distributed P-values.

It is widely recognized that the multiple testing problem cannot be ignored at the final stage of analysis. However, the multiple testing problem is present at earlier stages of the statistical analysis of microarray data, particularly at the data filtering stage. The utility of a TEC filter for excluding poorly hybridized Affymetrix® probe sets from subsequent analysis has very recently been demonstrated [37]. Additionally, filtering does not always improve the FDR in the final analysis and may actually reduce the ability to discover interesting associations [37]. Future statistical research should consider how to approach sequential multiple testing inference, such as how to integrate the multiple testing procedures used in filtering and the final analysis to achieve reliable error rate control or estimation.


Key Points

  • Modern microarrays query so many features that some erroneous statistical inferences are unavoidable. The findings of a microarray study should be considered preliminary regardless of what methods are used to perform statistical analysis.
  • The results of microarray studies can be properly interpreted only in the context of reliable control or estimation of the prevalence of erroneous inferences, particularly that of false positives.
  • The false discovery rate is a useful measure of the prevalence of false positives. Roughly speaking, the false discovery rate is the proportion of significant results that are expected to be false positives.
  • Several statistical methods that use P-values to estimate or control the false discovery rate and related error rates have been developed in recent years. Most of these methods can be implemented using freely available software (Table 2).
  • Three criteria can help guide the selection of an appropriate method for a specific application (Table 3).
  • The accuracy of P-value based estimation or control of the false discovery rate depends heavily on the reliability of the statistical hypothesis testing procedure used to compute the P-values. The adequacy of the hypothesis testing procedure's assumptions should be kept in mind when interpreting results.
  • A P-value histogram, quantile–quantile plot, or the distribution of P-value spacings can indicate when the assumptions of a false discovery rate method are violated, suggesting that the results obtained by that method are questionable.

 


    Acknowledgements
 TOP
 ABSTRACT
 INTRODUCTION
 BASIC PRINCIPLES OF STATISTICAL...
 THE PROBLEM OF MULTIPLE...
 ERROR RATES FOR MULTIPLE...
 OTHER PRINCIPLES THAT...
 BRIEF DESCRIPTIONS OF METHODS
 CHOOSING A METHOD FOR...
 ASSESSING THE RELIABILITY OF...
 DISCUSSION
 FOOTNOTES
 Acknowledgements
 REFERENCES
 
The author wishes to thank colleagues Cheng Cheng, Wei Liu and Wenjian Yang and two anonymous reviewers for their helpful suggestions that improved the quality of this article. The author is also thankful to Angela McArthur for editorial assistance. This work was supported by the NIH Cancer Centre Support Grant CA-21765 and the American Lebanese Syrian Associated Charities (ALSAC).


    FOOTNOTES
 TOP
 ABSTRACT
 INTRODUCTION
 BASIC PRINCIPLES OF STATISTICAL...
 THE PROBLEM OF MULTIPLE...
 ERROR RATES FOR MULTIPLE...
 OTHER PRINCIPLES THAT...
 BRIEF DESCRIPTIONS OF METHODS
 CHOOSING A METHOD FOR...
 ASSESSING THE RELIABILITY OF...
 DISCUSSION
 FOOTNOTES
 Acknowledgements
 REFERENCES
 
Stanley Pounds is an Assistant Member of the Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN, USA. His research focuses on developing and improving statistical methods for the analysis of microarray gene expression data, with special emphasis on methods that estimate or control multiple testing error rates.

Submitted: July 21, 2005. Received (in revised form): November 3, 2005.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 BASIC PRINCIPLES OF STATISTICAL...
 THE PROBLEM OF MULTIPLE...
 ERROR RATES FOR MULTIPLE...
 OTHER PRINCIPLES THAT...
 BRIEF DESCRIPTIONS OF METHODS
 CHOOSING A METHOD FOR...
 ASSESSING THE RELIABILITY OF...
 DISCUSSION
 FOOTNOTES
 Acknowledgements
 REFERENCES
 

  1. Tilstone C. DNA microarrays: vital statistics. Nature 2003; 424:610–12.[CrossRef][Medline]
  2. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Nat Acad Sci USA 2003; 100:9440–45.[Abstract/Free Full Text]
  3. Pounds S, Morris SW. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of P-values. Bioinformatics 2003; 19:1236–42.[Abstract/Free Full Text]
  4. Dudoit S, van der Laan MJ, Pollard KS. Multiple Testing. Part I. Single-step procedures for control of general type I error rates. Statistical Applications in Genetics and Molecular Biology 2004; 3: Article 13. Available from http://www.bepress.com/sagmb/vol3/iss1/art13. Last accessed January 11, 2006.
  5. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B 1995; 57:289–300.
  6. Storey JD. A direct approach to false discovery rates. J Roy Stat Soc B 2002; 64:479–98.[CrossRef]
  7. Tsai C-A, Hsueh H-M, Chen JJ. Estimation of false discovery rates in multiple testing: application to gene microarray data. Biometrics 2003; 59:1071–81.[CrossRef][Web of Science][Medline]
  8. Genovese C, Wasserman L. Operating characteristics and extensions of the false discovery rate procedure. J Roy Stat Soc B 2002; 64:499–517.[CrossRef]
  9. Cheng C, Pounds S, Boyett JM, et al. Significance threshold selection criteria for massive multiple comparisons with applications to DNA microarray experiments. Statistical Applications in Genetics and Molecular Biology 2004; 3: Article 36. Available from http://www.bepress.com/sagmb/vol3/iss1/art36. Last accessed January 11, 2006.
  10. Efron B, Tibshirani RJ. An Introduction to the Bootstrap Boca Raton, FL: Chapman and Hall/CRC 1993.
  11. Mehta T, Tanik M, Allison DB. Towards sound epistemological foundations of statistical methods for high-dimensional biology. Nat Genet 2004; 36:943–47.[CrossRef][Web of Science][Medline]
  12. Allison D, Gadbury GL, Heo M, Fernandez JR, Leed C-K, Prolla TA, Weindruch R. A mixture model approach for the analysis of microarray gene expression data. Comput Stat and Data Anal 2002; 39:1–20.
  13. Pounds S, Cheng C. Improving false discovery rate estimation. Bioinformatics 2004; 20:1737–45.[Abstract/Free Full Text]
  14. Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: a unified approach. J Roy Stat B 2004; 66:187–205.[CrossRef]
  15. Yekutieli D, Benjamini Y. Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. J Stat Plann Infer 1999; 82:171–96.[CrossRef]
  16. Ge Y, Dudoit S, Speed TP. Resampling-based multiple testing for microarray data analysis. Test 2003; 12:1–77.
  17. Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statistical Science 2003; 18:71–103.[CrossRef]
  18. Reiner A, Yekutieli D, Benjamini Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003; 19:368–75.[Abstract/Free Full Text]
  19. Benjamini Y, Hochberg Y. On the adaptive control of the false discovery rate in multiple testing with independent statistics. J Educ Behav Stat 2000; 25:60–83.[CrossRef]
  20. Hseuh H, Chen JJ, Kodell RL. Comparison of methods for estimating number of true null hypotheses in multiplicity testing. J Biopharm Stat 2003; 13:675–89.[CrossRef][Medline]
  21. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Anna Stat 2001; 29:1165–88.[CrossRef]
  22. Storey JD. The positive false discovery rate: a Bayesian interpretation and the q-value. Anna Stat 2003; 31:2013–35.[CrossRef]
  23. Liao JG, Lin Y, Selvanayagam ZE, Shih WJ. A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics 2004; 20:2694–701.[Abstract/Free Full Text]
  24. Cui XQ, Churchill GA. How many mice and how many arrays? Replication in mouse cDNA microarray experiments. In Johnson KF, Lin SM (Eds.). Methods of Microarray Data Analysis III Norwell, MA: Kluwer Academic Publishers 2003 139–54.
  25. Lee M-LT, Kuo FC, Whitmore GA, et al. Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proc Nat Acad Sci USA 2000; 97:9834–39.[Abstract/Free Full Text]
  26. Pan W, Lin J, Le CT. How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biology 2002; 3: Available from http://genomebiology.com/2002/3/5/research/0022. Last accessed January 11, 2006.
  27. Simon R, Radmacher MD, Dobbin K. Design of studies using DNA microarrays. Genetic Epidemiology 2002; 23:21–36.[CrossRef][Web of Science][Medline]
  28. Lee M-L, Whitmore G. Power and sample size for microarray studies. Stat Med 2002; 11:3543–70.
  29. Gadbury GL, Page GP, Edwards J, et al. Power and sample size estimation in high dimensional biology. Stat Methods Med Res 2004; 14:325–38.
  30. Muller P, Parmigiani G, Robert C, et al. Optimal sample size for multiple testing: The case of gene expression microarrays. Journal of the American Statistical Association 2004; 99:990–1001.[CrossRef]
  31. Tsai C-A, Wang S-J, Chen D-T, et al. Sample size for gene expression microarray experiments. Bioinformatics 2005; 21:1502–8.[Abstract/Free Full Text]
  32. Jung S-H. Sample size for FDR-control in microarray data analysis. Bioinformatics 2005; 21:3097–104.[Abstract/Free Full Text]
  33. Jung S-H, Bang H, Young S. Sample size calculation for multiple testing in microarray data analysis. Biostatistics 2005; 6:157–69.[Abstract]
  34. Hu J, Zou F, Wright FA. Practical FDR-based sample size calculations in microarray experiments. Bioinformatics 2005; 21:3264–72.[Abstract/Free Full Text]
  35. Pounds S, Cheng C. Sample size determination for the false discovery rate. Bioinformatics 2005; 21:4263–71.[Abstract/Free Full Text]
  36. Gadbury GL, Page GP, Heo M, Mountz JD, Allison DB. Randomization tests for small samples: an application for genetic expression data. Appl Stat 2003; 52:365–76.
  37. Pounds S, Cheng C. Statistical development and evaluation of gene expression data filters. J Comput Biol 2005; 12:482–95.[CrossRef][Web of Science][Medline]
  38. Wand MP. Data-based choice of histogram bin width. The American Statistician 1997; 51:59–64.
  39. Mason RL, Gunst RF, Hess JL. Statistical Design and Analysis of Experiments. New York: John Wiley and Sons 1989.
  40. Pyke R. Spacings. J Roy Stat Soc B 1965; 27:395–49.
  41. Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 2002; 18:546–54.[Abstract/Free Full Text]
  42. Morris JS, Guosheng Y, Baggerly K, et al. Pooling information across different studies and oligonucleotide chip types to identify prognostic genes for lung cancer. In Shoemaker JS,, Lin SM (Eds.). Methods of Microarray Data Analysis IV New York: Springer 2005.
  43. Shapiro SS, Wilk MB. An analysis of variance test for normality (complete samples). Biometrika 1965; 52:591–611.[Free Full Text]
  44. O’Neill ME, Mathews KL. Levene tests of homogeneity of variance for general block and treatment designs. Biometrics 2002; 58:216–24.[CrossRef][Web of Science][Medline]
  45. O’Brien PC. A test for randomness. Biometrics 1976; 32:391–401.[Medline]
  46. O’Brien PC, Dyck PJ. A runs test based on run lengths. Biometrics 1985; 41:237–44.[Medline]
  47. Conover WJ. Practical nonparametric statitistics. 3rd edn New York: John Wiley and Sons 1999.
  48. Hollander M, Wolfe DA. Nonparametric statistical methods. New York: John Wiley and Sons 1999.

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Am J EpidemiolHome page
I. Shrier, R. J. Steele, J. Hanley, and B. Rich
Analyses of Injury Count Data: Some Do's and Don'ts
Am. J. Epidemiol., November 15, 2009; 170(10): 1307 - 1315.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
A.-L. Boulesteix and M. Slawski
Stability and aggregation of ranked gene lists
Brief Bioinform, September 1, 2009; 10(5): 556 - 568.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Pounds, C. Cheng, X. Cao, K. R. Crews, W. Plunkett, V. Gandhi, J. Rubnitz, R. C. Ribeiro, J. R. Downing, and J. Lamba
PROMISE: a tool to identify genomic features with a specific biologically interesting pattern of associations with multiple endpoint variables
Bioinformatics, August 15, 2009; 25(16): 2013 - 2019.
[Abstract] [Full Text] [PDF]


Home page
CirculationHome page
R. T. Clements, G. Smejkal, N. R. Sodha, A. R. Ivanov, J. M. Asara, J. Feng, A. Lazarev, S. Gautam, V. Senthilnathan, K. R. Khabbaz, et al.
Pilot Proteomic Profile of Differentially Regulated Proteins in Right Atrial Appendage Before and After Cardiac Surgery Using Cardioplegia and Cardiopulmonary Bypass
Circulation, September 30, 2008; 118(14_suppl_1): S24 - S31.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Tamura and P. D'haeseleer
Microbial genotype-phenotype mapping by class association rule mining
Bioinformatics, July 1, 2008; 24(13): 1523 - 1529.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Clin. Nutr.Home page
S. L Salsberg and D. S Ludwig
Putting your genes on a diet: the molecular effects of carbohydrate
Am. J. Clinical Nutrition, May 1, 2007; 85(5): 1169 - 1170.
[Full Text] [PDF]


Home page
Physiol. GenomicsHome page
T. S. Mehta, S. O. Zakharkin, G. L. Gadbury, and D. B. Allison
Epistemological issues in omics and high-dimensional biology: give the people what they want
Physiol Genomics, December 13, 2006; 28(1): 24 - 32.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
7/1/25    most recent
bbk002v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pounds, S. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pounds, S. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?