Briefings in Bioinformatics Advance Access originally published online on March 7, 2006
Briefings in Bioinformatics 2006 7(2):166-177; doi:10.1093/bib/bbl002
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Normalization and quantification of differential expression in gene expression microarrays
Corresponding author. Christine Steinhoff, Max Planck Institute for Molecular Genetics, Department of Computational Molecular Biology, Ihnestr 73, D-14195 Berlin, Germany. E-mail: steinhof{at}molgen.mpg.de
| ABSTRACT |
|---|
|
|
|---|
Array-based gene expression studies frequently serve to identify genes that are expressed differently under two or more conditions. The actual analysis of the data, however, may be hampered by a number of technical and statistical problems. Possible remedies on the level of computational analysis lie in appropriate preprocessing steps, proper normalization of the data and application of statistical testing procedures in the derivation of differentially expressed genes. This review summarizes methods that are available for these purposes and provides a brief overview of the available software tools.
Keywords: microarray, normalization, low-level analysis, differential gene expression
| INTRODUCTION |
|---|
|
|
|---|
Microarray technology has been around for almost 10 years and with it a plethora of computational analysis tools has been developed. Yet, the application of microarray technology in biological research still poses serious problems and causes considerable confusion on the part of the users of the technology. The lack of simple answers to the problems in this field is largely due to the wide scope of questions that can be tackled with the technology, overlaid with its technical aspects, which influence the analysis in very specific ways. This review aims at summarizing existing approaches to the early steps in the analysis pipeline, coupled with methods to tackle the supposedly simple question of finding genes that behave differently under different conditions.
A microarray experiment is performed under the assumption that gene intensities reflect actual mRNA levels. It is, however, well-known that raw gene expression intensities do not fulfill this requirement. Their values are highly influenced by a number of non-biological sources of variation (for an overview see [1, 2]). Thus, for achieving biologically meaningful data, computational preprocessing including normalization steps is essential [3].
Microarray experiments are frequently employed for the purpose of identifying genes that are expressed differently under distinct conditions. This amounts to comparing one group A with another group B and delineating a list of genes ranked according to their respective statistic of differential expression. In a further step, significance is assigned to each gene and a cut-off value can be defined (for an overview see [47]. Even for these seemingly simple questions, proper preprocessing and normalization are crucial and to a certain degree the two aspects are even linked with each other.
While we review the computational methods, familiarity with microarray technologies on the part of the reader is assumed. The platforms that will be considered are Affymetrix-type oligonucleotide arrays [13, 14] and two-colour spotted (cDNA-) arrays [1416]. In the following, we are using the abbreviations oligo array and two-dye arrays. For the latter technology also the possibility of dye-swap experiments will be considered.
This review is structured according to the sequence of analysis steps that need to be performed. Preprocessing and normalization are dealt with in Preprocessing and normalization methods section, while Differential Expression section deals with the quantification of differential expression. We also provide a brief overview of tools that are available to the researcher in order to carry out these analytical steps (Tables 1 and 2). Most computational procedures that are reviewed in this article can be performed by using the open source language R [8] and R packages in the Bioconductor project [9]. We recommend using R and Bioconductor. Presently, the packages provide a wide range of powerful statistical applications for various kinds of genomic analysis. It allows for the integration of different kinds of biological data and for rapid development of new statistical packages.
|
|
Recently, a number of books were published that introduce in detail the process of DNA microarray analysis, discuss problems and drawbacks, and provide different software solutions [1015].
| PREPROCESSING AND NORMALIZATION METHODS |
|---|
|
|
|---|
Motivation
The need for what we call preprocessing comes from the fact that in addition to reflecting mRNA levels, spot intensities may also depend on peculiarities of print tips, particular PCR reactions, integration efficiency of a dye or spatial and hybridization specific effects. These problems can partly be remedied by image processing methods, background adjustment, normalization, summary of multiple probes per transcript, or quality control measures [1, 2, 16]. Thus, such procedures are referred to as preprocessing of the data.
A simple selfself comparison will demonstrate the problem. Splitting an RNA sample into two aliquots, labelling them differently and performing a hybridization will show a summary of all these unwanted effects. The variation seen between the two equal samples is all due to the experimental variation which we need to deal with in order to later on quantify differential gene expression [17].
The need for normalization arises from the observation that measurements from different hybridizations may occupy different scales. In order to compare them they need to be normalized. Otherwise, one would deem genes differentially expressed where only the hybridizations behaved differently. Additionally, the variance in the data tends to depend on the absolute intensity of the data. This, too, may lead to false biological conclusions and should be remedied by a normalization method.
For two hybridizations (or two colours of one hybridization) this latter problem is easily visualized with a scatter plot of the average of the two log intensities A versus their log ratio M [18]. This graphical representation is frequently referred to as MA plot [16]. It shows that the variance of M changes strongly with A, e.g. while the variance is low for high values of A it is rather large for small values of A. This is a source for possible misinterpretation of the data: a fold change of two may be highly interesting for two strongly expressed genes while it is not noteworthy when the genes come from the region of low expression. For the quantification of differential expression, we require constant variance across the whole dynamic range.
It is a common practice to transform gene expression intensities to logarithmic scale. This makes the variation of intensities or differences less dependent on the absolute magnitude and evens out highly skewed distributions. Furthermore, logarithmic transformations convert multiplicative errors into additive ones [19]. Problems with logarithmic scale arise for negative values which occur frequently after background subtraction. For positive values close to zero, logarithmic transformation yields strongly negative values and consequently heavily scattered plots.
During the last years, a number of solutions for preprocessing and normalization came up. A pipeline of the analysis procedure and an overview of frequently used methods are given in Figure 1. One of the basic questions there pertains to the user's assumption as to whether only a small fraction of the genes or large parts of them change under the studied change of conditions. This is usually a reflection of the experiment design. For example, using a specialized array containing genes relevant for a particular biological process, one expects most of the genes to change in the experiment. When normalizing these experimental settings, housekeeping genes, internal controls or spikes have to be used. Amongst the genes of a whole-genome array, on the other hand, only a small fraction is expected to change. Here, we will focus on methods for the latter, the general purpose array. Regarding space limitations, for most methods we will not go into deep detail. For a tutorial guiding through normalization procedures, see the article by Kreil [20] published last year in this journal.
|
Preprocessing
When analysing oligo arrays, one chooses for a background correction and decides how to utilize perfect matches (PM) and mismatches (MM) in order to obtain a summary of intensities. This is frequently called summary statistic in the literature (see [21, 22] for an overview). Irizarry et al. [21] propose a background correction that ignores MM values altogether. They offer a Bioconductor package called Robust Multi-array Analysis (RMA) that comprises background adjustment and normalization (refer Transformation methods section). Wu et al. [23] introduce a sequence-based statistical model that describes background adjustment specifically for oligo arrays. The components of the error model are estimated by a maximum likelihood approach or an empirical Bayes approach. This approach is implemented in the Bioconductor package gcrma which is a modified version of RMA that describes the intensity of probes as a function of the GC-content. Li and Wong [24] establish a statistical framework that comprises an error model for perfect matches and mismatches. This setting is only applicable for oligo arrays. Their approach comprises the deduction of a summary statistic. For an overview of different probe set summary methods see [25].
Likewise, for two-dye arrays there exist tools for image analysis including background corrections or to test for the above mentioned artifacts like PCR batch effects and the like [26].
Table 1 provides an overview of the freely available computational tools for preprocessing gene expression data. Having performed preprocessing for either kind of technology platform, we end up with one value per probe set or transcript represented on the array. In the following, these units will shortly be referred to as genes. Since it is not the focus of this review, we will not go into deeper detail regarding preprocessing steps. There are a number of publications dealing with this aspect [5, 11, 12, 14, 18].
Scaling methods
Applying scaling methods, one assumes that different sets of intensities differ by a constant global factor. These are only correct for global multiplicative effects [27], since all raw intensity values are multiplied with one common (i.e. global) scaling factor. Note that using log-transformed datasets multiplicative effects become additive. The scaling factor might be the mean, median, Z-score, etc. [27, 28]. Preprocessing, including standardization as provided by the Microarray Suite Software 5.0 (MAS5.0) applies a trimmed mean based scaling approach. Adapted from the available documentation of MAS5.0 the algorithm is implemented in the Bioconductor package affy.
Transformation methods
Transformation methods aim at quantitatively mapping one set of intensities to another one. They are non-parametric when no distributional assumptions are made. Mostly, these methods are based on regression. Regression can be applied either over the entire range of intensities [29, 30] or locally [31]. Depending on whether the regression function is a linear function or a polynomial function of degree larger than one, we distinguish linear and polynomial regression.
Especially for local regression, outlier values can strongly influence the regression curve. Therefore, it is advisable to introduce weights that penalize outliers. Local regression via loess/lowess (locally weighted scatter plot smooth) uses a linear (lowess) or quadratic (loess) polynomial weighted regression function with Tukey's biweight function [31] while local regression via locfit applies a tricubic weighting function. With regard to microarray normalization they perform very similarly. Workman et al. [32] proposed a normalization method where intensity pairs of two arrays are interpolated according to a cubic spline function (qspline).
Quantile normalization for oligo arrays as proposed by Bolstad et al. [33] aims at making the distribution of gene expression intensities of each sample the same. This approach is applicable for many arbitrary samples. Each quantile of intensities is projected to lie along the unit diagonal. This can be achieved by the following procedure: let X(i, k) be the gene expression intensity of the ith gene and the kth sample. Each sample set of intensities X(·, k) is being sorted by a permutation
k according to intensity values and results in a sorted sample set X' (·, k). Then each intensity value X'(i, k) is substituted by the mean across all samples: mean (X' (i, ·)). The inverse permutation inv(
k) is now applied to each sample set and produces the normalized set of gene expression intensities. The approach is implemented in the Bioconductor package affy.
Error model based transformation methods
The basic idea of introducing an error model is to describe the relation between measured signal intensities and true abundance of RNA molecules. Assume that the true intensity level xkg of the kth sample and gth gene is disturbed by random multiplicative (bkg) and additive (akg) factors. The measurement ykg of the gth gene in the kth sample can be described as
|
|
Looking at MA representations (refer Motivation in Preprocessing and normalization methods section), we observe scattered plots for low intensities whereas for high intensities this is not the case [2, 18]. This phenomenon is due to the fact that the variance depends on the intensity, e.g. for low mean intensities we find a rather high variance whereas for large intensities the variance is roughly constant. Variance stabilization provides a solution for this problem. Applying a variance stabilizing transformation as proposed by Huber et al. [35] and Durbin et al. [36] the variance is approximately constant across the whole dynamic range of expression intensities. Thus, it allows for quantification of differential expression independently from the mean intensities. Furthermore, this approach overcomes the shortcoming of logarithmic transformation. Variance stabilization is performed by applying an arsinh transformation. In contrast to the logarithmic function the arsinh function is continuous, has no singularity at zero and is defined for negative values.
Kerr et al. [38] propose an analysis of variance (ANOVA) model to capture multiple effects and their interactions. To apply this method, the experiment has to be designed in an ANOVA setting. Dye-swap experiments for example fulfill this requirement. This method provides an integrative approach to adjust for extraneous effects and to assign significance to gene expression changes. These are captured in the variety-gene interaction term. Several other methods have been proposed, e.g. [19, 39].
| DIFFERENTIAL EXPRESSION |
|---|
|
|
|---|
Motivation
The task of analysing a gene expression experiment for differential genes falls into the following steps:
- Ranking: genes are ranked according to their evidence of differential expression.
- Assigning significance: a statistical significance is being assigned to each gene.
- Cut-off value: to arrive at a limited number of differentially expressed genes a cut-off value for the statistical significance needs to be determined.
Availability of repetitions provides for a richer spectrum of applicable statistical procedures. We distinguish the experimental setting according to the number of conditions that are compared. Either we compare two groups or multiple groups (Figure 2).
|
In the two-condition case, one considers either a paired or unpaired situation. Comparing a healthy group with a diseased one is an example for an unpaired experiment because the samples are independent. An example for a paired situation is gene expression measurements of one cell line before and after chemical treatment (Figure 2). The availability of replicates allows for a sound statistical procedure because variation between replicates can be considered. Several methods have been published that provide an appropriate statistical framework for analysing two-condition comparisons (for an overview see [47, 18, 41, 42]), (Section Two-conditional setting and independent multiconditional setting).
In the case of multiple conditions, one distinguishes independent and dependent settings too (Figure 2). The essential difference between these is the linear order of states in the dependent setting. Statistically, each conditional state, e.g. each time point, is dependent on all the others. Cellular differentiation experiments are examples for a dependent testing structure. An example for the independent sample setting is finding differentially expressed genes comparing multiple groups of disease stages (for an overview see [4]). Most commonly, multiconditional experiments are time courses. Several methods that provide a statistical framework for analysing multiconditional setting are introduced subsequently (Sections Two-conditional setting and independent multiconditional setting and Dependent multiconditional setting).
Two-conditional setting and independent multiconditional setting
The availability of replicates enables to rank genes according to their associated t-statistic for each gene: t = m/(std/
n), where m is the difference of means across replicates, std, the within groups standard deviation and n, the number of genes considered for testing. F-scores are the straightforward generalization of t-scores in the multiconditional case. Problems arise when genes with small intensity differences show almost no changes between conditions. This might yield high t-scores and thus, these genes occupy top ranks. A remedy lies in artificially enlarging these variances.
Accordingly, a number of methods has been introduced that propose different penalizing factors in the t-statistic [4346]. Many authors offer freely available computational tools. Table 2 provides an overview of these tools. Lönnstedt and Speed [43] introduce a parametric empirical Bayes approach. In terms of ranking genes according to their evidence of differential expression this is equivalent to a penalized t-statistic [5]: t=m/
((a+ std2)/n). They use the penalty value a, which is estimated from the mean and standard deviation of the variance across samples. Also, Tusher et al. [44] and Efron et al. [45] suggest using a penalizing factor, e.g. the fudge factor. Likewise, low variances are being corrected by proposing an enlarging factor. The approach by Tusher et al. [44] is implemented in the computational tool called significance analysis of microarrays (SAM). Recently, SAM has been updated such that time courses can be analysed, too. The authors developed a bioconductor package samr [47] as well as an Excel Add-in. Efron et al. [45] suggest applying an additive penalizing factor in the denominator of the t-statistic that is the 90th percentile of the standard deviation across samples. Choosing the penalizing factor to be zero reduces this method to the ordinary t-statistic. Also, Baldi and Long [48] suggest a Bayesian probabilistic approach combined with a modified t-test. Related to the approach by Tusher et al. [44], Broberg [46] suggest a calibrated testing procedure such that estimators for false negative and false positive rates are minimized.
Several linear model approaches for ranking gene expression differences have been introduced [38, 41, 49, 50]. Kerr et al. [38] use ANOVA models for an integrated procedure of normalization and detection of differentially expressed genes. They assume a linear model of specific effects for log intensities of all genes. These effects might be dye, slide, treatment, gene effects and their respective interactions. Smyth et al. [41] propose a modified t-statistic that is proportional to the t-statistic with sample variance offset as used in [4446]. The approach can be generalized for the multi-conditional case. It has been implemented in the Bioconductor package limma [1, 41]. Using this package, experimental setting, duplicate spots and quality weights can also be considered. The moderated t-statistic is calculated, genes are ranked with respect to the resulting scores and P-values can be assigned. Further developments focus on linear models in a gene wise manner [49]. Also, Jain et al. [51] propose a modified t-statistic. Lin et al. [50] use a robust linear model for each single gene to estimate contrasts of all pairwise comparisons of tested groups.
Furthermore, a number of rank-based approaches (thus, non-parametric) have been developed. These are based on a Wilcoxon rank sum test or permutation t-test. While t-test and F-test based methods assume that the intensity measurements of normalized ratios are normally distributed, rank-based approaches do not do so. Instead of considering numerical values, Wilcoxon rank sum tests use ranks. This is a more robust approach, although frequently with lower power, because one loses information by switching from the numerical to the rank scale. In the multiconditional case, the KruskalWallis test is the straightforward generalization of the Wilcoxon test. Yan et al. [52] present a non-parametric method based on the statistic of relative entropy between two distributions. For the assignment of significance, resampling based permutations [53] are applied [52].
Dependent multiconditional setting
Time course experiments arise for example from cell differentiation processes and constitute multiconditional experiments. Each time point represents one conditional state. Thus, all experiments corresponding to one time point build up one conditional group. The essential difference compared with independent cases is the linear order of states. Statistically, each conditional state, e.g. each time point, is dependent on all the others. This fact requires new concepts of the statistical procedure. Bar-Joseph [54] provides an overview of several recent developments in analysing time course gene expression data.
Recently, the original form of SAM [44] has been generalized to time course experimental settings. Time is being included as one covariate. Storey et al. [55] propose a statistical framework specifically designed for time course analysis. This spline-based approach has been implemented in the open-source software package EDGE (Table 2). To assign significance to each gene or group of genes they use a t-statistic and F-statistic related approach. Guo et al. [56] introduce a robust statistic which is based on the Wald statistic. There, time-relevant dependencies within the gene intensity data set are explicitly integrated. To assign significance, either recent versions of SAM [44] or [57] might be applied. Xu et al. [58] suggest an approach using regression analysis. To estimate the parameters of the regression model they apply least squares estimates. Standard errors are assessed using estimating techniques as introduced in [59]. Significance levels are assigned based on Z-statistic. Bar-Joseph et al. [60] use cubic splines to describe gene expression time courses and significance is assigned by comparing global differences of two aligned curves. Schliep et al. [61] suggest using Hidden Markov models (HMM) for the analysis of time course gene expression data. External biological knowledge can be integrated using a partially supervised learning approach. External biological knowledge for example might be the expression behaviour of several master genes that is known beforehand. The method has been implemented in the freely available software package GQL (Table 2).
Cut-off and multiple testing
After ranking the genes according to a statistical procedure, one has to find a cut-off above which biologically meaningful information is expected. Frequently, researchers choose the P-value cut-off of 0.05 and assume all genes showing a lower P-value to be biologically significant. Performing many tests at a time, however, increases the problem of falsely significant genes. Roughly speaking, when performing 10 000 tests one expects 5% of the genes to show a P-value of less than 0.05 just due to chance.
There are a number of multiple testing approaches to overcome this problem. One possibility to lower the problem is to reduce the number of statistical tests by filtering steps. Thus, we have to find a criterion due to which the number of testing procedures can be limited. This might be either external biological knowledge or variance across conditions. That means, the set of intensities can be reduced by neglecting genes which we do not expect any biological information from. Alternatively, one could use only those genes that show a certain minimal amount of variance over all conditional states or apply intensity-based filtering, e.g. neglecting very lowly expressed genes. For oligo-array experimental setting, Pounds and Cheng [62] suggest a filtering procedure using the P-values of present/absent calls. They combine these to one summary P-value that is used for filtering. They also discuss that there might be cases where filtering is not necessarily improving the detection of differentially expressed genes.
Given a type I error rate (i.e. a false positive rate) controlling for multiple testing means correcting P-values such that the given error rate can be guaranteed for all tests. Methods can be divided into those that control the family wise error rate (FWER) or the false discovery rate (FDR). The probability of at least one type I error within the significant genes is called FWER. The FDR is the expected proportion of type I errors within the rejected hypotheses. For an overview see [6365].
The so-called Bonferroni correction is an extremely conservative approach. Significance levels are being divided by the number of tests that are performed. This one-step multiplicity adjustment controls the FWER. Holm [66] suggests a stepwise procedure which improves the power. Westfall and Young [53] suggest a resampling method to adjust P-values.
While these methods control the FWER, Benjamini and Hochberg [67] suggest a less conservative approach by controlling the FDR instead. Likewise, different modifications have been proposed [65, 6776]. Storey and Tibshirani [77] propose using Q-values which is a measure of statistical significance in terms of the FDR instead of false positive rates as it is the case for P-values. Estimation of FDR, as proposed in [78], is implemented in SAM. Efron et al. [45] suggest using the local FDR. Given a score for a certain gene the local FDR determines the probability that the gene is not differentially expressed conditioned on the observed test score. Scheid and Spang [64] derive an estimator for local false discovery rate. The procedure is implemented in the Bioconductor package twilight.
| CONCLUSION |
|---|
|
|
|---|
Starting with raw gene expression measurements we summarized numerous approaches to arrive at biologically meaningful expression datasets. We outlined the importance of preprocessing and normalization, various aspects of ranking genes according to a statistic for differential expression, assigning significances to expression changes and deriving meaningful cut-offs [13, 5]. There are freely available software packages that enable methodologically sound analyses of microarray data. We provided a brief overview of different tools and recommend using the open source R-packages in the Bioconductor project. These packages not only allow for the integration of various kinds of biological data but are also rapidly evolving and providing current statistical approaches.
Although this review has attempted to present normalization procedures separately from the search for differential genes, it must be realized that the two tasks are in fact linked. Normalization by transformation assumes that most of the genes represented on an array remain unchanged upon a change of condition. This, in turn, means that already the normalization method implicitly flags other genes as differential and it is these ones that are more likely to be found in the search for differential genes. Thus, the two problems really are one. One recommendable combination, for example, is to apply variance stabilization, followed by a modified t-test and multiple testing correction using FDR. Variance stabilization overcomes many drawbacks of other methods, as outlined before. The choice of the test highly depends on the experimental setting. To our experience in many cases modified t-tests have proven valuable.
Practically, however, appropriate experimental design is crucial for achieving a biologically meaningful interpretation of the experiment. Otherwise, computational analysis needs to focus on troubleshooting rather than providing a solid procedure for biological hypothesis generation. For example, searching for differentially expressed genes, replicates are indispensable for assigning a statistical significance to the changes. Furthermore, the smaller the expected expression changes the more important are repetitions. Working with two-dye arrays, dye-swap experiments may offer an additional opportunity to cheaply generate data for normalization, in particular when only small amounts of sample material are available. An experiment that, right from the start, is designed to be evaluated by the ANOVA approach, may minimize the number of hybridizations necessary to answer a particular question.
In microarray technology, the large number of genes that can be tested are of great appeal to the experimenter. At the same time, this is the statistical curse about the method. The fact that typically the number of genes on the array is much larger than the number of conditions is what makes it so difficult to analyse the data in a statistically sound manner. Remedies lie in filtering techniques and appropriate corrections for multiple testing. In addition, on the level of functional interpretation more information can be gained, e.g. by searching for overrepresentation of genes belonging to a particular functional category or combining gene expression analysis with gene function prediction or elucidating biological networks [7981].
Key Points
|
| Acknowledgements |
|---|
|
|
|---|
The authors thank Stefanie Scheid and Anja von Heydebreck for their critical reading of the manuscript and useful suggestions. We acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG), Sonderforschungsbereich (SFB) 618: Theoretical Biology: Robustness, Modularity and Evolutionary Design of Living Systems.
| FOOTNOTES |
|---|
|
|
|---|
Christine Steinhoff is a postdoctoral scientist in the Computational Molecular Biology Department at the Max Planck Institute for Molecular Genetics in Berlin. Her research interest focuses on epigenetic gene regulatory mechanisms especially based on gene expression experimental approaches.
Martin Vingron is the Director at the Max Planck Institute for Molecular Genetics in Berlin and Head of the Department for Computational Molecular Biology. His current research interest lies in utilizing gene expression data as well as evolutionary data for the elucidation of gene regulatory mechanisms.
Received (in revised form): January 28, 2006.
| References |
|---|
|
|
|---|
- Smyth GK, Speed TP. Normalization of cDNA microarray data. Methods 2003; 31:26573.[CrossRef][ISI][Medline]
- Huber W, Heydebreck Av, Vingron M. Analysis of Microarray Gene Expression Data. Chichester: John Wiley & Sons 2003.
- Steinhoff C, Vingron M. Normalization Strategies for Microarray Data Analysis. Taylor & Francis Group 2005.
- Scheid S, Spang R. Microarray data analysis: Differential gene expression. Taylor & Francis Group 2005.
- Smyth GK, Yang YH, Speed TP. Statistical Issues in cDNA Microarray Data Analysis. Totowa: Humana Press 2002.
- Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 2002; 18:(4)54654.
[Abstract/Free Full Text] - Elo L, Aittokallio T, Filen S, et al. The effect of replication on gene rankings: a practical comparison of methods for detecting differential expression in microarray experiments. Bioinformatics Research and Education Workshop. 2005 Berlin:.
- Ihaka R, Gentleman R. R: a language for data analysis and graphics. J Comput Graph Stat 1996; 5:299314.[CrossRef]
- Gentleman R, Carey V, Bates D, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 2004; 5:(10)R80.[CrossRef][Medline]
- Speed T. Statistical Analysis of Gene Expression Microarray Data. CRC Press 2003.
- Simon R, Korn EL, McShane LM, et al. Design and Analysis of DNA Microarray Investigations. Springer 2003.
- In Parmigiani G, Garrett ES, Irizarry R (Eds.). The Analysis of Gene Expression Data. Springer 2003.
- Lee M-LT. Analysis of Microarray Gene Expresson Data. Kluwer Academic Publishers 2004.
- Gentleman R, Carey V, Huber W, et al. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer 2005.
- Nuber UA. DNA Microarrays. Taylor & Francis Group: Garland Science Publishing 2005.
- Dudoit S, Yang YH, Callow MJ, et al. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 2002; 12:11139.[ISI]
- Yang IV, Chen E, Hasseman JP, et al. Within the fold: assessing differential expression measures and reproducibility in microarray assays. Genome Biol 2002; 3:(11)research0062.10062.12.
- Huber W, Heydebreck Av, Vingron M. Low-Level analysis of microarray experiments. Wiley-VCH 2005.
- Cui X, Kerr MK, Churchill GA. Transformation for cDNA Microarray Data. Stat Appl Genet Mol Biol 2003; 2:(1) Article 4.
- Kreil DP. There is no silver Bullet-a guide to low-level data transforms and normalisation methods for microarray data. Brief Bioinform 2005; 6:(1)8697.
[Abstract/Free Full Text] - Irizarry RA, Hobbs B, Collin F, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003; 4:(2)24964.[Abstract]
- Irizarry RA, Bolstad BM, Collin F, et al. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003; 31:(4)e15.
[Abstract/Free Full Text] - Wu Z, Irizarry RA, Gentleman R, et al. A model-based background adjustment for oligonucleotide expression arrays. J Am Stat Assoc 2004; 99:(468)90917.[CrossRef][ISI]
- Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. PNAS 2001; 98:316.
[Abstract/Free Full Text] - Choe SE, Boutros M, Michelson AM, et al. Preferred analysis methods for Affymetrix GeneChips revealed by a wholly definer control dataset. BMC Bioinformatics 2005; 6:R16.
- Yang YH, Buckley MJ, Dudoit S, et al. Comparison of methods for image analysis on cDNA microarray data. J Comput Graph Statist 2002; 11:(1)10836.[CrossRef]
- Shena M, Shalon D, Davis RW, et al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995; 270:(5235)46770.
[Abstract/Free Full Text] - DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997; 278:(5338)6806.
[Abstract/Free Full Text] - Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286:5317.
[Abstract/Free Full Text] - Virtaneva K, Wright FA, Tanner SM, et al. Expression profiling reveals fundamental biological differences in acute myeloid leukemia with isolated trisomy 8 and normal cytogenetics. PNAS 2001; 98:11249.
[Abstract/Free Full Text] - Yang YH, Dudoit S, Luu P, et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 2002; 30:(4)e15.
[Abstract/Free Full Text] - Workman C, Jensen LJ, Jarmer H, et al. A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biol 2002; 3:(9)research0048.10048.16.
- Bolstad BM, Irizarry RA, Astrand M, et al. A Comparison of normalization methods for high density oligonucleotide array based on variance and bias. Bioinformatics 2003; 19:(2)18593.
[Abstract/Free Full Text] - Rocke DM, Durbin BP. A model for measurement error for gene expression arrays. J Computat Biol 2001; 8:(6)55769.
- Huber W, Heydebreck Av, Sültmann H, et al. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002; 18:S1, S96104.[Abstract]
- Durbin BP, Hardin JS, Hawkins DM, et al. A variance stabilizing transformation for gene-expression microarray data. Bioinformatics 2002; 18:Suppl1, S96104.[Abstract]
- Chen Y, Dougherty ER, Bittner ML. Ratio based decisions and the quantitative analysis of cDNA microarray images. J Biomed Opt 1997; 2:36474.[CrossRef]
- Kerr MK, Martin M, Churchill GA. Analysis of variances from gene expression microarray data. J Comput Biol 2000; 7:(6)81937.[CrossRef][ISI][Medline]
- Kepler TB, Crosby L, Morgan KT. Normalization and analysis of DNA microarray data by self-consistency and local regression. Genome Biol 2002; 3:(7)research0037.10037.12.
- Newton MA, Kendziorski CM, Richmond CS, et al. On Differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 2001; 8:(1)3752.[CrossRef][ISI][Medline]
- Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004; 3:(1).
- Troyanskaya OG, Garber ME, Brown PO, et al. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics 2002; 18:(11)145461.
[Abstract/Free Full Text] - Lönnstedt I, Speed TP. Replicated microarray data. Statistica Sinica 2002; 12:3146.
- Tusher V, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001; 98:511624.
[Abstract/Free Full Text] - Efron B, Tibshirani R, Storey JD, et al. Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 2001; 96:(456)115160.[CrossRef][ISI]
- Broberg P. Statistical methods for ranking differentially expressed genes. Genome Biol 2003; 4:R41.[CrossRef][Medline]
- Tibshirani R, Chu G, Hastie T. The samr Package. Bioconductor 2005.
- Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001; 17:(6)50919.
[Abstract/Free Full Text] - Thomas JG, Olson JM, Tapscott SJ, et al. An efficient approach and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res 2001; 11:(7)122736.
[Abstract/Free Full Text] - Lin DM, Yang YH, Scolnick JA, et al. Spatial patterns of gene expression in the olfactory bulb. PNAS 2004; 101:(34)1271823.
[Abstract/Free Full Text] - Jain N, Thatte J, Braciale T, et al. Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics 2003; 19:(15)194551.
[Abstract/Free Full Text] - Yan X, Deng M, Fung WK, et al. Detecting differentially expressed genes by relative entropy. J Theor Biol 2005; 234:395402.[Medline]
- Westfall PH, Young SS. Re-Sampling Based Multiple Testing. New York: Wiley 1993.
- Bar-Joseph Z. Analyzing time series gene expression data. Bioinformatics 2004; 20:(16)2493503.
[Abstract/Free Full Text] - Storey JD, Xiao W, Leek JT, et al. Significance analysis of time course microarray experiments. PNAS 2005 in press.
- Guo X, Qi H, Verfaillie CM, et al. Statistical significance analysis of longitudinal gene expression data. Bioinformatics 2003; 19:(13)162835.
[Abstract/Free Full Text] - Pan W, Lin J, Le CT. A mixture model approach to detecting differentially expressed genes with microarray data. Funct Integr Genomics 2003; 3:(3)11724.[Medline]
- Xu XL, Olson JM, Zhao LP. A regression-based method to identify differentially expressed genes in microarray time course studies and its application in an inducible Huntington's disease transgenic model. Hum Mol Genet 2002; 11:(17)197785.
[Abstract/Free Full Text] - Zhao LP, Prentice RL, Breeden L. Statistical modeling of large microarray data sets to identify stimulus-response profiles. PNAS 2001; 98:56316.
[Abstract/Free Full Text] - Bar-Joseph Z, Gerber G, Jaakkola T, et al. Comparing the continous representation of time series expression profiles to identify differentially expressed genes. PNAS 2003; 100:1014651.
[Abstract/Free Full Text] - Schliep A, Costa IG, Steinhoff C, et al. Analyzing gene expression time-courses. IEEE Trans Comput Biol 2005; 2:(3)17993.[CrossRef]
- Pounds S, Cheng C. Statistical development and evaluation of microarray gene expression data filters. J Comput Biol 2005; 12:(4)48295.[CrossRef][ISI][Medline]
- Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Stat Sci 2003; 18:(1)71103.[CrossRef][ISI]
- Scheid S, Spang R. A stochastic downhill search algorithm for estimating the local false discovery rate. IEEE Trans Comput Biol 2004; 1:(3)98108.[CrossRef]
- Tsai CA, Hsueh HM, Chen JJ. Estimation of false discovery rates and multiple testing: application to gene microarray data. Biometrics 2003; 59:(4)107181.[CrossRef][ISI][Medline]
- Holm S. A simple sequentially rejective multiple test procedure. Scand J Statist 1979; 6:6570.
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 1995; 57:(1)289300.
- Benjamini Y, Hochberg Y. On the adaptive control of the false discovery rate in multiple testing with independent statistics. J Educ Behav Stat 2000; 25:(1)6083.[CrossRef]
- Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple hypothesis testing under dependencies. Ann Statist 2001; 29:(4)116588.[CrossRef]
- Storey JD. The positive false discovery rate: a bayesian interpretation and the Q-Value. Ann Math Statist 2003; 31:(6)201335.
- Benjamini Y, Liu W. A step-down multiple hypothesis testing procedure that controls the false discovery rate under independence. J Stat Plan Infer 1999; 82:16370.[CrossRef]
- Yekutieli D, Benjamini Y. Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. J Stat Plan Infer 1999; 82:12, 17196.[CrossRef]
- Allison DB, Gadbury GL, Heo M, et al. A mixture model approach for the analysis of microarray gene expression data. Comput Statist Data Anal 2002; 39:(1)120.
- Pounds S, Morris SW. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of P-values. Bioinformatics 2003; 19:(10).
- Reiner A, Yekutieli D, Benjamini Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003; 19:(3)36875.
[Abstract/Free Full Text] - Pounds S, Cheng C. Improving false discovery rate estimation. Bioinformatics 2004; 20:(11)173745.
[Abstract/Free Full Text] - Storey JD, Tibshirani R. Statistical significance for genomewide studies. PNAS 2003; 100:(16)94405.
[Abstract/Free Full Text] - Storey JD. A direct approach to false discovery rates. J R Statist Soc B 2002; 64:(3)47998.[CrossRef]
- Troyanskaya OG. Putting microarrays in a context: integrated analysis of diverse biological data. Brief Bioinformat 2005; 6:(1)3443.
[Abstract/Free Full Text] - Stuart JM, Segal A, Koller D, et al. A gene-coexpression network for global discovery of conserved genetic modules. Science 2003; 302:24954.
[Abstract/Free Full Text] - McCarroll SA, Murphy CT, Zou S, et al. Comparing genomic expression patterns across species identifies shared transcriptional profile in aging. Nature Genet 2004; 36:(2)197204.[CrossRef][ISI][Medline]
- Duboit S, Yang YH. Bioconductor R packages for exploratory analysis and normalisation of cDNA microarray data. New York: Springer; 2002.
This article has been cited by other articles:
![]() |
E. Ben-Yaacov and Y. C. Eldar A fast and flexible method for the segmentation of aCGH data Bioinformatics, August 15, 2008; 24(16): i139 - i145. [Abstract] [PDF] |
||||
![]() |
J. Degenhardt, M. Haubrock, J. Donitz, E. Wingender, and T. Crass DEEP--A tool for differential expression effector prediction Nucleic Acids Res., July 13, 2007; 35(suppl_2): W619 - W624. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



