Briefings in Bioinformatics Advance Access originally published online on May 26, 2006
Briefings in Bioinformatics 2007 8(1):32-44; doi:10.1093/bib/bbl016
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Partial least squares: a versatile tool for the analysis of high-dimensional genomic data
Corresponding author. Anne-Laure Boulesteix, Department of Medical Statistics and Epidemiology, Technical University of Munich, Ismaningerstrasse 22, D-81675 Munich, Germany. Tel: +49 89 4140-4347; Fax: +49 89 4140-4840; E-mail: anne-laure.boulesteix{at}tum.de
| ABSTRACT |
|---|
|
|
|---|
Partial least squares (PLS) is an efficient statistical regression technique that is highly suited for the analysis of genomic and proteomic data. In this article, we review both the theory underlying PLS as well as a host of bioinformatics applications of PLS. In particular, we provide a systematic comparison of the PLS approaches currently employed, and discuss analysis problems as diverse as, e.g. tumor classification from transcriptome data, identification of relevant genes, survival analysis and modeling of gene networks and transcription factor activities.
Keywords: partial least squares (PLS), high-dimensional genomic data, gene expression, classification, dimension reduction
| INTRODUCTION |
|---|
|
|
|---|
In the last few years, multivariate statistical methods for the analysis of high-dimensional genomic data have been the subject of numerous publications in statistics, machine learning, bioinformatics and biology. A challenging problem connected with these data is that they contain typically many more variables (p, genes and features) than observations (n, gene chips and time points). For instance, it is not uncommon to collect expression data for 20 000 genes using only 1020 microarrays. Since many traditional multivariate methods are not applicable in this case, predicting, e.g. the survival time or the tumor class of a patient with such high-dimensional data is a difficult and challenging task that requires special techniques such as variable selection or dimension reduction.
In this article, we survey the application of partial least squares (PLS), a powerful yet comparatively unknown approach for analyzing high-dimensional data, to problems in bioinformatics and genomics. The PLS method was first developed by Herman Wold in the 1960s and 1970s to address problems in econometric path modeling, and was subsequently adopted by his son Svante Wold (and many others) in the 1980s for regression problems in chemometric and spectrometric modeling. Early references on path modeling are, e.g. Wold [5, 13]. One of the first applications of PLS to regression is Wold et al. [4]. Two recent studies [6] describe these early developments and provide a detailed chronological overview. PLS is still a highly active research area from a theoretical point of view; see for instance [7] for recent developments on the connections of PLS with Krylov subspaces and conjugate gradients. PLS started to attract the attention of statisticians only about 15 years agosee e.g. [811]. This was mainly due to the ability of PLS to work very well for data with very small sample sizes and a large number of parameters. Thus, it is only natural that in the last few years this methodology is being successfully applied to problems in genomics and proteomics.
PLS methods are in general characterized by high computational and statistical efficiency. They also offer great flexibility and versatility in terms of the analysis problems that may be addressed. However, the literature of PLS is very diverse because of the existence of a large number of algorithmic variants of PLS, which render it very difficult to understand the principles underlying PLS. It is the aim of this article to fill this gap by, firstly, providing a systematic overview of the available PLS methods and, secondly, reviewing the broad range of their applications to genome data.
The remainder of the article is structured as follows. In Methodological Foundations of Partial Least Squares section, we summarize the main methodological aspects of PLS regression. In Applications of Partial Least Squares to High-dimensional Genomic Data section, various applications of PLS regression to microarray studies are reviewed. Outlook and Generalizations of PLS section is devoted to PLS-based methods that are especially designed for particular types of response variables (for instance, survival time or categorical outcome) and to their practical use in microarray data analysis. A recapitulation of the notations and abbreviations that are used throughout the manuscript can be found in the appendix.
| METHODOLOGICAL FOUNDATIONS OF PARTIAL LEAST SQUARES |
|---|
|
|
|---|
In this section, we provide an introduction into the mathematics of PLS. In a nutshell, PLS is a dimension reduction approach that is coupled with a regression model. Unlike in similar approaches such as principal component regression, the latent components obtained by PLS are chosen with the response variable of the regression kept in mind.
PLS regression
Suppose we want to predict q continuous response variables Y1, ... , Yq using p continuous predictor variables X1, ... , Xp. The available data sample consisting of n observations is denoted as
, where
and
denote the ith observation of the predictor and response variables, respectively. The prime denotes uncentered basic data, as in [9]. Their removal indicates the subtraction of the sample average, i.e.
|
|
|
|
The xi = (xi1, ... , xip)T are collected in the n x p matrix X. Similarly, Y is the n x q matrix containing the yi = (yi1, ... , yiq)T:
![]() |
|
| (1) |
|
| (2) |
PLS as well as principal component regression and reduced rank regression can all be seen as methods to construct a matrix of latent components T as a linear transformation of X:
|
| (3) |
![]() |
The latent components are then used for prediction in place of the original variables: once T is constructed, QT is obtained as the least squares solution of Equation (1):
|
|
Finally, the matrix B of regression coefficients for the model Y = XB + F is given as
|
|
may be written as |
|
If we have a new (uncentered) raw observation
, the prediction
of the response is given by
|
|
In PLS, dimension reduction and regression are performed simultaneously, i.e. PLS outputs the matrix of regression coefficients B as well as the matrices W, T, P and Q, and hence the term PLS regression. In the PLS literature, the columns of T are often denoted as latent variables or scores. In this study, we prefer the term latent components, since in PLS the columns of T are rather the result of a matrix decomposition than observations of underlying random variables. P and Q are often denoted as X-loadings and Y-loadings, respectively.
The basic idea of the PLS method is that the response Y should be taken into account for the construction of the components T. More precisely, the components are defined such that they have high covariance with the response, as outlined in Univariate response and Multivariate response sections. That is why PLS is called a supervized method in contrast to, e.g. principal component analysis (PCA), which does not use the response for the construction of the new components. This feature explains why PLS usually performs better than PCA in prediction problems.
The characterization of the various PLS regression approaches might be done at four different levels:
- the objective function maximized by the W matrix,
- the W matrix itself,
- the obtained matrix of regression coefficients B and
- the algorithm used to compute W.
These four different levels are connected as follows:
- The same W matrix can maximize several objective functions. But a given objective function is generally satisfied by only one W matrix (and its oppositeW).
- There might be several algorithms that output the same W matrix.
- A given W matrix leads to only one possible matrix of regression coefficients. But two different matrices W and W* can lead to the same regression coefficients if there exists an invertible c x c matrix M such that W* = WM. Note that, although W and W* lead to the same prediction, they do not necessarily satisfy the same objective function.
Univariate response
In this section, the case of univariate response variables (q = 1) is considered. Thus, Y is a n x 1 matrix, i.e. a vector of length n. Y1 is denoted as Y in the present section. For a fixed-weight vector wi = (w1i, ..., wpi)T, the sample covariance between the response variable Y and the random variable Ti = w1iX1 + ... + wpiXp can be computed as
|
|
|
|
j, i, j = 1, ..., c), |
|
In PLS univariate regression, there is only one commonly adopted objective function. The columns w1, ..., wc of the p x c weight matrix W are defined such that the squared sample covariance between Y and the latent components is maximal under the condition that the latent components are mutually empirically uncorrelated. Moreover, the vectors w1, ..., wc are constrained to be of unit length.
Objective function 1: Univariate PLS (PLS1)
For i = 1, ..., c,
|
|
Multivariate response
The case of a multivariate response is more difficult to handle since one has to find latent components which explain all the responses Y1, ..., Yq simultaneously. There are two main variants for multivariate PLS regression. The first variant is usually denoted as PLS2 in contrast to the univariate method PLS1, or simply PLS. To avoid misunderstandings, we use the term PLS2. The W matrix corresponding to PLS2 may be obtained via several algorithms. The most well-known are the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm and the Kernel-PLS algorithm, which are implemented in the R packages pls and pls.pcr. Recently, ter Braak and de Jong [13] discovered that the PLS2 maximizes the same expression as Statistically Inspired Modification of PLS (SIMPLS) but with different and less intuitive constraints.
Objective function 2: PLS2
For i = 1, ..., c,
|
|
The second important variant of multivariate regression is SIMPLS, which was first introduced by de Jong [14]. In contrast to PLS2, SIMPLS was first developed as an optimality problem. Algorithms were then developed to solve this optimality problem.
Objective function 3: SIMPLS
For i = 1, ..., c,
|
|
The term wTXTYYTXw which is maximized by both PLS2 and SIMPLS is the same as in the univariate case. In the case of a multivariate response (q > 1), it can be reformulated as the sum of the squared empirical covariances between T and Y1, ..., Yq
![]() |
Objective function 4: SIMPLS (equivalent formulation)
For i = 1, ..., c
|
|
As for PLS2, there exist several algorithms that solve the optimality problem of SIMPLS. One of them is implemented in the function simpls from the R package pls.pcr. A particularity of the R function simpls is that it returns unit length scores instead of unit length weights (as one would expect when considering objective function 3). By transforming the weights to have unit length, one obtains weights satisfying objective function 3. A user-friendly version of SIMPLS implementing this transformation can be found in the R package plsgenomics [16].
| APPLICATIONS OF PARTIAL LEAST SQUARES TO HIGH-DIMENSIONAL GENOMIC DATA |
|---|
|
|
|---|
Regression problems
Any genomic analysis that incorporates a regression model may profit from the application of PLS. Some important recent examples are briefly reviewed in this section.
- A straightforward application of univariate PLS regression to expression data from yeast Saccharomyces cerevisiae can be found in [17]. In this study some handpicked gene expression levels are regressed against expression levels of other genes using PLS1 with different numbers of latent components. The magnitude of the obtained regression coefficients are interpreted in terms of interaction strength between genes.
- PLS regression has also been successfully applied to missing values imputation in microarray data by Bras and Menezes [18]. In this approach, the missing values are imputed by PLS regression using all the genes with observed values as predictors. Another reference on PLS imputation in the context of microarray data is Nguyen et al. [19].
- Huang et al. [20] use PLS regression for a prediction purpose. The aim is to model a continuous variable (LVAD support time) using p gene expression levels as predictors. LVAD stands for left mechanical ventricular assist device and is a successful substitution therapy for heart failure patients waiting for transplantation. Although PLS regression can handle a very large number of predictors and can thus be applied to this problem without adaptation, Huang et al. [20] suggest a penalized version of PLS regression (PPLS), which eliminates genes with poor prediction power. Their method is based on the shrinkage of the p regression coefficients obtained by PLS regression. After the shrinkage procedure, a number of genes (depending on the shrinkage parameter
) do not contribute anymore to the model. Huang et al. [20] suggest to use cross-validation for the selection of both the shrinkage parameter
and the number c of latent components used to produce the regression coefficients.
- PLS regression is used by Johansson et al. [21] to identify periodically expressed genes. Johansson et al. [21] construct a virtual response Y that represent cyclic behavior with the same periodicity as the cell cycle. The genes that contribute significantly to the PLS regression model are then interpreted as cell-cycle regulated.
- Applications of PLS multivariate regression to other types of data include the prediction of transcription factor activities from combined analysis of gene expression data and chromatin immunoprecipitation (ChIP) data as proposed by Boulesteix and Strimmer [16]. The transcription of genes is regulated by DNA binding proteins, which are known as transcription factors. An issue of interest for biologists is the estimation of the activity levels of these transcription factors. Available data material include microarray data for the potential target genes under different experimental conditions, and connectivity data (e.g. ChIP data) giving the amount of interaction between the transcription factors and the considered genes. Boulesteix and Strimmer [16] assume as the relationship between microarray data and connectivity data the linear structure Y = A + XB + F, where Y is the n x q constant matrix containing the expression levels of n genes (rows) in q conditions (columns), X is the n x p matrix containing the connectivity information for n genes (rows) and p transcription factors (columns), A is a n x q matrix corresponding to the intercepts and E is a n x q error matrix. The p x q matrix B corresponds to the activity levels of the p transcription factors in the q considered conditions. Thus, the estimation of the transcription factor activities can be formulated as a simple regression problem that is solved in [16] by employing the SIMPLS method. Using PLS in this context allows not only to extract information on the transcription factors activities but also to identify coherent meta-factors corresponding to the different latent components.
- Other applications of PLS to regression problems in genomic data analysis include, e.g. the prediction of the protein structure (e.g. the helix or strand content using high-dimensional sequence data [22]).
Classification problems
The example above considered only the case of continuous response variables Y. In many studies, however, the response to be predicted is categorical. In other words, Y may take only one of K possible unordered values Y = 0, ..., K 1. For instance, Y could be the tumor type of a particular cancer patient. If Y is multicategorical (K > 2), it has to be transformed before PLS dimension reduction. A simple transformation method consists to convert Y into K 1 random variables Y1, ..., YK 1 defined as follows:
|
|
Using this transformation, it can be shown that multivariate PLS dimension reduction (almost) leads to the same components as PCA performed on the between-group sample covariance matrix. A collection of properties on this topic as well as mathematical proofs are given in [23]. These properties can be seen as a justification of PLS dimension reduction with categorical variables. Recently, many researchers have considered the PLS methods for classification:
- In two independent comparative studies by Man et al. [24] and Huang et al. [25], classification based on PLS regression is reported to lead to high prediction accuracy.
- PLS classification analysis for binary response has been investigated by Huang and Pan [26] for leukemia [27] and colon cancer data [28]. Each observation is assigned to one of the two classes 0 or 1, depending on the continuous prediction. Huang and Pan [26] suggest to determine the best number of latent components by leave-one-out cross-validation.
- A similar approach is used in a more applied study by Perez-Enciso and Tenenhaus [29]: various binary outcomes such as (i) before versus after chemotherapy treatment in a case-control study, (ii) estrogen receptor positive versus negative tumors and (iii) tumor type are predicted via PLS discriminant analysis.
- PLS regression is also employed for multiclass classification in [30] for the molecular diagnostic of cancer. Using the software SIMCA, they performed classification with the National Cancer Institute (NCI) data set [31] giving the expression levels of 9605 genes in 60 tumor cell lines of eight different types (leukemia, non-small-cell lung, colon, melanoma, ovarian, breast, central nervous system and renal).
- Other classification studies based on PLS regression can be found in [3236]. A similar approach based on PLS regression to perform classification in the context of meta-analysis is suggested in [37].
There exists another route to classification using partial least squares, first proposed by Nguyen and Rocke [38, 39] and further studied by Boulesteix [40] and compared with other dimension reduction techniques in [41]. This approach first employs PLS as a dimension reduction method and subsequently uses the PLS latent components as predictors in a classical discrimination method (e.g. logistic regression, linear or quadratic discriminant analysis). To apply this method, one has to choose (i) the number of latent components to be extracted in the dimension reduction step and (ii) the classification method to be used for the classification step.
In Nguyen and Rocke [38, 39], three classification methods are studied: logistic regression, linear discriminant analysis and quadratic discriminant analysis. In [40], the only investigated classification method is linear discriminant analysis. Generally, linear discriminant analysis (LDA) turns out to yield the best classification performance, whereas quadratic discriminant analysis gives worse results. In the extensive comparison study performed by Boulesteix [40], which included many currently employed methods, PLS+LDA turns out to range among the best classification procedures for all the eight studied cancer data sets. According to this study, the most successful other methods are the nearest centroids approach by Tibshirani et al. [42] and the support vector machines.
Feature selection
An issue that is tightly connected with the prediction of a clinical outcome is the identification of genes whose expression levels are associated with the considered outcome. For instance, a physician might want to find out which genes have different expression levels in tumor tissues and normal tissues. The selection of relevant genes is important both for biologists who aim to understand the function of genes and the cell processes and for statisticians who want to apply statistical methods which can handle a restricted number of variables.
In the case of PLS1 dimension reduction (see Univariate response section) applied to binary classification problems (see Classification problems section), the weight vector w1 = (w11, ..., wp1)T defining the first latent component may be used to order the p genes in terms of their relevance for the classification problem [40]. Let Fj denote the F-statistic used in analysis of variance and computed from X for gene j as:
![]() |
|
|
|
|
A gene selection approach based on several PLS latent components is applied to gene expression data by Musumarra et al. [30, 43]. It is based on all the weight vectors w1, ..., wc and implemented in the software package SIMCA. The 'variable influence' VIN
j of gene j for the
-th PLS component is defined as a function of
and the proportion of the sum of squares explained by the
-th latent component. Finally, the genes are ordered according to their variable importance in the projection VIPj, which is defined for each gene j as the sum of the VIN
j over the c PLS latent components. An advantage of this approach is that it captures information on the single genes from all the PLS latent components included in the analysis. Thus, it can also discover non-linear patterns which the F-statistic would fail to detect. A major drawback of the VIP index is its lack of theoretical background. One might investigate its connections to the matrix of regression coefficients.
Survival analysis
Another issue of interest in the statistical analysis of gene expression data is the prediction of the survival time Y of diseased patients using their gene expression profiles. In this context, survival data are usually denoted as a triple (t,
, x), where:
- t is a continuous variable usually called failure time which equals the time to death Y if
= 1 or the time to censoring if
= 0,
is a binary variable, which equals 1 if the death of the patient was observed before censoring and 0 if the patient was still alive at the end of the study,
- x = (X1, ..., Xp)T is a vector of p continuous gene expression levels which are considered as predictor variables.
Standard approaches to predict survival times using continuous predictors such as the proportional hazard regression model (PH model) by Cox [44] may not be applied directly if n < p. Various approaches based on the clustering of genes or observations have been proposed, with the inconvenience that the results depend on the chosen clustering algorithm. PLS-based survival analysis is another important family of methods for survival analysis with many predictors.
Nguyen and Rocke [45] suggest a two-stage method that (i) performs univariate PLS with the failure time as response variable and X1, ..., Xp as predictors and (ii) uses the obtained first latent components as predictors in classical PH regression. They apply their approach to lymphoma data [46] giving the survival time and expression levels of 5622 genes for 40 lymphoma patients and to breast cancer data [47] giving the survival time and expression levels of 3846 genes for 49 breast cancer patients. In this two-step procedure, dimension reduction and prediction using PH regression are performed successively. The specificity of the failure time is not taken into account during the dimension reduction stage: it treats both time to death and time to censoring as the same continuous variable in the dimension reduction step, which is a severe drawback if censoring is non-negligible. Improvements of this approach are proposed in [4850]. These approaches combine the construction of the successive PLS latent components with PH regression, but in different ways. They are reviewed in Outlook and Generalizations of PLS section which deals with PLS-based methods for special response variables.
Available software
There are currently four R packages that implement partial least squares approaches:
- plsgenomics
(http://cran.r-project.org/src/contrib/Descriptions/plsgenomics.html)
This package implements PLS regression (using the function simpls from the pls.pcr package) with user-friendly features such as the choice of the number of components. It also implements the classification method PLS+LDA presented in Classification problem section and discussed by Nguyen and Rocke [38, 39] and Boulesteix [40] as well as the ridge PLS method [51] mentioned in PLS and generalized linear models section.
- pls.pcr
(http://cran.r-project.org/src/contrib/Descriptions/pls.pcr.html)
This package implements the two main variants of multivariate PLS regression SIMPLS and PLS2 as well as PCR.
- pls
(http://cran.r-project.org/src/contrib/Descriptions/pls.html)
This package is an extension of the earlier package pls.pcr including, e.g. various plot functions and a formula interface.
- gpls
(http://cran.r-project.org/src/contrib/Descriptions/gpls.html)
This package implements the classification method using generalized PLS [52] mentioned in PLS and generalized linear Models section.
- plss
http://www.math.univ-montp2.fr/~durand/ProgramSources.html)
These programs implement PLS regression based on splines transformations of the predictors [53]. They work only under R for Windows.
Other software
- Classification with PLS regression (PLS-DA), (DA, discriminant analysis) is implemented in the software tool SIMCA.
(http://www.umetrics.com/default.asp/pagename/software_simcap/c/3/).
- The SAS procedure PLS implements several dimension reduction methods such as PCR, Reduced Rank Regression (RRR) and PLS. The two main versions of multivariate PLS (SIMPLS and PLS2) are available. For PLS2, one may specify the algorithmic variant as an option, for instance NIPALS.
(http://support.sas.com/rnd/app/da/new/dapls.html)
- The PLS Toolbox (by Eigenvector Research Incorporated) for use with MATLAB
(http://software.eigenvector.com/toolbox/3_5/index.html)
includes a wide range of methods for multivariate statistical analysis, some of which are based on PLS regression. In particular, it includes the function plsda, which performs classification (class prediction) based on SIMPLS or PLS2 regression.
- The software tool Unscrambler
(http://www.camo.com/rt/Products/Unscrambler/unscrambler.html)
also implements multivariate PLS1 and multivariate regression (PLS2) and PLS-DA.
| OUTLOOK AND GENERALIZATIONS OF PLS |
|---|
|
|
|---|
So far, we have considered applications of PLS regression to various biological problems. However, applying a regression method designed for continuous responses to categorical responses or performing dimension reduction with survival data without taking censoring into account is unappealing, although it is reported to give good results in many cases. In this section, we review methods that use the principle of PLS regression but adapt it to handle special types of responses such as survival time or categorical outcome. These methods can be divided into two categories. In the first category of methods, the structure of the univariate PLS regression algorithm remains unchanged, but the coefficients used to construct the latent components are modified. In the second category of methods, the PLS algorithm is embedded into a complex generalized regression procedure. Both approaches can be applied to, e.g. survival analysis and classification. In the following section, we consider only the univariate case, i.e. Y is a n x 1 matrix (n vector).
Modification of the latent components in PLS regression
Let us consider objective function 1. Some calculation using the Lagrange multiplier method yields
|
|
In the most usual PLS1 algorithm, the weight vectors t2, ..., tc are built sequentially in a similar way as t1, except that X and Y are replaced by deflated matrices. With
and xij denoting the element of X at row i and column j, simple transformations lead to
![]() |
A related approach denoted as PLS logistic regression is used in [57] to map complex trait genes using gene expression data. In this setting, the response is a categorical genetic trait and the latent components t2, ..., tc are constructed based on the regression coefficients estimated from a logistic regression model. Perez-Enciso et al. [57] demonstrate the potentialities of this approach based on an extensive simulation study.
PLS and generalized linear models
Marx [58] proposes an extension of the concept of PLS regression into the framework of generalized linear models. This approach, which is denoted as iteratively reweighted partial least squares (IRPLS or IRWPLS), embeds the univariate PLS regression algorithm into the iterative steps of the usual Iteratively Reweighted Least Squares algorithm [59] for generalized linear models, resulting in two nested loops. The loops are iterated a fixed number of times or until a convergence criterion is reached. This apparently appealing approach has a major drawback in practical microarray data analysis: convergence is never reached if X is full row-rank, which is most often the case in high-dimensional microarray data with n << p [51]. The IRPLS method as well as a few adaptations overcoming the convergence problem have been applied both to survival analysis and classification. Binary classification is one of the most common applications of generalized linear models and of Marx's IRPLS algorithm. To our knowledge, the IRPLS algorithm has never been applied directly to classification with microarray data. However, it has inspired at least two recent papers on the generalization of PLS regression to categorical response variables.
The first approach is proposed by Ding and Gentleman [52] and can be seen as an adaptation of Marx's IRPLS method which solves the problem of separation. As already mentioned in Classification problems section, infinite parameter estimates can occur in binary logistic regression when the two classes are completely or quasi-completely separated [60]. Firth [61] suggests a procedure to remove the first-order term of the asymptotic bias of maximum likelihood estimates in Generalized Linear Models (GLMs). The procedure is based on a modified score function which, when applied to logistic regression, guarantees finite estimates [62]. The binary classification method obtained by using the Firth's modified score function in place of the usual score function in the IRPLS algorithm is denoted as IRWPLSF by Ding and Gentleman [52]. They also propose a generalization of the method to multicategorical response variables, which is based on the multinomial logit model and denoted as MIRWPLSF. The IRWPLSF and MIRWPLSF are reported to achieve a slightly better classification performance than usual classification methods such as nearest neighbors or SVM on the colon cancer data [28] and on the NCI cancer data [31]. The second approach to modify Marx's IRPLS is suggested in [51]: the procedure embeds a PLS step into ridge penalty logistic regression and might also be generalized to multicategorical responses. This method is applied with success to the colon cancer data [28], the leukemia data [27] and the prostate cancer data [63].
Another classical application of generalized linear models and IRPLS is survival analysis. As suggested in [64], Park et al. [49] transform the failure time problem into a generalized linear regression problem with logarithmic link function. They propose to use the IRPLS estimation method for generalized linear regression [58]. In contrast to the two-stage scheme developed in [45], this method takes censoring explicitly into account. The choice of the number of components is done via a cross-validation procedure which suggests to use c = 1 for the lung cancer data set [65]. According to Park et al. [49] convergence is achieved in a few steps. However, this property seems to be controversial and lack of convergence problems are invoked as a drawback of the method in the more recent paper by Li and Gui [50].
| CONCLUSIONS |
|---|
|
|
|---|
The microarray revolution has lead to an enormous increase in the availability of high-dimensional biomedical data. Classical multivariate methods are not applicable to these small n, large p data sets. In this article we have reviewed the PLS approach to regression and dimension reduction that is perfectly suited for analysing this kind of data.
Specifically, PLS has several advantages over many competing approaches:
- It automatically performs variable selection.
- It can be applied to a diverse set of tasks, including classification, survival analysis and modeling of transcription factors activities.
- It is statistically very efficient.
- Moreover, it is computationally very fast, which renders it practical for application to large data sets.
As outlined in Application of Partial Least Squares to High-dimensional Genomic Data and Outlook and Generalizations of PLS sections of this review, at present most reported applications of the PLS method to genomic data focus on the analysis of microarray data from gene expression experiments. The key advantages that characterize the PLS methodology are versatility and flexibility. On the one hand, it can be directly applied to various types of data of any dimensions for different prediction or imputation problems. On the other hand, PLS algorithms adapt easily to a broad range of questions and thus serve as a flexible basis for the development of novel tools for the analysis biological data. In short, we expect that with the advent of proteomics data, e.g. from mass spectrometric experiments, PLS will in the future also play a major role for analysing many other kinds of high-dimensional omics data.
Key Points
|
| APPENDIX |
|---|
|
|
|---|
List of abbreviations
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| FOOTNOTES |
|---|
|
|
|---|
Anne-Laure Boulesteix is a post-doctoral researcher and consultant in biostatistics at the Technical University of Munich. She received her PhD in statistics in 2005 from the University of Munich, and is generally interested in computational statistics and high-dimensional multivariate data analysis.
Korbinian Strimmer is heading the Information Theory and Bioinformatics group at the Department of Statistics of the University of Munich. His research focuses on statistical learning procedures, complex networks and statistical genomics.
| References |
|---|
|
|
|---|
- Wold H. Estimation of principal components and related models by iterative least squares. In Krishnaiah PR (Ed.). Multivariate Analysis.New York: Academic Press 1966 pp. 391420.
- Wold H. Nonlinear Iterative Partial Least Squares (NIPALS) modeling: some current developments. In Krishnaiah PR (Ed.). Multivariate Analysis.New York: Academic Press 1973 pp. 383407.
- Wold H. Path models with latent variables: the NIPALS approach. In Blalock HM (Ed.). Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building.New York: Academic Press 1975.
- Wold S, Ruhe A, Wold H, et al. Collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM J Sci Comput Stat 1984; 5:73543.[CrossRef]
- Martens H. Reliable and relevant modeling of real world data: a personal account of the development of PLS regression. Chemom Intell Lab Syst 2001; 58:8595.[CrossRef]
- Wold S. Personal memories of the early PLS development. Chemom Intell Lab Syst 2001; 58:834.
- Phatak A, Dehoog F. Exploiting the connection between PLSR, Lanczos, and conjugate gradients: alternative proofs of some properties of PLSR. J Chemom 2002; 16:3617.[CrossRef]
- Helland I. On the structure of Partial Least Squares. Comm Stat Simul Comp 1988; 17:581607.
- Stone M, Brook RJ. Continuum regression: cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal component regression. J Roy Stat Soc B 1990; 52:23769.
- Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics 1993; 35:10935.[Medline]
- Garthwaite PH. An interpretation of partial least squares. J Am Stat Assoc 1994; 89:1227.[CrossRef]
- Martens H, Naes T. Multivariate Calibration.New York: Wiley 1989.
- Braak CJF, de Jong S. The objective function of partial least squares. J Chemom 1998; 12:4154.
- Jong S. SIMPLS: an alternative approach to partial least squares regression. Chemom Intell Lab Syst 1993; 18:25163.[CrossRef]
- Rao CR. Linear Statistical Inference and its Application.New York: Wiley 1993.
- Boulesteix AL, Strimmer K. Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach. Theor Biol Med Model 2005; 2:23.[CrossRef][Medline]
- Datta S. Exploring the relationships in gene expressions: a partial least squares approach. Gene Expression 2001; 9:25764.[Web of Science][Medline]
- Bras LP, Menezes JC. Dealing with gene expression missing data. IEE Syst Biol 2006; 153:10519.[CrossRef]
- Nguyen DV, Wang N, Caroll RJ. Evaluation of missing value estimation for microarray data. J Data Sci 2004; 2:34770.
- Huang X, Pan W, Park S, et al. Modeling the relationship between LVAD support time and gene expression changes in the human heart by penalized partial least squares. Bioinformatics 2004; 20:88894.
[Abstract/Free Full Text] - Johansson D, Lindgren P, Berglund A. A multivariate approach applied to microarray data for identification of genes with cell cycle-coupled transcription. Bioinformatics 2003; 19:46773.
[Abstract/Free Full Text] - Clementi M, Clementi S, Cruciani G, et al. Robust multivariate statistics and the prediction of protein secondary structure content. Protein Eng 1997; 10:7479.
[Free Full Text] - Barker M, Rayens W. Partial least squares for discrimination. J Chemom 2003; 17:16673.[CrossRef]
- Man MZ, Dyson G, Johnson K, et al. Evaluating methods for classifying expression data. J Biopharm Stat 2004; 14:106584.[CrossRef][Medline]
- Huang X, Pan W, Grindle S, et al. A comparative study of discriminating human heart failure etiology using gene expression profiles. BMC Bioinformatics 2005; 6:205.[CrossRef][Medline]
- Huang X, Pan W. Linear regression and two-class classification with gene expression data. Bioinformatics 2003; 19:20728.
[Abstract/Free Full Text] - Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286:5317.
[Abstract/Free Full Text] - Alon U, Barkai DA, Notterman K. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 1999; 96:674550.
[Abstract/Free Full Text] - Perez-Enciso M, Tenenhaus M. Prediction of clinical outcome with microarray data: a partial least squares approach. Hum Genet 2003; 112:58192.[Web of Science][Medline]
- Musumarra G, Barresi V, Condorelli DF, et al. Potentialities of multivariate approaches in genome-based cancer research: identification of candidate genes for new diagnostics by PLS discriminant analysis. J Chemom 2004; 18:12532.[CrossRef]
- Ross DT, Scherf U, Eisen MB, et al. Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 2000; 24:22734.[CrossRef][Web of Science][Medline]
- Alaiya AA, Franzen B, Hagman A, et al. Classification of human ovarian tumors using multivariate data analysis of polypeptide expression patterns. Int J Cancer 2000; 86:7316.[CrossRef][Web of Science][Medline]
- Musumarra G, Condorelli DF, Scire S, et al. Shortcuts in genome-scale cancer pharmacology research from multivariate analysis of the National Cancer Institute gene expression data base. Biochem Pharmacol 2001; 62:54753.[CrossRef][Web of Science][Medline]
- Cho JH, Lee D, Park JH, et al. Optimal approach for classification of acute leukemia subtypes based on gene expression data. Biotech Progress 2002; 18:84754.[CrossRef]
- Tan Y, Shi L, Tong W, et al. Multi-class tumor classification by discriminant partial least squares using microarray gene expression data and assessment of classification models. Comput Biol Chem 2004; 28:23544.[CrossRef][Web of Science][Medline]
- Modlich O, Prisack HB, Munnes M, et al. Predictors of primary breast cancers responsiveness to preoperative epirubicin//cyclophosphamide-based chemotherapy: translation of microarray data into clinically useful predictive signatures. J Transl Med 2005; 3:32.[CrossRef][Medline]
- Huang X, Pan W, Han X, et al. Borrowing information from relevant microarray studies for sample classification using weighted partial least squares. Comput Biol Chem 2005; 29:20411.[CrossRef][Web of Science][Medline]
- Nguyen DV, Rocke D. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 2002; 18:3950.
[Abstract/Free Full Text] - Nguyen DV, Rocke D. Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics 2002; 18:121626.
[Abstract/Free Full Text] - Boulesteix AL. PLS dimension reduction for classification with high-dimensional microarray data. Stat Appl Genet Mol Biol 2004; 3:33.
- Dai JJ, Lieu L, Rocke D. Dimension reduction for classification with gene expression data. Stat Appl Genet Mol Biol 2006; 5:6.
- Tibshirani R, Hastie T, Narasimhan B, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci 2002; 99:656772.
[Abstract/Free Full Text] - Musumarra G, Barresi V, Condorelli DF, et al. A bioinformatics approach to the identification of candidate genes for the development of new cancer diagnostics. Biol Chem 2003; 384:3217.[CrossRef][Web of Science][Medline]
- Cox DR. Regression models and life-tables (with discussion). J Roy Stat Soc B 1972; 34:187220.
- Nguyen DV, Rocke D. Partial least squares proportional hazards regression for application to DNA microarray survival data. Bioinformatics 2002; 18:162532.
[Abstract/Free Full Text] - Alizadeh AA, Eisen MB, Davis RE, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000; 403:50311.[CrossRef][Medline]
- Sorlie T, Perou CM, Tibshirani R, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci 2001; 98:1086974.
[Abstract/Free Full Text] - Nguyen DV. Partial least squares dimension reduction for microarray gene expression data with a censored response. Math Biosci 2005; 193:11937.[CrossRef][Web of Science][Medline]
- Park PJ, Tian L, Kohane IS. Linking gene expression data with patient survival times using partial least squares. Bioinformatics 2002; 20:20815.
- Li H, Gui J. Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics 2004; 20:20815.
- Fort G, Lambert-Lacroix S. Classification using partial least squares with penalized logistic regression. Bioinformatics 2005; 21:110411.
[Abstract/Free Full Text] - Ding B, Gentleman R. Classification using penalized partial least squares. J Comput Graph Stat 2005; 14:28098.[CrossRef]
- Durand JF. Local polynomial additive regression through PLS and splines: PLSS. Chemom Intell Lab Syst 2001; 58:23546.[CrossRef]
- Bastien P. PLS-Cox model: application to gene expression data. Proceedings COMPSTAT04Springer: Physica-Verlag 2004 pp. 65562.
- Bastien P, Esposito-Vinzi V, Tenenhaus M. PLS generalized linear regression. Comput Stat Data Anal 2005; 48:1746.
- Nguyen DV, Rocke D. On partial least squares dimension reduction for microarray-based classification: a simulation study. Comput Stat Data Anal 2004; 46:40725.[CrossRef]
- Perez-Enciso M, Toro MA, Tenenhaus M, et al. Combining gene expression and molecular marker information for mapping complex trait genes: a simulation study. Genetics 2003; 164:1597606.
[Abstract/Free Full Text] - Marx BD. Iteratively reweighted partial least squares. Technometrics 1996; 38:37481.[CrossRef]
- Green P. Iteratively reweighted least squares for maximum likelihood estimation and some robust and resistant alternatives. J Roy Stat Soc B 1984; 46:14992.
- Albert A, Anderson J. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 1984; 71:110.
[Abstract/Free Full Text] - Firth D. Bias reduction of maximum likelihood estimates. Biometrika 1993; 80:2738.
[Abstract/Free Full Text] - Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Stat Med 2002; 21:240919.[CrossRef][Web of Science][Medline]
- Singh D, Febbo PG, Ross K, et al. Gene expression correlates of clinical prostate cancer behaviour. Cancer Cell 2002; 1:2039.[CrossRef][Web of Science][Medline]
- Whitehead J. Fitting Cox's regression model to survival data using GLIM. J Roy Stat Soc C 1980; 29:26875.
- Bhattacharjee A, Richards WG, Staunton J, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci 2001; 98:137905.
[Abstract/Free Full Text]
This article has been cited by other articles:
![]() |
A. Madi, I. Hecht, S. Bransburg-Zabary, Y. Merbl, A. Pick, M. Zucker-Toledano, F. J. Quintana, A. I. Tauber, I. R. Cohen, and E. Ben-Jacob Organization of the autoantibody repertoire in healthy newborns and adults revealed by system level informatics of antigen microarray data PNAS, August 25, 2009; 106(34): 14484 - 14489. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Wegmann, C. Leuenberger, and L. Excoffier Efficient Approximate Bayesian Computation Coupled With Markov Chain Monte Carlo Without Likelihood Genetics, August 1, 2009; 182(4): 1207 - 1218. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. W. Dudley and G. R. Johnson Epistatic Models Improve Prediction of Performance in Corn Crop Sci., May 11, 2009; 49(3): 763 - 770. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Cooil, L. Aksoy, T. L. Keiningham, and K. M. Maryott The Relationship of Employee Perceptions of Organizational Climate to Business-Unit Outcomes: An MPLS Approach Journal of Service Research, February 1, 2009; 11(3): 277 - 294. [Abstract] [PDF] |
||||
![]() |
A.-L. Boulesteix, C. Porzelius, and M. Daumer Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value Bioinformatics, August 1, 2008; 24(15): 1698 - 1706. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Mramor, G. Leban, J. Demsar, and B. Zupan Visualization-based cancer microarray data classification analysis Bioinformatics, August 15, 2007; 23(16): 2147 - 2154. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Lu, B. Bulka, M. desJardins, and S. J. Freeland Amino acid quantitative structure property relationship database: a web-based platform for quantitative investigations of amino acids Protein Eng. Des. Sel., July 1, 2007; 20(7): 347 - 351. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-L. Boulesteix WilcoxCV: an R package for fast variable selection in cross-validation Bioinformatics, July 1, 2007; 23(13): 1702 - 1704. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










