Bent Jørgensen and Yuri Goegebeur Department of Statistics ST02: Multivariate Data Analysis and Chemometrics Module 7: Partial least squares regression I 7.1 7.2 7.1 The PLS1 algorithm . . . . . . . . . . . . 7.1.1 The steps of the PLS1 algorithm . 7.1.2 Comments on the PLS1 algorithm Prediction for PLS1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 The PLS1 algorithm Rick: I congratulate you. Victor Laszlo: What for? Rick: Your work. Victor: I try. Rick: We all try. You succeed. [Casablanca, 1942] The PCR method from the previous module represents a considerable improvement over MLR and CLS. By using latent variables (scores), it is possible to use a large number of variables (frequencies), just as in CLS, but without having to know about all interferences. Problems may arise, however, if there is a lot of variation in X that is not due to the analyte as such. PCR finds, somewhat uncritically, those latent variables that describe as much as possible of the variation in X. But sometimes the analyte itself gives rise to only small variations in X, and if the interferences vary a lot, then the latent variables found by PCR may not be particularly good at describing Y . In the worst case important information may be hidden in directions in the X-space that PCR interprets as noise, and therefore leaves out. Partial Least Squares Regression (PLS) is able to cope better with this problem, by forming variables that are relevant for describing Y . See Examples for a motivating example. 7.1.1 The steps of the PLS1 algorithm We now consider the general form of the PLS1 algorithm. We assume that X is an n × k centered data matrix and y an n × 1 centered data vector. The so-called PLS2 algorithm considered in Module 8 may be used for the case of more than one column in Y . The PLS2 algorithm, however, is more complicated than PLS1 and even when several columns are available in Y , it may be preferable to apply PLS1 separately to each column of Y . On the other hand, PLS2 may be better for initial, more exploratory investigations, or in cases where the different analytes show covariation. http://statmaster.sdu.dk/courses/ST02 January 29, 2007 7.1 The PLS1 algorithm 2 The PLS1 algorithm starts with the initialization j = 1, X 1 = X and y 1 = y. The algorithm then proceeds through the following steps to find the first g latent variables: > 1. Let wj = X > j yj / X j yj . 2. Let tj = X j wj . > 3. Let b cj = t> j y j /tj tj . > 4. Let pj = X > j tj /tj tj . 5. Let X j+1 = X j −tj p> cj . j and y j+1 = y j −tj b 6. Stop if j = g; otherwise let j = j + 1 and return to Step 1. Now form the two k × g matrices W and P and n × g matrix T with columns wj , pj and tj , respectively, and form a column vector b c (g × 1) with elements b cj . Let c = TP> X g X tj p> = j j=1 and b = Tb y c = XW (P > W )−1 b c, which are the predicted values of X and y, respectively. The matrix W is orthogonal, and T has orthogonal columns. The PLS1 algorithm is used here in order to define the method, although there are alternative ways of organizing the computations; see Bro (1996, p. 57). Note that, in spite of the similarities with the NIPALS algorithm, the PLS1 algorithm is recursive and requires exactly g steps, whereas the NIPALS algorithm is iterative, the number of iterations cannot be determined in advance, and is dependent on the choice of a stopping criterion. In this sense, the PLS1 algorithm is simpler than the NIPALS algorithm. 7.1.2 Comments on the PLS1 algorithm We now comment on each step of the PLS1 algorithm in turn. For simplicity, we explain the first run of the algorithm (j = 1), and then go on to explain the general case. Step 1. In PLS, we seek the direction in the space of X, which yields the biggest covariance between X and y. This direction is given by a unit vector w, and is such that large variations in x-values are accompanied by large variations in the corresponding y-values. The unit vector w1 (k × 1) is thus formed by standardizing the covariance matrix for X and y. A further interpretation of w1 is that its transpose w> 1 is proportional to the CLS regression coefficient b= a http://statmaster.sdu.dk/courses/ST02 y >X . y>y January 29, 2007 7.2 Prediction for PLS1 3 It may hence be useful, for diagnostic purposes, to compare w1 with any prior knowledge about the spectrum of the pure analyte, although possible interferences may obscure this picture. Step 2. The n × 1 score vector t1 is formed as a linear combination of the columns of X with weights w1 . As explained above, the relative weights are given by the covariances between y and each of the columns of X, and t1 may hence be understood as the best linear combination of the columns of X for the purpose of predicting y. The latent vectors tj are also called scores, similar to the terminology for PCA. Step 3. The regression coefficient b c1 is calculated by ordinary linear regression of y on t1 . Step 4. The k × 1 vector p1 is the transpose of the vector of regression coefficients obtained from simple linear regressions of the columns of X on t1 . Step 5. The n × k vector X 2 = X −t1 p> 1 represents the residuals after regressing X on t1 , and correspondingly, y 2 = y −t1 b c1 are the residuals after regressing y on t1 . This step ensures that the tj -vectors become orthogonal (just as the corresponding tj are in PCR), and thus ensures that the multiple regression of y on T can be calculated one column at a time, as done in Step 3. After the first run through Steps 1–5, the procedure is repeated using the residuals X 2 and y 2 . The algorithm then finds the best linear combination of the columns of X 2 for the purpose of predicting y 2 , thus picking up any further structure in the connection between X and y not accounted for by t1 . This is repeated on and on, such that each run of the algorithm in principle reveals more and more information about the connection between X and y. Just as for PCR the information accounted for by each step usually becomes less and less for each step taken. After the g runs have been completed, the following relations hold: X = T P > + X g+1 (7.1) y = Tb c + y g+1 . The number of scores g should hence, in principle, be chosen such that X g+1 contains no further information about y g+1 , or in other words, such that X g+1 and y g+1 are approximately uncorrelated with each other. In the extreme case where X > j y j becomes zero, the algorithm is stopped prematurely. In summary, further scores should be extracted only as long as each new variable contributes significantly to the description of y. Criteria for deciding when this is the case will be discussed later, in Module 13. 7.2 Prediction for PLS1 Prediction for the PLS1 method is slightly more complicated, than for PCR, in spite of the algorithm being simpler. Consider a new prediction sample z (1 × k vector) and predicted → value − y (both uncentered). Note the new notation for the predicted value. Let x (k × 1) and y be the calibration sample averages. The prediction is performed by essentially retracing the steps of the algorithm, letting the row vector z − x follow the same steps as a row of the X matrix. http://statmaster.sdu.dk/courses/ST02 January 29, 2007 7.2 Prediction for PLS1 4 Let W , T , P and b c be the matrices and vector formed after applying the PLS1 algorithm to the calibration data. Initialize by taking j = 1 and xj = z − x. Then proceed through the following steps: 1. Let tj = xj wj . 2. Let xj+1 = xj −tj p> j . 3. Let j = j + 1, and repeat Steps 1 to 3 until j = g. Now form the row vector bt = (t1 , . . . , tg ), and complete the prediction as follows: − → c. y = y + btb It is possible, though, to summarize the prediction in a matrix formula (Bro, 1996, pp. 62–63), as follows: − → b, y = y + (z − x)> b where b b, the so-called regression vector, is −1 b b b = W P >W c. There is, however, a slight disadvantage to this method, because useful information contained in the individual latent variables t1 , . . . , tg is not available here. Like the regression matrix in the previous methods, the regression vector b b contains useful information about which areas (frequencies) contribute to the prediction. http://statmaster.sdu.dk/courses/ST02 January 29, 2007