Subido por Oscar Barquero

PArtial least squares

Anuncio
Bent Jørgensen and Yuri Goegebeur
Department of Statistics
ST02: Multivariate Data Analysis and Chemometrics
Module 7: Partial least squares
regression I
7.1
7.2
7.1
The PLS1 algorithm . . . . . . . . . . . .
7.1.1 The steps of the PLS1 algorithm .
7.1.2 Comments on the PLS1 algorithm
Prediction for PLS1 . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
The PLS1 algorithm
Rick: I congratulate you. Victor Laszlo: What for? Rick: Your work. Victor: I
try. Rick: We all try. You succeed. [Casablanca, 1942]
The PCR method from the previous module represents a considerable improvement over
MLR and CLS. By using latent variables (scores), it is possible to use a large number of
variables (frequencies), just as in CLS, but without having to know about all interferences.
Problems may arise, however, if there is a lot of variation in X that is not due to the
analyte as such. PCR finds, somewhat uncritically, those latent variables that describe as
much as possible of the variation in X. But sometimes the analyte itself gives rise to only
small variations in X, and if the interferences vary a lot, then the latent variables found by
PCR may not be particularly good at describing Y . In the worst case important information
may be hidden in directions in the X-space that PCR interprets as noise, and therefore leaves
out.
Partial Least Squares Regression (PLS) is able to cope better with this problem, by forming
variables that are relevant for describing Y . See Examples for a motivating example.
7.1.1
The steps of the PLS1 algorithm
We now consider the general form of the PLS1 algorithm. We assume that X is an n × k
centered data matrix and y an n × 1 centered data vector. The so-called PLS2 algorithm
considered in Module 8 may be used for the case of more than one column in Y . The
PLS2 algorithm, however, is more complicated than PLS1 and even when several columns
are available in Y , it may be preferable to apply PLS1 separately to each column of Y . On
the other hand, PLS2 may be better for initial, more exploratory investigations, or in cases
where the different analytes show covariation.
http://statmaster.sdu.dk/courses/ST02
January 29, 2007
7.1 The PLS1 algorithm
2
The PLS1 algorithm starts with the initialization j = 1, X 1 = X and y 1 = y. The
algorithm then proceeds through the following steps to find the first g latent variables:
>
1. Let wj = X >
j yj / X j yj .
2. Let tj = X j wj .
>
3. Let b
cj = t>
j y j /tj tj .
>
4. Let pj = X >
j tj /tj tj .
5. Let X j+1 = X j −tj p>
cj .
j and y j+1 = y j −tj b
6. Stop if j = g; otherwise let j = j + 1 and return to Step 1.
Now form the two k × g matrices W and P and n × g matrix T with columns wj , pj and
tj , respectively, and form a column vector b
c (g × 1) with elements b
cj . Let
c = TP>
X
g
X
tj p>
=
j
j=1
and
b = Tb
y
c
= XW (P > W )−1 b
c,
which are the predicted values of X and y, respectively. The matrix W is orthogonal, and
T has orthogonal columns.
The PLS1 algorithm is used here in order to define the method, although there are alternative ways of organizing the computations; see Bro (1996, p. 57). Note that, in spite
of the similarities with the NIPALS algorithm, the PLS1 algorithm is recursive and requires
exactly g steps, whereas the NIPALS algorithm is iterative, the number of iterations cannot
be determined in advance, and is dependent on the choice of a stopping criterion. In this
sense, the PLS1 algorithm is simpler than the NIPALS algorithm.
7.1.2
Comments on the PLS1 algorithm
We now comment on each step of the PLS1 algorithm in turn. For simplicity, we explain the
first run of the algorithm (j = 1), and then go on to explain the general case.
Step 1. In PLS, we seek the direction in the space of X, which yields the biggest covariance
between X and y. This direction is given by a unit vector w, and is such that large variations
in x-values are accompanied by large variations in the corresponding y-values. The unit vector
w1 (k × 1) is thus formed by standardizing the covariance matrix for X and y. A further
interpretation of w1 is that its transpose w>
1 is proportional to the CLS regression coefficient
b=
a
http://statmaster.sdu.dk/courses/ST02
y >X
.
y>y
January 29, 2007
7.2 Prediction for PLS1
3
It may hence be useful, for diagnostic purposes, to compare w1 with any prior knowledge
about the spectrum of the pure analyte, although possible interferences may obscure this
picture.
Step 2. The n × 1 score vector t1 is formed as a linear combination of the columns of
X with weights w1 . As explained above, the relative weights are given by the covariances
between y and each of the columns of X, and t1 may hence be understood as the best linear
combination of the columns of X for the purpose of predicting y. The latent vectors tj are
also called scores, similar to the terminology for PCA.
Step 3. The regression coefficient b
c1 is calculated by ordinary linear regression of y on
t1 .
Step 4. The k × 1 vector p1 is the transpose of the vector of regression coefficients
obtained from simple linear regressions of the columns of X on t1 .
Step 5. The n × k vector X 2 = X −t1 p>
1 represents the residuals after regressing X on
t1 , and correspondingly, y 2 = y −t1 b
c1 are the residuals after regressing y on t1 . This step
ensures that the tj -vectors become orthogonal (just as the corresponding tj are in PCR),
and thus ensures that the multiple regression of y on T can be calculated one column at a
time, as done in Step 3.
After the first run through Steps 1–5, the procedure is repeated using the residuals X 2
and y 2 . The algorithm then finds the best linear combination of the columns of X 2 for the
purpose of predicting y 2 , thus picking up any further structure in the connection between
X and y not accounted for by t1 . This is repeated on and on, such that each run of the
algorithm in principle reveals more and more information about the connection between X
and y. Just as for PCR the information accounted for by each step usually becomes less and
less for each step taken.
After the g runs have been completed, the following relations hold:
X = T P > + X g+1
(7.1)
y = Tb
c + y g+1 .
The number of scores g should hence, in principle, be chosen such that X g+1 contains no
further information about y g+1 , or in other words, such that X g+1 and y g+1 are approximately
uncorrelated with each other. In the extreme case where X >
j y j becomes zero, the algorithm
is stopped prematurely. In summary, further scores should be extracted only as long as each
new variable contributes significantly to the description of y. Criteria for deciding when this
is the case will be discussed later, in Module 13.
7.2
Prediction for PLS1
Prediction for the PLS1 method is slightly more complicated, than for PCR, in spite of the
algorithm being simpler. Consider a new prediction sample z (1 × k vector) and predicted
→
value −
y (both uncentered). Note the new notation for the predicted value. Let x (k × 1) and
y be the calibration sample averages. The prediction is performed by essentially retracing the
steps of the algorithm, letting the row vector z − x follow the same steps as a row of the X
matrix.
http://statmaster.sdu.dk/courses/ST02
January 29, 2007
7.2 Prediction for PLS1
4
Let W , T , P and b
c be the matrices and vector formed after applying the PLS1 algorithm
to the calibration data. Initialize by taking j = 1 and xj = z − x. Then proceed through the
following steps:
1. Let tj = xj wj .
2. Let xj+1 = xj −tj p>
j .
3. Let j = j + 1, and repeat Steps 1 to 3 until j = g.
Now form the row vector bt = (t1 , . . . , tg ), and complete the prediction as follows:
−
→
c.
y = y + btb
It is possible, though, to summarize the prediction in a matrix formula (Bro, 1996, pp. 62–63),
as follows:
−
→
b,
y = y + (z − x)> b
where b
b, the so-called regression vector, is
−1
b
b
b = W P >W
c.
There is, however, a slight disadvantage to this method, because useful information contained
in the individual latent variables t1 , . . . , tg is not available here. Like the regression matrix in
the previous methods, the regression vector b
b contains useful information about which areas
(frequencies) contribute to the prediction.
http://statmaster.sdu.dk/courses/ST02
January 29, 2007
Descargar