Subido por Jaume Bejar

2015 Methods in Molecular Biology-K Jung-Statistical Analysis in Proteomics-Humana Press Chapter11

Anuncio
Chapter 11
Application of Discriminant Analysis
and Cross-Validation on Proteomics Data
Julia Kuligowski, David Pérez-Guaita, and Guillermo Quintás
Abstract
High-throughput proteomic experiments have raised the importance and complexity of bioinformatic
analysis to extract useful information from raw data. Discriminant analysis is frequently used to identify
differences among test groups of individuals or to describe combinations of discriminant variables.
However, even in relatively large studies, the number of detected variables typically largely exceeds the
number of samples and the classifiers should be thoroughly validated to assess their performance for new
samples. Cross-validation is a widely approach when an external validation set is not available. In this chapter, different approaches for cross-validation are presented including relevant aspects that should be taken
into account to avoid overly optimistic results and the assessment of the statistical significance of crossvalidated figures of merit.
Key words Proteomics, Cross-validation, Double cross-validation, Discriminant analysis, Partial least
squares-discriminant analysis
1
Introduction
In recent years, proteomics has become one of the most widely
used research tools in high-throughput biology. Proteomic analysis
plays a key role not only in basic, but also in biomedical research in
fields such as drug discovery or clinical diagnosis. Availability of
large scale “omics” data from genomics, transcriptomics, proteomics, or metabolomics increases the importance and complexity
of bioinformatic and statistical analysis to get insight into huge
amounts of raw data and extract useful information. Proteomic
data is frequently analyzed using discriminant analysis (DA) to
assess differences among groups of individuals and identify for
which combination of variables they are most distinct. For example, in a typical clinical proteomics study, the “case” group includes
subjects diagnosed with a disease while a second group includes
subjects classified as “healthy” or “control.” In this type of study,
the objective is frequently the identification and interpretation of
Klaus Jung (ed.), Statistical Analysis in Proteomics, Methods in Molecular Biology, vol. 1362,
DOI 10.1007/978-1-4939-3106-4_11, © Springer Science+Business Media New York 2016
175
176
Julia Kuligowski et al.
proteomic disease biomarkers to get further insight into biological
processes related to the disease. A second objective could be to
pinpoint biomarkers to enable the construction of accurate classifiers. One of the most relevant challenges for the analysis of highthroughput proteomic data is their high dimensionality. Even in
relatively large studies with a few hundred biological samples, the
number of detected variables in most of the cases largely exceeds
the number of samples. Besides, the majority of the detected proteomic variables are irrelevant for outcome prediction and their
elimination improves the classifier performance. In addition, variables may be correlated, being multivariate statistical analysis generally required to extract information contained in the dataset.
In proteomic studies, overfitting is a potential pitfall where the
classifier models random variation in the data. Because of that, DA
models should be subjected to thorough statistical validation to
estimate the generalization accuracy of the classifier and to ensure
that the model will work for new samples. This validation evaluates
the performance of the classifier and yields an accurate estimation
of the prediction error and the probability of a chance result.
Statistical validation can be carried out by external and internal
cross validation. External validation uses test samples not included
in the calibration set of samples used to build the classifier and it is
considered the “gold standard.” But, when the sample size is
scarce, if we use sufficient samples to develop a reliable classifier we
can find that we might have not enough samples for testing its
performance and vice versa. In this situation, cross-validation is
used for testing as a suboptimal approximation to external validation that, in spite of its limitations, is one of the most practical
methods during model development to provide a point estimate of
the performance of a classifier [1].
In this chapter we explain basic procedures used to assess the
generalization accuracy of discriminant classifiers using crossvalidation. In Subheading 2, partial least squares-DA (PLS-DA) is
outlined. Subheading 3 presents cross-validation and double crossvalidation including relevant aspects that should be taken into
account for the selection of the cross-validation strategy, as well as
permutation testing for the assessment of the statistical significance
of cross-validated figures of merit.
2
Partial Least Squares-Discriminant Analysis
PLS-DA is currently one of the most popular classification methods in multivariate analysis. This methodology is the discriminant
version of PLS regression extensively used in chemometrics. It
aims to model the linear relationship between the matrix X (N × J),
where N and J are the number of samples and predictive variables,
respectively, and the corresponding vector of responses y (N × 1)
Application of DA and CV on Proteomics Data
177
[2]. In short, PLS extracts a series of latent variables explaining the
variation in X correlated with y. In a PLS-DA model, the relation
between the predictors X (N × J) and the response y (N × 1) can be
described as
y = Xb T + e
where b (1 × J) is the vector of regression coefficients, and e (N × 1)
is the error vector (i.e., residuals).
For the application of PLS to DA in proteomics, the y vector
consists of dummy variables (e.g., +1, −1) indicating the class of
each sample (e.g., control vs. disease or treatment vs. placebo).
Parameters to be selected during the development of a PLS-DA
model include the type of data pretreatment and the model complexity (i.e., the number of latent variables). The number of latent
variables determines the complexity of the model, and thus the
selection of a high or a low number of latent variables could lead to
overfitting or under-fitting data, respectively. The analysis of crossvalidated figures of merit of PLS-DA classifiers obtained using different numbers of latent variables can be used to select the optimum
value. Hence, cross-validation is not only used for estimating the
generalization accuracy of the classifier, but also during method
development.
2.1
Figures of Merit
Figures of merit are used to describe the performance of the classifier during method development and validation. Outcomes of a
binary classifier may be, e.g., y > 0 or y < 0 to differentiate disease
and healthy. The prediction error is an estimate of how well the
classifier will predict the outcome (i.e., the class in a discriminant
analysis) of future, unknown observations drawn from the same
population. It can be defined as the probability of incorrect classification using a DA model, or as the expected difference between
the theoretical and the predicted responses from the model in a
regression model. Besides, a number of measures are available to
assess the performance of a classifier. After selecting a threshold
value to classify samples (e.g., if ythreshold = 0, samples with y > 0 are
classified as “disease”), the proportion of misclassified samples can
be used when the classes are of similar size. The number of true
positive (TP), false positives (FP), true negatives (TN), and false
negatives (FN) can be used to build a more informative confusion
or contingency table summarizing combinations of the predicted
and actual classes. Using these values, sensitivity, as the ratio TP/
(TP + FN), and specificity, as TN/(FP + TN), can be defined.
Both, sensitivity and selectivity values can be used to build a receiver
operator characteristic (ROC) curve by plotting the sensitivity
versus (1-specificity) of the classifier using different threshold values.
The area under the ROC curve (AUROC) is a frequently employed
figure of merit. Alternatively, the Q2 or discriminant-Q2 statistics
178
Julia Kuligowski et al.
can also be used as figures of merit of classification models. Q2
quantifies the closeness of predicted y values to the theoretical y
values. It is calculated as one minus the ratio of the prediction
error sum of squares (PRESS) to the total sum of squares (TSS).
The discriminant-Q2 computes the PRESSD (PRESS discriminant), a PRESS that does not take into account accurately classified
samples [3]. The use of discriminant-Q2 is preferable as it provides
a more reliable diagnosis of the generalization accuracy of the classifier [4].
3
Validation Methods
Validation of a classifier assesses its generalization accuracy. To
ensure a non-biased estimation of the classification error, the most
rigorous approach of testing the predictive performance of a classifier consists in computing model predictions for an independent
set of samples (i.e., the external validation set). This set of observations should not be used during the model development and so, in
practice, one of the first steps is to split the initial sample set into a
calibration and a validation subset. This split should be carefully
selected considering that samples used for both, calibration and
validation should be representative of the whole population. The
use of an external validation set is illustrated in Fig. 1.
Fig. 1 Scheme of the estimation of PLS-DA model performance using a calibration and an independent test set (i.e., external validation). The initial data set is
split into a calibration and a test set. A model is build using the calibration data
set. Then, class prediction of samples included in the test set is used to estimate
the generalization accuracy of the classifier
Application of DA and CV on Proteomics Data
3.1
Cross-Validation
179
As mentioned before, cross-validation is one of the most practical
methods to estimate the predictive model performance when an
external validation set is not available or simply not even possible.
Besides, it is also widely used during the development of the classifier, for example, for the selection of the number of latent variables
in PLS-DA model development. When using other methods such
as support vector machines (SVM-DA), cross-validation is also
employed during model development to select, e.g., the regularization parameter C and kernel parameters such as the γ parameter
for a standard radial Gaussian kernel.
During cross-validation the sample set is split usually at random into a number of folds or subsets (k-fold CV). The class of
samples included in one subset k is predicted using a model classifier build on samples belonging to the other (k-1) subsets. This
step is repeated until each subset has been predicted and then the
model performance is estimated using the resulting set of test predictions (see Fig. 2). There is not a straightforward way for establishing the fraction of samples that should be used for model
development and testing and the optimal value strongly depend on
the case under study (e.g., number of samples, sample heterogeneity) (see Note 1). When the number of k splits is equal to the number of samples, it is called leave-one-out cross-validation. A rule of
thumb recommended in machine learning is to use 2/3 of the
samples for calibration and 1/3 for testing. The number of k
subsets also depends on the size of the calibration set and hence
the number of samples. For example, whereas for large data sets a
threefold cross-validation (k = 3) is usually appropriate, for small
data sets we may have to select leave-one-out cross-validation.
Fig. 2 Scheme of PLS-DA model validation using k-fold cross-validation for k = 3.
The full data set is randomly split into k subsets. Then each of these subsets is
used to estimate the accuracy of the classifier build from the remaining k-1
subsets. The figure describes the prediction of one of the test sets (test set #1)
resulting from a threefold CV. Likewise, model calculation is repeated employing
subsets #1 and #2 and #1 and #3 for model calculation, using subsets #3 and #2
for prediction
180
Julia Kuligowski et al.
This division can be performed randomly or using sampling
methodologies such as Kennard-Stone, which aims to improve the
representativeness of the sample space in both, calibration and test
subsets [5]. An important point to keep in mind is that the number
of independent samples is not necessarily equal to the number of
objects if the study includes sample replicates or repeated measurements (see Note 2). To avoid overly optimistic results, sample replicates must be kept together during cross-validation in the same
subset (calibration or test). On the other hand, it is usually useful
to develop a PCA or PLS-DA model including all sample replicates. A scores plot from the obtained model will provide an overview of the variation within replicates of the same sample, and its
relation to the overall variance [6].
Cross-validation is straightforward but has several disadvantages. It can lead to overoptimistic results especially when the sample to variable ratio is low [7] and the samples used in calibration
or test sets may be too few for being representative of the population, leading to an inaccurate estimate of the generalization accuracy of the classifier. Moreover, the figures of merit may vary
depending on the subsets chosen during cross-validation. Monte
Carlo cross-validation (MCCV) uses a repeated random selection
of the CV subsets to circumvent this drawback. The first step in an
MCCV is a random selection of the calibration and test set, without replacement. Then, the classifier is build using the calibration
set and the error is calculated for the samples retained in the test
set. These two steps are repeated n times and an average error is
finally calculated from the distribution of n figures of merit.
3.2 Double
Cross-Validation
Cross-validation provides internal figures of merit because test
samples are used to develop the model (e.g., to select the number
of latent variables of the model in PLS-DA) [7, 8]. Double crossvalidation (2CV) or cross model validation, is a cross-validation
strategy that overcomes this potential pitfall providing unbiased
external figures of merit [7–10]. In double cross-validation, a subset of objects is set aside as a test set. The remaining set of objects
are again split into training and test sets (i.e., internal calibration
and test sets) in a k-fold cross-validation procedure for the selection of the number of latent variables of the inner classifier used to
predict samples included in the test set (see Fig. 3). In double crossvalidation, samples used for prediction are not used for the building of the classifier (i.e., scaling, selection of the number of LV),
which improves the generalizability of the accuracy estimates.
3.3 Assessment
of CV Figures
of Merit Using
a Permutation Test
Some regression and discriminant analysis methods such as
PLS-DA or SVM are very potent tools for finding correlation
when the number of variables exceeds the number of samples. If
method development is not carried out carefully, overfitted models
lacking predictive capabilities for future samples may be obtained.
Application of DA and CV on Proteomics Data
181
Fig. 3 Scheme of PLS-DA model validation using double cross-validation (for
k = 3). The full data set is randomly split into k subsets. Then each of these subsets is used to estimate the accuracy of the classifier build from the remaining
k-1 subsets. The selection of the number of latent variables of each PLS-DA
sub-model is based on CV figures of merit calculated using each training set. The
figure shows the prediction of one of the test sets (test set #1)
Both, external validation and cross-validation help to identify model
overfitting. However, a significant limitation of cross-validation
procedures is that they do not assess the statistical significance of
the figures of merit estimating the predictive power of the classifier. For example, if the performance of a classifier is assessed by
the AUROC, results will range between 0 and 1 and the higher
the value, the better the model. However, it is difficult to define a
threshold value that corresponds to a “good” classifier.
To obtain an estimate of the statistical significance of the classifier, one may perform a permutation test. This type of hypothesis
testing is nonparametric and does not imply assumptions about the
distribution of the data. In this test, cross-validated figures of merit
calculated using the original class labels (i.e., y vector) are compared to a distribution of the same estimators obtained after a random rearrangement of class labels. In practice, the number of all
possible permutations is too large but we can approximate the permutation distribution by using enough random permutations. The
statistical significance of the figures of merit, as expressed in a
p-value, is then empirically calculated as the fraction of permutation values that are at least as extreme as the statistic obtained from
non-permuted data [11]. Results from the test will assess to what
extent the classifier is finding chance correlations between the proteomic data and the classes and it is therefore capable of identifying
overfitting of the data. Low p-values indicate that the original class
label configuration is relevant with respect to the data and assure
the significance of the model. A typical plot showing the outcome
from a permutation testing for the assessment of CV and calibration
is illustrated in Fig. 4. The use of permutation tests in combination
with 2CV has been repeatedly shown as a suitable approach to
assess the statistical significance of figures of merit [3, 8, 12].
182
Julia Kuligowski et al.
Sum Squared Y (C & CV) for 3 Components
2
Fractional Y Calibration
Fractional Y Cross-Validation
Fractional Y Calibration Fit
Fractional Y Cross-Validation Fit
Standardized SSQ Y (CV and C)
1.5
1
0.5
0
−0.5
−1
−1.5
−2
−2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Correlation permuted Y to original Y
0.9
1
Fig. 4 Typical results from a permutation test to assess the statistical significance of a PLSDA model. The plot displays the fractional y-variance captured for
self-prediction (calibration) and cross-validation versus the correlation of the
permuted y to the original y vector, and shows the regression lines
3.4 Variable
Selection
4
The identification and elimination of variables irrelevant for classification is usually included in the workflow of data analysis. Variable
selection improves the predictive capabilities of multivariate models and facilitates a biological interpretation of the models that
might be relevant for further research based on the results (e.g.,
development of target methods). However, when the training data
is used to perform variable selection, figures of merit calculated by
cross-validation will lead to specifically overoptimistic and nongeneralizable performance estimates. To avoid this, when no external validation set is available, the cross-validation procedure may
include the variable selection step [13]. However, this approach is
computing intensive and so the use of an external validation set is
usually preferable. In any case, the data analysis strategy including
the CV method employed should be reported (see Note 3).
Notes
1. Selecting the type of cross validation. There are a number of
different cross validation strategies available such as leaveone-out, k-fold, venetian blinds, h-block, and random subsets.
Application of DA and CV on Proteomics Data
183
Whereas each type of scheme has its advantages and drawbacks,
the decision of which is the most effective for the data at
hand and the employed classifier is very much data and problem dependent. It is often recommended to take into account
some attributes of the data including the underlying distribution of the data (for example, if the samples in the data set are
sorted by, e.g., class, collection time, instrument, the distribution may influence CV figures of merit); the total number
of samples and the distribution of samples within each class;
the presence of replicate measurements (for example, if there
are protein profiles of samples collected from the same individual, or instrumental replicates). Double cross validation
protects against over-optimistic figures of merit and yields a
more realistic estimate of the generalization performance of
the classifier.
k-fold and leave-one-out CV. One of the most commonly
used CV approaches is k-fold CV in which N/k samples are
removed and the classifier is build up on the remaining samples. Then, the classes of the left-out samples are predicted
using the classifier and the process is repeated until every sample is predicted once. Finally, the predictive capabilities of the
classifier are estimated using, e.g., the number of misclassified
samples, AUROC or other figures of merit. In leave-one-out
CV, k is the number of samples in the data set.
Random subset selection. The data set is randomly split into
k subsets and the same procedure as in k-fold cross-validation
is applied. This type of cross-validation offers the advantage of
repeating the procedure. The use of iterations is very useful for
a reliable model assessment as it helps to reduce and evaluate
variance due to instability of the training models and to obtain
an interval of the estimate.
Venetian blinds, interleaved, or striped splitting select segments of consecutive samples or interleaved segments as training and test subsets. It is a very straightforward way of
stratifying that is useful when samples are randomly distributed
in the data set and test and training samples span the same data
space as far as possible.
2. Avoid traps! Regardless whether one uses LOO-CV, random
subset selection or any other cross-validation approach, two
well-known traps must be avoided: the ill-conditioned and the
replicate traps [14]. Ill-conditioning is present when the test
set of samples is not representative of training samples and
leads to overly pessimistic figures of merit of the classifier. The
replicate trap occurs when replicates of the same sample are
included in both training and test sets. In this case cross-validation errors are biased downward. Also, cross-validated
errors will be biased downward if an initial selection of the
184
Julia Kuligowski et al.
most discriminant variables is performed using the entire data
set. However, unsupervised screening procedures, like removing variables with near-zero variance, have been included prior
to cross-validation [15].
3. Reporting results: To report an estimation of the efficiency of
a classifier using cross-validated figures of merit, the crossvalidation parameters such as type of approach (k-fold, LOO,
venetian blinds, etc.), the sample size and the number of iterations or data scaling should be specified. In addition, data
treatment in, e.g., Matlab or R allows the creation of code
integrating the whole data preprocessing and analysis procedure. When reporting the results of an analysis, the availability
of the dataset and a detailed description of the code and/or
software used improve the reproducibility of the results.
References
1. Esbensen KH, Geladi P (2010) Principles of
proper validation: use and abuse of re-sampling
for validation. J Chemometr 24:168–187
2. Wold S, Sjöström M, Eriksson L (2001) PLSregression: a basic tool of chemometrics.
Chemometr Intell Lab Syst 58:109–130
3. Westerhuis JA, Velzen EJJ, van Hoefsloot
HCJ et al (2008) Discriminant Q2 (DQ2) for
improved discrimination in PLSDA models.
Metabolomics 4:293–296
4. Szymańska E, Saccenti E, Smilde AK et al
(2012) Double-check: validation of diagnostic
statistics for PLS-DA models in metabolomics
studies. Metabolomics 8:3–16
5. Kennard RW, Stone LA (1969) Computer
aided design of experiments. Technometrics 11:
137–148
6. Esbensen KH, Guyot D, Westad F et al (2004)
Multivariate data analysis—in practice. An
introduction to multivariate data analysis and
experimental design, 5th edn. CAMO Process
AS, Oslo
7. Rubingh CM, Bijlsma S, Derks EPPA et al
(2006) Assessing the performance of statistical
validation tools for megavariate metabolomics
data. Metabolomics 2:53–61
8. Westerhuis JA, Hoefsloot HCJ, Smit S et al
(2008) Assessment of PLSDA cross validation.
Metabolomics 4:81–89
9. Filzmoser P, Liebmann B, Varmuza K (2009)
Repeated double cross validation. J Chemometr
23:160–171
10. Gidskehaug L, Anderssen E, Alsberg BK
(2008) Cross model validation and optimisation of bilinear regression models. Chemometr
Intell Lab Syst 93:1–10
11. Knijnenburg TA, Wessels LFA, Reinders MJT
et al (2009) Fewer permutations, more accurate p-values. Bioinformatics 25:161–168
12. Wongravee K, Lloyd GR, Hall J et al (2009)
Monte-Carlo methods for determining optimal number of significant variables. Application
to mouse urinary profiles. Metabolomics 5:
387–406
13. Kuligowski J, Perez-Guaita D, Escobar J et al
(2013) Evaluation of the effect of chance correlations on variable selection using Partial
Least Squares-Discriminant Analysis. Talanta
116:835–840
14. Bakeev K (ed) (2010) Process analytical technology: spectroscopic tools and implementation strategies for the chemical and
pharmaceutical industries, 2nd edn. Wiley,
New York
15. Krstajic D, Buturovic LL, Leahy DE et al
(2010) Cross validation pitfalls when selection
and assessing regression and classification
models. J Cheminform 6:10
Descargar