Chapter 11 Application of Discriminant Analysis and Cross-Validation on Proteomics Data Julia Kuligowski, David Pérez-Guaita, and Guillermo Quintás Abstract High-throughput proteomic experiments have raised the importance and complexity of bioinformatic analysis to extract useful information from raw data. Discriminant analysis is frequently used to identify differences among test groups of individuals or to describe combinations of discriminant variables. However, even in relatively large studies, the number of detected variables typically largely exceeds the number of samples and the classifiers should be thoroughly validated to assess their performance for new samples. Cross-validation is a widely approach when an external validation set is not available. In this chapter, different approaches for cross-validation are presented including relevant aspects that should be taken into account to avoid overly optimistic results and the assessment of the statistical significance of crossvalidated figures of merit. Key words Proteomics, Cross-validation, Double cross-validation, Discriminant analysis, Partial least squares-discriminant analysis 1 Introduction In recent years, proteomics has become one of the most widely used research tools in high-throughput biology. Proteomic analysis plays a key role not only in basic, but also in biomedical research in fields such as drug discovery or clinical diagnosis. Availability of large scale “omics” data from genomics, transcriptomics, proteomics, or metabolomics increases the importance and complexity of bioinformatic and statistical analysis to get insight into huge amounts of raw data and extract useful information. Proteomic data is frequently analyzed using discriminant analysis (DA) to assess differences among groups of individuals and identify for which combination of variables they are most distinct. For example, in a typical clinical proteomics study, the “case” group includes subjects diagnosed with a disease while a second group includes subjects classified as “healthy” or “control.” In this type of study, the objective is frequently the identification and interpretation of Klaus Jung (ed.), Statistical Analysis in Proteomics, Methods in Molecular Biology, vol. 1362, DOI 10.1007/978-1-4939-3106-4_11, © Springer Science+Business Media New York 2016 175 176 Julia Kuligowski et al. proteomic disease biomarkers to get further insight into biological processes related to the disease. A second objective could be to pinpoint biomarkers to enable the construction of accurate classifiers. One of the most relevant challenges for the analysis of highthroughput proteomic data is their high dimensionality. Even in relatively large studies with a few hundred biological samples, the number of detected variables in most of the cases largely exceeds the number of samples. Besides, the majority of the detected proteomic variables are irrelevant for outcome prediction and their elimination improves the classifier performance. In addition, variables may be correlated, being multivariate statistical analysis generally required to extract information contained in the dataset. In proteomic studies, overfitting is a potential pitfall where the classifier models random variation in the data. Because of that, DA models should be subjected to thorough statistical validation to estimate the generalization accuracy of the classifier and to ensure that the model will work for new samples. This validation evaluates the performance of the classifier and yields an accurate estimation of the prediction error and the probability of a chance result. Statistical validation can be carried out by external and internal cross validation. External validation uses test samples not included in the calibration set of samples used to build the classifier and it is considered the “gold standard.” But, when the sample size is scarce, if we use sufficient samples to develop a reliable classifier we can find that we might have not enough samples for testing its performance and vice versa. In this situation, cross-validation is used for testing as a suboptimal approximation to external validation that, in spite of its limitations, is one of the most practical methods during model development to provide a point estimate of the performance of a classifier [1]. In this chapter we explain basic procedures used to assess the generalization accuracy of discriminant classifiers using crossvalidation. In Subheading 2, partial least squares-DA (PLS-DA) is outlined. Subheading 3 presents cross-validation and double crossvalidation including relevant aspects that should be taken into account for the selection of the cross-validation strategy, as well as permutation testing for the assessment of the statistical significance of cross-validated figures of merit. 2 Partial Least Squares-Discriminant Analysis PLS-DA is currently one of the most popular classification methods in multivariate analysis. This methodology is the discriminant version of PLS regression extensively used in chemometrics. It aims to model the linear relationship between the matrix X (N × J), where N and J are the number of samples and predictive variables, respectively, and the corresponding vector of responses y (N × 1) Application of DA and CV on Proteomics Data 177 [2]. In short, PLS extracts a series of latent variables explaining the variation in X correlated with y. In a PLS-DA model, the relation between the predictors X (N × J) and the response y (N × 1) can be described as y = Xb T + e where b (1 × J) is the vector of regression coefficients, and e (N × 1) is the error vector (i.e., residuals). For the application of PLS to DA in proteomics, the y vector consists of dummy variables (e.g., +1, −1) indicating the class of each sample (e.g., control vs. disease or treatment vs. placebo). Parameters to be selected during the development of a PLS-DA model include the type of data pretreatment and the model complexity (i.e., the number of latent variables). The number of latent variables determines the complexity of the model, and thus the selection of a high or a low number of latent variables could lead to overfitting or under-fitting data, respectively. The analysis of crossvalidated figures of merit of PLS-DA classifiers obtained using different numbers of latent variables can be used to select the optimum value. Hence, cross-validation is not only used for estimating the generalization accuracy of the classifier, but also during method development. 2.1 Figures of Merit Figures of merit are used to describe the performance of the classifier during method development and validation. Outcomes of a binary classifier may be, e.g., y > 0 or y < 0 to differentiate disease and healthy. The prediction error is an estimate of how well the classifier will predict the outcome (i.e., the class in a discriminant analysis) of future, unknown observations drawn from the same population. It can be defined as the probability of incorrect classification using a DA model, or as the expected difference between the theoretical and the predicted responses from the model in a regression model. Besides, a number of measures are available to assess the performance of a classifier. After selecting a threshold value to classify samples (e.g., if ythreshold = 0, samples with y > 0 are classified as “disease”), the proportion of misclassified samples can be used when the classes are of similar size. The number of true positive (TP), false positives (FP), true negatives (TN), and false negatives (FN) can be used to build a more informative confusion or contingency table summarizing combinations of the predicted and actual classes. Using these values, sensitivity, as the ratio TP/ (TP + FN), and specificity, as TN/(FP + TN), can be defined. Both, sensitivity and selectivity values can be used to build a receiver operator characteristic (ROC) curve by plotting the sensitivity versus (1-specificity) of the classifier using different threshold values. The area under the ROC curve (AUROC) is a frequently employed figure of merit. Alternatively, the Q2 or discriminant-Q2 statistics 178 Julia Kuligowski et al. can also be used as figures of merit of classification models. Q2 quantifies the closeness of predicted y values to the theoretical y values. It is calculated as one minus the ratio of the prediction error sum of squares (PRESS) to the total sum of squares (TSS). The discriminant-Q2 computes the PRESSD (PRESS discriminant), a PRESS that does not take into account accurately classified samples [3]. The use of discriminant-Q2 is preferable as it provides a more reliable diagnosis of the generalization accuracy of the classifier [4]. 3 Validation Methods Validation of a classifier assesses its generalization accuracy. To ensure a non-biased estimation of the classification error, the most rigorous approach of testing the predictive performance of a classifier consists in computing model predictions for an independent set of samples (i.e., the external validation set). This set of observations should not be used during the model development and so, in practice, one of the first steps is to split the initial sample set into a calibration and a validation subset. This split should be carefully selected considering that samples used for both, calibration and validation should be representative of the whole population. The use of an external validation set is illustrated in Fig. 1. Fig. 1 Scheme of the estimation of PLS-DA model performance using a calibration and an independent test set (i.e., external validation). The initial data set is split into a calibration and a test set. A model is build using the calibration data set. Then, class prediction of samples included in the test set is used to estimate the generalization accuracy of the classifier Application of DA and CV on Proteomics Data 3.1 Cross-Validation 179 As mentioned before, cross-validation is one of the most practical methods to estimate the predictive model performance when an external validation set is not available or simply not even possible. Besides, it is also widely used during the development of the classifier, for example, for the selection of the number of latent variables in PLS-DA model development. When using other methods such as support vector machines (SVM-DA), cross-validation is also employed during model development to select, e.g., the regularization parameter C and kernel parameters such as the γ parameter for a standard radial Gaussian kernel. During cross-validation the sample set is split usually at random into a number of folds or subsets (k-fold CV). The class of samples included in one subset k is predicted using a model classifier build on samples belonging to the other (k-1) subsets. This step is repeated until each subset has been predicted and then the model performance is estimated using the resulting set of test predictions (see Fig. 2). There is not a straightforward way for establishing the fraction of samples that should be used for model development and testing and the optimal value strongly depend on the case under study (e.g., number of samples, sample heterogeneity) (see Note 1). When the number of k splits is equal to the number of samples, it is called leave-one-out cross-validation. A rule of thumb recommended in machine learning is to use 2/3 of the samples for calibration and 1/3 for testing. The number of k subsets also depends on the size of the calibration set and hence the number of samples. For example, whereas for large data sets a threefold cross-validation (k = 3) is usually appropriate, for small data sets we may have to select leave-one-out cross-validation. Fig. 2 Scheme of PLS-DA model validation using k-fold cross-validation for k = 3. The full data set is randomly split into k subsets. Then each of these subsets is used to estimate the accuracy of the classifier build from the remaining k-1 subsets. The figure describes the prediction of one of the test sets (test set #1) resulting from a threefold CV. Likewise, model calculation is repeated employing subsets #1 and #2 and #1 and #3 for model calculation, using subsets #3 and #2 for prediction 180 Julia Kuligowski et al. This division can be performed randomly or using sampling methodologies such as Kennard-Stone, which aims to improve the representativeness of the sample space in both, calibration and test subsets [5]. An important point to keep in mind is that the number of independent samples is not necessarily equal to the number of objects if the study includes sample replicates or repeated measurements (see Note 2). To avoid overly optimistic results, sample replicates must be kept together during cross-validation in the same subset (calibration or test). On the other hand, it is usually useful to develop a PCA or PLS-DA model including all sample replicates. A scores plot from the obtained model will provide an overview of the variation within replicates of the same sample, and its relation to the overall variance [6]. Cross-validation is straightforward but has several disadvantages. It can lead to overoptimistic results especially when the sample to variable ratio is low [7] and the samples used in calibration or test sets may be too few for being representative of the population, leading to an inaccurate estimate of the generalization accuracy of the classifier. Moreover, the figures of merit may vary depending on the subsets chosen during cross-validation. Monte Carlo cross-validation (MCCV) uses a repeated random selection of the CV subsets to circumvent this drawback. The first step in an MCCV is a random selection of the calibration and test set, without replacement. Then, the classifier is build using the calibration set and the error is calculated for the samples retained in the test set. These two steps are repeated n times and an average error is finally calculated from the distribution of n figures of merit. 3.2 Double Cross-Validation Cross-validation provides internal figures of merit because test samples are used to develop the model (e.g., to select the number of latent variables of the model in PLS-DA) [7, 8]. Double crossvalidation (2CV) or cross model validation, is a cross-validation strategy that overcomes this potential pitfall providing unbiased external figures of merit [7–10]. In double cross-validation, a subset of objects is set aside as a test set. The remaining set of objects are again split into training and test sets (i.e., internal calibration and test sets) in a k-fold cross-validation procedure for the selection of the number of latent variables of the inner classifier used to predict samples included in the test set (see Fig. 3). In double crossvalidation, samples used for prediction are not used for the building of the classifier (i.e., scaling, selection of the number of LV), which improves the generalizability of the accuracy estimates. 3.3 Assessment of CV Figures of Merit Using a Permutation Test Some regression and discriminant analysis methods such as PLS-DA or SVM are very potent tools for finding correlation when the number of variables exceeds the number of samples. If method development is not carried out carefully, overfitted models lacking predictive capabilities for future samples may be obtained. Application of DA and CV on Proteomics Data 181 Fig. 3 Scheme of PLS-DA model validation using double cross-validation (for k = 3). The full data set is randomly split into k subsets. Then each of these subsets is used to estimate the accuracy of the classifier build from the remaining k-1 subsets. The selection of the number of latent variables of each PLS-DA sub-model is based on CV figures of merit calculated using each training set. The figure shows the prediction of one of the test sets (test set #1) Both, external validation and cross-validation help to identify model overfitting. However, a significant limitation of cross-validation procedures is that they do not assess the statistical significance of the figures of merit estimating the predictive power of the classifier. For example, if the performance of a classifier is assessed by the AUROC, results will range between 0 and 1 and the higher the value, the better the model. However, it is difficult to define a threshold value that corresponds to a “good” classifier. To obtain an estimate of the statistical significance of the classifier, one may perform a permutation test. This type of hypothesis testing is nonparametric and does not imply assumptions about the distribution of the data. In this test, cross-validated figures of merit calculated using the original class labels (i.e., y vector) are compared to a distribution of the same estimators obtained after a random rearrangement of class labels. In practice, the number of all possible permutations is too large but we can approximate the permutation distribution by using enough random permutations. The statistical significance of the figures of merit, as expressed in a p-value, is then empirically calculated as the fraction of permutation values that are at least as extreme as the statistic obtained from non-permuted data [11]. Results from the test will assess to what extent the classifier is finding chance correlations between the proteomic data and the classes and it is therefore capable of identifying overfitting of the data. Low p-values indicate that the original class label configuration is relevant with respect to the data and assure the significance of the model. A typical plot showing the outcome from a permutation testing for the assessment of CV and calibration is illustrated in Fig. 4. The use of permutation tests in combination with 2CV has been repeatedly shown as a suitable approach to assess the statistical significance of figures of merit [3, 8, 12]. 182 Julia Kuligowski et al. Sum Squared Y (C & CV) for 3 Components 2 Fractional Y Calibration Fractional Y Cross-Validation Fractional Y Calibration Fit Fractional Y Cross-Validation Fit Standardized SSQ Y (CV and C) 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Correlation permuted Y to original Y 0.9 1 Fig. 4 Typical results from a permutation test to assess the statistical significance of a PLSDA model. The plot displays the fractional y-variance captured for self-prediction (calibration) and cross-validation versus the correlation of the permuted y to the original y vector, and shows the regression lines 3.4 Variable Selection 4 The identification and elimination of variables irrelevant for classification is usually included in the workflow of data analysis. Variable selection improves the predictive capabilities of multivariate models and facilitates a biological interpretation of the models that might be relevant for further research based on the results (e.g., development of target methods). However, when the training data is used to perform variable selection, figures of merit calculated by cross-validation will lead to specifically overoptimistic and nongeneralizable performance estimates. To avoid this, when no external validation set is available, the cross-validation procedure may include the variable selection step [13]. However, this approach is computing intensive and so the use of an external validation set is usually preferable. In any case, the data analysis strategy including the CV method employed should be reported (see Note 3). Notes 1. Selecting the type of cross validation. There are a number of different cross validation strategies available such as leaveone-out, k-fold, venetian blinds, h-block, and random subsets. Application of DA and CV on Proteomics Data 183 Whereas each type of scheme has its advantages and drawbacks, the decision of which is the most effective for the data at hand and the employed classifier is very much data and problem dependent. It is often recommended to take into account some attributes of the data including the underlying distribution of the data (for example, if the samples in the data set are sorted by, e.g., class, collection time, instrument, the distribution may influence CV figures of merit); the total number of samples and the distribution of samples within each class; the presence of replicate measurements (for example, if there are protein profiles of samples collected from the same individual, or instrumental replicates). Double cross validation protects against over-optimistic figures of merit and yields a more realistic estimate of the generalization performance of the classifier. k-fold and leave-one-out CV. One of the most commonly used CV approaches is k-fold CV in which N/k samples are removed and the classifier is build up on the remaining samples. Then, the classes of the left-out samples are predicted using the classifier and the process is repeated until every sample is predicted once. Finally, the predictive capabilities of the classifier are estimated using, e.g., the number of misclassified samples, AUROC or other figures of merit. In leave-one-out CV, k is the number of samples in the data set. Random subset selection. The data set is randomly split into k subsets and the same procedure as in k-fold cross-validation is applied. This type of cross-validation offers the advantage of repeating the procedure. The use of iterations is very useful for a reliable model assessment as it helps to reduce and evaluate variance due to instability of the training models and to obtain an interval of the estimate. Venetian blinds, interleaved, or striped splitting select segments of consecutive samples or interleaved segments as training and test subsets. It is a very straightforward way of stratifying that is useful when samples are randomly distributed in the data set and test and training samples span the same data space as far as possible. 2. Avoid traps! Regardless whether one uses LOO-CV, random subset selection or any other cross-validation approach, two well-known traps must be avoided: the ill-conditioned and the replicate traps [14]. Ill-conditioning is present when the test set of samples is not representative of training samples and leads to overly pessimistic figures of merit of the classifier. The replicate trap occurs when replicates of the same sample are included in both training and test sets. In this case cross-validation errors are biased downward. Also, cross-validated errors will be biased downward if an initial selection of the 184 Julia Kuligowski et al. most discriminant variables is performed using the entire data set. However, unsupervised screening procedures, like removing variables with near-zero variance, have been included prior to cross-validation [15]. 3. Reporting results: To report an estimation of the efficiency of a classifier using cross-validated figures of merit, the crossvalidation parameters such as type of approach (k-fold, LOO, venetian blinds, etc.), the sample size and the number of iterations or data scaling should be specified. In addition, data treatment in, e.g., Matlab or R allows the creation of code integrating the whole data preprocessing and analysis procedure. When reporting the results of an analysis, the availability of the dataset and a detailed description of the code and/or software used improve the reproducibility of the results. References 1. Esbensen KH, Geladi P (2010) Principles of proper validation: use and abuse of re-sampling for validation. J Chemometr 24:168–187 2. Wold S, Sjöström M, Eriksson L (2001) PLSregression: a basic tool of chemometrics. Chemometr Intell Lab Syst 58:109–130 3. Westerhuis JA, Velzen EJJ, van Hoefsloot HCJ et al (2008) Discriminant Q2 (DQ2) for improved discrimination in PLSDA models. Metabolomics 4:293–296 4. Szymańska E, Saccenti E, Smilde AK et al (2012) Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics 8:3–16 5. Kennard RW, Stone LA (1969) Computer aided design of experiments. Technometrics 11: 137–148 6. Esbensen KH, Guyot D, Westad F et al (2004) Multivariate data analysis—in practice. An introduction to multivariate data analysis and experimental design, 5th edn. CAMO Process AS, Oslo 7. Rubingh CM, Bijlsma S, Derks EPPA et al (2006) Assessing the performance of statistical validation tools for megavariate metabolomics data. Metabolomics 2:53–61 8. Westerhuis JA, Hoefsloot HCJ, Smit S et al (2008) Assessment of PLSDA cross validation. Metabolomics 4:81–89 9. Filzmoser P, Liebmann B, Varmuza K (2009) Repeated double cross validation. J Chemometr 23:160–171 10. Gidskehaug L, Anderssen E, Alsberg BK (2008) Cross model validation and optimisation of bilinear regression models. Chemometr Intell Lab Syst 93:1–10 11. Knijnenburg TA, Wessels LFA, Reinders MJT et al (2009) Fewer permutations, more accurate p-values. Bioinformatics 25:161–168 12. Wongravee K, Lloyd GR, Hall J et al (2009) Monte-Carlo methods for determining optimal number of significant variables. Application to mouse urinary profiles. Metabolomics 5: 387–406 13. Kuligowski J, Perez-Guaita D, Escobar J et al (2013) Evaluation of the effect of chance correlations on variable selection using Partial Least Squares-Discriminant Analysis. Talanta 116:835–840 14. Bakeev K (ed) (2010) Process analytical technology: spectroscopic tools and implementation strategies for the chemical and pharmaceutical industries, 2nd edn. Wiley, New York 15. Krstajic D, Buturovic LL, Leahy DE et al (2010) Cross validation pitfalls when selection and assessing regression and classification models. J Cheminform 6:10