Machine Learning 1 Statistical Learning Applied Statistics Outline ■ Supervised learning ■ Loss functions ■ Metrics ■ Model validation ■ Bias-variance tradeoff ■ Feature engineering Supervised learning Supervised learning In a supervised learning problem, we have that datasets come as a collection of pairs such that and . The elements of are called explanatory variables or independent variables and the elements of are called target variables or dependent variables. Then, we will assume that there exists a map such that Where is a zero mean Gaussian noise. The goal is then to find a map approximates as best as possible. That is, such that that Supervised learning Usually we don’t search the best function in general, since it’s an intractable problem. We usually fix a family of functions , such that each function in the family has the same structure and the exact values of each function in depend on a set of parameters . Then, our goal is to find the best values of the parameters such that , that is, to find a model Loss functions Loss functions To measure how well (or bad) a model is fitting a dataset, we can define a loss function, that computes a value comparing the predictions of the model with the expected output values. Note, however, that all datasets will contain noise, so we can to consider the target values of our dataset as following a random distribution. That is, Conditional probability In the supervised learning setting, we can think of the model modelling the conditional probability: Assuming, as before, that the output follows a parametric distribution, we have And then, we consider that the model is predicting the parameters of the distribution given some input: Maximum likelihood estimation With this setting, we can measure the probability of an output using the probability distribution determined by the parameters predicted by the model: We want this probability to be high, so we can optimize the parameters of the model to find the maximum of the joint probability for the examples in the dataset: This is the method of maximum likelihood estimation. Negative log-likelihood Since the product of several numbers less than 1 can get small very fast, we can transform the likelihood by applying the logarithm operation, without changing the optimal parameters: In optimization problems, it is customary to minimize the objective function, so we can negate the log-likelihood to obtain the negative log-likelihood (NLL). Mean Squared Error Assume that a univariate output follows a Gaussian distribution with fixed variance. We can use the model to predict the mean of the distribution. The maximum likelihood estimator is then given by: Log loss Assume that a binary output follows a Bernoulli distribution. We can use the model to predict the rate of the distribution. The maximum likelihood estimator is then given by: Cross-entropy Assume that a classification output follows a Categorical distribution with We can use the model to predict the rates of the distribution. The maximum likelihood estimator is then given by: classes. Metrics Metrics A metric is a function similar to a loss function, but more interpretable and closer to the business point of view. All loss functions could be used as metrics, but not all metrics are suitable to be used as loss functions. Depending on the problem, we have different metrics available. The choice of metric for each use case is usually based on the decision making process using the predictions of the model. Regression metrics Some of the metrics used in a regression setting are the following: Mean squared error Mean absolute error Root mean squared error Mean absolute percentage error Classification metrics Some of the metrics used in a binary classification setting are the following: Real Accuracy Positive Negative Positive TP FP Negative FN TN Prediction Precision True positive rate Recall False positive rate Confusion matrix ROC curve A classification model usually predicts a probability value. Then, we can use a threshold to transform the predicted probability to a single 0 or 1 value. For each possible threshold we will obtain a confusion matrix, from which we can compute other metrics. If we compute TPR and FPR for all possible thresholds we obtain the ROC curve. Precision-recall curve If we apply the same procedure of considering all possible thresholds but we instead compute precision and recall values, we obtain the precision-recall curve. The main difference between the ROC and the precision-recall curve is that the latter does not consider TN. The base value for the PR curve depends on the dataset. Proportion of 1’s Multiclass confusion matrix If we have more than two classes we can compute some classification metrics as one-vs-all. Real Accuracy Prediction Exercise Given the following dataset and predictions, compute some of the classification metrics for different thresholds. 0.14 0.39 3.91 2.61 3.09 9.41 9.94 5.50 8.06 5.76 0 0 0 0 0 1 1 1 1 1 0.002 0.004 0.518 0.006 0.020 0.999 0.999 0.441 0.999 0.907 Model validation Model validation Metrics can be used for two related goals: ■ Selection of best model ■ Estimation of performance on future data But, if we compute metrics on the dataset used to fit the model, we will overestimate the performance of the model. We need some strategies to accurately estimate the performance of the model on unseen data. Train-test split We can split the dataset in two parts and use one for training and the other one for performance evaluation. Dataset Train data Fit the model Test data Estimate performance Train-val-test split If we need to do a large number of experiments, we can add a third partition to choose the best model and then evaluate the performance on the test set. Dataset Train data Fit the model Val data Choose the model Test data Estimate performance K-fold cross validation When the dataset is small we can split the dataset into several folds and train different models, each one with a specific subset of folds as training set. Dataset Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Performance 1 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Performance 2 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Performance 3 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Performance 4 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Performance 5 Average performance Overfitting vs underfitting Underfitting Good model Overfitting Improve generalization To avoid underfitting: ■ Use a more powerful model To avoid overfitting: ■ Obtain more data Not always possible ■ Use a less powerful model ■ Add regularization to the model ■ Use the average of several models Not always desirable Usually a good idea Maybe not cost effective Train vs test error Low testing error High testing error Low training error Good Model Overfitting High training error — Underfitting Bias-variance tradeoff Bias-variance tradeoff It turns out that whichever function we use to approximate our data, we can decompose its expected error on an unseen sample as follows: Where Bias-variance tradeoff A model can have any combination of high or low bias and high or low variance. Bias-variance tradeoff The optimal level of flexibility is problem and data dependent, so it cannot be known it advance. In general we cannot observe the bias and variance, so we have to focus only on the MSE. Feature engineering Feature engineering ■ All machine learning models are based on mathematical operations. ■ That means that the features in our dataset must be transformed to numbers. ■ If the feature is non-numeric, it is usually categorical, although some times the number of categories is very large (e.g. text datasets). ■ Even if a feature is numeric, it is useful to transform it to a specific range of values. ■ It might be the case that a feature is categorical in nature but it is represented with numbers (e.g. postal codes). In these cases it is also important to consider a feature transformation. Categorical variables A categorical variable can have two or more discrete values (e.g. yes / no, city name, postal code, browser, etc). We should not convert the categorical values to numbers, since: ■ ■ We will be fixing an order in the categories that might not be real in the data. We will be fixing a distance between values that might not be accurate. The solution is to use one-hot encoding: A (1, 0, 0) B (0, 1, 0) C (0, 0, 1) Numerical variables When we have numerical variables we could use them directly in the model, but we need to be careful if they are in different scales (e.g. age of people vs prices of houses). To normalize the numerical variables we usually use two different methods: Standard scaling MinMax scaling