Subido por Marc González

Statistical Learning

Anuncio
Machine Learning 1
Statistical Learning
Applied Statistics
Outline
■
Supervised learning
■
Loss functions
■
Metrics
■
Model validation
■
Bias-variance tradeoff
■
Feature engineering
Supervised learning
Supervised learning
In a supervised learning problem, we have that datasets come as a collection of
pairs
such that
and
. The elements of
are called
explanatory variables or independent variables and the elements of are called
target variables or dependent variables.
Then, we will assume that there exists a map
such that
Where is a zero mean Gaussian noise. The goal is then to find a map
approximates as best as possible. That is,
such that
that
Supervised learning
Usually we don’t search the best function in general, since it’s an intractable
problem. We usually fix a family of functions
, such that each function in the
family has the same structure and the exact values of each function in
depend on
a set of parameters .
Then, our goal is to find the best values of the parameters
such that
, that is, to find a model
Loss functions
Loss functions
To measure how well (or bad) a model is fitting a dataset, we can define a loss
function, that computes a value comparing the predictions of the model with the
expected output values.
Note, however, that all datasets will
contain noise, so we can to consider the
target values of our dataset as following a
random distribution. That is,
Conditional probability
In the supervised learning setting, we can think of the model modelling the
conditional probability:
Assuming, as before, that the output follows a parametric distribution, we have
And then, we consider that the model is predicting the parameters of the
distribution given some input:
Maximum likelihood estimation
With this setting, we can measure the probability of an output using the probability
distribution determined by the parameters predicted by the model:
We want this probability to be high, so we can optimize the parameters of the
model to find the maximum of the joint probability for the examples in the dataset:
This is the method of maximum likelihood estimation.
Negative log-likelihood
Since the product of several numbers less than 1 can get small very fast, we can
transform the likelihood by applying the logarithm operation, without changing
the optimal parameters:
In optimization problems, it is customary to minimize the objective function, so we
can negate the log-likelihood to obtain the negative log-likelihood (NLL).
Mean Squared Error
Assume that a univariate output follows a Gaussian distribution with fixed variance.
We can use the model to predict the mean of the distribution.
The maximum likelihood estimator is then given by:
Log loss
Assume that a binary output follows a Bernoulli distribution.
We can use the model to predict the rate of the distribution.
The maximum likelihood estimator is then given by:
Cross-entropy
Assume that a classification output follows a Categorical distribution with
We can use the model to predict the rates of the distribution.
The maximum likelihood estimator is then given by:
classes.
Metrics
Metrics
A metric is a function similar to a loss function, but more interpretable and closer
to the business point of view.
All loss functions could be used as metrics, but not all metrics are suitable to be
used as loss functions.
Depending on the problem, we have different metrics available. The choice of
metric for each use case is usually based on the decision making process using
the predictions of the model.
Regression metrics
Some of the metrics used in a regression setting are the following:
Mean squared error
Mean absolute error
Root mean squared error
Mean absolute percentage error
Classification metrics
Some of the metrics used in a binary classification setting are the following:
Real
Accuracy
Positive
Negative
Positive
TP
FP
Negative
FN
TN
Prediction
Precision
True positive rate
Recall
False positive rate
Confusion matrix
ROC curve
A classification model usually predicts a probability value. Then, we can use a
threshold to transform the predicted probability to a single 0 or 1 value.
For each possible threshold we will
obtain a confusion matrix, from which
we can compute other metrics.
If we compute TPR and FPR for all
possible thresholds we obtain the ROC
curve.
Precision-recall curve
If we apply the same procedure of considering all possible thresholds but we
instead compute precision and recall values, we obtain the precision-recall curve.
The main difference between the ROC
and the precision-recall curve is that the
latter does not consider TN.
The base value for the PR curve
depends on the dataset.
Proportion of 1’s
Multiclass confusion matrix
If we have more than two classes we can compute some classification metrics as
one-vs-all.
Real
Accuracy
Prediction
Exercise
Given the following dataset and predictions, compute some of the classification
metrics for different thresholds.
0.14
0.39
3.91
2.61
3.09
9.41
9.94
5.50
8.06
5.76
0
0
0
0
0
1
1
1
1
1
0.002
0.004
0.518
0.006
0.020
0.999
0.999
0.441
0.999
0.907
Model validation
Model validation
Metrics can be used for two related goals:
■
Selection of best model
■
Estimation of performance on future data
But, if we compute metrics on the dataset used to fit the model, we will
overestimate the performance of the model.
We need some strategies to accurately estimate the performance of the model on
unseen data.
Train-test split
We can split the dataset in two parts and use one for training and the other one
for performance evaluation.
Dataset
Train data
Fit the model
Test data
Estimate performance
Train-val-test split
If we need to do a large number of experiments, we can add a third partition to
choose the best model and then evaluate the performance on the test set.
Dataset
Train data
Fit the model
Val data
Choose the model
Test data
Estimate performance
K-fold cross validation
When the dataset is small we can split the dataset into several folds and train
different models, each one with a specific subset of folds as training set.
Dataset
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Performance 1
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Performance 2
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Performance 3
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Performance 4
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Performance 5
Average
performance
Overfitting vs underfitting
Underfitting
Good model
Overfitting
Improve generalization
To avoid underfitting:
■
Use a more powerful model
To avoid overfitting:
■
Obtain more data
Not always possible
■
Use a less powerful model
■
Add regularization to the model
■
Use the average of several models
Not always desirable
Usually a good idea
Maybe not cost effective
Train vs test error
Low testing error
High testing error
Low training error
Good Model
Overfitting
High training error
—
Underfitting
Bias-variance tradeoff
Bias-variance tradeoff
It turns out that whichever function we use to approximate our data, we can
decompose its expected error on an unseen sample as follows:
Where
Bias-variance tradeoff
A model can have any combination of high or low bias and high or low variance.
Bias-variance tradeoff
The optimal level of flexibility is problem and data dependent, so it cannot be
known it advance.
In general we cannot
observe the bias and
variance, so we have
to focus only on the
MSE.
Feature engineering
Feature engineering
■
All machine learning models are based on mathematical operations.
■
That means that the features in our dataset must be transformed to numbers.
■
If the feature is non-numeric, it is usually categorical, although some times the
number of categories is very large (e.g. text datasets).
■
Even if a feature is numeric, it is useful to transform it to a specific range of
values.
■
It might be the case that a feature is categorical in nature but it is represented
with numbers (e.g. postal codes). In these cases it is also important to consider
a feature transformation.
Categorical variables
A categorical variable can have two or more discrete values (e.g. yes / no, city
name, postal code, browser, etc).
We should not convert the categorical values to numbers, since:
■
■
We will be fixing an order in the categories that might not be real in the data.
We will be fixing a distance between values that might not be accurate.
The solution is to use one-hot encoding:
A
(1, 0, 0)
B
(0, 1, 0)
C
(0, 0, 1)
Numerical variables
When we have numerical variables we could use them directly in the model, but
we need to be careful if they are in different scales (e.g. age of people vs prices of
houses).
To normalize the numerical variables we usually use two different methods:
Standard scaling
MinMax scaling
Descargar