# english

Anuncio
```Grado en Ingenier
&iacute;a
Ingenier&iacute;a
&iacute;stica
Tema: Regresion - IN ENGLISH
Regression
Introduction. Non-deterministic relationships
Simple Linear Regression
Model
Estimation
Diagnosis / Inference
Multiple Regression
Multiple dispersion graphs
Estimation
Multicolinearity
Dummy variables / Interactions
N&uacute;mero de transparencia: 2
Objectives
To know how to analyse the relationship between variables using a
linear regression model that describes the influence of variable X on
another variable Y.
To know how to obtain point estimates of the parameters of the said
model
To know how to construct confidence intervals and resolve hypothesis
To know how to estimate the average of Y for a given value of X
To know how to predict future values of the variable Y
N&uacute;mero de transparencia: 3
Relationships between variables
Regression studies the relation between variables.
What type of relationships can exist:
-Deterministic relationship (exact)
- Non-deterministic relationship (not exact)
N&uacute;mero de transparencia: 4
Deterministic relationships
We call a relationship between two variables deterministic when
by knowing the value of one of the variables we are able to know
the value of the other EXACTLY
This corresponds to an exact mathematical relationship; a function
Y = f (x)
N&uacute;mero de transparencia: 5
Non
-deterministic relationships
Non-deterministic
The relationship between the two variables is not exact; knowing
the value of one does not allow us to know the exact value of the
other.
We know that a relationship exists between the variables
– but it isn’t exact!
N&uacute;mero de transparencia: 6
Regression
What does regression do?
It creates a linear model to simulate the relationship between
the variables
The relationship isn’t exact, and the model is not exact
=&gt; but it is very useful!
N&uacute;mero de transparencia: 7
Regression
Regression:: residuals
If the relationship is not exact, then we will always commit an error
e = residual
The distance between each point (real data) is the part of the model
that can’t be predicted
We will estimate the regression line so that the errors we commit
are minimised (criterion: least mean square), specifying that the
mean error is zero
N&uacute;mero de transparencia: 8
How is the regression line calculated
?
calculated?
N&uacute;mero de transparencia: 9
How do we term the variables?
X
Independent
Explicative
Y
Dependent
The response to be explained
What we want to predict
The value that we know
N&uacute;mero de transparencia: 10
Regression
Regression:: an example
An example: we will analyse the relationship between the production cost
of a process and the number of pieces produced
5,7
4,7
3,7
2,7
1,7
2,1
2,4
2,7
3
3,3
3,6
3,9
Y = Production cost
X = The number of pieces
We will calculate the regression line using Statgraphics
N&uacute;mero de transparencia: 11
Regression
Regression:: an example
5,7
4,7
coste prod = 0,783429 + 0,669509*piezas producidas 3,7
2,7
1,7
2,1
N&uacute;mero de transparencia: 12
2,4
2,7
3
3,3
3,6
3,9
Regression
Regression:: an example
5,7
4,7
3,7
2,7
1,7
2,1
2,4
2,7
3
3,3
3,6
3,9
coste prod = 0,783429 + 0,669509*piezas producidas
However, a factory that produces a million units will have a production cost
of:
coste prod = 0,783429 + 0,669509* 1 = 1, 46 millones €
Will all the factories with this volume of production have the same cost ??
N&uacute;mero de transparencia: 13
Regression
Regression:: an example
Will all the factories with this volume of production have the same cost ??
5,7
4,7
3,7
2,7
1,7
2,1
2,4
2,7
3
3,3
3,6
3,9
There is a range of production cost, from 2.8 to 4.8 milllon €
Specifically, for the factory A : Prod. Cost = 1,66 millones
But the model says:
coste prod = 0,783429 + 0,669509* 1 = 1, 46 millones €
Therefore the error that is committed is 1,66 – 1,46 = 0,2 millones
N&uacute;mero de transparencia: 14
Assumptions of the model
Can we apply the regression model to all types of data?
No. If the conclusions that we make for out models are correct, the data
that the use must comply to the following properties:
1.
2.
3.
4.
Linearity
Independence
Normally distributed
N&uacute;mero de transparencia: 15
Linearity
This is a fundamental assumption, the data must follow a
linear tendency, and be highly correlated
N&uacute;mero de transparencia: 16
Linearity
Linearity:: what happens if the data are not linear?
The regression will not correctly represent the
relationship between the variables
If the data is not linear we can look for a mathematical
transformation (e.g, log, sqrt) that improves the linearity.
N&uacute;mero de transparencia: 17
Homoscedasticity
This assumption means that the data has constant variance,
that it has a graph of the following type:
• When the variance of the data is constant we say that it is
• HOMOSCEDASTIC
• What happens if the data is not homoscedastic ??
N&uacute;mero de transparencia: 18
Homoscedasticity
Homoscedasticity:: heteroscedastic data
When the variance is not constant (it grows with the independent
variable) we say the data is HETEROSCEDASTIC
How does this affect the regression?
Gastos - Ingresos
(X 1,E6)
1
Gastos
0,8
0,6
0,4
0,2
0
0
2
4
Ingresos
6
8
(X 100000)
The prediction errors will be larger by an amount that grows
with the value of the variables!!
We shouldn’t apply regression to such heteroscedastic data.
We have to transform the data using: LOG
N&uacute;mero de transparencia: 19
Testing for linearity and homoscedasticity
The test for the assumption of linearity and homoscedasticity
we carry out by a graphical analysis of the data
(Scatterplots / X-Y plot)
5,7
4,7
3,7
2,7
1,7
2,1
2,4
2,7
3
3,3
3,6
If the data satisfies this assumption then we can
continue with the analysis
N&uacute;mero de transparencia: 20
3,9
Independence
The data that we analyse must be mutually independent
(between each datum):
- If we analyse the production cost against
production volume for different factories, we assume
that the data from one factory does not affect data from
another.
You CANNOT use regression analysis to analyse data
from a time series, as the each datum depends on
previous data.
N&uacute;mero de transparencia: 21
Normally distributed
The last assumption is that the model requires is that the data
analysed is normally distributed. What does this mean?
5,7
4,7
3,7
2,7
1,7
2,1
2,4
2,7
3
3,3
3,6
3,9
We have said that for each value of X, Y can take values in a
certain range
We assume that the values of Y for each value of X follow a
normal distribution
N&uacute;mero de transparencia: 22
The model
If the data satisfies the (four) assumptions discussed, we
can use the model to estimate them.
coste prod = 0,783429 + 0,669509*piezas producidas
N&uacute;mero de transparencia: 23
The model
coste prod = 0,783429 + 0,669509*piezas producidas
β0
β1
is the value of Y when X has value 0
(not always a feasible condition)
It also tells us how Y increases against changes in X:
∆Y = β1 ∆X
Therefore, in our previous example – how much will the prod. cost
increase if the number of pieces produced increase by one million?
∆(coste prod) = 0.669509*∆ (piezas producidas) = 0.66 millon
N&uacute;mero de transparencia: 24
Regression
….
Regression:: a problem
problem….
In regression we start with a data sample and from that we estimate
the model
5,7
4,7
3,7
2,7
1,7
2,1
2,4
2,7
3
3,3
3,6
3,9
coste prod = 0,783429 + 0,669509*piezas producidas
N&uacute;mero de transparencia: 25
Regression
….
Regression:: a problem
problem….
If we change the data sample we will change the parameters of the
model (the numbers that we have calculated)
Is it possible to select a sample that would give as the following
result?
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
-2.5
-3
-2
-1
0
1
2
3
If this happens, the gradient of the line, β1, is ZERO and we say that
THE REGRESSION IS NOT SIGNIFICANT
N&uacute;mero de transparencia: 26
Regression
….
Regression:: a problem
problem….
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
-2.5
-3
-2
-1
0
1
2
3
We want to be sure that our regression is valid – independent of the
sample considered
We want to be sure that the regression is valid for all of the
population studied and no just for one specific sample
WE WANT TO BE SURE THAT β1 IS NEVER EQUAL TO ZERO
N&uacute;mero de transparencia: 27
In order to analyse if β1 is zero we have three tools:
Confidence intervals
Hypothesis tests
t-statistic
p-value
N&uacute;mero de transparencia: 28
Confidence intervals
We calculate a range in which the estimate of β1 will be for “any” sample
that we take.
This we do using a determined probability (generally 95%)
β1 -2xSE(β
β1) β1
β1 +2xSE(β
β1 )
If the value “0” does not belong to the interval,
the parameter is SIGNIFICANT !!
N&uacute;mero de transparencia: 29
Confidence intervals
coste prod = 0,783429 + 0,669509*piezas producidas
(β1 -2xSE(β1
β1)
β1)
β1 ; β1 +2xSE(β1
β1 )
( 0,67-2*0.07; 0,67+2*0.07) = (0.53; 0.81)
“0” does not belong to the interval =&gt; the parameter is significant!
N&uacute;mero de transparencia: 30
Hypothesis test
An alternative of assuring that β1 is not zero is to
propose a hypothesis (according to the standard form):
H0: β1 =0
H1: β1 ≠0
Statgraphics gives us the p-value for this test directly:
p &lt; 0.05
We reject H0
=&gt; The regression is significant
N&uacute;mero de transparencia: 31
Hypothesis test
-statistic
test:: tt-statistic
We also have another alternative to the p-value to
resolve the hypothesis test:
H0: β1 =0
H1: β1 ≠0
N&uacute;mero de transparencia: 32
Hypothesis test
-statistic
test:: tt-statistic
We also have another alternative to the p-value to resolve the
hypothesis test:
H0: β1 =0
H1: β1 ≠0
|t|&gt;2 we reject H0
|t|&lt;2 we do not reject H0
|t|&gt;2 =&gt; we reject H0
The regression is significant
N&uacute;mero de transparencia: 33
How good is the model
? --&gt;&gt; R2
model?
The coefficient R2 (R-squared) indicates how much of Y is explained by X
(using the model)
Ejemplo:
R2=71.76%
R2 / R2 = (squared correlation coefficient)
N&uacute;mero de transparencia: 34
Summary
We study the data and see if the assumptions are satisfied
If not, then we transform the data using mathematical functions
We fit the model
We use the confidence intervals and hypothesis tests to see if X is
significant to Y (does X influence Y ?)
N&uacute;mero de transparencia: 35
Diagnostics
The decisions that we can take thanks to the information given by a
regression model are important
We need to be sure that our conclusions are correct.
For this we use:
Inference tests, confidence intervals ….
Diagnostics: to test once more if the assumptions made remain valid
In the diagnosis of the model, we test that the random part of the
model (the residuals) do not contain any additional information,
- or demonstrate the invalidity of the assumptions (linearity,
homoscedaticity, independence, normally distributed).
N&uacute;mero de transparencia: 36
Diagnostics
The diagnosis is performed by visual inspection of the residual
graphs.
They should have the following general appearance:
N&uacute;mero de transparencia: 37
Diagnostics
We cannot accept residuals that display other types of behaviour:
3000
1000
2500
500
2000
0
1500
-500
1000
-1000
500
0
0
20
40
N&uacute;mero de transparencia: 38
60
80
100
-1500
500
1000
1500
2000
2500
3000
Regression
Introduction. Non-deterministic relationships
Simple Linear Regression
Model
Estimation
Diagnosis / Inference
Multiple Regression
Multiple dispersion graphs
Estimation
Multicolinearity
Dummy variables / Interactions
N&uacute;mero de transparencia: 39
Multiple Regression
In a multiple regression model, we want to know the value of a
response variable that results from more than one explicative
variable:
In this expression, each one of the β-coefficients reresents the
individual influence that each X variable has on Y
The assumptions of the model are the same as for simple regression
So are the hypothesis tests, diagnosis etc.
Slight inconveniences:
The visualisation of the graphs is slightly more complicated
We need to re-define the R2 coefficient
N&uacute;mero de transparencia: 40
Multiple Regression : Graphs
Each cell of the graph represents the bilateral relation between two
variables:
TOT_COST
UDS
MANPOWER
ENERGY
INVEST
MAINT
MAT
ENV
N&uacute;mero de transparencia: 41
Multiple Regression : adjusted R2 value
The R2 coefficient increases as the number of variables in the
model increases (whether they are significant or not). In order to
allieviate this effect we compensate for it. For this reason in multiple
regression we use the corrected (or adusted)
Dependent variable: log(TOT_COST)
----------------------------------------------------------------------------Standard
T
Parameter
Estimate
Error
Statistic
P-Value
----------------------------------------------------------------------------CONSTANT
-1,82352
0,313487
-5,81689
0,0000
log(UDS)
0,666417
0,116524
5,71913
0,0000
log(MANPOWER)
0,157212
0,0551564
2,85029
0,0052
log(ENERGY)
0,174001
0,0489637
3,55367
0,0005
log(INVEST)
0,216335
0,0365883
5,91267
0,0000
log(MAINT)
-0,0199751
0,0594171
-0,336185
0,7373
log(MAT)
0,139431
0,0221418
6,2972
0,0000
log(ENV)
0,0027926
0,0178724
0,156252
0,8761
-----------------------------------------------------------------------------
N&uacute;mero de transparencia: 42
Regression
Introduction. Non-deterministic relationships
Simple Linear Regression
Model
Estimation
Diagnosis / Inference
Multiple Regression
Multiple dispersion graphs
Estimation
Multicolinearity
Dummy variables / Interactions
N&uacute;mero de transparencia: 43
Example
Number
Number of
ofaccidents
accidentsin
in
Spanish
Spanishprovinces
provincesas
asaa
function
functionof
ofnumber
number of
of
registered
registeredvehicles
vehicles
(X 1000)
3
nacciden
2,5
2
1,5
1
0,5
0
0
4
8
12
16
matricul
20
24
(X 1000)
----------------------------------------------------------------------------Dependent variable: nacciden
----------------------------------------------------------------------------Standard
T
Parameter
Estimate
Error
Statistic
P-Value
----------------------------------------------------------------------------CONSTANT
278,24
102,518
2,71406
0,0265
matricul
0,0993373
0,00850344
11,682
0,0000
----------------------------------------------------------------------------R-squared (adjusted for d.f.) = 93,7703 percent
N&uacute;mero de transparencia: 44
Example
(X 1000)
3
2,5
nacciden
Number
Number of
ofaccidents
accidentsin
in
Spanish
Spanishprovinces
provincesas
asaa
function
functionof
ofthe
thenumber
number
of
of driving
drivinglicences.
licences.
2
1,5
1
0,5
0
0
4
8
12
16
permisos
20
24
(X 1000)
----------------------------------------------------------------------------Dependent variable: nacciden
----------------------------------------------------------------------------Standard
T
Parameter
Estimate
Error
Statistic
P-Value
----------------------------------------------------------------------------CONSTANT
216,481
127,099
1,70325
0,1269
permisos
0,107617
0,0109657
9,81395
0,0000
----------------------------------------------------------------------------R-squared (adjusted for d.f.) = 91,3722 percent
N&uacute;mero de transparencia: 45
Regressions
Accid=278.2 +0.1 Matriculas
(t-statistic = 11.68)
Accid=216.4 +0.1 Permisos
(t-statistic = 9.81)
N&uacute;mero de transparencia: 46
Regression for both variables
----------------------------------------------------------------------------Dependent variable: nacciden
----------------------------------------------------------------------------Standard
Parameter
Estimate
Error
T
Statistic
P-Value
----------------------------------------------------------------------------CONSTANT
250,63
113,216
2,21373
0,0625
matricul
0,0725492
0,0395634
1,83374
0,1093
permisos
0,0301069
0,043353
0,694461
0,5098
-----------------------------------------------------------------------------
N&uacute;mero de transparencia: 47
Regression for both variables
----------------------------------------------------------------------------Dependent variable: nacciden
----------------------------------------------------------------------------Standard
Parameter
Estimate
Error
T
Statistic
P-Value
----------------------------------------------------------------------------CONSTANT
250,63
113,216
2,21373
0,0625
matricul
0,0725492
0,0395634
1,83374
0,1093
permisos
0,0301069
0,043353
0,694461
0,5098
-----------------------------------------------------------------------------
N&uacute;mero de transparencia: 48
Regressions
Accid=278.2 +0.1 Matriculas
(11.68)
Accid=216.4 +0.1 Permisos
(9.81)
Accid=250+0.07 Matriculas +0.03 Permisos
(1.8)
(0.69)
N&uacute;mero de transparencia: 49
What
’s happening
?
What’s
happening?
(X 1000)
24
matricul
20
Correlaci&oacute;n=.975
16
12
8
4
0
0
4
8
12
16
permisos
N&uacute;mero de transparencia: 50
20
24
(X 1000)
Regression: - a problem
Sometimes the independent variables are very similar:
they contain the same information
Independent
variables
N&uacute;mero de transparencia: 51
Dependent
variable
Regression: - a problem
The model cannot distinguish between the two variables.
Independent
variables
N&uacute;mero de transparencia: 52
Dependent
variable
In our example:
Registered cars
Driving licences
Num Accid
Both are too similar in order to
distinguish between them
N&uacute;mero de transparencia: 53
In our example:
The solution? –
Registered cars
Driving licences
Eliminate one of the variables.
We lose almost no information
Num Accid
Both are too similar in order to
distinguish between them
N&uacute;mero de transparencia: 54
In our example:
The solution? –
Registered cars
Eliminate one of the variables.
We lose almost no information
Num Accid
Both are too similar in order to
distinguish between them
N&uacute;mero de transparencia: 55
The problem of (multi)colinearity frequenctly appears in
statistics
We tend to measure one thing in many ways
It is detected when:
-for simple regression the variables are significant
- on introducing new variables, these variables stop
becoming significant
N&uacute;mero de transparencia: 56
Regression
Introduction. Non-deterministic relationships
Simple Linear Regression
Model
Estimation
Diagnosis / Inference
Multiple Regression
Multiple dispersion graphs
Estimation
Multicolinearity
Dummy variables
N&uacute;mero de transparencia: 57
A weight – height study
Is the relation the same for women and men?
Weight
Height
N&uacute;mero de transparencia: 58
A weight – height study
Is the relation the same for women and men?
Weight
Weight
Height
N&uacute;mero de transparencia: 59
Height
A weight – height study
If the relation is not equal, we could commit serious errors:
Weight
Weight
Height
N&uacute;mero de transparencia: 60
Height
Examples
Variable Y
Variable X
Group
that
influence
Weight
Height
Sex: Male or Female
Consumption of a
worker
Earnings of the
worker
Labour status:
Unemployed or Employed
Automobile
consumption
Power / Engine
size
Engine type: Diesel or
Petrol
Profit margin of a
bank branch
Bank charges
Branch: Urban or Rural
N&uacute;mero de transparencia: 61
could
It is necessary to introduce a group
group::
In this case:
• we define a variable Z that takes the following values:
Zi =0 if the observation belongs to group A
Zi=1 if the observation belongs to group B
• and we will estimate using the following regression model:
yˆ = βˆ0 + βˆ1 X + βˆ2 Z
N&uacute;mero de transparencia: 62
The model is estimated
estimated::
yˆ = βˆ0 + βˆ1 X + βˆ2 Z
• Women are assigned Z=0, so that :
yˆ = βˆ0 + βˆ1 X
• Men are assigned Z=1, so that:
yˆ = ( βˆ0 + βˆ2 ) + βˆ1 X
N&uacute;mero de transparencia: 63
Therefore
Therefore::
Weight
yˆ = ( βˆ0 + βˆ2 ) + βˆ1 X
β̂ 2
yˆ = βˆ0 + βˆ1 X
Height
The effect is that a man of a certain height weighs β2 kg more that a
women of the same height
Or does he? …
N&uacute;mero de transparencia: 64
Let
’s do it
Let’s
it::
Dependent variable: peso
----------------------------------------------------------------------------Standard
T
Parameter
Estimate
Error
Statistic
P-Value
----------------------------------------------------------------------------CONSTANT
-77,7888
16,0908
-4,83438
0,0000
altura
0,842013
0,0905752
9,29628
0,0000
sexo
-5,17748
2,20877
-2,34405
0,0208
----------------------------------------------------------------------------R-squared = 60,8791 percent
R-squared (adjusted for d.f.) = 60,1927 percent
Sexo=0 : Men
Sexo=1 : Women
Therefore: a man of height 180 will weigh: -78+0.84x180= 73 kilos
… and a women of the same height will weigh: -78+0.84x180-5.17= 68 kilos
There is a significant difference because t=-2.34 and its abs. value is &gt; 2
N&uacute;mero de transparencia: 65
The result
Weight
5 kg
Men
Women
Height
N&uacute;mero de transparencia: 66
Interactions
We have supposed that the lines are parallel
And if they aren’t?
Y
B
A
X
N&uacute;mero de transparencia: 67
Including interactions in the model
Modelling an interaction is easy. One has to estimate a
regression model between:
&middot;
&middot;
&middot;
&middot;
the Y variable
the X variable
the Z variable
the X - Z interaction which is modelled by the product (XZ).
yˆ = βˆ 0 + βˆ1 X + βˆ 2 Z + βˆ 3 XZ
For the group with Z=0
yˆ = βˆ 0 + βˆ1 X
For the group with Z=1 yˆ = βˆ 0 + βˆ1 X + βˆ 2 + βˆ 3 X = ( βˆ 0 + βˆ 2 ) + ( βˆ1 + βˆ 3 ) X
Therefore, in order to analyse if an interaction exists is the same as to estimate a regression
model and see if the the interaction parameter is significant (abs. value of t-statistic &gt; 2).
N&uacute;mero de transparencia: 68
Example
Example:: Sales of companies in the service sector in Madrid as a
function of their investment in research and development ((R&amp;D)
R&amp;D)
Plot of ventas vs id
240
ventas
200
160
120
80
40
0
0
0.5
1
1.5
2
2.5
id
3
(X 1000)
Plot of log(ventas) vs log(id)
log(ventas)
5.7
5.2
4.7
4.2
3.7
3.2
2.7
3.1
4.1
5.1
6.1
7.1
8.1
log(id)
LOG(VENTAS) = 1.762 + 0.393 Log(ID)
(t)
(7.88)
(10.34)
N&uacute;mero de transparencia: 69
R2 = 45.7 %
Example
Example:: Sales of companies in the service sector in Madrid as a
function of their investment in research and development ((R&amp;D)
R&amp;D)
We want to study if there is a difference in being in the telecommunications
sector or not
TELECO=1 : if in telecom sector
TELECO=0 : if not in telecom sector
LOG(VENTAS) = 2.25 + 0.288 Log(ID) +
(t)
(11.12) (8.08)
0.527 TELECO
(7.03)
R2 = 61.05%
•If the company is in the telecom sector:
Log(VENTAS)= 2.78 + 0.288 log(ID)
•If it is in another sector:
Log(VENTAS) = 2.25 + 0.288 log(ID)
We estimate the interaction:
Log(VENTAS)=1.99 + 0.334Log(ID) + 1.80 TELECO - 0.202 TELECOxLog(ID)
(t)
(8.84) (8.40)
(3.40)
(-2.43)
•If the company is in the telecom sector:
R2= 62.8%
Log(VENTAS) = 3.8 + 0.13 log(ID)
•If it is in another sector:
Log(VENTAS) = 1.99 + 0.334 log(ID)
N&uacute;mero de transparencia: 70