Grado en Ingenier ía Ingeniería Asignatura: Estad ística Estadística Tema: Regresion - IN ENGLISH Grado en Ingeniería. Estadística. Tema 4 Regression Introduction. Non-deterministic relationships Simple Linear Regression Model Estimation Diagnosis / Inference Multiple Regression Multiple dispersion graphs Estimation Multicolinearity Dummy variables / Interactions Número de transparencia: 2 Grado en Ingeniería. Estadística. Tema 4 Objectives To know how to analyse the relationship between variables using a linear regression model that describes the influence of variable X on another variable Y. To know how to obtain point estimates of the parameters of the said model To know how to construct confidence intervals and resolve hypothesis tests about the said parameters To know how to estimate the average of Y for a given value of X To know how to predict future values of the variable Y Número de transparencia: 3 Grado en Ingeniería. Estadística. Tema 4 Relationships between variables Regression studies the relation between variables. What type of relationships can exist: -Deterministic relationship (exact) - Non-deterministic relationship (not exact) Número de transparencia: 4 Grado en Ingeniería. Estadística. Tema 4 Deterministic relationships We call a relationship between two variables deterministic when by knowing the value of one of the variables we are able to know the value of the other EXACTLY This corresponds to an exact mathematical relationship; a function Y = f (x) Número de transparencia: 5 Grado en Ingeniería. Estadística. Tema 4 Non -deterministic relationships Non-deterministic The relationship between the two variables is not exact; knowing the value of one does not allow us to know the exact value of the other. We know that a relationship exists between the variables – but it isn’t exact! Número de transparencia: 6 Grado en Ingeniería. Estadística. Tema 4 Regression What does regression do? It creates a linear model to simulate the relationship between the variables The relationship isn’t exact, and the model is not exact => but it is very useful! Número de transparencia: 7 Grado en Ingeniería. Estadística. Tema 4 Regression Regression:: residuals If the relationship is not exact, then we will always commit an error e = residual The distance between each point (real data) is the part of the model that can’t be predicted We will estimate the regression line so that the errors we commit are minimised (criterion: least mean square), specifying that the mean error is zero Número de transparencia: 8 Grado en Ingeniería. Estadística. Tema 4 How is the regression line calculated ? calculated? gradient Número de transparencia: 9 Grado en Ingeniería. Estadística. Tema 4 How do we term the variables? X Independent Explicative Y Dependent The response to be explained What we want to predict The value that we know Número de transparencia: 10 Grado en Ingeniería. Estadística. Tema 4 Regression Regression:: an example An example: we will analyse the relationship between the production cost of a process and the number of pieces produced 5,7 4,7 3,7 2,7 1,7 2,1 2,4 2,7 3 3,3 3,6 3,9 Y = Production cost X = The number of pieces We will calculate the regression line using Statgraphics Número de transparencia: 11 Grado en Ingeniería. Estadística. Tema 4 Regression Regression:: an example 5,7 4,7 coste prod = 0,783429 + 0,669509*piezas producidas 3,7 2,7 1,7 2,1 Número de transparencia: 12 2,4 2,7 3 3,3 3,6 Grado en Ingeniería. Estadística. Tema 4 3,9 Regression Regression:: an example 5,7 4,7 3,7 2,7 1,7 2,1 2,4 2,7 3 3,3 3,6 3,9 coste prod = 0,783429 + 0,669509*piezas producidas However, a factory that produces a million units will have a production cost of: coste prod = 0,783429 + 0,669509* 1 = 1, 46 millones € Will all the factories with this volume of production have the same cost ?? Número de transparencia: 13 Grado en Ingeniería. Estadística. Tema 4 Regression Regression:: an example Will all the factories with this volume of production have the same cost ?? 5,7 4,7 3,7 2,7 1,7 2,1 2,4 2,7 3 3,3 3,6 3,9 There is a range of production cost, from 2.8 to 4.8 milllon € Specifically, for the factory A : Prod. Cost = 1,66 millones But the model says: coste prod = 0,783429 + 0,669509* 1 = 1, 46 millones € Therefore the error that is committed is 1,66 – 1,46 = 0,2 millones Número de transparencia: 14 Grado en Ingeniería. Estadística. Tema 4 Assumptions of the model Can we apply the regression model to all types of data? No. If the conclusions that we make for out models are correct, the data that the use must comply to the following properties: 1. 2. 3. 4. Linearity Homoscedasticity (Homocedasticidad) Independence Normally distributed Número de transparencia: 15 Grado en Ingeniería. Estadística. Tema 4 Linearity This is a fundamental assumption, the data must follow a linear tendency, and be highly correlated Número de transparencia: 16 Grado en Ingeniería. Estadística. Tema 4 Linearity Linearity:: what happens if the data are not linear? The regression will not correctly represent the relationship between the variables If the data is not linear we can look for a mathematical transformation (e.g, log, sqrt) that improves the linearity. Número de transparencia: 17 Grado en Ingeniería. Estadística. Tema 4 Homoscedasticity This assumption means that the data has constant variance, that it has a graph of the following type: • When the variance of the data is constant we say that it is • HOMOSCEDASTIC • What happens if the data is not homoscedastic ?? Número de transparencia: 18 Grado en Ingeniería. Estadística. Tema 4 Homoscedasticity Homoscedasticity:: heteroscedastic data When the variance is not constant (it grows with the independent variable) we say the data is HETEROSCEDASTIC How does this affect the regression? Gastos - Ingresos (X 1,E6) 1 Gastos 0,8 0,6 0,4 0,2 0 0 2 4 Ingresos 6 8 (X 100000) The prediction errors will be larger by an amount that grows with the value of the variables!! We shouldn’t apply regression to such heteroscedastic data. We have to transform the data using: LOG Número de transparencia: 19 Grado en Ingeniería. Estadística. Tema 4 Testing for linearity and homoscedasticity The test for the assumption of linearity and homoscedasticity we carry out by a graphical analysis of the data (Scatterplots / X-Y plot) 5,7 4,7 3,7 2,7 1,7 2,1 2,4 2,7 3 3,3 3,6 If the data satisfies this assumption then we can continue with the analysis Número de transparencia: 20 Grado en Ingeniería. Estadística. Tema 4 3,9 Independence The data that we analyse must be mutually independent (between each datum): - If we analyse the production cost against production volume for different factories, we assume that the data from one factory does not affect data from another. You CANNOT use regression analysis to analyse data from a time series, as the each datum depends on previous data. Número de transparencia: 21 Grado en Ingeniería. Estadística. Tema 4 Normally distributed The last assumption is that the model requires is that the data analysed is normally distributed. What does this mean? 5,7 4,7 3,7 2,7 1,7 2,1 2,4 2,7 3 3,3 3,6 3,9 We have said that for each value of X, Y can take values in a certain range We assume that the values of Y for each value of X follow a normal distribution Número de transparencia: 22 Grado en Ingeniería. Estadística. Tema 4 The model If the data satisfies the (four) assumptions discussed, we can use the model to estimate them. coste prod = 0,783429 + 0,669509*piezas producidas Número de transparencia: 23 Grado en Ingeniería. Estadística. Tema 4 The model coste prod = 0,783429 + 0,669509*piezas producidas β0 β1 is the value of Y when X has value 0 (not always a feasible condition) A “+” sign indicates the two variables grow together A “-” sign indicates one variable grows as the other decreases It also tells us how Y increases against changes in X: ∆Y = β1 ∆X Therefore, in our previous example – how much will the prod. cost increase if the number of pieces produced increase by one million? ∆(coste prod) = 0.669509*∆ (piezas producidas) = 0.66 millon Número de transparencia: 24 Grado en Ingeniería. Estadística. Tema 4 Regression …. Regression:: a problem problem…. In regression we start with a data sample and from that we estimate the model 5,7 4,7 3,7 2,7 1,7 2,1 2,4 2,7 3 3,3 3,6 3,9 coste prod = 0,783429 + 0,669509*piezas producidas Número de transparencia: 25 Grado en Ingeniería. Estadística. Tema 4 Regression …. Regression:: a problem problem…. If we change the data sample we will change the parameters of the model (the numbers that we have calculated) Is it possible to select a sample that would give as the following result? 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 -3 -2 -1 0 1 2 3 If this happens, the gradient of the line, β1, is ZERO and we say that THE REGRESSION IS NOT SIGNIFICANT Número de transparencia: 26 Grado en Ingeniería. Estadística. Tema 4 Regression …. Regression:: a problem problem…. 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 -3 -2 -1 0 1 2 3 We want to be sure that our regression is valid – independent of the sample considered We want to be sure that the regression is valid for all of the population studied and no just for one specific sample WE WANT TO BE SURE THAT β1 IS NEVER EQUAL TO ZERO Número de transparencia: 27 Grado en Ingeniería. Estadística. Tema 4 Inferences about the regression In order to analyse if β1 is zero we have three tools: Confidence intervals Hypothesis tests t-statistic p-value Número de transparencia: 28 Grado en Ingeniería. Estadística. Tema 4 Confidence intervals We calculate a range in which the estimate of β1 will be for “any” sample that we take. This we do using a determined probability (generally 95%) β1 -2xSE(β β1) β1 β1 +2xSE(β β1 ) If the value “0” does not belong to the interval, the parameter is SIGNIFICANT !! Número de transparencia: 29 Grado en Ingeniería. Estadística. Tema 4 Confidence intervals coste prod = 0,783429 + 0,669509*piezas producidas (β1 -2xSE(β1 β1) β1) β1 ; β1 +2xSE(β1 β1 ) ( 0,67-2*0.07; 0,67+2*0.07) = (0.53; 0.81) “0” does not belong to the interval => the parameter is significant! Número de transparencia: 30 Grado en Ingeniería. Estadística. Tema 4 Hypothesis test An alternative of assuring that β1 is not zero is to propose a hypothesis (according to the standard form): H0: β1 =0 H1: β1 ≠0 Statgraphics gives us the p-value for this test directly: p < 0.05 We reject H0 => The regression is significant Número de transparencia: 31 Grado en Ingeniería. Estadística. Tema 4 Hypothesis test -statistic test:: tt-statistic We also have another alternative to the p-value to resolve the hypothesis test: H0: β1 =0 H1: β1 ≠0 Número de transparencia: 32 Grado en Ingeniería. Estadística. Tema 4 Hypothesis test -statistic test:: tt-statistic We also have another alternative to the p-value to resolve the hypothesis test: H0: β1 =0 H1: β1 ≠0 |t|>2 we reject H0 |t|<2 we do not reject H0 |t|>2 => we reject H0 The regression is significant Número de transparencia: 33 Grado en Ingeniería. Estadística. Tema 4 How good is the model ? -->> R2 model? The coefficient R2 (R-squared) indicates how much of Y is explained by X (using the model) Ejemplo: R2=71.76% R2 / R2 = (squared correlation coefficient) Número de transparencia: 34 Grado en Ingeniería. Estadística. Tema 4 Summary We study the data and see if the assumptions are satisfied If not, then we transform the data using mathematical functions We fit the model We use the confidence intervals and hypothesis tests to see if X is significant to Y (does X influence Y ?) Número de transparencia: 35 Grado en Ingeniería. Estadística. Tema 4 Diagnostics The decisions that we can take thanks to the information given by a regression model are important We need to be sure that our conclusions are correct. For this we use: Inference tests, confidence intervals …. Diagnostics: to test once more if the assumptions made remain valid In the diagnosis of the model, we test that the random part of the model (the residuals) do not contain any additional information, - or demonstrate the invalidity of the assumptions (linearity, homoscedaticity, independence, normally distributed). Número de transparencia: 36 Grado en Ingeniería. Estadística. Tema 4 Diagnostics The diagnosis is performed by visual inspection of the residual graphs. They should have the following general appearance: Número de transparencia: 37 Grado en Ingeniería. Estadística. Tema 4 Diagnostics We cannot accept residuals that display other types of behaviour: 3000 1000 2500 500 2000 0 1500 -500 1000 -1000 500 0 0 20 40 Número de transparencia: 38 60 80 100 -1500 500 1000 1500 2000 2500 Grado en Ingeniería. Estadística. Tema 4 3000 Regression Introduction. Non-deterministic relationships Simple Linear Regression Model Estimation Diagnosis / Inference Multiple Regression Multiple dispersion graphs Estimation Multicolinearity Dummy variables / Interactions Número de transparencia: 39 Grado en Ingeniería. Estadística. Tema 4 Multiple Regression In a multiple regression model, we want to know the value of a response variable that results from more than one explicative variable: In this expression, each one of the β-coefficients reresents the individual influence that each X variable has on Y Advantages: The assumptions of the model are the same as for simple regression So are the hypothesis tests, diagnosis etc. Slight inconveniences: The visualisation of the graphs is slightly more complicated We need to re-define the R2 coefficient Número de transparencia: 40 Grado en Ingeniería. Estadística. Tema 4 Multiple Regression : Graphs Each cell of the graph represents the bilateral relation between two variables: TOT_COST UDS MANPOWER ENERGY INVEST MAINT MAT ENV Número de transparencia: 41 Grado en Ingeniería. Estadística. Tema 4 Multiple Regression : adjusted R2 value The R2 coefficient increases as the number of variables in the model increases (whether they are significant or not). In order to allieviate this effect we compensate for it. For this reason in multiple regression we use the corrected (or adusted) adusted R2 value. Dependent variable: log(TOT_COST) ----------------------------------------------------------------------------Standard T Parameter Estimate Error Statistic P-Value ----------------------------------------------------------------------------CONSTANT -1,82352 0,313487 -5,81689 0,0000 log(UDS) 0,666417 0,116524 5,71913 0,0000 log(MANPOWER) 0,157212 0,0551564 2,85029 0,0052 log(ENERGY) 0,174001 0,0489637 3,55367 0,0005 log(INVEST) 0,216335 0,0365883 5,91267 0,0000 log(MAINT) -0,0199751 0,0594171 -0,336185 0,7373 log(MAT) 0,139431 0,0221418 6,2972 0,0000 log(ENV) 0,0027926 0,0178724 0,156252 0,8761 ----------------------------------------------------------------------------- Adjusted R2 = 81.73% Número de transparencia: 42 Grado en Ingeniería. Estadística. Tema 4 Regression Introduction. Non-deterministic relationships Simple Linear Regression Model Estimation Diagnosis / Inference Multiple Regression Multiple dispersion graphs Estimation Multicolinearity Dummy variables / Interactions Número de transparencia: 43 Grado en Ingeniería. Estadística. Tema 4 Example Number Number of ofaccidents accidentsin in Spanish Spanishprovinces provincesas asaa function functionof ofnumber number of of registered registeredvehicles vehicles (X 1000) 3 nacciden 2,5 2 1,5 1 0,5 0 0 4 8 12 16 matricul 20 24 (X 1000) ----------------------------------------------------------------------------Dependent variable: nacciden ----------------------------------------------------------------------------Standard T Parameter Estimate Error Statistic P-Value ----------------------------------------------------------------------------CONSTANT 278,24 102,518 2,71406 0,0265 matricul 0,0993373 0,00850344 11,682 0,0000 ----------------------------------------------------------------------------R-squared (adjusted for d.f.) = 93,7703 percent Número de transparencia: 44 Grado en Ingeniería. Estadística. Tema 4 Example (X 1000) 3 2,5 nacciden Number Number of ofaccidents accidentsin in Spanish Spanishprovinces provincesas asaa function functionof ofthe thenumber number of of driving drivinglicences. licences. 2 1,5 1 0,5 0 0 4 8 12 16 permisos 20 24 (X 1000) ----------------------------------------------------------------------------Dependent variable: nacciden ----------------------------------------------------------------------------Standard T Parameter Estimate Error Statistic P-Value ----------------------------------------------------------------------------CONSTANT 216,481 127,099 1,70325 0,1269 permisos 0,107617 0,0109657 9,81395 0,0000 ----------------------------------------------------------------------------R-squared (adjusted for d.f.) = 91,3722 percent Número de transparencia: 45 Grado en Ingeniería. Estadística. Tema 4 Regressions Accid=278.2 +0.1 Matriculas (t-statistic = 11.68) Accid=216.4 +0.1 Permisos (t-statistic = 9.81) Número de transparencia: 46 Grado en Ingeniería. Estadística. Tema 4 Regression for both variables ----------------------------------------------------------------------------Dependent variable: nacciden ----------------------------------------------------------------------------Standard Parameter Estimate Error T Statistic P-Value ----------------------------------------------------------------------------CONSTANT 250,63 113,216 2,21373 0,0625 matricul 0,0725492 0,0395634 1,83374 0,1093 permisos 0,0301069 0,043353 0,694461 0,5098 ----------------------------------------------------------------------------- Número de transparencia: 47 Grado en Ingeniería. Estadística. Tema 4 Regression for both variables ----------------------------------------------------------------------------Dependent variable: nacciden ----------------------------------------------------------------------------Standard Parameter Estimate Error T Statistic P-Value ----------------------------------------------------------------------------CONSTANT 250,63 113,216 2,21373 0,0625 matricul 0,0725492 0,0395634 1,83374 0,1093 permisos 0,0301069 0,043353 0,694461 0,5098 ----------------------------------------------------------------------------- Número de transparencia: 48 Grado en Ingeniería. Estadística. Tema 4 Regressions Accid=278.2 +0.1 Matriculas (11.68) Accid=216.4 +0.1 Permisos (9.81) Accid=250+0.07 Matriculas +0.03 Permisos (1.8) (0.69) Número de transparencia: 49 Grado en Ingeniería. Estadística. Tema 4 What ’s happening ? What’s happening? (X 1000) 24 matricul 20 Correlación=.975 16 12 8 4 0 0 4 8 12 16 permisos Número de transparencia: 50 20 24 (X 1000) Grado en Ingeniería. Estadística. Tema 4 Regression: - a problem Sometimes the independent variables are very similar: they contain the same information Independent variables Número de transparencia: 51 Dependent variable Grado en Ingeniería. Estadística. Tema 4 Regression: - a problem The model cannot distinguish between the two variables. Independent variables Número de transparencia: 52 Dependent variable Grado en Ingeniería. Estadística. Tema 4 In our example: Registered cars Driving licences Num Accid Both are too similar in order to distinguish between them Número de transparencia: 53 Grado en Ingeniería. Estadística. Tema 4 In our example: The solution? – Registered cars Driving licences Eliminate one of the variables. We lose almost no information Num Accid Both are too similar in order to distinguish between them Número de transparencia: 54 Grado en Ingeniería. Estadística. Tema 4 In our example: The solution? – Registered cars Eliminate one of the variables. We lose almost no information Num Accid Both are too similar in order to distinguish between them Número de transparencia: 55 Grado en Ingeniería. Estadística. Tema 4 The problem of (multi)colinearity frequenctly appears in statistics We tend to measure one thing in many ways It is detected when: -for simple regression the variables are significant - on introducing new variables, these variables stop becoming significant Número de transparencia: 56 Grado en Ingeniería. Estadística. Tema 4 Regression Introduction. Non-deterministic relationships Simple Linear Regression Model Estimation Diagnosis / Inference Multiple Regression Multiple dispersion graphs Estimation Multicolinearity Dummy variables Número de transparencia: 57 Grado en Ingeniería. Estadística. Tema 4 A weight – height study Is the relation the same for women and men? Weight Height Número de transparencia: 58 Grado en Ingeniería. Estadística. Tema 4 A weight – height study Is the relation the same for women and men? Weight Weight Height Número de transparencia: 59 Height Grado en Ingeniería. Estadística. Tema 4 A weight – height study If the relation is not equal, we could commit serious errors: Weight Weight Height Número de transparencia: 60 Height Grado en Ingeniería. Estadística. Tema 4 Examples Variable Y Variable X Group that influence Weight Height Sex: Male or Female Consumption of a worker Earnings of the worker Labour status: Unemployed or Employed Automobile consumption Power / Engine size Engine type: Diesel or Petrol Profit margin of a bank branch Bank charges Branch: Urban or Rural Número de transparencia: 61 could Grado en Ingeniería. Estadística. Tema 4 It is necessary to introduce a group group:: In this case: • we define a variable Z that takes the following values: Zi =0 if the observation belongs to group A Zi=1 if the observation belongs to group B • and we will estimate using the following regression model: yˆ = βˆ0 + βˆ1 X + βˆ2 Z Número de transparencia: 62 Grado en Ingeniería. Estadística. Tema 4 The model is estimated estimated:: yˆ = βˆ0 + βˆ1 X + βˆ2 Z • Women are assigned Z=0, so that : yˆ = βˆ0 + βˆ1 X • Men are assigned Z=1, so that: yˆ = ( βˆ0 + βˆ2 ) + βˆ1 X Número de transparencia: 63 Grado en Ingeniería. Estadística. Tema 4 Therefore Therefore:: Weight yˆ = ( βˆ0 + βˆ2 ) + βˆ1 X β̂ 2 yˆ = βˆ0 + βˆ1 X Height The effect is that a man of a certain height weighs β2 kg more that a women of the same height Or does he? … Número de transparencia: 64 Grado en Ingeniería. Estadística. Tema 4 Let ’s do it Let’s it:: Dependent variable: peso ----------------------------------------------------------------------------Standard T Parameter Estimate Error Statistic P-Value ----------------------------------------------------------------------------CONSTANT -77,7888 16,0908 -4,83438 0,0000 altura 0,842013 0,0905752 9,29628 0,0000 sexo -5,17748 2,20877 -2,34405 0,0208 ----------------------------------------------------------------------------R-squared = 60,8791 percent R-squared (adjusted for d.f.) = 60,1927 percent Sexo=0 : Men Sexo=1 : Women Therefore: a man of height 180 will weigh: -78+0.84x180= 73 kilos … and a women of the same height will weigh: -78+0.84x180-5.17= 68 kilos There is a significant difference because t=-2.34 and its abs. value is > 2 Número de transparencia: 65 Grado en Ingeniería. Estadística. Tema 4 The result Weight 5 kg Men Women Height Número de transparencia: 66 Grado en Ingeniería. Estadística. Tema 4 Interactions We have supposed that the lines are parallel And if they aren’t? Y B A X Número de transparencia: 67 Grado en Ingeniería. Estadística. Tema 4 Including interactions in the model Modelling an interaction is easy. One has to estimate a regression model between: · · · · the Y variable the X variable the Z variable the X - Z interaction which is modelled by the product (XZ). yˆ = βˆ 0 + βˆ1 X + βˆ 2 Z + βˆ 3 XZ For the group with Z=0 yˆ = βˆ 0 + βˆ1 X For the group with Z=1 yˆ = βˆ 0 + βˆ1 X + βˆ 2 + βˆ 3 X = ( βˆ 0 + βˆ 2 ) + ( βˆ1 + βˆ 3 ) X Therefore, in order to analyse if an interaction exists is the same as to estimate a regression model and see if the the interaction parameter is significant (abs. value of t-statistic > 2). Número de transparencia: 68 Grado en Ingeniería. Estadística. Tema 4 Example Example:: Sales of companies in the service sector in Madrid as a function of their investment in research and development ((R&D) R&D) Plot of ventas vs id 240 ventas 200 160 120 80 40 0 0 0.5 1 1.5 2 2.5 id 3 (X 1000) Plot of log(ventas) vs log(id) log(ventas) 5.7 5.2 4.7 4.2 3.7 3.2 2.7 3.1 4.1 5.1 6.1 7.1 8.1 log(id) LOG(VENTAS) = 1.762 + 0.393 Log(ID) (t) (7.88) (10.34) Número de transparencia: 69 R2 = 45.7 % Grado en Ingeniería. Estadística. Tema 4 Example Example:: Sales of companies in the service sector in Madrid as a function of their investment in research and development ((R&D) R&D) We want to study if there is a difference in being in the telecommunications sector or not TELECO=1 : if in telecom sector TELECO=0 : if not in telecom sector LOG(VENTAS) = 2.25 + 0.288 Log(ID) + (t) (11.12) (8.08) 0.527 TELECO (7.03) R2 = 61.05% •If the company is in the telecom sector: Log(VENTAS)= 2.78 + 0.288 log(ID) •If it is in another sector: Log(VENTAS) = 2.25 + 0.288 log(ID) We estimate the interaction: Log(VENTAS)=1.99 + 0.334Log(ID) + 1.80 TELECO - 0.202 TELECOxLog(ID) (t) (8.84) (8.40) (3.40) (-2.43) •If the company is in the telecom sector: R2= 62.8% Log(VENTAS) = 3.8 + 0.13 log(ID) •If it is in another sector: Log(VENTAS) = 1.99 + 0.334 log(ID) Número de transparencia: 70 Grado en Ingeniería. Estadística. Tema 4