Subido por Marwa Ajouz

Assignment 1

Anuncio
Problem 1:
a- bp Histogram:
Code:
data(Forbes)
attach(Forbes)
hist(Forbes$bp,main='bp Histogram',xlab='bp')
The resulting histogram:
Log(pres) Histogram:
Code:
hist(Forbes$lpres,main='Log Pres Histogram',xlab='Log (Pres)')
The resulting histogram:
b- Code:
plot(Forbes$bp,Forbes$lpres,main='logpres Against bp',xlab='bp',ylab='logpress')
The resulting plot:
c- Code used:
model <- lm(Forbes$lpres ~ Forbes$bp, data = Forbes)
model
Result:
B0 = -42.14 / B1 = 0.8955
log (press) = -42.14 + 0.8955bp + error
d- Derived Line:
Code:
abline(lm(lpres ~ bp, data = Forbes), col = "blue")
Result plot:
e- For bp = 207, lpres ~ -42.14 + 0.8955 x 207 = 143.23
f- Code:
residual <- resid(model)
plot(fitted(model), residual)
abline(0,0)
Resulting Plot:
g- From part a, we notice which boiling points and pressures are recorded with highest
frequencies. Regarding the boiling points, there is one peak, and the recorded boiling point with
highest frequency is 200 – 205 F. Regarding the log (pres), the one with highest frequency is
between 135 and 140. From part b, we notice a positive relation between boiling point and
log(pres); as boiling point increases, log (pres) increases. From part c, we notice that the slope
(B1) is positive (which is expected from the observation in part b). We also notice that the
intercept is low and negative. This shows that at low boiling points, log(pres) is low. From part d,
we notice that points that are outside the regression line are little. This shows that the mode is
accepted. Part e is a calculation from the regression model used. From part e, we notice that the
spread of residuals is acceptable and the model does not need change.
Problem 2:
1- Predictor: ppgdp function
Response: fertility
2- After importing the Excel file UN11 into R, the following code was used:
library(alr3)
data(UN11)
plot(UN11$ppgdp,UN11$fertility,main='Scatterplot',xlab='ppgdp',ylab='Fertility')
The resulting scatterplot is as follows:
The above scatterplot shows that the fertility indicator varies from 1 to 7 at a common range of
ppgdp. The variance is not constant and the correlation is not linear. Hence, we say that a
straight-line mean function is not plausible in such case for a summary of the graph.
3-
Used Code:
plot(log(UN11$ppgdp),log(UN11$fertility),main='Log Scatterplot',xlab='Log ppgdp',ylab='Log
Fertility')
Since the natural logarithm is needed, the base is set to default.
Resulting scatter plot:
The simple linear regression seems plausible for summary for the graph (using log scale) since
the variance is almost constant.
Problem 3:
1- Computation of the mean and variance using R:
Codes:
data(wblake)
attach(wblake)
meanLength <- with(wblake, tapply(Length, Age, mean))
print(meanLength)
varLength <- with(wblake, tapply(Length, Age, var))
print(varLength)
The resulting mean values are as follows:
The resulting variance values are as follows:
2- Average Length vs. Age Groups
AgeGroup <- sort(unique(wblake$Age))
plot(AgeGroup,meanLength,main='Average Length vs. Age Groups',xlab='Age
Group',ylab='Average Length')
Code to plot the graph showing all the recorded lengths with average line:
plot(wblake$Age,wblake$Length,main='Length vs. Age Groups with Average Line',xlab='Age
Group',ylab='Length')
lines(AgeGroup,tapply(wblake$Length,wblake$Age,mean))
Resulting Graph:
Comparison with Figure 1.5:
This graph is the same of Figure 1.5 except that Figure 1.5 is drawn at another scale and that it shows
linear regression of the data.
3- Standard Deviation vs. Age:
Code:
stdLength <- with(wblake, tapply(Length, Age, sd))
plot(AgeGroup,stdLength,main='Std Length vs. Age Groups',xlab='Age Group',ylab='STD Length')
The plot shows difference in the standard deviation depending on age groups. The plot is not a
null plot. Hence, the variance function is not constant.
Problem 4:
Code used:
data(water)
attach(water)
pairs(Year~BSAAM+APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE)
Resulting Plot:
From these scatterplot matrix, it seems that year would not be used for the prediction because the
scatter plots do not show steady relations between change in years and water supply.
Other insights from the matrix:
The relation between BSAAM and (OPBC, OPRC and OPSLAKE) is stronger than the relation between
BSAAM and (APMAM, APSAB and APSLAKE). The relation between the locations starting with the A with
each other is stronger than their relations with those starting with the O. Similarly, the relation between
the locations starting with the O with each other is stronger than their relation with those starting with
the A. Conclusion: BSAAM could be an indicator to predict the runoff for those starting with the O. And
knowing the runoffs in one of those starting with A could be an indicator for the others starting with the
A. Finally, knowing the runoffs in one of those starting with O could be an indicator for the others
starting with the O.
Problem 5:
Example 1: X is uniformly distributed on interval [-1, 1]. Y = |X|. In this case, if X<0, Y = -X and if X>0, Y =
X. These two RVs are dependent, where Y depends on X. Let’s check for correlation:
Check if Cov [X, Y] = 0: Cov [X, Y] = E [XY] – E[X] x E[Y]
1
E[X] = ∫−1 𝑥𝑑𝑥 =
1
2
1
− 2 = 0 => E[X] x E[Y] = 0
XY = -𝑋 2 𝑖𝑓 𝑋 < 0 𝑎𝑛𝑑 𝑋𝑌 = 𝑋 2 𝑖𝑓 𝑋 > 0
0
1
E [XY] = E[XY| X<0] + E[XY| X>0] = ∫−1 −𝑋 2 𝑑𝑥 + ∫0 𝑋 2 𝑑𝑥 = -1\3 + 1/3 = 0
Therefore Cov [X, Y] = E [XY] – E[X] x E[Y] = 0 – 0 = 0. Hence X and Y are independent but uncorrelated.
Example 2: X is a discrete RV that takes three values: -1, 0 and 1 with P(X=-1) = P(X=0) = P(X=1) = 1/3
And Y = 1 if X = 0, and Y = 0 otherwise. The values of Y depend on X. Let’s check for correlation:
Check if Cov [X, Y] = 0: Cov [X, Y] = E [XY] – E[X] x E[Y]
E[X] = -1 x P (X=-1) + 0 x P (X=0) + 1 x P (X=1) = -1/3 + 0 + 1/3 = 0. Hence, E[X] x E[Y] = 0.
E[XY] = E[XY|X=-1] + E[XY|X=0] + E[XY|X=1]. If X=1 or X=-1, Y = 0. Hence, XY = 0. If X=1, Y = 1 and XY = 0.
Hence, E[XY] = 0 + 0 + 0 = 0
Problem 6:
Formula to Prove: Var [Y] = E [Var [Y|X]] + Var [E [Y|X]]
E [Var [Y|X]] = E {E [𝑌 2 |𝑋] - E [Y|X]2 } = E {E [𝑌 2 |𝑋] - f(x)2 } = E [𝑌 2 ] – E [f(x)2 ]
Var [E [Y|X]] = E [f(x)2 ] – E [[f(x)]]2 = E [f(x)2 ] – (E [Y])2
Therefore, E [Var [Y|X]] + Var [E [Y|X]] = E [𝑌 2 ] – E [f(x)2 ] + E [f(x)2 ] – (E [Y])2 = E [𝑌 2 ] - (E [Y])2 = Var [Y]
Problem 7:
Case 1: E [e] = 0
E [𝑒 2 ] = 𝑉𝑎𝑟 [𝑒] = 𝐸[𝑉𝑎𝑟[𝑒|𝑋]] + 𝑉𝑎𝑟 [𝐸 [𝑒|𝑥]]
𝐸[𝑉𝑎𝑟[𝑒|𝑋]] = 𝜎 2 and 𝑉𝑎𝑟 [𝐸 [𝑒|𝑥]] >= 0
Therefore, E [𝑒 2 ] ≥ 𝜎 2
Case 2: E [e] = c ≠0
E [𝑒 2 ] = 𝑉𝑎𝑟 [𝑒] + [E [e] ]2 = 𝐸[𝑉𝑎𝑟[𝑒|𝑋]] + 𝑉𝑎𝑟 [𝐸 [𝑒|𝑥]] + [E [e] ]2 = 𝜎 2 +𝑉𝑎𝑟 [𝐸 [𝑒|𝑥]] + [E [e] ]2
And in case [E [e] ]2 > 0 and 𝑉𝑎𝑟 [𝐸 [𝑒|𝑥]] >= 0. Therefore, in this case E [𝑒 2 ] > 𝜎 2
In this case, since [E [e] ]2 > 0, E [𝑒 2 ] > 𝜎 2 . In order to generalize E [𝑒 2 ] = 𝜎 2 , we would have E[e] = 0. Or
else, if [E [e] ]2 > 0, E[𝑒 2 ] is calculated as higher than 𝜎 2 .
Descargar