For this practical, we will need the GGally
and the lmodel2
libraries. Make sure that they are both installed:
install.packages("GGally")
install.packages("lmodel2")
Now load required libraries:
library(GGally)
library(ggplot2)
library(lmodel2)
library(dplyr)
We have data (x, y), where x is a deterministic variable and y is a random variable that is dependent on x. We fit a linear regression line to the data with slope \(\beta_1\) and intercept \(\beta_0\) by finding the minimum of the RSS:
\(RSS(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_{i} - \beta_0 - \beta_1 x_i)^2\)
which is calculated using:
This equation requires that the variance in y is the same for all \(x_i\)
If this is not the case, then we apply weighted regression, by minimizing:
\(RSS(\beta_0, \beta_1) = \sum_{i=1}^{n} w_i (y_{i} - \beta_0 - \beta_1 x_i)^2\)
with \(w_i = 1 / \sigma_i^2\)
In R, the slope and intercept can be calculated by the command:
b1 = cov(x,y)/var(x)
b0 = mean(y) – b1*mean(x)
For the general linear regression, we use the lm()
command. The syntax you will need to use looks like:
z = lm(formula, data, weights)
A summary of what you see when typing ?lm
into the console is:
formula
- a symbolic description of the model to be fitted
response ~ terms
where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response.first + second
indicates all the terms in first together with all the terms in second with duplicates removed.data
- the data frame used as input. The names you specify in formula
will correspond to the column names in this data frame.weights
- an optional vector of weights to be used in the fitting process.Throughout this practical, we’ll use a data frame containing Anscombe’s Quartet as an example. The data consists of four hypothetical collections of data that each have x and y values. The data frame anscombe
has eight columns which represent all four of these data sets. For example, the first data set uses the columns x1
and y1
, the second uses x2
and y2
, etc. You can type ?anscombe
into the console to find out more about this.
We’ll first look at some basic statistical properties. In the examples below, we use c()
to group together the results into a vector so that they’re easier to see all in a row instead of being printed out on separate lines.
As you can see, the mean of x values for all the data sets are the same:
c(mean(anscombe$x1), mean(anscombe$x2), mean(anscombe$x3), mean(anscombe$x4))
## [1] 9 9 9 9
As is the variance of x values:
c(var(anscombe$x1), var(anscombe$x2), var(anscombe$x3), var(anscombe$x4))
## [1] 11 11 11 11
The mean of y values is the same to several decimal places:
c(mean(anscombe$y1), mean(anscombe$y2), mean(anscombe$y3), mean(anscombe$y4))
## [1] 7.500909 7.500909 7.500000 7.500909
The variance of y values is also very similar:
c(var(anscombe$y1), var(anscombe$y2), var(anscombe$y3), var(anscombe$y4))
## [1] 4.127269 4.127629 4.122620 4.123249
However, when we plot the different data sets, we see that they all look quite different:
For the example below, we’ll use the x and y values from the first data set in Anscombe’s Quartet.
anscombe1 = data.frame(x = anscombe$x1,
y = anscombe$y1)
Now we perform a linear regression on the data:
z = lm(y~x, data=anscombe1)
data=anscombe1
means that we want to use the anscombe1
data frame that we just createdy~x
says that we’re trying to predict a response (the value y
) given the value of a term (x
). Here x
and y
correspond to the actual column names of the data frame. If we were to us a different data frame (like mtcars
) we would update the formula (for example to predict miles per gallon given horse power, we would use mpg~hp
).Using summary
we can get an overview of the results of the regression
summary(z)
##
## Call:
## lm(formula = y ~ x, data = anscombe1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
For some data sets, you will also have information available on the error for particular values. Anscombe’s quartet doesn’t contain this information, but for the sake of demonstrating the R syntax to create a plot with error bars, we add a y_error
column using random values.
anscombe1$y_error = runif(nrow(anscombe1))
Now we perform a scatter plot using geom_point
, but now we also add geom_errorbar
to add error bars on top of those points. The code ymin = y - y_error
and ymax = y + y_error
shows how we use the values from the y
and y_error
columns to calculate the top and bottom of the error bars.
ggplot(anscombe1, aes(x=x, y=y)) + geom_point() +
geom_errorbar(aes(ymin = y - y_error,
ymax = y + y_error,
width=0.3)) # width changes the width of the horizontal bar at the top of the error bar
In this example, the error bars are “mirrored” around the value of y
. Since we specify the value of ymin
independently of ymax
, then it’s possible that you could use a different formula and/or column so that the observed value of y
does not lie directly in the middle of the error bars but is skewed to one side.
We will reuse the same code from above, but will now add in a regression line. To do this, we need to use the coefficients that were found from the linear regression.
z$coefficients
## (Intercept) x
## 3.0000909 0.5000909
As you can see, the first value is the intercept. The second value is labelled x
although this refers to the slope of the regression line. We will use these values directly in the geom_abline
command:
ggplot(anscombe1, aes(x=x, y=y)) + geom_point() +
geom_errorbar(aes(ymin=y - y_error,
ymax=y + y_error,
width=0.3)) +
geom_abline(intercept = z$coefficients[1],
slope = z$coefficients[2])
Alternatively, we can also use geom_smooth
to perform a linear regression, although for the examples in this practical, we’ll use geom_abline
. Using geom_smooth
is a good idea when you have to create a plot with multiple linear regressions that are done on subsets of the data.
ggplot(anscombe1, aes(x=x, y=y)) + geom_point() +
geom_errorbar(aes(ymin=y - y_error,
ymax=y + y_error,
width=0.3)) +
geom_smooth(aes(x=x, y=y), method="lm", formula=y~x, se=FALSE)
One thing which we’ll come back to later is that the linear regressions for the four data sets in Anscombe’s Quartet are nearly identical, even though visually they look quite different:
For this exercise, download the data Practical6_Data.txt and load it into R.
You should see:
Using summary
on the results, you should see:
##
## Call:
## lm(formula = y_values ~ x_values, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.066718 -0.009398 -0.003966 0.004625 0.082318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01251 0.01131 -1.106 0.285
## x_values 0.88361 0.08043 10.986 7.31e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03105 on 16 degrees of freedom
## Multiple R-squared: 0.8829, Adjusted R-squared: 0.8756
## F-statistic: 120.7 on 1 and 16 DF, p-value: 7.311e-09
For the linear regression, we need to specify the weights as 1/(data1$y_errors^2)
The summary of the linear regression results should show:
##
## Call:
## lm(formula = y_values ~ x_values, data = data1, weights = 1/(data1$y_errors^2))
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.7018 -0.1766 0.5083 0.8713 1.3991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.007306 0.006987 -1.046 0.311
## x_values 0.623074 0.099661 6.252 1.16e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8809 on 16 degrees of freedom
## Multiple R-squared: 0.7095, Adjusted R-squared: 0.6914
## F-statistic: 39.09 on 1 and 16 DF, p-value: 1.156e-05
Remember that with ggplot, you can append commands like + geom_abline(..., color="blue") + geom_abline(..., color="red")
For plotting residuals against fitted values, we’ll again use the anscombe1
data frame that we created above, using the same code for the linear regression:
z = lm(y~x, anscombe1)
Now we get the fitted values corresponding to each value of x:
fit <- fitted(z)
print(fit)
## 1 2 3 4 5 6 7
## 8.001000 7.000818 9.501273 7.500909 8.501091 10.001364 6.000636
## 8 9 10 11
## 5.000455 9.001182 6.500727 5.500545
We then get the residuals:
res <- residuals(z)
print(res)
## 1 2 3 4 5 6
## 0.03900000 -0.05081818 -1.92127273 1.30909091 -0.17109091 -0.04136364
## 7 8 9 10 11
## 1.23936364 -0.74045455 1.83881818 -1.68072727 0.17945455
Add in columns to the anscombe1
data frame for the fit and res values
anscombe1$fit = fit
anscombe1$res = res
We can see that the fitted values follow the regression line:
ggplot(anscombe1) +
geom_point(aes(x=x, y=y, color="original value"), size=3) +
geom_point(aes(x=x, y=fit, color="fitted value"), size=3) +
geom_abline(intercept = z$coefficients[1], slope=z$coefficients[2])
In the code above, we have the statements color="original value"
and color="fitted value"
. When ggplot sees statements like these withing aes(...)
it will automatically assign a color to each of these text values. What is happening is the same as what you would get when using something like aes(..., color=some_column_name)
. In this case, every distinct value of some_column_name
will be assigned a unique color.
Show the residuals that were calculated:
ggplot(anscombe1) +
geom_point(aes(x=x, y=res))
Putting this all together we can see how subtracting the residual from the original y
values gives us the fitted values in fit
. Note that geom_segment
allows us to draw line segments from the points defined at x
,y
to the point at xend
,yend
ggplot(anscombe1) +
geom_point(aes(x=x, y=y, color="original value"), size=3) +
geom_point(aes(x=x, y=fit, color="fitted value"), size=3) +
geom_segment(aes(x=x, xend=x, y=y, yend=y-res)) +
geom_abline(intercept = z$coefficients[1], slope=z$coefficients[2])
Plot residual against fitted values:
ggplot(anscombe1, aes(x=fit, y=res)) + geom_point() + ggtitle("residual vs. fitted values")
Perform a qqnorm plot of residuals to see if they are normally distributed:
qqnorm(res)
qqline(res)
Make the plots of fitted values vs residuals for the data from Practical6_Data.txt
Perform a linear regression without weights:
Make a qnorm plot of the residuals:
Linear regression with weights (performed same as earlier in the practical):
Make a qnorm plot of the residuals (using the weighted model):
The cook’s distance of point \(X_i\) is a measure of the difference in regression parameters when the point is included or omitted. If omitting a point changes the regression parameters strongly, this point has a large Cook’s distance.
To see quantitatively if the Cook’s distance of a data point is too large and it has an undue influence on the fit you need to compare it with a theoretical cut-off. This cutoff value can be calculated with the F-statistic using qf(0.5, p, n-p, lower.tail = FALSE)
where:
p
- fitted parameters (=2 with simple linear regression)n
- number of data pointsBelow we use the anscombe1
data frame again.
#Cook's distance
z = lm(y~x, data=anscombe1)
cook <- cooks.distance(z)
# put the fitted values and Cook's distance into a data frame
data = data.frame(fit = fitted(z),
cooks_distance = cook)
#cut-off value
cut <- qf(0.5, 2, length(data$fit)-2, lower.tail = FALSE)
ggplot(data, aes(x=fit, y=cooks_distance)) + geom_point() + geom_hline(yintercept=cut)
If we look at all four data sets in Anscombe’s Quartet, there are quite different outcomes when plotting the Cook’s distance for each.
For the unweighted and weighted fit, compare the cooks distance
Unweighted fit:
Weighted fit:
We can create a series of diagnostic plots for the linear regression results by simply using plot(z)
(where z
is the output of a linear regression) instead of ggplot
. What’s happening here is that R detects that z
is the result of a linear regression, and whenever it sees this, it will create a series of four plots:
Here we create the diagnostic plots for the linear regression of the anscombe1
data frame:
z = lm(y~x, data=anscombe1)
plot(z)
Create diagnostic plots for both the unweighted and weighted regressions that you just did.
Reduced major axis regression minimizes the area of the triangles formed by the observations and the line:
model <-lmodel2(y~x, data = dataframe, range.y = "interval",
range.x = "interval", nperm = 100)
Arguments:
formula
- A formula specifying the bivariate model, as in lm.data
- A data frame containing the two variables specified in the formula.range.y
, range.x
- Parameters for ranged major axis regression (RMA). If range.y = NULL and range.x = NULL, RMA will not be computed. If only one of them is NULL, the program will stop. If range.y = “relative”: variable y has a true zero (relative-scale variable). If range.y = “interval”: variable y possibly includes negative values (interval-scale variable). If range.x = “relative”: variable x has a true zero (relative-scale variable). If range.x = “interval”: variable x possibly includes negative values (interval-scale variable)nperm
- Number of permutations for the tests. If nperm = 0, tests will not be computed.Here we again use the anscombe1
data frame:
library(lmodel2) # make sure that this library is loaded in order to use the function below
model <-lmodel2(y~x, data = anscombe1, range.y = "interval",
range.x = "interval", nperm = 100)
From the regression results, we want to retrieve the Intercept
and Slope
values for the RMA method. Below we can see the results that we have to search through.
model$regression.results
## Method Intercept Slope Angle (degrees) P-perm (1-tailed)
## 1 OLS 3.000091 0.5000909 26.56922 0.00990099
## 2 MA 2.511327 0.5543980 29.00390 0.00990099
## 3 SMA 1.988042 0.6125408 31.48917 NA
## 4 RMA 2.075872 0.6027819 31.08081 0.00990099
There are two ways in which you can extract these values. First, since model$regression.results
is a data frame (which you can check by typing class(model$regression.results)
), you can use indices to directly extract data from a cell at a specific row and column number:
intercept <- model$regression.results[4,2]
slope <- model$regression.results[4,3]
Another way is to use dplyr
as discussed in the previous practical. Here we filter
the data frame model$regression.results
for all rows where Method == "RMA"
, and then we select
the value in the Intercept
column. This will give us a data frame with only one row and one column. Since we want the number in the data frame, and not the data frame itself, we need to add the extra step of %>% unlist()
.
intercept = model$regression.results %>% filter(Method == "RMA") %>% select(Intercept) %>% unlist()
slope = model$regression.results %>% filter(Method == "RMA") %>% select(Slope) %>% unlist()
Make a summary of the regression model:
summary(model)
## Length Class Mode
## y 11 -none- numeric
## x 11 -none- numeric
## regression.results 5 data.frame list
## confidence.intervals 5 data.frame list
## eigenvalues 2 -none- numeric
## H 1 -none- numeric
## n 1 -none- numeric
## r 1 -none- numeric
## rsquare 1 -none- numeric
## P.param 1 -none- numeric
## theta 1 -none- numeric
## nperm 1 -none- numeric
## epsilon 1 -none- numeric
## info.slope 1 -none- numeric
## info.CI 1 -none- numeric
## call 6 -none- call
Plot the results of the regression model:
ggplot(anscombe1, aes(x=x, y=y)) + geom_point() + geom_abline(slope=slope, intercept=intercept)
Make a reduced major axis regression for x and y in the Practical6_data.txt
and add the regression line to the plot
## Length Class Mode
## y 18 -none- numeric
## x 18 -none- numeric
## regression.results 5 data.frame list
## confidence.intervals 5 data.frame list
## eigenvalues 2 -none- numeric
## H 1 -none- numeric
## n 1 -none- numeric
## r 1 -none- numeric
## rsquare 1 -none- numeric
## P.param 1 -none- numeric
## theta 1 -none- numeric
## nperm 1 -none- numeric
## epsilon 1 -none- numeric
## info.slope 1 -none- numeric
## info.CI 1 -none- numeric
## call 6 -none- call
\(Y_i = \beta_0 + \beta_1*X_{1i} + \beta_2*X_{2i} + \beta_3*X_{3i} + … + \epsilon_i\)
Interpretation of the parameters:
An anova test can be done to see if the removing one variable significantly changes the fit. If p-value of the anova test <0.05, you should keep the variable (as a rule of thumb)
say you have several variables (y, x1, x2, x3, …) in the data frame dat
z = lm(y~ x1 + x2 + x3 + …, data = dat)
A pairs plot is useful for doing an initial exploration of how different variables (i.e. columns in a data frame) may be related to each other. A pairs plot is presented as a matrix containing scatter plots of all the combinations of variables with each other.
To be able to do this, make sure to install and load the GGally
library as described above.
install.packages("GGally")
library(GGally)
As an example, we’ll use a pairs plot on the mtcars
dataset. For this, we only use the mpg
, disp
, hp
and qsec
columns. Looking at the first few rows in these columns, we see:
head(mtcars[,c("mpg", "disp", "hp", "qsec")])
## mpg disp hp qsec
## Mazda RX4 21.0 160 110 16.46
## Mazda RX4 Wag 21.0 160 110 17.02
## Datsun 710 22.8 108 93 18.61
## Hornet 4 Drive 21.4 258 110 19.44
## Hornet Sportabout 18.7 360 175 17.02
## Valiant 18.1 225 105 20.22
The pairs plot for these four columns looks like this:
ggpairs(mtcars[,c("mpg", "disp", "hp", "qsec")])
Reading across the rows and columns we can see which variables are being plotted together. In the first column and second row, we have the disp
vs. the mpg
, directly below that is the hp
vs. the mpg
and so on. On the top right we can see the correlations calculated between the variables. On the diagonal we see a density plot of the variables. For example, most of the values for hp
are around 100, although the distribution is skewed to one side.
Now we perform a linear regression where we try to predict mpg
given the values for disp
, hp
and qsec
z = lm(mpg~disp + hp + qsec, data=mtcars)
summary(z)
##
## Call:
## lm(formula = mpg ~ disp + hp + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5029 -2.6417 -0.6002 1.9749 7.2263
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.622206 9.668300 3.995 0.000426 ***
## disp -0.028468 0.007787 -3.656 0.001049 **
## hp -0.034642 0.017967 -1.928 0.064041 .
## qsec -0.385561 0.468127 -0.824 0.417114
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.144 on 28 degrees of freedom
## Multiple R-squared: 0.7542, Adjusted R-squared: 0.7279
## F-statistic: 28.64 on 3 and 28 DF, p-value: 1.118e-08
We can also extract the \(R^2\) value from the summary:
summary_for_z <- summary(z)
print(summary_for_z$r.squared)
## [1] 0.7541953
Create a series of diagnostic plots:
plot(z)
We can perform an Anova test using the following set up (assuming a data frame named dat
with columns named y
, x1
, x2
, and x3
)
z1 = lm(y~ x1 + x2 + x3, data = dat)
z2 = lm(y~ x1 + x2, data = dat)
anova(z1,z2)
Using the mtcars
example, we can do:
z1 = lm(mpg~disp + hp + qsec, data=mtcars)
z2 = lm(mpg~disp + hp, data=mtcars) # leave out `qsec`
results = anova(z1, z2)
print(results)
## Analysis of Variance Table
##
## Model 1: mpg ~ disp + hp + qsec
## Model 2: mpg ~ disp + hp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 276.79
## 2 29 283.49 -1 -6.7057 0.6784 0.4171
For this we’ll use the stackloss
data set with was mentioned in Lecture 8. This data frame is already pre-loaded with R, and you can get more details about it by typing ?stackloss
in the console.
## Air.Flow Water.Temp Acid.Conc. stack.loss
## Min. :50.00 Min. :17.0 Min. :72.00 Min. : 7.00
## 1st Qu.:56.00 1st Qu.:18.0 1st Qu.:82.00 1st Qu.:11.00
## Median :58.00 Median :20.0 Median :87.00 Median :15.00
## Mean :60.43 Mean :21.1 Mean :86.29 Mean :17.52
## 3rd Qu.:62.00 3rd Qu.:24.0 3rd Qu.:89.00 3rd Qu.:19.00
## Max. :80.00 Max. :27.0 Max. :93.00 Max. :42.00
For this we are trying to predict stack.loss
given all the other variables
##
## Call:
## lm(formula = stack.loss ~ Acid.Conc. + Air.Flow + Water.Temp,
## data = stackloss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2377 -1.7117 -0.4551 2.3614 5.6978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.9197 11.8960 -3.356 0.00375 **
## Acid.Conc. -0.1521 0.1563 -0.973 0.34405
## Air.Flow 0.7156 0.1349 5.307 5.8e-05 ***
## Water.Temp 1.2953 0.3680 3.520 0.00263 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
## F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
##
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp, data = stackloss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5290 -1.7505 0.1894 2.1156 5.6588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.3588 5.1383 -9.801 1.22e-08 ***
## Air.Flow 0.6712 0.1267 5.298 4.90e-05 ***
## Water.Temp 1.2954 0.3675 3.525 0.00242 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.239 on 18 degrees of freedom
## Multiple R-squared: 0.9088, Adjusted R-squared: 0.8986
## F-statistic: 89.64 on 2 and 18 DF, p-value: 4.382e-10
## Analysis of Variance Table
##
## Model 1: stack.loss ~ Acid.Conc. + Air.Flow + Water.Temp
## Model 2: stack.loss ~ Air.Flow + Water.Temp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17 178.83
## 2 18 188.79 -1 -9.9654 0.9473 0.344
##
## Call:
## lm(formula = stack.loss ~ Air.Flow, data = stackloss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2896 -1.1272 -0.0459 1.1166 8.8728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -44.13202 6.10586 -7.228 7.31e-07 ***
## Air.Flow 1.02031 0.09995 10.208 3.77e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.098 on 19 degrees of freedom
## Multiple R-squared: 0.8458, Adjusted R-squared: 0.8377
## F-statistic: 104.2 on 1 and 19 DF, p-value: 3.774e-09
Perform the anova test:
## Analysis of Variance Table
##
## Model 1: stack.loss ~ Air.Flow + Water.Temp
## Model 2: stack.loss ~ Air.Flow
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 18 188.80
## 2 19 319.12 -1 -130.32 12.425 0.002419 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For this create a set of linear regressions for combinations of the individual variables
stack.loss~Air.Flow
Value for \(R^{2}\):
## [1] 0.8457809
stack.loss~Water.Temp
Value for \(R^{2}\):
## [1] 0.766508
Air.Flow~Water.Temp
Value for \(R^{2}\):
## [1] 0.6112931