For this practical, we will need the GGally and the lmodel2 libraries. Make sure that they are both installed:

install.packages("GGally")
install.packages("lmodel2")

Now load required libraries:

library(GGally)
library(ggplot2)
library(lmodel2)
library(dplyr)

1 Basic linear fitting with and without weights

1.1 Reminder of the lecture

We have data (x, y), where x is a deterministic variable and y is a random variable that is dependent on x. We fit a linear regression line to the data with slope $\beta_1$ and intercept $\beta_0$ by finding the minimum of the RSS:

$RSS(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_{i} - \beta_0 - \beta_1 x_i)^2$

which is calculated using:

$\hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i-\bar{x})^2}$
$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$

This equation requires that the variance in y is the same for all $x_i$

If this is not the case, then we apply weighted regression, by minimizing:

$RSS(\beta_0, \beta_1) = \sum_{i=1}^{n} w_i (y_{i} - \beta_0 - \beta_1 x_i)^2$

with $w_i = 1 / \sigma_i^2$

1.2 Syntax

In R, the slope and intercept can be calculated by the command:

b1 = cov(x,y)/var(x)
b0 = mean(y) – b1*mean(x)

For the general linear regression, we use the lm() command. The syntax you will need to use looks like:

z = lm(formula, data, weights)

A summary of what you see when typing ?lm into the console is:

formula - a symbolic description of the model to be fitted
- A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response.
- A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed.
data - the data frame used as input. The names you specify in formula will correspond to the column names in this data frame.
weights - an optional vector of weights to be used in the fitting process.

1.3 Example

1.3.1 Anscombe’s Quartet

Throughout this practical, we’ll use a data frame containing Anscombe’s Quartet as an example. The data consists of four hypothetical collections of data that each have x and y values. The data frame anscombe has eight columns which represent all four of these data sets. For example, the first data set uses the columns x1 and y1, the second uses x2 and y2, etc. You can type ?anscombe into the console to find out more about this.

We’ll first look at some basic statistical properties. In the examples below, we use c() to group together the results into a vector so that they’re easier to see all in a row instead of being printed out on separate lines.

As you can see, the mean of x values for all the data sets are the same:

c(mean(anscombe$x1), mean(anscombe$x2), mean(anscombe$x3), mean(anscombe$x4))

## [1] 9 9 9 9

As is the variance of x values:

c(var(anscombe$x1), var(anscombe$x2), var(anscombe$x3), var(anscombe$x4))

## [1] 11 11 11 11

The mean of y values is the same to several decimal places:

c(mean(anscombe$y1), mean(anscombe$y2), mean(anscombe$y3), mean(anscombe$y4))

## [1] 7.500909 7.500909 7.500000 7.500909

The variance of y values is also very similar:

c(var(anscombe$y1), var(anscombe$y2), var(anscombe$y3), var(anscombe$y4))

## [1] 4.127269 4.127629 4.122620 4.123249

However, when we plot the different data sets, we see that they all look quite different:

For the example below, we’ll use the x and y values from the first data set in Anscombe’s Quartet.

anscombe1 = data.frame(x = anscombe$x1, 
                       y = anscombe$y1)

1.3.2 Perform a linear regression

Now we perform a linear regression on the data:

z = lm(y~x, data=anscombe1)

data=anscombe1 means that we want to use the anscombe1 data frame that we just created
y~x says that we’re trying to predict a response (the value y) given the value of a term (x). Here x and y correspond to the actual column names of the data frame. If we were to us a different data frame (like mtcars) we would update the formula (for example to predict miles per gallon given horse power, we would use mpg~hp).

1.3.3 Basic regression output

Using summary we can get an overview of the results of the regression

summary(z)

## 
## Call:
## lm(formula = y ~ x, data = anscombe1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x             0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

1.3.4 Error bar plots in ggplot

For some data sets, you will also have information available on the error for particular values. Anscombe’s quartet doesn’t contain this information, but for the sake of demonstrating the R syntax to create a plot with error bars, we add a y_error column using random values.

anscombe1$y_error = runif(nrow(anscombe1))

Now we perform a scatter plot using geom_point, but now we also add geom_errorbar to add error bars on top of those points. The code ymin = y - y_error and ymax = y + y_error shows how we use the values from the y and y_error columns to calculate the top and bottom of the error bars.

ggplot(anscombe1, aes(x=x, y=y)) + geom_point() + 
 geom_errorbar(aes(ymin = y - y_error,
                   ymax = y + y_error,
                   width=0.3)) # width changes the width of the horizontal bar at the top of the error bar

In this example, the error bars are “mirrored” around the value of y. Since we specify the value of ymin independently of ymax, then it’s possible that you could use a different formula and/or column so that the observed value of y does not lie directly in the middle of the error bars but is skewed to one side.

1.3.5 Add a regression line to the error bar plot

We will reuse the same code from above, but will now add in a regression line. To do this, we need to use the coefficients that were found from the linear regression.

z$coefficients

## (Intercept)           x 
##   3.0000909   0.5000909

As you can see, the first value is the intercept. The second value is labelled x although this refers to the slope of the regression line. We will use these values directly in the geom_abline command:

ggplot(anscombe1, aes(x=x, y=y)) + geom_point() + 
  geom_errorbar(aes(ymin=y - y_error,
                    ymax=y + y_error,
                    width=0.3)) + 
  geom_abline(intercept = z$coefficients[1], 
              slope = z$coefficients[2])

Alternatively, we can also use geom_smooth to perform a linear regression, although for the examples in this practical, we’ll use geom_abline. Using geom_smooth is a good idea when you have to create a plot with multiple linear regressions that are done on subsets of the data.

ggplot(anscombe1, aes(x=x, y=y)) + geom_point() + 
  geom_errorbar(aes(ymin=y - y_error,
                    ymax=y + y_error,
                    width=0.3)) + 
  geom_smooth(aes(x=x, y=y), method="lm", formula=y~x, se=FALSE)

One thing which we’ll come back to later is that the linear regressions for the four data sets in Anscombe’s Quartet are nearly identical, even though visually they look quite different:

1.4 Exercise

For this exercise, download the data Practical6_Data.txt and load it into R.

1.4.1 Make an error bar plot

You should see:

1.4.2 Perform a linear regression

Using summary on the results, you should see:

## 
## Call:
## lm(formula = y_values ~ x_values, data = data1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.066718 -0.009398 -0.003966  0.004625  0.082318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01251    0.01131  -1.106    0.285    
## x_values     0.88361    0.08043  10.986 7.31e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03105 on 16 degrees of freedom
## Multiple R-squared:  0.8829, Adjusted R-squared:  0.8756 
## F-statistic: 120.7 on 1 and 16 DF,  p-value: 7.311e-09

1.4.3 Add the regression line to the error bar plot

1.4.4 Perform a linear regression with weights specified

For the linear regression, we need to specify the weights as 1/(data1$y_errors^2)

The summary of the linear regression results should show:

## 
## Call:
## lm(formula = y_values ~ x_values, data = data1, weights = 1/(data1$y_errors^2))
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7018 -0.1766  0.5083  0.8713  1.3991 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.007306   0.006987  -1.046    0.311    
## x_values     0.623074   0.099661   6.252 1.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8809 on 16 degrees of freedom
## Multiple R-squared:  0.7095, Adjusted R-squared:  0.6914 
## F-statistic: 39.09 on 1 and 16 DF,  p-value: 1.156e-05

1.4.5 Plot both linear regressions (with and without weights)

Remember that with ggplot, you can append commands like + geom_abline(..., color="blue") + geom_abline(..., color="red")

2 Regression diagnostic

2.1 Residuals ($\epsilon_i$)

2.1.1 Reminder from the lecture

$E(\epsilon_i|X_i) = 0$ (residuals should have mean 0 for all $X_i$ )
$\epsilon_i$ is normally distributed (residuals are normally distributed with mean = 0 and variance $Var(\epsilon_i) = s_2$, for all $\epsilon_i$

2.1.2 Syntax

For plotting residuals against fitted values, we’ll again use the anscombe1 data frame that we created above, using the same code for the linear regression:

z = lm(y~x, anscombe1)

Now we get the fitted values corresponding to each value of x:

fit <- fitted(z)
print(fit)

##         1         2         3         4         5         6         7 
##  8.001000  7.000818  9.501273  7.500909  8.501091 10.001364  6.000636 
##         8         9        10        11 
##  5.000455  9.001182  6.500727  5.500545

We then get the residuals:

res <- residuals(z)
print(res)

##           1           2           3           4           5           6 
##  0.03900000 -0.05081818 -1.92127273  1.30909091 -0.17109091 -0.04136364 
##           7           8           9          10          11 
##  1.23936364 -0.74045455  1.83881818 -1.68072727  0.17945455

Add in columns to the anscombe1 data frame for the fit and res values

anscombe1$fit = fit
anscombe1$res = res

We can see that the fitted values follow the regression line:

ggplot(anscombe1) + 
  geom_point(aes(x=x, y=y, color="original value"), size=3) + 
  geom_point(aes(x=x, y=fit, color="fitted value"), size=3) + 
  geom_abline(intercept = z$coefficients[1], slope=z$coefficients[2])

In the code above, we have the statements color="original value" and color="fitted value". When ggplot sees statements like these withing aes(...) it will automatically assign a color to each of these text values. What is happening is the same as what you would get when using something like aes(..., color=some_column_name). In this case, every distinct value of some_column_name will be assigned a unique color.

Show the residuals that were calculated:

ggplot(anscombe1) + 
  geom_point(aes(x=x, y=res))

Putting this all together we can see how subtracting the residual from the original y values gives us the fitted values in fit. Note that geom_segment allows us to draw line segments from the points defined at x,y to the point at xend,yend

ggplot(anscombe1) + 
  geom_point(aes(x=x, y=y, color="original value"), size=3) + 
  geom_point(aes(x=x, y=fit, color="fitted value"), size=3) + 
  geom_segment(aes(x=x, xend=x, y=y, yend=y-res)) + 
  geom_abline(intercept = z$coefficients[1], slope=z$coefficients[2])

Plot residual against fitted values:

ggplot(anscombe1, aes(x=fit, y=res)) + geom_point() + ggtitle("residual vs. fitted values")

Perform a qqnorm plot of residuals to see if they are normally distributed:

qqnorm(res)
qqline(res)

2.1.3 Exercise

Make the plots of fitted values vs residuals for the data from Practical6_Data.txt

Perform a linear regression without weights:

Make a qnorm plot of the residuals:

Linear regression with weights (performed same as earlier in the practical):

Make a qnorm plot of the residuals (using the weighted model):

2.2 Cook’s distance

2.2.1 Reminder of the lecture

The cook’s distance of point $X_i$ is a measure of the difference in regression parameters when the point is included or omitted. If omitting a point changes the regression parameters strongly, this point has a large Cook’s distance.

To see quantitatively if the Cook’s distance of a data point is too large and it has an undue influence on the fit you need to compare it with a theoretical cut-off. This cutoff value can be calculated with the F-statistic using qf(0.5, p, n-p, lower.tail = FALSE) where:

p - fitted parameters (=2 with simple linear regression)
n - number of data points

2.2.2 Syntax

Below we use the anscombe1 data frame again.

#Cook's distance

z = lm(y~x, data=anscombe1)
cook <- cooks.distance(z)

# put the fitted values and Cook's distance into a data frame
data = data.frame(fit = fitted(z),
                  cooks_distance = cook)

#cut-off value
cut <- qf(0.5, 2, length(data$fit)-2, lower.tail = FALSE)

ggplot(data, aes(x=fit, y=cooks_distance)) + geom_point() + geom_hline(yintercept=cut)

If we look at all four data sets in Anscombe’s Quartet, there are quite different outcomes when plotting the Cook’s distance for each.

2.2.3 Exercise

For the unweighted and weighted fit, compare the cooks distance

Unweighted fit:

Weighted fit:

2.3 Automatic diagnostic plots

We can create a series of diagnostic plots for the linear regression results by simply using plot(z) (where z is the output of a linear regression) instead of ggplot. What’s happening here is that R detects that z is the result of a linear regression, and whenever it sees this, it will create a series of four plots:

Residuals vs. Fitted
Normal Q-Q
Scale-Location (not covered in this practical)
Residuals vs. Leverage (related to Cook’s distance)

Here we create the diagnostic plots for the linear regression of the anscombe1 data frame:

z = lm(y~x, data=anscombe1)
plot(z)

2.3.1 Exercise

Create diagnostic plots for both the unweighted and weighted regressions that you just did.

2.3.1.1 Diagnostic plots for unweighted fit

2.3.1.2 Diagnostic plots for weighted fit

3 Reduced major axis regression

3.1 Reminder from the lecture

Reduced major axis regression minimizes the area of the triangles formed by the observations and the line:

$\beta_1 = s_x / s_y$
$\beta_0$ is calculated like in the normal regression

3.2 Syntax

model <-lmodel2(y~x, data = dataframe, range.y = "interval", 
                 range.x = "interval", nperm = 100)

Arguments:

formula - A formula specifying the bivariate model, as in lm.
data - A data frame containing the two variables specified in the formula.
range.y, range.x - Parameters for ranged major axis regression (RMA). If range.y = NULL and range.x = NULL, RMA will not be computed. If only one of them is NULL, the program will stop. If range.y = “relative”: variable y has a true zero (relative-scale variable). If range.y = “interval”: variable y possibly includes negative values (interval-scale variable). If range.x = “relative”: variable x has a true zero (relative-scale variable). If range.x = “interval”: variable x possibly includes negative values (interval-scale variable)
nperm - Number of permutations for the tests. If nperm = 0, tests will not be computed.

Here we again use the anscombe1 data frame:

library(lmodel2) # make sure that this library is loaded in order to use the function below
model <-lmodel2(y~x, data = anscombe1, range.y = "interval", 
                 range.x = "interval", nperm = 100)

From the regression results, we want to retrieve the Intercept and Slope values for the RMA method. Below we can see the results that we have to search through.

model$regression.results

##   Method Intercept     Slope Angle (degrees) P-perm (1-tailed)
## 1    OLS  3.000091 0.5000909        26.56922        0.00990099
## 2     MA  2.511327 0.5543980        29.00390        0.00990099
## 3    SMA  1.988042 0.6125408        31.48917                NA
## 4    RMA  2.075872 0.6027819        31.08081        0.00990099

There are two ways in which you can extract these values. First, since model$regression.results is a data frame (which you can check by typing class(model$regression.results)), you can use indices to directly extract data from a cell at a specific row and column number:

intercept <- model$regression.results[4,2]
slope <- model$regression.results[4,3]

Another way is to use dplyr as discussed in the previous practical. Here we filter the data frame model$regression.results for all rows where Method == "RMA", and then we select the value in the Intercept column. This will give us a data frame with only one row and one column. Since we want the number in the data frame, and not the data frame itself, we need to add the extra step of %>% unlist().

intercept = model$regression.results %>% filter(Method == "RMA") %>% select(Intercept) %>% unlist()
slope = model$regression.results %>% filter(Method == "RMA") %>% select(Slope) %>% unlist()

Make a summary of the regression model:

summary(model)

##                      Length Class      Mode   
## y                    11     -none-     numeric
## x                    11     -none-     numeric
## regression.results    5     data.frame list   
## confidence.intervals  5     data.frame list   
## eigenvalues           2     -none-     numeric
## H                     1     -none-     numeric
## n                     1     -none-     numeric
## r                     1     -none-     numeric
## rsquare               1     -none-     numeric
## P.param               1     -none-     numeric
## theta                 1     -none-     numeric
## nperm                 1     -none-     numeric
## epsilon               1     -none-     numeric
## info.slope            1     -none-     numeric
## info.CI               1     -none-     numeric
## call                  6     -none-     call

Plot the results of the regression model:

ggplot(anscombe1, aes(x=x, y=y)) + geom_point() + geom_abline(slope=slope, intercept=intercept)

3.3 Exercise

Make a reduced major axis regression for x and y in the Practical6_data.txt and add the regression line to the plot

##                      Length Class      Mode   
## y                    18     -none-     numeric
## x                    18     -none-     numeric
## regression.results    5     data.frame list   
## confidence.intervals  5     data.frame list   
## eigenvalues           2     -none-     numeric
## H                     1     -none-     numeric
## n                     1     -none-     numeric
## r                     1     -none-     numeric
## rsquare               1     -none-     numeric
## P.param               1     -none-     numeric
## theta                 1     -none-     numeric
## nperm                 1     -none-     numeric
## epsilon               1     -none-     numeric
## info.slope            1     -none-     numeric
## info.CI               1     -none-     numeric
## call                  6     -none-     call

4 Multiple regression

4.1 Reminder of the lecture 9

$Y_i = \beta_0 + \beta_1*X_{1i} + \beta_2*X_{2i} + \beta_3*X_{3i} + … + \epsilon_i$

Interpretation of the parameters:

If $\beta_j > 0$, then $j$ stands for the average increase of the response when predictor $x_j$ increases with one unit, holding all other variables constant.
If $\beta_j < 0$, then $j$ stands for the average decrease of the response when predictor $x_j$ increases with one unit, holding all other variables constant.

An anova test can be done to see if the removing one variable significantly changes the fit. If p-value of the anova test <0.05, you should keep the variable (as a rule of thumb)

4.2 Syntax

4.2.1 Linear model as a function of several parameters

say you have several variables (y, x1, x2, x3, …) in the data frame dat

z = lm(y~ x1 + x2 + x3 + …, data = dat)

4.2.2 A pairs plot

A pairs plot is useful for doing an initial exploration of how different variables (i.e. columns in a data frame) may be related to each other. A pairs plot is presented as a matrix containing scatter plots of all the combinations of variables with each other.

To be able to do this, make sure to install and load the GGally library as described above.

install.packages("GGally")
library(GGally)

As an example, we’ll use a pairs plot on the mtcars dataset. For this, we only use the mpg, disp, hp and qsec columns. Looking at the first few rows in these columns, we see:

head(mtcars[,c("mpg", "disp", "hp", "qsec")])

##                    mpg disp  hp  qsec
## Mazda RX4         21.0  160 110 16.46
## Mazda RX4 Wag     21.0  160 110 17.02
## Datsun 710        22.8  108  93 18.61
## Hornet 4 Drive    21.4  258 110 19.44
## Hornet Sportabout 18.7  360 175 17.02
## Valiant           18.1  225 105 20.22

The pairs plot for these four columns looks like this:

ggpairs(mtcars[,c("mpg", "disp", "hp", "qsec")])

Reading across the rows and columns we can see which variables are being plotted together. In the first column and second row, we have the disp vs. the mpg, directly below that is the hp vs. the mpg and so on. On the top right we can see the correlations calculated between the variables. On the diagonal we see a density plot of the variables. For example, most of the values for hp are around 100, although the distribution is skewed to one side.

Now we perform a linear regression where we try to predict mpg given the values for disp, hp and qsec

z = lm(mpg~disp + hp + qsec, data=mtcars)
summary(z)

## 
## Call:
## lm(formula = mpg ~ disp + hp + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5029 -2.6417 -0.6002  1.9749  7.2263 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.622206   9.668300   3.995 0.000426 ***
## disp        -0.028468   0.007787  -3.656 0.001049 ** 
## hp          -0.034642   0.017967  -1.928 0.064041 .  
## qsec        -0.385561   0.468127  -0.824 0.417114    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.144 on 28 degrees of freedom
## Multiple R-squared:  0.7542, Adjusted R-squared:  0.7279 
## F-statistic: 28.64 on 3 and 28 DF,  p-value: 1.118e-08

We can also extract the $R^2$ value from the summary:

summary_for_z <- summary(z)
print(summary_for_z$r.squared)

## [1] 0.7541953

Create a series of diagnostic plots:

plot(z)

4.2.3 Anova test

We can perform an Anova test using the following set up (assuming a data frame named dat with columns named y, x1, x2, and x3)

z1 = lm(y~ x1 + x2 + x3, data = dat)
z2 = lm(y~ x1 + x2, data = dat)
anova(z1,z2)

Using the mtcars example, we can do:

z1 = lm(mpg~disp + hp + qsec, data=mtcars)
z2 = lm(mpg~disp + hp, data=mtcars)  # leave out `qsec`
results = anova(z1, z2)
print(results)

## Analysis of Variance Table
## 
## Model 1: mpg ~ disp + hp + qsec
## Model 2: mpg ~ disp + hp
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     28 276.79                           
## 2     29 283.49 -1   -6.7057 0.6784 0.4171

4.3 Exercise

For this we’ll use the stackloss data set with was mentioned in Lecture 8. This data frame is already pre-loaded with R, and you can get more details about it by typing ?stackloss in the console.

4.3.1 Look at a summary of the data set

##     Air.Flow       Water.Temp     Acid.Conc.      stack.loss   
##  Min.   :50.00   Min.   :17.0   Min.   :72.00   Min.   : 7.00  
##  1st Qu.:56.00   1st Qu.:18.0   1st Qu.:82.00   1st Qu.:11.00  
##  Median :58.00   Median :20.0   Median :87.00   Median :15.00  
##  Mean   :60.43   Mean   :21.1   Mean   :86.29   Mean   :17.52  
##  3rd Qu.:62.00   3rd Qu.:24.0   3rd Qu.:89.00   3rd Qu.:19.00  
##  Max.   :80.00   Max.   :27.0   Max.   :93.00   Max.   :42.00

4.3.2 Make a pairs plot

4.3.3 Fit a linear model using all variables and summarize the output of the model

For this we are trying to predict stack.loss given all the other variables

## 
## Call:
## lm(formula = stack.loss ~ Acid.Conc. + Air.Flow + Water.Temp, 
##     data = stackloss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2377 -1.7117 -0.4551  2.3614  5.6978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
## Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
## Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
## Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared:  0.9136, Adjusted R-squared:  0.8983 
## F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

4.3.4 Plot the standard diagnostic plots

4.3.5 Make a Cook’s distance plot by hand with a cutoff line

4.3.6 Leave out the least important parameter Acid concentration

## 
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp, data = stackloss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5290 -1.7505  0.1894  2.1156  5.6588 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -50.3588     5.1383  -9.801 1.22e-08 ***
## Air.Flow      0.6712     0.1267   5.298 4.90e-05 ***
## Water.Temp    1.2954     0.3675   3.525  0.00242 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.239 on 18 degrees of freedom
## Multiple R-squared:  0.9088, Adjusted R-squared:  0.8986 
## F-statistic: 89.64 on 2 and 18 DF,  p-value: 4.382e-10

4.3.7 See if that made a difference for the overall model

## Analysis of Variance Table
## 
## Model 1: stack.loss ~ Acid.Conc. + Air.Flow + Water.Temp
## Model 2: stack.loss ~ Air.Flow + Water.Temp
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     17 178.83                           
## 2     18 188.79 -1   -9.9654 0.9473  0.344

4.3.8 Now leave out water temp

## 
## Call:
## lm(formula = stack.loss ~ Air.Flow, data = stackloss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2896  -1.1272  -0.0459   1.1166   8.8728 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -44.13202    6.10586  -7.228 7.31e-07 ***
## Air.Flow      1.02031    0.09995  10.208 3.77e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.098 on 19 degrees of freedom
## Multiple R-squared:  0.8458, Adjusted R-squared:  0.8377 
## F-statistic: 104.2 on 1 and 19 DF,  p-value: 3.774e-09

Perform the anova test:

## Analysis of Variance Table
## 
## Model 1: stack.loss ~ Air.Flow + Water.Temp
## Model 2: stack.loss ~ Air.Flow
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     18 188.80                                
## 2     19 319.12 -1   -130.32 12.425 0.002419 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.3.9 Fits of individual varaibles

For this create a set of linear regressions for combinations of the individual variables

4.3.9.1 Regression for `stack.loss~Air.Flow`

Value for $R^{2}$:

## [1] 0.8457809

4.3.9.2 Regression for `stack.loss~Water.Temp`

Value for $R^{2}$:

## [1] 0.766508

4.3.9.3 Regression for `Air.Flow~Water.Temp`

Value for $R^{2}$:

## [1] 0.6112931

Practical 6

Chris Davis

October 10, 2016

1 Basic linear fitting with and without weights

1.1 Reminder of the lecture

1.2 Syntax

1.3 Example

1.3.1 Anscombe’s Quartet

1.3.2 Perform a linear regression

1.3.3 Basic regression output

1.3.4 Error bar plots in ggplot

1.3.5 Add a regression line to the error bar plot

1.4 Exercise

1.4.1 Make an error bar plot

1.4.2 Perform a linear regression

1.4.3 Add the regression line to the error bar plot

1.4.4 Perform a linear regression with weights specified

1.4.5 Plot both linear regressions (with and without weights)

2 Regression diagnostic

2.1 Residuals (\(\epsilon_i\))

2.1.1 Reminder from the lecture

2.1.2 Syntax

2.1.3 Exercise

2.2 Cook’s distance

2.2.1 Reminder of the lecture

2.2.2 Syntax

2.2.3 Exercise

2.3 Automatic diagnostic plots

2.3.1 Exercise

2.3.1.1 Diagnostic plots for unweighted fit

2.3.1.2 Diagnostic plots for weighted fit

3 Reduced major axis regression

3.1 Reminder from the lecture

3.2 Syntax

3.3 Exercise

4 Multiple regression

4.1 Reminder of the lecture 9

4.2 Syntax

4.2.1 Linear model as a function of several parameters

4.2.2 A pairs plot

4.2.3 Anova test

4.3 Exercise

4.3.1 Look at a summary of the data set

4.3.2 Make a pairs plot

4.3.3 Fit a linear model using all variables and summarize the output of the model

4.3.4 Plot the standard diagnostic plots

4.3.5 Make a Cook’s distance plot by hand with a cutoff line

4.3.6 Leave out the least important parameter Acid concentration

4.3.7 See if that made a difference for the overall model

4.3.8 Now leave out water temp

4.3.9 Fits of individual varaibles

4.3.9.1 Regression for stack.loss~Air.Flow

4.3.9.2 Regression for stack.loss~Water.Temp

4.3.9.3 Regression for Air.Flow~Water.Temp

4.3.9.1 Regression for `stack.loss~Air.Flow`

4.3.9.2 Regression for `stack.loss~Water.Temp`

4.3.9.3 Regression for `Air.Flow~Water.Temp`