DASM Practical 6

Basic linear fitting with and without weights

Getting Started

For this practical, we will need the GGally and the lmodel2 libraries. Make sure that they are both installed. You only have to do this once.

install.packages("GGally")
install.packages("lmodel2")

Now load required libraries.

library(GGally)
library(lmodel2)
library(tidyverse)

Reminder from the lecture

We have data (x, y), where x is a deterministic variable and y is a random variable that is dependent on x.
We fit a linear regression line to the data with slope $\beta_1$ and intercept $\beta_0$ by finding the minimum of the RSS (Residual Sum of Squares)

$RSS(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_{i} - \beta_0 - \beta_1 x_i)^2$

Where $\beta_1$ and $\beta_0$ are calculated using:

$\hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i-\bar{x})^2}$
$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$

This equation requires that the variance in y is the same for all $x_i$

If this is not the case, then we apply weighted regression, by minimizing:

$RSS(\beta_0, \beta_1) = \sum_{i=1}^{n} w_i (y_{i} - \beta_0 - \beta_1 x_i)^2$

with $w_i = 1 / \sigma_i^2$

Syntax

In R, the slope ($\hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i-\bar{x})^2}$) and intercept ($\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$) can be calculated by these commands:

b1 = cov(x,y)/var(x)
b0 = mean(y) – b1*mean(x)

For the general linear regression, we use the lm() command. The syntax you will need to use looks like:

z = lm(formula, data, weights)

A summary of what you see when typing ?lm into the console is:

formula - a symbolic description of the model to be fitted
- A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response.
- A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed.
data - the data frame used as input. The names you specify in formula will correspond to the column names in this data frame.
weights - an optional vector of weights to be used in the fitting process.

Example

Anscombe’s Quartet

Throughout this practical, we’ll use a data frame containing Anscombe’s Quartet as an example. The data consists of four hypothetical collections of data that each have x and y values. The data frame anscombe has eight columns which represent all four of these data sets. For example, the first data set uses the columns x1 and y1, the second uses x2 and y2, etc. You can type ?anscombe into the console to find out more about this.

We’ll first look at some basic statistical properties. In the examples below, we use c() to group together the results into a vector so that they’re easier to see all in a row instead of being printed out on separate lines.

As you can see, the mean of x values for all the data sets are the same:

c(mean(anscombe$x1), mean(anscombe$x2), mean(anscombe$x3), mean(anscombe$x4))

## [1] 9 9 9 9

As is the variance of x values:

c(var(anscombe$x1), var(anscombe$x2), var(anscombe$x3), var(anscombe$x4))

## [1] 11 11 11 11

The mean of y values is the same to several decimal places:

c(mean(anscombe$y1), mean(anscombe$y2), mean(anscombe$y3), mean(anscombe$y4))

## [1] 7.500909 7.500909 7.500000 7.500909

The variance of y values is also very similar:

c(var(anscombe$y1), var(anscombe$y2), var(anscombe$y3), var(anscombe$y4))

## [1] 4.127269 4.127629 4.122620 4.123249

However, when we plot the different data sets, we see that they all look quite different:

Anscombe’s work has been taken a step further by a recent paper that showed how to generate datasets that look different but have nearly identical statistical properties.

Perform a linear regression

For this example, we’ll use the x and y values from the first data set in Anscombe’s Quartet. We’ll create a second data frame to avoid overwriting the original data.

anscombe1 = data.frame(x = anscombe$x1, 
                       y = anscombe$y1)

Now we perform a linear regression on the data:

z = lm(y~x, data=anscombe1)

data=anscombe1 means that we want to use the anscombe1 data frame that we just created
y~x says that we’re trying to predict a response (the value y) given the value of a term (x). Here x and y correspond to the actual column names of the data frame. If we were to us a different data frame (like mtcars) we would update the formula (for example to predict miles per gallon given horse power, we would use mpg~hp).

Basic regression output

Using summary we can get an overview of the results of the regression

summary(z)

## 
## Call:
## lm(formula = y ~ x, data = anscombe1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x             0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

Error bar plots in ggplot

For some data sets, you will also have information available on the error for particular values. Anscombe’s quartet doesn’t contain this information, but for the sake of demonstrating the R syntax to create a plot with error bars, we just add a y_error column using random values and pretend that this is a real error measurement.

# create a column with bogus error values so we can show how geom_errorbar() works
anscombe1$y_error = runif(nrow(anscombe1))

Now we perform a scatter plot using geom_point, but now we also add geom_errorbar to add error bars on top of those points. The code ymin = y - y_error and ymax = y + y_error shows how we use the values from the y and y_error columns to calculate the top and bottom of the error bars.

ggplot(anscombe1, aes(x=x, y=y)) + 
  geom_point() + 
  geom_errorbar(aes(ymin = y - y_error,
                    ymax = y + y_error,
                    width=0.3)) # width of the horizontal bar at the top of the error bar

In this example, the error bars are “mirrored” around the value of y. Since we specify the value of ymin independently of ymax, then it’s possible that you could use a different formula and/or column so that the observed value of y does not lie directly in the middle of the error bars but is skewed to one side.

In the example below, we create columns for y_error_lower and y_error_upper, in order to show how to create error bars where the error is not symmetrical around the data point.

# Create columns with more bogus error values, 
# so we can show how geom_errorbar() works for more skewed error bars
# The values assigned here just make sure that the upper error bar 
# is further away from the data point than the lower error bar
anscombe1$y_error_upper = anscombe1$y + 3 * sd(anscombe1$y)/5
anscombe1$y_error_lower = anscombe1$y - sd(anscombe1$y)/5

ggplot(anscombe1, aes(x=x, y=y)) + 
  geom_point() + 
  geom_errorbar(aes(ymin = y_error_lower,
                    ymax = y_error_upper,
                    width=0.3))

Add a regression line to the error bar plot

We will reuse the same code from above, but will now add in a regression line. To do this, we need to use the coefficients that were found from the linear regression.

z$coefficients

## (Intercept)           x 
##   3.0000909   0.5000909

As you can see, the first value is the intercept. The second value is labelled x although this refers to the slope of the regression line. We will use these values directly in the geom_abline command:

ggplot(anscombe1, aes(x=x, y=y)) + geom_point() + 
  geom_errorbar(aes(ymin=y - y_error,
                    ymax=y + y_error,
                    width=0.3)) +             # same plotting code as above
  geom_abline(intercept = z$coefficients[1],  # but add in the regression line
              slope = z$coefficients[2])

Alternatively, we can also use geom_smooth to perform a linear regression, although for the examples in this practical, we’ll use geom_abline. Using geom_smooth is a good idea when you have to create a plot with multiple linear regressions that are done on subsets of the data.

ggplot(anscombe1, aes(x=x, y=y)) + geom_point() + 
  geom_errorbar(aes(ymin=y - y_error,
                    ymax=y + y_error,
                    width=0.3)) + 
  geom_smooth(aes(x=x, y=y), method="lm", formula=y~x, se=FALSE)

One thing which we’ll come back to later is that the linear regressions for the four data sets in Anscombe’s Quartet are nearly identical, even though visually they look quite different:

Exercise

For this exercise, download the data Practical6_Data.txt and load it into R. Assume that the error is symmetrical around the data points.

Make an error bar plot

You should see:

Perform a linear regression

Using summary on the results, you should see:

## 
## Call:
## lm(formula = y_values ~ x_values, data = data1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.066718 -0.009398 -0.003966  0.004625  0.082318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01251    0.01131  -1.106    0.285    
## x_values     0.88361    0.08043  10.986 7.31e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03105 on 16 degrees of freedom
## Multiple R-squared:  0.8829, Adjusted R-squared:  0.8756 
## F-statistic: 120.7 on 1 and 16 DF,  p-value: 7.311e-09

Add the regression line to the error bar plot

Perform a linear regression with weights specified

For the linear regression, we need to tell lm() that the weights are 1/(data1$y_errors^2)

The summary of the linear regression results should show:

## 
## Call:
## lm(formula = y_values ~ x_values, data = data1, weights = 1/(data1$y_errors^2))
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7018 -0.1766  0.5083  0.8713  1.3991 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.007306   0.006987  -1.046    0.311    
## x_values     0.623074   0.099661   6.252 1.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8809 on 16 degrees of freedom
## Multiple R-squared:  0.7095, Adjusted R-squared:  0.6914 
## F-statistic: 39.09 on 1 and 16 DF,  p-value: 1.156e-05

Plot both linear regressions (with and without weights)

Remember that with ggplot, you can append commands like + geom_abline(..., color="blue") + geom_abline(..., color="red")

Regression diagnostic

Residuals ($\epsilon_i$)

Reminder from the lecture

$E(\epsilon_i|X_i) = 0$ (residuals should have mean 0 for all $X_i$ )
$\epsilon_i$ is normally distributed (residuals are normally distributed with mean = 0 and variance $Var(\epsilon_i) = s_2$, for all $\epsilon_i$

Syntax

For plotting residuals against fitted values, we’ll again use the anscombe1 data frame that we created above, using the same code for the linear regression:

z = lm(y~x, anscombe1)

Now we get the fitted values corresponding to each value of x:

fitted(z)

##         1         2         3         4         5         6         7 
##  8.001000  7.000818  9.501273  7.500909  8.501091 10.001364  6.000636 
##         8         9        10        11 
##  5.000455  9.001182  6.500727  5.500545

We then get the residuals:

residuals(z)

##           1           2           3           4           5           6 
##  0.03900000 -0.05081818 -1.92127273  1.30909091 -0.17109091 -0.04136364 
##           7           8           9          10          11 
##  1.23936364 -0.74045455  1.83881818 -1.68072727  0.17945455

Add in columns to the anscombe1 data frame for the fit and res values

anscombe1$fit = fitted(z)
anscombe1$res = residuals(z)

We can see that the fitted values follow the regression line:

ggplot(anscombe1) + 
  geom_point(aes(x=x, y=y, color="original value"), size=3) + 
  geom_point(aes(x=x, y=fit, color="fitted value"), size=3) + 
  geom_abline(intercept = z$coefficients[1], slope=z$coefficients[2])

In the code above, we have the statements color="original value" and color="fitted value". When ggplot sees statements like these withing aes(...) it will automatically assign a color to each of these text values. What is happening is the same as what you would get when using something like aes(..., color=some_column_name). In this case, every distinct value of some_column_name will be assigned a unique color.

Show the residuals that were calculated:

ggplot(anscombe1) + 
  geom_point(aes(x=x, y=res))

Putting this all together we can see how subtracting the residual from the original y values gives us the fitted values in fit. Note that geom_segment allows us to draw line segments from the points defined at x,y to the point at xend,yend

ggplot(data = anscombe1) + 
  geom_point(aes(x=x, y=y, color="original value"), size=3) + 
  geom_point(aes(x=x, y=fit, color="fitted value"), size=3) + 
  geom_segment(aes(x=x, xend=x, y=y, yend=y-res), 
               arrow = arrow(length = unit(0.3,"cm"))) + # add an arrow to this line segment
  geom_abline(intercept = z$coefficients[1], slope=z$coefficients[2])

Plot residual against fitted values. If our original data was from a straight line, then we would just see residual values of just zero.

ggplot(anscombe1, aes(x=fit, y=res)) + geom_point() + ggtitle("residual vs. fitted values")

Perform a qqnorm plot of residuals to see if they are normally distributed:

qqnorm(anscombe1$res)
qqline(anscombe1$res)

Exercise

Make the plots of fitted values vs residuals for the data from Practical6_Data.txt

Perform a linear regression without weights:

Make a qnorm plot of the residuals:

Linear regression with weights (performed same as earlier in the practical):

Make a qnorm plot of the residuals (using the weighted model):

Cook’s distance

Reminder from the lecture

The Cook’s distance of point $X_i$ is a measure of the difference in regression parameters when the point is included or omitted. If omitting a point changes the regression parameters strongly, this point has a large Cook’s distance.

To see quantitatively if the Cook’s distance of a data point is too large and it has an undue influence on the fit you need to compare it with a theoretical cut-off.

This cutoff value can be calculated with the F-statistic using qf(0.5, p, n-p, lower.tail = FALSE) where:

p - fitted parameters (=2 with simple linear regression)
n - number of data points

Syntax

Below we use the anscombe1 data frame again.

##Cook's distance

z = lm(y~x, data=anscombe1)
cook <- cooks.distance(z)

## put the fitted values and Cook's distance into a data frame
data = data.frame(fit = fitted(z),
                  cooks_distance = cook)

##cut-off value
cut <- qf(0.5, 2, length(data$fit)-2, lower.tail = FALSE)

ggplot(data, aes(x=fit, y=cooks_distance)) + 
  geom_point() +                   # plot fitted values vs. cooks distance for each fitted values
  geom_hline(yintercept=cut)       # add a horizontal line for the cut-off we calculated

If we look at all four data sets in Anscombe’s Quartet, there are quite different outcomes when plotting the Cook’s distance for each.

Exercise

For the unweighted and weighted fit you performed on Practical6_Data.txt, compare the Cook’s distance.

Unweighted fit:

Weighted fit:

Automatic diagnostic plots

We can create a series of diagnostic plots for the linear regression results by simply using plot(z) (where z is the output of a linear regression) instead of ggplot. What’s happening here is that R detects that z is the result of a linear regression, and whenever it sees this, it will create a series of four plots:

Residuals vs. Fitted
Normal Q-Q
Scale-Location (not covered in this practical)
Residuals vs. Leverage (related to Cook’s distance)

Here we create the diagnostic plots for the linear regression of the anscombe1 data frame:

z = lm(y~x, data=anscombe1)
plot(z)

Exercise

Create diagnostic plots for both the unweighted and weighted regressions that you just did on the data for from Practical6_Data.txt.

Diagnostic plots for unweighted fit

Diagnostic plots for weighted fit

Multiple regression

Reminder from the lecture

$Y_i = \beta_0 + \beta_1*X_{1i} + \beta_2*X_{2i} + \beta_3*X_{3i} + … + \epsilon_i$

Interpretation of the parameters:

If $\beta_j > 0$, then $j$ stands for the average increase of the response when predictor $x_j$ increases with one unit, holding all other variables constant.
If $\beta_j < 0$, then $j$ stands for the average decrease of the response when predictor $x_j$ increases with one unit, holding all other variables constant.

An ANOVA test can be done to see if the removing one variable significantly changes the fit. If p-value of the ANOVA test <0.05, you should keep the variable (as a rule of thumb)

Syntax

Linear model as a function of several parameters

If you have several variables (y, x1, x2, x3, …) in the data frame dat, you can perform multiple regression via this syntax:

z = lm(y~ x1 + x2 + x3 + …, data = dat)

A pairs plot

A pairs plot is useful for doing an initial exploration of how different variables (i.e. columns in a data frame) may be related to each other. A pairs plot is presented as a matrix containing scatter plots of all the combinations of variables with each other.

To be able to do this, make sure to install and load the GGally library as described above.

install.packages("GGally")
library(GGally)

As an example, we’ll use a pairs plot on the mtcars dataset. For this, we only use the mpg, disp, hp and wt columns. Looking at the first few rows in these columns, we see:

head(mtcars[,c("mpg", "disp", "hp", "wt")])

##                    mpg disp  hp    wt
## Mazda RX4         21.0  160 110 2.620
## Mazda RX4 Wag     21.0  160 110 2.875
## Datsun 710        22.8  108  93 2.320
## Hornet 4 Drive    21.4  258 110 3.215
## Hornet Sportabout 18.7  360 175 3.440
## Valiant           18.1  225 105 3.460

The pairs plot for these four columns looks like this:

ggpairs(mtcars[,c("mpg", "disp", "hp", "wt")])

Reading across the rows and columns we can see which variables are being plotted together. In the first column and second row, we have the disp vs. the mpg, directly below that is the hp vs. the mpg and so on. On the top right we can see the correlations calculated between the variables. On the diagonal we see a density plot of the variables. For example, most of the values for hp are around 100, although the distribution is skewed to one side.

Now we perform a linear regression where we try to predict mpg given the values for disp, hp and wt

z = lm(mpg~disp + hp + wt, data=mtcars)
summary(z)

## 
## Call:
## lm(formula = mpg ~ disp + hp + wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.891 -1.640 -0.172  1.061  5.861 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.105505   2.110815  17.579  < 2e-16 ***
## disp        -0.000937   0.010350  -0.091  0.92851    
## hp          -0.031157   0.011436  -2.724  0.01097 *  
## wt          -3.800891   1.066191  -3.565  0.00133 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8083 
## F-statistic: 44.57 on 3 and 28 DF,  p-value: 8.65e-11

We can also extract the $R^2$ value from the summary:

summary_for_z <- summary(z)
print(summary_for_z$r.squared)

## [1] 0.8268361

Create a series of diagnostic plots:

plot(z)

ANOVA test

We can perform an ANOVA test using the following set up (assuming a data frame named dat with columns named y, x1, x2, and x3)

z1 = lm(y~ x1 + x2 + x3, data = dat)
z2 = lm(y~ x1 + x2, data = dat)
anova(z1,z2)

Using the mtcars example, we can do:

z1 = lm(mpg~disp + hp + wt, data=mtcars)
z2 = lm(mpg~disp + hp, data=mtcars)  # leave out `wt`
results = anova(z1, z2)
print(results)

## Analysis of Variance Table
## 
## Model 1: mpg ~ disp + hp + wt
## Model 2: mpg ~ disp + hp
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     28 194.99                                
## 2     29 283.49 -1   -88.503 12.709 0.001331 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As stated above, if p-value of the anova test <0.05, you should keep the variable (as a rule of thumb). Here it’s clear that the wt column (weight) is quite important in predicting the fuel economy of a car.

Exercise

For this we’ll use the stackloss data set with was mentioned in Lecture 8. This data frame is already pre-loaded with R, and you can get more details about it by typing ?stackloss in the console.

Look at a summary of the data set

##     Air.Flow       Water.Temp     Acid.Conc.      stack.loss   
##  Min.   :50.00   Min.   :17.0   Min.   :72.00   Min.   : 7.00  
##  1st Qu.:56.00   1st Qu.:18.0   1st Qu.:82.00   1st Qu.:11.00  
##  Median :58.00   Median :20.0   Median :87.00   Median :15.00  
##  Mean   :60.43   Mean   :21.1   Mean   :86.29   Mean   :17.52  
##  3rd Qu.:62.00   3rd Qu.:24.0   3rd Qu.:89.00   3rd Qu.:19.00  
##  Max.   :80.00   Max.   :27.0   Max.   :93.00   Max.   :42.00

Make a pairs plot

Fit a linear model using all variables and summarize the output of the model

For this we are trying to predict stack.loss given all the other variables

## 
## Call:
## lm(formula = stack.loss ~ Acid.Conc. + Air.Flow + Water.Temp, 
##     data = stackloss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2377 -1.7117 -0.4551  2.3614  5.6978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
## Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
## Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
## Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared:  0.9136, Adjusted R-squared:  0.8983 
## F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

Plot the standard diagnostic plots

Make a Cook’s distance plot by hand with a cutoff line

Leave out the least important parameter Acid concentration

## 
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp, data = stackloss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5290 -1.7505  0.1894  2.1156  5.6588 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -50.3588     5.1383  -9.801 1.22e-08 ***
## Air.Flow      0.6712     0.1267   5.298 4.90e-05 ***
## Water.Temp    1.2954     0.3675   3.525  0.00242 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.239 on 18 degrees of freedom
## Multiple R-squared:  0.9088, Adjusted R-squared:  0.8986 
## F-statistic: 89.64 on 2 and 18 DF,  p-value: 4.382e-10

See if that made a difference for the overall model

## Analysis of Variance Table
## 
## Model 1: stack.loss ~ Acid.Conc. + Air.Flow + Water.Temp
## Model 2: stack.loss ~ Air.Flow + Water.Temp
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     17 178.83                           
## 2     18 188.79 -1   -9.9654 0.9473  0.344

Now leave out water temp

## 
## Call:
## lm(formula = stack.loss ~ Air.Flow, data = stackloss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2896  -1.1272  -0.0459   1.1166   8.8728 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -44.13202    6.10586  -7.228 7.31e-07 ***
## Air.Flow      1.02031    0.09995  10.208 3.77e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.098 on 19 degrees of freedom
## Multiple R-squared:  0.8458, Adjusted R-squared:  0.8377 
## F-statistic: 104.2 on 1 and 19 DF,  p-value: 3.774e-09

Perform the anova test:

## Analysis of Variance Table
## 
## Model 1: stack.loss ~ Air.Flow + Water.Temp
## Model 2: stack.loss ~ Air.Flow
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     18 188.80                                
## 2     19 319.12 -1   -130.32 12.425 0.002419 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fits of individual varaibles

For this create a set of linear regressions for combinations of the individual variables

Regression for `stack.loss~Air.Flow`

Value for $R^{2}$:

## [1] 0.8457809

Regression for `stack.loss~Water.Temp`

Value for $R^{2}$:

## [1] 0.766508

Regression for `Air.Flow~Water.Temp`

Value for $R^{2}$:

## [1] 0.6112931

Replacing values with NA

Real data may often have missing values, which are represented in R as NA. In some data sets you may see them represented in other forms like NaN (not a number) or by an unrealistic number (e.g., -999). For R to handle the missing values, you have to rename them as NA, if they are not named like that, then R does not recognize them and will try to process them as an actual number.

Let’s say that for a data frame we have called df, NA values are currently represented as -999. If we type into the console df == -999, we’ll get a matrix of TRUE, FALSE or NA values (if NA is already present).

To replace these -999 values, we use the command below which says that for all locations in the data frame with a value of -999, we should assign to those locations the value NA instead.

df[df == -999] <-NA

Also be aware that many R functions have arguments that allow you to specify how they will process NA values. For example, if you type ?mean in the R console, you’ll see that the mean() function has a na.rm argument that specifies if NA values should be removed before computing the mean.

Calculating mean() with NA values included:

mean(c(3,7,3,NA,5))

## [1] NA

Calculating mean() and removing NA values:

mean(c(3,7,3,NA,5), na.rm=TRUE)

## [1] 4.5

The lm() function has a na.action argument that specifies how to deal with missing values.

df <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))

# remove rows from the data frame containing NA
na.omit(df)

##   x  y
## 1 1  0
## 2 2 10

lm(y~x, df, na.action=na.fail)

## Error in na.fail.default(structure(list(y = c(0, 10, NA), x = c(1, 2, : missing values in object

lm(y~x, df, na.action=na.omit)

## 
## Call:
## lm(formula = y ~ x, data = df, na.action = na.omit)
## 
## Coefficients:
## (Intercept)            x  
##         -10           10

lm(y~x, df, na.action=na.pass)

## Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): NA/NaN/Inf in 'y'

The dplyr library also deals with NA values in a specific way:

If you are doing a numeric comparisons (i.e. hp >= 250) using the filter function, then NA values will be filtered out.
If you want to do a numeric comparison and also keep NA values, then you need to specify something like hp >= 250 | is.na(hp). The function is.na() will return TRUE for any NA values.

Exercise

For this exercise, download the Carbon.csv data into R as a data frame called input1.

After loading in the csv file into R, try to replace the -999 values with NA. If you run summary before the replacement you should see:

##           date_start elementary.carbon.from.fossil.fuels     BB_EC        
##  01/03/12 12:35: 1   Min.   :0.1077                      Min.   :0.01385  
##  01/06/11 13:30: 1   1st Qu.:0.2289                      1st Qu.:0.02643  
##  02/05/11 15:15: 1   Median :0.3446                      Median :0.06442  
##  02/12/11 09:15: 1   Mean   :0.5839                      Mean   :0.12673  
##  06/06/11 11:45: 1   3rd Qu.:0.6550                      3rd Qu.:0.14614  
##  07/07/11 13:30: 1   Max.   :2.2389                      Max.   :0.66726  
##  (Other)       :17                                                        
##      FF_OC               BB_OC               BIO_OC      
##  Min.   :-999.0000   Min.   :-999.0000   Min.   :0.2535  
##  1st Qu.:   0.2770   1st Qu.:   0.1179   1st Qu.:0.4915  
##  Median :   0.4028   Median :   0.3220   Median :0.7457  
##  Mean   : -42.8049   Mean   : -42.8112   Mean   :1.1032  
##  3rd Qu.:   0.5454   3rd Qu.:   0.7306   3rd Qu.:1.2399  
##  Max.   :   2.2815   Max.   :   3.3353   Max.   :4.0950  
##                                                          
##     FF_WIOC              C_WIOC           FF_WSOC        
##  Min.   :-999.0000   Min.   :0.04875   Min.   :-0.16657  
##  1st Qu.:   0.1436   1st Qu.:0.12869   1st Qu.: 0.09897  
##  Median :   0.1950   Median :0.26482   Median : 0.17624  
##  Mean   : -86.5462   Mean   :0.42041   Mean   : 0.28935  
##  3rd Qu.:   0.3089   3rd Qu.:0.42538   3rd Qu.: 0.35644  
##  Max.   :   1.4266   Max.   :1.80312   Max.   : 1.05329  
##                                                          
##      C_WSOC         
##  Min.   :-999.0000  
##  1st Qu.:   0.4852  
##  Median :   0.8561  
##  Mean   : -42.1394  
##  3rd Qu.:   1.5079  
##  Max.   :   4.2263  
##

After replacing them, using summary you should see that several columns now have NA values:

##           date_start elementary.carbon.from.fossil.fuels     BB_EC        
##  01/03/12 12:35: 1   Min.   :0.1077                      Min.   :0.01385  
##  01/06/11 13:30: 1   1st Qu.:0.2289                      1st Qu.:0.02643  
##  02/05/11 15:15: 1   Median :0.3446                      Median :0.06442  
##  02/12/11 09:15: 1   Mean   :0.5839                      Mean   :0.12673  
##  06/06/11 11:45: 1   3rd Qu.:0.6550                      3rd Qu.:0.14614  
##  07/07/11 13:30: 1   Max.   :2.2389                      Max.   :0.66726  
##  (Other)       :17                                                        
##      FF_OC            BB_OC             BIO_OC          FF_WIOC       
##  Min.   :0.1173   Min.   :0.06923   Min.   :0.2535   Min.   :0.08086  
##  1st Qu.:0.2867   1st Qu.:0.12886   1st Qu.:0.4915   1st Qu.:0.17058  
##  Median :0.4328   Median :0.32871   Median :0.7457   Median :0.19604  
##  Mean   :0.6585   Mean   :0.65190   Mean   :1.1032   Mean   :0.35414  
##  3rd Qu.:0.5580   3rd Qu.:0.77068   3rd Qu.:1.2399   3rd Qu.:0.31171  
##  Max.   :2.2815   Max.   :3.33533   Max.   :4.0950   Max.   :1.42661  
##  NA's   :1        NA's   :1                          NA's   :2        
##      C_WIOC           FF_WSOC             C_WSOC      
##  Min.   :0.04875   Min.   :-0.16657   Min.   :0.2886  
##  1st Qu.:0.12869   1st Qu.: 0.09897   1st Qu.:0.5394  
##  Median :0.26482   Median : 0.17624   Median :0.8566  
##  Mean   :0.42041   Mean   : 0.28935   Mean   :1.3542  
##  3rd Qu.:0.42538   3rd Qu.: 0.35644   3rd Qu.:1.5549  
##  Max.   :1.80312   Max.   : 1.05329   Max.   :4.2263  
##                                       NA's   :1

DASM Practical 6

Chris Davis

October 06, 2017

Basic linear fitting with and without weights

Getting Started

Reminder from the lecture

Syntax

Example

Anscombe’s Quartet

Perform a linear regression

Basic regression output

Error bar plots in ggplot

Add a regression line to the error bar plot

Exercise

Make an error bar plot

Perform a linear regression

Add the regression line to the error bar plot

Perform a linear regression with weights specified

Plot both linear regressions (with and without weights)

Regression diagnostic

Residuals (\(\epsilon_i\))

Reminder from the lecture

Syntax

Exercise

Cook’s distance

Reminder from the lecture

Syntax

Exercise

Automatic diagnostic plots

Exercise

Diagnostic plots for unweighted fit

Diagnostic plots for weighted fit

Multiple regression

Reminder from the lecture

Syntax

Linear model as a function of several parameters

A pairs plot

ANOVA test

Exercise

Look at a summary of the data set

Make a pairs plot

Fit a linear model using all variables and summarize the output of the model

Plot the standard diagnostic plots

Make a Cook’s distance plot by hand with a cutoff line

Leave out the least important parameter Acid concentration

See if that made a difference for the overall model

Now leave out water temp

Fits of individual varaibles

Regression for stack.loss~Air.Flow

Regression for stack.loss~Water.Temp

Regression for Air.Flow~Water.Temp

Replacing values with NA

Exercise

Regression for `stack.loss~Air.Flow`

Regression for `stack.loss~Water.Temp`

Regression for `Air.Flow~Water.Temp`