For this practical, we will need the GGally
and the lmodel2
libraries. Make sure that they are both installed. You only have to do this once.
install.packages("GGally")
install.packages("lmodel2")
Now load required libraries.
library(GGally)
library(lmodel2)
library(tidyverse)
\(RSS(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_{i} - \beta_0 - \beta_1 x_i)^2\)
Where \(\beta_1\) and \(\beta_0\) are calculated using:
This equation requires that the variance in y is the same for all \(x_i\)
If this is not the case, then we apply weighted regression, by minimizing:
\(RSS(\beta_0, \beta_1) = \sum_{i=1}^{n} w_i (y_{i} - \beta_0 - \beta_1 x_i)^2\)
with \(w_i = 1 / \sigma_i^2\)
In R, the slope (\(\hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i-\bar{x})^2}\)) and intercept (\(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\)) can be calculated by these commands:
b1 = cov(x,y)/var(x)
b0 = mean(y) – b1*mean(x)
For the general linear regression, we use the lm()
command. The syntax you will need to use looks like:
z = lm(formula, data, weights)
A summary of what you see when typing ?lm
into the console is:
formula
- a symbolic description of the model to be fitted
response ~ terms
where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response.first + second
indicates all the terms in first together with all the terms in second with duplicates removed.data
- the data frame used as input. The names you specify in formula
will correspond to the column names in this data frame.weights
- an optional vector of weights to be used in the fitting process.Throughout this practical, we’ll use a data frame containing Anscombe’s Quartet as an example. The data consists of four hypothetical collections of data that each have x and y values. The data frame anscombe
has eight columns which represent all four of these data sets. For example, the first data set uses the columns x1
and y1
, the second uses x2
and y2
, etc. You can type ?anscombe
into the console to find out more about this.
We’ll first look at some basic statistical properties. In the examples below, we use c()
to group together the results into a vector so that they’re easier to see all in a row instead of being printed out on separate lines.
As you can see, the mean of x values for all the data sets are the same:
c(mean(anscombe$x1), mean(anscombe$x2), mean(anscombe$x3), mean(anscombe$x4))
## [1] 9 9 9 9
As is the variance of x values:
c(var(anscombe$x1), var(anscombe$x2), var(anscombe$x3), var(anscombe$x4))
## [1] 11 11 11 11
The mean of y values is the same to several decimal places:
c(mean(anscombe$y1), mean(anscombe$y2), mean(anscombe$y3), mean(anscombe$y4))
## [1] 7.500909 7.500909 7.500000 7.500909
The variance of y values is also very similar:
c(var(anscombe$y1), var(anscombe$y2), var(anscombe$y3), var(anscombe$y4))
## [1] 4.127269 4.127629 4.122620 4.123249
However, when we plot the different data sets, we see that they all look quite different:
Anscombe’s work has been taken a step further by a recent paper that showed how to generate datasets that look different but have nearly identical statistical properties.
For this example, we’ll use the x and y values from the first data set in Anscombe’s Quartet. We’ll create a second data frame to avoid overwriting the original data.
anscombe1 = data.frame(x = anscombe$x1,
y = anscombe$y1)
Now we perform a linear regression on the data:
z = lm(y~x, data=anscombe1)
data=anscombe1
means that we want to use the anscombe1
data frame that we just createdy~x
says that we’re trying to predict a response (the value y
) given the value of a term (x
). Here x
and y
correspond to the actual column names of the data frame. If we were to us a different data frame (like mtcars
) we would update the formula (for example to predict miles per gallon given horse power, we would use mpg~hp
).Using summary
we can get an overview of the results of the regression
summary(z)
##
## Call:
## lm(formula = y ~ x, data = anscombe1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
For some data sets, you will also have information available on the error for particular values. Anscombe’s quartet doesn’t contain this information, but for the sake of demonstrating the R syntax to create a plot with error bars, we just add a y_error
column using random values and pretend that this is a real error measurement.
# create a column with bogus error values so we can show how geom_errorbar() works
anscombe1$y_error = runif(nrow(anscombe1))
Now we perform a scatter plot using geom_point
, but now we also add geom_errorbar
to add error bars on top of those points. The code ymin = y - y_error
and ymax = y + y_error
shows how we use the values from the y
and y_error
columns to calculate the top and bottom of the error bars.
ggplot(anscombe1, aes(x=x, y=y)) +
geom_point() +
geom_errorbar(aes(ymin = y - y_error,
ymax = y + y_error,
width=0.3)) # width of the horizontal bar at the top of the error bar
In this example, the error bars are “mirrored” around the value of y
. Since we specify the value of ymin
independently of ymax
, then it’s possible that you could use a different formula and/or column so that the observed value of y
does not lie directly in the middle of the error bars but is skewed to one side.
In the example below, we create columns for y_error_lower
and y_error_upper
, in order to show how to create error bars where the error is not symmetrical around the data point.
# Create columns with more bogus error values,
# so we can show how geom_errorbar() works for more skewed error bars
# The values assigned here just make sure that the upper error bar
# is further away from the data point than the lower error bar
anscombe1$y_error_upper = anscombe1$y + 3 * sd(anscombe1$y)/5
anscombe1$y_error_lower = anscombe1$y - sd(anscombe1$y)/5
ggplot(anscombe1, aes(x=x, y=y)) +
geom_point() +
geom_errorbar(aes(ymin = y_error_lower,
ymax = y_error_upper,
width=0.3))
We will reuse the same code from above, but will now add in a regression line. To do this, we need to use the coefficients that were found from the linear regression.
z$coefficients
## (Intercept) x
## 3.0000909 0.5000909
As you can see, the first value is the intercept. The second value is labelled x
although this refers to the slope of the regression line. We will use these values directly in the geom_abline
command:
ggplot(anscombe1, aes(x=x, y=y)) + geom_point() +
geom_errorbar(aes(ymin=y - y_error,
ymax=y + y_error,
width=0.3)) + # same plotting code as above
geom_abline(intercept = z$coefficients[1], # but add in the regression line
slope = z$coefficients[2])
Alternatively, we can also use geom_smooth
to perform a linear regression, although for the examples in this practical, we’ll use geom_abline
. Using geom_smooth
is a good idea when you have to create a plot with multiple linear regressions that are done on subsets of the data.
ggplot(anscombe1, aes(x=x, y=y)) + geom_point() +
geom_errorbar(aes(ymin=y - y_error,
ymax=y + y_error,
width=0.3)) +
geom_smooth(aes(x=x, y=y), method="lm", formula=y~x, se=FALSE)
One thing which we’ll come back to later is that the linear regressions for the four data sets in Anscombe’s Quartet are nearly identical, even though visually they look quite different:
For this exercise, download the data Practical6_Data.txt and load it into R. Assume that the error is symmetrical around the data points.
You should see:
Using summary
on the results, you should see:
##
## Call:
## lm(formula = y_values ~ x_values, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.066718 -0.009398 -0.003966 0.004625 0.082318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01251 0.01131 -1.106 0.285
## x_values 0.88361 0.08043 10.986 7.31e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03105 on 16 degrees of freedom
## Multiple R-squared: 0.8829, Adjusted R-squared: 0.8756
## F-statistic: 120.7 on 1 and 16 DF, p-value: 7.311e-09
For the linear regression, we need to tell lm()
that the weights
are 1/(data1$y_errors^2)
The summary of the linear regression results should show:
##
## Call:
## lm(formula = y_values ~ x_values, data = data1, weights = 1/(data1$y_errors^2))
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.7018 -0.1766 0.5083 0.8713 1.3991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.007306 0.006987 -1.046 0.311
## x_values 0.623074 0.099661 6.252 1.16e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8809 on 16 degrees of freedom
## Multiple R-squared: 0.7095, Adjusted R-squared: 0.6914
## F-statistic: 39.09 on 1 and 16 DF, p-value: 1.156e-05
Remember that with ggplot, you can append commands like + geom_abline(..., color="blue") + geom_abline(..., color="red")
For plotting residuals against fitted values, we’ll again use the anscombe1
data frame that we created above, using the same code for the linear regression:
z = lm(y~x, anscombe1)
Now we get the fitted values corresponding to each value of x:
fitted(z)
## 1 2 3 4 5 6 7
## 8.001000 7.000818 9.501273 7.500909 8.501091 10.001364 6.000636
## 8 9 10 11
## 5.000455 9.001182 6.500727 5.500545
We then get the residuals:
residuals(z)
## 1 2 3 4 5 6
## 0.03900000 -0.05081818 -1.92127273 1.30909091 -0.17109091 -0.04136364
## 7 8 9 10 11
## 1.23936364 -0.74045455 1.83881818 -1.68072727 0.17945455
Add in columns to the anscombe1
data frame for the fit and res values
anscombe1$fit = fitted(z)
anscombe1$res = residuals(z)
We can see that the fitted values follow the regression line:
ggplot(anscombe1) +
geom_point(aes(x=x, y=y, color="original value"), size=3) +
geom_point(aes(x=x, y=fit, color="fitted value"), size=3) +
geom_abline(intercept = z$coefficients[1], slope=z$coefficients[2])
In the code above, we have the statements color="original value"
and color="fitted value"
. When ggplot sees statements like these withing aes(...)
it will automatically assign a color to each of these text values. What is happening is the same as what you would get when using something like aes(..., color=some_column_name)
. In this case, every distinct value of some_column_name
will be assigned a unique color.
Show the residuals that were calculated:
ggplot(anscombe1) +
geom_point(aes(x=x, y=res))
Putting this all together we can see how subtracting the residual from the original y
values gives us the fitted values in fit
. Note that geom_segment
allows us to draw line segments from the points defined at x
,y
to the point at xend
,yend
ggplot(data = anscombe1) +
geom_point(aes(x=x, y=y, color="original value"), size=3) +
geom_point(aes(x=x, y=fit, color="fitted value"), size=3) +
geom_segment(aes(x=x, xend=x, y=y, yend=y-res),
arrow = arrow(length = unit(0.3,"cm"))) + # add an arrow to this line segment
geom_abline(intercept = z$coefficients[1], slope=z$coefficients[2])
Plot residual against fitted values. If our original data was from a straight line, then we would just see residual values of just zero.
ggplot(anscombe1, aes(x=fit, y=res)) + geom_point() + ggtitle("residual vs. fitted values")
Perform a qqnorm plot of residuals to see if they are normally distributed:
qqnorm(anscombe1$res)
qqline(anscombe1$res)
Make the plots of fitted values vs residuals for the data from Practical6_Data.txt
Perform a linear regression without weights:
Make a qnorm plot of the residuals:
Linear regression with weights (performed same as earlier in the practical):
Make a qnorm plot of the residuals (using the weighted model):
The Cook’s distance of point \(X_i\) is a measure of the difference in regression parameters when the point is included or omitted. If omitting a point changes the regression parameters strongly, this point has a large Cook’s distance.
To see quantitatively if the Cook’s distance of a data point is too large and it has an undue influence on the fit you need to compare it with a theoretical cut-off.
This cutoff value can be calculated with the F-statistic using qf(0.5, p, n-p, lower.tail = FALSE)
where:
p
- fitted parameters (=2 with simple linear regression)n
- number of data pointsBelow we use the anscombe1
data frame again.
##Cook's distance
z = lm(y~x, data=anscombe1)
cook <- cooks.distance(z)
## put the fitted values and Cook's distance into a data frame
data = data.frame(fit = fitted(z),
cooks_distance = cook)
##cut-off value
cut <- qf(0.5, 2, length(data$fit)-2, lower.tail = FALSE)
ggplot(data, aes(x=fit, y=cooks_distance)) +
geom_point() + # plot fitted values vs. cooks distance for each fitted values
geom_hline(yintercept=cut) # add a horizontal line for the cut-off we calculated
If we look at all four data sets in Anscombe’s Quartet, there are quite different outcomes when plotting the Cook’s distance for each.
For the unweighted and weighted fit you performed on Practical6_Data.txt
, compare the Cook’s distance.
Unweighted fit:
Weighted fit:
We can create a series of diagnostic plots for the linear regression results by simply using plot(z)
(where z
is the output of a linear regression) instead of ggplot
. What’s happening here is that R detects that z
is the result of a linear regression, and whenever it sees this, it will create a series of four plots:
Here we create the diagnostic plots for the linear regression of the anscombe1
data frame:
z = lm(y~x, data=anscombe1)
plot(z)
Create diagnostic plots for both the unweighted and weighted regressions that you just did on the data for from Practical6_Data.txt
.
\(Y_i = \beta_0 + \beta_1*X_{1i} + \beta_2*X_{2i} + \beta_3*X_{3i} + … + \epsilon_i\)
Interpretation of the parameters:
An ANOVA test can be done to see if the removing one variable significantly changes the fit. If p-value of the ANOVA test <0.05, you should keep the variable (as a rule of thumb)
If you have several variables (y, x1, x2, x3, …) in the data frame dat
, you can perform multiple regression via this syntax:
z = lm(y~ x1 + x2 + x3 + …, data = dat)
A pairs plot is useful for doing an initial exploration of how different variables (i.e. columns in a data frame) may be related to each other. A pairs plot is presented as a matrix containing scatter plots of all the combinations of variables with each other.
To be able to do this, make sure to install and load the GGally
library as described above.
install.packages("GGally")
library(GGally)
As an example, we’ll use a pairs plot on the mtcars
dataset. For this, we only use the mpg
, disp
, hp
and wt
columns. Looking at the first few rows in these columns, we see:
head(mtcars[,c("mpg", "disp", "hp", "wt")])
## mpg disp hp wt
## Mazda RX4 21.0 160 110 2.620
## Mazda RX4 Wag 21.0 160 110 2.875
## Datsun 710 22.8 108 93 2.320
## Hornet 4 Drive 21.4 258 110 3.215
## Hornet Sportabout 18.7 360 175 3.440
## Valiant 18.1 225 105 3.460
The pairs plot for these four columns looks like this:
ggpairs(mtcars[,c("mpg", "disp", "hp", "wt")])
Reading across the rows and columns we can see which variables are being plotted together. In the first column and second row, we have the disp
vs. the mpg
, directly below that is the hp
vs. the mpg
and so on. On the top right we can see the correlations calculated between the variables. On the diagonal we see a density plot of the variables. For example, most of the values for hp
are around 100, although the distribution is skewed to one side.
Now we perform a linear regression where we try to predict mpg
given the values for disp
, hp
and wt
z = lm(mpg~disp + hp + wt, data=mtcars)
summary(z)
##
## Call:
## lm(formula = mpg ~ disp + hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.891 -1.640 -0.172 1.061 5.861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.105505 2.110815 17.579 < 2e-16 ***
## disp -0.000937 0.010350 -0.091 0.92851
## hp -0.031157 0.011436 -2.724 0.01097 *
## wt -3.800891 1.066191 -3.565 0.00133 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.639 on 28 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8083
## F-statistic: 44.57 on 3 and 28 DF, p-value: 8.65e-11
We can also extract the \(R^2\) value from the summary:
summary_for_z <- summary(z)
print(summary_for_z$r.squared)
## [1] 0.8268361
Create a series of diagnostic plots:
plot(z)
We can perform an ANOVA test using the following set up (assuming a data frame named dat
with columns named y
, x1
, x2
, and x3
)
z1 = lm(y~ x1 + x2 + x3, data = dat)
z2 = lm(y~ x1 + x2, data = dat)
anova(z1,z2)
Using the mtcars
example, we can do:
z1 = lm(mpg~disp + hp + wt, data=mtcars)
z2 = lm(mpg~disp + hp, data=mtcars) # leave out `wt`
results = anova(z1, z2)
print(results)
## Analysis of Variance Table
##
## Model 1: mpg ~ disp + hp + wt
## Model 2: mpg ~ disp + hp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 194.99
## 2 29 283.49 -1 -88.503 12.709 0.001331 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As stated above, if p-value of the anova test <0.05, you should keep the variable (as a rule of thumb). Here it’s clear that the wt
column (weight) is quite important in predicting the fuel economy of a car.
For this we’ll use the stackloss
data set with was mentioned in Lecture 8. This data frame is already pre-loaded with R, and you can get more details about it by typing ?stackloss
in the console.
## Air.Flow Water.Temp Acid.Conc. stack.loss
## Min. :50.00 Min. :17.0 Min. :72.00 Min. : 7.00
## 1st Qu.:56.00 1st Qu.:18.0 1st Qu.:82.00 1st Qu.:11.00
## Median :58.00 Median :20.0 Median :87.00 Median :15.00
## Mean :60.43 Mean :21.1 Mean :86.29 Mean :17.52
## 3rd Qu.:62.00 3rd Qu.:24.0 3rd Qu.:89.00 3rd Qu.:19.00
## Max. :80.00 Max. :27.0 Max. :93.00 Max. :42.00
For this we are trying to predict stack.loss
given all the other variables
##
## Call:
## lm(formula = stack.loss ~ Acid.Conc. + Air.Flow + Water.Temp,
## data = stackloss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2377 -1.7117 -0.4551 2.3614 5.6978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.9197 11.8960 -3.356 0.00375 **
## Acid.Conc. -0.1521 0.1563 -0.973 0.34405
## Air.Flow 0.7156 0.1349 5.307 5.8e-05 ***
## Water.Temp 1.2953 0.3680 3.520 0.00263 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
## F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
##
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp, data = stackloss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5290 -1.7505 0.1894 2.1156 5.6588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.3588 5.1383 -9.801 1.22e-08 ***
## Air.Flow 0.6712 0.1267 5.298 4.90e-05 ***
## Water.Temp 1.2954 0.3675 3.525 0.00242 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.239 on 18 degrees of freedom
## Multiple R-squared: 0.9088, Adjusted R-squared: 0.8986
## F-statistic: 89.64 on 2 and 18 DF, p-value: 4.382e-10
## Analysis of Variance Table
##
## Model 1: stack.loss ~ Acid.Conc. + Air.Flow + Water.Temp
## Model 2: stack.loss ~ Air.Flow + Water.Temp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17 178.83
## 2 18 188.79 -1 -9.9654 0.9473 0.344
##
## Call:
## lm(formula = stack.loss ~ Air.Flow, data = stackloss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2896 -1.1272 -0.0459 1.1166 8.8728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -44.13202 6.10586 -7.228 7.31e-07 ***
## Air.Flow 1.02031 0.09995 10.208 3.77e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.098 on 19 degrees of freedom
## Multiple R-squared: 0.8458, Adjusted R-squared: 0.8377
## F-statistic: 104.2 on 1 and 19 DF, p-value: 3.774e-09
Perform the anova test:
## Analysis of Variance Table
##
## Model 1: stack.loss ~ Air.Flow + Water.Temp
## Model 2: stack.loss ~ Air.Flow
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 18 188.80
## 2 19 319.12 -1 -130.32 12.425 0.002419 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For this create a set of linear regressions for combinations of the individual variables
stack.loss~Air.Flow
Value for \(R^{2}\):
## [1] 0.8457809
stack.loss~Water.Temp
Value for \(R^{2}\):
## [1] 0.766508
Air.Flow~Water.Temp
Value for \(R^{2}\):
## [1] 0.6112931
Real data may often have missing values, which are represented in R as NA
. In some data sets you may see them represented in other forms like NaN
(not a number) or by an unrealistic number (e.g., -999). For R to handle the missing values, you have to rename them as NA
, if they are not named like that, then R does not recognize them and will try to process them as an actual number.
Let’s say that for a data frame we have called df
, NA
values are currently represented as -999
. If we type into the console df == -999
, we’ll get a matrix of TRUE
, FALSE
or NA
values (if NA
is already present).
To replace these -999 values, we use the command below which says that for all locations in the data frame with a value of -999, we should assign to those locations the value NA
instead.
df[df == -999] <-NA
Also be aware that many R functions have arguments that allow you to specify how they will process NA
values. For example, if you type ?mean
in the R console, you’ll see that the mean()
function has a na.rm
argument that specifies if NA
values should be removed before computing the mean.
Calculating mean()
with NA
values included:
mean(c(3,7,3,NA,5))
## [1] NA
Calculating mean()
and removing NA
values:
mean(c(3,7,3,NA,5), na.rm=TRUE)
## [1] 4.5
The lm()
function has a na.action
argument that specifies how to deal with missing values.
df <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))
# remove rows from the data frame containing NA
na.omit(df)
## x y
## 1 1 0
## 2 2 10
lm(y~x, df, na.action=na.fail)
## Error in na.fail.default(structure(list(y = c(0, 10, NA), x = c(1, 2, : missing values in object
lm(y~x, df, na.action=na.omit)
##
## Call:
## lm(formula = y ~ x, data = df, na.action = na.omit)
##
## Coefficients:
## (Intercept) x
## -10 10
lm(y~x, df, na.action=na.pass)
## Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): NA/NaN/Inf in 'y'
The dplyr
library also deals with NA
values in a specific way:
hp >= 250
) using the filter
function, then NA
values will be filtered out.NA
values, then you need to specify something like hp >= 250 | is.na(hp)
. The function is.na()
will return TRUE
for any NA
values.For this exercise, download the Carbon.csv data into R as a data frame called input1
.
After loading in the csv file into R, try to replace the -999 values with NA
. If you run summary
before the replacement you should see:
## date_start elementary.carbon.from.fossil.fuels BB_EC
## 01/03/12 12:35: 1 Min. :0.1077 Min. :0.01385
## 01/06/11 13:30: 1 1st Qu.:0.2289 1st Qu.:0.02643
## 02/05/11 15:15: 1 Median :0.3446 Median :0.06442
## 02/12/11 09:15: 1 Mean :0.5839 Mean :0.12673
## 06/06/11 11:45: 1 3rd Qu.:0.6550 3rd Qu.:0.14614
## 07/07/11 13:30: 1 Max. :2.2389 Max. :0.66726
## (Other) :17
## FF_OC BB_OC BIO_OC
## Min. :-999.0000 Min. :-999.0000 Min. :0.2535
## 1st Qu.: 0.2770 1st Qu.: 0.1179 1st Qu.:0.4915
## Median : 0.4028 Median : 0.3220 Median :0.7457
## Mean : -42.8049 Mean : -42.8112 Mean :1.1032
## 3rd Qu.: 0.5454 3rd Qu.: 0.7306 3rd Qu.:1.2399
## Max. : 2.2815 Max. : 3.3353 Max. :4.0950
##
## FF_WIOC C_WIOC FF_WSOC
## Min. :-999.0000 Min. :0.04875 Min. :-0.16657
## 1st Qu.: 0.1436 1st Qu.:0.12869 1st Qu.: 0.09897
## Median : 0.1950 Median :0.26482 Median : 0.17624
## Mean : -86.5462 Mean :0.42041 Mean : 0.28935
## 3rd Qu.: 0.3089 3rd Qu.:0.42538 3rd Qu.: 0.35644
## Max. : 1.4266 Max. :1.80312 Max. : 1.05329
##
## C_WSOC
## Min. :-999.0000
## 1st Qu.: 0.4852
## Median : 0.8561
## Mean : -42.1394
## 3rd Qu.: 1.5079
## Max. : 4.2263
##
After replacing them, using summary
you should see that several columns now have NA
values:
## date_start elementary.carbon.from.fossil.fuels BB_EC
## 01/03/12 12:35: 1 Min. :0.1077 Min. :0.01385
## 01/06/11 13:30: 1 1st Qu.:0.2289 1st Qu.:0.02643
## 02/05/11 15:15: 1 Median :0.3446 Median :0.06442
## 02/12/11 09:15: 1 Mean :0.5839 Mean :0.12673
## 06/06/11 11:45: 1 3rd Qu.:0.6550 3rd Qu.:0.14614
## 07/07/11 13:30: 1 Max. :2.2389 Max. :0.66726
## (Other) :17
## FF_OC BB_OC BIO_OC FF_WIOC
## Min. :0.1173 Min. :0.06923 Min. :0.2535 Min. :0.08086
## 1st Qu.:0.2867 1st Qu.:0.12886 1st Qu.:0.4915 1st Qu.:0.17058
## Median :0.4328 Median :0.32871 Median :0.7457 Median :0.19604
## Mean :0.6585 Mean :0.65190 Mean :1.1032 Mean :0.35414
## 3rd Qu.:0.5580 3rd Qu.:0.77068 3rd Qu.:1.2399 3rd Qu.:0.31171
## Max. :2.2815 Max. :3.33533 Max. :4.0950 Max. :1.42661
## NA's :1 NA's :1 NA's :2
## C_WIOC FF_WSOC C_WSOC
## Min. :0.04875 Min. :-0.16657 Min. :0.2886
## 1st Qu.:0.12869 1st Qu.: 0.09897 1st Qu.:0.5394
## Median :0.26482 Median : 0.17624 Median :0.8566
## Mean :0.42041 Mean : 0.28935 Mean :1.3542
## 3rd Qu.:0.42538 3rd Qu.: 0.35644 3rd Qu.:1.5549
## Max. :1.80312 Max. : 1.05329 Max. :4.2263
## NA's :1