Make sure to load the tidyverse
library so we can do plotting later on.
library(tidyverse)
One of the tasks we’ll show in this practical relates to constructing tables from categorical data. Before doing this, we should first review a few useful functions for working with tabular data in data frames.
The iris
data frame is an example data set that is included with R, just like the mtcars
data we worked with in previous practicals.
Get a list of columns in a data frame:
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
Summarize a data frame, showing the columns and statistical information about the values in those columns:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Extracting columns with the $
symbol. Note that if you type iris$
and then press the tab key, you will get a list of suggestions for all of the column names.
iris$Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
## [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
## [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
## [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
Cross-tabulation, or a contingency table, is a technique that counts the number of observations at permutations of variable values. As an example, if you collected data on people’s hair color and eye color, a contingency table could be used to analyze the data and say how many people have particular combinations of hair color and eye color. Using this you could see how many people have both brown hair and green eyes, how many people have both blond hair and blue eyes, etc.
Cross tabulation can be performed in R using the table()
function.
table()
functionBy typing ?table
in the console, you see: table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.
By “factor levels”, this just means the distinct values for the factors. Factors are the same thing as the categories or categorical data that was mentioned in the lecture.
table(...,
exclude = if (useNA == "no") c(NA, NaN),
useNA = c("no", "ifany", "always"),
dnn = list.names(...), deparse.level = 1)
The relevant argument for this practical is:
...
- one or more objects which can be interpreted as factors (including character strings), or a list (or data frame) whose components can be so interpreted. (For as.table, arguments passed to specific methods; for as.data.frame, unused.)In the examples below you’ll see that ...
takes the form of one or two variables which we use to construct the contingency table.
Whenever you see ...
in documentation for R, it generally means that multiple objects can be passed to the function.
Here we create a table from the cyl
(number of cylinders) column of the mtcars
data set:
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
Reading across (bottom left -> top left, bottom center -> top center, etc), we see that there are:
Next we create a table from the cyl
and gear
(number of forward gears) columns of the mtcars
data set:
table(mtcars$cyl, mtcars$gear)
##
## 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
With the table
function, the first value you enter (mtcars$cyl
) will show up on the rows, and the second (mtcars$gear
) will show up in the columns. For example, from the table we can see that there are 12 cars with 8 cylinders and 3 gears, while there are 0 cars with 4 gears and eight cylinders.
Load in data on heart attack victims. With R you can download a file directly from the Internet. Note that sep="\t"
means that the file is tab delimited
df = read.csv("https://github.com/cbdavis/DASM/raw/master/2017/data/heartatk4R.txt", sep="\t")
Now construct a two way table from the categorical variables SEX
and DIED
. You should see:
##
## 0 1
## F 4298 767
## M 7136 643
Hypothesis: n events that occur with probabilities \(p_i\)
Observations: counts for each event (\(n_1\), \(n_2\), …, \(n_i\), …,\(n_k\))
Test statistic:
\(\chi^{2} = \sum{}\frac{(observed - expected)^{2}}{expected} = \sum_{i=1}^{k} \frac{(n_i-E_i)^2}{E_i}\)
with \(E_i = n * p_i\)
For the chi-square test for goodness of fit, you only need to have two arguments specified:
chisq.test(x, p)
Arguments:
x
- a numeric vector or matrix. x can also be a vector of factors.p
- a vector of probabilities of the same length of x. An error is given if any entry of p is negative.If p
is not specified, then chisq.test
automatically assumes equal probabilities.
This is the dice example mentioned in Lecture 6, Slide 5.
You roll a die 100 times, and these are the number of counts associated with each value of the die that was observed.
die = c(1,2,3,4,5,6)
count = c(11,16,25,13,21,14)
We would expect \(p_i = \frac{1}{6}\) for each value of \(i\)
For the counts at each die value, we would expect:
exp = rep(1/6, 6) * 100
exp
## [1] 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667
Use the chisq.test
function to see the same results:
chisq.test(count, p=rep(1/6, 6))
##
## Chi-squared test for given probabilities
##
## data: count
## X-squared = 8.48, df = 5, p-value = 0.1317
The peak should mean that we have the highest probability of getting a \(\chi^2\) value of 3 with a fair die:
This is the example of murder cases from the lecture
Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | |
---|---|---|---|---|---|---|---|
53 | 42 | 51 | 45 | 36 | 37 | 65 |
Are the murders equally distributed through the week or not at a confidence level of 0.95?
What are the test hypotheses, df, and p-value?
If you do things correctly (and set correct=FALSE
), you should see an output of:
##
## Chi-squared test for given probabilities
##
## data: murders
## X-squared = 13.319, df = 6, p-value = 0.03824
Based on these results, we reject \(H_0\), since p < 0.05. For at least one day of the week, murder rate is different from \(\frac{1}{7}\) at the 95% confidence level
A 2-way table is of the form:
Hypothesis test:
Test statistic:
\(\chi^2 = \sum_{i=1}^{k}\sum_{j=1}^{l} \frac{(n_{ij}-n_{i\cdot}*n_{\cdot j}/n)^2}{n_{i\cdot}*n_{\cdot j}/n}\)
The syntax for performing a chisq.test
for a two way table is very simple - in this case x
is the matrix containing the data for the two way table:
chisq.test(x)
A set of cars are observed and per car it is recorded whether the parents and the children are wearing their seatbelts. We can construct the two-way table of the observations in the code below:
seatbelt = rbind(c(56, 8), c(2, 16))
colnames(seatbelt) = c("child_buckled", "child_unbuckled")
rownames(seatbelt) = c("parent_buckled", "parent_unbuckled")
seatbelt
## child_buckled child_unbuckled
## parent_buckled 56 8
## parent_unbuckled 2 16
We use the following hypotheses and confidence level:
We now need to create expected matrix which would show what the expected values would be if the variables are independent. For this we need to find \(p(A \cap B) = p(A)p(B)\)
# calculate sums over the rows and columns
parent = rowSums(seatbelt) / sum(seatbelt)
parent
## parent_buckled parent_unbuckled
## 0.7804878 0.2195122
child = colSums(seatbelt) / sum(seatbelt)
child
## child_buckled child_unbuckled
## 0.7073171 0.2926829
probabilities = parent %*% t(child)
# add back in the row names and column names
colnames(probabilities) = c("child_buckled", "child_unbuckled")
rownames(probabilities) = c("parent_buckled", "parent_unbuckled")
probabilities
## child_buckled child_unbuckled
## parent_buckled 0.5520523 0.22843546
## parent_unbuckled 0.1552647 0.06424747
expected_matrix = probabilities * sum(seatbelt)
expected_matrix
## child_buckled child_unbuckled
## parent_buckled 45.26829 18.731707
## parent_unbuckled 12.73171 5.268293
While the lecure showed how to do the chi-squared test by hand, here we will show how to do it in R by using the chisq.test()
function. Make sure to specify correct=FALSE
, as by default the chisq.test
function applies a correction if some counts in the table are low. If you leave the value empty or specify correct=TRUE
you’ll get a slightly deviating value.
chisq.test(seatbelt, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: seatbelt
## X-squared = 39.599, df = 1, p-value = 3.118e-10
Using the example from question 10.60 in the book (page 556): A study of potential age discrimination considers promotions among middle managers in a large company. The data are as follows:
Under 30 | 30–39 | 40–49 | 50 and Over | Total | |
---|---|---|---|---|---|
Promoted | 9 | 29 | 32 | 10 | 80 |
Not promoted | 41 | 41 | 48 | 40 | 170 |
Totals | 50 | 70 | 80 | 50 |
Create the above table in R and perform the chisq.test
on it. Is there a statistically significant relation between age and promotions?
##
## Pearson's Chi-squared test
##
## data: vals
## X-squared = 13.025, df = 3, p-value = 0.004582
A Q-Q plot (or Quantile-Quantile plot) can be used to see if the values in a sample are normally distributed.
Suppose you have a sample of values from an unknown population. In the example below, we’re actually sampling from a normal distribution, but pretend for now that we don’t know where the data is from.
x = rnorm(50)
x
## [1] -2.037687456 0.982705627 -0.150894325 0.098180946 0.005617086
## [6] -1.127875624 0.006083578 1.222265650 -1.097998043 0.308884684
## [11] 0.423084752 0.045752119 1.503858226 1.982813949 0.688540191
## [16] -1.910151709 -0.227651442 -1.758438936 -0.280546433 0.840160655
## [21] -0.103500051 -0.982532328 0.165145652 -0.054335538 1.101896051
## [26] 1.381677729 -0.836436761 0.153019057 1.421888305 -0.894206718
## [31] -0.299182770 -0.660221177 -0.798789553 -0.286455477 -0.747798422
## [36] -0.829770184 0.600284224 -0.317889265 -1.365338820 -1.811824106
## [41] 1.416879322 1.066057759 -0.276979265 -0.071097830 0.078951984
## [46] -0.451363915 1.662397583 1.218293663 2.321434501 0.620630022
First you order the sample values from the lowest to the highest values. We can use the sort()
function for this:
x = sort(x)
x
## [1] -2.037687456 -1.910151709 -1.811824106 -1.758438936 -1.365338820
## [6] -1.127875624 -1.097998043 -0.982532328 -0.894206718 -0.836436761
## [11] -0.829770184 -0.798789553 -0.747798422 -0.660221177 -0.451363915
## [16] -0.317889265 -0.299182770 -0.286455477 -0.280546433 -0.276979265
## [21] -0.227651442 -0.150894325 -0.103500051 -0.071097830 -0.054335538
## [26] 0.005617086 0.006083578 0.045752119 0.078951984 0.098180946
## [31] 0.153019057 0.165145652 0.308884684 0.423084752 0.600284224
## [36] 0.620630022 0.688540191 0.840160655 0.982705627 1.066057759
## [41] 1.101896051 1.218293663 1.222265650 1.381677729 1.416879322
## [46] 1.421888305 1.503858226 1.662397583 1.982813949 2.321434501
You can assign a “cumulative probability” \(P_i\) to each measurement with \(P_i = \frac{(i - 0.5)}{n}\).
i = seq_along(x) # creates a sequence of 1, 2, ... up to the length of x (50 in this example)
P = (i - 0.5) / length(x) # vector of values for each P_i
P
## [1] 0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19 0.21 0.23 0.25 0.27
## [15] 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 0.55
## [29] 0.57 0.59 0.61 0.63 0.65 0.67 0.69 0.71 0.73 0.75 0.77 0.79 0.81 0.83
## [43] 0.85 0.87 0.89 0.91 0.93 0.95 0.97 0.99
\(x_i\) has then at least a fraction of \(P_i\) data values smaller than \(x_i\).
\(x_i(P_i)\) is the \(P_{i}\)-th quantile of the sample.
As an example of this, our 10th element is:
x[10]
## [1] -0.8364368
and at the 10th location, our value for P is
P[10]
## [1] 0.19
This means that at least 19% of the values in the data set are smaller than -0.8364368
We then create a new variable z
where we find the locations of the quantiles specified in P
z = qnorm(P, mean = mean(x), sd=sd(x))
We then plot z
versus x
. If the line is straight, then the data appears to be normally distributed
plot(z,x)
In R we can automatically perform all these steps just by using qqnorm()
and qqline()
.
The function qqnorm()
draws the points in the scatter plot, and qqline()
draws the line showing where we would expect the points to lie if they are normally distributed.
qqnorm(x)
qqline(x)
Even with a low number of samples, many of the points are on the diagonal line, which indicates that our data (which is actually sampled from a normal distribution) looks like it could be normally distributed.
What is we try 500 samples instead of just 50?
x <- rnorm(500)
qqnorm(x)
qqline(x)
What about 10000 samples?
x <- rnorm(10000)
qqnorm(x)
qqline(x)
How does a uniform distribution look?
x <- runif(100) # 100 samples from a uniform distribution
qqnorm(x)
qqline(x)
We next use the faithful
data set which is included with R. You can type ?faithful
in the RStudio console to get more information on it. This data is about the Old Faithful geyser in Yellowstone National Park in the US. This geyser is famous for erupting on a very regular schedule and the data set has information on how long the eruptions are (faithful$eruptions
) and the amount of time until the next eruption (faithful$waiting
).
Make a Q-Q plot of the waiting times (faithful$waiting
):
This tells us that the eruptions are clearly not normally distributed. To investigate further, create a histogram of the values:
From the histogram, we see from this is that it’s clearly bi-modal as there are two distinct peaks.
To investigate further, create a scatter plot showing how the waiting time might be related to the length of the eruption.
We see a few things here:
We should split up the data into two sets, where one lists all the eruptions that lasted less than three minutes, and the other one contains those which are longer.
Now create two separate data frames from the faithful
data frame, where faithful_short
has eruptions < 3 and faithful_long
has eruptions >= 3
Create a Q-Q plot for the short eruptions:
Create a Q-Q plot for the long eruptions:
If we split the data into two clusters, do they seem to be normally distributed?
For this, we want to test if any of several means are different. We want to compare the within group variance (\({SSD}_{W}\)) to the between group variance (\({SSD}_{B}\)). If the means are all equal (i.e. the data are all from the same distribution) then the statistic \(F\) follows an F-distribution with \((k-1)\), \((n-k)\) degrees of freedom (\(F_{k-1,n-k}\)).
\(F = \frac{{MS}_{B}}{{MS}_{W}}\)
If the means are different, then \({MS}_{B} >> {MS}_{W}\)
oneway.test(formula, data, subset, na.action, var.equal = FALSE)
Relevant arguments:
formula
- a formula of the form lhs ~ rhs where lhs gives the sample values and rhs the corresponding groups.data
- an optional matrix or data frame (or similar: see model.frame
) containing the variables in the formula formula. By default the variables are taken from environment(formula).subset
- an optional vector specifying a subset of observations to be used. Normally we don’t use this option.var.equal
- a logical variable indicating whether to treat the variances in the samples as equal. If TRUE
, then a simple F test for the equality of means in a one-way analysis of variance is performed. If FALSE
, an approximate method of Welch (1951) is used, which generalizes the commonly known 2-sample Welch test to the case of arbitrarily many samples.The formula we will use is cal~month
(calories grouped by month). Both cal
and month
correspond to the names of the columns in a data frame.
This is the example on number of calories consumed by month (Lecture 6, slide 43).
Fifteen subjects were randomly assigned to three time groups, and their amount of daily calories consumed was recorded (may
, sep
, dec
).
We want to know if there is a difference in calorie consumption between the months.
# values for the different months
may <- c(2166,1568,2233,1882,2019)
sep <- c(2279,2075,2131,2009,1793)
dec <- c(2226,2154,2583,2010,2190)
# create a vector containing the 5 values from may, the 5 values from sep, etc.
cal = c(may, sep, dec)
cal
## [1] 2166 1568 2233 1882 2019 2279 2075 2131 2009 1793 2226 2154 2583 2010
## [15] 2190
# The cal vector shows the numbers, but we need another vector which helps us
# to know which months those values are associated with.
# We now need to create a vector of the same length as the cal vector
# where each element is text that corresponds to the month of the observation in cal
# since we have 5 values for may, sep & dec, we repeat "may", "sep", and "dec" 5 times each
month = c(rep("may", 5),
rep("sep", 5),
rep("dec", 5))
month
## [1] "may" "may" "may" "may" "may" "sep" "sep" "sep" "sep" "sep" "dec"
## [12] "dec" "dec" "dec" "dec"
# now we join these vectors together into a dataframe
data1= data.frame(cal, month)
# show what the data looks like
data1
## cal month
## 1 2166 may
## 2 1568 may
## 3 2233 may
## 4 1882 may
## 5 2019 may
## 6 2279 sep
## 7 2075 sep
## 8 2131 sep
## 9 2009 sep
## 10 1793 sep
## 11 2226 dec
## 12 2154 dec
## 13 2583 dec
## 14 2010 dec
## 15 2190 dec
# perform the 1-way ANOVA test
oneway.test(cal~month, data1)
##
## One-way analysis of means (not assuming equal variances)
##
## data: cal and month
## F = 1.5554, num df = 2.000, denom df = 7.813, p-value = 0.27
Based on these values, we cannot reject \(H_0\), i.e. the results do not support the hypothesis that calorie intake varies by month.
A set of four groups of infants were observed and data was compiled on their age at walking (in months). The four groups are defined as follows:
The values for each group are:
c(9.00, 9.50, 9.75, 10.00, 13.00, 9.50)
c(11.00, 10.00, 10.00, 11.75, 10.50, 15.00)
c(11.50, 12.00, 9.00, 11.50, 13.25, 13.00)
c(13.25, 11.50, 12.00, 13.50, 11.50)
Using what you learned above, first put the data into a format that can be used with the one way test, and conduct the test.
##
## One-way analysis of means (not assuming equal variances)
##
## data: walking_age_months and group
## F = 2.7759, num df = 3.000, denom df = 10.506, p-value = 0.09373
We have a function \(x = f(u,v)\) where:
We then:
Assume that for a research project you are told to calculate the uncertainty for the following formula:
\[f(p,q) = \frac{p \cdot q}{p + q}\]
For both \(p\) and \(q\) we have evidence of the following means and standard deviations:
We assume that the uncertainties for \(p\) and \(q\) follow normal distributions
# take n samples
n <- 1000
# take n samples from normal distributions with different means and standard deviations
p <- rnorm(n, mean = 2, sd = 0.5)
q <- rnorm(n, mean = 5, sd = 1)
Calculate values for \(f(p,q) = \frac{p \cdot q}{p + q}\)
f <- p*q/(p+q)
Plot the histogram, using 100 bins (use breaks=100
)
hist(f, breaks=100)
We now calculate the mean:
f_bar <- mean(f)
f_bar
## [1] 1.405288
and then calculate the standard deviation:
sig <- sd(f)
sig
## [1] 0.2680724
Below we can see how both the mean and the standard deviation converge on a single value as we increase the number of samples from the distributions we defined for \(p\) and \(q\) (see page 31,32 of Lecture 7 for more explanation). Note the log scale for the x axis.
Now you’ve been given a different formula:
\[f(u,v) = u^2 + u + v\]
…and you’ve been told to perform a monte carlo simulation with 10000 samples this time. We again assume that the variables u
and v
are normally distributed with the following means and standard deviations:
Now try to recreate the following results on your own.
Plot the histogram of \(x = f(u,v) = u^2 + u + v\)
Calculate mean(x) and std(x) for \(x = f(u,v)\)
## [1] 16.51136
## [1] 16.35522