A test of s in two different populations with size \(n_1\), \(n_2\) involves the following:
var.test
var.test(x, y, ratio = 1,
alternative = c("two.sided", "less", "greater"),
conf.level = 0.95, ...)
Relevant arguments for this practical:
x, y
- numeric vectors of data valuesratio
- the hypothesized ratio of the population variances of x and y.alternative
- a character string specifying the alternative hypothesis, must be one of “two.sided” (default), “greater” or “less”. You can specify just the initial letter.conf.level
- confidence level for the returned confidence interval.Take 100 samples from two normal distributions with different mean, but the same variance. Will the test recognize that we should not reject the null hypothesis?
x <- rnorm(100, mean=0) # if you don’t specify the variance it is set to 1 by default
y <- rnorm(100, mean=1)
F test to compare two variances:
var.test(x, y, ratio = 1, alternative = "two.sided", conf.level = 0.95)
##
## F test to compare two variances
##
## data: x and y
## F = 0.53233, num df = 99, denom df = 99, p-value = 0.001913
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.3581720 0.7911629
## sample estimates:
## ratio of variances
## 0.5323273
mean = 0
, sd = 1
and sd = 1.2
Run the var.test
again. You should see:
##
## F test to compare two variances
##
## data: x and y
## F = 0.63133, num df = 99, denom df = 99, p-value = 0.02306
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.4247872 0.9383086
## sample estimates:
## ratio of variances
## 0.6313331
For different means, same variance you should see:
##
## F test to compare two variances
##
## data: x and y
## F = 0.99091, num df = 9, denom df = 9, p-value = 0.9894
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.2461272 3.9893853
## sample estimates:
## ratio of variances
## 0.9909069
For same mean, different variance you should see:
##
## F test to compare two variances
##
## data: x and y
## F = 0.69366, num df = 9, denom df = 9, p-value = 0.5946
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.1722954 2.7926725
## sample estimates:
## ratio of variances
## 0.6936604
Before we begin constructing tables from categorical data, we should review a few useful functions for working with tabular data.
Get a list of columns in a data frame:
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
Summarize a data frame, showing the columns and statistical information about the values in those columns:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Extracting columns with the $ command
iris$Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
## [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
## [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
## [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
table
commandBy typing ?table
in the console, you see: table
uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.
table(...,
exclude = if (useNA == "no") c(NA, NaN),
useNA = c("no", "ifany", "always"),
dnn = list.names(...), deparse.level = 1)
The relevant argument for this practical is:
...
- one or more objects which can be interpreted as factors (including character strings), or a list (or data frame) whose components can be so interpreted. (For as.table, arguments passed to specific methods; for as.data.frame, unused.)In the examples below you’ll see that ...
takes the form of one or two variables which we use to construct the contingency table.
Here we create a table from the cyl
(number of cylinders) column of the mtcars
data set:
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
Reading across, we see that there are 11 cars with 4 cylinders, 7 with 6 cylinders, and 14 with 8 cylinders.
Next we create a table from the cyl
and gear
(number of forward gears) columns of the mtcars
data set:
table(mtcars$cyl, mtcars$gear)
##
## 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
With the table
function, the first value you enter (mtcars$cyl
) will show up on the rows, and the second (mtcars$gear
) will show up in the columns. For example, from the table we can see that there are 12 cars with 8 cylinders and 3 gears, while there are 0 cars with 4 gears and eight cylinders.
Load in data on heart attack victims. With R you can download a file directly from the Internet. Note that sep="\t"
means that the file is tab delimited
df = read.csv("http://courses.statistics.com/Intro1/Lesson2/heartatk4R.txt", sep="\t")
Now construct a two way table from the categorical variables SEX
and DIED
. You should see:
##
## 0 1
## F 4298 767
## M 7136 643
Hypothesis: n events that occur with probabilities \(p_i\)
Observations: counts for each event (\(n_1\), \(n_2\), …, \(n_i\), …,\(n_k\))
Test statistic:
\(\chi^{2} = \sum{}\frac{(observed - expected)^{2}}{expected} = \sum_{i=1}^{k} \frac{(n_i-E_i)^2}{E_i}\)
with \(E_i = n * p_i\)
For the chi-square test for goodness of fit, you only need to have two arguments specified:
chisq.test(x, p)
Arguments:
x
- a numeric vector or matrix. x
can also be a vector of factors.p
- a vector of probabilities of the same length of x
. An error is given if any entry of p
is negative.If p
is not specified, then chisq.test
automatically assumes equal probabilities.
This is the dice example mentioned in lecture 6.
You roll a die 100 times, and these are the number of counts associated with each value of the die that was observed.
die = c(1,2,3,4,5,6)
count = c(11,16,25,13,21,14)
We would expect \(p_i = \frac{1}{6}\) for each value of \(i\)
For the counts at each die value, we would expect:
exp = rep(1/6, 6) * 100
exp
## [1] 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667
Use the chisq.test
function to see the same results:
chisq.test(count, p=rep(1/6, 6))
##
## Chi-squared test for given probabilities
##
## data: count
## X-squared = 8.48, df = 5, p-value = 0.1317
The peak should mean that we have the highest probability of getting a \(\chi^2\) value of 3 with a fair die.
This is the example of murder cases from the lecture
Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday |
---|---|---|---|---|---|---|
53 | 42 | 51 | 45 | 36 | 37 | 65 |
Are the murders equally distributed through the week or not at a confidence level of 0.95?
What are the test hypotheses, df, and p-value?
If you do things correctly (and set correct=FALSE
), you should see an output of:
##
## Chi-squared test for given probabilities
##
## data: murders
## X-squared = 13.319, df = 6, p-value = 0.03824
Based on these results, we reject \(H_0\), since p < 0.05. For at least one day of the week, murder rate is different from \(\frac{1}{7}\) at the 95% confidence level
A 2-way table is of the form:
Hypothesis test:
Test statistic:
\(\chi^2 = \sum_{i=1}^{k}\sum_{j=1}^{l} \frac{(n_{ij}-n_{i\cdot}*n_{\cdot j}/n)^2}{n_{i\cdot}*n_{\cdot j}/n}\)
The syntax for performing a chisq.test
for a two way table is very simple - in this case x is the matrix containing the data for the two way table:
chisq.test(x)
A set of cars are observed and per car it is recorded whether the parents and the children are wearing their seatbelts. We can construct the two-way table of the observations in the code below:
seatbelt = rbind(c(56, 8), c(2, 16))
colnames(seatbelt) = c("child_buckled", "child_unbuckled")
rownames(seatbelt) = c("parent_buckled", "parent_unbuckled")
seatbelt
## child_buckled child_unbuckled
## parent_buckled 56 8
## parent_unbuckled 2 16
We use the following hypotheses and confidence level:
We now need to create expected matrix which would show what the expected values would be if the variables are independent. For this we need to find \(p(A \cap B) = p(A)p(B)\)
# calculate sums over the rows and columns
parent = rowSums(seatbelt)
child = colSums(seatbelt)
parent
## parent_buckled parent_unbuckled
## 64 18
child
## child_buckled child_unbuckled
## 58 24
For the sake of convenience, we want to have the rowSums
and colSums
values available in such a way that we can access the values with the $
operator like parent$parent_buckled
. To do this, we can convert the results to a list using as.list
. As you noticed in the previous results, we already have the names associated with the values (i.e. when printing the value of the parent
variable, you see the names parent_buckled
and parent_unbuckled
). If we didn’t see any names like this, then converting it to a list wouldn’t be useful.
parent = as.list(rowSums(seatbelt))
child = as.list(colSums(seatbelt))
# total population size (all adults & children)
total = sum(seatbelt)
# Expected value for parent buckled, child buckled
# The n11, n12, etc correspond to the notation used in the lecture slides
EV_pb_cb = (parent$parent_buckled * child$child_buckled) / total # n11
EV_pb_cu = (parent$parent_buckled * child$child_unbuckled) / total # n12
EV_pu_cb = (parent$parent_unbuckled * child$child_buckled) / total # n21
EV_pu_cu = (parent$parent_unbuckled * child$child_unbuckled) / total # n22
# create a new matrix of the expected values
# rbind combines the rows together
seatbelt_EV = rbind(c(EV_pb_cb, EV_pb_cu), c(EV_pu_cb, EV_pu_cu))
# copy the row and column names from the seatbelt matrix
rownames(seatbelt_EV) = rownames(seatbelt)
colnames(seatbelt_EV) = colnames(seatbelt)
seatbelt_EV
## child_buckled child_unbuckled
## parent_buckled 45.26829 18.731707
## parent_unbuckled 12.73171 5.268293
The method above is ok if your table only contains a very few elements. A more efficient way to calculate seatbelt_EV
, especially if you are dealing with a large table, would be to directly use vector operations. Here t()
transposes the vector returned by the colSums
function. This results in us multiplying a 2x1 matrix with a 1x2 matrix.
Note the %*%
, which specifies that we want to do matrix multiplication. Simply using *
would perform element-wise multiplication. See here for examples showing the difference.
seatbelt_EV = (rowSums(seatbelt) %*% t(colSums(seatbelt)))/total
# add in the rownames, the column names are already present
rownames(seatbelt_EV) = rownames(seatbelt)
seatbelt_EV
## child_buckled child_unbuckled
## parent_buckled 45.26829 18.731707
## parent_unbuckled 12.73171 5.268293
We can now calculate \(\chi^2\) using the matrices:
sum(((seatbelt - seatbelt_EV)^2)/seatbelt_EV)
## [1] 39.5993
We can also do this directly via the chisq.test
function. Make sure to specify correct=FALSE
, as by default the chisq.test
function applies a correction if some counts in the table are low. If you leave the value empty or specify correct=TRUE
you’ll get a slightly deviating value.
chisq.test(seatbelt, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: seatbelt
## X-squared = 39.599, df = 1, p-value = 3.118e-10
Using the example from question 10.60 in the book (page 556): A study of potential age discrimination considers promotions among middle managers in a large company. The data are as follows:
Under 30 | 30–39 | 40–49 | 50 and Over | Total | |
---|---|---|---|---|---|
Promoted | 9 | 29 | 32 | 10 | 80 |
Not promoted | 41 | 41 | 48 | 40 | 170 |
Totals | 50 | 70 | 80 | 50 |
Create the above table in R and perform the chisq.test
on it. Is there a statistically significant relation between age and promotions?
For this, we want to test if any of several means are different. We want to compare the within group variance (\({SSD}_{W}\)) to the between group variance (\({SSD}_{B}\)). If the means are all equal (i.e. the data are all from the same distribution) then the statistic \(F\) follows an F-distribution with \((k-1)\), \((n-k)\) degrees of freedom (\(F_{k-1,n-k}\)).
\(F = \frac{{MS}_{B}}{{MS}_{W}}\)
If the means are different, then \({MS}_{B} >> {MS}_{W}\)
oneway.test(formula, data, subset, na.action, var.equal = FALSE)
Relevant arguments:
formula
- a formula of the form lhs ~ rhs where lhs gives the sample values and rhs the corresponding groups.data
- an optional matrix or data frame (or similar: see model.frame
) containing the variables in the formula formula. By default the variables are taken from environment(formula).subset
- an optional vector specifying a subset of observations to be used. Normally we don’t use this option.var.equal
- a logical variable indicating whether to treat the variances in the samples as equal. If TRUE
, then a simple F test for the equality of means in a one-way analysis of variance is performed. If FALSE
, an approximate method of Welch (1951) is used, which generalizes the commonly known 2-sample Welch test to the case of arbitrarily many samples.The formula we will use is cal~month
(calories grouped by month). Both cal
and month
correspond to the names of the columns in a data frame.
This is the example on number of calories consumed by month (page 37 of Lecture 6).
Fifteen subjects were randomly assigned to three time groups, and their amount of daily calories consumed was recorded (may
, sep
, dec
).
We want to know if there is a difference in calorie consumption between the months.
# values for the different months
may <- c(2166,1568,2233,1882,2019)
sep <- c(2279,2075,2131,2009,1793)
dec <- c(2226,2154,2583,2010,2190)
# create a vector containing the 5 values from may, the 5 values from sep, etc.
cal = c(may, sep, dec)
cal
## [1] 2166 1568 2233 1882 2019 2279 2075 2131 2009 1793 2226 2154 2583 2010
## [15] 2190
# The cal vector shows the numbers, but we need another vector which helps us
# to know which months those values are associated with.
# We now need to create a vector of the same length as the cal vector
# where each element is text that corresponds to the month of the observation in cal
# since we have 5 values for may, sep & dec, we repeat "may", "sep", and "dec" 5 times each
month = c(rep("may", 5),
rep("sep", 5),
rep("dec", 5))
month
## [1] "may" "may" "may" "may" "may" "sep" "sep" "sep" "sep" "sep" "dec"
## [12] "dec" "dec" "dec" "dec"
# now we join these vectors together into a dataframe
data1= data.frame(cal, month)
# show what the data looks like
data1
## cal month
## 1 2166 may
## 2 1568 may
## 3 2233 may
## 4 1882 may
## 5 2019 may
## 6 2279 sep
## 7 2075 sep
## 8 2131 sep
## 9 2009 sep
## 10 1793 sep
## 11 2226 dec
## 12 2154 dec
## 13 2583 dec
## 14 2010 dec
## 15 2190 dec
# perform the 1-way ANOVA test
oneway.test(cal~month, data1)
##
## One-way analysis of means (not assuming equal variances)
##
## data: cal and month
## F = 1.5554, num df = 2.000, denom df = 7.813, p-value = 0.27
Based on these values, we cannot reject \(H_0\), i.e. the results do not support the hypothesis that calorie intake varies by month.
A set of four groups of infants were observed and data was compiled on their age at walking (in months). The four groups are defined as follows:
The values for each group are:
9.00 9.50 9.75 10.00 13.00 9.50
11.00 10.00 10.00 11.75 10.50 15.00
11.50 12.00 9.00 11.50 13.25 13.00
13.25 11.50 12.00 13.50 11.50
Using what you learned above, first put the data into a format that can be used with the one way test, and conduct the test. Create a box plot or a normal probability plot (qqnorm
and qqline
from the previous practical) to see if the data are roughly normally distributed.