1 The test for equality in variance

1.1 Review from lecture

A test of s in two different populations with size $n_1$, $n_2$ involves the following:

A confidence level 1-$\alpha$
Two independent, random samples of normal distributions
Null hypothesis, $H_0$: $s_1 \le s_2$, or $s_1 \ge s_2$, or $s_1 = s_2$
Alternative hypothesis, $H_a$:
- $s_1 \gt s_2$ (upper tail alternative)
- $s_1 \lt s_2$ (lower tail alternative)
- $s_1 \ne s_2$ (two tailed)
Test statistic: $F = \frac{{s_{1}}^2}{{s_{2}}^2}$

1.2 Syntax of `var.test`

var.test(x, y, ratio = 1, 
         alternative = c("two.sided", "less", "greater"),
         conf.level = 0.95, ...)

Relevant arguments for this practical:

x, y - numeric vectors of data values
ratio - the hypothesized ratio of the population variances of x and y.
alternative - a character string specifying the alternative hypothesis, must be one of “two.sided” (default), “greater” or “less”. You can specify just the initial letter.
conf.level - confidence level for the returned confidence interval.

1.3 Example

Take 100 samples from two normal distributions with different mean, but the same variance. Will the test recognize that we should not reject the null hypothesis?

x <- rnorm(100, mean=0)  # if you don’t specify the variance it is set to 1 by default
y <- rnorm(100, mean=1)

F test to compare two variances:

var.test(x, y, ratio = 1, alternative = "two.sided", conf.level = 0.95)

## 
##  F test to compare two variances
## 
## data:  x and y
## F = 0.53233, num df = 99, denom df = 99, p-value = 0.001913
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.3581720 0.7911629
## sample estimates:
## ratio of variances 
##          0.5323273

1.4 Exercise on you own

Take 100 random samples from normal distributions with a different variance e.g., mean = 0, sd = 1 and sd = 1.2

Run the var.test again. You should see:

## 
##  F test to compare two variances
## 
## data:  x and y
## F = 0.63133, num df = 99, denom df = 99, p-value = 0.02306
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4247872 0.9383086
## sample estimates:
## ratio of variances 
##          0.6313331

Try both of the examples above again with n = 10

For different means, same variance you should see:

## 
##  F test to compare two variances
## 
## data:  x and y
## F = 0.99091, num df = 9, denom df = 9, p-value = 0.9894
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.2461272 3.9893853
## sample estimates:
## ratio of variances 
##          0.9909069

For same mean, different variance you should see:

## 
##  F test to compare two variances
## 
## data:  x and y
## F = 0.69366, num df = 9, denom df = 9, p-value = 0.5946
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1722954 2.7926725
## sample estimates:
## ratio of variances 
##          0.6936604

2 Chi-square test for tabular data

2.1 Preliminaries

Before we begin constructing tables from categorical data, we should review a few useful functions for working with tabular data.

2.1.1 Working with table columns

Get a list of columns in a data frame:

colnames(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

Summarize a data frame, showing the columns and statistical information about the values in those columns:

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Extracting columns with the $ command

iris$Sepal.Length

##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

2.1.2 Syntax of the `table` command

By typing ?table in the console, you see: table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.

table(...,
      exclude = if (useNA == "no") c(NA, NaN),
      useNA = c("no", "ifany", "always"),
      dnn = list.names(...), deparse.level = 1)

The relevant argument for this practical is:

... - one or more objects which can be interpreted as factors (including character strings), or a list (or data frame) whose components can be so interpreted. (For as.table, arguments passed to specific methods; for as.data.frame, unused.)

In the examples below you’ll see that ... takes the form of one or two variables which we use to construct the contingency table.

2.1.3 Construct a table from one or two variables

Here we create a table from the cyl (number of cylinders) column of the mtcars data set:

table(mtcars$cyl)

## 
##  4  6  8 
## 11  7 14

Reading across, we see that there are 11 cars with 4 cylinders, 7 with 6 cylinders, and 14 with 8 cylinders.

Next we create a table from the cyl and gear (number of forward gears) columns of the mtcars data set:

table(mtcars$cyl, mtcars$gear)

##    
##      3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2

With the table function, the first value you enter (mtcars$cyl) will show up on the rows, and the second (mtcars$gear) will show up in the columns. For example, from the table we can see that there are 12 cars with 8 cylinders and 3 gears, while there are 0 cars with 4 gears and eight cylinders.

2.1.4 Exercise

Load in data on heart attack victims. With R you can download a file directly from the Internet. Note that sep="\t" means that the file is tab delimited

df = read.csv("http://courses.statistics.com/Intro1/Lesson2/heartatk4R.txt", sep="\t")

Now construct a two way table from the categorical variables SEX and DIED. You should see:

##    
##        0    1
##   F 4298  767
##   M 7136  643

2.2 Chi-square test for goodness of fit

2.2.1 Review from lecture

Hypothesis: n events that occur with probabilities $p_i$

Observations: counts for each event ($n_1$, $n_2$, …, $n_i$, …,$n_k$)

Test statistic:

$\chi^{2} = \sum{}\frac{(observed - expected)^{2}}{expected} = \sum_{i=1}^{k} \frac{(n_i-E_i)^2}{E_i}$

with $E_i = n * p_i$

2.2.2 Syntax of chi-square test for goodness of fit

For the chi-square test for goodness of fit, you only need to have two arguments specified:

chisq.test(x, p)

Arguments:

x - a numeric vector or matrix. x can also be a vector of factors.
p - a vector of probabilities of the same length of x. An error is given if any entry of p is negative.

If p is not specified, then chisq.test automatically assumes equal probabilities.

2.2.3 Example

This is the dice example mentioned in lecture 6.

You roll a die 100 times, and these are the number of counts associated with each value of the die that was observed.

die = c(1,2,3,4,5,6)
count = c(11,16,25,13,21,14)

2.2.3.1 For a fair dice: what would be the expected probabilities?

We would expect $p_i = \frac{1}{6}$ for each value of $i$

For the counts at each die value, we would expect:

exp = rep(1/6, 6) * 100
exp

## [1] 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667

2.2.3.2 Question: Is the die fair at 95% confidence level?

$H_0$: the probability of throwing each number is equal = $\frac{1}{6}$
- i.e. $p_i = \frac{1}{6}$ for all $i$
$H_a$: at least one number comes at a different frequency
- i.e. $p_i \neq \frac{1}{6}$ for at least one $i$

Use the chisq.test function to see the same results:

chisq.test(count, p=rep(1/6, 6))

## 
##  Chi-squared test for given probabilities
## 
## data:  count
## X-squared = 8.48, df = 5, p-value = 0.1317

The peak should mean that we have the highest probability of getting a $\chi^2$ value of 3 with a fair die.

2.2.4 Exercise

This is the example of murder cases from the lecture

Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
53	42	51	45	36	37	65

Are the murders equally distributed through the week or not at a confidence level of 0.95?

What are the test hypotheses, df, and p-value?

If you do things correctly (and set correct=FALSE), you should see an output of:

## 
##  Chi-squared test for given probabilities
## 
## data:  murders
## X-squared = 13.319, df = 6, p-value = 0.03824

Based on these results, we reject $H_0$, since p < 0.05. For at least one day of the week, murder rate is different from $\frac{1}{7}$ at the 95% confidence level

2.3 Chi-square test of independence

2.3.1 Review from lecture 6

A 2-way table is of the form:

Hypothesis test:

$\alpha$ = 0.05
$H_0$ : the outcomes are independent
$H_a$ : outcomes are dependent

Test statistic:

$\chi^2 = \sum_{i=1}^{k}\sum_{j=1}^{l} \frac{(n_{ij}-n_{i\cdot}*n_{\cdot j}/n)^2}{n_{i\cdot}*n_{\cdot j}/n}$

2.3.2 Syntax

The syntax for performing a chisq.test for a two way table is very simple - in this case x is the matrix containing the data for the two way table:

chisq.test(x)

2.3.3 Example

A set of cars are observed and per car it is recorded whether the parents and the children are wearing their seatbelts. We can construct the two-way table of the observations in the code below:

seatbelt = rbind(c(56, 8), c(2, 16))
colnames(seatbelt) = c("child_buckled", "child_unbuckled")
rownames(seatbelt) = c("parent_buckled", "parent_unbuckled")

seatbelt

##                  child_buckled child_unbuckled
## parent_buckled              56               8
## parent_unbuckled             2              16

We use the following hypotheses and confidence level:

$H_0$: the variables are independent
$H_a$: the variables are dependent
$\alpha=0.05$

We now need to create expected matrix which would show what the expected values would be if the variables are independent. For this we need to find $p(A \cap B) = p(A)p(B)$

# calculate sums over the rows and columns
parent = rowSums(seatbelt)
child = colSums(seatbelt)

parent

##   parent_buckled parent_unbuckled 
##               64               18

child

##   child_buckled child_unbuckled 
##              58              24

For the sake of convenience, we want to have the rowSums and colSums values available in such a way that we can access the values with the $ operator like parent$parent_buckled. To do this, we can convert the results to a list using as.list. As you noticed in the previous results, we already have the names associated with the values (i.e. when printing the value of the parent variable, you see the names parent_buckled and parent_unbuckled). If we didn’t see any names like this, then converting it to a list wouldn’t be useful.

parent = as.list(rowSums(seatbelt))
child = as.list(colSums(seatbelt))

# total population size (all adults & children)
total = sum(seatbelt)

# Expected value for parent buckled, child buckled
# The n11, n12, etc correspond to the notation used in the lecture slides
EV_pb_cb = (parent$parent_buckled * child$child_buckled) / total      # n11
EV_pb_cu = (parent$parent_buckled * child$child_unbuckled) / total    # n12
EV_pu_cb = (parent$parent_unbuckled * child$child_buckled) / total    # n21
EV_pu_cu = (parent$parent_unbuckled * child$child_unbuckled) / total  # n22

# create a new matrix of the expected values
# rbind combines the rows together
seatbelt_EV = rbind(c(EV_pb_cb, EV_pb_cu), c(EV_pu_cb, EV_pu_cu))
# copy the row and column names from the seatbelt matrix
rownames(seatbelt_EV) = rownames(seatbelt)
colnames(seatbelt_EV) = colnames(seatbelt)
seatbelt_EV

##                  child_buckled child_unbuckled
## parent_buckled        45.26829       18.731707
## parent_unbuckled      12.73171        5.268293

The method above is ok if your table only contains a very few elements. A more efficient way to calculate seatbelt_EV, especially if you are dealing with a large table, would be to directly use vector operations. Here t() transposes the vector returned by the colSums function. This results in us multiplying a 2x1 matrix with a 1x2 matrix.

Note the %*%, which specifies that we want to do matrix multiplication. Simply using * would perform element-wise multiplication. See here for examples showing the difference.

seatbelt_EV = (rowSums(seatbelt) %*% t(colSums(seatbelt)))/total
# add in the rownames, the column names are already present
rownames(seatbelt_EV) = rownames(seatbelt)
seatbelt_EV

##                  child_buckled child_unbuckled
## parent_buckled        45.26829       18.731707
## parent_unbuckled      12.73171        5.268293

We can now calculate $\chi^2$ using the matrices:

sum(((seatbelt - seatbelt_EV)^2)/seatbelt_EV)

## [1] 39.5993

We can also do this directly via the chisq.test function. Make sure to specify correct=FALSE, as by default the chisq.test function applies a correction if some counts in the table are low. If you leave the value empty or specify correct=TRUE you’ll get a slightly deviating value.

chisq.test(seatbelt, correct=FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  seatbelt
## X-squared = 39.599, df = 1, p-value = 3.118e-10

2.3.4 Exercise

Using the example from question 10.60 in the book (page 556): A study of potential age discrimination considers promotions among middle managers in a large company. The data are as follows:

	Under 30	30–39	40–49	50 and Over	Total
Promoted	9	29	32	10	80
Not promoted	41	41	48	40	170
Totals	50	70	80	50

Create the above table in R and perform the chisq.test on it. Is there a statistically significant relation between age and promotions?

3 One-way ANOVA test

3.1 Review from lecture

For this, we want to test if any of several means are different. We want to compare the within group variance (${SSD}_{W}$) to the between group variance (${SSD}_{B}$). If the means are all equal (i.e. the data are all from the same distribution) then the statistic $F$ follows an F-distribution with $(k-1)$, $(n-k)$ degrees of freedom ($F_{k-1,n-k}$).

${SSD}_W = \sum_{i} \sum_{j} (x_{ij}-\bar{x}_i)^2$
${SSD}_B = \sum_{i} \sum_{j} (\bar{x}_i - \bar{x}_{\cdot})^2 = \sum_{i} n_{i} (\bar{x}_i - \bar{x}_{\cdot})^2$
${MS}_W = \frac{{SSD}_{W}}{(n-k)}$
${MS}_B = \frac{{SSD}_{B}}{(k-1)}$
$F = \frac{{MS}_{B}}{{MS}_{W}}$
If means are the same within the groups then $F$ should be around 1
If the means are different, then ${MS}_{B} >> {MS}_{W}$

3.2 Syntax of the 1-way test

oneway.test(formula, data, subset, na.action, var.equal = FALSE)

Relevant arguments:

formula - a formula of the form lhs ~ rhs where lhs gives the sample values and rhs the corresponding groups.
data - an optional matrix or data frame (or similar: see model.frame) containing the variables in the formula formula. By default the variables are taken from environment(formula).
subset - an optional vector specifying a subset of observations to be used. Normally we don’t use this option.
var.equal - a logical variable indicating whether to treat the variances in the samples as equal. If TRUE, then a simple F test for the equality of means in a one-way analysis of variance is performed. If FALSE, an approximate method of Welch (1951) is used, which generalizes the commonly known 2-sample Welch test to the case of arbitrarily many samples.

The formula we will use is cal~month (calories grouped by month). Both cal and month correspond to the names of the columns in a data frame.

3.3 Example

This is the example on number of calories consumed by month (page 37 of Lecture 6).

Fifteen subjects were randomly assigned to three time groups, and their amount of daily calories consumed was recorded (may, sep, dec).

We want to know if there is a difference in calorie consumption between the months.

# values for the different months
may <- c(2166,1568,2233,1882,2019) 
sep <- c(2279,2075,2131,2009,1793) 
dec <- c(2226,2154,2583,2010,2190) 

# create a vector containing the 5 values from may, the 5 values from sep, etc.
cal = c(may, sep, dec) 
cal

##  [1] 2166 1568 2233 1882 2019 2279 2075 2131 2009 1793 2226 2154 2583 2010
## [15] 2190

# The cal vector shows the numbers, but we need another vector which helps us 
# to know which months those values are associated with.
# We now need to create a vector of the same length as the cal vector
# where each element is text that corresponds to the month of the observation in cal
# since we have 5 values for may, sep & dec, we repeat "may", "sep", and "dec" 5 times each
month = c(rep("may", 5), 
          rep("sep", 5), 
          rep("dec", 5)) 
month

##  [1] "may" "may" "may" "may" "may" "sep" "sep" "sep" "sep" "sep" "dec"
## [12] "dec" "dec" "dec" "dec"

# now we join these vectors together into a dataframe
data1= data.frame(cal, month)
# show what the data looks like
data1

##     cal month
## 1  2166   may
## 2  1568   may
## 3  2233   may
## 4  1882   may
## 5  2019   may
## 6  2279   sep
## 7  2075   sep
## 8  2131   sep
## 9  2009   sep
## 10 1793   sep
## 11 2226   dec
## 12 2154   dec
## 13 2583   dec
## 14 2010   dec
## 15 2190   dec

# perform the 1-way ANOVA test
oneway.test(cal~month, data1)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  cal and month
## F = 1.5554, num df = 2.000, denom df = 7.813, p-value = 0.27

Based on these values, we cannot reject $H_0$, i.e. the results do not support the hypothesis that calorie intake varies by month.

3.4 Exercise

A set of four groups of infants were observed and data was compiled on their age at walking (in months). The four groups are defined as follows:

active - test group receiving active training; these children had their walking and placing reflexes trained during four three-minute sessions that took place every day from their second to their eighth week of life.
passive - passive training group; these children received the same types of social and gross motor stimulation, but did not have their specific walking and placing reflexes trained.
none - no training; these children had no special training, but were tested along with the children who underwent active or passive training.
ctr.8w - eighth-week controls; these children had no training and were only tested at the age of 8 weeks.

The values for each group are:

active: 9.00 9.50 9.75 10.00 13.00 9.50
passive: 11.00 10.00 10.00 11.75 10.50 15.00
none: 11.50 12.00 9.00 11.50 13.25 13.00
ctr.8w: 13.25 11.50 12.00 13.50 11.50

Using what you learned above, first put the data into a format that can be used with the one way test, and conduct the test. Create a box plot or a normal probability plot (qqnorm and qqline from the previous practical) to see if the data are roughly normally distributed.

Practical 4

Chris Davis

September 26, 2016

1 The test for equality in variance

1.1 Review from lecture

1.2 Syntax of `var.test`

1.3 Example

1.4 Exercise on you own

2 Chi-square test for tabular data

2.1 Preliminaries

2.1.1 Working with table columns

2.1.2 Syntax of the `table` command

2.1.3 Construct a table from one or two variables

2.1.4 Exercise

2.2 Chi-square test for goodness of fit

2.2.1 Review from lecture

2.2.2 Syntax of chi-square test for goodness of fit

2.2.3 Example

2.2.3.1 For a fair dice: what would be the expected probabilities?

2.2.3.2 Question: Is the die fair at 95% confidence level?

2.2.4 Exercise

2.3 Chi-square test of independence

2.3.1 Review from lecture 6

2.3.2 Syntax

2.3.3 Example

2.3.4 Exercise

3 One-way ANOVA test

3.1 Review from lecture

3.2 Syntax of the 1-way test

3.3 Example

3.4 Exercise

Practical 4

Chris Davis

September 26, 2016

1 The test for equality in variance

1.1 Review from lecture

1.2 Syntax of var.test

1.3 Example

1.4 Exercise on you own

2 Chi-square test for tabular data

2.1 Preliminaries

2.1.1 Working with table columns

2.1.2 Syntax of the table command

2.1.3 Construct a table from one or two variables

2.1.4 Exercise

2.2 Chi-square test for goodness of fit

2.2.1 Review from lecture

2.2.2 Syntax of chi-square test for goodness of fit

2.2.3 Example

2.2.3.1 For a fair dice: what would be the expected probabilities?

2.2.3.2 Question: Is the die fair at 95% confidence level?

2.2.4 Exercise

2.3 Chi-square test of independence

2.3.1 Review from lecture 6

2.3.2 Syntax

2.3.3 Example

2.3.4 Exercise

3 One-way ANOVA test

3.1 Review from lecture

3.2 Syntax of the 1-way test

3.3 Example

3.4 Exercise

1.2 Syntax of `var.test`

2.1.2 Syntax of the `table` command