Tabular Data

Getting started

Make sure to load the tidyverse library so we can do plotting later on.

library(tidyverse)

Working with data frame columns (review)

One of the tasks we’ll show in this practical relates to constructing tables from categorical data. Before doing this, we should first review a few useful functions for working with tabular data in data frames.

The iris data frame is an example data set that is included with R, just like the mtcars data we worked with in previous practicals.

Get a list of columns in a data frame:

colnames(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

Summarize a data frame, showing the columns and statistical information about the values in those columns:

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Extracting columns with the $ symbol. Note that if you type iris$ and then press the tab key, you will get a list of suggestions for all of the column names.

iris$Sepal.Length
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

Cross Tabulation

Cross-tabulation, or a contingency table, is a technique that counts the number of observations at permutations of variable values. As an example, if you collected data on people’s hair color and eye color, a contingency table could be used to analyze the data and say how many people have particular combinations of hair color and eye color. Using this you could see how many people have both brown hair and green eyes, how many people have both blond hair and blue eyes, etc.

Cross tabulation can be performed in R using the table() function.

Syntax of the table() function

By typing ?table in the console, you see: table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.

By “factor levels”, this just means the distinct values for the factors. Factors are the same thing as the categories or categorical data that was mentioned in the lecture.

table(...,
      exclude = if (useNA == "no") c(NA, NaN),
      useNA = c("no", "ifany", "always"),
      dnn = list.names(...), deparse.level = 1)

The relevant argument for this practical is:

  • ... - one or more objects which can be interpreted as factors (including character strings), or a list (or data frame) whose components can be so interpreted. (For as.table, arguments passed to specific methods; for as.data.frame, unused.)

In the examples below you’ll see that ... takes the form of one or two variables which we use to construct the contingency table.

Whenever you see ... in documentation for R, it generally means that multiple objects can be passed to the function.

Construct a table from one or two variables

Here we create a table from the cyl (number of cylinders) column of the mtcars data set:

table(mtcars$cyl)
## 
##  4  6  8 
## 11  7 14

Reading across (bottom left -> top left, bottom center -> top center, etc), we see that there are:

  • 11 cars with 4 cylinders
  • 7 cars with 6 cylinders
  • 14 cars with 8 cylinders

Next we create a table from the cyl and gear (number of forward gears) columns of the mtcars data set:

table(mtcars$cyl, mtcars$gear)
##    
##      3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2

With the table function, the first value you enter (mtcars$cyl) will show up on the rows, and the second (mtcars$gear) will show up in the columns. For example, from the table we can see that there are 12 cars with 8 cylinders and 3 gears, while there are 0 cars with 4 gears and eight cylinders.

Exercise

Load in data on heart attack victims. With R you can download a file directly from the Internet. Note that sep="\t" means that the file is tab delimited

df = read.csv("https://github.com/cbdavis/DASM/raw/master/2017/data/heartatk4R.txt", sep="\t")

Now construct a two way table from the categorical variables SEX and DIED. You should see:

##    
##        0    1
##   F 4298  767
##   M 7136  643

Chi-square test for goodness of fit

Review from lecture

Hypothesis: n events that occur with probabilities \(p_i\)

Observations: counts for each event (\(n_1\), \(n_2\), …, \(n_i\), …,\(n_k\))

Test statistic:

\(\chi^{2} = \sum{}\frac{(observed - expected)^{2}}{expected} = \sum_{i=1}^{k} \frac{(n_i-E_i)^2}{E_i}\)

with \(E_i = n * p_i\)

Syntax of chi-square test for goodness of fit

For the chi-square test for goodness of fit, you only need to have two arguments specified:

chisq.test(x, p) 

Arguments:

  • x - a numeric vector or matrix. x can also be a vector of factors.
  • p - a vector of probabilities of the same length of x. An error is given if any entry of p is negative.

If p is not specified, then chisq.test automatically assumes equal probabilities.

Example

This is the dice example mentioned in Lecture 6, Slide 5.

You roll a die 100 times, and these are the number of counts associated with each value of the die that was observed.

die = c(1,2,3,4,5,6)
count = c(11,16,25,13,21,14)

For a fair dice: what would be the expected probabilities?

We would expect \(p_i = \frac{1}{6}\) for each value of \(i\)

For the counts at each die value, we would expect:

exp = rep(1/6, 6) * 100
exp
## [1] 16.66667 16.66667 16.66667 16.66667 16.66667 16.66667

Question: Is the die fair at 95% confidence level?

  • \(H_0\): the probability of throwing each number is equal = \(\frac{1}{6}\)
    • i.e. \(p_i = \frac{1}{6}\) for all \(i\)
  • \(H_a\): at least one number comes at a different frequency
    • i.e. \(p_i \neq \frac{1}{6}\) for at least one \(i\)

Use the chisq.test function to see the same results:

chisq.test(count, p=rep(1/6, 6))
## 
##  Chi-squared test for given probabilities
## 
## data:  count
## X-squared = 8.48, df = 5, p-value = 0.1317

The peak should mean that we have the highest probability of getting a \(\chi^2\) value of 3 with a fair die:

Exercise

This is the example of murder cases from the lecture

Sunday Monday Tuesday Wednesday Thursday Friday Saturday
53 42 51 45 36 37 65

Are the murders equally distributed through the week or not at a confidence level of 0.95?

What are the test hypotheses, df, and p-value?

If you do things correctly (and set correct=FALSE), you should see an output of:

## 
##  Chi-squared test for given probabilities
## 
## data:  murders
## X-squared = 13.319, df = 6, p-value = 0.03824

Based on these results, we reject \(H_0\), since p < 0.05. For at least one day of the week, murder rate is different from \(\frac{1}{7}\) at the 95% confidence level

Chi-square test of independence

Review from lecture 6

A 2-way table is of the form:

Hypothesis test:

  • \(\alpha\) = 0.05
  • \(H_0\) : the outcomes are independent
  • \(H_a\) : outcomes are dependent

Test statistic:

\(\chi^2 = \sum_{i=1}^{k}\sum_{j=1}^{l} \frac{(n_{ij}-n_{i\cdot}*n_{\cdot j}/n)^2}{n_{i\cdot}*n_{\cdot j}/n}\)

Syntax

The syntax for performing a chisq.test for a two way table is very simple - in this case x is the matrix containing the data for the two way table:

chisq.test(x) 

Example

A set of cars are observed and per car it is recorded whether the parents and the children are wearing their seatbelts. We can construct the two-way table of the observations in the code below:

seatbelt = rbind(c(56, 8), c(2, 16))
colnames(seatbelt) = c("child_buckled", "child_unbuckled")
rownames(seatbelt) = c("parent_buckled", "parent_unbuckled")

seatbelt
##                  child_buckled child_unbuckled
## parent_buckled              56               8
## parent_unbuckled             2              16

We use the following hypotheses and confidence level:

  • \(H_0\): the variables are independent
  • \(H_a\): the variables are dependent
  • \(\alpha=0.05\)

We now need to create expected matrix which would show what the expected values would be if the variables are independent. For this we need to find \(p(A \cap B) = p(A)p(B)\)

# calculate sums over the rows and columns
parent = rowSums(seatbelt) / sum(seatbelt)
parent
##   parent_buckled parent_unbuckled 
##        0.7804878        0.2195122
child = colSums(seatbelt) / sum(seatbelt)
child
##   child_buckled child_unbuckled 
##       0.7073171       0.2926829
probabilities = parent %*% t(child)
# add back in the row names and column names
colnames(probabilities) = c("child_buckled", "child_unbuckled")
rownames(probabilities) = c("parent_buckled", "parent_unbuckled")
probabilities
##                  child_buckled child_unbuckled
## parent_buckled       0.5520523      0.22843546
## parent_unbuckled     0.1552647      0.06424747
expected_matrix = probabilities * sum(seatbelt)
expected_matrix
##                  child_buckled child_unbuckled
## parent_buckled        45.26829       18.731707
## parent_unbuckled      12.73171        5.268293

While the lecure showed how to do the chi-squared test by hand, here we will show how to do it in R by using the chisq.test() function. Make sure to specify correct=FALSE, as by default the chisq.test function applies a correction if some counts in the table are low. If you leave the value empty or specify correct=TRUE you’ll get a slightly deviating value.

chisq.test(seatbelt, correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  seatbelt
## X-squared = 39.599, df = 1, p-value = 3.118e-10

Exercise

Using the example from question 10.60 in the book (page 556): A study of potential age discrimination considers promotions among middle managers in a large company. The data are as follows:

Under 30 30–39 40–49 50 and Over Total
Promoted 9 29 32 10 80
Not promoted 41 41 48 40 170
Totals 50 70 80 50

Create the above table in R and perform the chisq.test on it. Is there a statistically significant relation between age and promotions?

## 
##  Pearson's Chi-squared test
## 
## data:  vals
## X-squared = 13.025, df = 3, p-value = 0.004582

Q-Q Plots

Overview

A Q-Q plot (or Quantile-Quantile plot) can be used to see if the values in a sample are normally distributed.

Suppose you have a sample of values from an unknown population. In the example below, we’re actually sampling from a normal distribution, but pretend for now that we don’t know where the data is from.

x = rnorm(50)
x
##  [1] -2.037687456  0.982705627 -0.150894325  0.098180946  0.005617086
##  [6] -1.127875624  0.006083578  1.222265650 -1.097998043  0.308884684
## [11]  0.423084752  0.045752119  1.503858226  1.982813949  0.688540191
## [16] -1.910151709 -0.227651442 -1.758438936 -0.280546433  0.840160655
## [21] -0.103500051 -0.982532328  0.165145652 -0.054335538  1.101896051
## [26]  1.381677729 -0.836436761  0.153019057  1.421888305 -0.894206718
## [31] -0.299182770 -0.660221177 -0.798789553 -0.286455477 -0.747798422
## [36] -0.829770184  0.600284224 -0.317889265 -1.365338820 -1.811824106
## [41]  1.416879322  1.066057759 -0.276979265 -0.071097830  0.078951984
## [46] -0.451363915  1.662397583  1.218293663  2.321434501  0.620630022

First you order the sample values from the lowest to the highest values. We can use the sort() function for this:

x = sort(x)
x
##  [1] -2.037687456 -1.910151709 -1.811824106 -1.758438936 -1.365338820
##  [6] -1.127875624 -1.097998043 -0.982532328 -0.894206718 -0.836436761
## [11] -0.829770184 -0.798789553 -0.747798422 -0.660221177 -0.451363915
## [16] -0.317889265 -0.299182770 -0.286455477 -0.280546433 -0.276979265
## [21] -0.227651442 -0.150894325 -0.103500051 -0.071097830 -0.054335538
## [26]  0.005617086  0.006083578  0.045752119  0.078951984  0.098180946
## [31]  0.153019057  0.165145652  0.308884684  0.423084752  0.600284224
## [36]  0.620630022  0.688540191  0.840160655  0.982705627  1.066057759
## [41]  1.101896051  1.218293663  1.222265650  1.381677729  1.416879322
## [46]  1.421888305  1.503858226  1.662397583  1.982813949  2.321434501

You can assign a “cumulative probability” \(P_i\) to each measurement with \(P_i = \frac{(i - 0.5)}{n}\).

i = seq_along(x)   # creates a sequence of 1, 2, ... up to the length of x (50 in this example)
P = (i - 0.5) / length(x)  # vector of values for each P_i
P
##  [1] 0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19 0.21 0.23 0.25 0.27
## [15] 0.29 0.31 0.33 0.35 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 0.55
## [29] 0.57 0.59 0.61 0.63 0.65 0.67 0.69 0.71 0.73 0.75 0.77 0.79 0.81 0.83
## [43] 0.85 0.87 0.89 0.91 0.93 0.95 0.97 0.99

\(x_i\) has then at least a fraction of \(P_i\) data values smaller than \(x_i\).

\(x_i(P_i)\) is the \(P_{i}\)-th quantile of the sample.

As an example of this, our 10th element is:

x[10]
## [1] -0.8364368

and at the 10th location, our value for P is

P[10]
## [1] 0.19

This means that at least 19% of the values in the data set are smaller than -0.8364368

We then create a new variable z where we find the locations of the quantiles specified in P

z = qnorm(P, mean = mean(x), sd=sd(x))

We then plot z versus x. If the line is straight, then the data appears to be normally distributed

plot(z,x)

In R we can automatically perform all these steps just by using qqnorm() and qqline().

The function qqnorm() draws the points in the scatter plot, and qqline() draws the line showing where we would expect the points to lie if they are normally distributed.

qqnorm(x)
qqline(x)   

Even with a low number of samples, many of the points are on the diagonal line, which indicates that our data (which is actually sampled from a normal distribution) looks like it could be normally distributed.

What is we try 500 samples instead of just 50?

x <- rnorm(500)
qqnorm(x)
qqline(x)

What about 10000 samples?

x <- rnorm(10000)
qqnorm(x)
qqline(x)

How does a uniform distribution look?

x <- runif(100) # 100 samples from a uniform distribution
qqnorm(x)
qqline(x)

Q-Q Plots Exercise

We next use the faithful data set which is included with R. You can type ?faithful in the RStudio console to get more information on it. This data is about the Old Faithful geyser in Yellowstone National Park in the US. This geyser is famous for erupting on a very regular schedule and the data set has information on how long the eruptions are (faithful$eruptions) and the amount of time until the next eruption (faithful$waiting).

Make a Q-Q plot of the waiting times (faithful$waiting):

This tells us that the eruptions are clearly not normally distributed. To investigate further, create a histogram of the values:

From the histogram, we see from this is that it’s clearly bi-modal as there are two distinct peaks.

To investigate further, create a scatter plot showing how the waiting time might be related to the length of the eruption.

We see a few things here:

  • The longer the eruption, the longer we will have to wait until the next one.
  • There seem to be two distinct clusters. It’s not clear what is causing this, and since the data doesn’t mention the date of the eruption, we don’t know it randomly switches between short and long eruptions, or if for years there were long eruptions, but now there are only short eruptions due to factors such as earthquakes changing the water supply to the geyser.

We should split up the data into two sets, where one lists all the eruptions that lasted less than three minutes, and the other one contains those which are longer.

Now create two separate data frames from the faithful data frame, where faithful_short has eruptions < 3 and faithful_long has eruptions >= 3

Create a Q-Q plot for the short eruptions:

Create a Q-Q plot for the long eruptions:

If we split the data into two clusters, do they seem to be normally distributed?

One-way ANOVA test

Review from lecture

For this, we want to test if any of several means are different. We want to compare the within group variance (\({SSD}_{W}\)) to the between group variance (\({SSD}_{B}\)). If the means are all equal (i.e. the data are all from the same distribution) then the statistic \(F\) follows an F-distribution with \((k-1)\), \((n-k)\) degrees of freedom (\(F_{k-1,n-k}\)).

  • \({SSD}_W = \sum_{i} \sum_{j} (x_{ij}-\bar{x}_i)^2\)
  • \({SSD}_B = \sum_{i} \sum_{j} (\bar{x}_i - \bar{x}_{\cdot})^2 = \sum_{i} n_{i} (\bar{x}_i - \bar{x}_{\cdot})^2\)
  • \({MS}_W = \frac{{SSD}_{W}}{(n-k)}\)
  • \({MS}_B = \frac{{SSD}_{B}}{(k-1)}\)
  • \(F = \frac{{MS}_{B}}{{MS}_{W}}\)

  • If means are the same within the groups then \(F\) should be around 1
  • If the means are different, then \({MS}_{B} >> {MS}_{W}\)

Syntax of the 1-way test

oneway.test(formula, data, subset, na.action, var.equal = FALSE)

Relevant arguments:

  • formula - a formula of the form lhs ~ rhs where lhs gives the sample values and rhs the corresponding groups.
  • data - an optional matrix or data frame (or similar: see model.frame) containing the variables in the formula formula. By default the variables are taken from environment(formula).
  • subset - an optional vector specifying a subset of observations to be used. Normally we don’t use this option.
  • var.equal - a logical variable indicating whether to treat the variances in the samples as equal. If TRUE, then a simple F test for the equality of means in a one-way analysis of variance is performed. If FALSE, an approximate method of Welch (1951) is used, which generalizes the commonly known 2-sample Welch test to the case of arbitrarily many samples.

The formula we will use is cal~month (calories grouped by month). Both cal and month correspond to the names of the columns in a data frame.

Example

This is the example on number of calories consumed by month (Lecture 6, slide 43).

Fifteen subjects were randomly assigned to three time groups, and their amount of daily calories consumed was recorded (may, sep, dec).

We want to know if there is a difference in calorie consumption between the months.

# values for the different months
may <- c(2166,1568,2233,1882,2019) 
sep <- c(2279,2075,2131,2009,1793) 
dec <- c(2226,2154,2583,2010,2190) 

# create a vector containing the 5 values from may, the 5 values from sep, etc.
cal = c(may, sep, dec) 
cal
##  [1] 2166 1568 2233 1882 2019 2279 2075 2131 2009 1793 2226 2154 2583 2010
## [15] 2190
# The cal vector shows the numbers, but we need another vector which helps us 
# to know which months those values are associated with.
# We now need to create a vector of the same length as the cal vector
# where each element is text that corresponds to the month of the observation in cal
# since we have 5 values for may, sep & dec, we repeat "may", "sep", and "dec" 5 times each
month = c(rep("may", 5), 
          rep("sep", 5), 
          rep("dec", 5)) 
month
##  [1] "may" "may" "may" "may" "may" "sep" "sep" "sep" "sep" "sep" "dec"
## [12] "dec" "dec" "dec" "dec"
# now we join these vectors together into a dataframe
data1= data.frame(cal, month)
# show what the data looks like
data1
##     cal month
## 1  2166   may
## 2  1568   may
## 3  2233   may
## 4  1882   may
## 5  2019   may
## 6  2279   sep
## 7  2075   sep
## 8  2131   sep
## 9  2009   sep
## 10 1793   sep
## 11 2226   dec
## 12 2154   dec
## 13 2583   dec
## 14 2010   dec
## 15 2190   dec
# perform the 1-way ANOVA test
oneway.test(cal~month, data1)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  cal and month
## F = 1.5554, num df = 2.000, denom df = 7.813, p-value = 0.27

Based on these values, we cannot reject \(H_0\), i.e. the results do not support the hypothesis that calorie intake varies by month.

Exercise

A set of four groups of infants were observed and data was compiled on their age at walking (in months). The four groups are defined as follows:

  • active - test group receiving active training; these children had their walking and placing reflexes trained during four three-minute sessions that took place every day from their second to their eighth week of life.
  • passive - passive training group; these children received the same types of social and gross motor stimulation, but did not have their specific walking and placing reflexes trained.
  • none - no training; these children had no special training, but were tested along with the children who underwent active or passive training.
  • ctr.8w - eighth-week controls; these children had no training and were only tested at the age of 8 weeks.

The values for each group are:

  • active: c(9.00, 9.50, 9.75, 10.00, 13.00, 9.50)
  • passive: c(11.00, 10.00, 10.00, 11.75, 10.50, 15.00)
  • none: c(11.50, 12.00, 9.00, 11.50, 13.25, 13.00)
  • ctr.8w: c(13.25, 11.50, 12.00, 13.50, 11.50)

Using what you learned above, first put the data into a format that can be used with the one way test, and conduct the test.

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  walking_age_months and group
## F = 2.7759, num df = 3.000, denom df = 10.506, p-value = 0.09373

Monte Carlo Simulation

Quick Review

We have a function \(x = f(u,v)\) where:

  • \(u\) has uncertainty \(s_u\)
  • \(v\) has uncertainty \(s_v\)

We then:

  • Construct a normal distribution \(f_{n,u}\) using the mean found for \(u\) and a standard deviation \(s_u\)
  • Construct a normal distribution \(f_{n,v}\) using the mean found for \(v\) and a standard deviation \(s_v\)
  • Take a random value \(u_1\) from fn,u and \(v_1\) from fn,v and calculate \(x_1 = f(u_1, v_1)\)
  • Repeat n times
  • Take average (or median) of all \(x_i\) for (i = 1, … n)

Example

Assume that for a research project you are told to calculate the uncertainty for the following formula:

\[f(p,q) = \frac{p \cdot q}{p + q}\]

For both \(p\) and \(q\) we have evidence of the following means and standard deviations:

  • \(p\): \(\mu = 2\) and \(\sigma = 0.5\)
  • \(q\): \(\mu = 5\) and \(\sigma = 1\)

We assume that the uncertainties for \(p\) and \(q\) follow normal distributions

# take n samples
n <- 1000
# take n samples from normal distributions with different means and standard deviations
p <- rnorm(n, mean = 2, sd = 0.5)
q <- rnorm(n, mean = 5, sd = 1)

Calculate values for \(f(p,q) = \frac{p \cdot q}{p + q}\)

f <- p*q/(p+q)

Plot the histogram, using 100 bins (use breaks=100)

hist(f, breaks=100)

We now calculate the mean:

f_bar <- mean(f)
f_bar
## [1] 1.405288

and then calculate the standard deviation:

sig <- sd(f)
sig
## [1] 0.2680724

Below we can see how both the mean and the standard deviation converge on a single value as we increase the number of samples from the distributions we defined for \(p\) and \(q\) (see page 31,32 of Lecture 7 for more explanation). Note the log scale for the x axis.

Exercise

Now you’ve been given a different formula:

\[f(u,v) = u^2 + u + v\]

…and you’ve been told to perform a monte carlo simulation with 10000 samples this time. We again assume that the variables u and v are normally distributed with the following means and standard deviations:

  • \(u\): \(\mu = 1.2\) and \(\sigma = 3\)
  • \(v\): \(\mu = 5\) and \(\sigma = 1.7\)

Now try to recreate the following results on your own.

Plot the histogram of \(x = f(u,v) = u^2 + u + v\)

Calculate mean(x) and std(x) for \(x = f(u,v)\)

## [1] 16.51136
## [1] 16.35522