1 Getting started

Make sure to load the ggplot2 library so we can do plotting.

library(ggplot2)

2 Bayes Theorem

This example is based on the one already presented in the lecture slides, although it goes into more of a visual explanation and also shows how you can use R code to calculate Bayesian Probabilities. Bayes Theorem can be difficult to understand since you’re calculating probabilities on different subsets of the whole population, and sometimes you really have to think about why P(A,B), P(A|B) and P(B|A) are all different, although certain observations in the data influence the calculation of all three.

2.1 Problem Statement

We have the following definition of the events and probabilities:

Event A: Cow has BSE, $P(A)$=0.01
Event B: Test for BSE returns positive result
$P(B|A)$ = 0.9 accuracy (i.e. test is positive, if the cow is infected)
$P(B|A^{c})$ = 0.1 false positives (i.e. test is positive, if the cow is not infected)

To visually explain how these different probabilities relate, this gray box represents 100% of the population of cows:

We know that 1% of the cow population has BSE:

But we also know that there are a lot of positive results:

Just from this, we can see:

there are way more positive test results than the number of cows which have BSE
therefore, getting a positive result may not be a good indication that the cow actually has BSE.

Combining everything together, we get the image shown below:

As you can see:

if a cow actually has BSE, then there is a high probability that the test will spot it.
However, since most cows are not infected (99%) and we know that there is a ten percent chance of false positives, this results in a large part of the population that gets positive test results, but actually doesn’t have BSE.

Now what would happen if we had different values for the probability that the test was accurate (i.e. different values for $P(B|A)$ and $P(B|A^{c})$)?

To evaluate this:

We still use the same probability of a cow having BSE, so we keep $P(A)$=0.01.
We need to calculate again $P(B)$ which is the probability of getting a positive result independent of whether the cow has BSE or not.
All calculations are based on $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

In R, we represent $P(B|A)$ as the variable p_B_given_A and we make a sequence from 0.8 to 1 in steps of 0.005 (seq(0.8,1,0.005)), which indicates a range of values for the test being 80% accurate all the way up to 100% accurate at spotting BSE given that the cow actually has BSE.

# probability has BSE
p_A = 0.01

# probability tests positive given has BSE (i.e. how accurate the test is if you have BSE)
p_B_given_A = seq(0.8,1,0.005)

# probability that the test is positive (independent of if you have BSE or not)
p_B = (p_A * p_B_given_A) +  # probability that the test is positive if you have BSE
  ((1-p_A) * (1-p_B_given_A))  # probability that the test is positive if you don't have BSE

df = data.frame(p_A = p_A,
                p_B_given_A = p_B_given_A,
                p_B = p_B)

df$p_A_given_B = (df$p_B_given_A * df$p_A) / df$p_B

We now have a data frame that looks like this:

head(df)

##    p_A p_B_given_A    p_B p_A_given_B
## 1 0.01       0.800 0.2060  0.03883495
## 2 0.01       0.805 0.2011  0.04002984
## 3 0.01       0.810 0.1962  0.04128440
## 4 0.01       0.815 0.1913  0.04260324
## 5 0.01       0.820 0.1864  0.04399142
## 6 0.01       0.825 0.1815  0.04545455

We can then plot the data to show the relationship between the different probabilities.

ggplot(df, aes(x=p_B_given_A, y=p_A_given_B)) + 
  geom_point() + 
  xlab("P(B|A) = P(Positive Test Results|Has BSE)
       If the cow has BSE, probability of spotting it with the test") + 
  ylab("P(A|B) = P(Has BSE|Positive Test Results)
       If cow has positive test results, probability that it actually has BSE")

There are two interesting things that we see here:

Past a certain point, as the test becomes more accurate in spotting cows with BSE, the percentage of false positives drops at a higher rate than the corresponding increase in accuracy.
If BSE spreads, then your false positives go down, and the “accuracy” of the test (i.e. $P(B|A)$) goes up without actually changing anything with the test. In other words, “accuracy” is partly a function of population size.

3 Distributions

3.1 Background

R contains functions that allow you to easily work with distributions. The names of these functions follow a standard format - you’ll see a d, r, p or q and then the name of the distribution. These letters stand for:

d - density function for distribution
r - random sample from distribution
p - cumulative distribution
q - quantile function

Below you can see how these letters are combined with the names of the distributions:

Distribution Name	Random Samples	Density Function	Cumulative Distribution	Quantile
Normal	`rnorm`	`dnorm`	`pnorm`	`qnorm`
Poison	`rpois`	`dpois`	`ppois`	`qpois`
Binomial	`rbinom`	`dbinom`	`pbinom`	`qbinom`
Uniform	`runif`	`dunif`	`punif`	`qunif`
Student t	`rt`	`dt`	`pt`	`qt`
Chi-Squared	`rchisq`	`dchisq`	`pchisq`	`qchisq`
F	`rf`	`df`	`pf`	`qf`

Main Source: Base R Cheat Sheet

Type ?Distributions into the RStudio Console in order to see documentation on all of the distributions

3.2 Examples

3.2.0.1 Plot a normal distribution and its cumulative distribution

# create a series of x values
xvals = seq(-3, 3, 0.01)

# create a data frame containing values for the density function (sampled at
# each value of xvals) and the cumulative distribution sampled at the same values

dist_data = data.frame(x=xvals, 
                       dist = dnorm(xvals),
                       cumulative_dist = pnorm(xvals))

# take a look at the data frame
head(dist_data)

##       x        dist cumulative_dist
## 1 -3.00 0.004431848     0.001349898
## 2 -2.99 0.004566590     0.001394887
## 3 -2.98 0.004704958     0.001441242
## 4 -2.97 0.004847033     0.001488999
## 5 -2.96 0.004992899     0.001538195
## 6 -2.95 0.005142641     0.001588870

# plot probability vs. cumulative probability
ggplot() + 
  geom_line(data=dist_data, aes(x=x, y=dist)) + 
  geom_line(data=dist_data, aes(x=x, y=cumulative_dist)) + 
  ylab("Probability") + 
  xlab("X Values")

3.2.0.2 Find the probability that x > x0 or x < x0

If you type ?pnorm in the console, you’ll find that the documentation for lower.tail states that “if TRUE (default), probabilities are P[X ≤ x] otherwise, P[X > x]”. This applies for the other distributions as well.

x0 = 1

# P(x > x0)
pnorm(x0, lower.tail=FALSE)

## [1] 0.1586553

1 - pnorm(x0, lower.tail=TRUE)

## [1] 0.1586553

# P(x < x0)
pnorm(x0, lower.tail=TRUE)

## [1] 0.8413447

1 - pnorm(x0, lower.tail=FALSE)

## [1] 0.8413447

3.2.0.3 Use the distributions to find quantiles

You can see the output of several example quantiles, which shows which value is greater than 80% (p=0.8) of all the values in the distribution.

qnorm(p=0.8)

## [1] 0.8416212

qunif(p=0.8, min=-1, max=1)

## [1] 0.6

qbinom(p=0.8, size=10, prob=0.5) # 10 trials, 50% probability of "success" per trial

## [1] 6

Here we plot a series of quantiles on a plot also showing the probability distribution and the cumulative distribution.

In the code, geom_vline(xintercept = qnorm(0.5), linetype="dashed") means that we should draw a vertical dashed line at the value specified by qnorm(0.5).

ggplot() + 
  geom_line(data=dist_data, aes(x=x, y=dist), colour="blue") + 
  geom_line(data=dist_data, aes(x=x, y=cumulative_dist), color="red") + 
  ylab("Probability") + 
  xlab("X Values") + 
  geom_vline(xintercept = qnorm(0.5), linetype="dashed") + 
  geom_vline(xintercept = qnorm(0.9), linetype="dashed") +
  geom_vline(xintercept = qnorm(0.95), linetype="dashed") + 
  geom_vline(xintercept = qnorm(0.99), linetype="dashed") + 
  ggtitle("Quantitles for normal distribution at 0.5, 0.9, 0.95, 0.99")

As we’d expect, the value for qnorm(0.5) is exactly where the cumulative probability is 0.5 as well. The same goes for values at 0.9, 0.95 and 0.99.

3.3 Take random samples from distributions

rnorm(10)

##  [1] -1.8027545  0.4551979 -0.9943766  0.6810542 -0.6473601 -0.4276190
##  [7] -0.4358254  0.3675553 -0.9970416  0.8915449

runif(10, min=-1, max=1)

##  [1] -0.4753282 -0.8342742  0.8763450  0.7309914  0.4790265  0.3773961
##  [7] -0.1567662  0.5277522  0.3783993 -0.1621227

rbinom(10, size=10, prob=0.5)

##  [1] 6 5 7 5 5 6 7 5 2 4

3.4 Using Q-Q plots to determine if a distribution is normal.

A Q-Q plot should definitely show that a normal distribution is indeed normally distributed:

x <- rnorm(100) # 100 samples from a normal distribution
qqnorm(x)
qqline(x)

How does a uniform distribution look?

x <- runif(100) # 100 samples from a uniform distribution
qqnorm(x)
qqline(x)

3.4.1 Using Q-Q plots on a binomial distribution

Type ?rbinom into the console will bring up a page showing the following parameters that need to be specified to generate random numbers from a binomial distribution:

n: number of observations
size: number of trials
prob: probability of success on each trial

In the code below, n=1000 means that we have 1000 observations. During each observation we do 50 trials (size=50) where the probability of a success is defined by prob=0.5.

x <- rbinom(n=1000, size=50, prob=0.5)

We can use range to see the minimum and maximum values:

range(x)

## [1] 14 36

In other words, there exists at least one observation where there were 14 successes and at least one observation where there were 36 successes.

We use + xlim(c(0,50)) to show the x axis for values from 0 to 50.

# Create a data frame for the values of x
# This results in a data frame with one column: data$x
data = as.data.frame(x)
ggplot(data, aes(x=x)) +  geom_histogram(binwidth = 1) + xlim(c(0,50))

qqnorm(x)
qqline(x)

If we try a more skewed distribution with prob=0.9, we get:

x <- rbinom(n=1000, size=50, prob=0.9)
data = as.data.frame(x)
ggplot(data, aes(x=x)) +  geom_histogram(binwidth = 1) + xlim(c(0,50))

qqnorm(x)
qqline(x)

3.4.2 Using Q-Q plots on real-world data sets

We next use the faithful data set which is included with R. You can type ?faithful in the RStudio console to get more information on it. This data is about the Old Faithful geyser in Yellowstone National Park in the US. This geyser is famous for erupting on a very regular schedule and the data set has information on how long the eruptions are (faithful$eruptions) and the amount of time until the next eruption (faithful$waiting).

We can first make a Q-Q plot of the waiting times:

x = faithful$waiting
qqnorm(x)
qqline(x)

This tells us that the eruptions are clearly not normally distributed. To investigate further, we can plot a histogram of the values:

ggplot(faithful, aes(x=waiting)) + geom_histogram(binwidth=2)

From the histogram, we see from this is that it’s clearly bi-modal as there are two distinct peaks.

To investigate further, we can do a scatter plot showing how the waiting time might be related to the length of the eruption.

ggplot(faithful, aes(x=eruptions, y=waiting)) + geom_point() + 
  xlab("Length of eruption (minutes)") + 
  ylab("Waiting time until next eruption (minutes)")

We see a few things here:

The longer the eruption, the longer we will have to wait until the next one.
There seem to be two distinct clusters. It’s not clear what is causing this, and since the data doesn’t mention the date of the eruption, we don’t know it randomly switches between short and long eruptions, or if for years there were long eruptions, but now there are only short eruptions due to factors such as earthquakes changing the water supply to the geyser.

We can also split up the data into two sets, where one lists all the eruptions that lasted less than three minutes, and the other one contains those which are longer.

For this, we use the which command, which returns the indices of the matching locations in a vector:

which(faithful$eruptions < 3)

##  [1]   2   4   6   9  11  14  16  17  19  21  22  27  36  37  39  42  44
## [18]  48  50  53  55  58  61  63  65  69  72  75  77  84  89  91  93  95
## [35]  99 101 103 106 108 112 115 117 119 121 124 127 129 131 133 135 137
## [52] 139 142 146 148 150 153 159 161 163 167 169 171 172 178 181 185 188
## [69] 190 192 199 201 204 206 209 211 213 217 219 221 223 232 234 236 237
## [86] 240 242 244 247 249 251 259 263 265 266 269 271

The numbers above correspond to the rows in the faithful data set where the eruptions are less than three minutes.

Now we create the two separate data frames from the faithful data frame:

faithful_short = faithful[which(faithful$eruptions < 3),]
faithful_long = faithful[which(faithful$eruptions >= 3),]

Q-Q plot for the short eruptions:

qqnorm(faithful_short$waiting)
qqline(faithful_short$waiting)

Q-Q plot for the long eruptions:

qqnorm(faithful_long$waiting)
qqline(faithful_long$waiting)

This shows that if we split the data into two clusters, then data in those clusters seems to be normally distributed.

3.4.3 Using Q-Q plots on other real-world data sets

If you type help(package="datasets") into the RStudio Console you’ll see a list of other data sets which are included with R. You can experiment with these just as we did with the faithful data set to see if this data is normally distributed as well.

3.5 Using quartile functions to calculate confidence intervals

Take a random sample n< 30 from a normal distribution:

n = 30 # number of random samples

# set the mean and standard deviation of the distribution
true_mean = 0
true_sd = 1

# take random samples
vals = rnorm(n, mean=true_mean, sd=true_sd)

Now we calculate $\mu$ and $\sigma$ of the distribution and compare it to to the original distribution

mean_pts = mean(vals)
mean_pts

## [1] -0.2514718

In the original distribution $\mu = 0$, while in the new sample it’s -0.2514718.

sd_pts = sd(vals)
sd_pts

## [1] 0.9297695

In the original distribution $\sigma = 1$, while in the new sample it’s 0.9297695.

We now will calculate the 95% confidence interval of the mean and compare to that of the original distribution. From the book, we know that we can calculate the 95% confidence interval using $\bar{y} \pm 1.96 \cdot \sigma_{\bar{y}}$ where $\sigma_{\bar{y}} = \frac{\sigma}{\sqrt{n}}$.

We will see for how many samples the 95% confidence interval includes the true mean. Here, for 500 iterations we take n random numbers from the distribution, and calculate the confidence intervals.

times_true_mean_in_confidence_interval = 0
num_tests = 500
# create a loop where the value of "count" will be 1, 2, 3... num_tests
for (count in c(1:num_tests)){
  
  # sample random values from the distribution
  vals = rnorm(n, mean=0, sd=1)
  
  # calculate mean and standard deviation
  mean_pts = mean(vals)
  sd_pts = sd(vals)
  
  # calculate left and right sides of the 95% confidence interval
  left = mean_pts - (1.96 * sd_pts/sqrt(n))
  right = mean_pts + (1.96 * sd_pts/sqrt(n))
  
  # see if the true_mean is within the bounds of the interval
  if (true_mean >= left & true_mean <= right){
    # if it's in the interval, add one to this variable
    times_true_mean_in_confidence_interval = times_true_mean_in_confidence_interval + 1
  }
}

# show how many times the true_mean falls within the intervals calculated
times_true_mean_in_confidence_interval

## [1] 472

# what percentage of the time does the true_mean fall within the intervals calculated?
times_true_mean_in_confidence_interval / num_tests

## [1] 0.944

What if we ran this 10,000 times? How would this percentage change as we repeatedly took more random samples?

Note that the x axis is scaled logarithmically.

This shows that there is a lot of variation at first, but as expected, with more samples, we settle into the 95% range.

3.6 Binonial Distributions

This example is from Lecture 2, and is worked out in R below.

You observe cars at an intersection: At a usual day 50% go right and 50% go left. Today you walk past quickly and you have time to observe 10 cars.

3.6.0.1 You want to know the chance that x cars will go to the right. What distribution do you use with which parameters?

As mentioned in the lecture, this is a binomial distribution with p = 0.5 and n = 10. We define “success” as a car turning right.

n = 10 # number of observations
p = 0.5 # probability

# in n observations, the number of cars that can turn right is somewhere between 0 and n
x = c(0:n)
data = data.frame(x = x, 
                  vals = dbinom(x, size=10, prob=0.5))

ggplot(data, aes(x=x, y = vals)) + 
  geom_point() + # show the discrete probabilities
  geom_line() + # connect the dots to show the distribution better
  xlab("Number of Cars Turning Right") + 
  ylab("Probability") + 
  scale_x_continuous(breaks = c(0:10)) # set up the x axis so that it shows integers instead of 0, 2.5, etc

3.6.0.2 What is the chance that only 2 will go right?

dbinom(2, size=10, prob=0.5)

## [1] 0.04394531

3.6.0.3 What is the chance that more than 7 go right?

We can add up the probabilities in a series of statements like this:

dbinom(8, size=10, prob=0.5) + 
  dbinom(9, size=10, prob=0.5) + 
  dbinom(10, size=10, prob=0.5)

## [1] 0.0546875

This isn’t a good approach if we have to sum up the probabilities for a lot of numbers like 8, 9, … 1000.

A much more efficient way is to pass a vector of numbers to dbinom and then use the sum function to sum up the probabilities:

# pass a vector of 8, 9, 10 and then sum up the probabilities
sum(dbinom(c(8:10), size=10, prob=0.5))

## [1] 0.0546875

3.6.0.4 Now you stop and observe 1000 cars: what is the most likely number to go right and what is the standard deviation?

To find the most likely number to go right, we calculate the expected value $E(x) = np$

n = 1000
p = 0.5
n * p

## [1] 500

To find the standard deviation:

First find the variance: $Var(x) = np(1-p)$
Use $\sigma = \sqrt{Var(x)}$

var = n*p*(1-p)
sqrt(var)

## [1] 15.81139

3.6.0.5 You observe 100 cars and see that 43 go to the right? Which uncertainty should you quote?

Again using $Var(x) = np(1-p)$ and $\sigma = \sqrt{Var(x)}$

n = 100
sqrt(n * p * (1-p))

## [1] 5

3.6.0.6 What is the chance that this observation was made on a typical day?

Probability of observing exactly 43 cars turning right:

dbinom(43, size=n, prob=0.5)

## [1] 0.03006864

Probability of observing at most 43 cars turning right:

# sum up the probabilities for observing between 0 and 43 cars
sum(dbinom(c(0:43), size=n, prob=0.5))

## [1] 0.09667395

4 Exercises

Try what you have learned above on the examples discussed in the lecture. Using R, you should arrive at the same numbers that were presented in class.

4.1 Example 2

The government says that 26% of the populations smoke. You want to test this by asking 30 of your friends.

If this sample is a good representation of the population, how many do you expect to smoke if the government says the truth?
Because of the small sample you decide that if 6, 7, 8, or 9 smoke, you believe the government, otherwise you reject – is this a good choice?
What about trusting the government if 4 – 11 people in the sample smoke

4.2 Example 3

A large detector is set up to detect neutrinos. It usually counts 2/day.

What is the distribution you would use to estimate the probability of counting x neutrinos in a given day?
What is the probability of detecting 8/day? Are scientists justified in calling this a special event?
What if it counts 4/day?

DASM Practical 2

Chris Davis

29 September, 2016