DASM Practical 3

Bayes Theorem

Getting started

Make sure to load the tidyverse library so we can do plotting later on.

library(tidyverse)

Example

This example is based on the one already presented in the lecture slides, although it goes into more of a visual explanation and also shows how you can use R code to calculate Bayesian Probabilities. Bayes Theorem can be difficult to understand since you’re calculating probabilities on different subsets of the whole population, and sometimes you really have to think about why P(A,B), P(A|B) and P(B|A) are all different, although certain observations in the data influence the calculation of all three.

Problem Statement

We have the following definition of the events and probabilities:

Event A: Cow has BSE, \(P(A)\)=0.01
Event B: Test for BSE returns positive result
\(P(B|A)\) = 0.9 accuracy (i.e. test is positive, if the cow is infected)
\(P(B|A^{c})\) = 0.1 false positives (i.e. test is positive, if the cow is not infected)

To visually explain how these different probabilities relate, this gray box represents 100% of the population of cows:

We know that 1% of the cow population has BSE:

But we also know that there are a lot of positive results:

Just from this, we can see:

there are way more positive test results than the number of cows which have BSE
therefore, getting a positive result may not be a good indication that the cow actually has BSE.

Combining everything together, we get the image shown below:

As you can see:

if a cow actually has BSE, then there is a high probability that the test will spot it.
However, since most cows are not infected (99%) and we know that there is a ten percent chance of false positives, this results in a large part of the population that gets positive test results, but actually doesn’t have BSE.

Now what would happen if we had different values for the probability that the test was accurate (i.e. different values for \(P(B|A)\) and \(P(B|A^{c})\))?

To evaluate this:

We still use the same probability of a cow having BSE, so we keep \(P(A)\)=0.01.
We need to calculate again \(P(B)\) which is the probability of getting a positive result independent of whether the cow has BSE or not.
All calculations are based on \(P(A|B) = \frac{P(B|A)P(A)}{P(B)}\)

In R, we represent \(P(B|A)\) as the variable p_B_given_A and we make a sequence from 0.8 to 1 in steps of 0.005 (seq(0.8,1,0.005)), which indicates a range of values for the test being 80% accurate all the way up to 100% accurate at spotting BSE given that the cow actually has BSE.

# probability has BSE
p_A = 0.01

# probability tests positive given has BSE (i.e. how accurate the test is if you have BSE)
p_B_given_A = seq(0.8,1,0.005)

# probability that the test is positive (independent of if you have BSE or not)
p_B = (p_A * p_B_given_A) +  # probability that the test is positive if you have BSE
  ((1-p_A) * (1-p_B_given_A))  # probability that the test is positive if you don't have BSE

df = data.frame(p_A = p_A,
                p_B_given_A = p_B_given_A,
                p_B = p_B)

df$p_A_given_B = (df$p_B_given_A * df$p_A) / df$p_B

We now have a data frame that looks like this:

head(df)

##    p_A p_B_given_A    p_B p_A_given_B
## 1 0.01       0.800 0.2060  0.03883495
## 2 0.01       0.805 0.2011  0.04002984
## 3 0.01       0.810 0.1962  0.04128440
## 4 0.01       0.815 0.1913  0.04260324
## 5 0.01       0.820 0.1864  0.04399142
## 6 0.01       0.825 0.1815  0.04545455

We can then plot the data to show the relationship between the different probabilities.

ggplot(df, aes(x=p_B_given_A, y=p_A_given_B)) + 
  geom_point() + 
  xlab("P(B|A) = P(Positive Test Results|Has BSE)
       If the cow has BSE, probability of spotting it with the test") + 
  ylab("P(A|B) = P(Has BSE|Positive Test Results)
       If cow has positive test results, probability that it actually has BSE")

There are two interesting things that we see here:

Past a certain point, as the test becomes more accurate in spotting cows with BSE, the percentage of false positives drops at a higher rate than the corresponding increase in accuracy.
If BSE spreads, then your false positives go down, and the “accuracy” of the test (i.e. \(P(B|A)\)) goes up without actually changing anything with the test. In other words, “accuracy” is partly a function of percentage of cows with BSE.

Distributions

Background

R contains functions that allow you to easily work with distributions. The names of these functions follow a standard format - you’ll see a d, r, p or q and then the name of the distribution. These letters stand for:

d - density function for distribution
r - random sample from distribution
p - cumulative distribution
q - quantile function

Below you can see how these letters are combined with the names of the distributions. The notations (\(f_{n}\), etc.) correspond with those used in the lectures.

Distribution Name	Random Samples	Density Function	Cumulative Distribution	Quantile
Normal	`rnorm`	\(f_{n}\) `dnorm`	\(F_{n}\) `pnorm`	`qnorm`
Poisson	`rpois`	\(f_{p}\) `dpois`	\(F_{p}\) `ppois`	`qpois`
Binomial	`rbinom`	\(f_{B}\) `dbinom`	\(F_{B}\) `pbinom`	`qbinom`
Uniform	`runif`	\(f_{U}\) `dunif`	\(F_{U}\) `punif`	`qunif`
Student t	`rt`	\(f_{t}\) `dt`	\(F_{t}\) `pt`	`qt`
Chi-Squared	`rchisq`	\(f_{C}\) `dchisq`	\(F_{C}\) `pchisq`	`qchisq`
F	`rf`	\(f_{F}\) `df`	\(F_{F}\) `pf`	`qf`

Main Source: Base R Cheat Sheet

Type ?Distributions into the RStudio Console in order to see documentation on all of the distributions

Understanding the Documentation for Functions

If you type ?pnorm in the console, you’ll bring up the documentation for this function. You’ll see the following description of the function:

pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

From this, we can see four things:

The name of the function - pnorm
The order in which the function expects the arguments - q is first, mean is second, etc.
The default values of the arguments - sd = 1. If we don’t explicitly specify a value, it will choose these stated defaults:
- mean = 0
- sd = 1
- lower.tail = TRUE
- log.p = FALSE
The names of the arguments - we can specify the name of the argument when we tell the function which value to use (examples below)

When you use the command pnorm(1), you’re saying the same thing as pnorm(q=1), since the function description shows that the first argument in the pnorm() function is q, which the documentation states is a vector of quantiles.

These are all the same due to the default values and the specification of the expected order of the arguments:

pnorm(1)
pnorm(1, 0, 1)
pnorm(1, mean=0, sd=1)

These two examples are the same as the three above, except that we’ve changed the order of the arguments. Any time you change the order of the arguments, you need to explicitly specify the name of the argument (like sd=) so the function understands where to find that particular argument.

pnorm(1, sd=1, mean=0)
pnorm(1, sd=1)

Examples

Find the probability density of a normal distribution at a value of 8, with mean of 10, and standard deviation of 3.

By typing ?dnorm in the console, we see that we need to use the following syntax:

dnorm(x, mean = 0, sd = 1, log = FALSE)

dnorm(x=8, mean=10, sd=3, log=FALSE)

## [1] 0.1064827

Since log=FALSE is specified to be a default value, we can shorten this to just:

dnorm(x=8, mean=10, sd=3)

Also, since our quantile, mean, and standard deviation are in the order expected by the function, we can shorten this further to just:

dnorm(8, 10, 3)

## [1] 0.1064827

However, if we change the order or argument, R will assume that we want a mean of 3 and a standard deviation of 10, and we will get a different answer than what we want.

dnorm(8, 3, 10)

## [1] 0.03520653

We can also use a vector with these functions to evaluate a range of values. For example, here we calculate the probability density at values of 1, 2, …, 20:

dnorm(1:20, 10, 3)

##  [1] 0.001477283 0.003798662 0.008740630 0.017996989 0.033159046
##  [6] 0.054670025 0.080656908 0.106482669 0.125794409 0.132980760
## [11] 0.125794409 0.106482669 0.080656908 0.054670025 0.033159046
## [16] 0.017996989 0.008740630 0.003798662 0.001477283 0.000514093

If we plot these values, we can see the distribution:

x = 1:20
y = dnorm(x, 10, 3)
plot(x, y)

Plot a normal distribution and its cumulative distribution with mean of 0 and standard deviation of 1

# create a series of x values
xvals = seq(-3, 3, 0.01)

# create a data frame containing values for the density function (sampled at
# each value of xvals) and the cumulative distribution sampled at the same values

dist_data = data.frame(x=xvals, 
                       dist = dnorm(xvals, mean=0, sd=1),
                       cumulative_dist = pnorm(xvals, mean=0, sd=1))

# take a look at the first few rows of the data frame
head(dist_data)

##       x        dist cumulative_dist
## 1 -3.00 0.004431848     0.001349898
## 2 -2.99 0.004566590     0.001394887
## 3 -2.98 0.004704958     0.001441242
## 4 -2.97 0.004847033     0.001488999
## 5 -2.96 0.004992899     0.001538195
## 6 -2.95 0.005142641     0.001588870

# plot probability vs. cumulative probability
ggplot() + 
  geom_line(data=dist_data, aes(x=x, y=dist, color="probability density")) + 
  geom_line(data=dist_data, aes(x=x, y=cumulative_dist, color="cumulative distribution")) + 
  ylab("Probability") + 
  xlab("X Values")

You’ll see that in the example above, we added code like color="probability density" within the aesthetics specification (aes()). When we do this, R will automatically assign a color for that particular text, and it will show up in the legend as well.

Note that the following three plotting commands are all the same. For this example, the two lines we are plotting both use the same data frame and column for x, so we can specify this like ggplot(data=dist_data, aes(x=x)). This means that these settings will be used for all the following geom_* statements (geom_line, geom_point, etc.). Instead of specifying the same parameter settings multiple times, we can just specify them once.

# specify data frame, x & y in the geom_line statements
ggplot() + 
  geom_line(data=dist_data, aes(x=x, y=dist)) + 
  geom_line(data=dist_data, aes(x=x, y=cumulative_dist))

# specify data frame in ggplot(), then specify x & y in the geom_line statements
ggplot(data=dist_data) + 
  geom_line(aes(x=x, y=dist)) + 
  geom_line(aes(x=x, y=cumulative_dist))

# specify data frame and x in ggplot(), then specify y in the geom_line statements
ggplot(data=dist_data, aes(x=x)) + 
  geom_line(aes(y=dist)) +           
  geom_line(aes(y=cumulative_dist))

Find the probability that x > x0 or x < x0

Type ?pnorm to bring up documentation for the probability distribution of the normal distribution. For this example, you need to note that for lower.tail that “if TRUE (default), probabilities are P[X ≤ x] otherwise, P[X > x]”. This applies for the other distributions as well.

x0 = 1

# P(x > x0)
pnorm(x0, lower.tail=FALSE)

## [1] 0.1586553

1 - pnorm(x0, lower.tail=TRUE)

## [1] 0.1586553

# P(x < x0)
pnorm(x0, lower.tail=TRUE)

## [1] 0.8413447

1 - pnorm(x0, lower.tail=FALSE)

## [1] 0.8413447

Use the distributions to find quantiles

You can see the output of several example quantiles, which shows which value is greater than 80% (p=0.8) of all the values in the distribution.

qnorm(p=0.8)

## [1] 0.8416212

qunif(p=0.8, min=-1, max=1)

## [1] 0.6

qbinom(p=0.8, size=10, prob=0.5) # 10 trials, 50% probability of "success" per trial

## [1] 6

Here we plot a series of quantiles on a plot also showing the probability distribution and the cumulative distribution.

In the code, geom_vline(xintercept = qnorm(0.5), linetype="dashed") means that we should draw a vertical dashed line at the value specified by qnorm(0.5).

ggplot() + 
  geom_line(data=dist_data, aes(x=x, y=dist), colour="blue") + 
  geom_line(data=dist_data, aes(x=x, y=cumulative_dist), color="red") + 
  ylab("Probability") + 
  xlab("X Values") + 
  geom_vline(xintercept = qnorm(0.5), linetype="dashed") + 
  geom_vline(xintercept = qnorm(0.9), linetype="dashed") +
  geom_vline(xintercept = qnorm(0.95), linetype="dashed") + 
  geom_vline(xintercept = qnorm(0.99), linetype="dashed") + 
  ggtitle("Quantitles for normal distribution at 0.5, 0.9, 0.95, 0.99")

As we’d expect, the value for qnorm(0.5) is exactly where the cumulative probability is 0.5 as well. The same goes for values at 0.9, 0.95 and 0.99.

Take random samples from distributions

10 samples from a normal distribution

rnorm(10)

##  [1] -0.4282528 -1.3192845 -0.4675785 -0.5589134  0.1868809  0.4415013
##  [7]  0.9431222  1.9556165 -0.6606099  0.5419178

10 samples from a uniform distribution where random values are between -1 and 1

runif(10, min=-1, max=1)

##  [1]  0.2298181  0.1107643  0.9412368 -0.3168448  0.2239617  0.3341198
##  [7] -0.7685436  0.5172092 -0.2406649 -0.6417861

10 samples from a binomial distribution, 100 trials (size=100) with probability of success of 0.5 (prop=0.5)

rbinom(10, size=100, prob=0.5)

##  [1] 71 48 50 55 52 52 44 50 44 52

We can use the hist() command to quickly make a histogram and show that the random samples do approximate the distributions that they are drawn from, especially for a large number of samples:

hist(rnorm(10000), breaks=100)

hist(runif(10000, min=-1, max=1), breaks=100)

hist(rbinom(10000, size=100, prob=0.5), breaks=100)

Adding multiple probability values

If you’re working with a binomial distribution, sometimes you will need to sum up a set of probabilities in the case where you are trying to determine the probability of having greater than or less than a certain number of successful results.

For example, given a situation where we have 5 trials, the probability of success on each trial is 0.3, what is the probability that we will have less than 3 successes?

To calculate this, we have to first consider that less than 3 successes means that we could have 0, 1, or 2 successes.

For each of these 3 cases (0, 1, or 2 successes), we can add up the probabilities in a series of statements like this:

dbinom(0, size=5, prob=0.3) +    # probability of 0 successes
  dbinom(1, size=5, prob=0.3) +  # probability of 1 successes 
  dbinom(2, size=5, prob=0.3)    # probability of 2 successes

## [1] 0.83692

This isn’t a good approach if we have to sum up the probabilities for a lot of numbers like 8, 9, … 1000.

A much more efficient way is to pass a sequence of numbers to dbinom and then use the sum function to sum up the probabilities:

# pass a vector of 0, 1, 2 and then sum up the probabilities
sum(dbinom(c(0:2), size=5, prob=0.3))

## [1] 0.83692

Distributions - Exercises

Exercise 1

Plot a density and cumulative normal distributions with a mean of 5, and a standard deviation of 10. Add vertical lines to show the 80%, 90%, and 99% quantiles. Note that your plot may look slightly different on the x axis depending on which range of x values you choose.

Exercise 2

Plot the probability density and cumulative probability for a binomial distribution with 100 trails and a probability of success of 0.3. Add vertical lines to show the 50%, 90%, and 99% quantiles.

Exercise 3

Assume a normal distribution with a mean of 5 and a standard deviation of 10.

Take 10 random samples, calculate the mean, and observe how the calculated mean differs from the actual mean (5). Note that your own values may be slightly different from what is shown here since we’re taking random samples. If you run your code multiple times you will see that the values are a bit different each time.

## [1] 8.164448

Do the same, except with 100 samples

## [1] 6.425474

… and again with 1000 values

## [1] 5.193759

Exercise 4

This example is from Lecture 2, so you can compare your results also with the lecture slides.

You observe cars at an intersection: At a usual day 50% go right and 50% go left. Today you walk past quickly and you have time to observe 10 cars.

You want to know the chance that x cars will go to the right. What distribution do you use with which parameters?

Plot the probability density for the number of cars turning right

What is the chance that only 2 will go right?

## [1] 0.04394531

What is the chance that more than 7 go right?

## [1] 0.0546875

Now you stop and observe 1000 cars: what is the most likely number to go right and what is the standard deviation?

To find the most likely number to go right, we calculate the expected value \(E(x) = np\)

## [1] 500

To find the standard deviation:

First find the variance: \(Var(x) = np(1-p)\)
Use \(\sigma = \sqrt{Var(x)}\)

## [1] 15.81139

You observe 100 cars and see that 43 go to the right? Which uncertainty should you quote?

Again using \(Var(x) = np(1-p)\) and \(\sigma = \sqrt{Var(x)}\), find \(\sigma\)

## [1] 5

What is the chance that this observation was made on a typical day?

Probability of observing exactly 43 cars turning right:

## [1] 0.03006864

Probability of observing at most 43 cars turning right:

## [1] 0.09667395

Confidence Intervals

Using quantile functions to calculate confidence intervals

A factory produces bolts that should be 10cm wide. You take a sample of 10 bolts and measure on average 10.25 cm with a standard deviation of 0.8.

What is the 95% confidence interval of the mean?

Because we’re looking for the 95% confidence interval, \(\alpha\) = 0.05. Since we’re looking at 10 samples, our degrees of freedom is df = 9.

The standard error is defined as \(\frac{s}{\sqrt{n}}\), and we use \(\bar{x}\) to represent the mean

x_bar = 10.25      # mean of the samples
sd_samples = 0.8   # standard deviation of the samples
num_samples = 10
standard_error = sd_samples / sqrt(num_samples)

Using the Student t distribution, calculate \(t_{\alpha/2}\):

t_0.025 = qt(0.025, 9)
t_0.025

## [1] -2.262157

Calculate \(t_{1 - \alpha/2}\):

t_0.975 = qt(0.975, 9)
t_0.975

## [1] 2.262157

-qt(0.025, 9) # another way to reach the same value

## [1] 2.262157

Our confidence interval will then be between \(\left[\bar{x} + t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}, \bar{x} + t_{1 - \alpha/2} \cdot \frac{s}{\sqrt{n}} \right]\)

In R, this looks like:

lower_limit = x_bar + t_0.025 * standard_error
upper_limit = x_bar + t_0.975 * standard_error
c(lower_limit, upper_limit)

## [1]  9.677714 10.822286

What is the 95% confidence interval on the standard deviation?

Remember that variance is \(\sigma^{2}\), while the standard deviation is \(\sigma\).

We can calculate the confidence interval for \(\sigma^{2}\) by using \(\left[ \frac{s^{2}(n-1)}{\chi^{2}_{1-\alpha/2}}, \frac{s^{2}(n-1)}{\chi^{2}_{\alpha/2}}\right]\)

Note that when calculating the confidence interval for the standard deviation, the lower limit in the confidence interval will be based on the quantile at 97.5%, while the upper limit will be based on that at 2.5% (see page 346 in the textbook).

In R, we can write it like this:

lower_limit = (sd_samples^2 * (num_samples - 1)) / 
                        qchisq(0.975, 9)
upper_limit = (sd_samples^2 * (num_samples - 1)) / 
                        qchisq(0.025, 9)
c(lower_limit, upper_limit)

## [1] 0.3027951 2.1330256

Since we are looking for the confidence interval for \(\sigma\) instead of \(\sigma^{2}\), we can just take the square root of the previous values

sqrt(c(lower_limit, upper_limit))

## [1] 0.5502682 1.4604881

Exercise

At the same factory, we now take a sample of 20 bolts and discover that with this larger sample, we now have an average of 10.13 cm with a standard deviation of 0.7.

What are the 90% and 99% confidence intervals of the mean?

90% confidence interval

## [1]  9.859348 10.400652

99% confidence interval

## [1]  9.682193 10.577807

What are the 90% and 99% confidence intervals on the standard deviation?