Make sure to load the ggplot2
library so we can do plotting.
library(ggplot2)
This example is based on the one already presented in the lecture slides, although it goes into more of a visual explanation and also shows how you can use R code to calculate Bayesian Probabilities. Bayes Theorem can be difficult to understand since you’re calculating probabilities on different subsets of the whole population, and sometimes you really have to think about why P(A,B)
, P(A|B)
and P(B|A)
are all different, although certain observations in the data influence the calculation of all three.
We have the following definition of the events and probabilities:
To visually explain how these different probabilities relate, this gray box represents 100% of the population of cows:
We know that 1% of the cow population has BSE:
But we also know that there are a lot of positive results:
Just from this, we can see:
Combining everything together, we get the image shown below:
As you can see:
Now what would happen if we had different values for the probability that the test was accurate (i.e. different values for \(P(B|A)\) and \(P(B|A^{c})\))?
To evaluate this:
In R, we represent \(P(B|A)\) as the variable p_B_given_A
and we make a sequence from 0.8 to 1 in steps of 0.005 (seq(0.8,1,0.005)
), which indicates a range of values for the test being 80% accurate all the way up to 100% accurate at spotting BSE given that the cow actually has BSE.
# probability has BSE
p_A = 0.01
# probability tests positive given has BSE (i.e. how accurate the test is if you have BSE)
p_B_given_A = seq(0.8,1,0.005)
# probability that the test is positive (independent of if you have BSE or not)
p_B = (p_A * p_B_given_A) + # probability that the test is positive if you have BSE
((1-p_A) * (1-p_B_given_A)) # probability that the test is positive if you don't have BSE
df = data.frame(p_A = p_A,
p_B_given_A = p_B_given_A,
p_B = p_B)
df$p_A_given_B = (df$p_B_given_A * df$p_A) / df$p_B
We now have a data frame that looks like this:
head(df)
## p_A p_B_given_A p_B p_A_given_B
## 1 0.01 0.800 0.2060 0.03883495
## 2 0.01 0.805 0.2011 0.04002984
## 3 0.01 0.810 0.1962 0.04128440
## 4 0.01 0.815 0.1913 0.04260324
## 5 0.01 0.820 0.1864 0.04399142
## 6 0.01 0.825 0.1815 0.04545455
We can then plot the data to show the relationship between the different probabilities.
ggplot(df, aes(x=p_B_given_A, y=p_A_given_B)) +
geom_point() +
xlab("P(B|A) = P(Positive Test Results|Has BSE)
If the cow has BSE, probability of spotting it with the test") +
ylab("P(A|B) = P(Has BSE|Positive Test Results)
If cow has positive test results, probability that it actually has BSE")
There are two interesting things that we see here:
R contains functions that allow you to easily work with distributions. The names of these functions follow a standard format - you’ll see a d, r, p or q and then the name of the distribution. These letters stand for:
Below you can see how these letters are combined with the names of the distributions:
|
Type ?Distributions
into the RStudio Console in order to see documentation on all of the distributions
# create a series of x values
xvals = seq(-3, 3, 0.01)
# create a data frame containing values for the density function (sampled at
# each value of xvals) and the cumulative distribution sampled at the same values
dist_data = data.frame(x=xvals,
dist = dnorm(xvals),
cumulative_dist = pnorm(xvals))
# take a look at the data frame
head(dist_data)
## x dist cumulative_dist
## 1 -3.00 0.004431848 0.001349898
## 2 -2.99 0.004566590 0.001394887
## 3 -2.98 0.004704958 0.001441242
## 4 -2.97 0.004847033 0.001488999
## 5 -2.96 0.004992899 0.001538195
## 6 -2.95 0.005142641 0.001588870
# plot probability vs. cumulative probability
ggplot() +
geom_line(data=dist_data, aes(x=x, y=dist)) +
geom_line(data=dist_data, aes(x=x, y=cumulative_dist)) +
ylab("Probability") +
xlab("X Values")
If you type ?pnorm
in the console, you’ll find that the documentation for lower.tail
states that “if TRUE (default), probabilities are P[X ≤ x] otherwise, P[X > x]”. This applies for the other distributions as well.
x0 = 1
# P(x > x0)
pnorm(x0, lower.tail=FALSE)
## [1] 0.1586553
1 - pnorm(x0, lower.tail=TRUE)
## [1] 0.1586553
# P(x < x0)
pnorm(x0, lower.tail=TRUE)
## [1] 0.8413447
1 - pnorm(x0, lower.tail=FALSE)
## [1] 0.8413447
You can see the output of several example quantiles, which shows which value is greater than 80% (p=0.8
) of all the values in the distribution.
qnorm(p=0.8)
## [1] 0.8416212
qunif(p=0.8, min=-1, max=1)
## [1] 0.6
qbinom(p=0.8, size=10, prob=0.5) # 10 trials, 50% probability of "success" per trial
## [1] 6
Here we plot a series of quantiles on a plot also showing the probability distribution and the cumulative distribution.
In the code, geom_vline(xintercept = qnorm(0.5), linetype="dashed")
means that we should draw a vertical dashed line at the value specified by qnorm(0.5)
.
ggplot() +
geom_line(data=dist_data, aes(x=x, y=dist), colour="blue") +
geom_line(data=dist_data, aes(x=x, y=cumulative_dist), color="red") +
ylab("Probability") +
xlab("X Values") +
geom_vline(xintercept = qnorm(0.5), linetype="dashed") +
geom_vline(xintercept = qnorm(0.9), linetype="dashed") +
geom_vline(xintercept = qnorm(0.95), linetype="dashed") +
geom_vline(xintercept = qnorm(0.99), linetype="dashed") +
ggtitle("Quantitles for normal distribution at 0.5, 0.9, 0.95, 0.99")
As we’d expect, the value for qnorm(0.5)
is exactly where the cumulative probability is 0.5
as well. The same goes for values at 0.9, 0.95 and 0.99.
rnorm(10)
## [1] -1.8027545 0.4551979 -0.9943766 0.6810542 -0.6473601 -0.4276190
## [7] -0.4358254 0.3675553 -0.9970416 0.8915449
runif(10, min=-1, max=1)
## [1] -0.4753282 -0.8342742 0.8763450 0.7309914 0.4790265 0.3773961
## [7] -0.1567662 0.5277522 0.3783993 -0.1621227
rbinom(10, size=10, prob=0.5)
## [1] 6 5 7 5 5 6 7 5 2 4
A Q-Q plot should definitely show that a normal distribution is indeed normally distributed:
x <- rnorm(100) # 100 samples from a normal distribution
qqnorm(x)
qqline(x)
How does a uniform distribution look?
x <- runif(100) # 100 samples from a uniform distribution
qqnorm(x)
qqline(x)
Type ?rbinom
into the console will bring up a page showing the following parameters that need to be specified to generate random numbers from a binomial distribution:
n
: number of observationssize
: number of trialsprob
: probability of success on each trialIn the code below, n=1000
means that we have 1000 observations. During each observation we do 50 trials (size=50
) where the probability of a success is defined by prob=0.5
.
x <- rbinom(n=1000, size=50, prob=0.5)
We can use range
to see the minimum and maximum values:
range(x)
## [1] 14 36
In other words, there exists at least one observation where there were 14 successes and at least one observation where there were 36 successes.
We use + xlim(c(0,50))
to show the x axis for values from 0 to 50.
# Create a data frame for the values of x
# This results in a data frame with one column: data$x
data = as.data.frame(x)
ggplot(data, aes(x=x)) + geom_histogram(binwidth = 1) + xlim(c(0,50))
qqnorm(x)
qqline(x)
If we try a more skewed distribution with prob=0.9
, we get:
x <- rbinom(n=1000, size=50, prob=0.9)
data = as.data.frame(x)
ggplot(data, aes(x=x)) + geom_histogram(binwidth = 1) + xlim(c(0,50))
qqnorm(x)
qqline(x)
We next use the faithful
data set which is included with R. You can type ?faithful
in the RStudio console to get more information on it. This data is about the Old Faithful geyser in Yellowstone National Park in the US. This geyser is famous for erupting on a very regular schedule and the data set has information on how long the eruptions are (faithful$eruptions
) and the amount of time until the next eruption (faithful$waiting
).
We can first make a Q-Q plot of the waiting times:
x = faithful$waiting
qqnorm(x)
qqline(x)
This tells us that the eruptions are clearly not normally distributed. To investigate further, we can plot a histogram of the values:
ggplot(faithful, aes(x=waiting)) + geom_histogram(binwidth=2)
From the histogram, we see from this is that it’s clearly bi-modal as there are two distinct peaks.
To investigate further, we can do a scatter plot showing how the waiting time might be related to the length of the eruption.
ggplot(faithful, aes(x=eruptions, y=waiting)) + geom_point() +
xlab("Length of eruption (minutes)") +
ylab("Waiting time until next eruption (minutes)")
We see a few things here:
We can also split up the data into two sets, where one lists all the eruptions that lasted less than three minutes, and the other one contains those which are longer.
For this, we use the which
command, which returns the indices of the matching locations in a vector:
which(faithful$eruptions < 3)
## [1] 2 4 6 9 11 14 16 17 19 21 22 27 36 37 39 42 44
## [18] 48 50 53 55 58 61 63 65 69 72 75 77 84 89 91 93 95
## [35] 99 101 103 106 108 112 115 117 119 121 124 127 129 131 133 135 137
## [52] 139 142 146 148 150 153 159 161 163 167 169 171 172 178 181 185 188
## [69] 190 192 199 201 204 206 209 211 213 217 219 221 223 232 234 236 237
## [86] 240 242 244 247 249 251 259 263 265 266 269 271
The numbers above correspond to the rows in the faithful
data set where the eruptions are less than three minutes.
Now we create the two separate data frames from the faithful
data frame:
faithful_short = faithful[which(faithful$eruptions < 3),]
faithful_long = faithful[which(faithful$eruptions >= 3),]
Q-Q plot for the short eruptions:
qqnorm(faithful_short$waiting)
qqline(faithful_short$waiting)
Q-Q plot for the long eruptions:
qqnorm(faithful_long$waiting)
qqline(faithful_long$waiting)
This shows that if we split the data into two clusters, then data in those clusters seems to be normally distributed.
If you type help(package="datasets")
into the RStudio Console you’ll see a list of other data sets which are included with R. You can experiment with these just as we did with the faithful
data set to see if this data is normally distributed as well.
Take a random sample n< 30 from a normal distribution:
n = 30 # number of random samples
# set the mean and standard deviation of the distribution
true_mean = 0
true_sd = 1
# take random samples
vals = rnorm(n, mean=true_mean, sd=true_sd)
Now we calculate \(\mu\) and \(\sigma\) of the distribution and compare it to to the original distribution
mean_pts = mean(vals)
mean_pts
## [1] -0.2514718
In the original distribution \(\mu = 0\), while in the new sample it’s -0.2514718.
sd_pts = sd(vals)
sd_pts
## [1] 0.9297695
In the original distribution \(\sigma = 1\), while in the new sample it’s 0.9297695.
We now will calculate the 95% confidence interval of the mean and compare to that of the original distribution. From the book, we know that we can calculate the 95% confidence interval using \(\bar{y} \pm 1.96 \cdot \sigma_{\bar{y}}\) where \(\sigma_{\bar{y}} = \frac{\sigma}{\sqrt{n}}\).
We will see for how many samples the 95% confidence interval includes the true mean. Here, for 500 iterations we take n
random numbers from the distribution, and calculate the confidence intervals.
times_true_mean_in_confidence_interval = 0
num_tests = 500
# create a loop where the value of "count" will be 1, 2, 3... num_tests
for (count in c(1:num_tests)){
# sample random values from the distribution
vals = rnorm(n, mean=0, sd=1)
# calculate mean and standard deviation
mean_pts = mean(vals)
sd_pts = sd(vals)
# calculate left and right sides of the 95% confidence interval
left = mean_pts - (1.96 * sd_pts/sqrt(n))
right = mean_pts + (1.96 * sd_pts/sqrt(n))
# see if the true_mean is within the bounds of the interval
if (true_mean >= left & true_mean <= right){
# if it's in the interval, add one to this variable
times_true_mean_in_confidence_interval = times_true_mean_in_confidence_interval + 1
}
}
# show how many times the true_mean falls within the intervals calculated
times_true_mean_in_confidence_interval
## [1] 472
# what percentage of the time does the true_mean fall within the intervals calculated?
times_true_mean_in_confidence_interval / num_tests
## [1] 0.944
What if we ran this 10,000 times? How would this percentage change as we repeatedly took more random samples?
Note that the x axis is scaled logarithmically.
This shows that there is a lot of variation at first, but as expected, with more samples, we settle into the 95% range.
This example is from Lecture 2, and is worked out in R below.
You observe cars at an intersection: At a usual day 50% go right and 50% go left. Today you walk past quickly and you have time to observe 10 cars.
As mentioned in the lecture, this is a binomial distribution with p = 0.5 and n = 10. We define “success” as a car turning right.
n = 10 # number of observations
p = 0.5 # probability
# in n observations, the number of cars that can turn right is somewhere between 0 and n
x = c(0:n)
data = data.frame(x = x,
vals = dbinom(x, size=10, prob=0.5))
ggplot(data, aes(x=x, y = vals)) +
geom_point() + # show the discrete probabilities
geom_line() + # connect the dots to show the distribution better
xlab("Number of Cars Turning Right") +
ylab("Probability") +
scale_x_continuous(breaks = c(0:10)) # set up the x axis so that it shows integers instead of 0, 2.5, etc
dbinom(2, size=10, prob=0.5)
## [1] 0.04394531
We can add up the probabilities in a series of statements like this:
dbinom(8, size=10, prob=0.5) +
dbinom(9, size=10, prob=0.5) +
dbinom(10, size=10, prob=0.5)
## [1] 0.0546875
This isn’t a good approach if we have to sum up the probabilities for a lot of numbers like 8, 9, … 1000.
A much more efficient way is to pass a vector of numbers to dbinom
and then use the sum
function to sum up the probabilities:
# pass a vector of 8, 9, 10 and then sum up the probabilities
sum(dbinom(c(8:10), size=10, prob=0.5))
## [1] 0.0546875
To find the most likely number to go right, we calculate the expected value \(E(x) = np\)
n = 1000
p = 0.5
n * p
## [1] 500
To find the standard deviation:
var = n*p*(1-p)
sqrt(var)
## [1] 15.81139
Again using \(Var(x) = np(1-p)\) and \(\sigma = \sqrt{Var(x)}\)
n = 100
sqrt(n * p * (1-p))
## [1] 5
Probability of observing exactly 43 cars turning right:
dbinom(43, size=n, prob=0.5)
## [1] 0.03006864
Probability of observing at most 43 cars turning right:
# sum up the probabilities for observing between 0 and 43 cars
sum(dbinom(c(0:43), size=n, prob=0.5))
## [1] 0.09667395
Try what you have learned above on the examples discussed in the lecture. Using R, you should arrive at the same numbers that were presented in class.
The government says that 26% of the populations smoke. You want to test this by asking 30 of your friends.
A large detector is set up to detect neutrinos. It usually counts 2/day.