Make sure to load the tidyverse
library so we can do plotting later on.
library(tidyverse)
This example is based on the one already presented in the lecture slides, although it goes into more of a visual explanation and also shows how you can use R code to calculate Bayesian Probabilities. Bayes Theorem can be difficult to understand since you’re calculating probabilities on different subsets of the whole population, and sometimes you really have to think about why P(A,B)
, P(A|B)
and P(B|A)
are all different, although certain observations in the data influence the calculation of all three.
We have the following definition of the events and probabilities:
To visually explain how these different probabilities relate, this gray box represents 100% of the population of cows:
We know that 1% of the cow population has BSE:
But we also know that there are a lot of positive results:
Just from this, we can see:
Combining everything together, we get the image shown below:
As you can see:
Now what would happen if we had different values for the probability that the test was accurate (i.e. different values for \(P(B|A)\) and \(P(B|A^{c})\))?
To evaluate this:
In R, we represent \(P(B|A)\) as the variable p_B_given_A
and we make a sequence from 0.8 to 1 in steps of 0.005 (seq(0.8,1,0.005)
), which indicates a range of values for the test being 80% accurate all the way up to 100% accurate at spotting BSE given that the cow actually has BSE.
# probability has BSE
p_A = 0.01
# probability tests positive given has BSE (i.e. how accurate the test is if you have BSE)
p_B_given_A = seq(0.8,1,0.005)
# probability that the test is positive (independent of if you have BSE or not)
p_B = (p_A * p_B_given_A) + # probability that the test is positive if you have BSE
((1-p_A) * (1-p_B_given_A)) # probability that the test is positive if you don't have BSE
df = data.frame(p_A = p_A,
p_B_given_A = p_B_given_A,
p_B = p_B)
df$p_A_given_B = (df$p_B_given_A * df$p_A) / df$p_B
We now have a data frame that looks like this:
head(df)
## p_A p_B_given_A p_B p_A_given_B
## 1 0.01 0.800 0.2060 0.03883495
## 2 0.01 0.805 0.2011 0.04002984
## 3 0.01 0.810 0.1962 0.04128440
## 4 0.01 0.815 0.1913 0.04260324
## 5 0.01 0.820 0.1864 0.04399142
## 6 0.01 0.825 0.1815 0.04545455
We can then plot the data to show the relationship between the different probabilities.
ggplot(df, aes(x=p_B_given_A, y=p_A_given_B)) +
geom_point() +
xlab("P(B|A) = P(Positive Test Results|Has BSE)
If the cow has BSE, probability of spotting it with the test") +
ylab("P(A|B) = P(Has BSE|Positive Test Results)
If cow has positive test results, probability that it actually has BSE")
There are two interesting things that we see here:
R contains functions that allow you to easily work with distributions. The names of these functions follow a standard format - you’ll see a d, r, p or q and then the name of the distribution. These letters stand for:
Below you can see how these letters are combined with the names of the distributions. The notations (\(f_{n}\), etc.) correspond with those used in the lectures.
|
Type ?Distributions
into the RStudio Console in order to see documentation on all of the distributions
If you type ?pnorm
in the console, you’ll bring up the documentation for this function. You’ll see the following description of the function:
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
From this, we can see four things:
pnorm
q
is first, mean
is second, etc.sd = 1
. If we don’t explicitly specify a value, it will choose these stated defaults:
mean
= 0
sd
= 1
lower.tail
= TRUE
log.p
= FALSE
When you use the command pnorm(1)
, you’re saying the same thing as pnorm(q=1)
, since the function description shows that the first argument in the pnorm()
function is q
, which the documentation states is a vector of quantiles.
These are all the same due to the default values and the specification of the expected order of the arguments:
pnorm(1)
pnorm(1, 0, 1)
pnorm(1, mean=0, sd=1)
These two examples are the same as the three above, except that we’ve changed the order of the arguments. Any time you change the order of the arguments, you need to explicitly specify the name of the argument (like sd=
) so the function understands where to find that particular argument.
pnorm(1, sd=1, mean=0)
pnorm(1, sd=1)
By typing ?dnorm
in the console, we see that we need to use the following syntax:
dnorm(x, mean = 0, sd = 1, log = FALSE)
dnorm(x=8, mean=10, sd=3, log=FALSE)
## [1] 0.1064827
Since log=FALSE
is specified to be a default value, we can shorten this to just:
dnorm(x=8, mean=10, sd=3)
Also, since our quantile, mean, and standard deviation are in the order expected by the function, we can shorten this further to just:
dnorm(8, 10, 3)
## [1] 0.1064827
However, if we change the order or argument, R will assume that we want a mean of 3 and a standard deviation of 10, and we will get a different answer than what we want.
dnorm(8, 3, 10)
## [1] 0.03520653
We can also use a vector with these functions to evaluate a range of values. For example, here we calculate the probability density at values of 1, 2, …, 20:
dnorm(1:20, 10, 3)
## [1] 0.001477283 0.003798662 0.008740630 0.017996989 0.033159046
## [6] 0.054670025 0.080656908 0.106482669 0.125794409 0.132980760
## [11] 0.125794409 0.106482669 0.080656908 0.054670025 0.033159046
## [16] 0.017996989 0.008740630 0.003798662 0.001477283 0.000514093
If we plot these values, we can see the distribution:
x = 1:20
y = dnorm(x, 10, 3)
plot(x, y)
# create a series of x values
xvals = seq(-3, 3, 0.01)
# create a data frame containing values for the density function (sampled at
# each value of xvals) and the cumulative distribution sampled at the same values
dist_data = data.frame(x=xvals,
dist = dnorm(xvals, mean=0, sd=1),
cumulative_dist = pnorm(xvals, mean=0, sd=1))
# take a look at the first few rows of the data frame
head(dist_data)
## x dist cumulative_dist
## 1 -3.00 0.004431848 0.001349898
## 2 -2.99 0.004566590 0.001394887
## 3 -2.98 0.004704958 0.001441242
## 4 -2.97 0.004847033 0.001488999
## 5 -2.96 0.004992899 0.001538195
## 6 -2.95 0.005142641 0.001588870
# plot probability vs. cumulative probability
ggplot() +
geom_line(data=dist_data, aes(x=x, y=dist, color="probability density")) +
geom_line(data=dist_data, aes(x=x, y=cumulative_dist, color="cumulative distribution")) +
ylab("Probability") +
xlab("X Values")
You’ll see that in the example above, we added code like color="probability density"
within the aesthetics specification (aes()
). When we do this, R will automatically assign a color for that particular text, and it will show up in the legend as well.
Note that the following three plotting commands are all the same. For this example, the two lines we are plotting both use the same data frame and column for x, so we can specify this like ggplot(data=dist_data, aes(x=x))
. This means that these settings will be used for all the following geom_
* statements (geom_line
, geom_point
, etc.). Instead of specifying the same parameter settings multiple times, we can just specify them once.
# specify data frame, x & y in the geom_line statements
ggplot() +
geom_line(data=dist_data, aes(x=x, y=dist)) +
geom_line(data=dist_data, aes(x=x, y=cumulative_dist))
# specify data frame in ggplot(), then specify x & y in the geom_line statements
ggplot(data=dist_data) +
geom_line(aes(x=x, y=dist)) +
geom_line(aes(x=x, y=cumulative_dist))
# specify data frame and x in ggplot(), then specify y in the geom_line statements
ggplot(data=dist_data, aes(x=x)) +
geom_line(aes(y=dist)) +
geom_line(aes(y=cumulative_dist))
Type ?pnorm
to bring up documentation for the probability distribution of the normal distribution. For this example, you need to note that for lower.tail
that “if TRUE (default), probabilities are P[X ≤ x] otherwise, P[X > x]”. This applies for the other distributions as well.
x0 = 1
# P(x > x0)
pnorm(x0, lower.tail=FALSE)
## [1] 0.1586553
1 - pnorm(x0, lower.tail=TRUE)
## [1] 0.1586553
# P(x < x0)
pnorm(x0, lower.tail=TRUE)
## [1] 0.8413447
1 - pnorm(x0, lower.tail=FALSE)
## [1] 0.8413447
You can see the output of several example quantiles, which shows which value is greater than 80% (p=0.8
) of all the values in the distribution.
qnorm(p=0.8)
## [1] 0.8416212
qunif(p=0.8, min=-1, max=1)
## [1] 0.6
qbinom(p=0.8, size=10, prob=0.5) # 10 trials, 50% probability of "success" per trial
## [1] 6
Here we plot a series of quantiles on a plot also showing the probability distribution and the cumulative distribution.
In the code, geom_vline(xintercept = qnorm(0.5), linetype="dashed")
means that we should draw a vertical dashed line at the value specified by qnorm(0.5)
.
ggplot() +
geom_line(data=dist_data, aes(x=x, y=dist), colour="blue") +
geom_line(data=dist_data, aes(x=x, y=cumulative_dist), color="red") +
ylab("Probability") +
xlab("X Values") +
geom_vline(xintercept = qnorm(0.5), linetype="dashed") +
geom_vline(xintercept = qnorm(0.9), linetype="dashed") +
geom_vline(xintercept = qnorm(0.95), linetype="dashed") +
geom_vline(xintercept = qnorm(0.99), linetype="dashed") +
ggtitle("Quantitles for normal distribution at 0.5, 0.9, 0.95, 0.99")
As we’d expect, the value for qnorm(0.5)
is exactly where the cumulative probability is 0.5
as well. The same goes for values at 0.9, 0.95 and 0.99.
10 samples from a normal distribution
rnorm(10)
## [1] -0.4282528 -1.3192845 -0.4675785 -0.5589134 0.1868809 0.4415013
## [7] 0.9431222 1.9556165 -0.6606099 0.5419178
10 samples from a uniform distribution where random values are between -1 and 1
runif(10, min=-1, max=1)
## [1] 0.2298181 0.1107643 0.9412368 -0.3168448 0.2239617 0.3341198
## [7] -0.7685436 0.5172092 -0.2406649 -0.6417861
10 samples from a binomial distribution, 100 trials (size=100
) with probability of success of 0.5 (prop=0.5
)
rbinom(10, size=100, prob=0.5)
## [1] 71 48 50 55 52 52 44 50 44 52
We can use the hist()
command to quickly make a histogram and show that the random samples do approximate the distributions that they are drawn from, especially for a large number of samples:
hist(rnorm(10000), breaks=100)
hist(runif(10000, min=-1, max=1), breaks=100)
hist(rbinom(10000, size=100, prob=0.5), breaks=100)
If you’re working with a binomial distribution, sometimes you will need to sum up a set of probabilities in the case where you are trying to determine the probability of having greater than or less than a certain number of successful results.
For example, given a situation where we have 5 trials, the probability of success on each trial is 0.3, what is the probability that we will have less than 3 successes?
To calculate this, we have to first consider that less than 3 successes means that we could have 0, 1, or 2 successes.
For each of these 3 cases (0, 1, or 2 successes), we can add up the probabilities in a series of statements like this:
dbinom(0, size=5, prob=0.3) + # probability of 0 successes
dbinom(1, size=5, prob=0.3) + # probability of 1 successes
dbinom(2, size=5, prob=0.3) # probability of 2 successes
## [1] 0.83692
This isn’t a good approach if we have to sum up the probabilities for a lot of numbers like 8, 9, … 1000.
A much more efficient way is to pass a sequence of numbers to dbinom
and then use the sum
function to sum up the probabilities:
# pass a vector of 0, 1, 2 and then sum up the probabilities
sum(dbinom(c(0:2), size=5, prob=0.3))
## [1] 0.83692
Plot a density and cumulative normal distributions with a mean of 5, and a standard deviation of 10. Add vertical lines to show the 80%, 90%, and 99% quantiles. Note that your plot may look slightly different on the x axis depending on which range of x values you choose.
Plot the probability density and cumulative probability for a binomial distribution with 100 trails and a probability of success of 0.3. Add vertical lines to show the 50%, 90%, and 99% quantiles.
Assume a normal distribution with a mean of 5 and a standard deviation of 10.
Take 10 random samples, calculate the mean, and observe how the calculated mean differs from the actual mean (5). Note that your own values may be slightly different from what is shown here since we’re taking random samples. If you run your code multiple times you will see that the values are a bit different each time.
## [1] 8.164448
Do the same, except with 100 samples
## [1] 6.425474
… and again with 1000 values
## [1] 5.193759
This example is from Lecture 2, so you can compare your results also with the lecture slides.
You observe cars at an intersection: At a usual day 50% go right and 50% go left. Today you walk past quickly and you have time to observe 10 cars.
## [1] 0.04394531
## [1] 0.0546875
To find the most likely number to go right, we calculate the expected value \(E(x) = np\)
## [1] 500
To find the standard deviation:
## [1] 15.81139
Again using \(Var(x) = np(1-p)\) and \(\sigma = \sqrt{Var(x)}\), find \(\sigma\)
## [1] 5
Probability of observing exactly 43 cars turning right:
## [1] 0.03006864
Probability of observing at most 43 cars turning right:
## [1] 0.09667395
A factory produces bolts that should be 10cm wide. You take a sample of 10 bolts and measure on average 10.25 cm with a standard deviation of 0.8.
Because we’re looking for the 95% confidence interval, \(\alpha\) = 0.05. Since we’re looking at 10 samples, our degrees of freedom is df = 9.
The standard error is defined as \(\frac{s}{\sqrt{n}}\), and we use \(\bar{x}\) to represent the mean
x_bar = 10.25 # mean of the samples
sd_samples = 0.8 # standard deviation of the samples
num_samples = 10
standard_error = sd_samples / sqrt(num_samples)
Using the Student t distribution, calculate \(t_{\alpha/2}\):
t_0.025 = qt(0.025, 9)
t_0.025
## [1] -2.262157
Calculate \(t_{1 - \alpha/2}\):
t_0.975 = qt(0.975, 9)
t_0.975
## [1] 2.262157
-qt(0.025, 9) # another way to reach the same value
## [1] 2.262157
Our confidence interval will then be between \(\left[\bar{x} + t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}, \bar{x} + t_{1 - \alpha/2} \cdot \frac{s}{\sqrt{n}} \right]\)
In R, this looks like:
lower_limit = x_bar + t_0.025 * standard_error
upper_limit = x_bar + t_0.975 * standard_error
c(lower_limit, upper_limit)
## [1] 9.677714 10.822286
Remember that variance is \(\sigma^{2}\), while the standard deviation is \(\sigma\).
We can calculate the confidence interval for \(\sigma^{2}\) by using \(\left[ \frac{s^{2}(n-1)}{\chi^{2}_{1-\alpha/2}}, \frac{s^{2}(n-1)}{\chi^{2}_{\alpha/2}}\right]\)
Note that when calculating the confidence interval for the standard deviation, the lower limit in the confidence interval will be based on the quantile at 97.5%, while the upper limit will be based on that at 2.5% (see page 346 in the textbook).
In R, we can write it like this:
lower_limit = (sd_samples^2 * (num_samples - 1)) /
qchisq(0.975, 9)
upper_limit = (sd_samples^2 * (num_samples - 1)) /
qchisq(0.025, 9)
c(lower_limit, upper_limit)
## [1] 0.3027951 2.1330256
Since we are looking for the confidence interval for \(\sigma\) instead of \(\sigma^{2}\), we can just take the square root of the previous values
sqrt(c(lower_limit, upper_limit))
## [1] 0.5502682 1.4604881
At the same factory, we now take a sample of 20 bolts and discover that with this larger sample, we now have an average of 10.13 cm with a standard deviation of 0.7.
90% confidence interval
## [1] 9.859348 10.400652
99% confidence interval
## [1] 9.682193 10.577807
90% confidence interval
## [1] 0.5557479 0.9592873
99% confidence interval
## [1] 0.4912256 1.1663281