Load the familiar data again, removing individuals with no listed height
library(mosaic)
library(dplyr)
library(ggplot2)
library(okcupiddata)
data(profiles)
set.seed(76)
profiles <- profiles %>%
filter(!is.na(height))
We take many, many, many samples of size 5 and then take the sample mean, and then do the same for samples of size 50:
n <- 5
samples_5 <- do(10000) *
mean(resample(profiles$height, size=n, replace=TRUE))
samples_5 <- samples_5 %>%
as_data_frame()
n <- 50
samples_50 <- do(10000) *
mean(resample(profiles$height, size=n, replace=TRUE))
samples_50 <- samples_50 %>%
as_data_frame()
Construct 95% confidence intervals for \(\mu\), the true average height of all 60K OkCupid users
n=5
and n=50
Hint: Look at the histograms of the 10000 simulations.
To get 95% confidence interval, we need
quantile()
with prob=c(0.025, 0.975)
We need the mean and the standard deviation of the sampling distribution, i.e. the distribution of the 10,000 \(\overline{x}\). The latter is the standard error.
Quantiles: Using the fact above
quantile(samples_5$mean, prob=c(0.025, 0.975))
## 2.5% 97.5%
## 64.8 71.8
i.e. we have (64.8, 71.8)
Normal-Curve Approach:
mean(samples_5$mean)
## [1] 68.3246
sd(samples_5$mean)
## [1] 1.797184
So using the +/- 2 SD from the mean rule, we have: \((68.3 - 2\times 1.80, 68.3 + 2\times 1.80)\) = (64.69, 71.89).
We plot the resulting confidence intervals i.e. our net
ggplot(samples_5, aes(x=mean)) +
geom_histogram(binwidth = 1) +
xlim(c(60,76)) +
geom_vline(xintercept = 68.29, linetype="dashed") +
geom_vline(xintercept = c(64.8, 71.8), col="blue") +
geom_vline(xintercept = c(64.69, 71.89), col="red")
Quantiles: Using the fact above
quantile(samples_50$mean, prob=c(0.025, 0.975))
## 2.5% 97.5%
## 67.18 69.38
i.e. we have (67.18, 69.38)
Normal-Curve Approach:
mean(samples_50$mean)
## [1] 68.29883
sd(samples_50$mean)
## [1] 0.5596155
So using the +/- 2 SD from the mean rule, we have: \((68.30 - 2\times 0.57, 68.30 + 2\times 0.57)\) = (67.16, 69.44).
We plot the resulting confidence intervals i.e. our net
ggplot(samples_50, aes(x=mean)) +
geom_histogram(binwidth = 0.5) +
xlim(c(60,76)) +
geom_vline(xintercept = 68.30, linetype="dashed") +
geom_vline(xintercept = c(67.18, 69.40), col="blue") +
geom_vline(xintercept = c(67.16, 69.44), col="red")
As expected, confidence interval is narrower for n=50
. This is because the SE is smaller when n=50
.