Load the familiar data again, removing individuals with no listed height

library(mosaic)
library(dplyr)
library(ggplot2)
library(okcupiddata)
data(profiles)
set.seed(76)
profiles <- profiles %>% 
  filter(!is.na(height))

We take many, many, many samples of size 5 and then take the sample mean, and then do the same for samples of size 50:

n <- 5
samples_5 <- do(10000) * 
  mean(resample(profiles$height, size=n, replace=TRUE))
samples_5 <- samples_5 %>% 
  as_data_frame() 

n <- 50
samples_50 <- do(10000) * 
  mean(resample(profiles$height, size=n, replace=TRUE))
samples_50 <- samples_50 %>% 
  as_data_frame() 

Learning Checks

Construct 95% confidence intervals for \(\mu\), the true average height of all 60K OkCupid users

  • Using both methods for constructing confidence intervals: quantiles and bell-curve aproach
  • For both when n=5 and n=50

Hint: Look at the histograms of the 10000 simulations.

Notes

Quantiles

To get 95% confidence interval, we need

  • the middle 95% (blue)
  • so the tails represent 5% (red), i.e. 2.5% each
  • thus we need the 2.5% and 97.5% percentiles
  • thus we use quantile() with prob=c(0.025, 0.975)

Normal Curve

We need the mean and the standard deviation of the sampling distribution, i.e. the distribution of the 10,000 \(\overline{x}\). The latter is the standard error.

n=5

Quantiles: Using the fact above

quantile(samples_5$mean, prob=c(0.025, 0.975))
##  2.5% 97.5% 
##  64.8  71.8

i.e. we have (64.8, 71.8)

Normal-Curve Approach:

mean(samples_5$mean)
## [1] 68.3246
sd(samples_5$mean)
## [1] 1.797184

So using the +/- 2 SD from the mean rule, we have: \((68.3 - 2\times 1.80, 68.3 + 2\times 1.80)\) = (64.69, 71.89).

We plot the resulting confidence intervals i.e. our net

  • Blue: Quantile contstructed
  • Red: Normal curve constructed
ggplot(samples_5, aes(x=mean)) +
  geom_histogram(binwidth = 1) +
  xlim(c(60,76)) +
  geom_vline(xintercept = 68.29, linetype="dashed") +
  geom_vline(xintercept = c(64.8,  71.8), col="blue") +
  geom_vline(xintercept = c(64.69, 71.89), col="red")

n=50

Quantiles: Using the fact above

quantile(samples_50$mean, prob=c(0.025, 0.975))
##  2.5% 97.5% 
## 67.18 69.38

i.e. we have (67.18, 69.38)

Normal-Curve Approach:

mean(samples_50$mean)
## [1] 68.29883
sd(samples_50$mean)
## [1] 0.5596155

So using the +/- 2 SD from the mean rule, we have: \((68.30 - 2\times 0.57, 68.30 + 2\times 0.57)\) = (67.16, 69.44).

We plot the resulting confidence intervals i.e. our net

  • Blue: Quantile contstructed
  • Red: Normal curve constructed
ggplot(samples_50, aes(x=mean)) +
  geom_histogram(binwidth = 0.5) +
  xlim(c(60,76)) +
  geom_vline(xintercept = 68.30, linetype="dashed") +
  geom_vline(xintercept = c(67.18, 69.40), col="blue") +
  geom_vline(xintercept = c(67.16, 69.44), col="red")

Conclusion

As expected, confidence interval is narrower for n=50. This is because the SE is smaller when n=50.