Let’s revisit the OkCupid profile data. Run the following in your console:
library(mosaic)
library(dplyr)
library(ggplot2)
library(okcupiddata)
data(profiles)
# Added line: let's remove all users (i.e. rows) who did not list a height. is.na() returns
# TRUE if missing, so we want those that are NOT missing.
profiles <- profiles %>%
filter(!is.na(height))
Then run the following:
n <- 5
samples <- do(10000) *
mean(resample(profiles$height, size=n, replace=TRUE))
samples <- samples %>%
as_data_frame()
ggplot(samples, aes(x=mean)) +
geom_histogram(binwidth = 1) +
xlim(c(40,90))
n
. What does this correspond to doing?This code
n=5
OkCupid users with replacement from the population of OkCupid usersVarying n
corresponds to sampling a different amount of people. Here are the histogram of sample means when
n=5
n=50
The histogram for n=500
latter is narrower i.e. less variable i.e. more precise.
Recall that in both cases, the sample mean \(\overline{x}\) is a point estimate of the true population mean \(\mu\), i.e. the true mean height of all 60K OkCupid users. i.e. \(\mu\) = 68.3 inches (or about 5’8’’). Let’s plot this value in red:
We see that using n=50
, our sample means \(\overline{x}\) i.e. our point estimates of \(\mu\) are more often closer to the true population mean in red. We are more often closer to the true value. This is why sample size matters!
Let’s look at the middle 95% of values for n=5
. It is [65.0, 71.6]. i.e.
Let’s look at the middle 95% of values for n=50
. It is [67.18, 69.40]: