Let’s revisit the OkCupid profile data. Run the following in your console:
library(mosaic)
library(dplyr)
library(ggplot2)
library(okcupiddata)
data(profiles)
# Added line: let's remove all users (i.e. rows) who did not list a height. is.na() returns
# TRUE if missing, so we want those that are NOT missing.
profiles <- profiles %>%
filter(!is.na(height))
Then run the following:
n <- 5
samples <- do(10000) *
mean(resample(profiles$height, size=n, replace=TRUE))
samples <- samples %>%
as_data_frame()
ggplot(samples, aes(x=mean)) +
geom_histogram(binwidth = 1) +
xlim(c(40,90))
n
. What does this correspond to doing?This code
n=5
OkCupid users with replacement from the population of OkCupid usersVarying n
corresponds to sampling a different amount of people. Here are the histogram of sample means when
n=5
n=50