Let’s revisit the OkCupid profile data. Run the following in your console:

library(mosaic)
library(dplyr)
library(ggplot2)

library(okcupiddata)
data(profiles)

# Added line: let's remove all users (i.e. rows) who did not list a height. is.na() returns
# TRUE if missing, so we want those that are NOT missing.
profiles <- profiles %>%
filter(!is.na(height))

Then run the following:

n <- 5
samples <- do(10000) *
mean(resample(profiles\$height, size=n, replace=TRUE))
samples <- samples %>%
as_data_frame()
ggplot(samples, aes(x=mean)) +
geom_histogram(binwidth = 1) +
xlim(c(40,90))

## Learning Checks

1. Discuss with your seatmates what the following code does.
2. Try varying n. What does this correspond to doing?
3. How does the histogram change?

#### LC1

This code

• Samples n=5 OkCupid users with replacement from the population of OkCupid users
• Computes the sample mean height $$\overline{x} = \frac{1}{n}\sum_{i=1}^{n}x_i$$, where $$x_i$$ is the height of the $$i$$th sampled user out of $$n$$.
• Does this many, many, many times i.e. 10000 times
• Plots the distribution of these 10000 values of $$\overline{x}$$ via a histogram

#### LC2

Varying n corresponds to sampling a different amount of people. Here are the histogram of sample means when

• n=5
• n=50