Let’s revisit the OkCupid profile data. Run the following in your console:

```
library(mosaic)
library(dplyr)
library(ggplot2)
library(okcupiddata)
data(profiles)
# Added line: let's remove all users (i.e. rows) who did not list a height. is.na() returns
# TRUE if missing, so we want those that are NOT missing.
profiles <- profiles %>%
filter(!is.na(height))
```

Then run the following:

```
n <- 5
samples <- do(10000) *
mean(resample(profiles$height, size=n, replace=TRUE))
samples <- samples %>%
as_data_frame()
ggplot(samples, aes(x=mean)) +
geom_histogram(binwidth = 1) +
xlim(c(40,90))
```

- Discuss with your seatmates what the following code does.
- Try varying
`n`

. What does this correspond to doing? - How does the histogram change?

This code

- Samples
`n=5`

OkCupid users with replacement from the**population**of OkCupid users - Computes the sample mean height \(\overline{x} = \frac{1}{n}\sum_{i=1}^{n}x_i\), where \(x_i\) is the height of the \(i\)th sampled user out of \(n\).
- Does this many, many, many times i.e. 10000 times
- Plots the
**distribution**of these 10000 values of \(\overline{x}\) via a histogram

Varying `n`

corresponds to sampling a different amount of people. Here are the histogram of sample means when

`n=5`

`n=50`