Let’s revisit the OkCupid profile data. Run the following in your console:

library(mosaic)
library(dplyr)
library(ggplot2)

library(okcupiddata)
data(profiles)

# Added line: let's remove all users (i.e. rows) who did not list a height. is.na() returns
# TRUE if missing, so we want those that are NOT missing.
profiles <- profiles %>% 
  filter(!is.na(height))

Then run the following:

n <- 5
samples <- do(10000) * 
  mean(resample(profiles$height, size=n, replace=TRUE))
samples <- samples %>% 
  as_data_frame() 
ggplot(samples, aes(x=mean)) +
  geom_histogram(binwidth = 1) +
  xlim(c(40,90))

Learning Checks

  1. Discuss with your seatmates what the following code does.
  2. Try varying n. What does this correspond to doing?
  3. How does the histogram change?

LC1

This code

  • Samples n=5 OkCupid users with replacement from the population of OkCupid users
  • Computes the sample mean height \(\overline{x} = \frac{1}{n}\sum_{i=1}^{n}x_i\), where \(x_i\) is the height of the \(i\)th sampled user out of \(n\).
  • Does this many, many, many times i.e. 10000 times
  • Plots the distribution of these 10000 values of \(\overline{x}\) via a histogram

LC2

Varying n corresponds to sampling a different amount of people. Here are the histogram of sample means when

  • n=5
  • n=50