Let’s revisit the OkCupid profile data. Run the following in your console:

library(mosaic)
library(dplyr)
library(ggplot2)

library(okcupiddata)
data(profiles)

# Remove individuals with no listed height
profiles <- profiles %>% 
  filter(!is.na(height))

We take many, many, many samples of size 5 and then take the sample mean:

n <- 5
samples_5 <- do(10000) * 
  mean(resample(profiles$height, size=n, replace=TRUE))
samples_5 <- samples_5 %>% 
  as_data_frame() 

We take many, many, many samples of size 50 and then take the sample mean:

n <- 50
samples_50 <- do(10000) * 
  mean(resample(profiles$height, size=n, replace=TRUE))
samples_50 <- samples_50 %>% 
  as_data_frame() 

Learning Checks

  1. Explicity compute the standard error when taking samples of size n=5 and n=50.
  2. Discuss with your peers why they matter in any study that involves some kind of sampling.

LC1

In General: The standard error is the standard deviation of the point estimate. In this case, it is the value that quantifies how much the sample means vary by.

In Our Case: If we take a sample of 5 OkCupid users and compute the (sample) mean height, are we going to get the same value each time? No. The SE measures this sample mean varies.

Mathematically: You can derive the standard error mathematically, but this is for a more advanced class in Probability/Statistics. See Advanced section below.

Computationally: It is the standard deviation of our 10000 sample means:

samples_5 %>% 
  summarise(SE = sd(mean))
## # A tibble: 1 × 1
##         SE
##      <dbl>
## 1 1.803266
samples_50 %>% 
  summarise(SE = sd(mean))
## # A tibble: 1 × 1
##          SE
##       <dbl>
## 1 0.5730157

Results:

  1. The SE with n=50 is smaller i.e.
  2. The sample mean \(\overline{x}\) are less variable when n=50
  3. The sample mean \(\overline{x}\) is more precise when n=50
  4. Our estimates are on average better when n=50

Visualization: Recall, the sampling distribution is the distribution of the point estimate. We see that for n=50

  • the distribution is narrower i.e.
  • it has a smaller standard deviation i.e.
  • the standard error is smaller
  • Our estimates are on average better when n=50. Bigger sample size is better.