Let’s revisit the OkCupid profile data. Run the following in your console:

library(mosaic)
library(dplyr)
library(ggplot2)

library(okcupiddata)
data(profiles)

# Remove individuals with no listed height
profiles <- profiles %>% 
  filter(!is.na(height))

We take many, many, many samples of size 5 and then take the sample mean:

n <- 5
samples_5 <- do(10000) * 
  mean(resample(profiles$height, size=n, replace=TRUE))
samples_5 <- samples_5 %>% 
  as_data_frame() 

We take many, many, many samples of size 50 and then take the sample mean:

n <- 50
samples_50 <- do(10000) * 
  mean(resample(profiles$height, size=n, replace=TRUE))
samples_50 <- samples_50 %>% 
  as_data_frame() 

Learning Checks

  1. Explicity compute the standard error when taking samples of size n=5 and n=50.
  2. Discuss with your peers why they matter in any study that involves some kind of sampling.

LC1

In General: The standard error is the standard deviation of the point estimate. In this case, it is the value that quantifies how much the sample means vary by.

In Our Case: If we take a sample of 5 OkCupid users and compute the (sample) mean height, are we going to get the same value each time? No. The SE measures this sample mean varies.

Mathematically: You can derive the standard error mathematically, but this is for a more advanced class in Probability/Statistics. See Advanced section below.

Computationally: It is the standard deviation of our 10000 sample means:

samples_5 %>% 
  summarise(SE = sd(mean))
## # A tibble: 1 × 1
##         SE
##      <dbl>
## 1 1.803266
samples_50 %>% 
  summarise(SE = sd(mean))
## # A tibble: 1 × 1
##          SE
##       <dbl>
## 1 0.5730157

Results:

  1. The SE with n=50 is smaller i.e.
  2. The sample mean \(\overline{x}\) are less variable when n=50
  3. The sample mean \(\overline{x}\) is more precise when n=50
  4. Our estimates are on average better when n=50

Visualization: Recall, the sampling distribution is the distribution of the point estimate. We see that for n=50

  • the distribution is narrower i.e.
  • it has a smaller standard deviation i.e.
  • the standard error is smaller
  • Our estimates are on average better when n=50. Bigger sample size is better.

LC2

From sample to sample, your point estimate, in this case the sample mean, will vary! i.e. there is uncertainty. How much uncertainty? The SE quantifies this!

Advanced

If you’re curious, the mathematically/probabilistically derived formula for the Standard Error is \(\mbox{SE}_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\) where

  • \(\sigma\) is the population standard deviation: the true standard deviation of the 60K heights (and not the standard deviation of the 10000 many, many, many sample means)
  • \(n\) is the sample size (and not the many, many, many number of times n)

Note how \(n\) is in the denominator. So

  • As \(n \longrightarrow \infty\), we have \(\mbox{SE}_{\overline{x}} \longrightarrow 0\) i.e.
  • As \(n \longrightarrow \infty\), there is no uncertainty in our estimate \(\overline{x}\) of \(\mu\), we know the answer exactly!

In our case since \(\sigma\) = sd(profiles$height) = 3.995, we have for

  • n=5: SE = \(\frac{3.995}{\sqrt{5}} = 1.787\)
  • n=50: SE = \(\frac{3.995}{\sqrt{50}} = 0.564\)

Note these are pretty close to the computationally computed values above. Why is \(\mbox{SE}_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\)? You need the tools from Probability MATH 310 to show this.