Let’s revisit the OkCupid profile data. Run the following in your console:
library(mosaic)
library(dplyr)
library(ggplot2)
library(okcupiddata)
data(profiles)
# Remove individuals with no listed height
profiles <- profiles %>%
filter(!is.na(height))
We take many, many, many samples of size 5 and then take the sample mean:
n <- 5
samples_5 <- do(10000) *
mean(resample(profiles$height, size=n, replace=TRUE))
samples_5 <- samples_5 %>%
as_data_frame()
We take many, many, many samples of size 50 and then take the sample mean:
n <- 50
samples_50 <- do(10000) *
mean(resample(profiles$height, size=n, replace=TRUE))
samples_50 <- samples_50 %>%
as_data_frame()
n=5
and n=50
.In General: The standard error is the standard deviation of the point estimate. In this case, it is the value that quantifies how much the sample means vary by.
In Our Case: If we take a sample of 5 OkCupid users and compute the (sample) mean height, are we going to get the same value each time? No. The SE measures this sample mean varies.
Mathematically: You can derive the standard error mathematically, but this is for a more advanced class in Probability/Statistics. See Advanced section below.
Computationally: It is the standard deviation of our 10000 sample means:
samples_5 %>%
summarise(SE = sd(mean))
## # A tibble: 1 × 1
## SE
## <dbl>
## 1 1.803266
samples_50 %>%
summarise(SE = sd(mean))
## # A tibble: 1 × 1
## SE
## <dbl>
## 1 0.5730157
Results:
n=50
is smaller i.e.n=50
n=50
n=50
Visualization: Recall, the sampling distribution is the distribution of the point estimate. We see that for n=50
n=50
. Bigger sample size is better.From sample to sample, your point estimate, in this case the sample mean, will vary! i.e. there is uncertainty. How much uncertainty? The SE quantifies this!
If you’re curious, the mathematically/probabilistically derived formula for the Standard Error is \(\mbox{SE}_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\) where
n
)Note how \(n\) is in the denominator. So
In our case since \(\sigma\) = sd(profiles$height)
= 3.995, we have for
n=5
: SE = \(\frac{3.995}{\sqrt{5}} = 1.787\)n=50
: SE = \(\frac{3.995}{\sqrt{50}} = 0.564\)Note these are pretty close to the computationally computed values above. Why is \(\mbox{SE}_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\)? You need the tools from Probability MATH 310 to show this.