Let’s revisit the OkCupid profile data. Run the following in your console:

```
library(mosaic)
library(dplyr)
library(ggplot2)
library(okcupiddata)
data(profiles)
# Remove individuals with no listed height
profiles <- profiles %>%
filter(!is.na(height))
```

We take many, many, many samples of size 5 and then take the sample mean:

```
n <- 5
samples_5 <- do(10000) *
mean(resample(profiles$height, size=n, replace=TRUE))
samples_5 <- samples_5 %>%
as_data_frame()
```

We take many, many, many samples of size 50 and then take the sample mean:

```
n <- 50
samples_50 <- do(10000) *
mean(resample(profiles$height, size=n, replace=TRUE))
samples_50 <- samples_50 %>%
as_data_frame()
```

- Explicity compute the
**standard error**when taking samples of size`n=5`

and`n=50`

. - Discuss with your peers
**why**they matter in any study that involves some kind of sampling.

**In General**: The standard error is the standard deviation of the point estimate. In this case, it is the value that quantifies how much the sample means vary by.

**In Our Case**: If we take a sample of 5 OkCupid users and compute the (sample) mean height, are we going to get the same value each time? No. The SE measures this sample mean varies.

**Mathematically**: You can derive the standard error mathematically, but this is for a more advanced class in Probability/Statistics. See Advanced section below.

**Computationally**: It is the standard deviation of our 10000 sample means:

```
samples_5 %>%
summarise(SE = sd(mean))
```

```
## # A tibble: 1 × 1
## SE
## <dbl>
## 1 1.803266
```

```
samples_50 %>%
summarise(SE = sd(mean))
```

```
## # A tibble: 1 × 1
## SE
## <dbl>
## 1 0.5730157
```

**Results**:

- The SE with
`n=50`

is smaller i.e. - The sample mean \(\overline{x}\) are less variable when
`n=50`

- The sample mean \(\overline{x}\) is more precise when
`n=50`

**Our estimates are on average better when**`n=50`

**Visualization**: Recall, the sampling distribution is the distribution of the point estimate. We see that for `n=50`

- the distribution is narrower i.e.
- it has a smaller standard deviation i.e.
- the standard error is smaller
**Our estimates are on average better when**`n=50`

. Bigger sample size is better.