Thu Dec 01, 2016

Recall: Point of Statistics

Taking a sample in order to infer about the population:

Drawing

Confidence Intervals

  • We want to know the population mean \(\mu\), so we take a sample mean \(\overline{x}\).
  • But instead of using a single value \(\overline{x}\), why not a range of plausible values?

Confidence Intervals

Imagine the \(\mu\) is a fish:

Sample Mean \(\overline{x}\) Confidence Interval
Drawing Drawing

Back to LC from Lec29

Let's focus on the case of taking many, many, many samples of size n=5 and n=50 from the population of OkCupid users:

n <- 5
samples_5 <- do(10000) * 
  mean(resample(profiles$height, size=n, replace=TRUE))
samples_5 <- samples_5 %>% 
  as_data_frame()

n <- 50
samples_50 <- do(10000) * 
  mean(resample(profiles$height, size=n, replace=TRUE))
samples_50 <- samples_50 %>% 
  as_data_frame()

Standard Errors

We computed

Sample Size Standard Error
5 1.768
50 0.561

Constructing Confidence Intervals

Let's construct 95% confidence intervals via two methods:

  1. Looking at the middle 95% of data using quantile()
  2. If the data is bell-shaped, using the rule of thumb

Method 1: Quantile Function

Recall from Lec28 Chalk Talk: quantiles. Run the following in your console:

x <- c(0:12)
x
quantile(x, prob=c(0.25, 0.5, 0.75))

This is saying

  • ~25% of x values are less than 3
  • ~50% of x values are less than 6
  • ~75% of x values are less than 9

Method 2: Bell-Shaped Data

Behold: the normal distribution i.e. the bell curve.

Method 2: Data Bell Shaped

Properties of the Normal Distribution:

  • Unimodel, symmetric
  • Entirely defined by its
    • mean \(\mu\): i.e. center
    • standard deviation \(\sigma\): i.e. spread
  • Key: 95% of values lie within \(2\sigma\) of \(\mu\)
    • i.e. the interval \([\mu - 2\sigma, \mu + 2\sigma]\) contains ~95% of values

Method 2: Data Bell Shaped

Below we have a Normal Distribution with \(\mu=5\) and \(\sigma=2\)…

Method 2: Data Bell Shaped

… Interval \([\mu -2\sigma, \mu +2\sigma] = [5 -2\times 2, 5 +2\times 2] = [1, 9]\) contains 95% of values (in purple).

Learning Check

Construct 95% confidence intervals for \(\mu\), the true average height of all 60K OkCupid users

  • Using both methods for constructing confidence intervals: quantiles and bell-curve aproach
  • For both when n=5 and n=50

Hint: Look at the histograms of the 10000 simulations.