---
title: "Problem Set 10"
author: "WRITE YOUR NAME HERE"
date: "2018-04-10"
output:
html_document:
highlight: tango
theme: cosmo
toc: yes
toc_depth: 2
toc_float:
collapsed: false
df_print: kable
---
```{r, include=FALSE}
# Do not edit this code block/chunk
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning = FALSE, fig.width = 16/2, fig.height = 9/2)
```
# Collaboration {-}
Please indicate who you collaborated with on this problem set:
# Background
```{r}
library(ggplot2)
library(dplyr)
library(moderndive)
# Set random number generator seed value to get replicable random sampling:
set.seed(76)
```
**Hints**:
1. Knit the problem set once before starting and read over the HTML file first.
1. I *highly* recommend `View()`ing all data frames in your console as you go, but remember that you cannot include `View()` code in your `.Rmd` file.
In this problem set, instead of considering a "bowl" with a population of $N=2400$ red and white balls where we're interested in the population proportion red $p$, we'll be considering a "sack" with a population of $N=800$ pennies where we're interested in the population mean year of minting $\mu$. In the `moderndive` package there is a data frame `pennies` that is a virtual representation of the sack of $N=800$ pennies, much in the same way `bowl` was a virtual representation of the bowl:
```{r}
head(pennies)
```
Below we
1. Compute the population mean year of minting $\mu$, which we see is about 1989.84
1. Plot the *population distribution* of the year of minting of the 800 pennies via a histogram and mark $\mu$. Note this distribution is not normally shaped.
```{r}
# Population mean
mu <- mean(pennies$year)
mu
# Also population standard deviation
sigma <- sd(pennies$year)
sigma
# Plot
ggplot(pennies, aes(x = year)) +
geom_histogram(binwidth = 5) +
labs(x = "year", title = "Fig 1: Population distribution of year of 800 pennies") +
geom_vline(xintercept = mu, col = "red", size = 1)
```
In this problem set
1. Instead of sampling from the bowl using the shovel with $n$ slots and then using the resulting sample proportion red $\widehat{p}$ to estimate the true population proportion red $p$
1. We'll be sampling from the sack using "handfuls" of $n$ pennies and then using the resulting sample mean year $\overline{x}$ to estimate the true population mean year $\mu$
1. Unlike with the bowl however, we'll skip over any tactile sampling (having an actual sack of pennies) we'll jump straight into virtual sampling.
You might be asking yourself, if we already know that the true population mean year $\mu = 1989.84$, then why are we sampling to estimate it? Because this is only a theoretical *simulation* scenario to understand how the sample mean $\overline{x}$ behaves across multiple samples of size $n$. This gives us a sense of its typical behavior for a given sample size $n$: what are frequently occuring values of $\overline{x}$, what are rare values of $\overline{x}$, etc. In any real-life non-census scenario however, we wouldn't know the true $\mu$, hence would need to resort to sampling to estimate it.
Recall that any simulation consists of a certain number of "trials" and there are 3 elements to specify:
1. What's being repeated in each trial of our simulation?
1. What's being measured/computed at each trial of our simulation?
1. How do we summarize the trials?
# Question 1: 1000 virtual handfuls of 50 pennies
As we did in ModernDive 8.3.3 for $\widehat{p}$, let's conduct a simulation to understand how $\overline{x}$ behaves:
1. Virtually extract samples of size `n=50` pennies; repeat this 1000 times.
1. Compute 1000 values of the sample mean $\overline{x}$ based on these 1000 virtual samples.
1. Summarize the 1000 values of the sample mean $\overline{x}$ with:
1. **Sampling distribution**: The histogram of these 1000 sample means $\overline{x}$
1. **Standard error**: The standard deviation of these 1000 sample means $\overline{x}$. In other words, how *precise* are they? What is the typical error of our estimate?
The copied/pasted code below performs the simulation for the `bowl`. Tweak all the code (including variable names and labels) to now virtually sample handfuls from `pennies`:
```{r}
# Take 1000 samples of size n = 50 pennies
virtual_samples <- pennies %>%
rep_sample_n(size = 50, reps = 1000)
# Compute the 1000 resulting point estimates: the sample mean. Let's also
# compute the sample standard deviation as well.
virtual_sample_mean <- virtual_samples %>%
group_by(replicate) %>%
summarize(sample_mean = mean(year), s = sd(year))
# Look at first 6 of 1000 values
head(virtual_sample_mean)
# Plot sampling distribution:
ggplot(virtual_sample_mean, aes(x = sample_mean)) +
geom_histogram(binwidth = 1, color = "white") +
labs(x = "Sample mean based on n = 50", title = "Fig 2: Histogram of 1000 sample means based on 1000 virtual samples of size n=50")
# Compute standard error:
virtual_sample_mean %>%
summarize(SE = sd(sample_mean))
```
Questions:
1. Why is the resulting sampling distribution centered at the true population mean $\mu$?
1. Approximately in what range of values do 95% of resulting sample means $\overline{x}$ lie?
1. Your friend claims to have sampled 50 pennies from this sack and obtained a sample mean $\overline{x} = 1979$. Do you believe them?
1. Your friend claims to have sampled 50 pennies from this sack and obtained a sample mean $\overline{x} = 1986$. Do you believe them?
Your answers:
**Question 1**: Because the sampling was random! Say instead of population being this sack of $N=800$ pennies, say our population of interest were *all* pennies in circulation in the US. Now say:
* We went to the US Mint and collected a sample of pennies coming off the press. This would clearly be a non-random biased sample. In fact, since the pennies would be newer, the center of the above 1000 sample means $\overline{x}$ would be shifted to the right.
* Say we went to a rural area that is far from any branch of the Federal Reserve bank. Any sample of pennies from there would also be biased, but this time towards the left, so the above sampling distribution would shift to the left i.e. older.
**Question 2:**: We could do this either by eyeballing it. This range would be approximately 1986 through 1993. Or we could do it exactly:
```{r}
# Compute standard error:
virtual_sample_mean %>%
summarize(mean = mean(sample_mean), SE = sd(sample_mean))
```
So since the distribution is Normal, we can use the Normal model to get the middle 95%.
$$
[1990 - 1.96 \times 1.73, 1900 + 1.96 \times 1.73] = [1986.609, 1993.391]
$$
**Question 3:**: (Subjective) Certainly not!
**Question 4:**: (Subjective) 1986 doesn't occur often, but it still would not be suprising to observe $\overline{x} = 1986$
# Question 2: 1000 virtual handfuls of 100 pennies
Repeat the above but where you are sampling $n=100$ pennies instead of $n=50$.
```{r}
# Take 1000 samples of size n = 100 pennies
virtual_samples <- pennies %>%
rep_sample_n(size = 100, reps = 1000)
# Compute the 1000 resulting point estimates: the sample mean
virtual_sample_mean <- virtual_samples %>%
group_by(replicate) %>%
summarize(sample_mean = mean(year))
# Plot sampling distribution:
ggplot(virtual_sample_mean, aes(x = sample_mean)) +
geom_histogram(binwidth = 1, color = "white") +
labs(x = "Sample mean based on n = 100", title = "Fig 3: Histogram of 1000 sample means based on 1000 virtual samples of size n=100")
# Compute standard error:
virtual_sample_mean %>%
summarize(SE = sd(sample_mean))
```
1. Approximately in what range of values do 95% of resulting sample means $\overline{x}$ lie?
1. Why is this range smaller than the previous range above? Be more specific than "because the sample size is larger."
1. Your friend claims to have sampled 100 pennies from this sack and obtained a sample mean $\overline{x} = 1979$. Do you believe them?
1. Your friend claims to have sampled 100 pennies from this sack and obtained a sample mean $\overline{x} = 1986$. Do you believe them?
Your answers:
**Question 1**: Eyeballing it: [1988, 1992]. Exactly:
```{r}
# Compute standard error:
virtual_sample_mean %>%
summarize(mean = mean(sample_mean), SE = sd(sample_mean))
```
So since the distribution is Normal, we can use the Normal model to get the middle 95%.
$$
[1990 - 1.96 \times 1.14, 1900 + 1.96 \times 1.14] = [1987.766, 1992.234]
$$
**Question 2**: Because the sample size increased fromn 50 to 100, the standard error decreased from 1.73 years to 1.14 years.
**Question 3**: (Subjective) Same as before.
**Question 4**: (Subjective) I believe them less than when $n=50$. Observing 1986 is now rarer, as the above 1000 $\overline{x}$ are more tightly wrapped around the true population mean $\mu$.
# The Moral: Population vs sampling distributions
Even if the *population distribution* is not normally shaped (like the individual pennies' years), for large enough sample size $n$ the sampling distribution of sample $\overline{x}$ is always normal by the Central Limit Theorem!