ggplot2
?join
So far we’ve seen simple linear regression
- Simple means only one predictor/independent variable \(x\)
- Outcome/depedendent variable \(y\)
- \(x\) can be either numerical or categorical
In Lec 36 LC we saw the relationship between \(x =\) dep delay & \(y =\) arr delay for Alaska Airlines flights.
- Since we only have Alaska flights, the variable
carrier
doesn’t vary.- But now let’s also consider Frontier Airlines (
carrier == F9
)
So we have:
- \(y =\) arrival delay
- \(x_1 =\) departure delay (numerical variable)
- \(x_2 =\) carrier (categorical variable with \(k=2\) levels. In other words, carrier now varies.)
Is there a difference in delays between Alaska and Frontier?
Is there a difference in delays between Alaska and Frontier?
- Continuing Regression Outputs: Lec36 Learning Check
- Categorical Predictors
What does “best fitting line”" mean?
Consider ANY point (in blue).
Now consider this point’s deviation from the regression line.
Do this for another point…
Do this for another point…
Regression line minimizes the sum of squared arrow lengths.
- Residuals
- Review of Lec36 Learning Check outputs
- Regression viewed through the lens of sampling
n=100
Here are your 12 resulting \(\widehat{p}\)’s…
p_hat | |
---|---|
aghall | 0.360 |
ccrobinson | 0.402 |
chimstead | 0.380 |
cwhitedzuro | 0.440 |
dmortime | 0.430 |
efeldman | 0.370 |
jobrien | 0.400 |
jvolz | 0.420 |
lschroer | 0.402 |
rlightman | 0.400 |
rstoreyfisher | 0.390 |
zmillslagle | 0.402 |
Let me add 8 of my own so we have 20…
p_hat | |
---|---|
aghall | 0.360 |
ccrobinson | 0.402 |
chimstead | 0.380 |
cwhitedzuro | 0.440 |
dmortime | 0.430 |
efeldman | 0.370 |
jobrien | 0.400 |
jvolz | 0.420 |
lschroer | 0.402 |
rlightman | 0.400 |
rstoreyfisher | 0.390 |
zmillslagle | 0.402 |
aykim | 0.420 |
aykim | 0.360 |
aykim | 0.300 |
aykim | 0.360 |
aykim | 0.360 |
aykim | 0.400 |
aykim | 0.340 |
aykim | 0.400 |
Let’s compute \(\mbox{SE} = \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\)…
p_hat <- p_hat %>%
mutate(
n = 100,
SE = sqrt(p_hat*(1-p_hat)/n)
)
p_hat | n | SE | |
---|---|---|---|
aghall | 0.360 | 100 | 0.048 |
ccrobinson | 0.402 | 100 | 0.049 |
chimstead | 0.380 | 100 | 0.049 |
cwhitedzuro | 0.440 | 100 | 0.050 |
dmortime | 0.430 | 100 | 0.050 |
efeldman | 0.370 | 100 | 0.048 |
jobrien | 0.400 | 100 | 0.049 |
jvolz | 0.420 | 100 | 0.049 |
lschroer | 0.402 | 100 | 0.049 |
rlightman | 0.400 | 100 | 0.049 |
rstoreyfisher | 0.390 | 100 | 0.049 |
zmillslagle | 0.402 | 100 | 0.049 |
aykim | 0.420 | 100 | 0.049 |
aykim | 0.360 | 100 | 0.048 |
aykim | 0.300 | 100 | 0.046 |
aykim | 0.360 | 100 | 0.048 |
aykim | 0.360 | 100 | 0.048 |
aykim | 0.400 | 100 | 0.049 |
aykim | 0.340 | 100 | 0.047 |
aykim | 0.400 | 100 | 0.049 |
Finally the left and right end points of the 95% confidence interval. Whose CI’s captured the true \(p=0.4023\)?
p_hat <- p_hat %>%
mutate(
left = p_hat - 1.96*SE,
right = p_hat + 1.96*SE
)
p_hat | n | SE | left | right | |
---|---|---|---|---|---|
aghall | 0.360 | 100 | 0.048 | 0.266 | 0.454 |
ccrobinson | 0.402 | 100 | 0.049 | 0.306 | 0.498 |
chimstead | 0.380 | 100 | 0.049 | 0.285 | 0.475 |
cwhitedzuro | 0.440 | 100 | 0.050 | 0.343 | 0.537 |
dmortime | 0.430 | 100 | 0.050 | 0.333 | 0.527 |
efeldman | 0.370 | 100 | 0.048 | 0.275 | 0.465 |
jobrien | 0.400 | 100 | 0.049 | 0.304 | 0.496 |
jvolz | 0.420 | 100 | 0.049 | 0.323 | 0.517 |
lschroer | 0.402 | 100 | 0.049 | 0.306 | 0.498 |
rlightman | 0.400 | 100 | 0.049 | 0.304 | 0.496 |
rstoreyfisher | 0.390 | 100 | 0.049 | 0.294 | 0.486 |
zmillslagle | 0.402 | 100 | 0.049 | 0.306 | 0.498 |
aykim | 0.420 | 100 | 0.049 | 0.323 | 0.517 |
aykim | 0.360 | 100 | 0.048 | 0.266 | 0.454 |
aykim | 0.300 | 100 | 0.046 | 0.210 | 0.390 |
aykim | 0.360 | 100 | 0.048 | 0.266 | 0.454 |
aykim | 0.360 | 100 | 0.048 | 0.266 | 0.454 |
aykim | 0.400 | 100 | 0.049 | 0.304 | 0.496 |
aykim | 0.340 | 100 | 0.047 | 0.247 | 0.433 |
aykim | 0.400 | 100 | 0.049 | 0.304 | 0.496 |
- Dots are \(\widehat{p}\)
- Dashed line is true \(p=0.4023\)
- Final topic for this course!
- Correlation Coefficient
Recall the nycflights
data set. For Alaska Air flights, let’s explore the relationship between
The correlation coefficient is computed as follows:
cor(alaska_flights$dep_delay, alaska_flights$arr_delay)
## [1] 0.8373792
83.7% is fairly strongly positively associated!
Chalk talk
For large \(n\), the sampling distribution for these point estimates are bell-shaped, thus a 95% C.I. is \(\mbox{PE} \pm 1.96\times \mbox{SE}\).
Population Parameter | Sample Statistic |
---|---|
Mean \(\mu\) | Sample Mean \(\overline{x}\) |
Proportion \(p\) | Sample Proportion \(\widehat{p}\) |
Diff of Means \(\mu_1 - \mu_2\) | \(\overline{x}_1 - \overline{x}_2\) |
Diff of Proportions \(p_1 - p_2\) | \(\widehat{p}_1 - \widehat{p}_2\) |
NPR report on Obama from 2013. Chalk talk…
We are estimating a population parameter using a point estimate based on a sample. Example: Mean (Chalk Talk)
Imagine the \(\mu\) is a fish:
Point Estimate \(\overline{x}\) | Confidence Interval |
---|---|
- Lec33 Learning Check Discussion
- Chalk Talk.
Age example:
- I picked a random sample of
n=3
students- I computed sample mean age \(\overline{x}\)
- I did this three times
Note:
- They are not the same because of sampling variability
- What quantifies how much these point estimates vary?
From the OkCupid population:
- Take samples of size
n
- Compute sample mean height \(\overline{x}\)
- Do this many, many, many times (10000)
- Visualize distribution of these sample means
Taking a sample in order to infer about a population:
Let’s Google “define infer”…
library(lubridate)
library(mosaic)
library(dplyr)
# Randomly sample three people:
students <-
c("Arthur", "Caroline", "Claire", "Clare", "Conor", "Daniel",
"Dylan", "Elana", "Jacob", "Jay", "Joe", "Julian", "Kelsie",
"Lisa", "Maya", "Naing", "Parker", "Rebecca", "Ry", "Theodora",
"Zebediah", "Albert")
resample(students, size=3, replace=FALSE)
# Get average age:
birthdays <- c("1980-11-05", "2000-01-01", "1955-08-05")
ages <- as.numeric(as.Date("2017-04-27") - as.Date(birthdays))/365.25
ages
mean(ages)
- We randomly sample 3 students and get mean age
- We randomly sample 3 students and get mean age
- We randomly sample 3 students and get mean age…
Questions:
- Why is the mean (AKA) age different each time?
- What numerical summary quantifies how these means vary?
Chalk talk…
- Hypothesis testing in general
- Background statistical theory
- View Lec29 Learning Check
- Chalk talk
If we assume \(H_0\) is true (there is no difference in test scores between evens and odds) then:
even_vs_odd
is irrelevantFrom last lecture: How do we construct null distribution?
In this case, the null distribution is barplot:
Analytically | Via Simulation |
---|---|
- Analytically/Mathematically: Necessitates probability background. Covered in MATH 310.
- Simulation: Necessitates random number generator. We take this approach.
Only chalk talk today, based on Learning Checks for Lec26.
Not very! Only occurs 0.34% of the time
p-value: Chalk Talk
If guessing at random, here are hypothetical outcomes:
She got 8/8 right!
Critical chalk talk.
ggplot2
?join
Binary situations, like
- True vs False
- Correct vs Incorrect
- Yes vs No
are often coded as 1
vs 0
in many programming languages.
- Correlation is not necessarily causation
- Spurious correlations
- Confounding variables
- Two types of studies
- Principles of designing experiments
Ezell’s Fried Chicken is a famous chicken restaurant in Seattle. Oprah Winfrey has it flown into to Chicago.
One day I was raving about Ezell’s Chicken, but my friend accused me of “buying into the hype”.
So what did we do?
Fried Chicken Face Off:
Do people prefer this? | Or this? |
---|---|
How would you design a taste test to ascertain, independent of hype, which fried chicken tastes better?
Use the relevant principles of designing experiements from above.
The mosaic
package has functions for the random simulation.
rflip()
: Flip a coinshuffle()
: Shuffle a set of valuesdo()
: Do the same thing many, many, many timesresample()
: the swiss army knife for samplingRun the following in your console:
library(mosaic)
# Define a vector fruit
fruit <- c("apple", "orange", "mango")
# Do this multiple times:
shuffle(fruit)
Two types of sampling:
resample()
by default samples with replacement. Run this in the console multiple times:
resample(fruit)
resample()
Chalk Talk
Chalk Talk 1
- In short: Probability is the study of randomness.
- Its roots lie in one historical constant
- It is the theoretical backbone of statistics.
There are two approaches to studying probability:
Mathematically (MATH 310) | Via Simulations |
---|---|
Doing this repeatedly by hand is tiring: