Lec41 - Mon 5/15: Midterm III Review

Administrative

  • Fri 5/19 7pm-10pm in Warner 506
  • Not a final, but 3rd midterm. Timed at ~1h15m to 1h30m
  • Bring your cheatsheets
  • Bring a calculator or your smart phone with calculator app

Philosophy

  • More conceptual in nature
  • Code:
    • Reading/understanding: Fair game
    • Writing: No direct code to write, but pseudocode
  • Normal curve of distribution of difficulty

Sources

  • Lectures 01 through 38 inclusive and cummulative
    • Slides from each lecture
    • Learning Checks
    • Problem set solutions!

Major Topics: Midterm I

  • Tidy data. What are the components?
  • What is the Grammar of Graphics? How do they tie in with ggplot2?
  • What are the first four of the 5NG? What are their distinguishing features?

Major Topics: Midterm II

Major Topics: Midterm III

  • Hypothesis testing
    • Lady tasting tea.
    • There is only one test; it has 5 components.
  • Confidence intervals
    • Theory: Sampling distribution and standard errors
    • Interpretation of CI
    • If sampling distribution is normal, the general formula for creating a 95% C.I.

Major Topics: Midterm III

  • Regression
    • Regression line is best fitting line in what sense?
    • Interpret ALL regression table outputs
    • Study residuals
    • Categorical variables
    • Multiple Regression

Lec39 - Thu 5/11: Multiple Regression

Recall

So far we’ve seen simple linear regression

  • Simple means only one predictor/independent variable \(x\)
  • Outcome/depedendent variable \(y\)
  • \(x\) can be either numerical or categorical

Recall

In Lec 36 LC we saw the relationship between \(x =\) dep delay & \(y =\) arr delay for Alaska Airlines flights.

Today

  • Since we only have Alaska flights, the variable carrier doesn’t vary.
  • But now let’s also consider Frontier Airlines (carrier == F9)

So we have:

  • \(y =\) arrival delay
  • \(x_1 =\) departure delay (numerical variable)
  • \(x_2 =\) carrier (categorical variable with \(k=2\) levels. In other words, carrier now varies.)

Today

Is there a difference in delays between Alaska and Frontier?

Today

Is there a difference in delays between Alaska and Frontier?

Lec38 - Wed 5/10: Interpretation + Categorical Predictors

Chalk Talk for Today

  • Continuing Regression Outputs: Lec36 Learning Check
  • Categorical Predictors

Lec37 - Mon 5/8: Least-Squares Line + Regression Output

Best Fitting Ling

What does “best fitting line”" mean?

Drawing

Best Fitting Ling

Consider ANY point (in blue).

Drawing

Best Fitting Ling

Now consider this point’s deviation from the regression line.

Drawing

Best Fitting Ling

Do this for another point…

Drawing

Best Fitting Ling

Do this for another point…

Drawing

Best Fitting Ling

Regression line minimizes the sum of squared arrow lengths.

Drawing

Chalk Talk

  • Residuals
  • Review of Lec36 Learning Check outputs
  • Regression viewed through the lens of sampling

Lec36 - Thu 5/4: Correlation

Recall

  • In Lec35 LC you all created your own 95% C.I. to estimate the proportion \(p\) of the OkCupid dataset which is female
  • We took a single sample of size n=100
  • We pretended we didn’t know the true \(p = 0.4023\) = 40.23%

Results 1

Here are your 12 resulting \(\widehat{p}\)’s…

email p_hat
aghall 0.360
ccrobinson 0.402
chimstead 0.380
cwhitedzuro 0.440
dmortime 0.430
efeldman 0.370
jobrien 0.400
jvolz 0.420
lschroer 0.402
rlightman 0.400
rstoreyfisher 0.390
zmillslagle 0.402

Results 2

Let me add 8 of my own so we have 20…

email p_hat
aghall 0.360
ccrobinson 0.402
chimstead 0.380
cwhitedzuro 0.440
dmortime 0.430
efeldman 0.370
jobrien 0.400
jvolz 0.420
lschroer 0.402
rlightman 0.400
rstoreyfisher 0.390
zmillslagle 0.402
aykim 0.420
aykim 0.360
aykim 0.300
aykim 0.360
aykim 0.360
aykim 0.400
aykim 0.340
aykim 0.400

Results 3

Let’s compute \(\mbox{SE} = \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\)

p_hat <- p_hat %>% 
  mutate(
    n = 100,
    SE = sqrt(p_hat*(1-p_hat)/n)
  )
email p_hat n SE
aghall 0.360 100 0.048
ccrobinson 0.402 100 0.049
chimstead 0.380 100 0.049
cwhitedzuro 0.440 100 0.050
dmortime 0.430 100 0.050
efeldman 0.370 100 0.048
jobrien 0.400 100 0.049
jvolz 0.420 100 0.049
lschroer 0.402 100 0.049
rlightman 0.400 100 0.049
rstoreyfisher 0.390 100 0.049
zmillslagle 0.402 100 0.049
aykim 0.420 100 0.049
aykim 0.360 100 0.048
aykim 0.300 100 0.046
aykim 0.360 100 0.048
aykim 0.360 100 0.048
aykim 0.400 100 0.049
aykim 0.340 100 0.047
aykim 0.400 100 0.049

Results 4

Finally the left and right end points of the 95% confidence interval. Whose CI’s captured the true \(p=0.4023\)?

p_hat <- p_hat %>% 
  mutate(
    left = p_hat - 1.96*SE,
    right = p_hat + 1.96*SE
  )
email p_hat n SE left right
aghall 0.360 100 0.048 0.266 0.454
ccrobinson 0.402 100 0.049 0.306 0.498
chimstead 0.380 100 0.049 0.285 0.475
cwhitedzuro 0.440 100 0.050 0.343 0.537
dmortime 0.430 100 0.050 0.333 0.527
efeldman 0.370 100 0.048 0.275 0.465
jobrien 0.400 100 0.049 0.304 0.496
jvolz 0.420 100 0.049 0.323 0.517
lschroer 0.402 100 0.049 0.306 0.498
rlightman 0.400 100 0.049 0.304 0.496
rstoreyfisher 0.390 100 0.049 0.294 0.486
zmillslagle 0.402 100 0.049 0.306 0.498
aykim 0.420 100 0.049 0.323 0.517
aykim 0.360 100 0.048 0.266 0.454
aykim 0.300 100 0.046 0.210 0.390
aykim 0.360 100 0.048 0.266 0.454
aykim 0.360 100 0.048 0.266 0.454
aykim 0.400 100 0.049 0.304 0.496
aykim 0.340 100 0.047 0.247 0.433
aykim 0.400 100 0.049 0.304 0.496

Results 5

  • Dots are \(\widehat{p}\)
  • Dashed line is true \(p=0.4023\)

Regression

  • Final topic for this course!
  • Correlation Coefficient

Correlation Coefficient

Example

Recall the nycflights data set. For Alaska Air flights, let’s explore the relationship between

  • Departure delay
  • Arrival delay

Example

Example

The correlation coefficient is computed as follows:

cor(alaska_flights$dep_delay, alaska_flights$arr_delay)
## [1] 0.8373792

83.7% is fairly strongly positively associated!

Bored?

Play Guess the Correlation

Lec35 - Wed 5/3: Confidence Intervals in General

Recall

Chalk talk

Point Estimates

For large \(n\), the sampling distribution for these point estimates are bell-shaped, thus a 95% C.I. is \(\mbox{PE} \pm 1.96\times \mbox{SE}\).

Population Parameter Sample Statistic
Mean \(\mu\) Sample Mean \(\overline{x}\)
Proportion \(p\) Sample Proportion \(\widehat{p}\)
Diff of Means \(\mu_1 - \mu_2\) \(\overline{x}_1 - \overline{x}_2\)
Diff of Proportions \(p_1 - p_2\) \(\widehat{p}_1 - \widehat{p}_2\)

Example: Polls

NPR report on Obama from 2013. Chalk talk…

Lec34 - Mon 5/1: Confidence Intervals

Recall

We are estimating a population parameter using a point estimate based on a sample. Example: Mean (Chalk Talk)

Accuracy vs Precision

Drawing

Confidence Intervals

Imagine the \(\mu\) is a fish:

Point Estimate \(\overline{x}\) Confidence Interval
Drawing Drawing

Learning Check Discussion

Lec33 - Fri 4/28: Sampling Distributions and Standard Errors

Recall

Age example:

  1. I picked a random sample of n=3 students
  2. I computed sample mean age \(\overline{x}\)
  3. I did this three times

Note:

  1. They are not the same because of sampling variability
  2. What quantifies how much these point estimates vary?

Lec32 Learning Check

From the OkCupid population:

  1. Take samples of size n
  2. Compute sample mean height \(\overline{x}\)
  3. Do this many, many, many times (10000)
  4. Visualize distribution of these sample means

Lec32 Learning Check

Accuracy vs Precision

Drawing

Lec32 - Thu 4/27: Back to Sampling

Recall: Point of Statistics

Taking a sample in order to infer about a population:

Drawing


Let’s Google “define infer”…

Demo for Today

library(lubridate)
library(mosaic)
library(dplyr)

# Randomly sample three people:
students <- 
  c("Arthur", "Caroline", "Claire", "Clare", "Conor", "Daniel", 
    "Dylan", "Elana", "Jacob", "Jay", "Joe", "Julian", "Kelsie", 
    "Lisa", "Maya", "Naing", "Parker", "Rebecca", "Ry", "Theodora", 
    "Zebediah", "Albert")
resample(students, size=3, replace=FALSE)

# Get average age:
birthdays <- c("1980-11-05", "2000-01-01", "1955-08-05")
ages <- as.numeric(as.Date("2017-04-27") - as.Date(birthdays))/365.25
ages
mean(ages)

Demo for Today

  • We randomly sample 3 students and get mean age
  • We randomly sample 3 students and get mean age
  • We randomly sample 3 students and get mean age…

Questions:

  1. Why is the mean (AKA) age different each time?
  2. What numerical summary quantifies how these means vary?

Chalk talk…

Lec31 - Wed 4/26: Background Statistical Theory

There is Only One Test

Drawing

Today: Chalk talk

  • Hypothesis testing in general
  • Background statistical theory

Lec30 - Mon 4/24: Finishing Hypothesis Testing

Today

  • View Lec29 Learning Check
  • Chalk talk

Lec29 - Thu 4/20: Permutation Test

Recall

If we assume \(H_0\) is true (there is no difference in test scores between evens and odds) then:

  1. Whether you have an even number of letters or odd is irrelevant
  2. Hence the categorical variable even_vs_odd is irrelevant
  3. Hence we can permute/shuffle it to no consequence

In Other Words, All These Are the Same:

In Other Words, All These Are the Same:

In Other Words, All These Are the Same:

In Other Words, All These Are the Same:

Lec28 - Wed 4/19: Constructing the Null Hypothesis

Recall

From last lecture: How do we construct null distribution?

Lady Tasting Tea

In this case, the null distribution is barplot:

Two Ways

Analytically Via Simulation
Drawing Drawing

Two Ways

  • Analytically/Mathematically: Necessitates probability background. Covered in MATH 310.
  • Simulation: Necessitates random number generator. We take this approach.

Constructing the Null

  • Lady Tasting Tea: Assuming she is guessing at random, we simulated many, many, many instances of “the number she got right”.
  • Odds vs Evens Test Score: Chalk talk

Lec27 - Mon 4/17: Tying Hypothesis Testing with Sampling

Chalk Talk

Only chalk talk today, based on Learning Checks for Lec26.

Lec26 - Fri 4/14: p-Values

Recall: Framework and Terminology

  1. Null hypothesis \(H_0\): she is guessing at random
  2. Alternative hypothesis \(H_A\): she is not. i.e. she can really tell which came first
  3. Test statistic: # of correct guesses out of 8
  4. Observed test statistic: 8 correct
  5. Null distribution: the bar chart
  6. Decision: compare observed test statistic to null distribution

Recall: Framework and Terminology

  • We going to assume \(H_0\) is true and see how likely the observed test statistic was as compared to null distribution.
  • How likely was the observed test statistic?

Recall: Framework and Terminology

Not very! Only occurs 0.34% of the time

Today’s Definition

p-value: Chalk Talk

Lec25 - Thu 4/13: Hypothesis Testing Framework and Terminology

Recall: Lady Tasting Tea

  • Lady Tasting Tea claims she can tell if tea or milk was poured first into a cup
  • You run an experiment with 8 cups if she can tell or if she is bullshitting
  • Let’s assume a hypothetical world where she is guessing at random.

Recall: Lady Tasting Tea

If guessing at random, here are hypothetical outcomes:

Recall: Lady Tasting Tea

She got 8/8 right!

Recall: Lady Tasting Tea

  • In our hypothetical world of guessing at random, 8/8 occured 34 times out of 10000. i.e. 0.34% of the time.
  • Can she tell, or is she bullshitting?

Hypothesis Testing Framework

Critical chalk talk.

Lec23 - Mon 4/10: Midterm II Review

Administrative

  • Evening Exam: Wed 4/7 7pm in Warner 506
  • Closed book, no calculators
  • Bring your cheatsheets

Philosophy

  • More conceptual in nature
  • Code:
    • Reading/understanding: Fair game
    • Writing: No direct code to write, but pseudocode
  • Normal curve of distribution of difficulty

Sources

  • Lectures 01 through 21 inclusive and cummulative
    • Slides from each lecture
    • Corresponding textbook material (if any)
    • Learning Checks
    • Problem Sets

Major Topics: Midterm I

  • Tidy data. What are the components?
  • What is the Grammar of Graphics? How do they tie in with ggplot2?
  • What are the first four of the 5NG? What are their distinguishing features?

Major Topics: Midterm II

Practice Midterm

  • Disclaimer, disclaimer, disclaimer
    • Do not overly interpret the content of this midterm.
    • Rather, view it to get a rough sense of my exam philosophy.
  • Note: There was no probability in last year’s Midterm II

Lec22 - Fri 4/7: Lady Tasting Tea

Scenario for Today

  • Say you are a statistician and you meet someone called the “Lady Tasting Tea.”
  • She claims she can tell by tasting whether the tea or the milk was added first to a cup.
  • You want to test whether
    • She can actually tell which came first OR
    • She’s lying and is only guessing at random
  • Say you have just enough tea/milk to pour into 8 cups.

Coding Note

Binary situations, like

  • True vs False
  • Correct vs Incorrect
  • Yes vs No

are often coded as 1 vs 0 in many programming languages.

Lec21 - Thu 4/6: Confounding Variables and Designed Experiments

Today: Other Use of Randomness

  1. Random sampling: To obtain a representative sample from a population.
  2. Random assignment: To design an experiement.

Mantra of Statistics

Chalk Talk

  • Confounding variables
  • Two types of studies
  • Principles of designing experiments

Learning Check

Ezell’s Fried Chicken is a famous chicken restaurant in Seattle. Oprah Winfrey has it flown into to Chicago.

Drawing

Learning Check

One day I was raving about Ezell’s Chicken, but my friend accused me of “buying into the hype”.

So what did we do?

Learning Check

Fried Chicken Face Off:

Do people prefer this? Or this?
Drawing Drawing

Learning Check

How would you design a taste test to ascertain, independent of hype, which fried chicken tastes better?

Use the relevant principles of designing experiements from above.

Lec20 - Wed 4/5: Introduction to Sampling

Recall

The mosaic package has functions for the random simulation.

  1. rflip(): Flip a coin
  2. shuffle(): Shuffle a set of values
  3. do(): Do the same thing many, many, many times
  4. resample(): the swiss army knife for sampling

Shuffling AKA Permuting

Run the following in your console:

library(mosaic)
# Define a vector fruit
fruit <- c("apple", "orange", "mango")

# Do this multiple times:
shuffle(fruit)

Sampling: Key Distinction

Two types of sampling:

  1. Sampling with replacement
  2. Sampling without replacement

Resampling

resample() by default samples with replacement. Run this in the console multiple times:

resample(fruit)

Possible Inputs to resample()

Chalk Talk

Lec19 - Mon 4/3: Intro to Probability via Simulation

Recall

Chalk Talk 1

Probability

  • In short: Probability is the study of randomness.
  • Its roots lie in one historical constant
  • It is the theoretical backbone of statistics.

Two Approaches to Probability

There are two approaches to studying probability:

Mathematically (MATH 310) Via Simulations
Drawing Drawing

Two Approaches to Probability

  • The mathematical approach requires A LOT of math background, whereas the simulation approach does not.
  • To do simulations, we need a computer’s random number generator. Why?

Simulations via Computer

Doing this repeatedly by hand is tiring:

DrawingDrawingDrawingDrawingDrawingDrawingDrawing
DrawingDrawingDrawingDrawingDrawingDrawing