Lec39 - Thu 5/11: Multiple Regression

Using the starter code below, run a multiple regression of

  • \(y =\) arrival delay
  • \(x_1 =\) departure delay (numerical variable)
  • \(x_2 =\) carrier (categorical variable with \(k=2\) levels. In other words, carrier now varies.)
  1. Interpret the resulting coefficients.
  2. Come up with an airline industry explanation on why the arrival delays for Alaska and Frontier may differ.
library(ggplot2)
library(dplyr)
library(mosaic)
library(broom)
library(nycflights13)
data(flights)

alaska_frontier_flights <- flights %>% 
  filter(carrier == "AS" | carrier == "F9") %>% 
  filter(dep_delay < 250)

ggplot(alaska_frontier_flights, aes(x=dep_delay, y=arr_delay, col=carrier)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE)

Discussion 1

Note: We are only investigating flights where the delay was not extreme (less than 250min). Let’s look at the regression output:

model <- lm(arr_delay~dep_delay+carrier, data=alaska_frontier_flights)
tidy(model, conf.int = TRUE)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -15.642 0.800 -19.546 0 -17.211 -14.072
dep_delay 0.979 0.016 62.388 0 0.949 1.010
carrierF9 17.734 1.151 15.409 0 15.477 19.992

It appears that Frontier is on average 17min later than Alaska. Why? Let’s dig deeper:

alaska_frontier_flights %>% 
  group_by(carrier, origin, dest) %>% 
  summarise(count=n())
carrier origin dest count
AS EWR SEA 712
F9 LGA DEN 675

Discussion 2

What about temp? Does that impact arr_delay?

data("weather")
alaska_frontier_flights <- alaska_frontier_flights %>% 
  left_join(weather, by=c("year", "month", "day", "hour", "origin"))

model_2 <- lm(arr_delay~dep_delay+carrier+temp, data=alaska_frontier_flights)
tidy(model_2, conf.int = TRUE)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -4.681 1.885 -2.484 0.013 -8.378 -0.984
dep_delay 0.991 0.016 63.679 0.000 0.960 1.021
carrierF9 17.685 1.135 15.578 0.000 15.458 19.912
temp -0.196 0.030 -6.454 0.000 -0.256 -0.137

Discussion 3

What about temp AND humid? Does that impact arr_delay?

  • Recall: In PS12, we showed that temp and humid are highly correlated.
  • In other words, they provide somewhat similar information.
  • Question: How much new information is added by considering humid
model_3 <- lm(arr_delay~dep_delay+carrier+temp+humid, data=alaska_frontier_flights)
tidy(model_3, conf.int = TRUE)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -2.008 2.640 -0.760 0.447 -7.187 3.171
dep_delay 0.994 0.016 63.357 0.000 0.963 1.025
carrierF9 17.430 1.148 15.176 0.000 15.177 19.683
temp -0.199 0.030 -6.522 0.000 -0.258 -0.139
humid -0.042 0.029 -1.445 0.149 -0.098 0.015

Lec38 - Wed 5/10: Categorical Predictors

United Airlines is in the news a lot lately. Do they have bigger departure delays than their rivals?

library(ggplot2)
library(dplyr)
library(mosaic)
library(broom)
library(nycflights13)
data(flights)

jan_31_flights <- flights

Using the starter code above

  1. Create a data frame jan_31_flights that consists of flights
    • Only on January 31st
    • Only flights for the following airlines: United Airlines (UA), American Airlines (AA)
  2. Create a visualization that illustrates the difference in departure delays between UA and AA. Which initially appears to have larger delays?
  3. What are the mean departure delays for both airlines?
  4. Run a linear regression with
    • \(y\) = departure delay (continuous/numerical variable)
    • \(x\) = carrier (categorical variable with two levels)
  5. Compare the two means in Step 3 with the regression results.

Discussion

Remember, to pick out UA and AA flights, you need the OR operator: for all rows where carrier == UA or carrier == AA.

jan_31_flights <- flights %>% 
  filter(month == 1 & day == 31) %>% 
  filter(carrier == "UA" | carrier == "AA")

As you now know, I’m a big fan of boxplots: comparisons across groups with a single line! In this case, the median delay for United appears to be higher!

ggplot(jan_31_flights, aes(x=carrier, y=dep_delay)) + 
  geom_boxplot()

The mean departure delays below. We see that United is on average 15.665 - 15.057 = 0.608 minutes later in their departures.

jan_31_flights %>% 
  group_by(carrier) %>% 
  summarise(mean_dep_delay=mean(dep_delay, na.rm=TRUE))
carrier mean_dep_delay
AA 15.057
UA 15.665

Let’s fit the regression using a categorical predictor \(x\)

model <- lm(dep_delay~carrier, data=jan_31_flights)
tidy(model, conf.int = TRUE)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 15.057 3.710 4.059 0.000 7.749 22.364
carrierUA 0.608 4.629 0.131 0.896 -8.510 9.726

\(\beta_{\mbox{carrierUA}}\) is 0.608! i.e. United has on average 0.608 minute bigger delays, just like we computed above! However notice:

  • The 95% confidence interval [-8.510, 9.726] includes 0! In other words, a differential of 0 is still plausible!
  • The p-value is very high! In other words, we would not reject \(H_0\) i.e. there may still be a true difference in delays between United and American, but this data on January 31st doesn’t suppor this claim.

Lec36 - Thu 5/4: Correlation

library(ggplot2)
library(dplyr)
library(nycflights13)
data(flights)

# 1. Load Alaska data, deleting rows that have missing dep or arr data
alaska_flights <- flights %>% 
  filter(carrier == "AS") %>% 
  filter(!is.na(dep_delay) & !is.na(arr_delay))

# 2. Number of observations
nrow(alaska_flights)

# 3. Plot
ggplot(data=alaska_flights, aes(x = dep_delay, y = arr_delay)) + 
  geom_point() +
  geom_smooth(method="lm", se=FALSE)

# 4. Output regression results
model <- lm(arr_delay ~ dep_delay, data=alaska_flights)
model
summary(model)

# 5. Output regression results in tidy format using broom package:
library(broom)

# Summary table + confidence intervals
model_table <- tidy(model, conf.int = TRUE)
View(model_table)

# Point by point values:
model_values <- augment(model, conf.int = TRUE) %>% 
  select(arr_delay, dep_delay, .fitted, .resid)
View(model_values)

Discussion

Let’s study the three outputs:

1. Plot of points and “best-fitting line”:

2. The regression table model_table:

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -15.599 0.762 -20.463 0 -17.096 -14.102
dep_delay 0.972 0.024 40.733 0 0.925 1.019

3. The point-by-point values model_values (only first 6 of 709 rows):

arr_delay dep_delay .fitted .resid
-10 -1 -16.571 6.571
-19 -7 -22.404 3.404
-41 -3 -18.515 -22.485
1 3 -12.683 13.683
-18 -1 -16.571 -1.429
-9 2 -13.655 4.655

Lec35 - Wed 5/3: Confidence Intervals in General

We are revisiting the Lec33 LC where

  • We estimated the population mean height \(\mu\) of OkCupid users using the sample mean \(\overline{x}\) as a point estimate based on samples of size n=5.
  • We did this many, many, many times to create the sampling distribution.
  • Using the standard error, we created left_value and right_value such that it captured 95% of \(\overline{x}\) values.

Today you will do the same but for a different set up:

  • Population parameter: the population proportion \(p\) that is female. Spoiler alert: it’s 40.2%.
  • Point estimate: the sample proportion \(\widehat{p}\) i.e. the proportion of the sample that is female.
    • To this end we create a new variable is_female below.
    • Note: A proportion is just a mean of 0’s and 1’s. Ex: Try running mean(c(0,0,0,1)) in your console.
library(ggplot2)
library(dplyr)
library(mosaic)
library(okcupiddata)
data(profiles)

profiles <- profiles %>% 
  mutate(is_female = ifelse(sex == "f", 1, 0)) %>% 
  select(sex, is_female, height)
View(profiles)

n <- 5

Using the starter code above

  1. Recreate the plot from Lec32 LC Discussion below for n=5 and interpret it.
  2. Recreate the plot from Lec32 LC Discussion below for n=100 and interpret it. Compare it to the plot for n=100. What is different?
  3. Now switch gears to the more realistic situation where
    1. We don’t know the true population proportion \(p\). So pretend you don’t know its 40.2%
    2. We take a single sample of size n=100, and not many, many, many samples.
    3. Compute a 95% confidence interval. You will we need to use the mathematical approximation to the standard error for \(\widehat{p}\), which is below.
    4. Fill out this Google Form with your resulting values.

\[ \mbox{SE}_{\widehat{p}} = \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}} \]

Discussion

profiles <- profiles %>%
  mutate(is_female = ifelse(sex == "f", 1, 0)) %>%
  select(is_female)

# Center of sampling distribution
p <- mean(profiles$is_female)

1) n=5

Let’s do this for n = 5 first.

# Same as Lec32 LC Part 3, but with replace=FALSE as is more realistic
n <- 5
samples <- do(10000)*mean(resample(profiles$is_female, size=n, replace=FALSE))

# Standard error = sd of sampling distribution
SE <- sd(samples$mean)
SE
## [1] 0.2202745
# Use +/- 2 standard deviations of the mean rule-of-thumb for bell-shaped
# data
left_value <- p - 2*SE
right_value <- p + 2*SE
c(left_value, right_value)
## [1] -0.0382369  0.8428611
# Plot!
title <- paste("10000 Simulations of Sample Proportion Height Based on n =", n)
ggplot(samples, aes(x=mean)) +
  geom_histogram(binwidth = 0.01) +
  labs(x="Sample Mean xbar", title=title) +
  geom_vline(xintercept=p, col="red") +
  geom_vline(xintercept=left_value, linetype="dashed") +
  geom_vline(xintercept=right_value, linetype="dashed")

Looks kind of choppy! Why? Because if we sample only 5 people, there are only 6 possible proportions: \(\frac{0}{6}, \frac{1}{6}, \ldots, \frac{6}{6}\). We need to boost n to get a smoother sampling distribution.

2) n=100

Let’s do this for n = 100 instead.

# Same as Lec32 LC Part 3, but with replace=FALSE as is more realistic
n <- 100
samples <- do(10000)*mean(resample(profiles$is_female, size=n, replace=FALSE))

# Standard error = sd of sampling distribution
SE <- sd(samples$mean)
SE
## [1] 0.04920429
# Use +/- 2 standard deviations of the mean rule-of-thumb for bell-shaped
# data
left_value <- p - 2*SE
right_value <- p + 2*SE
c(left_value, right_value)
## [1] 0.3039035 0.5007207
# Plot!
title <- paste("10000 Simulations of Sample Proportion Height Based on n =", n)
ggplot(samples, aes(x=mean)) +
  geom_histogram(binwidth = 0.01) +
  labs(x="Sample Mean xbar", title=title) +
  geom_vline(xintercept=p, col="red") +
  geom_vline(xintercept=left_value, linetype="dashed") +
  geom_vline(xintercept=right_value, linetype="dashed")

Much better! Also, notice that since

  • the sample size went from n=10 to n=100
  • the standard error SE went down from 0.221 to 0.049
  • so the confidence interval got narrower!

3) Realistic Situation

There is no many, many, many times; we take ONE sample.

n <- 100
sample <- resample(profiles$is_female, size=n, replace=FALSE)

The point estimate \(\widehat{p}\) of \(p\) is:

p_hat <- mean(sample)
p_hat
## [1] 0.35

Since there is only one sample, and not many, many, many samples, we use the mathematical approximation to the standard error of \(\widehat{p}\)

\[ \mbox{SE}_{\widehat{p}} = \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}} \]

SE <- sqrt(p_hat*(1-p_hat)/n)
SE
## [1] 0.04769696

Key Observation: Look how similar this value is to the SE <- sd(samples$mean) value above!

We now use this to build our confidence interval using the rule \(\mbox{PE} \pm 1.96\times\mbox{SE}\) since n is large

c(p_hat -1.96*SE, p_hat + 1.96*SE)
## [1] 0.256514 0.443486

Our confidence interval is (0.257, 0.443), which in this case contains the true \(p=0.402\). Remember: 95% confidence intervals will

  • Capture the true population parameter 95% of the time
  • Not capture it, i.e. fail, 5% of the time

Lec33 - Fri 4/28: Sampling Distributions and Standard Errors

  1. Change left_value and right_value below so that the dashed vertical lines in the plot capture 95% of sample means. Note these two values.
  2. Change n=5 to n=50. What happens to left_value and right_value? Compare to above.
library(ggplot2)
library(dplyr)
library(mosaic)
library(okcupiddata)
data(profiles)

# Let's make our lives easier by removing all 3 users who did not list their
# height:
# -is.na() returns true if a value is NA missing.
# -!is.na() returns true if a value is NOT NA missing.
profiles <- profiles %>% 
  select(height) %>% 
  filter(!is.na(height))

# Recall, this is the true population mean. In a typical real-life situation,
# you won't know this value! This is a rhetorical/theoretical exercise to show
# how sampling influences our estimates.
mu <- mean(profiles$height)

# Same as Lec32 LC Part 3, but with replace=FALSE as is more realistic
n <- 5
samples <- do(10000) * mean(resample(profiles$height, size=n, replace=FALSE))
View(samples)

# Change these two values
left_value <- 60
right_value <- 80

# Plot!
title <- paste("10000 Simulations of Sample Mean Height Based on n =", n)
ggplot(samples, aes(x=mean)) +
  geom_histogram(binwidth = 1) +
  labs(x="Sample Mean xbar", title=title) + 
  geom_vline(xintercept=mu, col="red") + 
  geom_vline(xintercept=left_value, linetype="dashed") + 
  geom_vline(xintercept=right_value, linetype="dashed")

Discussion

library(ggplot2)
library(dplyr)
library(mosaic)
library(okcupiddata)
data(profiles)
profiles <- profiles %>% 
  select(height) %>% 
  filter(!is.na(height))

# Center of sampling distribution
mu <- mean(profiles$height)

# Same as Lec32 LC Part 3, but with replace=FALSE as is more realistic
n <- 5
samples <- do(10000) * mean(resample(profiles$height, size=n, replace=FALSE))

# Standard error = sd of sampling distribution
SE <- sd(samples$mean)

# Use +/- 2 standard deviations of the mean rule-of-thumb for bell-shaped
# data
left_value <- mu - 2*SE
right_value <- mu + 2*SE
c(left_value, right_value)
## [1] 64.81036 71.78021
# Plot!
title <- paste("10000 Simulations of Sample Mean Height Based on n =", n)
ggplot(samples, aes(x=mean)) +
  geom_histogram(binwidth = 1) +
  labs(x="Sample Mean xbar", title=title) + 
  geom_vline(xintercept=mu, col="red") + 
  geom_vline(xintercept=left_value, linetype="dashed") + 
  geom_vline(xintercept=right_value, linetype="dashed")

Lec32 - Thu 4/27: Back to Sampling

Let’s revisit the OkCupid profile data. Note the command xlim(c(50,80)) fixes the x-axis range to be between 50 and 80 inches.

  1. Discuss with your seatmates what all 4 code parts below are doing.
  2. Change the xintercept in the geom_vline() to be the true population mean \(\mu\)
  3. Try increasing n and repeating. What does this correspond to doing in real life?
  4. How does the histogram change?
  5. Describe using statistical language the role n plays when it comes to estimating \(\mu\).
library(ggplot2)
library(dplyr)
library(mosaic)
library(okcupiddata)
data(profiles)

n <- 5

# Parts 1 & 2:
resample(profiles$height, size=n, replace=TRUE)
mean(resample(profiles$height, size=n, replace=TRUE))

# Part 3:
samples <- do(10000) * mean(resample(profiles$height, size=n, replace=TRUE))
View(samples)

# Part 4:
ggplot(samples, aes(x=mean)) +
  geom_histogram(binwidth = 1) +
  labs(x="sample mean height") + 
  xlim(c(50,80)) +
  geom_vline(xintercept=50, col="red")

Discussion

Note:

  • The code originally had replace=TRUE, but we should replace=FALSE. Think of a poll; once you’ve polled someone you shouldn’t call them again.
    • Question: Does the difference of sampling with vs without replacement matter?
    • Answer: Not for relatively small samples for a population as large as this one (~60K users); think back to the stadium vs room question from PS-08.
  • To calculate the \(\mu\), there was the issue of missing data, so you had to run mean(profiles$height, na.rm=TRUE) to get 68.29 inches = 5’8’’ = 172cm. But beware:
    • If there is a systematic reason why some values are missing (like really short individuals being reluctant to post their height), then remove these missing values will bias your results.
    • If there is no systematic reason, then removing them won’t bias your results.
    • In this case, running sum(is.na(profiles$height)) will show there are only 3 missing values, so ignoring them doesn’t influence the results much.

Learning Checks:

  1. Parts
    • 1 & 2: Demonstrate sampling n people from the population and computing their sample mean height \(\overline{x}\)
    • 3: Does Parts 1 & 2 many, many, many times
    • 4: Plots these so we can observed the effect of sampling variability on \(\overline{x}\)
  2. See plot below facetted for n=5, 50, 100
  3. This corresponds to sampling a larger number of people from the population.
  4. See below; As you increase the sample size n, the histogram of \(\overline{x}\) narrows.
  5. As you increase the sample size n
    • The sampled \(\overline{x}\) vary less about/around \(\mu\)
    • The \(\overline{x}\)’s estimate \(\mu\) more precisely.

Lec29 - Thu 4/20: Permutation Test

  1. Compute the p-value from Learning Check below.
  2. Start answering the question “Did Econ majors do better on the intro stats final than non-Econ majors?” This will be a question on Problem Set 10.

Discussion

1. Refresher

Running the code from Lec28 Learning Check discussion below, we get the following graph:

Recall

  • Since \(H_A: \mu_O - \mu_E\), more extreme can mean both
    1. A more extreme negative difference than -0.073 OR
    2. A more extreme positive difference than 0.073
  • The p-value is the probability of being in the blue areas below
  • In other words: what proportion of the 1000 simulations ended up in the blue areas:

2. Computing the p-Value

We mutate() a new variable based on this using the OR operator |:

simulations <- simulations %>% 
  mutate(more_extreme = difference < -0.073  | difference > 0.073)

For example, here are the first 6 of 1000 simulations that assumed \(H_0: \mu_O - \mu_E\).

even odd difference more_extreme
834 0.68 0.74 0.06 FALSE
365 0.71 0.71 0.00 FALSE
509 0.70 0.72 0.02 FALSE
264 0.70 0.72 0.02 FALSE
39 0.74 0.65 -0.09 TRUE
715 0.71 0.70 -0.01 FALSE

Let’s count the number that are “more extreme” by summing the variable more_extreme:

simulations %>% 
  summarise(n_more_extreme = sum(more_extreme)) 
##   n_more_extreme
## 1            263

So 263 times out of 1000 we observed a simulated test statistic either

  1. less than -0.073
  2. greater than 0.073

The p-value is 0.264. Note that we add one to the numerator to include the actually observed difference of means of -0.073.

\[ p-\mbox{Value} = \frac{263+1}{1000} = 0.264. \]

Lec28 - Wed 4/19: Constructing the Null Hypothesis

Copy the following code into a .R script in RStudio

library(ggplot2)
library(dplyr)
library(mosaic)

# Load grades.csv into R:


# Step 0: Exploratory Data Analysis ---------------------------------------

# Before doing any statistical testing, always do an Exploratory Data Analysis.
# The answer might be so stinking obvious, you don't need use stats.

# ALWAYS View() your data
View(grades)

# Learning Check 1: Create a visualization that attempts to graphically answer
# the question of "Did students with an even number of letters in their last
# name do better on the final exam than those who didn't?" Can you answer the
# question? Write your code below:



# Step 1: Compute the Observed Difference ---------------------------------

# Let's compute the observed difference in means. i.e. what really happened. But
# first, two sidenotes:

# Sidenote 1.1: Wrapper Functions ---------------------------------------------

# The following two bits of code produce the exact same results, but formatted
# differently: the difference in mean final scores

# Bit 1: Using the dplyr tools
grades %>%
  group_by(even_vs_odd) %>%
  summarise(mean=mean(final))

# Bit 2: Using the wrapper function
mean(final ~ even_vs_odd, data=grades)

# Because the wrapper function does the same task, but in much more succinct
# fashion, we'll use the Bit 2 approach!

# Sidenote 1.2: Computing differences ----------------------------------------

# We introduce the diff() function to take the difference of two values stored
# in a vector. Watch out for the order!
c(1, 3)
c(1, 3) %>% diff()
c(3, 1) %>% diff()

# Back to Step 1: ---------------------------------------------

# Now let's take the difference in means and compute the difference. Note the
# order of the subtraction! odd-even
mean(final ~ even_vs_odd, data=grades)
mean(final ~ even_vs_odd, data=grades) %>% diff()

# Students with an odd number of letters did on average 7.3% worse on the final
# than those with an even number! But is this difference of 7.2% statistically
# significant? Or is it just due to random chance? This is where statistics
# comes in.

# Assign this difference to observed_diff. We will use this later.
observed_diff <- mean(final ~ even_vs_odd, data=grades) %>% diff()
observed_diff



# Step 2: Simulate the Null Distribution ------------------------------------

# Recall that assuming the null hypothesis, we can permute/shuffle the variable
# even_vs_odd and it doesn't matter! A sidenote first on using shuffle() as our
# simulation tool

# Sidenote 2.1: shuffle() ---------------------------------------------
shuffled_grades <- grades %>%
  mutate(even_vs_odd = shuffle(even_vs_odd))

# Compare the two. What is different about them?
View(grades)
View(shuffled_grades)

# This is one simulated shuffle, assuming H0 is true
mean(final ~ shuffle(even_vs_odd), data=grades)

# But think to the lady tasting tea, we need to do this many, many, many times
# to get a sense of the typical random behavior! i.e. the null distribution
# We do() the shuffle many, many, many times.
simulations <- do(10000) * mean(final ~ shuffle(even_vs_odd), data=grades)

# Let's look at the contents:
View(simulations)

# Now for each of the 10000 shuffles, let's compute the difference MAKING SURE
# it matches the order from when we computed the observed_diff. i.e. odd-even
# and NOT even-odd
simulations <- simulations %>%
  mutate(difference=odd-even)
simulations



# Learning Checks:
# 1. Plot the results by changing the three SOMETHINGS with the appropriate
# "something"s
ggplot(data=SOMETHING , aes(x=SOMETHING)) +
  geom_SOMETHING() +
  geom_vline(xintercept = SOMETHING, col="red")

# 2. What is your answer to the question "Did students with an even number of
# letters in their last name do better on the final exam than those who didn't?"

# 3. Why is the "null distribution" centered where it is?

Discussion

After loading the necessary packages and the grades.csv spreadsheet into RStudio:

Step 0: Exploratory Data Analysis

ggplot(grades, aes(x=even_vs_odd, y=final)) + 
  geom_boxplot()

There seems to be a slight difference in the median test scores between the even and odds, but is this difference statistically significant? Note that while we could’ve also done faceted histograms, boxplots allow us to compare groups with a single horizontal line!

Step 1: Compute the Observed Test Statistic

mean(final ~ even_vs_odd, data=grades)
##      even       odd 
## 0.7323734 0.6595197
mean(final ~ even_vs_odd, data=grades) %>% diff()
##         odd 
## -0.07285366
observed_diff <- mean(final ~ even_vs_odd, data=grades) %>% diff()

We observe a difference of -0.073 = -7.3%. Note above that diff() does odd-even i.e. \(\overline{x}_O - \overline{x}_E\). While choosing between odd-even or even-odd is inconsequential, it is important to stay consistent. Note: this is the reverse of what is in the class notes.

Step 2: Simulate the Null Distribution

Recall that assuming the null hypothesis, we can permute/shuffle the variable even_vs_odd and it doesn’t matter! In the interest of time, let’s only do 1000 simulations, not 100000.

Crucial: Note in the mutate() we do odd-even and not even-odd to stay consistent.

simulations <- do(1000) * mean(final ~ shuffle(even_vs_odd), data=grades)
simulations <- simulations %>%
  mutate(difference=odd-even)

Let’s look at the first 6 rows of 1000:

Now let’s plot the 1000 simulated differences in average test scores \(\overline{x}_O - \overline{x}_E\):

ggplot(data=simulations , aes(x=difference)) +
  geom_histogram(binwidth=0.025) +
  labs(x="Avg of Odds - Avg of Evens")

Step 3: Compare Null Distribution to Observed Test Statistic

Recall from Step 1, the observed difference in average test scores \(\overline{x}_O - \overline{x}_E\) was -0.073 = 7.3%, which was saved in observed_diff. Let’s draw a red line on the null distribution! How likely is this to occur?

ggplot(data=simulations , aes(x=difference)) +
  geom_histogram(binwidth=0.025) +
  labs(x="Avg of Odds - Avg of Evens") +
  geom_vline(xintercept = observed_diff, col="red")

  • LC: What is your answer to the question Did students with an even number of letters in their last name do better on the final exam than those who didn’t? IMO: if there were no difference between the two groups, observing a difference of -0.073 is not implausible. So we see no reason to say \(H_0\) is false
  • LC: Why is the “null distribution” centered where it is? Because in our hypothetical universe of no true difference between even and odds, the typically difference in means \(\overline{x}_O - \overline{x}_{E}\) is 0.

Lec26 - Fri 4/14: p-Values

Say you are given a data set where the rows correspond to students who took a test and the columns are two variables:

  • Even or odd number of letters in last name
  • Test score

For example, the first three rows of such a data set might look like:

id num_letters test_score
1 odd 0.7
2 even 0.6
3 odd 0.8

You want to answer the scientific question: is there a difference in test score between people with an even number of letters in their last time vs people with an odd number of letters in their last time

Questions:

  1. Identify all 5 components of the hypothesis testing framework to answer this question
  2. Concept: If there truly is no difference, then what can you do to the data set?

Discussion

Five Components

  1. Null Hypothesis: \(H_0\): no difference in test scores between odd vs even
  2. Alternative Hypothesis: \(H_A\): there is a difference
  3. Test Statistic: the mean test score of odd MINUS the mean test score of even
  4. Observed Test Statistic: the difference in sample means \(\overline{x}_E - \overline{x}_O\)
  5. Null distribution: the typical behavior of the test statistic assuming \(H_0\) is true. That way we can say how likely/unlikely the observed test statistic is. We need to construct this. But how?

Assuming \(H_0\) Allows You To…

All hypothesis testing assumes the null hypothesis is true. In our case:

  1. We assume no difference in test scores between evens and odds
  2. So for each student, it doesn’t matter if they have even or odd
  3. In other words, the variable num_letters is meaningless
  4. If num_letters is meaningless, then we can permute its values to no consequence

Thus assuming \(H_0\) is true, the above observed data is the same as the following permuted data

id num_letters test_score
1 odd 0.7
2 even 0.6
3 odd 0.8

which is the same as the following permuted data

id num_letters test_score
1 odd 0.8
2 odd 0.7
3 even 0.6

Lec22 - Fri 4/7: Lady Tasting Tea

Scenario

  • Say you are a statistician and you meet someone called the “Lady Tasting Tea.”
  • She claims she can tell by tasting whether the tea or the milk was added first to a cup.
  • You want to test whether
    • She can actually tell which came first
    • She’s lying and is really guessing at random
  • Say you have just enough tea/milk to pour into 8 cups.

1. Background Questions

  1. What principles of experimental design should you follow in your test?

Now let’s suppose/assume she really can’t tell i.e. she is guessing at random:

  1. What is the probability she guesses one cup right?
  2. If you are counting the number of guesses out of 8 she gets correct, what are the possible outcomes?
  3. What’s more likely? That she gets 4 correct or get 7 correct?

2. Coding Questions

Continuing to suppose/assume she really can’t tell i.e. is guessing at random, how would you use the mosaic (in particular the 4 functions), dplyr, and ggplot2 packages

  1. To simulate the Lady Tasting Tea making a guess for a particular cup?
  2. To simulate the Lady Tasting Tea guessing 8 times?
  3. To count the number she got correct out of 8?
  4. To repeat the above procedure many, many, many times? Say 10000 times.
  5. To visualize the statistical distribution of the number out of 8 she gets correct?

Create a new R Script (File -> New File -> R Script) and copy the following starter code:

library(ggplot2)
library(dplyr)
library(mosaic)

# Single cup outcome, where
# -1 indicates correct
# -0 indicates incorrect
single_cup_outcome <- c(1, 0)

Discussion

library(ggplot2)
library(dplyr)
library(mosaic)

# Single cup outcome, where
# -1 indicates correct
# -0 indicates incorrect
single_cup_outcome <- c(1, 0)

# 8 guesses:
do(8) * resample(single_cup_outcome, size=1)
resample(single_cup_outcome, size=8)

# 8 guesses, many, many, many times:
simulation <- do(10000) * resample(single_cup_outcome, size=8)
View(simulation)

# Count the number right by adding the columns. Note 
# summarise(sum=sum(variable)) only works for summing rows not columns:
simulation <- simulation %>% 
  mutate(n_correct = V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8) 
View(simulation)

# Visualize:
ggplot(simulation, aes(x=n_correct)) + 
  geom_bar() +
  labs(x="Number of Guesses Correct", title="Assuming she is guessing at random")

Lec21 - Thu 4/6: Confounding Variables and Designed Experiments

Fried Chicken Face Off

Do people prefer this? Or this?
Drawing Drawing

How would you design a taste test to ascertain, independent of hype, which fried chicken tastes better? Use the relevant principles of designing experiements from above.

Design Principles Put in Place:

  • Single (but not double) blinded: The taster doesn’t know which (Ezell’s or KFC) chicken they are eating, but the server does.
  • Randomizing:
    • Which order of chicken you eat: KFC first or not
    • Which kind of meat (wing, breast, leg) between tasters. Each taster would try two kinds of meat.
  • Accounting for:
    • Which kind of meat within a taster. Ex: if you eat a KFC wing, you will necessarily eat an Ezell’s wing
    • Temperature: We picked a place that is central to both Ezell’s and KFC, given the cooling down of the chicken that can occur during travel.
    • Kind of batter: Wwe can’t do KFC crispy chicken b/c Ezell’s doesn’t have that type of batter. This is a limitation of the study b/c some feel the crispy chicken is better.
  • Replicates: Just one replicate of each kind of meat due to finite budget and finite stomach space.

Results:

Final score: KFC 8, Ezell’s 4. Some notes:

  • Even though people were “blinded”, most knew which of the two pieces was from KFC.
  • People generally felt
    • The meat from Ezell’s was better, and this was magnified as the chicken went cold.
    • The skin was better at KFC. Given that fried chicken is what it is b/c of the skin, people voted for KFC.
  • Future studies should
    • Consider the chicken and the skin separately.
    • Have “overall experience” scores.
    • Blocked users into two groups first: those who’ve had Ezell’s before and those who didn’t.
  • This can be viewed as an example of a pilot study used to inform how to design a study appropriately.

Lec20 - Wed 4/5: Introduction to Sampling

Example Code

Go over this code first:

# Load packages
library(dplyr)
library(ggplot2)
library(mosaic)

# Define vector to sample from
fruit <- c("apple", "orange", "mango")

# 1. Shuffling works with do()
# Do this many times to get a feel for it:
do(5) * shuffle(fruit)

# 2. Shuffling works with mutate as well:
example_data <- data_frame(
  name = c("Ilana", "Abbi", "Hannibal"),
  fruit = c("apple", "orange", "mango")
)
# Do this many times to get a feel for it:
example_data %>% 
  mutate(fruit = shuffle(fruit))

# 3. Testing the various inputs. Discuss with your peers what each is doing:
resample(fruit, size=1)
resample(fruit, replace=FALSE)
resample(fruit, prob=c(0.495, 0.495, 0.01))

Learning Checks 1-5

  1. Rewrite rflip(10) using the resample() command. Hint: coin <- c("H", "T")
  2. Rewrite shuffle(fruit) command by changing the minimal number of default settings of resample(). Test this on fruit
  3. Write code that will allow you to generate a sample of 15 fruit without replacement.
  4. Write code that will allow you to generate a sample of 15 fruit with replacement.
  5. What’s the fastest way to do the above 5 times? Write it out

Learning Check

A medical doctor pours over some his patients’ medical records and observes:

People who do this: Wake up with this:
Drawing Drawing

He then asserts the following causal relationship:

  • Explanatory AKA treatment variable: sleeping with shoes on
  • Response variable: causes one to wake up with a headache

What’s wrong with the doctor’s logic? What is really going on?

Discussion

fruit <- c("apple", "orange", "mango")

# LC1: rflip(10)
coin <- c("H", "T")
resample(coin, size=10)
rflip(10)

# LC2: shuffle(fruit)
resample(fruit, replace=FALSE)
shuffle(fruit)

# LC3: The following yields an error. You can't sample more elements without
# replacement than there are in the vertor. In other words, the largest sample
# without replacement of fruit is of size 3.
resample(fruit, size=15, replace=FALSE)

# LC4: Note the following two commands are the same because of the way the
# defaults are set:
resample(fruit, size=15)
resample(fruit, size=15, replace=TRUE)

# LC5: Fastest way to repeat 5 times. Use do()!
do(5) * resample(fruit, size=15)

As for our doctor:

  • Shoes do not cause headaches.
  • Alcohol is acting as a confounding variable.

Lec19 - Mon 4/3: Intro to Probability via Simulation

Example Code

Go over this code first:

# Load packages
library(dplyr)
library(ggplot2)

# New package
library(mosaic)

# Flip a coin once. Try this multiple times:
rflip()

# Flip a coin 10 times. Try this multiple times:
rflip(10)

# Flip a coin 10 times, but do this 5 times. Try this multiple times
do(5) * rflip(10)

# Flip a coin 10 times, but do this 500 times
do(500) * rflip(10)

# Gah! There are too many rows!
simulations <- do(500) * rflip(10)

# Convert to data frame format; this allows us to better view in console
simulations <- simulations %>% 
  as_data_frame()

# We could also View() it
View(simulations)

Learning Checks

  • LC1: Create a histogram of the number illustrating the long-run behavior of flipping a coin 10 times.
    • Where is it centered?
    • Describe the shape of the distribution of values
  • LC2: Try to replicate the above, but for the sum of two die rolls. Hint: resample(c(1:6), 2)

Discussion

LC1

coin_flips <- do(500) * rflip(10)
coin_flips <- coin_flips %>% 
  as_data_frame()

If we View(coin_flips) the first 6 rows, we see that we have in tidy format:

n heads tails prop
10 8 2 0.8
10 4 6 0.4
10 6 4 0.6
10 4 6 0.4
10 8 2 0.8
10 6 4 0.6

So we plot a histogram of the heads variable with binwidth=1 since we are dealing with integers i.e. whole numbers.

ggplot(coin_flips, aes(x=heads)) +
  geom_histogram(binwidth = 1)

  • Where is it centered? Answer: At 5 i.e. half of 10.
  • Describe the shape of the distribution of values. Answer: bell-shaped. i.e. like a Normal distribution.

LC2

Let’s unpack resample(c(1:6), 2):

  • Running c(1:6) in the console returns six values, 1 2 3 4 5 6, one for each possible die roll value.
  • resample(c(1:6), 2) says: sample a value from 1 to 6 twice. This is akin to rolling a die twice.
two_dice <- do(500) * resample(c(1:6), 2)
two_dice <- two_dice %>% 
  as_data_frame() 

If we View(two_dice) the first 6 rows, we see that we have in tidy format:

V1 V2
5 4
2 3
1 5
4 3
5 4
6 2

So to get the sum of the two dice, we mutate() a new variable sum based on the sum of the two die:

two_dice <- two_dice %>% 
  mutate(sum = V1 + V2)
V1 V2 sum
5 4 9
2 3 5
1 5 6
4 3 7
5 4 9
6 2 8

And now we plot it:

ggplot(two_dice, aes(x=sum)) +
  geom_histogram(binwidth=1)

Advanced

What’s the deal with the ugly axes tick marks? This is again b/c computers are stupid, and ggplot does not know we are dealing only with whole numbers i.e. integers. We can:

  • Convert the sum variable from numerical to categorical using as.factor(sum)
  • Then plot using geom_bar() (for categorial x-variable) instead of geom_histogram
ggplot(two_dice, aes(x=as.factor(sum))) +
  geom_bar()

Lec17 - Wed 3/22: Intro to Statistical Inference

For each of the following 4 scenarios

  1. Identify
    • The population of interest and if applicable the population parameter
    • The sample used and if applicable the statistic
  2. Comment on the representativeness/generalizability of the results of the sample to the population.

Scenario 1

The Royal Air Force wants to study how resistant their airplanes are to bullets. They study the bullet holes on all the airplanes on the tarmac after an air battle against the Luftwaffe (German Air Force).

  1. Statements about population
    • population: ALL Royal Air Force airplanes
    • population parameter: It wasn’t explicitly defined here, but imagine some aircraft engineering measure of resistance/strength.
  2. Statements about sample
    • sample: Only the airplanes that returned from an air battle
    • statistic: The same measure above, but applied only to the returning aircraft:
  3. Representativeness/generalizability:
    • The sample suffers from survivor’s bias i.e. only planes that didn’t get shot down are included in your sample. You don’t have information on the more important cases of when planes do get shot down. Wald, an American statistician during World War II, suggested that they reinforce parts of the planes where bullet holes were not present.
    • Also, this was a fight only against the German Air Force. Perhaps the Italian and Japanese Air Forces used different bullets, but we don’t have a sample representative of these groups.

Scenario 2

You want to know the average income of Middlebury graduates in the last 10 years. So you get the records of 10 randomly chosen Midd Kids. They all answer and you take the average.

  1. Statements about population
    • population: All Middlebury graduates in the last ten years.
    • population parameter: The population mean \(\mu\) (greek letter “mu”): their average income of all these graduates.
  2. Statements about sample
    • sample: Then 10 chosen Midd Kids
    • statistic: The sample mean \(\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i\) of their incomes.
  3. Representativeness/generalizability:
    • While the sample size is small (i.e. our estimate won’t be very precise and highly variable), it is still representative (i.e. still accurate). We’ll see that accuracy and precision are different concepts.

Scenario 3

Imagine it’s 1993 i.e. almost all households have landlines. You want to know the average number of people in each household in Middlebury. You randomly pick out 500 phone numbers from the phone book and conduct a phone survey.

  1. Statements about population
    • population: All Middlebury households
    • population parameter: the population mean \(\mu\): average number of people in a household
  2. Statements about sample
    • sample: Of the 500 households chosen, those who answer the phone
    • statistic: The sample mean \(\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i\) of the number of people in the households.
  3. Representativeness/generalizability:
    • We assumed that all households have landlines, so the real issue of poorer individuals not having phones is not in question here.
    • Rather, households with larger numbers of people are more likely to have at least one person at home, and thus someone able to pick up the phone. Our results will be biased towards larger households.
    • One way to address this is to keep trying until every house on your list picks up. But especially for single young professionals, this might be very hard to do.

Scenario 4

You want to know the prevalence of illegal downloading of TV shows among Middlebury students. You get the emails of 100 randomly chosen Midd Kids and ask them ``How many times did you download a pirated TV show last week?’’

  1. Statements about population
    • population: Current Midd Kids
    • population parameter: the population proportion \(p\) of Midd Kids who downloaded a pirated TV show last week.
  2. Statements about sample
    • sample: The 100 randomly chosen Midd kids
    • statistic: The sample proportion \(\widehat{p}\) of Midd Kids who self-report to have done so
  3. Representativeness/generalizability:
    • This study could suffer from volunteer bias, where different people might have different probabilities of willingness to report the truth. Since we are asking Midd Kids to fess up to illegal activity, your results might get skewed.

Lec15 - Fri 3/17: 5MV#5 arrange() & _join

In ModernDive, LC5.13 thru 5.17 in Chapters 5.2.5-5.5.3.

Discussion

LC 5.13-5.14

  • 5.5 Looking at Figure 5.7, when joining flights and weather, or in order words match the hourly weather values with each flight, why do we need to join by all of year, month, day, hour, and origin, and not just hour? Because hour is simply a value between 0 and 23; to identify a specific hour, we need to know which year, month, day and at which airport
  • 5.6 What surprises you about the top 10 destinations from NYC in 2013? Subjective! What surprises me is the high number of flights to Boston. Wouldn’t it be easier and quicker to take the train?

LC 5.15-5.17

5.15 What are some ways to select all three of the dest, air_time, and distance variables from flights? Give the code showing how to do this in at least three different ways.

library(dplyr)
library(nycflights13)
# The regular way:
flights %>% 
  select(dest, air_time, distance)

# Since they are sequential columns in the data set
flights %>% 
  select(dest:distance)

# Not as effective, by removing everything else
flights %>% 
  select(-year, -month, -day, -dep_time, -sched_dep_time, -dep_delay, -arr_time,
         -sched_arr_time, -arr_delay, -carrier, -flight, -tailnum, -origin, 
         -hour, -minute, -time_hour)

5.16 How could one use starts_with, ends_with, and contains to select columns from the flights data frame? Provide three different examples in total: one for starts_with, one for ends_with, and one for contains.

# Anything that starts with "d"
flights %>% 
  select(starts_with("d"))
# Anything related to delays:
flights %>% 
  select(ends_with("delay"))
# Anything related to departures:
flights %>% 
  select(contains("dep"))

5.17 Why might we want to use the select function on a data frame? To narrow down the data frame, to make it easier to look at. Using View() for example.

Lec14 - Thu 3/16: 5MV#3 group_by() & 5MV#4 mutate()

In ModernDive, LC5.5 thru 5.12 in Chapters 5.2.3-5.2.4.

Discussion

LC 5.5-5.9

5.5 What does the standard deviation column in the summary_temp_by_month data frame tell us about temperatures in New York City throughout the year?

library(dplyr)
library(nycflights13)
summary_temp_by_month <- weather %>% 
  group_by(month) %>% 
  summarize(
          mean = mean(temp, na.rm = TRUE),
          std_dev = sd(temp, na.rm = TRUE)
          )
month mean std_dev
1 35.64127 10.185459
2 34.15454 6.940228
3 39.81404 6.224948
4 51.67094 8.785250
5 61.59185 9.608687
6 72.14500 7.603357
7 80.00967 7.147631
8 74.40495 5.171365
9 67.42582 8.475824
10 60.03305 8.829652
11 45.10893 10.502249
12 38.36811 9.940822

The standard deviation is a quantification of spread and variability. We see that the period in November, December, and January has the most variation in weather, so you can expect very different temperatures on different days.

5.6 What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC?

summary_temp_by_day <- weather %>% 
  group_by(year, month, day) %>% 
  summarize(
          mean = mean(temp, na.rm = TRUE),
          std_dev = sd(temp, na.rm = TRUE)
          )
summary_temp_by_day

Note: group_by(day) is not enough, because day is a value between 1-31. We need to group_by(year, month, day)

5.7 Recreate by_monthly_origin, but instead of grouping via group_by(origin, month), group variables in a different order group_by(month, origin). What differs in the resulting data set?

by_monthly_origin <- flights %>% 
  group_by(month, origin) %>% 
  summarize(count = n())
month origin count
1 EWR 9893
1 JFK 9161
1 LGA 7950
2 EWR 9107
2 JFK 8421
2 LGA 7423
3 EWR 10420
3 JFK 9697
3 LGA 8717
4 EWR 10531
4 JFK 9218
4 LGA 8581
5 EWR 10592
5 JFK 9397
5 LGA 8807
6 EWR 10175
6 JFK 9472
6 LGA 8596
7 EWR 10475
7 JFK 10023
7 LGA 8927
8 EWR 10359
8 JFK 9983
8 LGA 8985
9 EWR 9550
9 JFK 8908
9 LGA 9116
10 EWR 10104
10 JFK 9143
10 LGA 9642
11 EWR 9707
11 JFK 8710
11 LGA 8851
12 EWR 9922
12 JFK 9146
12 LGA 9067

The difference is they are organized/sorted by month first, then origin

5.8 How could we identify how many flights left each of the three airports in each of the months of 2013?

We could summarize the count from each airport using the n() function, which counts rows.

count_flights_by_airport <- flights %>% 
  group_by(origin, month) %>% 
  summarize(count=n())
origin month count
EWR 1 9893
EWR 2 9107
EWR 3 10420
EWR 4 10531
EWR 5 10592
EWR 6 10175
EWR 7 10475
EWR 8 10359
EWR 9 9550
EWR 10 10104
EWR 11 9707
EWR 12 9922
JFK 1 9161
JFK 2 8421
JFK 3 9697
JFK 4 9218
JFK 5 9397
JFK 6 9472
JFK 7 10023
JFK 8 9983
JFK 9 8908
JFK 10 9143
JFK 11 8710
JFK 12 9146
LGA 1 7950
LGA 2 7423
LGA 3 8717
LGA 4 8581
LGA 5 8807
LGA 6 8596
LGA 7 8927
LGA 8 8985
LGA 9 9116
LGA 10 9642
LGA 11 8851
LGA 12 9067

All remarkably similar!

Note: the n() function counts rows, whereas the sum(VARIABLE_NAME) funciton sums all values of a certain numerical variable VARIABLE_NAME.

5.9 How does the filter operation differ from a group_by followed by a summarize?

  • filter picks out rows from the original data set without modifying them, whereas
  • group_by %>% summarize computes summaries of numerical variables, and hence reports new values.

LC 5.10-5.12

5.10 What do positive values of the gain variable in flights correspond to? What about negative values? And what about a zero value?

  • Say a flight departed 20 minutes late, i.e. dep_delay=20
  • Then arrived 10 minutes late, i.e. arr_delay=10.
  • Then gain = arr_delay - dep_delay = 10 - 20 = -10 is negative, so it “made up time in the air”.

0 means the departure and arrival time were the same, so no time was made up in the air. We see in most cases that the gain is near 0 minutes.

I never understood this. If the pilot says “we’re going make up time in the air” because of delay by flying faster, why don’t you always just fly faster to begin with?

5.11 Could we create the dep_delay and arr_delay columns by simply subtracting dep_time from sched_dep_time and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in flights?

No because you can’t do direct arithmetic on times. The difference in time between 12:03 and 11:59 is 4 minutes, but 1293-1159 = 134

5.12 What can we say about the distribution of gain? Describe it in a few sentences using the plot and the gain_summary data frame values.

Most of the time the gain is a little under zero, most of the time the gain is between -50 and 50 minutes. There are some extreme cases however!

Lec13 - Wed 3/15: Piping, 5MV#1 filter() & 5MV#2 summarize()

In ModernDive, LC5.1 thru 5.4 in Chapters 5-5.2.2.

Discussion

LC 5.1

All the following are the same!

library(nycflights13)
library(dplyr)
data(flights)
# Original in book
not_BTV_SEA <- flights %>% 
  filter(!(dest == "BTV" | dest == "SEA"))

# Alternative way
not_BTV_SEA <- flights %>% 
  filter(!dest == "BTV" & !dest == "SEA")

# Or even
not_BTV_SEA <- flights %>% 
  filter(dest != "BTV" & dest != "SEA")

LC 5.2-5.4

  • 5.2 Say a doctor is studying the effect of smoking on lung cancer of a large number of patients who have records measured at five year intervals. He notices that a large number of patients have missing data points because the patient has died, so he chooses to ignore these patients in his analysis. What is wrong with this doctor’s approach? The missing patients may have died of lung cancer! So to ignore them might seriously bias your results! It is very important to think of what the consequences on your analysis are of ignoring missing data! Ask yourself:
    • There is a systematic reasons why certain values are missing? If so, you might be biasing your results!
    • If there isn’t, then it might be ok to “sweep missing values under the rug.”
  • 5.3 Modify the above summarize function to be use the n() summary function: summarize(count=n()). What does the returned value correspond to? It corresponds to a count of the number of observations/rows:
data(weather)
weather %>% 
  summarize(count = n())
  • 5.4 Why doesn’t the following code work? You may want to run the code line by line:
summary_temp <- weather %>%   
  summarize(mean = mean(temp, na.rm = TRUE)) %>% 
  summarize(std_dev = sd(temp, na.rm = TRUE))

Consider the output of only running the first two lines:

weather %>%   
  summarize(mean = mean(temp, na.rm = TRUE))

Because after the first summarize(), the variable temp disappears as it has been collapsed to the value mean. So when we try to run the second summarize(), it can’t find the variable temp to compute the standard deviation of.

Lec11 - Thu 3/9: 5NG#5 Barplots

In ModernDive, LC4.26 thru 4.37 in Chapters 4.7.

Discussion

Note on Wed March 15: The learning checks originally posted were from the previous version of the book, therefore the discussion below might differ slightly from what you wrote originally. The above link has been fixed.

LC 4.26-4.29

  • 4.26: Why are histograms inappropriate for visualizing categorical variables? Histograms are for continuous variables i.e. the horizontal part of each histogram bar represents an interval, whereas for a categorical variable each bar represents only one level of the categorical variable.
  • 4.27: What is the difference between histograms and barplots? See above.
  • 4.28: How many Envoy Air flights departed NYC in 2013? Envoy Air is carrier code MQ and thus 26397 flights departed NYC in 2013.
  • 4.29: What was the seventh highest airline in terms of departed flights from NYC in 2013? How could we better present the table to get this answer quickly. What a pain! We’ll see in Chapter 5 on Data Wrangling that applying arrange(desc(n)) will sort this table in descending order of n!

LC 4.30-4.31

  • 4.30: Why should pie charts be avoided and replaced by barplots? In my opinion, comparisons using horizontal lines are easier than comparing angles and areas of circles.
  • 4.31: What is your opinion as to why pie charts continue to be used? Legacy?

LC 4.32-4.37

  • 4.32 What kinds of questions are not easily answered by looking at the above figure? Because the red, green, and blue bars don’t all start at 0 (only red does), it makes comparing counts hard.
  • 4.33 What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights? The different airlines prefer different airports. For example, United is mostly a Newark carrier and JetBlue is a JFK carrier. If airlines didn’t prefer airports, each color would be roughly one third of each bar.
  • 4.34 Why might the side-by-side barplot be preferable to a stacked barplot in this case? We can easily compare the different aiports for a given carrier using a single comparison line i.e. things are lined up
  • 4.35 What are the disadvantages of using a side-by-side barplot, in general? Hard to get totals for each airline.
  • 4.36 Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case? Not that different than using side-by-side; depends on how you want to organize your presentation.
  • 4.37 What information about the different carriers at different airports is more easily seen in the faceted barplot? Now we can also compare the different carriers within a particular airport easily too. For example, we can read off who the top carrier for each airport is easily using a single horizontal line.

Lec09 - Thu 3/2: 5NG#4 Boxplots

In ModernDive

Discussion

# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)

# Load weather data set in nycflights
data(weather)

LC 4.22-4.25

4.22: What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.

ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
  geom_boxplot()

It appears to be an outlier. Let’s revisit the use of the filter command to hone in on it. We want all data points where the month is 5 and temp<25

weather %>% 
  filter(month==5 & temp < 25)
origin year month day hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour
JFK 2013 5 9 2 13.1 12.02 95.34 80 8.05546 9.270062 0 1016.9 10 2013-05-08 21:00:00

There appears to be only one hour and only at JFK that recorded 13.1 F (-10.5 C) in the month of May. This is probably a data entry mistake! Why wasn’t the weather at least similar at EWR (Newark) and LGA (La Guardia)?

4.23: Which months have the highest variability in temperature? What reasons do you think this is?

We are now interested in the spread of the data. One measure some of you may have seen previously is the standard deviation. But in this plot we can read off the Interquartile Range (IQR):

  • The distance from the 1st to the 3rd quartiles i.e. the length of the boxes
  • You can also think of this as the spread of the middle 50% of the data

Just from eyeballing it, it seems

  • November has the biggest IQR, i.e. the widest box, so has the most variation in temperature
  • August has the smallest IQR, i.e. the narrowest box, so is the most consistent temperature-wise

Here’s how we compute the exact IQR values for each month (we’ll see this more in depth Chapter 5 of the text):

  1. group the observations by month then
  2. for each group, i.e. month, summarise it by applying the summary statistic function IQR(), while making sure to skip over missing data via na.rm=TRUE then
  3. arrange the table in descending order of IQR
weather %>% 
  group_by(month) %>% 
  summarise(IQR = IQR(temp, na.rm=TRUE)) %>% 
  arrange(desc(IQR))
month IQR
11 16.02
12 13.68
1 12.96
9 12.06
4 12.06
5 11.88
6 10.98
10 10.98
2 10.08
7 9.18
3 9.00
8 7.02

4.24: We looked at the distribution of a continuous variable over a categorical variable here with this boxplot. Why can’t we look at the distribution of one continuous variable over the distribution of another continuous variable? Say, temperature across pressure, for example?

Because we need a way to group many continuous observations together, say by grouping by month. For pressure, we have near unique values for pressure, i.e. no groups, so we can’t make boxplots.

4.25: Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?

In a histogram, the bin corresponding to where an outlier lies may not by high enough for us to see. In a boxplot, they are explicitly labelled separately.

Lec08 - Wed 3/1: 5NG#3 Histograms

In ModernDive, LC4.14 thru 4.21 in Chapters 4.5-4.6.

Discussion

# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)

# Load weather data set in nycflights
data(weather)

LC 4.14-4.17

4.14: What does changing the number of bins from 30 to 60 tell us about the distribution of temperatures?

ggplot(data = weather, aes(x = temp)) +
  geom_histogram(bins = 30)

ggplot(data = weather, aes(x = temp)) +
  geom_histogram(bins = 60)

The distribution doesn’t change much. But by refining the bid width, we see that the temperature data has a high degree of accuracy. What do I mean by accuracy? Looking at the temp variabile by View(weather), we see that the precision of each temperature recording is 2 decimal places.

4.15: Would you classify the distribution of temperatures as symmetric or skewed?

It is rather symmetric, i.e. there are no long tails on only one side of the distribution

4.16: What would you guess is the “center” value in this distribution? Why did you make that choice?

The center is around 55°F. By running the summary() command, we see that the mean and median are very similar. In fact, when the distribution is symmetric the mean equals the median.

4.17: Is this data spread out greatly from the center or is it close? Why?

This can only be answered relatively speaking! Let’s pick things to be relative to Seattle, WA temperatures:

alt text

alt text

While, it appears that Seattle weather has a similar center of 55°F, its temperatures are almost entirely between 35°F and 75°F for a range of about 40°F. Seattle temperatures are much less spread out than New York i.e. much more consistent over the year. New York on the other hand has much colder days in the winter and much hotter days in the summer. Expressed differently, the middle 50% of values, as delineated by the interquartile range is 30°F:

IQR(weather$temp, na.rm=TRUE)
## [1] 30.06
summary(weather$temp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.94   39.92   55.04   55.20   69.98  100.00       1

LC 4.18-4.21

4.18: What other things do you notice about the faceted plot above? How does a faceted plot help us see relationships between two variables?

  • Certain months have much more consistent weather (August in particular), while others have crazy variability like January and October, representing changes in the seasons.
  • The two variables we are see the relationship of are temp and month.

4.19: What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100?

  • While month is technically a number between 1-12, we’re viewing it as a categorical variable here. Specifically an ordinal categorical variable since there is a ordering to the categories
  • 25, 50, 75, 100 are temperatures

4.20: For which types of datasets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the variability of the variables and other important characteristics?

Having histograms split by day would not be great:

  • We’d have 365 facets to look at. Way to many.
  • We don’t really care about day-to-day fluctuation in weather so much, but maybe more week-to-week variation. We’d like to focus on seasonal trends.

4.21: Does the temp variable in the weather data set have a lot of variability? Why do you say that?

Again, like in LC 4.17, this is a relative question. I would say yes, because in New York City, you have 4 clear seasons with different weather. Whereas in Seattle WA and Portland OR, you have two seasons: summer and rain!

Lec07 - Mon 2/27: 5NG#2 Linegraphs

In ModernDive, LC4.9 thru 4.13 in Chapter 4.4. Hint: For LC4.10, Google “NYC Timezone” and note the number next to UTC. UTC stands for Coordinated Universal Time.

Discussion

  • LC4.9: Take a look at both the weather and early_january_weather data frames by running View(weather) and View(early_january_weather) in the console. In what respect do these data frames differ? The rows of early_january_weather are a subset of weather.
  • LC4.10: The weather data is recorded hourly. Why does the time_hour variable correctly identify the hour of the measurement whereas the hour variable does not? Because to uniquely identify an hour, we need the year/month/day/hour sequence, whereas there are only 24 possible hour’s. Note that in the case of weather, there is a timezone bug: the time_hour variable is off by 5 hours from the year/month/day/hour sequence, since the Eastern Time Zone is 5 hours off UTC.
  • LC4.11: Why should line-graphs be avoided when there is not a clear ordering of the horizontal axis? Because lines suggest connectedness and ordering.
  • LC4.12: Why are line-graphs frequently used when time is the explanatory variable? Because time is sequential: subsequent observations are closely related to each other.
  • LC4.13: Plot a time series of a variable other than temp for Newark Airport in the first 15 days of January 2013. Humidity is a good one to look at, since this very closely related to the cycles of a day.
data(weather)
early_january_weather <- weather %>% 
  filter(origin == "EWR" & month == 1 & day <= 15)
ggplot(data = early_january_weather, aes(x = time_hour, y = humid)) +
  geom_line()

Lec06 - Fri 2/24: 5NG#1 Scatterplots

In ModernDive, LC4.1 thru 4.8 which include in Chapter 4:

  • Review Readings: Start of Chapter 4 to 4.2
  • Chapter 4.3 to 4.3.1: 5NG#1 Scatterplots. Drill down on geom_point()
  • Chapter 4.3.2: Two ways for dealing with overplotting:
    1. alpha to control transparency
    2. geom_jitter(): a variation of geom_point() where we add a little jitter (i.e. random noise) to the points to break log-jams

Discussion

Load necessary data and packages:

library(dplyr)
library(ggplot2)
library(nycflights13)
data(flights)
all_alaska_flights <- flights %>% 
  filter(carrier == "AS")

LC4.1: flights includes all flights, whereas all_alaska_flights only includes Alaska Airlines flights.

LC 4.2-4.6:

ggplot(data=all_alaska_flights, aes(x = dep_delay, y = arr_delay)) + 
  geom_point()

  • 4.2: What are some practical reasons why dep_delay and arr_delay have a positive relationship? The later a plane departs, typically the later it will arrive.
  • 4.3: What does (0, 0) correspond to from the point of view of a passenger on an Alaskan flight? Why do you believe there is a cluster of points near (0, 0)? The point (0,0) means no delay in departure and arrival. From the passenger’s point of view, this means the flight was on time. It seems most flights are at least close to being on time.
  • 4.4: Create a similar plot, but one showing the relationship between departure time and departure delay. What hypotheses do you have about the patterns you see? We now put dep_time as the x-aesthetic and dep_delay as the y-aesthetic
ggplot(data=all_alaska_flights, aes(x = dep_time, y = dep_delay)) + 
  geom_point()

Hint: Look at Alaska Airlines’ route map. In fact, there are only two flights paths: Flights 7 and 11 flying from Newark (EWR) to Seattle (SEA).

alt text

alt text

LC 4.7-4.8:

  • 4.7: Why is setting the alpha argument value useful with scatter-plots? It thins out the points so we address over-plotting. But more importantly it hints at the (statistical) density and distribution of the points: where are the points concentrated, where do they occur. We will see more about densities and distributions in Chapter 6 when we switch gears to statistical topics.
  • 4.8: After viewing the above plot, give a range of arrival delays and departure delays that occur most frequently? How has that region changed compared to when you observed the same plot without the alpha = 0.2 set in lower plot? The lower plot suggests that most Alaska flights from NYC depart between 12 minutes early and on time and arrive between 50 minutes early and on time.

Lec05 - Thu 2/23: More 5NG

LC1-5 Consider the following data in tidy format:

A B C D
1 1 3 Hot
2 2 2 Hot
3 3 1 Cold
4 4 2 Cold

Letting

  • the x-axis correspond to variable A
  • the y-axis is variable B

draw using pen & paper the 5 graphics below:

  1. A scatter plot
  2. A scatter plot where the color of the points corresponds to D
  3. A scatter plot where the size of the points corresponds to C
  4. A line graph
  5. A line graph where the color of the line corresponds to D

Reach for the Stars #1: A little ambitious right now, but see if you can tweak the code below to create baby’s first ggplot2 graphic, graphic #1 above: just the scatter plot. I suggest you:

  1. Create a new .R script file
  2. Cut and paste the code below
  3. Tweak the code in the ggplot() function from your Editor (not directly in console)
library(dplyr)
library(ggplot2)

simple_ex <-
  data_frame(
    A = c(1, 2, 3, 4),
    B = c(1, 2, 3, 4),
    C = c(3, 2, 1, 2),
    D = c("Hot", "Hot", "Cold", "Cold")
  )
View(simple_ex)

ggplot(data=DATASETNAME, mapping=aes(x=VARIABLENAME, y=VARIABLENAME)) +
  geom_point()

Reach for the Stars #2: Even more ambitious right now, but see if you can tweak the same code to make graphic #2 above: a scatter plot where the color of the points corresponds to D. Hint:

  • open the help file for geom_point by running ?geom_point in the console
  • scroll down to the Aesthetics section.

Discussion

LC1-5: Chalk Talk

Reach for the Stars #1:

ggplot(data=simple_ex, mapping=aes(x=A, y=B)) +
  geom_point()

Reach for the Stars #2:

ggplot(data=simple_ex, mapping=aes(x=A, y=B, color=D)) +
  geom_point()

Notice:

  1. We simply added color=D in the aes()thetic mapping statement!
  2. How Cold gets mapped to red and Hot to blue. Computers don’t know one color represents heat better than another! How do you change these colors? We’ll see later; let’s keep things simple for now

Lec04 - Wed 2/22: 5NG

Based on the 5NG examples in today’s slides

  • Learning Check 1: Following the example of Napoleon’s march, identify the elements of the Grammar of Graphics:
    1. identify the data variables being displayed and what type of variable they are
    2. identify the aes()thetic attribute of the geom_etric object the above data variables are being mapped to
  • Learning Check 2: Answer the following questions:
    1. Does spending more on a movie yield higher IMDB ratings?
    2. Why are there drops in the number of flights?
    3. What are the smallest and largest visible heights and what do you think of them? Also, think of one graph improvement to better convey information about SF OkCupid users.
    4. Click here for an explanation of boxplots. About what proportion of manual car models sold between 1984 and 2015 got 20 mpg or worse mileage?
    5. About how many babies were named “Hayden” between 1990-2014?

Discussion

5NG#1: Scatterplot

Let’s look at a random sample of 3 of the movies:

title budget rating
I Walk the Line 4e+06 6.0
Auteur Theory, The 7e+04 4.5
Ding-a-ling-Less 1e+06 8.1

Both variables are numerical. Here are the components of the Grammar of Graphics:

data variable aes()thetic attribute geom_etric object
budget x point
rating y point

Question: Does spending more on a movie yield higher IMDB ratings?

5NG#2: Linegraph

Let’s look at a random sample of 3 of the dates:

date n
2013-01-08 899
2013-01-26 680
2013-01-28 923

Both variables are numerical (dates are technically numerical since they are an abstraction of time). Here are the components of the Grammar of Graphics:

data variable aes()thetic attribute geom_etric object
date x line
n y line

Note: Why did we use line as the geom_etric object? Because lines suggest sequence/relationship, and points don’t.

Question: Why are there drops in the number of flights? 2013/01/14 was a Monday.

5NG#3: Histogram

Let’s look at a random sample of 3 of the users:

sex height
m 72
m 71
f 61

Height is numerical. Here are the components of the Grammar of Graphics:

data variable aes()thetic attribute geom_etric object
height x histogram

Note: We’ll see later there is no y aesthetic here, because there is no explicit variable that maps to it, but rather it is computed internally.

Question: What are the smallest and largest visible heights and what do you think of them? Also, think of one graph improvement to better convey information about SF OkCupid users.

5NG#4: Boxplot

Let’s look at a random sample of 3 of the car year/make/model matchings:

name trans hwy
2012 Bentley Continental Supersports Automatic 19
1995 Dodge Caravan C/V/Grand Caravan 2WD Automatic 22
1989 Dodge Daytona Manual 26

trans type is categorical, whereas hwy is numerical. Here are the components of the Grammar of Graphics:

data variable aes()thetic attribute geom_etric object
trans x boxplot
hwy y boxplot

Question: About what proportion of manual car models sold between 1984 and 2015 got 20 mpg or worse mileage? Answer: 25%

5NG#5: Bar Plot

Let’s look at all the data:

name n
Carlos 155711
Ethan 359506
Hayden 105716

Name is categorical. Here are the components of the Grammar of Graphics:

data variable aes()thetic attribute geom_etric object
name x bar
n y bar

Question: About how many babies were named “Hayden” between 1990-2014? Answer: 1e+05 is R’s shorthand notation for \(1 \times 10^5 = 10,000\). To help me remember exponents, I just memorize that \(1\times 10^6 = 1,000,000\) i.e. one million.

Lec03 - Mon 2/20: Tidy Data

Do the 16 Learning Checks in Chapter 3 of ModernDive: Tidy Data. You do not need to submit these.

Discussion

3.1 3.2 is an example!

3.2 Since there are three variable at play (Date, Price, Stock Name), there should be three columns!

Date Stock Name Price
2009-01-01 Boeing $173.55
2009-01-02 Boeing $172.61
2009-01-03 Boeing $173.86
2009-01-04 Boeing $170.77
2009-01-05 Boeing $174.29
2009-01-01 Amazon $174.90
2009-01-02 Amazon $171.42
2009-01-03 Amazon $171.58
2009-01-04 Amazon $173.89
2009-01-05 Amazon $170.16
2009-01-01 Google $174.34
2009-01-02 Google $170.04
2009-01-03 Google $173.65
2009-01-04 Google $174.87
2009-01-05 Google $172.19

3.3 What does any ONE row in thd flights dataset refer to? Data on a flight. Not a flight path! Example:

  • a flight path would be United 1545 to Houston
  • a flight would be United 1545 to Houston 2013/1/1 at 5:15am

3.4 What are some examples in this dataset of categorical variables? What makes them different than quantitative variables?

Hint: Type ?flights in the console to see what all the variables mean!

  • Cateogorical:
    • carrier the company
    • dest the destination
    • flight the flight number. Even though this is a number, its simply a label. Example United 1545 isn’t “less than” United 1714
  • Quantitative:
    • distance the distance in miles
    • time_hour time

3.5 What does int, dbl, and chr mean in the output above?

  • int: integer. Used to count things i.e. a discrete value. Ex: the # of cars parked in a lot
  • dbl: double. Used to measure things. i.e. a continuous value. Ex: your height in inches
  • chr: character. i.e. text

3.6 & 3.7 19 columns (variables) and 336,776 rows (observations i.e. flights)

3.8 weather, planes, airports, airlines data sets.

The observational units, i.e. what each row corresponds to:

  • weather: weather at a given origin (EWR, JFK, LGA) for a given hour i.e. year, month, day, hour
  • planes: a physical aircraft
  • airports: an airport in the US
  • airlines: an airline company

3.9 See ?airports help file

3.10 Identification Variables

  • In the weather example in LC3.8, the combination of origin, year, month, day, hour are identification variables as they identify the observation in question.
  • Anything else pertains to observations: temp, humid, wind_speed, etc.

3.11 What are common characteristics of “tidy” datasets? Described in Lecture slides.

3.12 What makes “tidy” datasets useful for organizing data? Organized way of viewing data. We’ll see later that this format is required for the ggplot2 and dplyr packages for data visualization and manipulation.

3.13 There are 2 variables below, but what does each row correspond to? We don’t know b/c there are no identification variables.

students faculty
4 2
6 3

3.14 We need at least a third variable to identify the observations. For example a variable “Department”.

3.15 Sociology example

  • Each row is a member of a university.
  • Variables are the columns
  • TRUE and FALSE. This is called a logical variable AKA a boolean variable. 1 and 0 can also be used

3.16 We can easily _join them with other data sets! For example, can we join the flights data with the planes data? We’ll see this more in Chapter 4!


Lec02 - Thu 2/16: R Packages

You will be getting your first experience with:

  • RStudio: i.e. The Dashboard
  • R Packages: Extending the functionality
  • View(): A command for viewing your data
  • R Scripts: Text files to save your work
  • Answering questions with data. In this case: baby name popularity

Setting Up

As described in today’s lecture slides: In RStudio (not DataCamp):

  1. Install the ggplot2, dplyr, and babynames packages.
  2. Load the ggplot2, dplyr, and babynames packages.

View() Your Data

Load the babynames dataset in the RStudio viewer by running the following in the console. You should get in the habit of always View()ing your data first!

  • Scroll through the viewer to get an overall feel for your data:
  • Filter your data in the viewer
    • Click on the Filter button
    • Click in the white boxes under the variable names and
      • Play with any sliders for numerical variables: year, n, prop
      • Enter in values to view subsets of rows: sex, name

R Scripts

There are two ways to run commands in the R console: Either

  1. Typing them directly in the console and pressing enter (as you just did).
  2. Saving them in a .R R Script and passing them to the console.

Do the following:

  • In RStudio menu bar -> File -> New File -> R Script
  • Save the file as babynames. Note:
    • A file extension .R gets added: babynames.R
    • You should see babynames.R in the File panel of RStudio:
  • Cut/paste the contents of the grey block below into babynames.R and save it again.
  • Run all the code in the console by highlighting it and pressing
    • Mac users: COMMAND+ENTER
    • PC users: CTRL+ENTER
  • Comment on what you see

Today’s Exercise

Investigate your hypothesized names that are “modern”, “old-fashioned”, and “back in vogue” by

  • Changing only what gets assigned to
    • baby_name
    • baby_sex
  • Running the appropriate lines of code in the console.