Alaska Flights

Load Alaska Flights data:

library(ggplot2)
library(dplyr)
library(nycflights13)
library(knitr)
data(flights)

# Load Alaska data, deleting rows that have missing dep or arr data
alaska_flights <- flights %>% 
  filter(carrier == "AS") %>% 
  filter(!is.na(dep_delay) & !is.na(arr_delay))

# Instead of looking at all 709 Alaska flights, let's look at a random sample
# of 10 of them:
set.seed(76)
alaska_flights_sample <- alaska_flights %>% 
  sample_n(10)

# Plot the 25 pairs of points and regression line:
ggplot(data=alaska_flights_sample, aes(x = dep_delay, y = arr_delay)) + 
  geom_point() +
  geom_smooth(method="lm", se=FALSE)

Regression:

Let’s take

  • The outcome variable y to be departure delay dep_delay
  • The predictor variable x to be arrival delay arr_delay

Running Regression

The way you run regression in R is via the lm() command for linear model and assign this to model

lm(OUTCOME_VARIABLE ~ PREDICTOR_VARIABLE, data=DATA_SET_NAME)

Let’s run the regression using only the 10 randomly sampled flights:

model <- lm(arr_delay ~ dep_delay, data=alaska_flights_sample)

Looking at Output

There are two useful functions from the broom package to look at the output of a regression:

  • tidy() to get the regression table
  • augment() to get point-by-point values. This table also contains more advanced output
library(broom)
regression_table <- tidy(model, conf.int = TRUE)
kable(regression_table, digits=3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -18.640 5.544 -3.362 0.010 -31.424 -5.856
dep_delay 0.834 0.227 3.670 0.006 0.310 1.358
point_by_point <- augment(model) %>% 
  as_data_frame()
kable(point_by_point, digits=3)
arr_delay dep_delay .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
42 56 28.070 11.910 13.930 0.531 15.699 0.874 1.244
-41 -3 -21.143 5.825 -19.857 0.127 15.524 0.123 -1.300
-25 -1 -19.474 5.630 -5.526 0.119 17.337 0.009 -0.360
-12 -7 -24.479 6.295 12.479 0.148 16.715 0.060 0.827
-26 -7 -24.479 6.295 -1.521 0.148 17.468 0.001 -0.101
-38 -2 -20.308 5.724 -17.692 0.123 15.955 0.093 -1.155
6 49 22.232 10.499 -16.232 0.412 15.539 0.588 -1.295
-25 -7 -24.479 6.295 -0.521 0.148 17.477 0.000 -0.035
2 15 -6.128 5.359 8.128 0.107 17.174 0.017 0.526
4 -5 -22.811 6.047 26.811 0.137 13.658 0.247 1.765

Learning Check

  1. Identify in the point-by-point table which columns correspond to
    • The observed outcome variable \(y\): arr_delay
    • The fitted values on the regression line \(\widehat{y}\): .fitted
    • The residuals and which two columns they are computed based on: .resid
  2. Re-run the regression, but using all 907 flights in the alaska_flights data set and plot a histogram of the residuals. What shape does the distribution have?

Residual Plots

Before we plot the histogram of residuals, let’s plot the raw residuals. Imagine we take the red line in the left-hand plot and flatten it out in the right hand plot. This is equivalent to computing the residuals.

We see that the residuals just form random scatter/noise around the \(y=0\) line i.e. there is no systematic pattern in the residuals

The residuals are \(\epsilon_i = \mbox{observed} - \mbox{fitted} = y_i - \widehat{y}_i\). They roughly follow a Normal distribution centered at 0. i.e. on average the error is 0. This is good!

model <- lm(arr_delay ~ dep_delay, data=alaska_flights)
point_by_point <- augment(model) %>% 
  as_data_frame()
ggplot(point_by_point, aes(x=.resid)) +
  geom_histogram(binwidth = 10)