Residuals

Alaska Flights

Load Alaska Flights data:

library(ggplot2)
library(dplyr)
library(nycflights13)
library(knitr)
data(flights)

# Load Alaska data, deleting rows that have missing dep or arr data
alaska_flights <- flights %>% 
  filter(carrier == "AS") %>% 
  filter(!is.na(dep_delay) & !is.na(arr_delay))

# Instead of looking at all 709 Alaska flights, let's look at a random sample
# of 10 of them:
set.seed(76)
alaska_flights_sample <- alaska_flights %>% 
  sample_n(10)

# Plot the 25 pairs of points and regression line:
ggplot(data=alaska_flights_sample, aes(x = dep_delay, y = arr_delay)) + 
  geom_point() +
  geom_smooth(method="lm", se=FALSE)

Regression:

Let’s take

The outcome variable y to be departure delay dep_delay
The predictor variable x to be arrival delay arr_delay

Running Regression

The way you run regression in R is via the lm() command for linear model and assign this to model

lm(OUTCOME_VARIABLE ~ PREDICTOR_VARIABLE, data=DATA_SET_NAME)

Let’s run the regression using only the 10 randomly sampled flights:

model <- lm(arr_delay ~ dep_delay, data=alaska_flights_sample)

Looking at Output

There are two useful functions from the broom package to look at the output of a regression:

tidy() to get the regression table
augment() to get point-by-point values. This table also contains more advanced output

library(broom)
regression_table <- tidy(model, conf.int = TRUE)
kable(regression_table, digits=3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-18.640	5.544	-3.362	0.010	-31.424	-5.856
dep_delay	0.834	0.227	3.670	0.006	0.310	1.358

point_by_point <- augment(model) %>% 
  as_data_frame()
kable(point_by_point, digits=3)

arr_delay	dep_delay	.fitted	.se.fit	.resid	.hat	.sigma	.cooksd	.std.resid
42	56	28.070	11.910	13.930	0.531	15.699	0.874	1.244
-41	-3	-21.143	5.825	-19.857	0.127	15.524	0.123	-1.300
-25	-1	-19.474	5.630	-5.526	0.119	17.337	0.009	-0.360
-12	-7	-24.479	6.295	12.479	0.148	16.715	0.060	0.827
-26	-7	-24.479	6.295	-1.521	0.148	17.468	0.001	-0.101
-38	-2	-20.308	5.724	-17.692	0.123	15.955	0.093	-1.155
6	49	22.232	10.499	-16.232	0.412	15.539	0.588	-1.295
-25	-7	-24.479	6.295	-0.521	0.148	17.477	0.000	-0.035
2	15	-6.128	5.359	8.128	0.107	17.174	0.017	0.526
4	-5	-22.811	6.047	26.811	0.137	13.658	0.247	1.765

Learning Check

Identify in the point-by-point table which columns correspond to
- The observed outcome variable \(y\): arr_delay
- The fitted values on the regression line \(\widehat{y}\): .fitted
- The residuals and which two columns they are computed based on: .resid
Re-run the regression, but using all 907 flights in the alaska_flights data set and plot a histogram of the residuals. What shape does the distribution have?

Residual Plots

Before we plot the histogram of residuals, let’s plot the raw residuals. Imagine we take the red line in the left-hand plot and flatten it out in the right hand plot. This is equivalent to computing the residuals.

We see that the residuals just form random scatter/noise around the \(y=0\) line i.e. there is no systematic pattern in the residuals

The residuals are \(\epsilon_i = \mbox{observed} - \mbox{fitted} = y_i - \widehat{y}_i\). They roughly follow a Normal distribution centered at 0. i.e. on average the error is 0. This is good!

model <- lm(arr_delay ~ dep_delay, data=alaska_flights)
point_by_point <- augment(model) %>% 
  as_data_frame()
ggplot(point_by_point, aes(x=.resid)) +
  geom_histogram(binwidth = 10)