library(ggplot2)
library(dplyr)
library(nycflights13)
library(knitr)
data(flights)
# Load Alaska data, deleting rows that have missing dep or arr data
alaska_flights <- flights %>%
filter(carrier == "AS") %>%
filter(!is.na(dep_delay) & !is.na(arr_delay))
# Instead of looking at all 709 Alaska flights, let's look at a random sample
# of 10 of them:
set.seed(76)
alaska_flights_sample <- alaska_flights %>%
sample_n(10)
# Plot the 25 pairs of points and regression line:
ggplot(data=alaska_flights_sample, aes(x = dep_delay, y = arr_delay)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
Let’s take
dep_delay
arr_delay
The way you run regression in R is via the lm()
command for linear model and assign this to model
lm(OUTCOME_VARIABLE ~ PREDICTOR_VARIABLE, data=DATA_SET_NAME)
Let’s run the regression using only the 10 randomly sampled flights:
model <- lm(arr_delay ~ dep_delay, data=alaska_flights_sample)
There are two useful functions from the broom package to look at the output of a regression:
tidy()
to get the regression tableaugment()
to get point-by-point values. This table also contains more advanced outputlibrary(broom)
regression_table <- tidy(model, conf.int = TRUE)
kable(regression_table, digits=3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -18.640 | 5.544 | -3.362 | 0.010 | -31.424 | -5.856 |
dep_delay | 0.834 | 0.227 | 3.670 | 0.006 | 0.310 | 1.358 |
point_by_point <- augment(model) %>%
as_data_frame()
kable(point_by_point, digits=3)
arr_delay | dep_delay | .fitted | .se.fit | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|---|
42 | 56 | 28.070 | 11.910 | 13.930 | 0.531 | 15.699 | 0.874 | 1.244 |
-41 | -3 | -21.143 | 5.825 | -19.857 | 0.127 | 15.524 | 0.123 | -1.300 |
-25 | -1 | -19.474 | 5.630 | -5.526 | 0.119 | 17.337 | 0.009 | -0.360 |
-12 | -7 | -24.479 | 6.295 | 12.479 | 0.148 | 16.715 | 0.060 | 0.827 |
-26 | -7 | -24.479 | 6.295 | -1.521 | 0.148 | 17.468 | 0.001 | -0.101 |
-38 | -2 | -20.308 | 5.724 | -17.692 | 0.123 | 15.955 | 0.093 | -1.155 |
6 | 49 | 22.232 | 10.499 | -16.232 | 0.412 | 15.539 | 0.588 | -1.295 |
-25 | -7 | -24.479 | 6.295 | -0.521 | 0.148 | 17.477 | 0.000 | -0.035 |
2 | 15 | -6.128 | 5.359 | 8.128 | 0.107 | 17.174 | 0.017 | 0.526 |
4 | -5 | -22.811 | 6.047 | 26.811 | 0.137 | 13.658 | 0.247 | 1.765 |
arr_delay
.fitted
.resid
alaska_flights
data set and plot a histogram of the residuals. What shape does the distribution have?Before we plot the histogram of residuals, let’s plot the raw residuals. Imagine we take the red line in the left-hand plot and flatten it out in the right hand plot. This is equivalent to computing the residuals.
We see that the residuals just form random scatter/noise around the \(y=0\) line i.e. there is no systematic pattern in the residuals
The residuals are \(\epsilon_i = \mbox{observed} - \mbox{fitted} = y_i - \widehat{y}_i\). They roughly follow a Normal distribution centered at 0. i.e. on average the error is 0. This is good!
model <- lm(arr_delay ~ dep_delay, data=alaska_flights)
point_by_point <- augment(model) %>%
as_data_frame()
ggplot(point_by_point, aes(x=.resid)) +
geom_histogram(binwidth = 10)