Load Packages and Data

# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)

# Load flights data set in nycflights
data(flights)

LC

What do positive values of the gain variable in flights_plus correspond to? What about negative values? And what about a zero value?

flights_plus <- flights %>% 
  mutate(gain = arr_delay - dep_delay)

Solution

ggplot(data=flights_plus, aes(x=gain)) +
  geom_histogram()

  • Say a flight departed 20 minutes late, i.e. dep_delay=20
  • Then arrived 10 minutes late, i.e. arr_delay=10.
  • Then gain = arr_delay - dep_delay = 10 - 20 = -10 is negative, so it “made up time in the air”.

0 means the departure and arrival time were the same, so no time was made up in the air. We see in most cases that the gain is near 0 minutes.

I never understood this. If the pilot says “we’re going make up time in the air” because of delay by flying faster, why don’t you always just fly faster to begin with?

LC

Could we create the dep_delay and arr_delay columns by simply subtracting dep_time from sched_dep_time and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in flights.

Solution

Let’s focus on delays first, but let’s assign our newly computed departure delay variable to dep_delay_new so as to not overwrite the original dep_delay variable. Furthermore, let’s only select the relevant columns to make View()ing easier:

new_flights_plus <- flights %>% 
  mutate(dep_delay_new = sched_dep_time - dep_time) %>% 
  select(sched_dep_time, dep_time, dep_delay, dep_delay_new)

Let’s take a look at the first 5 rows:

sched_dep_time dep_time dep_delay dep_delay_new
515 517 2 -2
529 533 4 -4
540 542 2 -2
545 544 -1 1
600 554 -6 46

Observations:

  1. Looking at the first 4 rows we see that when computing delays, you need to substract sched_dep_time from dep_time, and not vice versa like was asked. A typo to be fixed before Chester and I officially release this textbook later this year after testing!
  2. But why are the values in the 5th row different? Because hours don’t have 100 minutes, but 60 minutes!

Moral of the story: Always look your data before and after you do anything!