# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)
# Load flights data set in nycflights
data(flights)
What do positive values of the gain variable in flights_plus correspond to? What about negative values? And what about a zero value?
flights_plus <- flights %>%
mutate(gain = arr_delay - dep_delay)
ggplot(data=flights_plus, aes(x=gain)) +
geom_histogram()
dep_delay=20arr_delay=10.gain = arr_delay - dep_delay = 10 - 20 = -10 is negative, so it “made up time in the air”.0 means the departure and arrival time were the same, so no time was made up in the air. We see in most cases that the gain is near 0 minutes.
I never understood this. If the pilot says “we’re going make up time in the air” because of delay by flying faster, why don’t you always just fly faster to begin with?
Could we create the dep_delay and arr_delay columns by simply subtracting dep_time from sched_dep_time and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in flights.
Let’s focus on delays first, but let’s assign our newly computed departure delay variable to dep_delay_new so as to not overwrite the original dep_delay variable. Furthermore, let’s only select the relevant columns to make View()ing easier:
new_flights_plus <- flights %>%
mutate(dep_delay_new = sched_dep_time - dep_time) %>%
select(sched_dep_time, dep_time, dep_delay, dep_delay_new)
Let’s take a look at the first 5 rows:
| sched_dep_time | dep_time | dep_delay | dep_delay_new |
|---|---|---|---|
| 515 | 517 | 2 | -2 |
| 529 | 533 | 4 | -4 |
| 540 | 542 | 2 | -2 |
| 545 | 544 | -1 | 1 |
| 600 | 554 | -6 | 46 |
Observations:
sched_dep_time from dep_time, and not vice versa like was asked. A typo to be fixed before Chester and I officially release this textbook later this year after testing!Moral of the story: Always look your data before and after you do anything!