# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)
# Load flights data set in nycflights
data(flights)
What do positive values of the gain
variable in flights_plus
correspond to? What about negative values? And what about a zero value?
flights_plus <- flights %>%
mutate(gain = arr_delay - dep_delay)
ggplot(data=flights_plus, aes(x=gain)) +
geom_histogram()
dep_delay=20
arr_delay=10
.gain = arr_delay - dep_delay = 10 - 20 = -10
is negative, so it “made up time in the air”.0 means the departure and arrival time were the same, so no time was made up in the air. We see in most cases that the gain
is near 0 minutes.
I never understood this. If the pilot says “we’re going make up time in the air” because of delay by flying faster, why don’t you always just fly faster to begin with?
Could we create the dep_delay
and arr_delay
columns by simply subtracting dep_time
from sched_dep_time
and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in flights
.
Let’s focus on delays first, but let’s assign our newly computed departure delay variable to dep_delay_new
so as to not overwrite the original dep_delay
variable. Furthermore, let’s only select the relevant columns to make View()
ing easier:
new_flights_plus <- flights %>%
mutate(dep_delay_new = sched_dep_time - dep_time) %>%
select(sched_dep_time, dep_time, dep_delay, dep_delay_new)
Let’s take a look at the first 5 rows:
sched_dep_time | dep_time | dep_delay | dep_delay_new |
---|---|---|---|
515 | 517 | 2 | -2 |
529 | 533 | 4 | -4 |
540 | 542 | 2 | -2 |
545 | 544 | -1 | 1 |
600 | 554 | -6 | 46 |
Observations:
sched_dep_time
from dep_time
, and not vice versa like was asked. A typo to be fixed before Chester and I officially release this textbook later this year after testing!Moral of the story: Always look your data before and after you do anything!