Load Packages and Data

# Load necessary packages and data
library(ggplot2)
library(dplyr)
library(nycflights13)

data(flights)
data(airports)
data(airlines)

LC 4.27, 4.29, and 4.30

ggplot(data = flights, aes(x = carrier)) +
  geom_bar()

flights_table <- count(x = flights, vars = carrier) 
# Raw table:
flights_table
## # A tibble: 16 × 2
##     vars     n
##    <chr> <int>
## 1     9E 18460
## 2     AA 32729
## 3     AS   714
## 4     B6 54635
## 5     DL 48110
## 6     EV 54173
## 7     F9   685
## 8     FL  3260
## 9     HA   342
## 10    MQ 26397
## 11    OO    32
## 12    UA 58665
## 13    US 20536
## 14    VX  5162
## 15    WN 12275
## 16    YV   601
# Table sorted in descending order:
flights_table %>% 
  arrange(desc(n))
## # A tibble: 16 × 2
##     vars     n
##    <chr> <int>
## 1     UA 58665
## 2     B6 54635
## 3     EV 54173
## 4     DL 48110
## 5     AA 32729
## 6     MQ 26397
## 7     US 20536
## 8     9E 18460
## 9     WN 12275
## 10    VX  5162
## 11    FL  3260
## 12    AS   714
## 13    F9   685
## 14    YV   601
## 15    HA   342
## 16    OO    32
  • Why are histograms inappropriate for visualizing categorical variables?
  • How many Envoy Air flights departed NYC in 2013?
  • What was the seventh highest airline in terms of departed flights from NYC in 2013?

Solution

  • Histograms are for continuous variables i.e. the horizontal part of each histogram bar represents an interval, whereas for a categorical variable each bar represents only one level of the categorical variable.
  • Envoy Air is carrier code MQ and thus 26397 flights departed NYC in 2013. The flights_table and airlines datasets should be joined so that we know what these airport codes mean!
  • US i.e. US airways was 7th with 20536 flights. The arrange(desc(n)) command came in real handy here!

LC 4.31-4.32

flights_namedports <- inner_join(flights, airports, by = c("origin" = "faa"))
ggplot(data = flights_namedports, aes(x = carrier, fill = name)) +
  geom_bar()

  • What kinds of questions are not easily answered by looking at the above figure?
  • What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?

Solution

  • Because the red, green, and blue bars don’t all start at 0 (only red does), it makes comparing counts hard.
  • The different airlines prefer different airports. For example, United is mostly a Newark carrier and JetBlue is a JFK carrier. If airlines didn’t prefer airports, each color would be roughly one third of each bar.

LC 4.35-4.36

ggplot(data = flights_namedports, aes(x = carrier, fill = name)) +
  geom_bar() +
  facet_wrap(~name, ncol=1)

  • Why is the faceted barplot preferred to the stacked barplot in this case?
  • What information about the different carriers at different airports is more easily seen in the faceted barplot?

Solution

  • We can easily compare the different aiports for a given carrier using a single vertical comparison line i.e. things are lined up
  • Now we can also compare the different carriers within a particular airport easily too. For example, we can read off who the top carrier for each airport is easily using a single horizontal line.