Load Packages and Data
# Load necessary packages and data
library(ggplot2)
library(dplyr)
library(nycflights13)
data(flights)
data(airports)
data(airlines)
LC 4.27, 4.29, and 4.30
ggplot(data = flights, aes(x = carrier)) +
geom_bar()
flights_table <- count(x = flights, vars = carrier)
# Raw table:
flights_table
## # A tibble: 16 × 2
## vars n
## <chr> <int>
## 1 9E 18460
## 2 AA 32729
## 3 AS 714
## 4 B6 54635
## 5 DL 48110
## 6 EV 54173
## 7 F9 685
## 8 FL 3260
## 9 HA 342
## 10 MQ 26397
## 11 OO 32
## 12 UA 58665
## 13 US 20536
## 14 VX 5162
## 15 WN 12275
## 16 YV 601
# Table sorted in descending order:
flights_table %>%
arrange(desc(n))
## # A tibble: 16 × 2
## vars n
## <chr> <int>
## 1 UA 58665
## 2 B6 54635
## 3 EV 54173
## 4 DL 48110
## 5 AA 32729
## 6 MQ 26397
## 7 US 20536
## 8 9E 18460
## 9 WN 12275
## 10 VX 5162
## 11 FL 3260
## 12 AS 714
## 13 F9 685
## 14 YV 601
## 15 HA 342
## 16 OO 32
- Why are histograms inappropriate for visualizing categorical variables?
- How many Envoy Air flights departed NYC in 2013?
- What was the seventh highest airline in terms of departed flights from NYC in 2013?
Solution
- Histograms are for continuous variables i.e. the horizontal part of each histogram bar represents an interval, whereas for a categorical variable each bar represents only one level of the categorical variable.
- Envoy Air is carrier code
MQ
and thus 26397 flights departed NYC in 2013. The flights_table
and airlines
datasets should be join
ed so that we know what these airport codes mean!
US
i.e. US airways was 7th with 20536 flights. The arrange(desc(n))
command came in real handy here!
LC 4.31-4.32
flights_namedports <- inner_join(flights, airports, by = c("origin" = "faa"))
ggplot(data = flights_namedports, aes(x = carrier, fill = name)) +
geom_bar()
- What kinds of questions are not easily answered by looking at the above figure?
- What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?
Solution
- Because the red, green, and blue bars don’t all start at 0 (only red does), it makes comparing counts hard.
- The different airlines prefer different airports. For example, United is mostly a Newark carrier and JetBlue is a JFK carrier. If airlines didn’t prefer airports, each color would be roughly one third of each bar.
LC 4.35-4.36
ggplot(data = flights_namedports, aes(x = carrier, fill = name)) +
geom_bar() +
facet_wrap(~name, ncol=1)
- Why is the faceted barplot preferred to the stacked barplot in this case?
- What information about the different carriers at different airports is more easily seen in the faceted barplot?
Solution
- We can easily compare the different aiports for a given carrier using a single vertical comparison line i.e. things are lined up
- Now we can also compare the different carriers within a particular airport easily too. For example, we can read off who the top carrier for each airport is easily using a single horizontal line.