# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)
# Load weather data set in nycflights
data(weather)
What kind of variable is on the x-axis in this plot:
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
geom_boxplot()
month
by itself was originally coded as a numerical variable whose values were 1
through 12
factor(month)
we are converting it to a categorical variable whose “labels” are 1 through 12month
variable to a categorical one in the data set weather
itself, but we don’t have the data manipulation tools to do that yet!What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.
It appears to be an outlier. Let’s revisit the use of the filter
command to hone in on it. We want all data points where the month
is 5 and temp<25
filter(weather, month==5 & temp < 25)
origin | year | month | day | hour | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
JFK | 2013 | 5 | 9 | 2 | 13.1 | 12.02 | 95.34 | 80 | 8.05546 | 9.270062 | 0 | 1016.9 | 10 | 2013-05-08 21:00:00 |
There appears to be only one hour and only at JFK that recorded 13.1 F (-10.5 C) in the month of May. This is probably a data entry mistake! Why wasn’t the weather at least similar at EWR (Newark) and LGA (La Guardia)?
Which months tend to have the highest temperature? What reasons do you think this is?
The solid black lines inside each box indicate medians (not the means!!!), which is a measure of center. For example, over half of August data points are about 73 F or higher. I hate heat and humidity! Blech!
As is fairly obvious, the summer months of June, July, and August are the hottest months.
Which months tend to have the highest variability in temperature? What reasons do you think this is?
We are now interested in the spread of the data. One measure some of you may have seen previously is the standard deviation. But in this plot we can read off the Interquartile Range (IQR):
Just from eyeballing it, it seems
Here’s how we compute the exact IQR values for each month (we’ll see this more in depth Chapter 5 of the text):
group
the observations by month
thengroup
, i.e. month
, summarise
it by applying the summary statistic function IQR()
, while making sure to skip over missing data via na.rm=TRUE
thenarrange
the table in desc
ending order of IQR
group_by(weather, month) %>%
summarise(IQR = IQR(temp, na.rm=TRUE)) %>%
arrange(desc(IQR))
month | IQR |
---|---|
11 | 16.02 |
12 | 13.68 |
1 | 12.96 |
9 | 12.06 |
4 | 12.06 |
5 | 11.88 |
6 | 10.98 |
10 | 10.98 |
2 | 10.08 |
7 | 9.18 |
3 | 9.00 |
8 | 7.02 |
Create a similar boxplot as above but for wind_speed
. What do you observe and does this make sense?
ggplot(data = weather, mapping = aes(x = factor(month), y = wind_speed)) +
geom_boxplot()
There is clearly an outlier in February. What are the units of wind speed? Look at the help file of the weather
data set by typing ?weather
in your console, you’ll see that they are in mph.
But does a wind speed of 1000 mph even make sense? Google the following: “fastest wind speed ever recorded”. Mount Washington in New Hampshire had the record of 231 mph (recorded in 1934) up until 1996, when it was eclipsed by Barrow Island in Australia in 1996 with a speed of 253 miles per hour (see here) during Typhoon Olivia. Not even close to 1000 mph
This is clearly a data entry mistake. What can we do to deal with this? We can either:
Change the range on the y-axis using the ylim()
command
ggplot(data = weather, mapping = aes(x = factor(month), y = wind_speed)) +
geom_boxplot() +
ylim(0, 45)
or
Delete that observation entirely using, again, the filter()
command. We create a new cleaned weather data set called weather_cleaned
which keeps all rows with wind_speed
less than 500 mph:
weather_cleaned <- filter(weather, wind_speed < 500)
ggplot(data = weather_cleaned, mapping = aes(x = factor(month), y = wind_speed)) +
geom_boxplot()