Boxplots Learning Checks

Load Packages and Data

# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)

# Load weather data set in nycflights
data(weather)

LC 4.17

What kind of variable is on the x-axis in this plot:

ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
  geom_boxplot()

Solution

month by itself was originally coded as a numerical variable whose values were 1 through 12
But by using factor(month) we are converting it to a categorical variable whose “labels” are 1 through 12
Note: we could’ve changed the month variable to a categorical one in the data set weather itself, but we don’t have the data manipulation tools to do that yet!

LC 4.18

What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.

Solution

It appears to be an outlier. Let’s revisit the use of the filter command to hone in on it. We want all data points where the month is 5 and temp<25

filter(weather, month==5 & temp < 25)

origin	year	month	day	hour	temp	dewp	humid	wind_dir	wind_speed	wind_gust	precip	pressure	visib	time_hour
JFK	2013	5	9	2	13.1	12.02	95.34	80	8.05546	9.270062	0	1016.9	10	2013-05-08 21:00:00

There appears to be only one hour and only at JFK that recorded 13.1 F (-10.5 C) in the month of May. This is probably a data entry mistake! Why wasn’t the weather at least similar at EWR (Newark) and LGA (La Guardia)?

LC 4.19

Which months tend to have the highest temperature? What reasons do you think this is?

Solution

The solid black lines inside each box indicate medians (not the means!!!), which is a measure of center. For example, over half of August data points are about 73 F or higher. I hate heat and humidity! Blech!

As is fairly obvious, the summer months of June, July, and August are the hottest months.

LC 4.20

Which months tend to have the highest variability in temperature? What reasons do you think this is?

Solution

We are now interested in the spread of the data. One measure some of you may have seen previously is the standard deviation. But in this plot we can read off the Interquartile Range (IQR):

The distance from the 1st to the 3rd quartiles i.e. the length of the boxes
You can also think of this as the spread of the middle 50% of the data

Just from eyeballing it, it seems

November has the biggest IQR, i.e. the widest box, so has the most variation in temperature
August has the smallest IQR, i.e. the narrowest box, so is the most consistent temperature-wise

Here’s how we compute the exact IQR values for each month (we’ll see this more in depth Chapter 5 of the text):

group the observations by month then
for each group, i.e. month, summarise it by applying the summary statistic function IQR(), while making sure to skip over missing data via na.rm=TRUE then
arrange the table in descending order of IQR

group_by(weather, month) %>% 
  summarise(IQR = IQR(temp, na.rm=TRUE)) %>% 
  arrange(desc(IQR))

month	IQR
11	16.02
12	13.68
1	12.96
9	12.06
4	12.06
5	11.88
6	10.98
10	10.98
2	10.08
7	9.18
3	9.00
8	7.02

LC 4.21

Create a similar boxplot as above but for wind_speed. What do you observe and does this make sense?

Solution

ggplot(data = weather, mapping = aes(x = factor(month), y = wind_speed)) +
  geom_boxplot()

There is clearly an outlier in February. What are the units of wind speed? Look at the help file of the weather data set by typing ?weather in your console, you’ll see that they are in mph.

But does a wind speed of 1000 mph even make sense? Google the following: “fastest wind speed ever recorded”. Mount Washington in New Hampshire had the record of 231 mph (recorded in 1934) up until 1996, when it was eclipsed by Barrow Island in Australia in 1996 with a speed of 253 miles per hour (see here) during Typhoon Olivia. Not even close to 1000 mph

This is clearly a data entry mistake. What can we do to deal with this? We can either:

Change the range on the y-axis using the ylim() command

ggplot(data = weather, mapping = aes(x = factor(month), y = wind_speed)) +
  geom_boxplot() +
  ylim(0, 45)

Delete that observation entirely using, again, the filter() command. We create a new cleaned weather data set called weather_cleaned which keeps all rows with wind_speed less than 500 mph:

weather_cleaned <- filter(weather, wind_speed < 500)
ggplot(data = weather_cleaned, mapping = aes(x = factor(month), y = wind_speed)) +
  geom_boxplot()

Boxplots Learning Checks

Albert Y. Kim

Mon Oct 3, 2016

Load Packages and Data

LC 4.17

Solution

LC 4.18

Solution

LC 4.19

Solution

LC 4.20

Solution

LC 4.21

Solution