Piping, Grouping, and Summarize Learning Checks

Load Packages and Data

# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)

# Load weather data set in nycflights
data(weather)

LC 5.5

What does the standard deviation column in the summary_temp_by_month data frame tell us about temperatures in New York City throughout the year?

summary_temp_by_month <- weather %>% 
  group_by(month) %>% 
  summarize(
          mean = mean(temp, na.rm = TRUE),
          std_dev = sd(temp, na.rm = TRUE)
          )

summary_temp_by_month

month	mean	std_dev
1	35.64127	10.185459
2	34.15454	6.940228
3	39.81404	6.224948
4	51.67094	8.785250
5	61.59185	9.608687
6	72.14500	7.603357
7	80.00967	7.147631
8	74.40495	5.171365
9	67.42582	8.475824
10	60.03305	8.829652
11	45.10893	10.502249
12	38.36811	9.940822

Solution

The standard deviation is a quantification of spread and variability. We see that the period in November, December, and January has the most variation in weather, so you can expect very different temperatures on different days.

Note: both mean(temp, na.rm = TRUE) and sd(temp, na.rm = TRUE) have a na.rm = TRUE to ignore missing values. This should only be used when necessary i.e. when there actually are missing values in the data set.

LC 5.6

What code would be required to get the mean and standard deviation temperature for each airport in NYC? Do this with and without using the %>% operator.

Solution

Just switch month above with origin. First without %>% piping. I find this awkward, as we first need to create an intermediate variable weather_group_by_airport.

weather_group_by_airport <- group_by(weather, origin) 
summary_temp_by_airport <- 
  summarize(weather_group_by_airport,
          mean = mean(temp, na.rm = TRUE),
          std_dev = sd(temp, na.rm = TRUE)
          )

Then with %>% piping which IMO is less awkward. Recall %>% is pronounced “then”.

summary_temp_by_airport <- weather %>% 
  group_by(origin) %>% 
  summarize(
          mean = mean(temp, na.rm = TRUE),
          std_dev = sd(temp, na.rm = TRUE)
          )

We output summary_temp_by_airport:

summary_temp_by_airport

origin	mean	std_dev
EWR	55.48703	18.34351
JFK	54.42183	17.05592
LGA	55.70181	17.89875

Is JFK significantly colder than Newark or La Guardia? Is that difference meaningful?

LC 5.7

How could we identify how many flights left each of the three airports in each of the months of 2013?

Solution

We could summarize the count from each airport using the n() function, which counts rows.

count_flights_by_airport <- weather %>% 
  group_by(origin) %>% 
  summarize(count=n())

count_flights_by_airport

origin	count
EWR	8708
JFK	8711
LGA	8711

All remarkably similar!

Note: the n() function counts rows, whereas the sum(VARIABLE_NAME) funciton sums all values of a certain variable VARIABLE_NAME.

LC 5.8

How could we identify the coldest temperature recorded at each airport without using the View() command?

Solution

There are three airports: EWR, JFK, and LGA

Run View(weather)
Click on “Filter”
Under origin type in EWR
The click the arrow next to temp to get the coldest day. For example for EWR it was 10.94 degrees
Repeat for all remaining airports

Piping, Grouping, and Summarize Learning Checks

Albert Y. Kim

Fri Oct 14, 2016

Load Packages and Data

LC 5.5

Solution

LC 5.6

Solution

LC 5.7

Solution

LC 5.8

Solution