Load Packages and Data

# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)

# Load weather data set in nycflights
data(weather)

LC 5.5

What does the standard deviation column in the summary_temp_by_month data frame tell us about temperatures in New York City throughout the year?

summary_temp_by_month <- weather %>% 
  group_by(month) %>% 
  summarize(
          mean = mean(temp, na.rm = TRUE),
          std_dev = sd(temp, na.rm = TRUE)
          )
summary_temp_by_month
month mean std_dev
1 35.64127 10.185459
2 34.15454 6.940228
3 39.81404 6.224948
4 51.67094 8.785250
5 61.59185 9.608687
6 72.14500 7.603357
7 80.00967 7.147631
8 74.40495 5.171365
9 67.42582 8.475824
10 60.03305 8.829652
11 45.10893 10.502249
12 38.36811 9.940822

Solution

The standard deviation is a quantification of spread and variability. We see that the period in November, December, and January has the most variation in weather, so you can expect very different temperatures on different days.

Note: both mean(temp, na.rm = TRUE) and sd(temp, na.rm = TRUE) have a na.rm = TRUE to ignore missing values. This should only be used when necessary i.e. when there actually are missing values in the data set.

LC 5.6

What code would be required to get the mean and standard deviation temperature for each airport in NYC? Do this with and without using the %>% operator.

Solution

Just switch month above with origin. First without %>% piping. I find this awkward, as we first need to create an intermediate variable weather_group_by_airport.

weather_group_by_airport <- group_by(weather, origin) 
summary_temp_by_airport <- 
  summarize(weather_group_by_airport,
          mean = mean(temp, na.rm = TRUE),
          std_dev = sd(temp, na.rm = TRUE)
          )

Then with %>% piping which IMO is less awkward. Recall %>% is pronounced “then”.

summary_temp_by_airport <- weather %>% 
  group_by(origin) %>% 
  summarize(
          mean = mean(temp, na.rm = TRUE),
          std_dev = sd(temp, na.rm = TRUE)
          )

We output summary_temp_by_airport:

summary_temp_by_airport
origin mean std_dev
EWR 55.48703 18.34351
JFK 54.42183 17.05592
LGA 55.70181 17.89875

Is JFK significantly colder than Newark or La Guardia? Is that difference meaningful?

LC 5.7

How could we identify how many flights left each of the three airports in each of the months of 2013?

Solution

We could summarize the count from each airport using the n() function, which counts rows.

count_flights_by_airport <- weather %>% 
  group_by(origin) %>% 
  summarize(count=n())
count_flights_by_airport
origin count
EWR 8708
JFK 8711
LGA 8711

All remarkably similar!

Note: the n() function counts rows, whereas the sum(VARIABLE_NAME) funciton sums all values of a certain variable VARIABLE_NAME.

LC 5.8

How could we identify the coldest temperature recorded at each airport without using the View() command?

Solution

There are three airports: EWR, JFK, and LGA

  • Run View(weather)
  • Click on “Filter”
  • Under origin type in EWR
  • The click the arrow next to temp to get the coldest day. For example for EWR it was 10.94 degrees
  • Repeat for all remaining airports