# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)
# Load weather data set in nycflights
data(weather)
What does the standard deviation column in the summary_temp_by_month data frame tell us about temperatures in New York City throughout the year?
summary_temp_by_month <- weather %>%
group_by(month) %>%
summarize(
mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE)
)
summary_temp_by_month
| month | mean | std_dev |
|---|---|---|
| 1 | 35.64127 | 10.185459 |
| 2 | 34.15454 | 6.940228 |
| 3 | 39.81404 | 6.224948 |
| 4 | 51.67094 | 8.785250 |
| 5 | 61.59185 | 9.608687 |
| 6 | 72.14500 | 7.603357 |
| 7 | 80.00967 | 7.147631 |
| 8 | 74.40495 | 5.171365 |
| 9 | 67.42582 | 8.475824 |
| 10 | 60.03305 | 8.829652 |
| 11 | 45.10893 | 10.502249 |
| 12 | 38.36811 | 9.940822 |
The standard deviation is a quantification of spread and variability. We see that the period in November, December, and January has the most variation in weather, so you can expect very different temperatures on different days.
Note: both mean(temp, na.rm = TRUE) and sd(temp, na.rm = TRUE) have a na.rm = TRUE to ignore missing values. This should only be used when necessary i.e. when there actually are missing values in the data set.
What code would be required to get the mean and standard deviation temperature for each airport in NYC? Do this with and without using the %>% operator.
Just switch month above with origin. First without %>% piping. I find this awkward, as we first need to create an intermediate variable weather_group_by_airport.
weather_group_by_airport <- group_by(weather, origin)
summary_temp_by_airport <-
summarize(weather_group_by_airport,
mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE)
)
Then with %>% piping which IMO is less awkward. Recall %>% is pronounced “then”.
summary_temp_by_airport <- weather %>%
group_by(origin) %>%
summarize(
mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE)
)
We output summary_temp_by_airport:
summary_temp_by_airport
| origin | mean | std_dev |
|---|---|---|
| EWR | 55.48703 | 18.34351 |
| JFK | 54.42183 | 17.05592 |
| LGA | 55.70181 | 17.89875 |
Is JFK significantly colder than Newark or La Guardia? Is that difference meaningful?
How could we identify how many flights left each of the three airports in each of the months of 2013?
We could summarize the count from each airport using the n() function, which counts rows.
count_flights_by_airport <- weather %>%
group_by(origin) %>%
summarize(count=n())
count_flights_by_airport
| origin | count |
|---|---|
| EWR | 8708 |
| JFK | 8711 |
| LGA | 8711 |
All remarkably similar!
Note: the n() function counts rows, whereas the sum(VARIABLE_NAME) funciton sums all values of a certain variable VARIABLE_NAME.
How could we identify the coldest temperature recorded at each airport without using the View() command?
There are three airports: EWR, JFK, and LGA
View(weather)origin type in EWRtemp to get the coldest day. For example for EWR it was 10.94 degrees