# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)
# Load weather data set in nycflights
data(weather)
What does the standard deviation column in the summary_temp_by_month
data frame tell us about temperatures in New York City throughout the year?
summary_temp_by_month <- weather %>%
group_by(month) %>%
summarize(
mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE)
)
summary_temp_by_month
month | mean | std_dev |
---|---|---|
1 | 35.64127 | 10.185459 |
2 | 34.15454 | 6.940228 |
3 | 39.81404 | 6.224948 |
4 | 51.67094 | 8.785250 |
5 | 61.59185 | 9.608687 |
6 | 72.14500 | 7.603357 |
7 | 80.00967 | 7.147631 |
8 | 74.40495 | 5.171365 |
9 | 67.42582 | 8.475824 |
10 | 60.03305 | 8.829652 |
11 | 45.10893 | 10.502249 |
12 | 38.36811 | 9.940822 |
The standard deviation is a quantification of spread and variability. We see that the period in November, December, and January has the most variation in weather, so you can expect very different temperatures on different days.
Note: both mean(temp, na.rm = TRUE)
and sd(temp, na.rm = TRUE)
have a na.rm = TRUE
to ignore missing values. This should only be used when necessary i.e. when there actually are missing values in the data set.
What code would be required to get the mean and standard deviation temperature for each airport in NYC? Do this with and without using the %>%
operator.
Just switch month
above with origin
. First without %>%
piping. I find this awkward, as we first need to create an intermediate variable weather_group_by_airport
.
weather_group_by_airport <- group_by(weather, origin)
summary_temp_by_airport <-
summarize(weather_group_by_airport,
mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE)
)
Then with %>%
piping which IMO is less awkward. Recall %>%
is pronounced “then”.
summary_temp_by_airport <- weather %>%
group_by(origin) %>%
summarize(
mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE)
)
We output summary_temp_by_airport
:
summary_temp_by_airport
origin | mean | std_dev |
---|---|---|
EWR | 55.48703 | 18.34351 |
JFK | 54.42183 | 17.05592 |
LGA | 55.70181 | 17.89875 |
Is JFK significantly colder than Newark or La Guardia? Is that difference meaningful?
How could we identify how many flights left each of the three airports in each of the months of 2013?
We could summarize the count from each airport using the n()
function, which counts rows.
count_flights_by_airport <- weather %>%
group_by(origin) %>%
summarize(count=n())
count_flights_by_airport
origin | count |
---|---|
EWR | 8708 |
JFK | 8711 |
LGA | 8711 |
All remarkably similar!
Note: the n()
function counts rows, whereas the sum(VARIABLE_NAME)
funciton sums all values of a certain variable VARIABLE_NAME
.
How could we identify the coldest temperature recorded at each airport without using the View()
command?
There are three airports: EWR
, JFK
, and LGA
View(weather)
origin
type in EWR
temp
to get the coldest day. For example for EWR
it was 10.94 degrees