Load Packages and Data

# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)

# Load weather data set in nycflights
data(weather)

LC 4.23-4.25

ggplot(data = weather, aes(x = temp)) +
  geom_histogram(bins = 30)

ggplot(data = weather, aes(x = temp)) +
  geom_histogram(bins = 60)

  1. What does changing the number of bins from 30 to 60 tell us about the distribution of temperatures?
  2. Would you classify the distribution of temperatures as symmetric or skewed?
  3. What would you guess is the “center” value in this distribution? Why did you make that choice?

Solution

  1. The distribution doesn’t change much. But by refining the bid width, we see that the temperature data has a high degree of accuracy. What do I mean by accuracy? Looking at the temp variabile by View(weather), we see that the precision of each temperature recording is 2 decimal places.
  2. It is rather symmetric, i.e. there are no long tails on either side.
  3. The center is around 55°F. By running the summary() command, we see that the mean and median are very similar. In fact, when the distribution is symmetric the mean equals the median.
summary(weather$temp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.94   39.92   55.04   55.20   69.98  100.00       1

LC 4.26

  1. Relative to Seattle, WA temperatures, is this data spread out greatly from the center or are they close? Use this chart as a reference for Seattle.

Solution

While, it appears that Seattle weather has a similar center of 55°F, its temperatures are almost entirely between 35°F and 75°F for a range of about 40°F. Seattle temperatures are much less spread out than New York i.e. much more consistent over the year. New York on the other hand has much colder days in the winter and much hotter days in the summer.

Expressed differently, the middle 50% of values, as delineated by the interquartile range is 30°F:

IQR(weather$temp, na.rm=TRUE)
## [1] 30.06