Using the starter code below, run a multiple regression of
- \(y =\) arrival delay
- \(x_1 =\) departure delay (numerical variable)
- \(x_2 =\) carrier (categorical variable with \(k=2\) levels. In other words, carrier now varies.)
library(ggplot2)
library(dplyr)
library(mosaic)
library(broom)
library(nycflights13)
data(flights)
alaska_frontier_flights <- flights %>%
filter(carrier == "AS" | carrier == "F9") %>%
filter(dep_delay < 250)
ggplot(alaska_frontier_flights, aes(x=dep_delay, y=arr_delay, col=carrier)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
Note: We are only investigating flights where the delay was not extreme (less than 250min). Let’s look at the regression output:
model <- lm(arr_delay~dep_delay+carrier, data=alaska_frontier_flights)
tidy(model, conf.int = TRUE)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -15.642 | 0.800 | -19.546 | 0 | -17.211 | -14.072 |
dep_delay | 0.979 | 0.016 | 62.388 | 0 | 0.949 | 1.010 |
carrierF9 | 17.734 | 1.151 | 15.409 | 0 | 15.477 | 19.992 |
It appears that Frontier is on average 17min later than Alaska. Why? Let’s dig deeper:
alaska_frontier_flights %>%
group_by(carrier, origin, dest) %>%
summarise(count=n())
carrier | origin | dest | count |
---|---|---|---|
AS | EWR | SEA | 712 |
F9 | LGA | DEN | 675 |
What about temp
? Does that impact arr_delay
?
data("weather")
alaska_frontier_flights <- alaska_frontier_flights %>%
left_join(weather, by=c("year", "month", "day", "hour", "origin"))
model_2 <- lm(arr_delay~dep_delay+carrier+temp, data=alaska_frontier_flights)
tidy(model_2, conf.int = TRUE)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -4.681 | 1.885 | -2.484 | 0.013 | -8.378 | -0.984 |
dep_delay | 0.991 | 0.016 | 63.679 | 0.000 | 0.960 | 1.021 |
carrierF9 | 17.685 | 1.135 | 15.578 | 0.000 | 15.458 | 19.912 |
temp | -0.196 | 0.030 | -6.454 | 0.000 | -0.256 | -0.137 |
What about temp
AND humid
? Does that impact arr_delay
?
temp
and humid
are highly correlated.humid
model_3 <- lm(arr_delay~dep_delay+carrier+temp+humid, data=alaska_frontier_flights)
tidy(model_3, conf.int = TRUE)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -2.008 | 2.640 | -0.760 | 0.447 | -7.187 | 3.171 |
dep_delay | 0.994 | 0.016 | 63.357 | 0.000 | 0.963 | 1.025 |
carrierF9 | 17.430 | 1.148 | 15.176 | 0.000 | 15.177 | 19.683 |
temp | -0.199 | 0.030 | -6.522 | 0.000 | -0.258 | -0.139 |
humid | -0.042 | 0.029 | -1.445 | 0.149 | -0.098 | 0.015 |
United Airlines is in the news a lot lately. Do they have bigger departure delays than their rivals?
library(ggplot2)
library(dplyr)
library(mosaic)
library(broom)
library(nycflights13)
data(flights)
jan_31_flights <- flights
Using the starter code above
jan_31_flights
that consists of flights
Remember, to pick out UA and AA flights, you need the OR operator: for all rows where carrier == UA
or carrier == AA
.
jan_31_flights <- flights %>%
filter(month == 1 & day == 31) %>%
filter(carrier == "UA" | carrier == "AA")
As you now know, I’m a big fan of boxplots: comparisons across groups with a single line! In this case, the median delay for United appears to be higher!
ggplot(jan_31_flights, aes(x=carrier, y=dep_delay)) +
geom_boxplot()
The mean departure delays below. We see that United is on average 15.665 - 15.057 = 0.608 minutes later in their departures.
jan_31_flights %>%
group_by(carrier) %>%
summarise(mean_dep_delay=mean(dep_delay, na.rm=TRUE))
carrier | mean_dep_delay |
---|---|
AA | 15.057 |
UA | 15.665 |
Let’s fit the regression using a categorical predictor \(x\)
model <- lm(dep_delay~carrier, data=jan_31_flights)
tidy(model, conf.int = TRUE)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 15.057 | 3.710 | 4.059 | 0.000 | 7.749 | 22.364 |
carrierUA | 0.608 | 4.629 | 0.131 | 0.896 | -8.510 | 9.726 |
\(\beta_{\mbox{carrierUA}}\) is 0.608! i.e. United has on average 0.608 minute bigger delays, just like we computed above! However notice:
library(ggplot2)
library(dplyr)
library(nycflights13)
data(flights)
# 1. Load Alaska data, deleting rows that have missing dep or arr data
alaska_flights <- flights %>%
filter(carrier == "AS") %>%
filter(!is.na(dep_delay) & !is.na(arr_delay))
# 2. Number of observations
nrow(alaska_flights)
# 3. Plot
ggplot(data=alaska_flights, aes(x = dep_delay, y = arr_delay)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
# 4. Output regression results
model <- lm(arr_delay ~ dep_delay, data=alaska_flights)
model
summary(model)
# 5. Output regression results in tidy format using broom package:
library(broom)
# Summary table + confidence intervals
model_table <- tidy(model, conf.int = TRUE)
View(model_table)
# Point by point values:
model_values <- augment(model, conf.int = TRUE) %>%
select(arr_delay, dep_delay, .fitted, .resid)
View(model_values)
Let’s study the three outputs:
1. Plot of points and “best-fitting line”:
2. The regression table model_table
:
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -15.599 | 0.762 | -20.463 | 0 | -17.096 | -14.102 |
dep_delay | 0.972 | 0.024 | 40.733 | 0 | 0.925 | 1.019 |
3. The point-by-point values model_values
(only first 6 of 709 rows):
arr_delay | dep_delay | .fitted | .resid |
---|---|---|---|
-10 | -1 | -16.571 | 6.571 |
-19 | -7 | -22.404 | 3.404 |
-41 | -3 | -18.515 | -22.485 |
1 | 3 | -12.683 | 13.683 |
-18 | -1 | -16.571 | -1.429 |
-9 | 2 | -13.655 | 4.655 |
We are revisiting the Lec33 LC where
n=5
.left_value
and right_value
such that it captured 95% of \(\overline{x}\) values.Today you will do the same but for a different set up:
is_female
below.mean(c(0,0,0,1))
in your console.library(ggplot2)
library(dplyr)
library(mosaic)
library(okcupiddata)
data(profiles)
profiles <- profiles %>%
mutate(is_female = ifelse(sex == "f", 1, 0)) %>%
select(sex, is_female, height)
View(profiles)
n <- 5
Using the starter code above
n=5
and interpret it.n=100
and interpret it. Compare it to the plot for n=100
. What is different?n=100
, and not many, many, many samples.\[ \mbox{SE}_{\widehat{p}} = \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}} \]
profiles <- profiles %>%
mutate(is_female = ifelse(sex == "f", 1, 0)) %>%
select(is_female)
# Center of sampling distribution
p <- mean(profiles$is_female)
n=5
Let’s do this for n = 5
first.
# Same as Lec32 LC Part 3, but with replace=FALSE as is more realistic
n <- 5
samples <- do(10000)*mean(resample(profiles$is_female, size=n, replace=FALSE))
# Standard error = sd of sampling distribution
SE <- sd(samples$mean)
SE
## [1] 0.2202745
# Use +/- 2 standard deviations of the mean rule-of-thumb for bell-shaped
# data
left_value <- p - 2*SE
right_value <- p + 2*SE
c(left_value, right_value)
## [1] -0.0382369 0.8428611
# Plot!
title <- paste("10000 Simulations of Sample Proportion Height Based on n =", n)
ggplot(samples, aes(x=mean)) +
geom_histogram(binwidth = 0.01) +
labs(x="Sample Mean xbar", title=title) +
geom_vline(xintercept=p, col="red") +
geom_vline(xintercept=left_value, linetype="dashed") +
geom_vline(xintercept=right_value, linetype="dashed")
Looks kind of choppy! Why? Because if we sample only 5 people, there are only 6 possible proportions: \(\frac{0}{6}, \frac{1}{6}, \ldots, \frac{6}{6}\). We need to boost n
to get a smoother sampling distribution.
n=100
Let’s do this for n = 100
instead.
# Same as Lec32 LC Part 3, but with replace=FALSE as is more realistic
n <- 100
samples <- do(10000)*mean(resample(profiles$is_female, size=n, replace=FALSE))
# Standard error = sd of sampling distribution
SE <- sd(samples$mean)
SE
## [1] 0.04920429
# Use +/- 2 standard deviations of the mean rule-of-thumb for bell-shaped
# data
left_value <- p - 2*SE
right_value <- p + 2*SE
c(left_value, right_value)
## [1] 0.3039035 0.5007207
# Plot!
title <- paste("10000 Simulations of Sample Proportion Height Based on n =", n)
ggplot(samples, aes(x=mean)) +
geom_histogram(binwidth = 0.01) +
labs(x="Sample Mean xbar", title=title) +
geom_vline(xintercept=p, col="red") +
geom_vline(xintercept=left_value, linetype="dashed") +
geom_vline(xintercept=right_value, linetype="dashed")
Much better! Also, notice that since
n=10
to n=100
SE
went down from 0.221 to 0.049There is no many, many, many times; we take ONE sample.
n <- 100
sample <- resample(profiles$is_female, size=n, replace=FALSE)
The point estimate \(\widehat{p}\) of \(p\) is:
p_hat <- mean(sample)
p_hat
## [1] 0.35
Since there is only one sample, and not many, many, many samples, we use the mathematical approximation to the standard error of \(\widehat{p}\)
\[ \mbox{SE}_{\widehat{p}} = \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}} \]
SE <- sqrt(p_hat*(1-p_hat)/n)
SE
## [1] 0.04769696
Key Observation: Look how similar this value is to the SE <- sd(samples$mean)
value above!
We now use this to build our confidence interval using the rule \(\mbox{PE} \pm 1.96\times\mbox{SE}\) since n
is large
c(p_hat -1.96*SE, p_hat + 1.96*SE)
## [1] 0.256514 0.443486
Our confidence interval is (0.257, 0.443), which in this case contains the true \(p=0.402\). Remember: 95% confidence intervals will
left_value
and right_value
below so that the dashed vertical lines in the plot capture 95% of sample means. Note these two values.n=5
to n=50
. What happens to left_value
and right_value
? Compare to above.library(ggplot2)
library(dplyr)
library(mosaic)
library(okcupiddata)
data(profiles)
# Let's make our lives easier by removing all 3 users who did not list their
# height:
# -is.na() returns true if a value is NA missing.
# -!is.na() returns true if a value is NOT NA missing.
profiles <- profiles %>%
select(height) %>%
filter(!is.na(height))
# Recall, this is the true population mean. In a typical real-life situation,
# you won't know this value! This is a rhetorical/theoretical exercise to show
# how sampling influences our estimates.
mu <- mean(profiles$height)
# Same as Lec32 LC Part 3, but with replace=FALSE as is more realistic
n <- 5
samples <- do(10000) * mean(resample(profiles$height, size=n, replace=FALSE))
View(samples)
# Change these two values
left_value <- 60
right_value <- 80
# Plot!
title <- paste("10000 Simulations of Sample Mean Height Based on n =", n)
ggplot(samples, aes(x=mean)) +
geom_histogram(binwidth = 1) +
labs(x="Sample Mean xbar", title=title) +
geom_vline(xintercept=mu, col="red") +
geom_vline(xintercept=left_value, linetype="dashed") +
geom_vline(xintercept=right_value, linetype="dashed")
library(ggplot2)
library(dplyr)
library(mosaic)
library(okcupiddata)
data(profiles)
profiles <- profiles %>%
select(height) %>%
filter(!is.na(height))
# Center of sampling distribution
mu <- mean(profiles$height)
# Same as Lec32 LC Part 3, but with replace=FALSE as is more realistic
n <- 5
samples <- do(10000) * mean(resample(profiles$height, size=n, replace=FALSE))
# Standard error = sd of sampling distribution
SE <- sd(samples$mean)
# Use +/- 2 standard deviations of the mean rule-of-thumb for bell-shaped
# data
left_value <- mu - 2*SE
right_value <- mu + 2*SE
c(left_value, right_value)
## [1] 64.81036 71.78021
# Plot!
title <- paste("10000 Simulations of Sample Mean Height Based on n =", n)
ggplot(samples, aes(x=mean)) +
geom_histogram(binwidth = 1) +
labs(x="Sample Mean xbar", title=title) +
geom_vline(xintercept=mu, col="red") +
geom_vline(xintercept=left_value, linetype="dashed") +
geom_vline(xintercept=right_value, linetype="dashed")
Let’s revisit the OkCupid profile data. Note the command xlim(c(50,80))
fixes the x-axis range to be between 50 and 80 inches.
xintercept
in the geom_vline()
to be the true population mean \(\mu\)n
and repeating. What does this correspond to doing in real life?n
plays when it comes to estimating \(\mu\).library(ggplot2)
library(dplyr)
library(mosaic)
library(okcupiddata)
data(profiles)
n <- 5
# Parts 1 & 2:
resample(profiles$height, size=n, replace=TRUE)
mean(resample(profiles$height, size=n, replace=TRUE))
# Part 3:
samples <- do(10000) * mean(resample(profiles$height, size=n, replace=TRUE))
View(samples)
# Part 4:
ggplot(samples, aes(x=mean)) +
geom_histogram(binwidth = 1) +
labs(x="sample mean height") +
xlim(c(50,80)) +
geom_vline(xintercept=50, col="red")
Note:
replace=TRUE
, but we should replace=FALSE
. Think of a poll; once you’ve polled someone you shouldn’t call them again.
PS-08
.mean(profiles$height, na.rm=TRUE)
to get 68.29 inches = 5’8’’ = 172cm. But beware:
sum(is.na(profiles$height))
will show there are only 3 missing values, so ignoring them doesn’t influence the results much.Learning Checks:
n
people from the population and computing their sample mean height \(\overline{x}\)n=5, 50, 100
n
, the histogram of \(\overline{x}\) narrows.n
Running the code from Lec28 Learning Check discussion below, we get the following graph:
Recall
We mutate()
a new variable based on this using the OR operator |
:
simulations <- simulations %>%
mutate(more_extreme = difference < -0.073 | difference > 0.073)
For example, here are the first 6 of 1000 simulations that assumed \(H_0: \mu_O - \mu_E\).
even | odd | difference | more_extreme | |
---|---|---|---|---|
834 | 0.68 | 0.74 | 0.06 | FALSE |
365 | 0.71 | 0.71 | 0.00 | FALSE |
509 | 0.70 | 0.72 | 0.02 | FALSE |
264 | 0.70 | 0.72 | 0.02 | FALSE |
39 | 0.74 | 0.65 | -0.09 | TRUE |
715 | 0.71 | 0.70 | -0.01 | FALSE |
Let’s count the number that are “more extreme” by summing the variable more_extreme
:
simulations %>%
summarise(n_more_extreme = sum(more_extreme))
## n_more_extreme
## 1 263
So 263 times out of 1000 we observed a simulated test statistic either
The p-value is 0.264. Note that we add one to the numerator to include the actually observed difference of means of -0.073.
\[ p-\mbox{Value} = \frac{263+1}{1000} = 0.264. \]
Copy the following code into a .R
script in RStudio
library(ggplot2)
library(dplyr)
library(mosaic)
# Load grades.csv into R:
# Step 0: Exploratory Data Analysis ---------------------------------------
# Before doing any statistical testing, always do an Exploratory Data Analysis.
# The answer might be so stinking obvious, you don't need use stats.
# ALWAYS View() your data
View(grades)
# Learning Check 1: Create a visualization that attempts to graphically answer
# the question of "Did students with an even number of letters in their last
# name do better on the final exam than those who didn't?" Can you answer the
# question? Write your code below:
# Step 1: Compute the Observed Difference ---------------------------------
# Let's compute the observed difference in means. i.e. what really happened. But
# first, two sidenotes:
# Sidenote 1.1: Wrapper Functions ---------------------------------------------
# The following two bits of code produce the exact same results, but formatted
# differently: the difference in mean final scores
# Bit 1: Using the dplyr tools
grades %>%
group_by(even_vs_odd) %>%
summarise(mean=mean(final))
# Bit 2: Using the wrapper function
mean(final ~ even_vs_odd, data=grades)
# Because the wrapper function does the same task, but in much more succinct
# fashion, we'll use the Bit 2 approach!
# Sidenote 1.2: Computing differences ----------------------------------------
# We introduce the diff() function to take the difference of two values stored
# in a vector. Watch out for the order!
c(1, 3)
c(1, 3) %>% diff()
c(3, 1) %>% diff()
# Back to Step 1: ---------------------------------------------
# Now let's take the difference in means and compute the difference. Note the
# order of the subtraction! odd-even
mean(final ~ even_vs_odd, data=grades)
mean(final ~ even_vs_odd, data=grades) %>% diff()
# Students with an odd number of letters did on average 7.3% worse on the final
# than those with an even number! But is this difference of 7.2% statistically
# significant? Or is it just due to random chance? This is where statistics
# comes in.
# Assign this difference to observed_diff. We will use this later.
observed_diff <- mean(final ~ even_vs_odd, data=grades) %>% diff()
observed_diff
# Step 2: Simulate the Null Distribution ------------------------------------
# Recall that assuming the null hypothesis, we can permute/shuffle the variable
# even_vs_odd and it doesn't matter! A sidenote first on using shuffle() as our
# simulation tool
# Sidenote 2.1: shuffle() ---------------------------------------------
shuffled_grades <- grades %>%
mutate(even_vs_odd = shuffle(even_vs_odd))
# Compare the two. What is different about them?
View(grades)
View(shuffled_grades)
# This is one simulated shuffle, assuming H0 is true
mean(final ~ shuffle(even_vs_odd), data=grades)
# But think to the lady tasting tea, we need to do this many, many, many times
# to get a sense of the typical random behavior! i.e. the null distribution
# We do() the shuffle many, many, many times.
simulations <- do(10000) * mean(final ~ shuffle(even_vs_odd), data=grades)
# Let's look at the contents:
View(simulations)
# Now for each of the 10000 shuffles, let's compute the difference MAKING SURE
# it matches the order from when we computed the observed_diff. i.e. odd-even
# and NOT even-odd
simulations <- simulations %>%
mutate(difference=odd-even)
simulations
# Learning Checks:
# 1. Plot the results by changing the three SOMETHINGS with the appropriate
# "something"s
ggplot(data=SOMETHING , aes(x=SOMETHING)) +
geom_SOMETHING() +
geom_vline(xintercept = SOMETHING, col="red")
# 2. What is your answer to the question "Did students with an even number of
# letters in their last name do better on the final exam than those who didn't?"
# 3. Why is the "null distribution" centered where it is?
After loading the necessary packages and the grades.csv
spreadsheet into RStudio:
Step 0: Exploratory Data Analysis
ggplot(grades, aes(x=even_vs_odd, y=final)) +
geom_boxplot()
There seems to be a slight difference in the median test scores between the even and odds, but is this difference statistically significant? Note that while we could’ve also done faceted histograms, boxplots allow us to compare groups with a single horizontal line!
Step 1: Compute the Observed Test Statistic
mean(final ~ even_vs_odd, data=grades)
## even odd
## 0.7323734 0.6595197
mean(final ~ even_vs_odd, data=grades) %>% diff()
## odd
## -0.07285366
observed_diff <- mean(final ~ even_vs_odd, data=grades) %>% diff()
We observe a difference of -0.073 = -7.3%. Note above that diff()
does odd-even
i.e. \(\overline{x}_O - \overline{x}_E\). While choosing between odd-even
or even-odd
is inconsequential, it is important to stay consistent. Note: this is the reverse of what is in the class notes.
Step 2: Simulate the Null Distribution
Recall that assuming the null hypothesis, we can permute/shuffle the variable even_vs_odd
and it doesn’t matter! In the interest of time, let’s only do 1000 simulations, not 100000.
Crucial: Note in the mutate()
we do odd-even
and not even-odd
to stay consistent.
simulations <- do(1000) * mean(final ~ shuffle(even_vs_odd), data=grades)
simulations <- simulations %>%
mutate(difference=odd-even)
Let’s look at the first 6 rows of 1000:
Now let’s plot the 1000 simulated differences in average test scores \(\overline{x}_O - \overline{x}_E\):
ggplot(data=simulations , aes(x=difference)) +
geom_histogram(binwidth=0.025) +
labs(x="Avg of Odds - Avg of Evens")
Step 3: Compare Null Distribution to Observed Test Statistic
Recall from Step 1, the observed difference in average test scores \(\overline{x}_O - \overline{x}_E\) was -0.073 = 7.3%, which was saved in observed_diff
. Let’s draw a red line on the null distribution! How likely is this to occur?
ggplot(data=simulations , aes(x=difference)) +
geom_histogram(binwidth=0.025) +
labs(x="Avg of Odds - Avg of Evens") +
geom_vline(xintercept = observed_diff, col="red")
Say you are given a data set where the rows correspond to students who took a test and the columns are two variables:
For example, the first three rows of such a data set might look like:
id | num_letters | test_score |
---|---|---|
1 | odd | 0.7 |
2 | even | 0.6 |
3 | odd | 0.8 |
You want to answer the scientific question: is there a difference in test score between people with an even number of letters in their last time vs people with an odd number of letters in their last time
Questions:
All hypothesis testing assumes the null hypothesis is true. In our case:
num_letters
is meaninglessnum_letters
is meaningless, then we can permute its values to no consequenceThus assuming \(H_0\) is true, the above observed data is the same as the following permuted data
id | num_letters | test_score |
---|---|---|
1 | odd | 0.7 |
2 | even | 0.6 |
3 | odd | 0.8 |
which is the same as the following permuted data
id | num_letters | test_score |
---|---|---|
1 | odd | 0.8 |
2 | odd | 0.7 |
3 | even | 0.6 |
Now let’s suppose/assume she really can’t tell i.e. she is guessing at random:
Continuing to suppose/assume she really can’t tell i.e. is guessing at random, how would you use the mosaic
(in particular the 4 functions), dplyr
, and ggplot2
packages
Create a new R Script (File -> New File -> R Script) and copy the following starter code:
library(ggplot2)
library(dplyr)
library(mosaic)
# Single cup outcome, where
# -1 indicates correct
# -0 indicates incorrect
single_cup_outcome <- c(1, 0)
library(ggplot2)
library(dplyr)
library(mosaic)
# Single cup outcome, where
# -1 indicates correct
# -0 indicates incorrect
single_cup_outcome <- c(1, 0)
# 8 guesses:
do(8) * resample(single_cup_outcome, size=1)
resample(single_cup_outcome, size=8)
# 8 guesses, many, many, many times:
simulation <- do(10000) * resample(single_cup_outcome, size=8)
View(simulation)
# Count the number right by adding the columns. Note
# summarise(sum=sum(variable)) only works for summing rows not columns:
simulation <- simulation %>%
mutate(n_correct = V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8)
View(simulation)
# Visualize:
ggplot(simulation, aes(x=n_correct)) +
geom_bar() +
labs(x="Number of Guesses Correct", title="Assuming she is guessing at random")
Do people prefer this? | Or this? |
---|---|
How would you design a taste test to ascertain, independent of hype, which fried chicken tastes better? Use the relevant principles of designing experiements from above.
Final score: KFC 8, Ezell’s 4. Some notes:
Go over this code first:
# Load packages
library(dplyr)
library(ggplot2)
library(mosaic)
# Define vector to sample from
fruit <- c("apple", "orange", "mango")
# 1. Shuffling works with do()
# Do this many times to get a feel for it:
do(5) * shuffle(fruit)
# 2. Shuffling works with mutate as well:
example_data <- data_frame(
name = c("Ilana", "Abbi", "Hannibal"),
fruit = c("apple", "orange", "mango")
)
# Do this many times to get a feel for it:
example_data %>%
mutate(fruit = shuffle(fruit))
# 3. Testing the various inputs. Discuss with your peers what each is doing:
resample(fruit, size=1)
resample(fruit, replace=FALSE)
resample(fruit, prob=c(0.495, 0.495, 0.01))
rflip(10)
using the resample()
command. Hint: coin <- c("H", "T")
shuffle(fruit)
command by changing the minimal number of default settings of resample()
. Test this on fruit
A medical doctor pours over some his patients’ medical records and observes:
People who do this: | Wake up with this: |
---|---|
He then asserts the following causal relationship:
What’s wrong with the doctor’s logic? What is really going on?
fruit <- c("apple", "orange", "mango")
# LC1: rflip(10)
coin <- c("H", "T")
resample(coin, size=10)
rflip(10)
# LC2: shuffle(fruit)
resample(fruit, replace=FALSE)
shuffle(fruit)
# LC3: The following yields an error. You can't sample more elements without
# replacement than there are in the vertor. In other words, the largest sample
# without replacement of fruit is of size 3.
resample(fruit, size=15, replace=FALSE)
# LC4: Note the following two commands are the same because of the way the
# defaults are set:
resample(fruit, size=15)
resample(fruit, size=15, replace=TRUE)
# LC5: Fastest way to repeat 5 times. Use do()!
do(5) * resample(fruit, size=15)
As for our doctor:
Go over this code first:
# Load packages
library(dplyr)
library(ggplot2)
# New package
library(mosaic)
# Flip a coin once. Try this multiple times:
rflip()
# Flip a coin 10 times. Try this multiple times:
rflip(10)
# Flip a coin 10 times, but do this 5 times. Try this multiple times
do(5) * rflip(10)
# Flip a coin 10 times, but do this 500 times
do(500) * rflip(10)
# Gah! There are too many rows!
simulations <- do(500) * rflip(10)
# Convert to data frame format; this allows us to better view in console
simulations <- simulations %>%
as_data_frame()
# We could also View() it
View(simulations)
resample(c(1:6), 2)
coin_flips <- do(500) * rflip(10)
coin_flips <- coin_flips %>%
as_data_frame()
If we View(coin_flips)
the first 6 rows, we see that we have in tidy format:
n | heads | tails | prop |
---|---|---|---|
10 | 8 | 2 | 0.8 |
10 | 4 | 6 | 0.4 |
10 | 6 | 4 | 0.6 |
10 | 4 | 6 | 0.4 |
10 | 8 | 2 | 0.8 |
10 | 6 | 4 | 0.6 |
So we plot a histogram of the heads
variable with binwidth=1
since we are dealing with integers i.e. whole numbers.
ggplot(coin_flips, aes(x=heads)) +
geom_histogram(binwidth = 1)
Let’s unpack resample(c(1:6), 2)
:
c(1:6)
in the console returns six values, 1 2 3 4 5 6
, one for each possible die roll value.resample(c(1:6), 2)
says: sample a value from 1 to 6 twice. This is akin to rolling a die twice.two_dice <- do(500) * resample(c(1:6), 2)
two_dice <- two_dice %>%
as_data_frame()
If we View(two_dice)
the first 6 rows, we see that we have in tidy format:
V1 | V2 |
---|---|
5 | 4 |
2 | 3 |
1 | 5 |
4 | 3 |
5 | 4 |
6 | 2 |
So to get the sum of the two dice, we mutate()
a new variable sum
based on the sum of the two die:
two_dice <- two_dice %>%
mutate(sum = V1 + V2)
V1 | V2 | sum |
---|---|---|
5 | 4 | 9 |
2 | 3 | 5 |
1 | 5 | 6 |
4 | 3 | 7 |
5 | 4 | 9 |
6 | 2 | 8 |
And now we plot it:
ggplot(two_dice, aes(x=sum)) +
geom_histogram(binwidth=1)
What’s the deal with the ugly axes tick marks? This is again b/c computers are stupid, and ggplot does not know we are dealing only with whole numbers i.e. integers. We can:
sum
variable from numerical to categorical using as.factor(sum)
geom_bar()
(for categorial x-variable) instead of geom_histogram
ggplot(two_dice, aes(x=as.factor(sum))) +
geom_bar()
For each of the following 4 scenarios
The Royal Air Force wants to study how resistant their airplanes are to bullets. They study the bullet holes on all the airplanes on the tarmac after an air battle against the Luftwaffe (German Air Force).
You want to know the average income of Middlebury graduates in the last 10 years. So you get the records of 10 randomly chosen Midd Kids. They all answer and you take the average.
Imagine it’s 1993 i.e. almost all households have landlines. You want to know the average number of people in each household in Middlebury. You randomly pick out 500 phone numbers from the phone book and conduct a phone survey.
You want to know the prevalence of illegal downloading of TV shows among Middlebury students. You get the emails of 100 randomly chosen Midd Kids and ask them ``How many times did you download a pirated TV show last week?’’
arrange()
& _join
In ModernDive, LC5.13 thru 5.17 in Chapters 5.2.5-5.5.3.
flights
and weather
, or in order words match the hourly weather values with each flight, why do we need to join by all of year
, month
, day
, hour
, and origin
, and not just hour
? Because hour
is simply a value between 0 and 23; to identify a specific hour, we need to know which year, month, day and at which airport5.15 What are some ways to select all three of the dest
, air_time
, and distance
variables from flights? Give the code showing how to do this in at least three different ways.
library(dplyr)
library(nycflights13)
# The regular way:
flights %>%
select(dest, air_time, distance)
# Since they are sequential columns in the data set
flights %>%
select(dest:distance)
# Not as effective, by removing everything else
flights %>%
select(-year, -month, -day, -dep_time, -sched_dep_time, -dep_delay, -arr_time,
-sched_arr_time, -arr_delay, -carrier, -flight, -tailnum, -origin,
-hour, -minute, -time_hour)
5.16 How could one use starts_with
, ends_with
, and contains
to select columns from the flights
data frame? Provide three different examples in total: one for starts_with
, one for ends_with
, and one for contains
.
# Anything that starts with "d"
flights %>%
select(starts_with("d"))
# Anything related to delays:
flights %>%
select(ends_with("delay"))
# Anything related to departures:
flights %>%
select(contains("dep"))
5.17 Why might we want to use the select
function on a data frame? To narrow down the data frame, to make it easier to look at. Using View()
for example.
group_by()
& 5MV#4 mutate()
In ModernDive, LC5.5 thru 5.12 in Chapters 5.2.3-5.2.4.
5.5 What does the standard deviation column in the summary_temp_by_month
data frame tell us about temperatures in New York City throughout the year?
library(dplyr)
library(nycflights13)
summary_temp_by_month <- weather %>%
group_by(month) %>%
summarize(
mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE)
)
month | mean | std_dev |
---|---|---|
1 | 35.64127 | 10.185459 |
2 | 34.15454 | 6.940228 |
3 | 39.81404 | 6.224948 |
4 | 51.67094 | 8.785250 |
5 | 61.59185 | 9.608687 |
6 | 72.14500 | 7.603357 |
7 | 80.00967 | 7.147631 |
8 | 74.40495 | 5.171365 |
9 | 67.42582 | 8.475824 |
10 | 60.03305 | 8.829652 |
11 | 45.10893 | 10.502249 |
12 | 38.36811 | 9.940822 |
The standard deviation is a quantification of spread and variability. We see that the period in November, December, and January has the most variation in weather, so you can expect very different temperatures on different days.
5.6 What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC?
summary_temp_by_day <- weather %>%
group_by(year, month, day) %>%
summarize(
mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE)
)
summary_temp_by_day
Note: group_by(day)
is not enough, because day
is a value between 1-31. We need to group_by(year, month, day)
5.7 Recreate by_monthly_origin
, but instead of grouping via group_by(origin, month)
, group variables in a different order group_by(month, origin)
. What differs in the resulting data set?
by_monthly_origin <- flights %>%
group_by(month, origin) %>%
summarize(count = n())
month | origin | count |
---|---|---|
1 | EWR | 9893 |
1 | JFK | 9161 |
1 | LGA | 7950 |
2 | EWR | 9107 |
2 | JFK | 8421 |
2 | LGA | 7423 |
3 | EWR | 10420 |
3 | JFK | 9697 |
3 | LGA | 8717 |
4 | EWR | 10531 |
4 | JFK | 9218 |
4 | LGA | 8581 |
5 | EWR | 10592 |
5 | JFK | 9397 |
5 | LGA | 8807 |
6 | EWR | 10175 |
6 | JFK | 9472 |
6 | LGA | 8596 |
7 | EWR | 10475 |
7 | JFK | 10023 |
7 | LGA | 8927 |
8 | EWR | 10359 |
8 | JFK | 9983 |
8 | LGA | 8985 |
9 | EWR | 9550 |
9 | JFK | 8908 |
9 | LGA | 9116 |
10 | EWR | 10104 |
10 | JFK | 9143 |
10 | LGA | 9642 |
11 | EWR | 9707 |
11 | JFK | 8710 |
11 | LGA | 8851 |
12 | EWR | 9922 |
12 | JFK | 9146 |
12 | LGA | 9067 |
The difference is they are organized/sorted by month
first, then origin
5.8 How could we identify how many flights left each of the three airports in each of the months of 2013?
We could summarize the count from each airport using the n()
function, which counts rows.
count_flights_by_airport <- flights %>%
group_by(origin, month) %>%
summarize(count=n())
origin | month | count |
---|---|---|
EWR | 1 | 9893 |
EWR | 2 | 9107 |
EWR | 3 | 10420 |
EWR | 4 | 10531 |
EWR | 5 | 10592 |
EWR | 6 | 10175 |
EWR | 7 | 10475 |
EWR | 8 | 10359 |
EWR | 9 | 9550 |
EWR | 10 | 10104 |
EWR | 11 | 9707 |
EWR | 12 | 9922 |
JFK | 1 | 9161 |
JFK | 2 | 8421 |
JFK | 3 | 9697 |
JFK | 4 | 9218 |
JFK | 5 | 9397 |
JFK | 6 | 9472 |
JFK | 7 | 10023 |
JFK | 8 | 9983 |
JFK | 9 | 8908 |
JFK | 10 | 9143 |
JFK | 11 | 8710 |
JFK | 12 | 9146 |
LGA | 1 | 7950 |
LGA | 2 | 7423 |
LGA | 3 | 8717 |
LGA | 4 | 8581 |
LGA | 5 | 8807 |
LGA | 6 | 8596 |
LGA | 7 | 8927 |
LGA | 8 | 8985 |
LGA | 9 | 9116 |
LGA | 10 | 9642 |
LGA | 11 | 8851 |
LGA | 12 | 9067 |
All remarkably similar!
Note: the n()
function counts rows, whereas the sum(VARIABLE_NAME)
funciton sums all values of a certain numerical variable VARIABLE_NAME
.
5.9 How does the filter
operation differ from a group_by
followed by a summarize
?
filter
picks out rows from the original data set without modifying them, whereasgroup_by %>% summarize
computes summaries of numerical variables, and hence reports new values.5.10 What do positive values of the gain
variable in flights
correspond to? What about negative values? And what about a zero value?
dep_delay=20
arr_delay=10
.gain = arr_delay - dep_delay = 10 - 20 = -10
is negative, so it “made up time in the air”.0 means the departure and arrival time were the same, so no time was made up in the air. We see in most cases that the gain
is near 0 minutes.
I never understood this. If the pilot says “we’re going make up time in the air” because of delay by flying faster, why don’t you always just fly faster to begin with?
5.11 Could we create the dep_delay
and arr_delay
columns by simply subtracting dep_time
from sched_dep_time
and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in flights
?
No because you can’t do direct arithmetic on times. The difference in time between 12:03 and 11:59 is 4 minutes, but 1293-1159 = 134
5.12 What can we say about the distribution of gain
? Describe it in a few sentences using the plot and the gain_summary
data frame values.
Most of the time the gain is a little under zero, most of the time the gain is between -50 and 50 minutes. There are some extreme cases however!
filter()
& 5MV#2 summarize()
In ModernDive, LC5.1 thru 5.4 in Chapters 5-5.2.2.
All the following are the same!
library(nycflights13)
library(dplyr)
data(flights)
# Original in book
not_BTV_SEA <- flights %>%
filter(!(dest == "BTV" | dest == "SEA"))
# Alternative way
not_BTV_SEA <- flights %>%
filter(!dest == "BTV" & !dest == "SEA")
# Or even
not_BTV_SEA <- flights %>%
filter(dest != "BTV" & dest != "SEA")
n()
summary function: summarize(count=n())
. What does the returned value correspond to? It corresponds to a count of the number of observations/rows:data(weather)
weather %>%
summarize(count = n())
summary_temp <- weather %>%
summarize(mean = mean(temp, na.rm = TRUE)) %>%
summarize(std_dev = sd(temp, na.rm = TRUE))
Consider the output of only running the first two lines:
weather %>%
summarize(mean = mean(temp, na.rm = TRUE))
Because after the first summarize()
, the variable temp
disappears as it has been collapsed to the value mean
. So when we try to run the second summarize()
, it can’t find the variable temp
to compute the standard deviation of.
In ModernDive, LC4.26 thru 4.37 in Chapters 4.7.
Note on Wed March 15: The learning checks originally posted were from the previous version of the book, therefore the discussion below might differ slightly from what you wrote originally. The above link has been fixed.
MQ
and thus 26397 flights departed NYC in 2013.arrange(desc(n))
will sort this table in descending order of n
!In ModernDive
# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)
# Load weather data set in nycflights
data(weather)
4.22: What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
geom_boxplot()
It appears to be an outlier. Let’s revisit the use of the filter
command to hone in on it. We want all data points where the month
is 5 and temp<25
weather %>%
filter(month==5 & temp < 25)
origin | year | month | day | hour | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
JFK | 2013 | 5 | 9 | 2 | 13.1 | 12.02 | 95.34 | 80 | 8.05546 | 9.270062 | 0 | 1016.9 | 10 | 2013-05-08 21:00:00 |
There appears to be only one hour and only at JFK that recorded 13.1 F (-10.5 C) in the month of May. This is probably a data entry mistake! Why wasn’t the weather at least similar at EWR (Newark) and LGA (La Guardia)?
4.23: Which months have the highest variability in temperature? What reasons do you think this is?
We are now interested in the spread of the data. One measure some of you may have seen previously is the standard deviation. But in this plot we can read off the Interquartile Range (IQR):
Just from eyeballing it, it seems
Here’s how we compute the exact IQR values for each month (we’ll see this more in depth Chapter 5 of the text):
group
the observations by month
thengroup
, i.e. month
, summarise
it by applying the summary statistic function IQR()
, while making sure to skip over missing data via na.rm=TRUE
thenarrange
the table in desc
ending order of IQR
weather %>%
group_by(month) %>%
summarise(IQR = IQR(temp, na.rm=TRUE)) %>%
arrange(desc(IQR))
month | IQR |
---|---|
11 | 16.02 |
12 | 13.68 |
1 | 12.96 |
9 | 12.06 |
4 | 12.06 |
5 | 11.88 |
6 | 10.98 |
10 | 10.98 |
2 | 10.08 |
7 | 9.18 |
3 | 9.00 |
8 | 7.02 |
4.24: We looked at the distribution of a continuous variable over a categorical variable here with this boxplot. Why can’t we look at the distribution of one continuous variable over the distribution of another continuous variable? Say, temperature across pressure, for example?
Because we need a way to group many continuous observations together, say by grouping by month. For pressure, we have near unique values for pressure, i.e. no groups, so we can’t make boxplots.
4.25: Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?
In a histogram, the bin corresponding to where an outlier lies may not by high enough for us to see. In a boxplot, they are explicitly labelled separately.
In ModernDive, LC4.14 thru 4.21 in Chapters 4.5-4.6.
# Load necessary packages
library(ggplot2)
library(dplyr)
library(nycflights13)
# Load weather data set in nycflights
data(weather)
4.14: What does changing the number of bins from 30 to 60 tell us about the distribution of temperatures?
ggplot(data = weather, aes(x = temp)) +
geom_histogram(bins = 30)
ggplot(data = weather, aes(x = temp)) +
geom_histogram(bins = 60)
The distribution doesn’t change much. But by refining the bid width, we see that the temperature data has a high degree of accuracy. What do I mean by accuracy? Looking at the temp
variabile by View(weather)
, we see that the precision of each temperature recording is 2 decimal places.
4.15: Would you classify the distribution of temperatures as symmetric or skewed?
It is rather symmetric, i.e. there are no long tails on only one side of the distribution
4.16: What would you guess is the “center” value in this distribution? Why did you make that choice?
The center is around 55°F. By running the summary()
command, we see that the mean and median are very similar. In fact, when the distribution is symmetric the mean equals the median.
4.17: Is this data spread out greatly from the center or is it close? Why?
This can only be answered relatively speaking! Let’s pick things to be relative to Seattle, WA temperatures:
While, it appears that Seattle weather has a similar center of 55°F, its temperatures are almost entirely between 35°F and 75°F for a range of about 40°F. Seattle temperatures are much less spread out than New York i.e. much more consistent over the year. New York on the other hand has much colder days in the winter and much hotter days in the summer. Expressed differently, the middle 50% of values, as delineated by the interquartile range is 30°F:
IQR(weather$temp, na.rm=TRUE)
## [1] 30.06
summary(weather$temp)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.94 39.92 55.04 55.20 69.98 100.00 1
4.18: What other things do you notice about the faceted plot above? How does a faceted plot help us see relationships between two variables?
temp
and month
.4.19: What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100?
4.20: For which types of datasets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the variability of the variables and other important characteristics?
Having histograms split by day would not be great:
4.21: Does the temp
variable in the weather
data set have a lot of variability? Why do you say that?
Again, like in LC 4.17, this is a relative question. I would say yes, because in New York City, you have 4 clear seasons with different weather. Whereas in Seattle WA and Portland OR, you have two seasons: summer and rain!
In ModernDive, LC4.9 thru 4.13 in Chapter 4.4. Hint: For LC4.10, Google “NYC Timezone” and note the number next to UTC. UTC stands for Coordinated Universal Time.
weather
and early_january_weather
data frames by running View(weather)
and View(early_january_weather)
in the console. In what respect do these data frames differ? The rows of early_january_weather
are a subset of weather
.weather
data is recorded hourly. Why does the time_hour
variable correctly identify the hour of the measurement whereas the hour variable does not? Because to uniquely identify an hour, we need the year
/month
/day
/hour
sequence, whereas there are only 24 possible hour
’s. Note that in the case of weather
, there is a timezone bug: the time_hour
variable is off by 5 hours from the year
/month
/day
/hour
sequence, since the Eastern Time Zone is 5 hours off UTC.data(weather)
early_january_weather <- weather %>%
filter(origin == "EWR" & month == 1 & day <= 15)
ggplot(data = early_january_weather, aes(x = time_hour, y = humid)) +
geom_line()
In ModernDive, LC4.1 thru 4.8 which include in Chapter 4:
geom_point()
alpha
to control transparencygeom_jitter()
: a variation of geom_point()
where we add a little jitter (i.e. random noise) to the points to break log-jamsLoad necessary data and packages:
library(dplyr)
library(ggplot2)
library(nycflights13)
data(flights)
all_alaska_flights <- flights %>%
filter(carrier == "AS")
LC4.1: flights
includes all flights, whereas all_alaska_flights
only includes Alaska Airlines flights.
LC 4.2-4.6:
ggplot(data=all_alaska_flights, aes(x = dep_delay, y = arr_delay)) +
geom_point()
dep_delay
and arr_delay
have a positive relationship? The later a plane departs, typically the later it will arrive.dep_time
as the x
-aesthetic and dep_delay
as the y
-aestheticggplot(data=all_alaska_flights, aes(x = dep_time, y = dep_delay)) +
geom_point()
Hint: Look at Alaska Airlines’ route map. In fact, there are only two flights paths: Flights 7 and 11 flying from Newark (EWR) to Seattle (SEA).
LC 4.7-4.8:
alpha
argument value useful with scatter-plots? It thins out the points so we address over-plotting. But more importantly it hints at the (statistical) density and distribution of the points: where are the points concentrated, where do they occur. We will see more about densities and distributions in Chapter 6 when we switch gears to statistical topics.alpha = 0.2
set in lower plot? The lower plot suggests that most Alaska flights from NYC depart between 12 minutes early and on time and arrive between 50 minutes early and on time.LC1-5 Consider the following data in tidy format:
A | B | C | D |
---|---|---|---|
1 | 1 | 3 | Hot |
2 | 2 | 2 | Hot |
3 | 3 | 1 | Cold |
4 | 4 | 2 | Cold |
Letting
x
-axis correspond to variable A
y
-axis is variable B
draw using pen & paper the 5 graphics below:
color
of the points corresponds to D
size
of the points corresponds to C
color
of the line corresponds to D
Reach for the Stars #1: A little ambitious right now, but see if you can tweak the code below to create baby’s first ggplot2
graphic, graphic #1 above: just the scatter plot. I suggest you:
.R
script fileggplot()
function from your Editor (not directly in console)library(dplyr)
library(ggplot2)
simple_ex <-
data_frame(
A = c(1, 2, 3, 4),
B = c(1, 2, 3, 4),
C = c(3, 2, 1, 2),
D = c("Hot", "Hot", "Cold", "Cold")
)
View(simple_ex)
ggplot(data=DATASETNAME, mapping=aes(x=VARIABLENAME, y=VARIABLENAME)) +
geom_point()
Reach for the Stars #2: Even more ambitious right now, but see if you can tweak the same code to make graphic #2 above: a scatter plot where the color
of the points corresponds to D
. Hint:
geom_point
by running ?geom_point
in the consoleLC1-5: Chalk Talk
Reach for the Stars #1:
ggplot(data=simple_ex, mapping=aes(x=A, y=B)) +
geom_point()
Reach for the Stars #2:
ggplot(data=simple_ex, mapping=aes(x=A, y=B, color=D)) +
geom_point()
Notice:
color=D
in the aes()
thetic mapping statement!Based on the 5NG examples in today’s slides
data
variables being displayed and what type of variable they areaes()
thetic attribute of the geom_
etric object the above data
variables are being mapped toLet’s look at a random sample of 3 of the movies:
title | budget | rating |
---|---|---|
I Walk the Line | 4e+06 | 6.0 |
Auteur Theory, The | 7e+04 | 4.5 |
Ding-a-ling-Less | 1e+06 | 8.1 |
Both variables are numerical. Here are the components of the Grammar of Graphics:
data variable |
aes() thetic attribute |
geom_ etric object |
---|---|---|
budget |
x |
point |
rating |
y |
point |
Question: Does spending more on a movie yield higher IMDB ratings?
Let’s look at a random sample of 3 of the dates:
date | n |
---|---|
2013-01-08 | 899 |
2013-01-26 | 680 |
2013-01-28 | 923 |
Both variables are numerical (dates are technically numerical since they are an abstraction of time). Here are the components of the Grammar of Graphics:
data variable |
aes() thetic attribute |
geom_ etric object |
---|---|---|
date |
x |
line |
n |
y |
line |
Note: Why did we use line
as the geom_
etric object? Because lines suggest sequence/relationship, and points don’t.
Question: Why are there drops in the number of flights? 2013/01/14 was a Monday.
Let’s look at a random sample of 3 of the users:
sex | height |
---|---|
m | 72 |
m | 71 |
f | 61 |
Height is numerical. Here are the components of the Grammar of Graphics:
data variable |
aes() thetic attribute |
geom_ etric object |
---|---|---|
height |
x |
histogram |
Note: We’ll see later there is no y
aesthetic here, because there is no explicit variable that maps to it, but rather it is computed internally.
Question: What are the smallest and largest visible heights and what do you think of them? Also, think of one graph improvement to better convey information about SF OkCupid users.
Let’s look at a random sample of 3 of the car year/make/model matchings:
name | trans | hwy |
---|---|---|
2012 Bentley Continental Supersports | Automatic | 19 |
1995 Dodge Caravan C/V/Grand Caravan 2WD | Automatic | 22 |
1989 Dodge Daytona | Manual | 26 |
trans
type is categorical, whereas hwy
is numerical. Here are the components of the Grammar of Graphics:
data variable |
aes() thetic attribute |
geom_ etric object |
---|---|---|
trans |
x |
boxplot |
hwy |
y |
boxplot |
Question: About what proportion of manual car models sold between 1984 and 2015 got 20 mpg or worse mileage? Answer: 25%
Let’s look at all the data:
name | n |
---|---|
Carlos | 155711 |
Ethan | 359506 |
Hayden | 105716 |
Name is categorical. Here are the components of the Grammar of Graphics:
data variable |
aes() thetic attribute |
geom_ etric object |
---|---|---|
name |
x |
bar |
n |
y |
bar |
Question: About how many babies were named “Hayden” between 1990-2014? Answer: 1e+05 is R’s shorthand notation for \(1 \times 10^5 = 10,000\). To help me remember exponents, I just memorize that \(1\times 10^6 = 1,000,000\) i.e. one million.
Do the 16 Learning Checks in Chapter 3 of ModernDive: Tidy Data. You do not need to submit these.
3.1 3.2 is an example!
3.2 Since there are three variable at play (Date, Price, Stock Name), there should be three columns!
Date | Stock Name | Price |
---|---|---|
2009-01-01 | Boeing | $173.55 |
2009-01-02 | Boeing | $172.61 |
2009-01-03 | Boeing | $173.86 |
2009-01-04 | Boeing | $170.77 |
2009-01-05 | Boeing | $174.29 |
2009-01-01 | Amazon | $174.90 |
2009-01-02 | Amazon | $171.42 |
2009-01-03 | Amazon | $171.58 |
2009-01-04 | Amazon | $173.89 |
2009-01-05 | Amazon | $170.16 |
2009-01-01 | $174.34 | |
2009-01-02 | $170.04 | |
2009-01-03 | $173.65 | |
2009-01-04 | $174.87 | |
2009-01-05 | $172.19 |
3.3 What does any ONE row in thd flights
dataset refer to? Data on a flight. Not a flight path! Example:
3.4 What are some examples in this dataset of categorical variables? What makes them different than quantitative variables?
Hint: Type ?flights
in the console to see what all the variables mean!
carrier
the companydest
the destinationflight
the flight number. Even though this is a number, its simply a label. Example United 1545 isn’t “less than” United 1714distance
the distance in milestime_hour
time3.5 What does int
, dbl
, and chr
mean in the output above?
int
: integer. Used to count things i.e. a discrete value. Ex: the # of cars parked in a lotdbl
: double. Used to measure things. i.e. a continuous value. Ex: your height in incheschr
: character. i.e. text3.6 & 3.7 19 columns (variables) and 336,776 rows (observations i.e. flights)
3.8 weather
, planes
, airports
, airlines
data sets.
The observational units, i.e. what each row corresponds to:
weather
: weather at a given origin
(EWR, JFK, LGA) for a given hour i.e. year
, month
, day
, hour
planes
: a physical aircraftairports
: an airport in the USairlines
: an airline company3.9 See ?airports
help file
3.10 Identification Variables
weather
example in LC3.8, the combination of origin
, year
, month
, day
, hour
are identification variables as they identify the observation in question.temp
, humid
, wind_speed
, etc.3.11 What are common characteristics of “tidy” datasets? Described in Lecture slides.
3.12 What makes “tidy” datasets useful for organizing data? Organized way of viewing data. We’ll see later that this format is required for the ggplot2
and dplyr
packages for data visualization and manipulation.
3.13 There are 2 variables below, but what does each row correspond to? We don’t know b/c there are no identification variables.
students | faculty |
---|---|
4 | 2 |
6 | 3 |
3.14 We need at least a third variable to identify the observations. For example a variable “Department”.
3.15 Sociology example
TRUE
and FALSE
. This is called a logical variable AKA a boolean variable. 1
and 0
can also be used3.16 We can easily _join
them with other data sets! For example, can we join the flights
data with the planes
data? We’ll see this more in Chapter 4!
You will be getting your first experience with:
View()
: A command for viewing your dataAs described in today’s lecture slides: In RStudio (not DataCamp):
ggplot2
, dplyr
, and babynames
packages.ggplot2
, dplyr
, and babynames
packages.View()
Your DataLoad the babynames
dataset in the RStudio viewer by running the following in the console. You should get in the habit of always View()
ing your data first!
There are two ways to run commands in the R console: Either
.R
R Script and passing them to the console.Do the following:
babynames
. Note:
.R
gets added: babynames.R
babynames.R
in the File panel of RStudio:babynames.R
and save it again.Investigate your hypothesized names that are “modern”, “old-fashioned”, and “back in vogue” by
baby_name
baby_sex