The lectures are numbered according to the following rough topic outline below. A detailed topic schedule can be found here.

~~Getting started: R, RStudio, R package~~~~Data science: Data visualization via~~`ggplot2`

, tidy data formatting, and data wrangling via`dplyr`

~~Data modeling: Correlation, simple linear regression, and multiple regression~~~~Data collection: Random sampling for statistical inference and random assignment for causal inference.~~- Statistical inference: Sampling, sampling distributions, standard errors, confidence intervals, and hypothesis testing.

The source code for this webpage is available by clicking the icon on the top-right of the nav bar.

Announcements:

- Solutions to PS10, PS11, and PS12 now posted. Solutions to PS13 will be posted on Saturday.
- Office hours:
- Tomorrow (Wed) 2-5pm in Frost Cafe, not SMudd 208
- Before exam: Sun 12/17 1pm-3pm in SMudd 208
- Before project: Thu 12/21 2pm-5pm in SMudd 208

- Don’t forget the exit survey posted in syllabus by Wed 12/20 noon worth 10% of your individual final grade for the project.

- Closed book, no coding,
**bring calculators**. - No practice exam, but no “curveball” questions. What does this mean? Only “straightforward” questions. What does this mean
*in practice*? If you understand lectures notes and the problem sets, you will be set.

- Lectures 4.1 thru 5.7
- Problem sets 10 thru 13

- Where does randomness come into play?
- Random sampling: statistical inference
- Random assignment: causal inference

- Understand all elements of the sampling from tub-o-balls interactive app and how they matter for
- Building and interpreting confidence intervals
- Conducting hypothesis tests. Note: you will need to draw all \(p\)-values, not compute them since you don’t have access to a computer

- Inference for regression: ModernDive Chapter 6.7 Inference for regression
- Interpret output of a regression table
- Why did those investigations of “no pattern in the residuals” matter?

- Central limit theorem: Bunny rabbit video

Today’s lecture:

- 10% Condition: sample size \(n\) has to be less than 10% of total population size \(N\) when sampling to estimate the true population proportion \(p\) with \(\widehat{p}\) on SDM p.448
- Continuing from last lecture: what to do when you don’t know \(\sigma\). \(t\)-distribution with different degrees of freedom is:

**Final problem set of the year! Assigned on Fri 12/8. Due on Wed 12/13 4pm in the mailbox outside my Converse Hall 316 office.**

From Stats, Data, and Models 4th edition:

- Exercise 19.2 (page 512) on being psychic
- Exercise 19.4 (page 512) on being psychic
- Exercise 19.7 (page 512) on being psychic
- Exercise 19.9 (page 513) on bad medicine
- Exercise 19.24 (page 515) on educated mothers
- Exercise 25.1 (page 718) on graduation
- Exercise 25.3 (page 719) on graduation
- Exercise 25.7 (page 719) on graduation
- Exercise 25.11 (page 719) on graduation
- Exercise 25.13 (page 719) on graduation
- Exercise 25.17 (page 719) on graduation

Announcements:

- Drop-in tutoring hours reading+finals period hours: 7pm-9pm everyday from Thu 12/14 to Wed 12/20.

Today’s lecture:

- The Central Limit Theorem as explained via Bunnies and Dragons in a 3m38s video.

Today’s lecture:

- Back to ModernDive: (New) Chapter 6.7 Inference for regression
- Corresponds to Chapter 25 in SDM. In particular, note that SDM 25.2 on “Assumptions and Conditions” corresponds to MD 6.7.5 on “Refresher: Residual analysis”. In particular, we’re answering the question “Why is it important that there be no strong pattern in the residuals?”

Announcements:

- Install
`mosaic`

package.

Today’s lecture:

- Why is the rule of thumb:
*95% of observations lie within plus/minus 1.96 standard deviations of the mean*? - Consider the standard normal curve that has mean 0 and standard deviation 1 below:
- The area to the left of 1.96 is 0.975
- The area to the right of 1.96 is 0.025
- Thus the area to the left of -1.96 is 0.025 by symmetry
- Thus the area from -1.96 to 1.96 is 0.975-0.025 = 0.95 = 95%.
- In other words, the middle 95% is from \([0 - 1.96\times 1, 0 + 1.96 \times 1] = [-1.96, 1.96]\)

```
library(mosaic)
xpnorm(1.96, mean = 0, sd = 1)
```

`## [1] 0.9750021`

Announcements:

- Final projects: Please test out the STAT 231 Data Science cleaned datasets data sooner rather than later. Make sure it suits your needs.

Today’s lecture:

- Back to interactive app: varying the sample size \(n\). Interpreting confidence intervals.
- In today’s handout, which you can see below.
- The red line corresponds to the true population proportion red \(p = \frac{900}{2400} = 0.375\)
- The top left corresponds to how the sample proportion red \(\widehat{p}\) behaves based on different samples of size \(n=50\).
- The bottom left corresponds to how \(\widehat{p}\) behaves based on samples of size \(n=500\).

- Questions:
- Say I told you I drew a sample of size \(n=50\) balls from the tub, and got 20 red balls i.e. \(\widehat{p}=0.4\), would you believe me?
- Say I told you I drew a sample of size \(n=50\) balls from the tub, and got 25 red balls i.e. \(\widehat{p}=0.5\), would you believe me?
- Say I told you I drew a sample of size \(n=50\) balls from the tub, and got 35 red balls i.e. \(\widehat{p}=0.7\), would you believe me?
- Now say I told you I drew a sample of size \(n=500\) balls from the tub, and got 200 i.e. \(\widehat{p}=0.4\) red balls, would you believe me?

- Say I told you I drew a sample of size \(n=500\) balls from the tub, and got 250 red balls i.e. \(\widehat{p}=0.5\), would you believe me?
- Now say the tub truly had population proportion red \(p = \frac{1500}{2400} = 0.625\). Draw in the top right panel how the sample proportion red \(\widehat{p}\) would behave based on different samples of size \(n=50\).
- Now say I told you I drew a sample of size \(n=50\) balls from the tub, and got 35 red balls, would you believe me?
- Again, assuming the tub truly had population proportion red \(p = \frac{1500}{2400} = 0.625\), draw in the bottom right panel how the sample proportion red \(\widehat{p}\) would behave based on different samples of size \(n=500\).

**Assigned on Fri 12/01. Due on Thu 12/7 11:30am.**

From Stats, Data, and Models 4th edition:

- Exercise 17.5 (page 464) on marriage
- Exercise 17.22 (page 465) on green M&M’s
- Exercise 17.30 (page 466) on binge drinking
- Exercise 18.1 (page 488) on age
- Exercise 18.3 (page 488) on age
- Exercise 18.6 (page 489) on age. Note a 99% confidence interval has a margin of error of \(2.58 \times \text{SE}\), unlike a 95% confidence interval which as a margin of error of \(1.96 \times \text{SE}\).
- Exercise 18.19 (page 490) on seafood

Today’s lecture:

- Added blue lines to denoting the middle 95% of values to interactive app from Lec 5.2 of sampling distribution of \(\widehat{p}\)’s based on samples of size \(n\).
- 2016 election

Announcements:

- Project data was sent back from the STAT 231 Data Science class. Please take a look and let me know if you have questions.
- Office hours today from 2-4pm and not 1-4pm

Today’s lecture:

- Still Chapter 17 in SDM
- Interactive app for today: varying the sample size \(n\)

**Assigned on Thu 11/23. Due on Thu 11/30 11:30am.**

From Stats, Data, and Models 4th edition:

- Chapter 17 (p.463): Exercises 1 (website) and 2 (marketing). Keep in mind the exercise done in class with the “tub-o-balls” where we counted the proportion red.

Announcements:

Thanks for great feedback! Some of my motivation for why I taught the course the way I did.

Not so much computer science, but rather coding. This is not an endorsement of President Obama, but rather the message:

That being said, there are “bread and butter” topics that a student leaving intro stats needs to understand:

- Interpreting results of a scientific paper that uses statistics: t-test, chi-squared tests, ANOVA, confidence intervals and p-values.
- Understanding the meaningfulness of
*models*for scientific phenomena. For example:- “Should I include an interaction term in my model or not?”
- “Is diet type a
*statistically significant*explanatory variable for disease?” - Machine learning: “Can I predict someone’s future income based on their high school GPA?”

~~Probability~~

@minebocek emphasizing that most students won't get excited about stats/data science via urn models/probability #USCOTS2017 pic.twitter.com/ume0kuvFYT

— Albert Y. Kim (@rudeboybert) May 20, 2017

How to achieve 1. and 2. above? Slides from first lecture on course objectives.

Less of this: | But more of this: |
---|---|

I believe the most effective way are to use the following toolbox, which takes time to develop:

- Data visualization
- Data modeling

Let’s revisit our tub-o-balls exercise from Lec 4.1

Group | red | white | green | n | prop_red | |
---|---|---|---|---|---|---|

1 | Kathleen and Max | 18 | 32 | 0 | 50 | 0.36 |

2 | Sean, Jack, and CJ | 18 | 32 | 0 | 50 | 0.36 |

3 | Orianna and Judy | 22 | 28 | 0 | 50 | 0.44 |

4 | James and Jacob | 21 | 29 | 0 | 50 | 0.42 |

5 | Hannah and Siya | 16 | 34 | 0 | 50 | 0.32 |

6 | Niko, Sophie, and Caitlin | 14 | 36 | 0 | 50 | 0.28 |

7 | Niko, Sophie, and Caitlin | 19 | 31 | 0 | 50 | 0.38 |

8 | Aleja and Ray | 20 | 30 | 0 | 50 | 0.40 |

9 | Yaw and Drew | 16 | 34 | 0 | 50 | 0.32 |

10 | Yaw and Drew | 21 | 29 | 0 | 50 | 0.42 |

Let’s plot a histogram of the 10 sample proportions \(\widehat{p}\) based on samples of size \(n=50\).

Turns out there are 900 red balls. So the true population parameter, in this case the population proportion is \(p=\frac{900}{2400} = 0.375\), marked in red.

Say I made Yaw and Drew do this **one million** times. Let’s draw the histogram:

Let’s draw the same histogram with `binwidth = 0.02`

:

The standard deviation of this is 0.068 (and not 0.1 as stated in lecture).

Announcements:

- Course chartered for the rest of the semester; see lecture outline here for topics and readings from SDM textbook.

Today’s lecture:

- Midterm II review
- Solutions/discussion to PS09 posted below

- In-class
- Closed book, no calculators, no coding.
- No practice exam, but no “curveball” questions on Thursday. What does this mean? Only “straightforward” questions. What does this mean
*in practice*? If you understand lectures notes, ModernDive, and the two problem sets, you will be set.

- Lectures 3.1 thru 3.9
- Problem sets 08 and 09.

- What is the general modeling framework? \(y=f(x)+\epsilon\) What are the components?
- We’ve seen 5 regression scenarios:
- Simple linear regression with 1 numerical explanatory/predictor variable. ModernDive 6.2
- Simple linear regression with 1 categorical explanatory/predictor variable. ModernDive 6.3
- Multiple regression with 2 numerical explanatory/predictor variables. ModernDive 6.4
- Multiple regression with 1 numerical and 1 categorial explanatory/predictor variable. ModernDive 6.5
- Multiple regression with 2 categorical explanatory/predictor variables. ModernDive 6.6

- For each of the above
- What is an appropriate exploratory data visualization?
**Most important**: How do we interpret the fitted values for the intercept and any slopes, both- Statistically/mathematically
- In the context of the scientific question

- How do we compute fitted values and residuals using the equation for the line?
- How do we conduct residual analysis? Why do we conduct this?

- Other topics
- How do you interpret the slopes for a categorical variable?
- What is the correlation coefficient? ModernDive 6.2.2.
- What does it mean for a line to be “best fitting”? ModernDive 6.2.6.
- In multiple regression, we can also consider
*interaction effects*. Why would we consider these and how do we intepret the results? We saw this in class for scenarios 4 and 5 above. Introduced in ModernDive 6.5.3. - Why do we need log-transformations? What do they do to the data? ModernDive 6.6.1

**Assigned on Sat 11/11. Due on Fri 11/17 1pm.**

Notes:

- On paper, no email submissions.
- Submitted at my office in Converse 316
- Can also be submitted day before in lecture

From Stats, Data, and Models 4th edition:

- Chapter 11 (p.314): Exercise 16
- Do the first portion, parts a) thru c), on the gallup poll
- Do the second portion, parts a) thru g), for exercises 17 (medical treatments) and 18 (social life) only.

- Chapter 11 (p. 316): Exercise 38 (arm length)
- Chapter 12 (top right of p. 337): Answer parts a) thru h) listed under Exercise “21-34” for the following 3 exercises only
- Exercise 21 (fish oil and bipolar disorder)
- Exercise 22 (vitamin B12 and depression)
- Exercise 24 (diet and insomnia)

Announcements:

- Project proposals are due in RStudio Server Shared Project by tomorrow at 5pm. In your instructions to the STAT231 Data Science students, be clear in your instructions and your objectives. If tell communicate exactly what it is you hope to do, it will help inform the data wrangling they need to do.

Today’s lecture:

- Design of experiments
- Kinsey reports on human sexuality. Large sample, but biased sampling, and thus non-generalizable results. Watch the following 2m04s video:

Announcements:

- Do not work on RStudio Server between 1pm-2pm today as there will be some maintenance work done during that time.

Today’s lecture:

- Finish ModernDive 7.2 Sampling
- ModernDive 7.3 Representative sampling examples

From last time, the results of sampling exercise!

Let’s look at the sample proportion of the \(n=50\) balls that were red.

Group | red | white | green | n | prop_red | |
---|---|---|---|---|---|---|

1 | Kathleen and Max | 18 | 32 | 0 | 50 | 0.36 |

2 | Sean, Jack, and CJ | 18 | 32 | 0 | 50 | 0.36 |

3 | Orianna and Judy | 22 | 28 | 0 | 50 | 0.44 |

4 | James and Jacob | 21 | 29 | 0 | 50 | 0.42 |

5 | Hannah and Siya | 16 | 34 | 0 | 50 | 0.32 |

6 | Niko, Sophie, and Caitlin | 14 | 36 | 0 | 50 | 0.28 |

7 | Niko, Sophie, and Caitlin | 19 | 31 | 0 | 50 | 0.38 |

8 | Aleja and Ray | 20 | 30 | 0 | 50 | 0.40 |

9 | Yaw and Drew | 16 | 34 | 0 | 50 | 0.32 |

10 | Yaw and Drew | 21 | 29 | 0 | 50 | 0.42 |

Ehhhhh, this is hard to digest. Let’s visualize:

What was the true proportion? I asked Professor X, who replied “From what I recall it was 2400 balls total, 1600 white, 100 green, so 700 red? But that’s assuming we didn’t lose any.” In other words, the true proportion \(p=\frac{700}{2400} = 0.2916\). Let’s plot this. **Question**: Is Professor X right about the true proportion red?

Announcements:

- In a Slack DM or a channel, type
`/appear NAME_OF_UNIQUE_CHATROOM`

. This will be useful when working on Final Projects. - Sports fans: Tomorrow 4:30pm in SMudd 206, Prof. Michael Lopez will give a talk on “How often does the best team win?” where they model the degree to which chance affects the outcomes of 4 men’s pro leagues: NFL, NBA, MLB, and NHL (sorry soccer fans, no MLS).

Which sport is the most akin to coin flips? Which sport is the most deterministic, in that better teams wins more consistently? Find out tomorrow! - All team leader’s coordinate the following:
- Login to the RStudio Server Shared Project and ensure that I am added as a collobaborator.
- Ask your teammates to see if they can login as well.
- In the group DM on Slack send me a message letting me know if your teammates can login.

Today’s lecture:

- After today’s “hands-on” exercise, fill in this Google Form
- Chalk talk 4.1 Sampling terminology

Announcements:

- PS09 due tomorrow Fri 11/3 at noon.
- “Team information” stage of final project due today.
- No problem set this week to allot time for final project proposals.

Today’s lecture:

- Mutiple regression with two categorical predictors (ModernDive 6.6)

Announcements:

- Demo of RStudio Server Shared Projects

Today’s lecture:

- Mutiple regression with one categorical and one numerical predictor (ModernDive 6.5)
- Mutiple regression with two categorical predictors (ModernDive 6.6)

Announcements:

- Curious about Spring 2018 math or stats courses? Want to learn more about majoring in Stats or Math? Then come to one of the Information Sessions for Majors and Prospective Majors!
- Information Session for Majors and Prospective Majors in Math: Wed 11/1 6:30-7:30pm in SMudd 206
- Information Session for Majors and Prospective Majors in Stats: Thu 11/2 6:00-7:00pm in SMudd 207

- Problem sets
- Solutions/discussion to PS08 posted below.
- PS09 is now due on Fri 11/3 at noon. Please take it seriously as it is very similar in flavor to the kind of work you’ll be doing for your final project.
- No problem set assigned this Thu 11/2 to give time for you to work on your project proposals.

Today’s lecture:

- Final project discussion on syllabus under final project.

*Assigned on Thu 10/26, due on Fri 11/3 noon.*

You will do a thorough regression analysis, in a similar vein as you will be for your final projects!

Instructions:

- Download the
`PS09_YourFirstName_YourLastName.Rmd`

from Slack and change the filename to reflect your name. - Slack Direct Message me
`@albert`

and the grader`@mchien20`

whenever you want:- A working link on Rpubs.com to your report
- Your
`PS09_YourFirstName_YourLastName.Rmd`

file.

- If you are having R Markdown errors:
- Go over the R Markdown Debugging Google Doc first. These steps solve about 85% of all problems.
- Go to drop-in tutoring at Q-Center (see syllabus for times).

Available here: `PS09_solutions.Rmd`

- Make sure to load all the necessary packages and create the new categorical variable
`trump_support_level`

. - Read the following
- The first four paragraphs on the Wikipedia entry for the Gini coefficient/index: a statistical measure of income inequality.
- The Jan 23, 2017 FiveThirtyEight article “Higher Rates Of Hate Crimes Are Tied To Income Inequality”. You will be partially reconstructing their analysis.

- The dataset used for this article is included in the FiveThirtyEight package:
`hate_crimes`

.- Read the help file corresponding to this data by running
`?hate_crimes`

- Run
`View(hate_crimes)`

and explore the dataset

- Read the help file corresponding to this data by running

```
library(ggplot2)
library(dplyr)
library(broom)
library(knitr)
library(fivethirtyeight)
# Create a categorical variable of Trump support level based on the numerical
# variable of share of 2016 U.S. presidential voters who voted for Donald Trump.
# Each of the low, medium, and high levels have roughly one third of states.
hate_crimes <- hate_crimes %>%
mutate(
trump_support_level = cut_number(share_vote_trump, 3, labels=c("low", "medium", "high"))
)
```

- What value of the Gini index indicates perfect income equality?
- What value of the Gini index indicates perfect income inequality?
- Why are the two hate crime variables based on counts per 100,000 individuals and not based on just the count?
- Name some differences on how and when the FBI reports-based and the Southern Poverty Law Center-based rates of hate crime were collected, and how this could impact any analyses.

**Answers:**

- 0
- 1
- We need to standardize comparisons between states that have massively different population sizes. We do this by comparison rates per 100,000 individuals instead of raw counts.
- The FBI data is voluntarily submitted, thus could suffer from
*volunteer bias*, and only focuses on incidents that a prosecutable. The SPLC data focuses on indicidents, which is a broader and at the same time more subjective definition IMO. Furthermore, the SPLC counts are based on media accounts, which could suffer from a different form of bias.

Let’s model the relationship, both visually and via regression, between:

- \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election
- \(x\): Level of Trump support in the state: low, medium, or high

**a) Visual model**

Create a visual model of this data (do not forget to include appropriate axes labels and title):

```
# Write your code to create visualization below:
ggplot(hate_crimes, aes(x=trump_support_level, y=hate_crimes_per_100k_splc)) +
geom_boxplot() +
labs(x="State's Trump support level", y="Hate crimes per 100k (SPLC reports)",
title="Relationship between hate crimes and Trump support level for US States")
```

**b) Regression model**

Output the regression table and interpret the results

```
lm(hate_crimes_per_100k_splc ~ trump_support_level, data=hate_crimes) %>%
tidy() %>%
kable(digits=3)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 0.460 | 0.053 | 8.692 | 0.000 |

trump_support_levelmedium | -0.238 | 0.079 | -3.029 | 0.004 |

trump_support_levelhigh | -0.269 | 0.080 | -3.363 | 0.002 |

We see that on average

- the “low” Trump support states had on average 0.46 incidents per 100K
- the “medium” Trump support states had on average 0.46 - 0.238 = 0.222 incidents per 100K
- the “high” Trump support states had on average 0.46 - 0.269 = 0.191 incidents per 100K

**Say what?** That’s not what I expected! I thought the relationship woud’ve been the other way around! As there is a higher level of Trump support, there would be a higher rate of incidents. Keep in mind, there could be other factors we are not accounting for!

**c) Conclusion**

- Give a one sentence as to whether or not your results above consistent with
- the green choropleth map in the FiveThirtyEight article
- the 2016 election results map?

- Which state was the outlier in the “low” group?

Write you answers here:

- Yes. Some surprising results: Washington, Oregon, and Massachusetts, which were solidly in Hilary Clinton’s camp, had a surprisingly large number of hate crime incidents.
- Washington DC was the outlier

Create two separate visualizations (do not forget to include appropriate axes labels and title) and run two separate simple linear regressions (using only one predictor) for \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election with

- \(x\): the gini index
- \(x\): high school education level

and interpret any slope values.

**a) Gini Index**

```
ggplot(hate_crimes, aes(x=gini_index, y=hate_crimes_per_100k_splc)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(x="Gini index", y="Hate crimes per 100k (SPLC reports)", title="Relationship between hate crimes and Trump support level for US States")
```

```
lm(hate_crimes_per_100k_splc~gini_index, data=hate_crimes) %>%
tidy() %>%
kable(digits=3)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -1.527 | 0.783 | -1.950 | 0.057 |

gini_index | 4.021 | 1.718 | 2.341 | 0.024 |

We see that the relationship is positive. That as the Gini index increases by 1 unit, there is on average an associated increase of 4.021 hate crimes per 100K individuals. In other words, as there is more inequality, there tended to have been more hate crimes.

However, note that the scale of the Gini index is between 0 (perfect equality) and 1 (perfect inequality)! So it makes sense to rescale the units. For example: as the Gini index increases by 1/100 = 0.01 unit, there is on average an associated increase of 4.021/100 = 0.04021 hate crimes per 100K individuals.

**b) High school education level**

```
ggplot(hate_crimes, aes(x=share_pop_hs, y=hate_crimes_per_100k_splc)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(x="Share of population over 25 with HS degree (2009)", y="Hate crimes per 100k (SPLC reports)", title="Relationship between hate crimes and high school US States")
```

```
lm(hate_crimes_per_100k_splc ~ share_pop_hs, data=hate_crimes) %>%
tidy() %>%
kable(digits=3)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -1.705 | 0.923 | -1.848 | 0.071 |

share_pop_hs | 2.320 | 1.065 | 2.179 | 0.035 |

We see that the relationship is positive. That as the proportion of the population that has at least a high school degree increases by 1 unit, there is an associated increase in hate crimes of 2.320.

Two questions:

- Proportion increases by 1? That’s everybody! Let’s rescale our interpretation: as the proportion of the population that has at least a high school degree increases by 0.01 = 1/100 = 1 percentage point, there is an associated increase in hate crimes of 2.320/100 = 0.0230.
- Wait, as a higher proportion of the people has at least a high school degree, the more hate crimes? Shouldn’t it be the other way? More on this in class.

Run a multiple regression for

- \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election
- Using the following predictor variables simultaenously
- \(x_1\): the gini index
- \(x_2\): high school education level

an interpret both slope coefficients

```
lm(hate_crimes_per_100k_splc ~ share_pop_hs + gini_index, data=hate_crimes) %>%
tidy() %>%
kable(digits=3)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -8.212 | 1.419 | -5.787 | 0 |

share_pop_hs | 5.256 | 1.003 | 5.241 | 0 |

gini_index | 8.702 | 1.630 | 5.340 | 0 |

Write your interpretation below:

- All else being equal, for every increase in 1 percentage point of the population that has at least a high school degree, there is an associated increase of on average 5.256/100 = 0.05256 hate crimes per 100K
- All else being equal, for every increase in 0.01 units in the Gini index, there is an associated increase of on average 8.702/100 = 0.08702 hate crimes per 100K

Create two new data frames:

`hate_crimes_no_new_york`

: the`hate_crimes`

dataset without New York`hate_crimes_no_DC`

: the`hate_crimes`

data without the District of Columbia

Repeat the multiple regression from Question 3 and indicate the removal of which state from the dataset has a bigger impact on the analysis. Why do you think this is?

```
hate_crimes_no_new_york <- hate_crimes %>%
filter(state != "New York")
lm(hate_crimes_per_100k_splc ~ share_pop_hs + gini_index, data=hate_crimes_no_new_york) %>%
tidy() %>%
kable(digits=3)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -8.655 | 1.448 | -5.977 | 0 |

share_pop_hs | 5.399 | 1.001 | 5.394 | 0 |

gini_index | 9.415 | 1.706 | 5.517 | 0 |

```
hate_crimes_no_DC <- hate_crimes %>%
filter(state != "District of Columbia")
lm(hate_crimes_per_100k_splc ~ share_pop_hs + gini_index, data=hate_crimes_no_DC) %>%
tidy() %>%
kable(digits=3)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | -3.989 | 1.508 | -2.645 | 0.011 |

share_pop_hs | 3.284 | 0.943 | 3.481 | 0.001 |

gini_index | 3.136 | 1.835 | 1.709 | 0.095 |

**Answer**: Removing DC changed the resulting slopes a lot more! Why? Because it is an outlier, and removing outliers has a bigger impact on analyses!

Announcements:

- Final project details coming on Monday

The multiple regression generalization of a simple linear regression line is a *regression plane* i.e. a flat surface. Note in the following **interactive** plot:

- The blue points are the 400 observed values, each corresponding to one row in the
`Credit`

data frame from the ISLR package. Hover over a blue point to get the observed (Income, Credit Limit, Credit Card Balance) values - The blue/green/yellow surface is the “best” fitting regression plane thru these points. Hover over any portion of the plane. The z value in the pop-up is the fitted value \(\widehat{y}\) corresponding to particular values of
- x: \(x_1\) the tncome (in $10K) variable
- y: \(x_2\) the credit limit variable

Announcements:

- Install
`plotly`

and`ISLR`

packages

Today’s lecture:

- Finish regression table
- Look at point-by-point summaries
- Residual analysis

Run following to create a 3D scatterplot in R:

```
library(ISLR)
library(plotly)
plot_ly(showscale=FALSE) %>%
add_markers(
x = Credit$Income,
y = Credit$Limit,
z = Credit$Balance,
hoverinfo = 'text',
text = ~paste("x1 - Income: ", Credit$Income,
"</br> x2 - Limit: ", Credit$Limit,
"</br> y - Balance: ", Credit$Balance)
) %>%
layout(
scene = list(
xaxis = list(title = "x1 - Income (in $10K)"),
yaxis = list(title = "x2 - Limit ($)"),
zaxis = list(title = "y - Balance ($)")
)
)
```

Announcements:

- When submitting problem sets, submit:
- A working Rpubs.com link. Example: http://rpubs.com/rudeboybert/PS07
- The corresponding
`.Rmd`

R Markdown file. Not the`.html file`

- Note that:
`.html`

files are the*output*.`.Rmd`

files are the*source code*. This is the document you need to- replicate/reproduce your HTML file.
- Modify if you want to update your analysis

- Solutions/discussion to PS06 on available seat miles posted below.
- Solutions to PS07 on Midterm I questions/R Markdown posted below.

Today’s lecture:

- Defining what we mean by “best” fitting line (ModernDive 6.2.6)
- Started regression with categorical predictor (ModernDive 6.3)

*Assigned on Fri 10/20, due on Thu 10/26 11:30am.*

You will run a regression analysis on the `evals`

dataset described in Section 6.1.1] of ModernDive. In particular you study the relationship between

- Outcome variable \(y\): teaching
`score`

- Predictor variable \(x\):
`age`

Instructions:

- Download the
`PS08_YourFirstName_YourLastName.Rmd`

from Slack and change the filename to reflect your name. - Slack Direct Message me
`@albert`

and the grader`@mchien20`

whenever you want:- A working link on Rpubs.com to your report
- Your
`PS08_YourFirstName_YourLastName.Rmd`

file.

- If you are having R Markdown errors:
- Go over the R Markdown Debugging Google Doc first. These steps solve about 85% of all problems.
- Go to drop-in tutoring at Q-Center (see syllabus for times).

Available here: `PS08_solutions.Rmd`

```
library(ggplot2)
library(dplyr)
library(broom)
library(knitr)
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- evals %>%
select(score, ethnicity, gender, language, age, bty_avg, rank)
```

Exploratory data analysis: Create a visualization that shows the relationship between:

- \(y\): instructor teaching score
- \(x\): instructor age

Comment on this relationship.

**Answer**: While the following plot using `geom_point`

works, there is *overplotting*.

```
ggplot(evals, aes(x=age, y=score)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(x="Age of instructor", y="Teaching score", title="Relationship between teaching score and age")
```

So as a visualization tool, you should really consider using a `geom_jitter()`

plot so as to break things up:

```
ggplot(evals, aes(x=age, y=score)) +
geom_jitter() +
geom_smooth(method="lm", se=FALSE) +
labs(x="Age of instructor", y="Teaching score", title="Relationship between teaching score and age (jittered)")
```

Remarks

- What does
`se=FALSE`

, it removes the “error/uncertainty” bands. We’ll see what this is later! - There appears to be a slight negative relationship, meaning as age goes up, average teaching score goes down. This is further evidenced by the correlation coefficient that is negative.

`cor(evals$age, evals$score)`

`## [1] -0.107032`

**Comment on relationship**: The relationship appears to be negative: as professors have higher ages, there is an associated drop in teaching score.

- Display the
*regression table*that shows both the 1) fitted intercept and 2) fitted slope of the regression line. Pipe`%>%`

the table into`kable(digits=3)`

to get a cleanly outputted table with 3 significant digits. - Interpret the slope.
- For an instructor that is 67 years old, what would you guess that their teaching score would be?

**Answer**: First, the table:

```
lm(score~age, data=evals) %>%
tidy() %>%
kable(digits=3)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 4.462 | 0.127 | 35.195 | 0.000 |

age | -0.006 | 0.003 | -2.311 | 0.021 |

Second, the slope for age of -0.006 indicates: for every additional year in age of the instructor, there is an associated decrease of on average 0.006 out of 5 in teaching score. Note, we cautiously use the word *associated*, because we can’t make any claims that age directly *causes* drops in teaching score, but rather that these two variables are merely associated.

Third, for an instructor who is \(x=67\), we would guess a teaching score of

\[ \widehat{y} = 4.462 - 0.006 \times 67 = 4.06 \]

Does there seem to be a systematic pattern in the lack-of-fit of the model? In other words, is there a pattern in the error between the fitted score \(\widehat{y}\) and the observed score \(y\)? Hint: `geom_hline(yintercept=0, col="blue")`

adds a blue horizontal line at \(y=0\).

**Answer**: Here are a random sample of 5 rows of point-by-point info out of a possible 463.

```
point_by_point_info <- lm(score~age, data=evals) %>%
augment() %>%
select(score, age, .fitted, .resid)
point_by_point_info %>%
sample_n(5) %>%
kable()
```

score | age | .fitted | .resid | |
---|---|---|---|---|

64 | 4.2 | 37 | 4.242218 | -0.0422180 |

309 | 3.6 | 35 | 4.254094 | -0.6540945 |

89 | 4.4 | 56 | 4.129392 | 0.2706083 |

415 | 4.9 | 54 | 4.141268 | 0.7587318 |

301 | 4.4 | 43 | 4.206589 | 0.1934113 |

Let’s look for any pattern in the *residuals*. First in the plot of the residuals over age, there seems to be pretty even scatter of the residuals AKA what’s left over AKA the lack-of-fit across all age. Maybe not so much for profs over 70, but there aren’t many of them to begin with.

```
ggplot(point_by_point_info, aes(x=age, y=.resid)) +
geom_point() +
geom_hline(yintercept=0, col="blue") +
labs(x="Age", y="Residual", title="Residuals over age")
```

Looking at the histogram, there seems to be a slight left skew, in that there are positive residuals. This is not ideal; we would rather they be *normally distributed* i.e. bell-shaped. More later in the course.

```
ggplot(point_by_point_info, aes(x=.resid)) +
geom_histogram(binwidth = 0.25) +
labs(x="Residual", title="Histogram of residuals")
```

Say an college administrator wants to model teaching scores using more than one predictor/explantory variable than just age, in particular using the instructor’s gender as well. Create a visualization that summarizes this relationship and comment on the observed relationship.

**Answer**: While you could do this with facets, IMO having a single *colored scatterplot* is more succinct. In other words, I find the following easier to digest:

```
ggplot(evals, aes(x=age, y=score, col=gender)) +
geom_jitter() +
geom_smooth(method="lm", se=FALSE) +
labs(x="Age of instructor", y="Teaching score", title="Relationship between teaching score and age+gender")
```

than this:

```
ggplot(evals, aes(x=age, y=score)) +
geom_jitter() +
geom_smooth(method="lm", se=FALSE) +
labs(x="Age of instructor", y="Teaching score", title="Relationship between teaching score and age+gender") +
facet_wrap(~gender)
```

**Comment on observed relationship**: the relationship between age and teaching score seems to differ between male and female faculty. The slope is more steeply negative for female faculty! Also note, there are fewer older female faculty.

Announcements:

- Problem set posted later this PM/tomorrow AM.
**Developing correlation coefficient intuition**: Play the Guess the Correlation game online.**Correlation is not necessarily causation:**Just because two variables are highly correlated, doesn’t mean that one causes the other! They are merely associated! Check out Spurious Correlations.

Today’s lecture:

- Finishing “Observed values, fitted values, and residuals” (ModernDive 6.2.4): point-by-point info.
- Residual analysis (ModernDive 6.2.5).

Announcements:

- Install the
`broom`

package, which is used to cleanly output regression tables and other regression results - Topics for course are switching: Going from data science to data modeling & statistical inference.
- Accordingly lecture format is switching: Instead of lectures centering around ModernDive eBook and chalk talks supplementing eBook, lectures will now center around chalks talks and will refer to ModernDive for
- Section numbers
- Table numbers: tables of data
- Code
- Mostly importantly figure numbers: visualizations.

Today’s lecture:

- New dataset: Professor evaluations (ModernDive 6.1.1)
- Modeling topics
- “Correlation coefficient” (ModernDive 6.2.2)
- “Simple linear regression” (ModernDive 6.2.3)
- “Observed values, fitted values, and residuals” (ModernDive 6.2.4)

Announcements:

- Install the
`readr`

package, which is used to read CSV files (See Lec 2.6 and ModernDive Section 4.2).

Topics:

- R Markdown is a tool for generating HTML webpage reports that combine
- Text
- R code
- R output: tables and graphs

- R Markdown files have extension
`.Rmd`

, instead of the`.R`

files we’ve been using until now. - Biggest source of confusion: R Markdown has it’s own
**environment**. Just because something exists in your console, doesn’t mean it exists in R Markdown. Whatever code you run in your console, needs to be included in your`.Rmd`

file - Your final projects will be submitted in R Markdown. Details coming shortly.

*Assigned on Mon 10/16, due on Thu 10/19 11:30am. This problem set will be shorter than usual.*

You will create an R Markdown report containing two plots and one table based entirely on midterm questions and publish these results to the web.

Instructions:

- Download the
`PS07_YourFirstName_YourLastName.Rmd`

from Slack and change the filename to reflect your name. - Press “Knit” and say yes to all prompts to install all packages. An HTML page should pop up in RStudio.
- I suggest you write/test your code in a separate
`.R`

scratch pad first. When you feel confident about the output, copy the code over to the appropriate grey*code block*. - Knit early, and knit often!
- When you are done
- Knit one final time.
- Click “Publish” in the HTML page that pops up in RStudio
- Select Rpubs
- Publish
- Create an account
- Give your document an appropriate title and description. Set the “slug” (meaning HTML link) to be “PS07”. After you hit continue, an online webpage should pop up. Copy the link.

- Slack Direct Message me
`@albert`

and the grader`@mchien20`

whenever you want:- A working link to your report
- Your
`PS07_YourFirstName_YourLastName.Rmd`

file.

Notes:

- If you’ve updated your HTML file, you can republish by following Step 5 again.
- If you’re stuck:
- Go over the R Markdown Debugging Google Doc first. These steps solve about 85% of all problems.
- Go to drop-in tutoring at Q-Center (see syllabus for times).

Available here: `PS07_solutions.Rmd`

Here is a scatterplot of number of Starbucks/Dunkin Donuts locations per 1000 individuals over median income for 1024 census tracks in Western Massachusetts:

```
library(ggplot2)
library(dplyr)
library(nycflights13)
library(knitr)
library(readr)
DD_vs_SB <- read_csv("https://rudeboybert.github.io/STAT135/content/PS07/DD_vs_SB.csv")
ggplot(data = DD_vs_SB, mapping = aes(x=median_income, y=shops_per_1000, color=Type)) +
geom_point() +
facet_wrap(~Type) +
geom_smooth(method="lm", se=FALSE, color="blue") +
scale_color_manual(values = c("orange", "darkgreen")) +
labs(x="Median income in $", y="Shops per 1000 individuals",
title="# of Dunkin Donuts & Starbucks per capita vs median income")
```

Here is a plot comparing beer vs spirit vs wine consumption worldwide:

```
drinks <- read_csv("https://rudeboybert.github.io/STAT135/content/PS07/drinks.csv")
# Either a boxplot:
ggplot(data = drinks, mapping = aes(x=type, y=servings)) +
geom_boxplot() +
labs(x="Alcohol type", y="Average number of servings",
title="Comparison of beer vs spirits vs wine consumption worldwide")
```

```
# Or a faceted histogram works:
ggplot(data = drinks, mapping = aes(x=servings)) +
geom_histogram() +
facet_wrap(~type) +
labs(x="Average number of servings",
title="Comparison of beer vs spirits vs wine consumption worldwide")
```

Here is a table that shows the median departure delay for each airline leaving Newark:

```
EWR_median_delays <- flights %>%
filter(origin == "EWR") %>%
group_by(carrier) %>%
summarize(avg_dep_delay = median(dep_delay, na.rm=TRUE))
kable(EWR_median_delays)
```

carrier | avg_dep_delay |
---|---|

9E | -5 |

AA | -3 |

AS | -3 |

B6 | -3 |

DL | -2 |

EV | -1 |

MQ | -2 |

OO | -1 |

UA | 0 |

US | -4 |

VX | -1 |

WN | 1 |

Recall:

- All
`kable()`

from the`knitr`

package does is print out the table in a nice format. - Because there are missing values in
`dep_delay`

, we set`na.rm=TRUE`

so that we ignore them in the calculation of the median. However, do this with caution! Is there a*systematic reason/pattern in the missingness*?

Run this code first; the `babynames`

package contains for each year from 1880 to 2015, the number of children of each sex given each name. All names with more than 5 uses are given.

```
install.packages("babynames") # Install package only once!
library(babynames)
View(babynames)
```

Let’s look at the distribution of females named “Taylor” over time:

```
library(ggplot2)
library(dplyr)
babynames_subset <- babynames %>%
filter(name == "Taylor" & sex == "F")
ggplot(babynames_subset, aes(x=year, y=n)) +
geom_line() +
labs(x="year", y="count", title="Popularity of name Taylor for females over time")
```

- In-class
- Closed book, no calculators,
`dplyr`

cheatsheet will be provided. - More conceptual in nature
- You won’t need to write explicit R code, but you may need to
- Understand given code
- Write code “sketches”; this is also known as
*pseudocode*. For more details, in the following link, read Part I, boxes:- Know what pseudocode is
- Understand why pseudocode is useful
- Remember that pseudocode is subjective and nonstandard

- Lectures 2.11 to back to 1.1, but not 3.1, corresponding readings, and especially
*learning checks*. - Problem sets 03, 04, and 05

- What is the grammar of graphics?
- 5NG:
- What are the Five Named Graphs (5NG)?
- What are their distinguishing features?
- When would you use one versus another?

- Explicitly tie the components of the grammar of graphics to the 5NG:
- What data variables are we considering? Are they categorical or numerical?
- What’s the
`geom`

etric object in question? - What is the mapping from data variables to
`aes`

thetic attributes? - Which of the 5NG consider
*relationships*between variables?

- What does
`facet`

ing do? What kind of variables can we`facet`

by?

- 4.1: Tidy data:
- What is definition of “tidy” data formatting?
- What is the alternative to tidy data formatting?

- 4.4: Observational units, identification vs measurement variables. Why do we need identification variables for
`join`

s? - You can skip Chapters 4.2, 4.3, and 4.5

- Understand the 5MV + joins. The images on the
`dplyr`

cheatsheet illustrate these well. By understand I mean:- What effect does each one have on a data frame?
- How can you sequence them using
`%>%`

“thens” to format your data the way you want them.

- Go over examples of data wrangling in the learning checks, the textbook, and Problem Sets and see if you can reconstruct the corresponding code sketches/pseudocode on your own.

Announcements:

- Midterm I next Thursday 10/12. Topics: follow this page linearly
- Lectures 2.11 to back to 1.1, but not 3.1, corresponding readings, and learning checks.
- Problem sets 03, 04, and 05
- No direct coding questions, but you might need to understand code, or write code “sketches”.
- Cheatsheets will be distributed the day of lecture.

- Review session on Wed 10/11 (Monday schedule)
- Please Slack me any topics you would like covered before Tuesday afternoon.

Chalk talk:

- Tao of data analysis
- Solutions/discussion to PS05 posted below.
- Start problem set 06, which is due on Fri 10/13 at noon instead of usual Thursday at 11:30am.

*Assigned on Thu 10/5, due on Fri 10/13* **noon** *(instead of usual Thursday at 11:30am)*.

You are a junior analyst in the airline industry, specifically working for Virgin America. You are tasked with calculating Virgin America’s total available seat miles for all flights leaving each of the three New York Metropolitan area airports separately, and comparing these figures with those of ExpressJet Airlines Inc, a close rival to your company. Available seat miles are the fundamental units of production for a passenger-carrying airline; it is a measure of **capacity**.

Deliverables:

- Have a file called
`PS06_YourFirstName_YourLastName.R`

(Ex:`PS06_Albert_Kim.R`

) ready to submit with- Your results in an easy to read table
- Your results in a cleanly formatted visualization
- A two sentence “executive summary” on the comparison in available seat miles between Virgin America and ExpressJet Airlines Inc for the three NYC airports. In other words, what insight would you provide the CEO of Virgin America if you only had 1 minute of their time.

- To be submitted on Slack in
. This week only, submit via a Direct Message that includes: 1) Albert 2) the grader Margaret Chien. You should already have a string of such messages when the grader gave you feedback on previous assignments.`#ps06`

channel - Tips: I
**highly**recommend you:- Reread the ChalkTalk on the “Tao of data analysis” first.
- Work in groups, especially in the “sketching out your plan” step.
- If you’re stuck, go to the drop-in help center on Tuesday evening and my office hours on Wednesday. Both I and the drop-in center TA’s will want to see your sketched plan first before helping you with any code.

```
# Problem Set 06
# Please Indicate
# -Your name:
# -Who you collaborated with:
# -Approximately how much time did you spend on this problem set:
# -What, if anything, gave you the most trouble:
# Needed packages. Make sure you have all these installed.
# For visualization:
library(ggplot2)
# For data wrangling:
library(dplyr)
# Data necessary:
library(nycflights13)
View(flights)
View(planes)
View(airlines)
View(weather)
View(airports)
# For printing clean outputs of data frames using kable() function:
library(knitr)
# Deliverable 1: Easy to read table. Save the table in a data frame called results
# View it here:
View(results)
# Deliverable 2: Visualization
# Executive Summary:
```

```
library(ggplot2)
library(dplyr)
library(knitr)
library(nycflights13)
# Clean up our workspace: Reduce both the flights and planes data sets to only
# those columns and rows we need.
flights_cleaned <- flights %>%
filter(carrier == "VX" | carrier == "EV") %>%
select(tailnum, carrier, distance, origin)
planes_cleaned <- planes %>%
select(tailnum, seats)
# Now we join the two by the variable "tailnum".
flights_planes_joined <- flights_cleaned %>%
left_join(planes_cleaned, by = "tailnum")
results <- flights_planes_joined %>%
# Now we can compute ASM:
mutate(seat_miles = seats * distance) %>%
# Since we want all 6 possible origin X carrier pairs, we group by
# those variables and THEN summarize
group_by(origin, carrier) %>%
summarise(total_seat_miles = sum(seat_miles)) %>%
# While not necessary, I decide to sort in descending order of ASM
arrange(desc(total_seat_miles))
# Recall, all the kable() function from the knitr package does it output tables
# in an easy to read format.
kable(results)
```

origin | carrier | total_seat_miles |
---|---|---|

JFK | VX | 1632349744 |

EWR | EV | 1465958750 |

EWR | VX | 664331034 |

LGA | EV | 333516285 |

JFK | EV | 17761240 |

We have two categorical variables in the table above, `origin`

and `carrier`

, and a *pre-aggregated* value `total_seat_miles`

that we want to show on the \(y\)-axis. Thus a barplot using `geom_col()`

would be appropriate here:

```
ggplot(results, aes(x=origin, y=total_seat_miles)) +
geom_col() +
facet_wrap(~carrier) +
labs(x="NYC Airport", y="Total Available Seat Miles",
title="Virgin vs ExpressJet Seat Miles in NYC in 2013")
```

There are numerous ways to summarize this, but one way is, told from the perspective of a Virgin America executive: we have about the same total capacity, but definitely spread out over different airports. We’ve got JFK covered, but definitely don’t have the capacity that ExpressJet does at Newark and La Guardia. Sure we don’t have any capacity at La Guardia, but is this a problem given that some think its the worst airport in the country?

I suggest we target our marketing so that

- We promote our great capacity to people who tend to fly out of JFK. As such, let’s run an ad campaign at subways stations along the A, E, J, and Z trains.
- We start running ads that try to take a bite out ExpressJet’s Newark market.

Announcements:

- Check-in with Slack notifications.

Topics

- Chalk talk: 2.11: Means vs medians
- Learning check 5.13 on joins discussion
- Chalk talk: Introduction to modeling
- In-class exercise: StitchFix

LC 5.13 on joins. Let’s try to join the `flights`

and `weather`

data frames so we have a unified data set. Note that weather is recorded hourly at each of the 3 NYC airports (EWR, JFK, and LGA).

Run the following in your console and look at the `hour`

column:

```
library(nycflights13)
library(dplyr)
View(weather)
View(flights)
```

`hour`

is just a number between `0`

and `23`

! To join the two datasets, you need to join by ALL identification variables as per the five arrows that connect `flights`

with `weather`

in red: `year`

, `month`

, `day`

, `hour`

, `origin`

:

Run the following in your console and observed the difference between `flights`

and `flights_with_weather`

: weather data is merged!

```
flights_with_weather <- flights %>%
left_join(weather, by=c("year", "month", "day", "hour", "origin"))
View(flights_with_weather)
```

Watch the following video from Good Morning America, then fill out your “style profile” on StitchFix.com. Feel free to use a “temporary email” from 10 minute email.

*Assigned on Thu 9/28, due on Thu 10/5 11:30am.*

You will be performing your first data wrangling of real data! The goal of this problem set is to get practice. In some cases, you’ll make a visualization too.

- Deliverables:
- Have a file called
`PS05_YourFirstName_YourLastName.R`

(Ex:`PS05_Albert_Kim.R`

) ready to submit. Create a new`.R`

file, save it with the appropriate name as indicated above, and cut and paste the content below. You should be cutting and pasting the commands that create your graphs into this file! + To be submitted on Slack in`#ps05`

channel, which I will create on Thu 10/5 11:30am.

- Have a file called
- Tips:
- When generating tables, always sketch out beforehand what you think the table should look like, and only then try to code it using
`dplyr`

commands. Again, separate out the what (the data wrangling “verbs”) from the how (the code).

- When generating tables, always sketch out beforehand what you think the table should look like, and only then try to code it using

```
# Problem Set 05
# Please Indicate
# -Your name:
# -Who you collaborated with:
# -Approximately how much time did you spend on this problem set:
# -What, if anything, gave you the most trouble:
library(ggplot2)
library(dplyr)
# Question 1: Drug Use Amongst OkCupid Users -------------------------------
# Let's look at 60,000 San Francisco OkCupid users in 2012 and consider the
# variable `drug` reflecting users' answers to a question on their drug use.
# Run ALL this code first:
library(okcupiddata)
profiles <- profiles %>%
select(sex, drugs) %>%
mutate(drugs = ifelse(is.na(drugs), "N/A", drugs))
View(profiles)
# a) Type in a series of commands that will output a table of how many men and
# women there are in this population.
# b) Create a visualization that shows the distribution of the different responses
# of in variable `drugs`.
# c) Create a visualization that shows the same information as in part b), but
# for men and women separately. Who is more likely to say they never use drugs?
# Men or women?
# Question 2: Gapminder -------------------------------
# We're going revisit the Gapminder data
# Run ALL this code first:
library(gapminder)
View(gapminder)
# a) Output a table that shows the mean and median GDP per capita of all
# countries in the year 2007, but split by continent. Your table should be 5
# rows and 3 columns.
# Question 3: Titanic -------------------------------
# Let's study data relating to who survived the Titanic disaster.
# Run ALL this code first:
data(Titanic)
Titanic <- Titanic %>%
as_data_frame()
View(Titanic)
# a) Output tables that compares survivor counts
# 1. between men and women
# 2. betweent the different classes
# b) For each comparison above, show who was most likely to survive. Spoiler: it
# was women and 1st class passengers; show your work.
```

```
library(ggplot2)
library(dplyr)
```

Let’s look at 60,000 San Francisco OkCupid users in 2012 and consider the variable `drug`

reflecting users’ answers to a question on their drug use.

Run ALL this code first:

```
library(okcupiddata)
profiles <- profiles %>%
select(sex, drugs) %>%
mutate(drugs = ifelse(is.na(drugs), "N/A", drugs))
View(profiles)
```

Let’s peak at the first 6 rows. We see that we have 2 categorical variables:

sex | drugs |
---|---|

m | never |

m | sometimes |

m | N/A |

m | N/A |

m | never |

m | N/A |

**a)** Type in a series of commands that will output a table of how many men and women there are in this population.

```
profiles %>%
group_by(sex) %>%
summarize(count=n())
```

sex | count |
---|---|

f | 24117 |

m | 35829 |

Wow, this population is almost 60% male! A large factor could be that San Francisco has a lot of technology/engineering jobs, which unfortuntately still suffer from gender imbalances.

**b)** Create a visualization that shows the distribution of the different responses of in variable `drugs`

.

Here we are only considering one categorical variable, so let’s use a barplot! If you look at the `profiles`

data, we see that this is an *unaggregated table*, so we use `geom_bar`

and not `geom_col`

:

```
ggplot(profiles, aes(x=drugs)) +
geom_bar()
```

Most people say “never” and some people say “often”. However, do you think all the “never”s are being completely honest? These are self-reported values, and thus they might suffer from *response* bias.

**c)** Create a visualization that shows the same information as in part b), but for men and women separately. Who is more likely to say they never use drugs? Men or women?

```
ggplot(profiles, aes(x=drugs, fill=sex)) +
geom_bar(position="dodge") +
labs(title="Drug use counts by sex")
```

Looking at this graph, there are more men who said they never do drugs. But remember, this population is 60% male! So we need to compare proportions, not counts! In other words, divide the height of the blue bars by 35829 and the red bars by 24117.

Here is the updated plot. It seems that women actually are more likely to say they never use drugs!

**Note**: Converting to proportions is unfortunately a tricky task in `dplyr`

, thus you won’t be expected to do this going forward and the topic is left to another course. However, if you are curious, the code is below:

```
proportion_drugs_by_sex <- profiles %>%
group_by(sex, drugs) %>%
summarise(count=n()) %>%
# Override grouping structure
group_by(sex) %>%
# Get total count for each group separately, i.e. group = sex
mutate(count_in_sex = sum(count)) %>%
mutate(proportion = count/count_in_sex)
View(proportion_drugs_by_sex)
ggplot(proportion_drugs_by_sex, aes(x=drugs, y=proportion, fill=sex)) +
geom_col(position="dodge")
```

We’re going revisit the Gapminder data

Run ALL this code first:

```
library(gapminder)
View(gapminder)
```

Let’s peak at the first 6 rows of the data set:

country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|

Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |

Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |

Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |

Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |

Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |

Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |

**a)** Output a table that shows the mean and median GDP per capita of all countries in the year 2007, but split by continent. Your table should be 5 rows and 3 columns.

```
gapminder %>%
filter(year==2007) %>%
group_by(continent) %>%
summarize(mean_gdp=mean(gdpPercap), median_gdp=median(gdpPercap))
```

continent | mean_gdp | median_gdp |
---|---|---|

Africa | 3089.033 | 1452.267 |

Americas | 11003.032 | 8948.103 |

Asia | 12473.027 | 4471.062 |

Europe | 25054.482 | 28054.066 |

Oceania | 29810.188 | 29810.188 |

Let’s study data relating to who survived the Titanic disaster.

Run ALL this code first:

```
data(Titanic)
Titanic <- Titanic %>%
as_data_frame()
View(Titanic)
```

Note: there are 4 categorical variables that categorize each passenger:

`Class`

:`1st`

,`2nd`

,`3rd`

, or`Crew`

. i.e.*4 levels*`Sex`

:`Male`

or`Female`

. i.e.*2 levels*`Age`

:`Child`

or`Adult`

. i.e.*2 levels*`Survived`

:`No`

or`Yes`

. i.e.*2 levels*

So each of the 2201 passenger fits into any one of \(4 \times 2 \times 2 \times 2 = 32\) possible categories, so the `Titanic`

data frame has 32 rows with a column `n`

that indicates the number of people in each category:

Class | Sex | Age | Survived | n |
---|---|---|---|---|

1st | Male | Child | No | 0 |

2nd | Male | Child | No | 0 |

3rd | Male | Child | No | 35 |

Crew | Male | Child | No | 0 |

1st | Female | Child | No | 0 |

2nd | Female | Child | No | 0 |

3rd | Female | Child | No | 17 |

Crew | Female | Child | No | 0 |

1st | Male | Adult | No | 118 |

2nd | Male | Adult | No | 154 |

3rd | Male | Adult | No | 387 |

Crew | Male | Adult | No | 670 |

1st | Female | Adult | No | 4 |

2nd | Female | Adult | No | 13 |

3rd | Female | Adult | No | 89 |

Crew | Female | Adult | No | 3 |

1st | Male | Child | Yes | 5 |

2nd | Male | Child | Yes | 11 |

3rd | Male | Child | Yes | 13 |

Crew | Male | Child | Yes | 0 |

1st | Female | Child | Yes | 1 |

2nd | Female | Child | Yes | 13 |

3rd | Female | Child | Yes | 14 |

Crew | Female | Child | Yes | 0 |

1st | Male | Adult | Yes | 57 |

2nd | Male | Adult | Yes | 14 |

3rd | Male | Adult | Yes | 75 |

Crew | Male | Adult | Yes | 192 |

1st | Female | Adult | Yes | 140 |

2nd | Female | Adult | Yes | 80 |

3rd | Female | Adult | Yes | 76 |

Crew | Female | Adult | Yes | 20 |

**a)** Output tables that compares survivor counts

- between men and women
- between the different classes

The following give the survivor counts split by sex and by class. Note we use the `sum()`

function to add the values in the `n`

variable

```
Titanic %>%
filter(Survived == "Yes") %>%
group_by(Sex) %>%
summarise(n=sum(n))
```

Sex | n |
---|---|

Female | 344 |

Male | 367 |

```
Titanic %>%
filter(Survived == "Yes") %>%
group_by(Class) %>%
summarise(n=sum(n))
```

Class | n |
---|---|

1st | 203 |

2nd | 118 |

3rd | 178 |

Crew | 212 |

**b)** For each comparison above, show who was most likely to survive. Spoiler: it was women and 1st class passengers; show your work.

We don’t have enough information however! When we say “who was more likely to survive”, we need to consider those who didn’t!

```
Titanic %>%
filter(Survived == "No") %>%
group_by(Sex) %>%
summarise(n=sum(n))
```

Sex | n |
---|---|

Female | 126 |

Male | 1364 |

```
Titanic %>%
filter(Survived == "No") %>%
group_by(Class) %>%
summarise(n=sum(n))
```

Class | n |
---|---|

1st | 122 |

2nd | 167 |

3rd | 528 |

Crew | 673 |

So

`Survival`

by`Sex`

: Women were much more likely to survive:- Female survival 344/(344+126) = 73.2%
- Male survival 367/(367+1362) = 21.2%.

`Survival`

by`Class`

: 1st class was most likely to survive:- 1st class survival 203/(203+122) = 62.5%
- 2nd class survival 118/(118+167) = 41.4%
- 3rd class survival 178/(178+528) = 25.2%
- Crew survival 212/(212+673) = 24.0%

**Important note**: We can `group_by`

more than one categorical variable to make the above task easier!

```
Titanic %>%
group_by(Sex, Survived) %>%
summarise(n=sum(n))
```

Sex | Survived | n |
---|---|---|

Female | No | 126 |

Female | Yes | 344 |

Male | No | 1364 |

Male | Yes | 367 |

```
Titanic %>%
group_by(Class, Survived) %>%
summarise(n=sum(n))
```

Class | Survived | n |
---|---|---|

1st | No | 122 |

1st | Yes | 203 |

2nd | No | 167 |

2nd | Yes | 118 |

3rd | No | 528 |

3rd | Yes | 178 |

Crew | No | 673 |

Crew | Yes | 212 |

**Optional exercise**:

```
Titanic %>%
group_by(Sex, Survived) %>%
summarise(n=sum(n)) %>%
# Override previous grouping metadata with new one
group_by(Sex) %>%
mutate(n_in_sex = sum(n)) %>%
mutate(prop = n/n_in_sex)
```

Sex | Survived | n | n_in_sex | prop |
---|---|---|---|---|

Female | No | 126 | 470 | 0.268 |

Female | Yes | 344 | 470 | 0.732 |

Male | No | 1364 | 1731 | 0.788 |

Male | Yes | 367 | 1731 | 0.212 |

Announcements:

- Information session for statistics and biostatistics grqduate programs
- Where: Merrill 300A.
- When: Friday, October 6 12:00 pm. A light lunch will be served.
- More information here.

Topics:

- Chalk talk:
- 2.9 Mutate, reorder, joins

- Tech time:
- Read Chapters [5.2.3, 5.3] of ModernDive.

Announcements:

- Strategies for problem sets:
- Please read chalk talk notes first. They are either
- Executive summaries of important ideas
- Explain the more difficult concepts slowly

- When
`ggplot`

ing, focus on the*grammar of graphics*first (the*what*), and then the`ggplot2`

code after (the*how*). You don’t want to conflate the*what*and the*how*, or you’ll be doubly confused! Lay out a plan of attack on paper, blackboard, whiteboard, etc, first!- Figure out based on chart from Chalk Talk 2.5 which plot is appropriate given your data.
- Sketch out what you want your plot to look like first!
- Create the table that shows the mapping of variables in your data frame to
`aes()`

thetic attributes - Find an example in the book that looks similar, and cut/paste/modify code!

- Please read chalk talk notes first. They are either
- Example:

Topics:

- Chalk talk
- Example of above strategy
- Learning check: and/or operators 5.1
- Graphical illustration of
`filter()`

,`summarize()`

, and`group_by()`

- Tech time:
- Read Chapters [5.2.2, 5.2.3] of ModernDive.

- Only
`summarizing`

(summarizing rows): `group_by()`

first (in this case color: grey, blue, green), then`summarize`

ing for each color separately:

Topics:

- Chalk talk:
- Learning checks: ID vs measurement variables 4.4
- Function inputs/outputs, five main verbs (5MV), 5MV#1: filtering rows.

- Tech time:
- Read Chapters [5, 5.2.1] of ModernDive.

`filter`

ing (subsetting rows):

- Assigned on Thu 9/21, due on Thu 9/28 11:30am.

You will be get practice with more of the 5NG and begin to “think with data”.

- Deliverables:
- Complete the following survey on the two DataCamp courses you took. This survey is considered part of the problem set.
- Code component:
- Have a file called
`PS04_YourFirstName_YourLastName.R`

(Ex:`PS04_Albert_Kim.R`

) ready to submit. Create a new`.R`

file, save it with the appropriate name as indicated above, and cut and paste the content below. You should be cutting and pasting the commands that create your graphs into this file! - To be submitted on Slack in
`#ps04`

channel, which I will create on Thu 9/28 11:30am.

- Have a file called
- Textbook component:
- Stats, Data, and Models 4th edition:
- Chapter 2 (p.36): #4, #28 (add part d: sketch a “dodged barchart” showing this data).
- Chapter 3 (p.76): #20 parts a) and b) only.
- Chapter 4 (p.107): #31

- To be submitted on paper in class. No emailed submissions will be accepted.

- Stats, Data, and Models 4th edition:

- Tips:
- Starting this problem set we will ramp things up; this will take longer.
- When working in RStudio,
`View()`

ing any data frame first is**always**a good idea.

```
# Problem Set 04
# Please Indicate
# -Your name:
# -Who you collaborated with:
# -Approximately how much time did you spend on this problem set:
# -What, if anything, gave you the most trouble:
# Needed packages. Make sure you have all these installed.
library(ggplot2)
library(dplyr)
library(fivethirtyeight)
library(nycflights13)
library(gapminder)
library(tidyr)
# Question 1: Departure delays for NYC domestic flights --------------------
# a) The following bit of code takes the nycflights13 flights data and restricts
# it to those for United (UA) and American (AA)
flights_subset <- flights %>%
filter(carrier == "UA" | carrier == "AA")
# Create a visualization that compares the distribution of departure delays for
# both carriers.
# b) Based on what you see, who tends to have higher departure delays? United
# or American?
# Question 2: Drinks ------------------------------------------------------
# Read the 538 articles corresponding to the drinks data frame linked in the
# help file:
?drinks
# Let's look at the data set drinks from the fivethirtyeight package.
View(drinks)
# a) Which 3 countries drink the most total litres of pure alcohol according to
# this data? No need to show your work here.
# b) The following bit of code uses the gather() function from the tidyr package
# to convert the data to tidy format and then arranges it by country, and then
# only considers the US, Canada, UK. Don't worry about understanding how this
# works for now, just run it and View() the new data frame.
drinks_tidy <- drinks %>%
gather(type, servings, -c(country, total_litres_of_pure_alcohol)) %>%
filter(country %in% c("USA", "Canada", "United Kingdom"))
View(drinks_tidy)
# Create an appropriate visualization that shows the # of servings of beer,
# wine, spirits consumed by each of the 3 countries listed above
# c) Who drinks the most wine of the group?
# Question 3: Gapminder ----------------------------------------------------
# Let's create a plot similar to the one we saw in Chapter 3 of ModernDive:
# https://ismayc.github.io/moderndiver-book/3-viz.html#components-of-the-grammar
# Specifically, in the TED talk "The best stats you've ever seen" at
# https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen
# Hans Rosling (RIP 2017) presents human and international development data. The
# data seen in the video is accessible in the gapminder data set within the
# gapminder package
View(gapminder)
# The next bit of code
# -Filters the data set to only consider observations for two years: 1952 & 2007
# -Renames the "pop" variable to "population"
gapminder <-
filter(gapminder, year==1952 | year==2007) %>%
rename(population=pop)
# a) Recreate the scatterplot of "Child Survival (%)" over "GDP per capita ($)"
# seen at 12:00 in the video, but
# * Making a comparison between 1952 and 2007
# * Displaying "life expectancy" instead of "Child Survival"
# Do so by changing the code snippet below.
ggplot(data=DATASETNAME, aes(AES1=VAR1, AES2=VAR2, AES3=VAR3, AES4=VAR4)) +
geom_point() +
facet_wrap(~VAR5) +
scale_x_log10() +
labs(x="WRITE LABEL HERE", y="WRITE LABEL HERE", title="WRITE TITLE HERE")
# I suggest you leave the above template untouched, copy and paste the above
# 5 lines of code below:
# b) Describe two facts that would be of interest to international development
# organizations.
```

Written solutions are here.

```
library(ggplot2)
library(dplyr)
library(fivethirtyeight)
library(nycflights13)
library(gapminder)
library(tidyr)
```

**a)** The following bit of code takes the nycflights13 flights data and restricts it to those for United (UA) and American (AA).

```
flights_subset <- flights %>%
filter(carrier == "UA" | carrier == "AA")
# Create a visualization that compares the distribution of departure delays for
# both carriers.
ggplot(flights_subset, aes(x=dep_delay)) +
geom_histogram() +
facet_wrap(~carrier)
```

```
ggplot(flights_subset, aes(x=carrier, y=dep_delay)) +
geom_boxplot()
```

**b)** Based on what you see, who tends to have higher departure delays? United or American? It’s hard to see because of the outliers! But if you zoom in on the plot, we see that United has a larger **median** delay.

If I had to choose based on this data? I’d go with American!

**a)** Which 3 countries drink the most total litres of pure alcohol according to this data? No need to show your work here.

You can just `View`

the data and sort in the sheet!

`View(drinks)`

Belarus, Lithuania, and Andorra!

**b)** The following bit of code uses the gather() function from the tidyr package to convert the data to tidy format and then arranges it by country, and then only considers the US, Canada, UK. Don’t worry about understanding how this works for now, just run it and `View()`

the new data frame.

```
drinks_tidy <- drinks %>%
gather(type, servings, -c(country, total_litres_of_pure_alcohol)) %>%
filter(country %in% c("USA", "Canada", "United Kingdom"))
```

Create an appropriate visualization that shows the # of servings of beer, wine, spirits consumed by each of the 3 countries listed above

```
ggplot(drinks_tidy, aes(x=country, y=servings, fill=type)) +
geom_col(position="dodge")
```

**c)** Who are the wine snobs of the group?

I should have said “who drinks the most wine?” The United Kingdom!

Let’s create a plot similar to the one we saw in Chapter 3 of ModernDive.

Specifically, in the TED talk “The best stats you’ve ever seen” at Hans Rosling (RIP 2017) presents human and international development data. The data seen in the video is accessible in the gapminder data set within the gapminder package.

The next bit of code

- Filters the data set to only consider observations for two years: 1952 & 2007
- Renames the “pop” variable to “population”

```
gapminder <-
filter(gapminder, year==1952 | year==2007) %>%
rename(population=pop)
```

**a)** Recreate the scatterplot of “Child Survival (%)” over “GDP per capita ($)” seen at 12:00 in the video, but

- Making a comparison between 1952 and 2007
- Displaying “life expectancy” instead of “Child Survival”

```
ggplot(data=gapminder, aes(x=gdpPercap, y=lifeExp, col=continent, size=population)) +
geom_point() +
facet_wrap(~year) +
scale_x_log10() +
labs(x="GDP per capita US$", y="Life expectancy", title="Life expectancy over GDP 1952 vs 2007")
```

**b)** Describe two facts that would be of interest to international development organizations.

I would say:

- The relationship between the GDP and life expectancy is positive, as one would expect.
- Alas, Africa hasn’t made as significant gains in GDP and life expectancy as Asia
- Look at China and India, the two biggest green dots, jump!

Announcements:

- Problem set submission via Slack at 11:30.
- Amherst College Department of Mathematics and Statistics Amherst College Statistics Colloquia tonight.
- Title: Lessons for official statistics production around the world from the experience of Greece
- Speaker: Andreas Georgiou ’83, Visiting Lecturer in Economics Amherst College
- When/where: Thursday, September 21, 2017 7:00 pm in Seeley Mudd 206.
- Abstract: The seminar will address what lessons can be gleaned from the experience with official statistics production in Greece, both in the period up to the onset of the debt crisis and since then. The seminar will discuss lessons about the appropriate institutional setting, legal framework, statistical culture and statistical practices.

Topics:

- Chalk talk:
- Learning checks: Barplots 3.33
- Tidy data format and spreadsheets + Comma Separated Values (CSV) files. CSV’s are all stuff, no fluff:

- Tech time:
- Go over Chapter 4 of ModernDive sections, but
- Skip 4.3 on converting from wide format to long/tidy format
- Skip 4.5 on normal forms of data

- Corresponding chapters in “Stats, Data, & Models” (SDM): Chapter 1.

- Go over Chapter 4 of ModernDive sections, but

Announcements:

- Now we’re done the Five Named Graphs (5NG), but there are dozens of other possible plots! See here for a pretty exhaustive list!

Topics:

- Chalk talk:
- 2.5 Barplots + Wrapup of 5NG and what each is showing.
- Learning checks: Boxplots 3.22

- Tech time:
- Read Chapters [3.8.1, end] of ModernDive. The DataCamp exercises are not quite ready for primetime, so you may skip these.

Topics:

- Chalk talk: 2.4 Boxplots & aggregated/unaggregated tables
- Tech time:
- Learning checks:
- Read Chapters 3.7 and the beginning of Chapter 3.8 (but stop just before 3.8.1) of ModernDive.

*Assigned on Thu 9/14, due on Thu 9/21 11:30am. The submission method will be announced then.*

Baby’s first data visualizations! We’re going to use data from the `fivethirtyeight`

package, which is data corresponding to the articles on http://fivethirtyeight.com/.

- Deliverables:
- Have a file called
`PS03_YourFirstName_YourLastName.R`

(Ex:`PS03_Albert_Kim.R`

) ready to submit on Thursday 9/21 11:30am

- Have a file called
- Tips: I can’t emphasize these enough, especially early on when learning to code:
- Computers are stupid!
- Do not code from scratch; copy/paste something that works and tweak it!
- Learning to code is like learning a new language! It’s hard at first, but you get better with practice!
- Do not spin your wheels! Work in groups, go to the weeknightly 7pm-9pm drop-in tutoring in Merrill Science Center 300B, and/or come to office hours.

- Create a new
`.R`

file, save it with the appropriate name as indicated above, and cut and paste the following content. You should be cutting and pasting the commands that create your graphs into this file!

```
# Problem Set 03
library(ggplot2)
library(dplyr)
library(fivethirtyeight)
# Question 1 --------------------------------------------------------------
# Let's consider the bechdel data frame that's included in the fivethirtyeight
# package:
View(bechdel)
# Read the article on fivethirtyeight.com linked in the help file for the data
# frame
?bechdel
# Graphic creation: Create a scatterplot where the
# -x-axis represents the film's budget
# -y-axis represents the film's domestic gross
# -color represents whether or not the film passed the Bechdel Test
# Interpretation: Your grandma looks over your shoulder and sees the resulting
# plot. She asks "What does this plot ultimately mean?" What do you tell her?
# Write your answer below:
# Question 2 --------------------------------------------------------------
# Let's consider the US_births_1994_2003 data frame that's included in the
# fivethirtyeight package:
View(US_births_1994_2003)
# Read the article on fivethirtyeight.com linked in the help file for the data
# frame
?US_births_1994_2003
# Let's not consider all years between 1994 and 2003, but rather only 1999.
# Don't worry if you don't understand the two lines of code below; we'll see
# this in Chapter 5 of ModernDive: Data Wrangling
US_births_1999 <- US_births_1994_2003 %>%
filter(year == 1999)
View(US_births_1999)
# Graphic creation: Create a "time series" plot of the number of births for
# each day in 1999
# Interpretation: Some one asks you: "Why is there that spike of more than
# 14,000 births in the second half of the year? What's going on?" What do you
# say? Write your answer below:
# Question 3 --------------------------------------------------------------
# Let's consider the airline_safety data frame that's included in the
# fivethirtyeight package:
View(airline_safety)
# Read the article on fivethirtyeight.com linked in the help file for the data
# frame
?airline_safety
# Graphic creation: Create a histogram of the number of incidents for all
# airlines between the years 2000 and 2014.
# Interpretation: Someone asks: "What is the typical number of incidents
# airlines airlines had between 2000 and 2014?" What do you say? Write your
# answer below:
```

```
library(ggplot2)
library(dplyr)
library(fivethirtyeight)
```

```
ggplot(bechdel, aes(x=budget, y=domgross, col=binary)) +
geom_point()
```

**Interpretation**: Your grandma looks over your shoulder and sees the resulting # plot. She asks “What does this plot ultimately mean?” What do you tell her? # Write your answer below:

**Discussion**: There is a positive relationship between budget and domestic gross, probably due to bigger budget films also having bigger marketing and distribution budgets. It’s kinda hard to tell from this plot if this relationship between budget and domestic gross is “significantly different” between movies that pass the Bechdel test and those that don’t. We’ll study this more when we see regression and statistical inference. Here is a preview!

```
# Let's not consider all years between 1994 and 2003, but rather only 1999.
# Don't worry if you don't understand the two lines of code below; we'll see
# this in Chapter 5 of ModernDive: Data Wrangling
US_births_1999 <- US_births_1994_2003 %>%
filter(year == 1999)
# Graphic creation: Create a "time series" plot of the number of births for
# each day in 1999
ggplot(US_births_1999, aes(x=date, y=births)) +
geom_line()
```

You should really be using a linegraph here and not a scatterplot, because sequential observations are related to each other i.e. they are connected. Linegraphs are suggestive of this.

**Interpretation**: Some one asks you: “Why is there that spike of more than 14,000 births in the second half of the year? What’s going on?” What do you say? Write your answer below:

year | month | date_of_month | date | day_of_week | births |
---|---|---|---|---|---|

1999 | 9 | 9 | 1999-09-09 | Thurs | 14540 |

1999 | 12 | 21 | 1999-12-21 | Tues | 13508 |

1999 | 9 | 8 | 1999-09-08 | Wed | 13437 |

1999 | 9 | 21 | 1999-09-21 | Tues | 13384 |

1999 | 9 | 28 | 1999-09-28 | Tues | 13358 |

1999 | 7 | 7 | 1999-07-07 | Wed | 13343 |

The spike in births on Sept 9, 1999 is probably not due to chance; there is probably something going on. How do you write Sept 9, 1999? 9/9/99! People were inducing births deliberately on that day!

We can play with the binning structure either with `binwidth`

or `bins`

:

```
ggplot(airline_safety, aes(x=incidents_00_14)) +
geom_histogram(binwidth = 2)
```

```
ggplot(airline_safety, aes(x=incidents_00_14)) +
geom_histogram(bins = 12)
```

**Interpretation**: Someone asks: “What is the typical number of incidents airlines airlines had between 2000 and 2014?” What do you say? Write your answer below:

“Typical” can mean lots of different things. It can be

- the mode: the most frequently occuring value. in this case values near 0
- the mean: the average
- the median: the middle value that splits the data 50/50.

*In my opinion*, I would say that most airlines didn’t have that many incidents, so the typical value is 0. Thank heavens!

Announcements:

- “Gah! I can’t tell who you are from your Slack names!” Please go to: “STAT 495 Amherst…” on top left of Slack -> “Preferences” -> “Messages & Media” -> “Names” -> Click the “Full & display names” radio button.
- Convenient button in RStudio: press “Up” button in Console to scroll thru history.

Topics:

- Chalk talk 2.3
- Tech time:
- Be sure to always load the packages outlined at the beginning of chapter in ModernDive.
- Create new “R Script” i.e. a “scratch pad”
- Read Chapters [3.5, 3.6] of ModernDive on histograms and facets, completing all
*Learning Checks*along the way. - Start PS03 below!

Announcements:

- Syllabus finalized, so please read it!
- Midterm dates
- Topic schedule (posted above also)
- New office hours location: Seeley Mudd 208 (lounge)

- Learning check discussions/solutions from “Lec 1.2 Data and R packages” readings are embedded inside book (See Chapter 2.4.3 of ModernDive for example).

Topics:

- Chalk talk 2.2: The “Five Named Graphs” (5NG) and example of
`ggplot()`

function to create plot from last time.Intro stats & data science #chalktalk of grammar of graphics + homage to @katyperry today, #ggplot2 tomorrow #rstats pic.twitter.com/1CQksTGqeM

— Albert Y. Kim (@rudeboybert) September 11, 2017 - Tech time: Read Chapters [3.3, 3.4] of ModernDive on scatterplots and linegraphs, completing all
*Learning Checks*along the way. Be sure to always load the packages outlined at the beginning of chapter.

*Assigned on Mon 9/11, due on Thu 9/14 11:30am.*

- Complete the following chapters from the “Intermediate R” DataCamp course:
- Chapter 1 Conditionals and Control Flow
*Skip: Chapter 2 Loops*- Chapter 3 Functions
*Skip rest: Chapters 4-5*

- There is nothing to submit as DataCamp logs your completion.

Announcements:

- Math/stat table 12-1:30 in Terrace Room A Valentine Hall
- Raffle tickets during drop-in hours. Something called “R Markdown Fest/Workshop” will be taking place; at the moment, participate only if you’re curious as we’ll cover this later. You can submit raffle tickets Monday and Tuesday:
- 4-6pm (just this week on M & Tu)
- 7-9pm (Normal S, M, Tu, W, Th office hours)

- Problem sets will be posted/due Thursdays, so problem set 2 is last out-of-sync problem set.
- If you were spinning your wheels with DataCamp, come to office hours or drop-in hours.
- Syllabus will be finalized by tomorrow, including
- Midterm dates
- Topic schedule

- Today we start topic 2: Data Science!

Topics:

- Take a look at the graph in Section 3.1.2 of ModernDive
- Chalk talk: 2.1 Grammar of graphics
- Tech time: Read Chapters [3, 3.2] of ModernDive

Announcements:

- Drop-in tutoring hours posted on syllabus
- Refresher on Slack notifications

Topics:

- Chalk talk:
- 1.2 R vs RStudio, coding
- 1.2 R packages
- 1.2 Data

- Tech time:
- Read Chapters [2.3, 2.5] of ModernDive
- Do the
*Learning Checks*in Chapter 2.4. These are in-class exercises that will not be collected, but discussions/solutions will be posted by the start of the next lecture. - (Optional reading) Corresponding chapters in “Stats, Data, & Models”" (SDM) 4th edition: Chapter 1.

*Assigned on Tue 9/5, due on Mon 9/11 11:30am.*

- Watch
I gave a TED talk on 3 ways to spot a bad statistic. I really hope you like it https://t.co/IjijWfyRTk pic.twitter.com/36BJZrpirx

— Mona Chalabi (@MonaChalabi) March 24, 2017 - Accept the invitation from DataCamp and complete the following chapters from the “Introduction to R” DataCamp course:
- Chapter 1 Intro to basics
- Chapter 2 Vectors
*Skip: Chapter 3 Matrices*- Chapter 4 Factors
- Chapter 5 Data frames
*Skip: Chapter 5 Lists*

- There is nothing to submit as DataCamp logs your completion.

- Albert’s background: http://rudeboybert.rbind.io/
- Go over syllabus (slides)
- Setting up infrastructure
- Intro to Slack
- Toolbox: R, RStudio, and DataCamp

- Slack is a medium for communication that has many pros (and some cons)
- Student feedback
- Two ways to use it. Try both for a few lectures and stick with what you prefer
- Via the Desktop App
- Via the browser

- Make sure you’ve completed the intro survey.
- Install the Slack Desktop App.
- Browse the past final projects.
- Take a break

- Chalk talk: 1.1 Slack
- Exercise:
- Send me a direct message (DM)
- Practice sending group DM with seatmates

- Keys for Slack success:
- Setting notifications.
- If you would like to still receive email notifications.

- R and RStudio
- DataCamp: an online interactive environment for learning data science (currently via R and Python).
- Student feedback from a previous intro course.
- Note: DataCamp is not a tool for long-term retention, but for introduction.

- Read Chapters [2, 2.2] of ModernDive (Chapters 2 thru 2.2 inclusive, but not Chapter 2.3):
- Understand the difference between R and RStudio
- Install them on your machines
- Chapter 2.2. introduces what you’ll be learning in problem set 1

- Accept the invitation in your Amherst emails to the DataCamp group “STAT/MATH 135 Intro to Stats via Modeling (Fall 2017)”