The lectures are numbered according to the following rough topic outline:

- Getting started (orange): R, RStudio, R packages
- Data science (pink): Data visualization via
`ggplot2`

, “tidy” data format, and data wrangling via`dplyr`

- Data modeling (blue): Correlation, basic regression, and multiple regression
- Statistical background (green): Random assignment for causal inference, random sampling for statistical inference, sampling distributions, and standard errors.
- Statistical inference (yellow): confidence intervals, hypothesis testing, inference for regression, other inferential methods

**Dates/times**:

- Time/place listed here.
- Midterm review Wed 5/2 7:30pm in Merrill 4.
- Go over practice exam. Note:
- No solutions will be posted as I do not have a digital copy.
- This was a non-cummulative “third midterm” and not a cummulative final exam; expect your final to be longer.

- Go over past problem sets.
- Any questions you may have; if possible Slack them to me in advance.

- Go over practice exam. Note:
- Office hours in SMudd 208
- Reading week: Thurs 5/3 2:30pm-5:30pm
- Finals week: Mon 5/7 1pm-5pm

- Study center drop-in hours in Merrill 300B.
- Reading week:
- Sun 4/29 8pm-9pm
- Mon 4/30 7pm-9pm
- Tues 5/1 7pm-9pm
- Wed 5/2 7pm-9pm
- Thurs 5/3 8pm-9pm

- Finals week:
- Sun 5/6 7pm-9pm
- Mon 5/7 7pm-9pm

- Reading week:

**Topics**:

- Please bring:
- A calculator (cell phone calculators not allowed). If you do not have access to one please Slack me.
- (Optional) A single 3 x 5 inch double-sided index card “cheatsheet” with your name on it. You’ll be submitting these with your final exam.

- Cummulative:
- All lectures: Lec 1.1 thru Lec 5.11
- All problem sets, both R and written components, including PS13 posted below.

- Use your class notes as your roadmap:
- They summarize what’s been covered in ModernDive + SDM and emphasize tricky ideas.
- They point to extra material on course webpage.

- Questions involving basic pseudocode as seen in Lec 2.9 below on available seat miles may be asked.

Recall the results of the “How many countries are there in Africa?” A/B test from Lec 4.3 where we randomly primed

- Roughly half the students with “Are there less than 14 countries in Africa?”
- Roughly half the students with “Are there more than 94 countries in Africa?”

```
africa <- read_csv("https://rudeboybert.github.io/STAT135/static/africa/africa.csv") %>%
select(group, number_of_countries)
head(africa)
```

group | number_of_countries |
---|---|

More than 94 | 70 |

Less than 14 | 27 |

More than 94 | 50 |

More than 94 | 93 |

Less than 14 | 53 |

More than 94 | 80 |

```
ggplot(africa, aes(x = group, y = number_of_countries)) +
geom_boxplot() +
labs(x = "Priming statement", y = "Guessed number of countries in Africa",
title = "Effects of priming on guessed number of countries in Africa")
```

**Analysis approach 1: Regression**

```
africa_regression <- lm(number_of_countries ~ group, data = africa)
get_regression_table(africa_regression)
```

term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|

intercept | 33.829 | 3.455 | 9.792 | 0 | 26.936 | 40.721 |

groupMore than 94 | 26.310 | 4.852 | 5.423 | 0 | 16.631 | 35.989 |

**Analysis approach 2: Two-sample t-test**

`t.test(number_of_countries ~ group, data = africa)`

```
##
## Welch Two Sample t-test
##
## data: number_of_countries by group
## t = -5.4438, df = 65.129, p-value = 8.488e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -35.96224 -16.65839
## sample estimates:
## mean in group Less than 14 mean in group More than 94
## 33.82857 60.13889
```

- Assigned Tuesday 4/25,
**but you will not submit it**. Solutions to be posted during reading week. - Only written component
- Learning goals:
- Inference for means: confidence intervals and hypothesis testing
- ANOVA
- Two-sample inference

- Exercise 20.7 on home sales (page 540). For part a), assume all conditions are met since the sample size \(n=36\) is “large enough.”
- Exercise 20.24 on parking
- Exercise 20.38 on saving gas. Assume since the sample size \(n=50\) is large, we can use the standard normal \(z\)-curve to compute \(p\)-values and not the \(t\)-distribution.
- Exercise 26.1 on popcorn (page 774). Skip part
~~c)~~b) and d); do parts a) and c) where the p-value is 0.00037. - Exercise 26.5 on baking yeast. Skip part c)
- Exercise 22.28 on graduation (page 618). Skip part a)

Recall from ModernDive Chapter 6.2.2 we saw simple regression with one categorical explanatory variable:

- \(y\): Life expectancy
- \(x\): Continent (categorical with Africa set as baseline)

```
library(ggplot2)
library(dplyr)
library(gapminder)
library(moderndive)
# Wrangle data:
gapminder2007 <- gapminder %>%
filter(year == 2007) %>%
select(country, continent, lifeExp, gdpPercap)
# Exploratory data analysis:
ggplot(gapminder2007, aes(x = continent, y = lifeExp)) +
geom_boxplot() +
labs(x = "Continent", y = "Life expectancy", title = "Fig 1: Life expectancy by continent")
```

```
# Fit regression
lifeExp_model <- lm(lifeExp ~ continent, data = gapminder2007)
get_regression_table(lifeExp_model)
```

term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|

intercept | 54.806 | 1.025 | 53.446 | 0 | 52.778 | 56.834 |

continentAmericas | 18.802 | 1.800 | 10.448 | 0 | 15.243 | 22.361 |

continentAsia | 15.922 | 1.646 | 9.675 | 0 | 12.668 | 19.177 |

continentEurope | 22.843 | 1.695 | 13.474 | 0 | 19.490 | 26.195 |

continentOceania | 25.913 | 5.328 | 4.863 | 0 | 15.377 | 36.450 |

For Americas, Asia, Europe, and Oceania, since all four 95% confidence intervals for the “bump” in the intercept relative to the baseline group Africa **do not** contain 0, this is suggestive that all four continents have significantly higher life expectancies than Africa.

Look at the above boxplot. Keeping in mind that while the solid black lines inside the boxes represent medians and not means, what ANOVA answers is: are the mean life expectancies of the 5 continents the same or is at least one of them different?

Here we have \(k=5\) groups (each being a continent) and \(n = 142\) observations (each being a country in 2007). Let’s generate the ANOVA table by applying the `anova()`

function to our saved model in `lifeExp_model`

.

`anova(lifeExp_model)`

term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|

continent | 4 | 13060.676 | 3265.1691 | 59.714 | 0 |

Residuals | 137 | 7491.177 | 54.6801 |

The “null distribution” is the \(F\) distribution with degrees of freedom

- \(df_1 = k-1 = 5-1 = 4\)
- \(df_2 = n-k-1 = 142-4-1 = 137\)

You can create a similar plot using `xpf()`

(similarly to the `xpnorm()`

function we used before when the null distribution is normal):

```
library(mosaic)
xpf(59.714, df1 = 4, df2 = 137)
```

On using p-values vs confidence intervals

- Psychology journal bans papers containing p-values “because the statistics were too often used to support lower-quality research.”
- The American Statistical Association put out a statement advising researchers to “avoid drawing scientific conclusions or making policy decisions based on p-values alone.”

- Assigned Wednesday 4/18, due Tuesday 4/24.
- Only written component
- Learning goals: Practice inference for regression

- Exercise 19.24 (page 515) on educated mothers. You can skip part b)
- Exercise 25.1 (page 718) on graduation
~~Exercise 25.3 on graduation~~Skip.- Exercise 25.7 on graduation
- Exercise 25.11 on graduation
- Exercise 25.13 on graduation
- Exercise 25.17 on graduation
- Exercise 25.22 on house prices. Skip part c)
- Exercise 25.24 on second home
- Part VI Review Exercises 42 (page 742) on Old Faithful. Skip parts e) and f)
- Revisit the solutions to PS08 R Markdown component
`PS08_solutions.html`

- Recompute by hand the 95% confidence interval for the population slope for
`log10_sqft_living`

[0.399, 0.980] as we did in Lec 5.7 - Using \(\alpha=0.01\) conduct a hypothesis test for the “condition 5” row. State your conclusion using both statistical and non-statistical language.

- Recompute by hand the 95% confidence interval for the population slope for

- Assigned Wednesday 4/11, due Tuesday 4/17.
- Two components:
- Written component: To be handed in on paper at the start of lecture; no email submissions will be accepted.
- Central Limit Theorem quiz component. You’ll have a quiz at the start of lecture on Tuesday 4/17.

Note SDM is the textbook Stats: Data and Models 4th edition.

- SDM Exercise 18.15 on confidence intervals (p.490)
- SDM Exercise 18.16 on confidence intervals
- SDM Exercise 19.2 on being psychic (p.512)
- SDM Exercise 19.4 on being psychic
- SDM Exercise 19.7 on being psychic
- SDM Exercise 19.9 on bad medicine

Solutions `PS11_written_solutions.pdf`

Watch the above video on the Central Limit Theorem before lecture on Tuesday 4/17. At the start of lecture, you will have a brief quiz worth one tenth of your engagement grade (thus 1% overall).

**Question 1**: The Central Limit Theorem guarantees what?

**The averages of samples have approximately normal distributions**. In other words, the sampling distribution of \(\overline{x}\) is Normal**As sample size gets bigger, the distribution of averages get more normal and narrower**. In other words, as \(n \longrightarrow\infty\) the sampling distribution of \(\overline{x}\) will more and more ressemble a perfect normal curve AND it gets narrower i.e. the spread shrinks i.e. the standard error gets smaller.

**Question 2** For the Central Limit Theorem to “work,” the population distribution from which you extract the samples to compute the sample averages should be normal as well. + a) TRUE or FALSE + b) What example is used in the video? Be specific about the nature of the distribution.

FALSE. For example, say we draw samples of size \(n\) from the population distribution of dragon wing-spans, which everyone knows is bimodal. Even if this population distribution of individual dragon wing-spans isn’t normal, the sampling distribution of \(\overline{x}\) based on samples of size \(n\) will still be normal.

“Replicable” random values. Copy the following into a scratchpad and run the following code portions individually:

```
# Run this
sample(1:10)
# Then this
sample(1:10)
# Then this
sample(1:10)
# Note how the results are different, due randomness. Now run this:
set.seed(76)
sample(1:10)
# Then this:
set.seed(76)
sample(1:10)
# Then this. Because the seed is different, the results are different:
set.seed(79)
sample(1:10)
```

We see that if we set the seed value, we get “replicable” randomness. Pick any number you like as the seed value. For example, 76 is a favorite number of mine.

- Assigned Friday 4/6, due Tuesday 4/10.
- Two components:
- R Markdown component: Submit both your
`PS10_FirstName_LastName.Rmd`

and`PS10_FirstName_LastName.html`

files on DropBox by 9am. - Written component: To be handed in on paper at the start of lecture; no email submissions will be accepted.

- R Markdown component: Submit both your
- Learning goals:
- Inferring about an unknown population proportion \(p\) using a confidence interval centered around the sample proportion: \(\widehat{p} \pm \text{MoE}\):
- Generalizing the in-class “simulation of the sampling distribution”
- From what we did in class: Estimating the population proportion red \(p\) in the sampling bowl using the sample proportion \(\widehat{p}\) based on a sample from the shovel with \(n=50\) slots (both tactile and virtual samples).
- To a new scenario: Estimating the population mean age \(\mu\) in a sack of pennies using the sample mean \(\overline{x}\) based on a sampled handful of \(n=50\) pennies (only virtual samples).

**Hints**:- Knit the problem set once before starting and read over the HTML file first.
- I
*highly*recommend`View()`

ing all data frames in your console as you go, but remember that you cannot include`View()`

code in your`.Rmd`

file.

- Download the following R Markdown template file, upload it to RStudio Server and rename it to reflect your name:
`PS10_FirstName_LastName.Rmd`

- Complete the problems in the
`.Rmd`

file and make sure your file knits. - Export the
**both**these files and submit using this DropBox link`PS10_FirstName_LastName.Rmd`

`PS10_FirstName_LastName.html`

- HTML file of solutions
`PS10_solutions.html`

`.Rmd`

R Markdown source file of solutions`PS10_solutions.Rmd`

Note SDM is the textbook Stats: Data and Models 4th edition.

- SDM Exercise 17.11 on market research (p.464).
- SDM Exercise 17.20 on M&M parts b) and c) only
- SDM Exercise 18.2 on “How’s life?” (p.488)
- SDM Exercise 18.6 on “How’s life?”. Hint:

```
library(mosaic)
xqnorm(0.975, mean = 0, sd = 1)
```

**Admin**:

- Extra office hours this Wednesday 4/4 3:30pm-5:15pm in SMudd 208.
- If you require special accommodations, please fill out this Google Form by Tuesday at 5pm.
- Midterm review tomorrow Tue 4/3 7pm-8pm in Merrill Science 4.
- Go over practice midterm. Note:
- No solutions will be posted as I do not have a digital copy.
- The midterm last semester only covered modeling, there were no questions on statistical background (the green topics in the spreadsheet above.)

- Go over past problem sets
- Any questions you may have; if possible Slack them to me in advance.

- Go over practice midterm. Note:

**Topics**:

- Midterm covers
- Lectures 3.1-4.6 (blue and green topics in schedule)
- PS06-PS09.

- Use your class notes as your roadmap:
- They summarize what’s been covered in ModernDive + SDM and emphasize tricky ideas.
- They point to extra material on course webpage.

- While you might need to understand given code, you will not be asked to directly write fully-functioning code. I might however ask you to write pseudocode as seen in Lec 2.9 on available seat miles.

Recall:

**Accuracy**relates to “how on target are we?” In other words, is the sampling distribution for the 1000 values of \(\widehat{p}\) centered at the true population proportion \(p\)?**Precision**relates to “how consistent are we?” or “how much do our estimates vary?” In other words, what is the spread of the sampling distribution for the 1000 values of \(\widehat{p}\) as quantified by the standard deviation, which in this special case is called the standard error?

- Assigned Wed 3/28, due Tuesday 4/3
- Only one component:
- Written component: To be handed in on paper at the start of lecture; no email submissions will be accepted.

- Learning goals:
- Think about bias, random sampling, observational studies and random assignment for designed experiments.
- Study the normal distribution

Note SDM is the textbook Stats: Data and Models 4th edition.

- SDM Exercise 11.8 on Satisfactory satisfaction samples (p.314)
- SDM Exercise 11.18 on social life. Questions are listed at the end of Exercise 11.16; do parts a), b), d), e), f), g)
- SDM Exercise 12.2 on E-commerce (p.336)
- SDM Exercise 12.20 on honesty
- SDM Exercise 12.22 on vitamin B12. Questions are listed just before Exercise 12.21.
- SDM Exercise 12.24 on insomnia. Questions are listed just before Exercise 12.21.
- SDM Exercise 12.30 on greyhounds. Questions are listed just before Exercise 12.21.
- SDM Exercise 5.2 on Mensa (p.134)
- SDM Exercise 5.8 on IQ
- Part a) is asking you to draw the normal curve and fill in the areas like on p.122
- Hint: You can use the
`xpnorm()`

function from the`mosaic`

package we saw in Lec 4.4 to verify your answers.

- SDM Exercise 5.18 on Checkup
- SDM Exercise 5.42 on Customer database
- SDM Exercise 5.46 on IQ. Hint: In the
`mosaic`

package:`xpnorm()`

takes as input an`x`

value and returns the proportion of the area of the normal curve on either side of`x`

`xqnorm()`

does the reverse. It takes as input as desired proportion of the area of the normal curve and returns the`x`

value

Interactive Shiny app on Normal Distributions, in particular sliders to vary the orange normal curve’s

- mean \(\mu\) i.e. center
- standard deviation \(\sigma\) i.e. spread

- “Number of countries in Africa” results to be posted here
- Google Form to enter in sampling exercise data from today.

How Facebook makes $100 million a day (based on 2017 2nd quarter results):

- Split testing, in particular the:
- The video
- “How split testing works” bullet points

- From this additional page, consider this image:

- Assigned Tuesday 3/20, due Tuesday 3/27
~~Three~~Two components:- R Markdown component: Submit
**both**your`PS08_FirstName_LastName.Rmd`

and`PS08_FirstName_LastName.html`

files on DropBox by 9am. - Netflix “Black Mirror” quiz component. You’ll have a quiz at the start of lecture on Tuesday 3/27.
~~Written component~~No written component this week.

- R Markdown component: Submit
- Learning goals: Wrap up multiple regression.

**Do this first!**Run the following line of code in your console to update the moderndive package to the latest development version:`devtools::install_github("moderndive/moderndive")`

- Download the following R Markdown template file, upload it to RStudio Server and rename it to reflect your name:
`PS08_FirstName_LastName.Rmd`

**Do this first before editing the**: Knit the file once and read over the questions first.`.Rmd`

file!- Complete the problems in the
`.Rmd`

file and make sure your file knits. If you are having trouble “knitting” your document, before asking for help, see if one of these 6 fixes work. In my experience, these fixes cover 90% of “knitting” errors. - Export the
**both**these files and submit using this DropBox link`PS08_FirstName_LastName.Rmd`

`PS08_FirstName_LastName.html`

- HTML file of solutions
`PS08_solutions.html`

`.Rmd`

R Markdown source file of solutions`PS08_solutions.Rmd`

Watch Netflix -> Black Mirror -> Season 4 -> Episode 4 “Hang the DJ” (about 51 minutes) before lecture on Tuesday 3/27. At the start of lecture, you will have a brief quiz worth one tenth of your engagement grade (thus 1% overall). Note, if you are uncomfortable watching sexually explicit scenes and would rather skip them, they occur at:

- 15:10 - 16:32
- 22:02 - 22:29
- 24:46 - 24:53
- 42:19 - 43:09

I encourage you to watch with your classmates! If you do not have access to Netflix please complete this Google Form by Thursday March 3/22 5pm.

**First scenario**: A doctor goes over medical records and finds that individuals who sleep with their shoes on tend to wake up with headaches. Does sleeping with your shoes on cause you to get a headache?

**Other scenarios**:

- You look at income data and find that individuals with more years of education, in particular a college degree, tend to also have higher earnings. Does college cause indivivuals to have higher earnings?
- You look at the annual income of all Middlebury College faculty and find that those faculty who identify as female tend to earn less than those who identify as male. Does being female cause faculty to earn less than males at Middlebury?
- You look at health records data and find that those individuals who take multivitamins daily tend to be in better health. Do multivitamins cause better health?

- Assigned Wednesday 3/7, due Tuesday 3/20 (after break)
- Three components: Two required and one optional
- R Markdown component: Submit
**both**your`PS07_FirstName_LastName.Rmd`

and`PS07_FirstName_LastName.html`

files on DropBox by 9am. **Optional**: DataCamp component.

- R Markdown component: Submit
- Learning goals:
- Basic regression with a categorical explanatory variable
- Starting multiple regression questions
- Unpacking hate crime occurrence in the US; extending on Problem Set 02

- Hints:
- If you are having trouble “knitting” your document, before asking for help, see if one of these 6 fixes work. In my experience, these fixes cover 90% of “knitting” errors.

- Download the following R Markdown template file, upload it to RStudio Server and rename it to reflect your name:
`PS07_FirstName_LastName.Rmd`

- Complete the problems in the
`.Rmd`

file and make sure your file knits. - Export the
**both**these files and submit using this DropBox link`PS07_FirstName_LastName.Rmd`

`PS07_FirstName_LastName.html`

- You should see the following confirmation screen:

- HTML file of solutions
`PS07_solutions.html`

`.Rmd`

R Markdown source file of solutions`PS07_solutions.Rmd`

Note SDM is the textbook Stats: Data and Models 4th edition.

- SDM Exercise 28.2 on candy sales (p.838)
- SDM Exercise 28.4 on movie profits
- SDM Exercise 28.12 more interpretations
- SDM Exercise 28.14 parts a) and c) on Scottish hill races
- SDM Exercise 28.20 parts a) and b) on GPA and SAT’s

Solutions `PS07_written_solutions.pdf`

This week’s DataCamp course Multiple and Logistic Regression is **optional**. If you would like:

- Another perspective on the same material we are learning, I recommend you in particular watch the videos.
- More practice coding, I recommend you in particular go over the coding exercises.

Login to DataCamp, in the top navbar click Groups -> STAT/MATH 135 … -> My Assignments -> 3 newly listed chapters:

- Parallel slopes
- Evaluating and extending parallel slopes model
- Multiple Regression

In ModernDive 7.2, we’ll be modeling

- \(y\) outcome variable teaching score
- Explantory variables:
- \(x_1\): numerical variable age
- \(x_2\): categorical (in this case binary) variable gender

but using two approaches:

- ModernDive 7.2.2: Parallel slopes model
- ModernDive 7.2.3: Interaction model

**Midterm I Question 4.b)** Both the following plots show the relationship between life expectancy and GDP per capita in 2007 in the `gapminder`

data, but the right one has GDP per capita (USD) on the x-axis on a \(\log10\)-scale using `scale_x_log10()`

. Thus:

- Left plot: the white grid lines on the x-axis denote
**additive**differences of \(+10000\) - Right plot: the white grid lines on the x-axis denote
**multiplicative**differences of \(\times 10\)

**Midterm I Question 5.f)** Counterexample. Both these datasets have 11 values and the same quartiles as Nashville (32K, 60K, and 100K), but have different proportions greater than 80K:

Dataset | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Data 1 | 16 | 17 | 32 |
33 | 54 | 60 |
61 | 62 | 100 |
110 | 115 | ||

Data 2 | 16 | 17 | 32 |
33 | 54 | 60 |
98 | 99 | 100 |
110 | 115 |

- Assigned Wednesday 2/28, due Tuesday 3/6
- Three components: Two required and one optional
- R Markdown component: Submit
**both**your`PS06_FirstName_LastName.Rmd`

and`PS06_FirstName_LastName.html`

files on DropBox by 9am. **Optional**: DataCamp component.

- R Markdown component: Submit
- Learning goals:
- Baby’s first regression in R.
- Interpreting regression.

- Hints:
- It’s best to knit your
`.Rmd`

file early and knit often, so that you can catch errors early. - If you are having trouble “knitting” your document, before asking for help, see if one of these 6 fixes work. In my experience, these fixes cover 90% of “knitting” errors.

- It’s best to knit your

- Download the following R Markdown template file, upload it to RStudio Server and rename it to reflect your name:
`PS06_FirstName_LastName.Rmd`

- Complete the problems in the
`.Rmd`

file and make sure your file knits. - Export the
**both**these files and submit using this DropBox link`PS06_FirstName_LastName.Rmd`

`PS06_FirstName_LastName.html`

- You should see the following confirmation screen:

- HTML file of solutions
`PS06_solutions.html`

`.Rmd`

R Markdown source file of solutions`PS06_solutions.Rmd`

Note SDM is the textbook Stats: Data and Models 4th edition.

- Read ModernDive 6.3.3 and answer this question: the “best-fitting” regression line is “best” in that it minimizes what?
- SDM Exercise 7.16 on engine size (p.207)
- SDM Exercise 7.18 on engine size
- SDM Exercise 7.20 on engine size
- SDM Exercise 7.42 on roller coasters
- SDM Exercise 7.58 parts b) and c) on wildfires
- SDM Exercise 8.12 on cell phones (p.238)
- SDM Exercise 8.36 on marriage age parts b)-d)

Solutions `PS06_written_solutions.pdf`

This week’s DataCamp course Correlation and Regression is **optional**. If you would like:

- Another perspective on the same material we are learning, I recommend you in particular watch the videos.
- More practice coding, I recommend you in particular go over the coding exercises.

Login to DataCamp, in the top navbar click Groups -> STAT/MATH 135 … -> My Assignments -> 4 newly listed chapters:

- Correlation and Regression: Visualizing two variables
- Correlation and Regression: Correlation
- Correlation and Regression: Simple linear regression
- Correlation and Regression: Interpreting regression models

- Assigned Tuesday 2/20, due Tuesday 2/27
- Two components:
- Google Form component: Fill out this Google Form with what you thought were the 2-3 most salient differences between the male and female StitchFix “style guide” quizzes.
- R Markdown component: Submit your
`PS05_FirstName_LastName.Rmd`

file (not the`PS05_FirstName_LastName.html`

file) on DropBox by 9am; no submissions will be accepted after 9am. In the outside chance there is a problem with the DropBox link, Slack direct message me your problem set before 9am.

- Learning goals:
- Shorter problem set for you to get used to working in RMarkdown.
- More practice data wrangling.
- Start thinking of modeling an outcome variable in terms of “input” explanatory/predictor variables.

- Hints:
- For any data wrangling exercises, I highly suggest you write out the
*pseudocode*first, as we did in Lec 2.9 in the “available seat miles” exercise. - It’s best to knit your
`.Rmd`

file early and knit often, so that you can catch errors early. - If you are having trouble “knitting” your document, before asking for help, see if one of these 6 fixes work. In my experience, these fixes cover 90% of “knitting” errors.
- Do not spin your wheels! I
**highly**recommend you allow yourself enough time to come to office hours and/or drop-in center hours on Monday in case you get stuck not being able to Knit.

- For any data wrangling exercises, I highly suggest you write out the

- Read the following article on StitchFix
- Watch the following video:

- Go to StitchFix -> Select your (binary) gender -> Complete the “style guide” quiz. If you would rather not use your own email, copy and paste a temporary email from https://www.tempmailaddress.com/.
- Logout of StitchFix, create a new account using a different email, and repeat the above steps but for the
*opposite gender*than the one you selected. - In this Google Form, note what you feel are the 2-3 most salient differences between the “style guide” quiz for men and for women.

- Download the following R Markdown template file, upload it to RStudio Server and rename it to reflect your name:
`PS05_FirstName_LastName.Rmd`

- Complete the problems in the
`.Rmd`

file and make sure your file knits. - Export the file
`PS05_FirstName_LastName.Rmd`

to your computer (not the`PS05_FirstName_LastName.html`

file) and submit it using this DropBox link.

- HTML file of solutions
`PS05_solutions.html`

`.Rmd`

R Markdown source file of solutions`PS05_solutions.Rmd`

**Admin**:

- Extra office hours this Wednesday 2/21 in SMudd 208 4:30-6:30.
- If you require special accommodations, please fill out this Google Form by Tuesday at 5pm.
- Midterm review tomorrow 7pm Merrill Science 4
- Go over practice midterm. Skip Q3.
- Any questions you may have; if possible Slack them to me in advance.

**Topics**:

- Midterm covers Lectures 1.1-2.9 (Orange and pink topics in schedule) & PS01-PS04.
- Use your class notes as your roadmap:
- They summarize what’s been covered in ModernDive and emphasize tricky ideas.
- They point to extra material on course webpage below.

- While you might need to understand given code, you will not be asked to directly write fully-functioning code. I might however ask you to write pseudocode as seen in Lec 2.9 on available seat miles.

- First, change RStudio default configurations:
- Go to RStudio menu bar -> Tools -> Global Options… -> R Markdown
- Uncheck box next to “Show output inline for all R Markdown Documents”

- Second, create a new R Markdown
`.Rmd`

file and “Knit” it.- Go to RStudio menu bar -> File -> New File -> R Markdown -> Set Title to “testing”. A file
`testing.Rmd`

should show up. - Click the arrow next to “Knit” -> “Knit to HTML” -> Save as “testing”. An HTML (webpage) should pop up; you may need to disable your browser’s pop-up blocker.

- Go to RStudio menu bar -> File -> New File -> R Markdown -> Set Title to “testing”. A file

**Note**: This is a bit harder than what will be expected of you in this course, but it is a nice all-encompassing and real example.

- Run the starter code below to view the data frame of interest for today. (Note
`joins`

are not a required topic in this class; if you’re still interested in learning how to join two datasets, read the optional ModernDive Section 5.8.) - In groups of 2-3 sketch out the steps on the board to wrangle out a data frame of the available seat miles for each airline in 2013 in descending order. Available seat miles is a measure of airline capacity.

**Hints**:

- Take a close look at the data at your disposal in
`View(flights_plus)`

. - Sketch out what your end goal is roughly going to look like and then fill in all steps in between. This way you won’t confuse
*what*you are trying to do (the algorithm) with*how*you are going to do it (writing code). - Recall the 5 main verbs (5MV) for data wrangling we’ve seen so far:
`filter()`

rows`summarize()`

many values to one using a summary statistic function like`mean()`

,`median()`

, etc.`group_by()`

to add grouping meta-data`mutate()`

existing variables to create new ones`arrange()`

the rows of a data frame in order of one of the variables- Extra:
`select()`

specific columns (i.e. variables) of your dataset.

**Starter Code**:

```
library(dplyr)
library(nycflights13)
# Don't worry if you don't fully understand what's going on here.
flights_plus <- flights %>%
left_join(planes, by = "tailnum") %>%
rename(year = year.x, year_manufactured = year.y) %>%
left_join(airlines, by = "carrier") %>%
filter(!is.na(seats))
View(flights_plus)
```

Based on the pseudocode we wrote in class:

```
flights_plus %>%
filter(year == 2013) %>%
select(name, seats, distance) %>%
mutate(flight_ASM = seats * distance) %>%
group_by(name) %>%
summarize(total_ASM = sum(flight_ASM)) %>%
arrange(desc(total_ASM))
```

name | total_ASM |
---|---|

United Air Lines Inc. | 15516377526 |

Delta Air Lines Inc. | 10532885801 |

JetBlue Airways | 9618222135 |

American Airlines Inc. | 3677292231 |

US Airways Inc. | 2533505829 |

Virgin America | 2296680778 |

ExpressJet Airlines Inc. | 1817236275 |

Southwest Airlines Co. | 1718116857 |

Endeavor Air Inc. | 776970310 |

Hawaiian Airlines Inc. | 642478122 |

Alaska Airlines Inc. | 314104736 |

AirTran Airways Corporation | 219628520 |

Frontier Airlines Inc. | 184832280 |

Mesa Airlines Inc. | 20163632 |

Envoy Air | 7162420 |

SkyWest Airlines Inc. | 1299835 |

First, let’s keep only the rows and variables we’re interested in. This allows us to spend less mental energy looking at the data since we’ve eliminated all rows and columns we don’t need. Note however since we only had 2013 flights to begin with, the `filter(year == 2013)`

was not actually necessary.

```
flights_plus %>%
filter(year == 2013) %>%
select(name, seats, distance)
```

```
## # A tibble: 284,170 x 3
## name seats distance
## <chr> <int> <dbl>
## 1 United Air Lines Inc. 149 1400.
## 2 United Air Lines Inc. 149 1416.
## 3 American Airlines Inc. 178 1089.
## 4 JetBlue Airways 200 1576.
## 5 Delta Air Lines Inc. 178 762.
## 6 United Air Lines Inc. 191 719.
## 7 JetBlue Airways 200 1065.
## 8 ExpressJet Airlines Inc. 55 229.
## 9 JetBlue Airways 200 944.
## 10 JetBlue Airways 200 1028.
## # ... with 284,160 more rows
```

Let’s calculate on each individual flight’s available seat miles and save this in a variable `flight_ASM`

. Notice how there are still 284,170 rows in the resulting data frame, one for each flight:

```
flights_plus %>%
filter(year == 2013) %>%
select(name, seats, distance) %>%
mutate(flight_ASM = seats * distance)
```

```
## # A tibble: 284,170 x 4
## name seats distance flight_ASM
## <chr> <int> <dbl> <dbl>
## 1 United Air Lines Inc. 149 1400. 208600.
## 2 United Air Lines Inc. 149 1416. 210984.
## 3 American Airlines Inc. 178 1089. 193842.
## 4 JetBlue Airways 200 1576. 315200.
## 5 Delta Air Lines Inc. 178 762. 135636.
## 6 United Air Lines Inc. 191 719. 137329.
## 7 JetBlue Airways 200 1065. 213000.
## 8 ExpressJet Airlines Inc. 55 229. 12595.
## 9 JetBlue Airways 200 944. 188800.
## 10 JetBlue Airways 200 1028. 205600.
## # ... with 284,160 more rows
```

Now let’s add grouping structure “meta-data”. Note the `Groups: name [16]`

meta-data at the top of the data frame output indicating that the rows are grouped by `name`

of which there are 16 possible values (one for each carrier). Also note that the data has not changed, we still have 284,170 rows:

```
flights_plus %>%
filter(year == 2013) %>%
select(name, seats, distance) %>%
mutate(flight_ASM = seats * distance) %>%
group_by(name)
```

```
## # A tibble: 284,170 x 4
## # Groups: name [16]
## name seats distance flight_ASM
## <chr> <int> <dbl> <dbl>
## 1 United Air Lines Inc. 149 1400. 208600.
## 2 United Air Lines Inc. 149 1416. 210984.
## 3 American Airlines Inc. 178 1089. 193842.
## 4 JetBlue Airways 200 1576. 315200.
## 5 Delta Air Lines Inc. 178 762. 135636.
## 6 United Air Lines Inc. 191 719. 137329.
## 7 JetBlue Airways 200 1065. 213000.
## 8 ExpressJet Airlines Inc. 55 229. 12595.
## 9 JetBlue Airways 200 944. 188800.
## 10 JetBlue Airways 200 1028. 205600.
## # ... with 284,160 more rows
```

Let’s now `summarize()`

each group by using the `sum()`

function. We now only have 16 rows, one for each carrier:

```
flights_plus %>%
filter(year == 2013) %>%
select(name, seats, distance) %>%
mutate(flight_ASM = seats * distance) %>%
group_by(name) %>%
summarize(total_ASM = sum(flight_ASM))
```

```
## # A tibble: 16 x 2
## name total_ASM
## <chr> <dbl>
## 1 AirTran Airways Corporation 219628520.
## 2 Alaska Airlines Inc. 314104736.
## 3 American Airlines Inc. 3677292231.
## 4 Delta Air Lines Inc. 10532885801.
## 5 Endeavor Air Inc. 776970310.
## 6 Envoy Air 7162420.
## 7 ExpressJet Airlines Inc. 1817236275.
## 8 Frontier Airlines Inc. 184832280.
## 9 Hawaiian Airlines Inc. 642478122.
## 10 JetBlue Airways 9618222135.
## 11 Mesa Airlines Inc. 20163632.
## 12 SkyWest Airlines Inc. 1299835.
## 13 Southwest Airlines Co. 1718116857.
## 14 United Air Lines Inc. 15516377526.
## 15 US Airways Inc. 2533505829.
## 16 Virgin America 2296680778.
```

And finally, `arrange()`

in `desc`

ending order of `total_ASM`

:

```
flights_plus %>%
filter(year == 2013) %>%
select(name, seats, distance) %>%
mutate(flight_ASM = seats * distance) %>%
group_by(name) %>%
summarize(total_ASM = sum(flight_ASM)) %>%
arrange(desc(total_ASM))
```

```
## # A tibble: 16 x 2
## name total_ASM
## <chr> <dbl>
## 1 United Air Lines Inc. 15516377526.
## 2 Delta Air Lines Inc. 10532885801.
## 3 JetBlue Airways 9618222135.
## 4 American Airlines Inc. 3677292231.
## 5 US Airways Inc. 2533505829.
## 6 Virgin America 2296680778.
## 7 ExpressJet Airlines Inc. 1817236275.
## 8 Southwest Airlines Co. 1718116857.
## 9 Endeavor Air Inc. 776970310.
## 10 Hawaiian Airlines Inc. 642478122.
## 11 Alaska Airlines Inc. 314104736.
## 12 AirTran Airways Corporation 219628520.
## 13 Frontier Airlines Inc. 184832280.
## 14 Mesa Airlines Inc. 20163632.
## 15 Envoy Air 7162420.
## 16 SkyWest Airlines Inc. 1299835.
```

- Assigned Tuesday 2/13, due Tuesday 2/20
- Three components:
- DataCamp: Due at 9am. Two chapters and a feedback survey.
- Code component: Due on DropBox at 9am; no submissions will be accepted after 9am. In the outside chance there is a problem with the DropBox link, Slack direct message me your problem set before 9am.

- Learning goals:
- Practice, practice, practice data wrangling.

- Hints:
- Coding component: Read Chapter 5 in ModernDive and complete the DataCamp portion first, then do the coding component. In particular the new Subsection 5.5.1 on
`group_by()`

more than one variable. - Coding component: Sketch out your “algorithm” on paper first and only then start coding, just as we did in the in-class exercise for Lec 2.9 involving available seat miles.
- DataCamp: We’ve seen one way to add a title is to add a
`labs(title = "TITLE TEXT")`

layer to a plot. A second way to add a`+ ggtitle("TITLE TEXT")`

layer to a plot

- Coding component: Read Chapter 5 in ModernDive and complete the DataCamp portion first, then do the coding component. In particular the new Subsection 5.5.1 on

- Login to DataCamp, in the top navbar click Groups -> STAT/MATH 135 … -> My Assignments -> Complete the 2 newly listed chapters:
- Introduction to the Tidyverse: Grouping and summarizing (do this one first)
- Introduction to the Tidyverse: Types of visualizations

- After you’re done the above, fill out this Feedback Survey on the “Introduction to the Tidyverse” DataCamp courses from PS03 and PS04, comparing them to the “Intro to R” and “Intermediate R” DataCamp courses from PS01 and PS02.

First, read the following links to get an idea of the context of the datasets you’ll work with in the code component.

- FiveThirtyEight: The Most Common Unisex Names In America: Is Yours One Of Them?
- FiveThirtyEight: How to Tell Someone’s Age When All You Know Is Her Name

Then answer the following questions, some of which will require you to complete the coding component below:

- Question 1: Babynames
- What makes a name “unisex” according to 538’s definition?
- In the “How to Tell Someone’s Age When All You Know Is Her Name” article, the Figure “Median Ages for Males with the 25 Most Common Names” involves 25 instances of a)
- In the “How to Tell Someone’s Age When All You Know Is Her Name” article, consider the names in the Figure “Youngest Male Names”. Which name has the largest spread (i.e. variation) in age? Compute by hand a previously seen
*summary statistic*of this spread based on based on the values in this figure. In other words, you don’t need to use the`babynames`

dataset, rather just read off approximate values from graphics.

- Repeat the above for the names in the Figure “Youngest Female Names”

- Question 1: Babynames
- If at least one third of babies with that name were female and at least one third of babies with that name were male.
- These were 25 instances of a) boxplots without the b) whiskers.
- “Oliver” has the largest gap between the 25th and 75th percentile of about 2.5 years and 35 years respectively. In other words, 25% of Olivers are younger than about 2.5 years while 25% of Olivers are older than about 35 years old. A measure of spread we’ve seen based on these two values is the
*interquartile range*, or the difference of the 3rd and 1st quartile, of 35-2.5=32.5 years.**What does this all mean?**You’ll encounter a lot of both really young, and a lot of older Olivers. On the other hand, Jaydens will tend to be of similar age. - “Ella” has the largest gap between the 25th and 75th percentile of about 5 years and 55 years respectively. In other words, 25% of Ellas are younger than 5 years while 25% of Ellas are older than about 55 years old. The
*interquartile range*is 55 - 5 = 50 years, a much larger spread than the Olivers!**What does this all mean?**There is even more variation, or differences, in ages in Ellas than Olivers!

- Follow the same procedure as in the PS02 code component to create, save, and submit a
`PS04_YourFirstName_YourLastName.R`

scratch pad. - Submit using this DropBox link.
- Please note that all questions in the coding component can be answered by copying/pasting/tweaking exisiting code in ModernDive:
- Find something similar to what you want in ModernDive.
- Copy and paste the code that made this plot in
`PS04_YourFirstName_YourLastName.R`

. - Tweak the code to match what you want.

```
# Name: WRITE YOUR NAME HERE
# Collaboration: INDICATE WHO YOU COLLABORATED WITH FOR THIS PROBLEM SET
# Load all necessary packages:
library(ggplot2)
library(dplyr)
library(fivethirtyeight)
library(babynames)
# Question 1 --------------------------------------------------------------
# Let's load the evals data from PS03 and View() it
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- as_data_frame(evals)
View(evals)
# a) In PS03 -> Coding Component -> Q1.c), you created a boxplot comparing
# teaching scores between men and women. Copy/paste/tweak a data wrangling
# example in ModernDive Chapter 5 to compute the two medians in these plots
# exactly.
# b) In PS03 -> Coding Component -> Q1.d), you created a boxplot comparing
# teaching scores between men and women but this time faceted by ethnicity.
# Copy/paste/tweak a data wrangling example in ModernDive Chapter 5 to compute
# these four medians exactly in these plots exactly and arrange them in
# descending order.
# Question 2 --------------------------------------------------------------
# Let's View() the bechdel dataset from PS03 from the fivethirtyeight package
View(bechdel)
# a) In PS03 -> Coding Component -> Q2 you created a visualization that shows
# the distribution of the variable clean_test, which is the result of the
# bechdel test for each movie. Copy/paste/tweak a data wrangling example in
# ModernDive Chapter 5 to output a data frame showing the counts of movies for
# each of the 5 levels i.e. outcomes of the bechdel test.
# Question 3 --------------------------------------------------------------
# Let's View() the babynames dataset from the babynames package
View(babynames)
# Read the help file associated with this dataset to get a sense of the
# variables included:
?babynames
# a) Create a time series plot comparing the number (not percentage for now) of
# male babies born with the name from PS04 -> Written Component -> Q1.c). You'll
# need to copy/paste/tweak code from both ModernDive Chapter 3 (data
# visualization) and 5 (data wrangling) to achieve this.
# b) Create a time series plot comparing the number of female babies born with
# the name from PS04 -> Written Component -> Q1.d). You'll need to
# copy/paste/tweak code from both ModernDive Chapter 3 (data visualization) and
# 5 (data wrangling) to achieve this.
# c) Create a time series plot comparing the number of babies born with one of
# the "unisex" names from the fivethirtyeight articles for each year but making
# sure to distinguish between the sexes. There is more than one way to do this.
# You'll need to copy/paste/tweak code from both ModernDive Chapter 3 (data
# visualization) and 5 (data wrangling) to achieve this.
```

Load all necessary packages:

```
library(ggplot2)
library(dplyr)
library(fivethirtyeight)
library(babynames)
```

**a) Compute median teaching scores exactly**

```
# Load evals data
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- as_data_frame(evals)
evals %>%
group_by(gender) %>%
summarize(median = median(score))
```

gender | median |
---|---|

female | 4.1 |

male | 4.3 |

```
evals %>%
group_by(gender, ethnicity) %>%
summarize(median = median(score)) %>%
arrange(desc(median))
```

gender | ethnicity | median |
---|---|---|

female | not minority | 4.30 |

male | not minority | 4.30 |

male | minority | 4.25 |

female | minority | 4.00 |

These values are the values of the solid horizontal lines of the boxplots from PS03 -> coding component -> Q1.c) and d):

**a) Showing the counts of movies for each of the 5 levels i.e. outcomes of the bechdel test.**

All we are doing here is explicitly computing the height of the bars of the following barplot from PS03 -> coding component -> Q2:

```
bechdel %>%
group_by(clean_test) %>%
summarize(count = n())
```

clean_test | count |
---|---|

nowomen | 141 |

notalk | 514 |

men | 194 |

dubious | 142 |

ok | 803 |

**a) Time series plot of males named “Oliver”**

```
olivers <- babynames %>%
filter(sex == "M", name == "Oliver")
ggplot(data = olivers, mapping = aes(x = year, y = n)) +
geom_line() +
labs(x = "Year", y = "Number of births", title = "Number of male Olivers born in the US")
```

Some notes:

- After a bump in the 1920’s, Oliver is back in style and trending! Keep in mind we are only showing counts and thus the US population size is not incorporated in this plot.
- We’re only interested in the males, so we need to
`filter`

. If we left the females in, the plot would have a weird zig-zaggy shape because of all the small counts of females named Oliver. Could these be data entry errors? - In the filter below we could’ve also done
`filter(sex == "M" & name == "Oliver")`

, or also`filter(sex == "M") %>% filter(name == "Oliver")`

**b) Time series plot of females named “Ella”**

```
ellas <- babynames %>%
filter(sex == "F", name == "Ella")
ggplot(data = ellas, mapping = aes(x = year, y = n)) +
geom_line() +
labs(x = "Year", y = "Number of births", title = "Number of female Ellas born in the US")
```

- After a bump in the 1920’s, Ella was trending up until about 2010, but now is on the decline! Keep in mind we are only showing counts and thus the US population size is not incorporated in this plot.
- We’re only interested in the females, so we need to
`filter`

. If we left the males in, the plot would have a weird zig-zaggy shape because of all the small counts of males named Ella. Could these be data entry errors?

**c) Time series plot of a unisex names distinguishing between sexes.** Let’s focus on Rileys. Here are two ways: one with the color aesthetic the other with facets. Which do you prefer?

```
rileys <- babynames %>%
filter(name == "Riley")
ggplot(data = rileys, mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
labs(x = "Year", y = "Number of births", title = "Number of Rileys born in the US (using color)")
```

```
ggplot(data = rileys, mapping = aes(x = year, y = n)) +
geom_line() +
labs(x = "Year", y = "Number of births", title = "Number of Rileys born in the US (using facets)") +
facet_wrap(~sex)
```

Note this approach is preferred to creating to entirely separate plots based on two separate `ggplot()`

functions, as the above two plots ensure that both the x and y axes are on the same scale. See below: the y-axis range is different, making it harder to compare the two time series plots.

Baby’s first R Markdown report exercise. More on Tuesday 2/20.

- Download
`Term_Project_Proposal.Rmd`

from Course Webpage -> Term Project -> Section 2.3 - Upload to RStudio Server
- Click the “Knit” button. A webpage should pop-up. If not, you may need to disable your pop-up blocker in your browser.
- Click on the blue “Publish” button
- Create a new account on RPubs.com
- Set the page’s:
- Title: testing
- Description: testing
- Slug (i.e. webpage URL address): testing

- A webpage that you can share should show in your browser

How to import an Excel file into R:

- Download this Excel file
- If you are using RStudio Server (Cloud version): upload
- In the Files panel -> click on the file -> Import dataset…
- Click on Import (bottom right)

- Assigned Wednesday 2/7, due Tuesday 2/13
- Three components:
- DataCamp: Due at 9am. Your completion of the courses will be logged automatically by DataCamp.
- Code component: Due on DropBox at 9am; no submissions will be accepted after 9am. In the outside chance there is a problem with the DropBox link, Slack direct message me your problem set before 9am.

- Learning goals:
- DataCamp: Gentle introduction to topics from Chapter 5 on “data wrangling”, a less sinister sounding term for “data manipulation”.
- More practice with data visualization. In particular histograms, boxplots, and barplots.

Login to DataCamp, in the top navbar click Groups -> STAT/MATH 135 … -> My Assignments -> Complete the 2 newly listed chapters in this order:

- Introduction to the Tidyverse: Data wrangling (do this one first)
- Introduction to the Tidyverse: Data visualization

First, read the following links to get an idea of the context of the datasets you’ll work with in the code component. After all: *numbers are numbers, but data has context.*

- Teacher evaluations at the University of Texas Austin
- FiveThirtyEight: The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women
- FiveThirtyEight: Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?

Then answer the following questions:

- Question 1: Teacher evaluations
- What does the variable
`bty_avg`

in the`evals`

data set represent and how is it computed? - Assuming men and women are of equal teaching ability, based on your visualizations below give an approximate numerical value of a “gender penalty” that women instructors pay on their teaching evaluation scores. This values doesn’t have to be precise; just eye-ball it.
- Say we have four demographic groups of professors: female minorities, male minorities, female non-minorities, and male non-minorities. Rank them from lowest to highest in terms of their median teaching scores.

- What does the variable
- Question 2: Bechdel test
- What is the purpose of the Bechdel test?
- What are the five possible outcomes of the Bechdel test?
- Based on your visualization, about what proportion of the \(n\) = 1794 movies included in the dataset “passed” the bechdel test?

- Question 3: Drinks
- What is the source of the drinks data?
- What proportion of US alcohol servings consumption is beer?
- Among these three countries, what proportion of wine servings are consumed by the French?
**Bonus**: Based on your visualization below, you should see a relationship between the categorical variables country and alcohol type: for each alcohol type, different countries consume them at different levels. Or put differently, for each alcohol type, the relative amount consumed depends on which country we are considering. Now say instead hypothetically that consumption was*independent*of country. Sketch out by hand what the barplot below would now look like.

- Question 1: Teacher evaluations
- It is the average beauty rating of an instructor based on a panel of 6 students’ evaluations saved in the variables
`bty_f1lower`

,`bty_f1upper`

,`bty_f2upper`

,`bty_m1lower`

,`bty_m1upper`

,`bty_m2upper`

in the`evals`

data frame. - Using the figure from Q1.c) from the coding component below, one example of a numerical “gender penalty” would be the difference in medians of about 4.3-4.1 = 0.2 points out of 5.
- Using the figure from Q1.d) from the coding component belows and the “eyeball test”, from lowest to highest: female minorities, male minorities, and the male and female non-minorities are about the same.

- It is the average beauty rating of an instructor based on a panel of 6 students’ evaluations saved in the variables
- Question 2: Bechdel test
- A rough metric to evaluate female representation in a movie.
- The outcomes form a hierarchy from lowest to highest:
- There are at least two named women.
- They talk to each other.
- They talk to each other about something other than a male character.
- Some end of difficult to classify; these are lumped in as “dubious”.

- If you leave out the “dubious” counts, then 800/1794 = 0.4459 = 44.59%. For such a non-strict test of female representation, a little less than half!

- Question 3: Drinks
- World Health Organization
- 249/(249+158+84) = 0.507 = 50.7%, the highest of these three countries.
- 370/(370+195+84) = 0.570 = 57.0%, very much as one would expect.
**Bonus**:

Let’s compute the total consumed for each alcohol type and then the average across the 3 countries:

country | type | servings | total | avg |
---|---|---|---|---|

France | beer | 127 | 595 | 198.33 |

United Kingdom | beer | 219 | 595 | 198.33 |

USA | beer | 249 | 595 | 198.33 |

France | spirit | 151 | 435 | 145.00 |

United Kingdom | spirit | 126 | 435 | 145.00 |

USA | spirit | 158 | 435 | 145.00 |

France | wine | 370 | 649 | 216.33 |

United Kingdom | wine | 195 | 649 | 216.33 |

USA | wine | 84 | 649 | 216.33 |

- Follow the same procedure as in the PS02 code component to create, save, and submit a
`PS03_YourFirstName_YourLastName.R`

scratch pad. - Submit using this DropBox link.
- Please note that all questions in the coding component can be answered by copying/pasting/tweaking exisiting code in ModernDive:
- Find a similar looking plot to the one you want in ModernDive.
- Copy and paste the code that made this plot in
`PS03_YourFirstName_YourLastName.R`

. - Tweak the code to match what you want.

```
# Name: WRITE YOUR NAME HERE
# Collaboration: INDICATE WHO YOU COLLABORATED WITH FOR THIS PROBLEM SET
# Load all necessary packages:
library(ggplot2)
library(dplyr)
library(fivethirtyeight)
# Question 1 --------------------------------------------------------------
# Let's load the evals data from the web and View() it
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- as_data_frame(evals)
View(evals)
# a) Visualize the distribution of the numerical variable score, which is the
# teaching score. Have the bin widths be 0.25 teaching score units.
# Do this below:
# b) Now show the same histogram as above, but for men and women separately.
# Do this below:
# c) Now compare teaching scores between men and women using a boxplot.
# Do this below:
# d) Now facet the above boxplot by teacher ethnicity.
# Do this below:
# Question 2 --------------------------------------------------------------
# Let's View() the bechdel dataset from the fivethirtyeight package
View(bechdel)
# Read the help file associated with this dataset to get a sense of the
# variables included:
?bechdel
# Create a visualization that shows the distribution of the variable clean_test,
# which is the result of the bechdel test for each movie.
# Do this below:
# Question 3 --------------------------------------------------------------
# Let's View() the drinks dataset from the fivethirtyeight package
View(drinks)
# Read the help file associated with this dataset to get a sense of the
# variables included:
?drinks
# We're going to consider a slightly modified and simplied version of this data
# in the drinks_subset data frame. Let's load the drinks_subset data from the
# web and View() it
load(url("https://rudeboybert.github.io/STAT135/static/PS/drinks_subset.RData"))
View(drinks_subset)
# Create "dodged" barplot that visualizes the distribution of the categorical
# variables type and country at the same time, where the bar colors
# represent the country.
# Do this below:
```

**Hint**: your visulization for Question 3 should look something like this, but with the heights of the bars reflecting the true serving counts in `drinks_subset`

. Here all the heights are set to fake values of 100.

Load all necessary packages:

```
library(ggplot2)
library(dplyr)
library(fivethirtyeight)
```

**a) Visualize the distribution of the numerical variable score, which is the teaching score. Have the bin widths be 0.25 teaching score units.**

```
# Load evals data
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- as_data_frame(evals)
ggplot(data = evals, mapping = aes(x = score)) +
geom_histogram(binwidth = 0.25) +
labs(x = "Teaching score", y = "Count", title = "Teaching scores at UT Austin")
```

We say this distribution is *left-skewed*: left because there is a tail to the left and skewed because its not symmetric. Using the eye-ball test, it seems a typical evaluation is around 4.2 out of 5. Some got 5/5, where as a few even got around 2/5! Yikes!

**b) Now show the same histogram as above, but for men and women separately.** Easy, just add a `facet_wrap(~gender)`

layer and update the title.

```
ggplot(data = evals, mapping = aes(x = score)) +
geom_histogram(binwidth = 0.25) +
labs(x = "Teaching score", y = "Count", title = "Teaching scores at UT Austin by gender") +
facet_wrap(~gender)
```

We observe that there are more male instructors, but are their teaching scores higher? It’s hard to say just looking at this plot.

**c) Now compare teaching scores between men and women using a boxplot.**

```
ggplot(data = evals, mapping = aes(x = gender, y = score)) +
labs(x = "Gender", y = "Teaching score", title = "Teaching scores at UT Austin by gender") +
geom_boxplot()
```

Ah! The comparison is much easier! Men had a higher median teaching evaluation score of around 4.3 whereas women had a median teaching evaluation score of 4.1. The spread (length of the boxes) seems about similar, indicating similar spread.

**d) Now facet the above boxplot by teacher ethnicity.** Easy, just add a `facet_wrap(~ethnicity)`

layer and update the title:

```
ggplot(data = evals, mapping = aes(x = gender, y = score)) +
labs(x = "Gender", y = "Teaching score", title = "Teaching scores at UT Austin by gender and ethnicity") +
geom_boxplot() +
facet_wrap(~ethnicity)
```

**Create a visualization that shows the distribution of the variable clean_test, which is the result of the bechdel test for each movie.** Since

`clean_test`

is categorical, a barplot best shows its distribution. Also, `bechdel`

is NOT pre-counted, in other words it’s like Table 3.3 in ModernDive and not Table 3.4, and thus we need to use `geom_bar()`

```
ggplot(data = bechdel, mapping = aes(x = clean_test)) +
geom_bar() +
labs(x = "Bechdel test result", y = "Count", title = "Bechdel test results")
```

**Create a “dodged” barplot that visualizes the distribution of the categorical variables type and country at the same time, where the bar colors represent the country.** Figure 3.24 is most similar to what we need. Let’s set

- The categorical variable
`type`

be on the x-axis - The categorical variable
`country`

by the fill of the bars - The numerical variable
`servings`

on the y-axis - We use
`geom_col(position = "dodge")`

since the counts are pre-counted and we want a “dodged” barplot i.e a side-by-side barplot.

```
# Load data
load(url("https://rudeboybert.github.io/STAT135/static/PS/drinks_subset.RData"))
ggplot(data = drinks_subset, mapping = aes(x=type, y=servings, fill=country)) +
geom_col(position = "dodge") +
labs(x = "Alcohol type", y = "Number of servings", title = "Alcohol consumption in USA, France, and UK")
```

Say the following piecharts represent results of an election poll at time points A, B, then C. At each time point we present the proportion of the poll respondents who say they will support one of 5 candidates: 1 through 5. Based on these 3 piecharts, answer the following questions:

- At time point A, is candidate 5 doing better than candidate 4?
- Did candidate 3 do better at time point B or time point C?
- Who gained more support between time point A and time point B, candidate 2 or candidate 4?

Now answer these questions using barplots. Much easier as you can compare the 5 *levels* of the categorical variable “candidate” (1 through 5) using imaginary horizontal lines:

- At time point A, is candidate 5 did better than candidate 4.
- Did candidate 3 (green) did better at time point C.
- Between time point A and time point B, candidate 2 (blue) increased where as candidate 4 (green) decreased.

Say we want to plot a boxplot of the following 12 values which are pre-sorted:

1, 3, 5, 6, 7, 8, 9, 12, 13, 14, 15, 30

They have the following *summary statistics*:

Min. | 1st Qu. | Median | 3rd Qu. | Max. |
---|---|---|---|---|

1 | 5.5 | 8.5 | 13.5 | 30 |

Let’s compare the points and the corresponding boxplot side-by-side with the values on the \(y\)-axis matching:

- Assigned Wednesday 1/31, due Tuesday 2/6
- Three components:
- DataCamp: Due at 9am. Your completion of these 3 courses will be logged automatically by DataCamp.
- Written component: To be handed in on paper at the start of lecture.
- Code component: Due at 9am. To be submitted electronically via this DropBox link; this link will close promptly just after 9am.

- Learning goals:
- Baby’s first data analyses!
- Get used to problem set workflow for the semester.

Login to DataCamp, in the top navbar click Groups -> STAT/MATH 135 … -> My Assignments -> Complete the 3 newly listed courses in this order:

- Intermediate R: Conditionals and Control Flow
- Intermediate R: Loops. We won’t be seeing loops that much in this class, so it’s ok if you feel your understanding of this chapter is less complete.
- Intermediate R: Functions

First, read the following two articles on FiveThirtyEight.com to get an idea of the context of the datasets you’ll work with in the code component. After all: *numbers are numbers, but data has context.*

- Some People Are Too Superstitious To Have A Baby On Friday The 13th
- Higher Rates Of Hate Crimes Are Tied To Income Inequality

Then answer the following questions:

- Question 1: Hate crimes
- What value of the Gini index indicates perfect income equality? What value of the Gini index indicates perfect income inequality?
- Why are the two hate crime variables based on counts per 100,000 individuals and not based on just the count?
- Name some differences on how and when the FBI reports-based and the Southern Poverty Law Center-based rates of hate crime were collected, and how this could impact any analyses.
- Based on the plot you’ll generate in the coding component below, answer this question: Describe in one sentence the
*relationship*between the two variables income inequality and the incident of hate crimes in the days following the 2016 election as reported by the Southern Poverty Law Center. - When created this plot, you’ll get a warning message:
`Removed 4 rows containing missing values (geom_point).`

Why is R returning this message?

- Question 2: US Births
- Based on the plot you’ll generate in the coding component below, answer this question: why is there a seasonal pattern to the resulting time series plot of the number of births in the US in 1999?
- Based on the plot you’ll generate in the coding component below, answer this question: There is a clear spike in births (over 14,000) for one day in 1999. Identify this date using
`View()`

. What possible explanations are there for this anomalous spike?

- Question 1: Hate crimes
- A quick Google search will yield that a Gini index is a scale between 0 and 1 in some cases, and between 0 and 100 in others. They are the same, kind of like percentages. 0 indicates perfect income equality, whereas 1 (or 100) indicates perfect inequality. In the case of our data, they use the 0 to 1 scale.
- Comparisons of counts between large population states (California, New York, Texas) and small states (Vermont, Montana, Delaware) would not be fair otherwise.
- The FBI data is voluntarily submitted, thus could suffer from volunteer bias, and only focuses on incidents that are
*prosecutable*. The SPLC data focuses on any incident, which is a broader and at the same time more subjective definition IMO. Furthermore, the SPLC counts are based on media accounts, which could suffer from different forms of bias. - Whereas the article stated there is a positive relationship, just by eyeballing the scatterplot below, it isn’t immediately apparent. Also what is the effect of that single outlier point on the top right? We’ll study methods for explicitly quantifying relationships between variables in Chapters 6 and 7 of ModernDive: Data Modeling with Regression.
- Because four states did not have values for the variable
`hate_crimes_per_100k_splc`

: Hawaii, North and South Dakota, and Wyoming. So this plot only has 51-4 = 47 points. Could there be a systematic reason these four states in particular had missing values? Or was it a coincidence? - Run
`View(hate_crimes)`

in the console, click on`hate_crimes_per_100k_splc`

twice to sort this variable in descending order, and you’ll see this outlier was the District of Columbia. I spoke to the author of this article Maia Majumder who said DC is a very special case: it is very small geographically and it has among the most accurate data collection of such incidents of all jurisdictions in the US. So is DC an outlier because hate crimes and inequality are really that high? Or is it because they suffer the least from*underreporting*of hate crimes? Hard to say…

- Question 2: US Births
- Seasonal pattern means there is a cycle that follow some fixed time interval that is not necessarily a “season” like winter -> spring -> summer -> fall. For example, electricity consumption follows a
*daily*pattern, since consumption tends to peak in the evening then drop at night. Cough medicine sales follow a*yearly*pattern since they peak in winter and then drop for the rest of the year. In this case, the fixed time interval is*weekly*: births tend to be high on weekdays, then drop on the weekends. As suggeted by the article, this is most likely due to there being a lower number of induced births on the weekend. - Run
`View(US_births_1999)`

in the console, click on`births`

twice to sort this variable in descending order, and you’ll find the day in question is September 9, 1999, which in two-digit year format is written 9/9/99. While I have no proof of this theory, it could be because parents deliberately wanted their kids to have their birthdate only involve the number 9!

- Seasonal pattern means there is a cycle that follow some fixed time interval that is not necessarily a “season” like winter -> spring -> summer -> fall. For example, electricity consumption follows a

- Create a new “scratchpad”
`.R`

file as seen in class and save it as`PS02_YourFirstName_YourLastName.R`

. So for example in my case the file name would be`PS02_Albert_Kim.R`

. - Copy and paste the code below into your
`PS02_YourFirstName_YourLastName.R`

file - Once you’ve completed your work, go over all the code you’ve copied/pasted/tweaked and make sure it runs.
- Download this file: Go to the Files Panel of RStudio -> Click the checkbox next to
`PS02_YourFirstName_YourLastName.R`

-> Click “More” -> “Export”. - Submit this file electronically via this DropBox link

```
# Name: WRITE YOUR NAME HERE
# Collaboration: INDICATE WHO YOU COLLABORATED WITH FOR THIS PROBLEM SET
# Load all necessary packages:
library(ggplot2)
library(dplyr)
library(fivethirtyeight)
# Question 1 --------------------------------------------------------------
# Let's consider the hate_crimes data frame that's included in the
# fivethirtyeight package:
View(hate_crimes)
# Read the help file associated with this dataset to get a sense of the
# variables included:
?hate_crimes
# Copy/paste/tweak code that will create a scatterplot with
# -y-axis: # of hate crimes per 100K, Southern Poverty Law Center, Nov 9-18 2016
# -x-axis: Gini index (measure of inequality).
# Do this below:
# Question 2 --------------------------------------------------------------
# Let's look the US_births_1994_2003 data frame that's included in the
# fivethirtyeight package:
View(US_births_1994_2003)
# Read the help file associated with this dataset to get a sense of the
# variables included:
?US_births_1994_2003
# Let's not consider all years between 1994 and 2003, but rather only 1999.
# Let's filter for only rows where year equals 1999.
# Don't worry if you don't understand the two lines of code below; we'll see
# this in Chapter 5 of ModernDive: Data Wrangling
US_births_1999 <- US_births_1994_2003 %>%
filter(year == 1999)
View(US_births_1999)
# Copy/paste/tweak code below that will create a "time series" plot of the
# number of births for each day in 1999:
```

Load all necessary packages:

```
library(ggplot2)
library(dplyr)
library(fivethirtyeight)
```

Note I added a `labs`

for labels layer to the plot to have informative axes labels and titles. While not required for now, as the semester progresses these will be stressed, as they give the points *context*. Remember: numbers are numbers, but data has context.

```
ggplot(data = hate_crimes, mapping = aes(x = gini_index, y = hate_crimes_per_100k_splc)) +
geom_point() +
labs(x = "Gini index", y = "Hate crimes per 100K (Nov 9-18 2016)",
title = "Hate crime incidence vs income inequality")
```

In a more intermediate class, you’ll learn how to replace points with in this case the state abbreviations using `geom_text()`

. What states have the highest incidence of hate crimes in the days after the election? DC, Oregon, Washington, Maine, Minnesota, and Massachusetts. What states had the lowest? Mississippi and Arkansas. Interesting…

```
US_births_1999 <- US_births_1994_2003 %>%
filter(year == 1999)
ggplot(data = US_births_1999, mapping = aes(x = date, y = births)) +
geom_line() +
labs(x = "Date", y = "Number of births", title = "Daily 1999 US births")
```

Note:

- The peaks and valleys corresponding to weekdays vs weekends. This is best seen by running
`View(US_births_1999)`

. - The overall dip at the end of November, most likely due to US Thanksgiving.

*Assigned: Monday 1/22. Due: Tuesday 1/30 9am.*

- Slack setup (25%)
- Watch this 2m36s video on Slack
- Change your Slack profile picture to a recent photo of you
- Turn on Slack email notifications

- Administrative stuff (25%):
- Complete this intro survey on Google Forms
- Print, sign, date, and staple a copy of the syllabus available here

- DataCamp assignment (50%):
- Click this link to create a free DataCamp account using your Amherst email. If this link won’t let you join the group, then Slack me a note along with your Amherst email.
- In the navbar on top, click Groups -> STAT/MATH 135 … -> My Assignments -> Complete the 5 listed courses in order starting with “Intro to basics” and ending with “Data frames”.
- Your completion of these 5 courses will be logged automatically by DataCamp. Don’t worry if feel like things don’t stick at first; we’re focused on short-term introduction and not long-term retention for now.

- If you want to be put on the waitlist for this course, please fill out this Google Form
- Syllabus slides