The lectures are numbered according to the following rough topic outline below. A detailed topic schedule can be found here.

  1. Getting started: R, RStudio, R package
  2. Data science: Data visualization via ggplot2, tidy data formatting, and data wrangling via dplyr
  3. Data modeling: Correlation, simple linear regression, and multiple regression
  4. Data collection: Random sampling for statistical inference and random assignment for causal inference.
  5. Statistical inference: Sampling, sampling distributions, standard errors, confidence intervals, and hypothesis testing.

The source code for this webpage is available by clicking the icon on the top-right of the nav bar.


5.9 Finishing testing + Midterm III review

Tue 12/12

Announcements:

  • Solutions to PS10, PS11, and PS12 now posted. Solutions to PS13 will be posted on Saturday.
  • Office hours:
    • Tomorrow (Wed) 2-5pm in Frost Cafe, not SMudd 208
    • Before exam: Sun 12/17 1pm-3pm in SMudd 208
    • Before project: Thu 12/21 2pm-5pm in SMudd 208
  • Don’t forget the exit survey posted in syllabus by Wed 12/20 noon worth 10% of your individual final grade for the project.

ANOVA

Drawing

Midterm III Format

  • Closed book, no coding, bring calculators.
  • No practice exam, but no “curveball” questions. What does this mean? Only “straightforward” questions. What does this mean in practice? If you understand lectures notes and the problem sets, you will be set.

Sources

  • Lectures 4.1 thru 5.7
  • Problem sets 10 thru 13

Topics:

  1. Where does randomness come into play?
    • Random sampling: statistical inference
    • Random assignment: causal inference
  2. Understand all elements of the sampling from tub-o-balls interactive app and how they matter for
    • Building and interpreting confidence intervals
    • Conducting hypothesis tests. Note: you will need to draw all \(p\)-values, not compute them since you don’t have access to a computer
  3. Inference for regression: ModernDive Chapter 6.7 Inference for regression
    • Interpret output of a regression table
    • Why did those investigations of “no pattern in the residuals” matter?
  4. Central limit theorem: Bunny rabbit video

5.8 \(t\)-Test

Mon 12/11

Today’s lecture:

  • 10% Condition: sample size \(n\) has to be less than 10% of total population size \(N\) when sampling to estimate the true population proportion \(p\) with \(\widehat{p}\) on SDM p.448
  • Continuing from last lecture: what to do when you don’t know \(\sigma\). \(t\)-distribution with different degrees of freedom is:


Problem set 13

Information

Final problem set of the year! Assigned on Fri 12/8. Due on Wed 12/13 4pm in the mailbox outside my Converse Hall 316 office.

From Stats, Data, and Models 4th edition:

  1. Exercise 19.2 (page 512) on being psychic
  2. Exercise 19.4 (page 512) on being psychic
  3. Exercise 19.7 (page 512) on being psychic
  4. Exercise 19.9 (page 513) on bad medicine
  5. Exercise 19.24 (page 515) on educated mothers
  6. Exercise 25.1 (page 718) on graduation
  7. Exercise 25.3 (page 719) on graduation
  8. Exercise 25.7 (page 719) on graduation
  9. Exercise 25.11 (page 719) on graduation
  10. Exercise 25.13 (page 719) on graduation
  11. Exercise 25.17 (page 719) on graduation

Example discusion/solutions

Written solutions are here.


5.7 Inference for mean \(\mu\)

Thu 12/07

Announcements:

  • Drop-in tutoring hours reading+finals period hours: 7pm-9pm everyday from Thu 12/14 to Wed 12/20.

Today’s lecture:


5.6 Inference for regression

Tue 12/05

Today’s lecture:

  • Back to ModernDive: (New) Chapter 6.7 Inference for regression
  • Corresponds to Chapter 25 in SDM. In particular, note that SDM 25.2 on “Assumptions and Conditions” corresponds to MD 6.7.5 on “Refresher: Residual analysis”. In particular, we’re answering the question “Why is it important that there be no strong pattern in the residuals?”

5.5 Finishing hypothesis testing for prop’n \(p\)

Mon 12/04

Announcements:

  • Install mosaic package.

Today’s lecture:

  • Why is the rule of thumb: 95% of observations lie within plus/minus 1.96 standard deviations of the mean?
  • Consider the standard normal curve that has mean 0 and standard deviation 1 below:
    • The area to the left of 1.96 is 0.975
    • The area to the right of 1.96 is 0.025
    • Thus the area to the left of -1.96 is 0.025 by symmetry
    • Thus the area from -1.96 to 1.96 is 0.975-0.025 = 0.95 = 95%.
    • In other words, the middle 95% is from \([0 - 1.96\times 1, 0 + 1.96 \times 1] = [-1.96, 1.96]\)
library(mosaic)
xpnorm(1.96, mean = 0, sd = 1)

## [1] 0.9750021

5.4 Hypothesis testing for prop’n \(p\)

Thu 11/30

Announcements:

  • Final projects: Please test out the STAT 231 Data Science cleaned datasets data sooner rather than later. Make sure it suits your needs.

Today’s lecture:

  • Back to interactive app: varying the sample size \(n\). Interpreting confidence intervals.
  • In today’s handout, which you can see below.
    • The red line corresponds to the true population proportion red \(p = \frac{900}{2400} = 0.375\)
    • The top left corresponds to how the sample proportion red \(\widehat{p}\) behaves based on different samples of size \(n=50\).
    • The bottom left corresponds to how \(\widehat{p}\) behaves based on samples of size \(n=500\).
  • Questions:
    1. Say I told you I drew a sample of size \(n=50\) balls from the tub, and got 20 red balls i.e. \(\widehat{p}=0.4\), would you believe me?
    2. Say I told you I drew a sample of size \(n=50\) balls from the tub, and got 25 red balls i.e. \(\widehat{p}=0.5\), would you believe me?
    3. Say I told you I drew a sample of size \(n=50\) balls from the tub, and got 35 red balls i.e. \(\widehat{p}=0.7\), would you believe me?
    4. Now say I told you I drew a sample of size \(n=500\) balls from the tub, and got 200 i.e. \(\widehat{p}=0.4\) red balls, would you believe me?
    5. Say I told you I drew a sample of size \(n=500\) balls from the tub, and got 250 red balls i.e. \(\widehat{p}=0.5\), would you believe me?
    6. Now say the tub truly had population proportion red \(p = \frac{1500}{2400} = 0.625\). Draw in the top right panel how the sample proportion red \(\widehat{p}\) would behave based on different samples of size \(n=50\).
    7. Now say I told you I drew a sample of size \(n=50\) balls from the tub, and got 35 red balls, would you believe me?
    8. Again, assuming the tub truly had population proportion red \(p = \frac{1500}{2400} = 0.625\), draw in the bottom right panel how the sample proportion red \(\widehat{p}\) would behave based on different samples of size \(n=500\).
Drawing

Problem set 12

Information

Assigned on Fri 12/01. Due on Thu 12/7 11:30am.

From Stats, Data, and Models 4th edition:

  1. Exercise 17.5 (page 464) on marriage
  2. Exercise 17.22 (page 465) on green M&M’s
  3. Exercise 17.30 (page 466) on binge drinking
  4. Exercise 18.1 (page 488) on age
  5. Exercise 18.3 (page 488) on age
  6. Exercise 18.6 (page 489) on age. Note a 99% confidence interval has a margin of error of \(2.58 \times \text{SE}\), unlike a 95% confidence interval which as a margin of error of \(1.96 \times \text{SE}\).
  7. Exercise 18.19 (page 490) on seafood

Example discusion/solutions

Written solutions are here.


5.3 Confidence interval for prop’n \(p\)

Tue 11/28

Today’s lecture:

  • Added blue lines to denoting the middle 95% of values to interactive app from Lec 5.2 of sampling distribution of \(\widehat{p}\)’s based on samples of size \(n\).
  • 2016 election
    1. FiveThirtyEight’s forecast was among those that were the “least wrong.” They do a “poll aggregator” based on several pollsters of varying quality.
    2. One highly rated pollster is Monmouth University; see 2017/11/7 poll results
    3. 2016/11/14 NPR article on four possible reasons polls were so off in 2016.

5.2 More on sampling distributions

Mon 11/27

Announcements:

  • Project data was sent back from the STAT 231 Data Science class. Please take a look and let me know if you have questions.
  • Office hours today from 2-4pm and not 1-4pm

Today’s lecture:

  • Still Chapter 17 in SDM
  • Interactive app for today: varying the sample size \(n\)

Problem set 11

Information

Assigned on Thu 11/23. Due on Thu 11/30 11:30am.

From Stats, Data, and Models 4th edition:

  1. Chapter 17 (p.463): Exercises 1 (website) and 2 (marketing). Keep in mind the exercise done in class with the “tub-o-balls” where we counted the proportion red.

Example discusion/solutions

Written solutions are here.


5.1 Sampling distributions

Tue 11/14

Announcements:

Feedback

Thanks for great feedback! Some of my motivation for why I taught the course the way I did.

1. Big picture

Not so much computer science, but rather coding. This is not an endorsement of President Obama, but rather the message:


2. Standard topics

That being said, there are “bread and butter” topics that a student leaving intro stats needs to understand:

  1. Interpreting results of a scientific paper that uses statistics: t-test, chi-squared tests, ANOVA, confidence intervals and p-values.
  2. Understanding the meaningfulness of models for scientific phenomena. For example:
    • “Should I include an interaction term in my model or not?”
    • “Is diet type a statistically significant explanatory variable for disease?”
    • Machine learning: “Can I predict someone’s future income based on their high school GPA?”
  3. Probability


3. Recipe for success

How to achieve 1. and 2. above? Slides from first lecture on course objectives.

Less of this: But more of this:
Drawing Drawing

I believe the most effective way are to use the following toolbox, which takes time to develop:

  • Data visualization
  • Data modeling

Back to sampling

Let’s revisit our tub-o-balls exercise from Lec 4.1

Group red white green n prop_red
1 Kathleen and Max 18 32 0 50 0.36
2 Sean, Jack, and CJ 18 32 0 50 0.36
3 Orianna and Judy 22 28 0 50 0.44
4 James and Jacob 21 29 0 50 0.42
5 Hannah and Siya 16 34 0 50 0.32
6 Niko, Sophie, and Caitlin 14 36 0 50 0.28
7 Niko, Sophie, and Caitlin 19 31 0 50 0.38
8 Aleja and Ray 20 30 0 50 0.40
9 Yaw and Drew 16 34 0 50 0.32
10 Yaw and Drew 21 29 0 50 0.42

Let’s plot a histogram of the 10 sample proportions \(\widehat{p}\) based on samples of size \(n=50\).

Turns out there are 900 red balls. So the true population parameter, in this case the population proportion is \(p=\frac{900}{2400} = 0.375\), marked in red.

Say I made Yaw and Drew do this one million times. Let’s draw the histogram:

Let’s draw the same histogram with binwidth = 0.02:

The standard deviation of this is 0.068 (and not 0.1 as stated in lecture).


Midterm II review

Mon 11/13

Announcements:

  • Course chartered for the rest of the semester; see lecture outline here for topics and readings from SDM textbook.

Today’s lecture:

  • Midterm II review
  • Solutions/discussion to PS09 posted below

Format

  • In-class
  • Closed book, no calculators, no coding.
  • No practice exam, but no “curveball” questions on Thursday. What does this mean? Only “straightforward” questions. What does this mean in practice? If you understand lectures notes, ModernDive, and the two problem sets, you will be set.

Sources

  • Lectures 3.1 thru 3.9
  • Problem sets 08 and 09.

ModernDive Chapter 6: Data Modeling

  • What is the general modeling framework? \(y=f(x)+\epsilon\) What are the components?
  • We’ve seen 5 regression scenarios:
    1. Simple linear regression with 1 numerical explanatory/predictor variable. ModernDive 6.2
    2. Simple linear regression with 1 categorical explanatory/predictor variable. ModernDive 6.3
    3. Multiple regression with 2 numerical explanatory/predictor variables. ModernDive 6.4
    4. Multiple regression with 1 numerical and 1 categorial explanatory/predictor variable. ModernDive 6.5
    5. Multiple regression with 2 categorical explanatory/predictor variables. ModernDive 6.6
  • For each of the above
    1. What is an appropriate exploratory data visualization?
    2. Most important: How do we interpret the fitted values for the intercept and any slopes, both
      • Statistically/mathematically
      • In the context of the scientific question
    3. How do we compute fitted values and residuals using the equation for the line?
    4. How do we conduct residual analysis? Why do we conduct this?
  • Other topics
    • How do you interpret the slopes for a categorical variable?
    • What is the correlation coefficient? ModernDive 6.2.2.
    • What does it mean for a line to be “best fitting”? ModernDive 6.2.6.
    • In multiple regression, we can also consider interaction effects. Why would we consider these and how do we intepret the results? We saw this in class for scenarios 4 and 5 above. Introduced in ModernDive 6.5.3.
    • Why do we need log-transformations? What do they do to the data? ModernDive 6.6.1

Problem set 10

Information

Assigned on Sat 11/11. Due on Fri 11/17 1pm.

Notes:

  • On paper, no email submissions.
  • Submitted at my office in Converse 316
  • Can also be submitted day before in lecture

From Stats, Data, and Models 4th edition:

  1. Chapter 11 (p.314): Exercise 16
    • Do the first portion, parts a) thru c), on the gallup poll
    • Do the second portion, parts a) thru g), for exercises 17 (medical treatments) and 18 (social life) only.
  2. Chapter 11 (p. 316): Exercise 38 (arm length)
  3. Chapter 12 (top right of p. 337): Answer parts a) thru h) listed under Exercise “21-34” for the following 3 exercises only
    • Exercise 21 (fish oil and bipolar disorder)
    • Exercise 22 (vitamin B12 and depression)
    • Exercise 24 (diet and insomnia)

Example discusion/solutions

Written solutions are here.


4.3 Design of Experiments

Thu 11/9

Announcements:

  • Project proposals are due in RStudio Server Shared Project by tomorrow at 5pm. In your instructions to the STAT231 Data Science students, be clear in your instructions and your objectives. If tell communicate exactly what it is you hope to do, it will help inform the data wrangling they need to do.

Today’s lecture:

  • Design of experiments
  • Kinsey reports on human sexuality. Large sample, but biased sampling, and thus non-generalizable results. Watch the following 2m04s video:

4.2 More on sampling

Tue 11/7

Announcements:

  • Do not work on RStudio Server between 1pm-2pm today as there will be some maintenance work done during that time.

Today’s lecture:

  • Finish ModernDive 7.2 Sampling
  • ModernDive 7.3 Representative sampling examples

Hands-on sampling exercise

From last time, the results of sampling exercise!

Let’s look at the sample proportion of the \(n=50\) balls that were red.

Group red white green n prop_red
1 Kathleen and Max 18 32 0 50 0.36
2 Sean, Jack, and CJ 18 32 0 50 0.36
3 Orianna and Judy 22 28 0 50 0.44
4 James and Jacob 21 29 0 50 0.42
5 Hannah and Siya 16 34 0 50 0.32
6 Niko, Sophie, and Caitlin 14 36 0 50 0.28
7 Niko, Sophie, and Caitlin 19 31 0 50 0.38
8 Aleja and Ray 20 30 0 50 0.40
9 Yaw and Drew 16 34 0 50 0.32
10 Yaw and Drew 21 29 0 50 0.42

Ehhhhh, this is hard to digest. Let’s visualize:

What was the true proportion? I asked Professor X, who replied “From what I recall it was 2400 balls total, 1600 white, 100 green, so 700 red? But that’s assuming we didn’t lose any.” In other words, the true proportion \(p=\frac{700}{2400} = 0.2916\). Let’s plot this. Question: Is Professor X right about the true proportion red?

4.1 Introduction to sampling

Mon 11/6

Announcements:

  • In a Slack DM or a channel, type /appear NAME_OF_UNIQUE_CHATROOM. This will be useful when working on Final Projects.
  • Sports fans: Tomorrow 4:30pm in SMudd 206, Prof. Michael Lopez will give a talk on “How often does the best team win?” where they model the degree to which chance affects the outcomes of 4 men’s pro leagues: NFL, NBA, MLB, and NHL (sorry soccer fans, no MLS).
    Which sport is the most akin to coin flips? Which sport is the most deterministic, in that better teams wins more consistently? Find out tomorrow!
  • All team leader’s coordinate the following:
    1. Login to the RStudio Server Shared Project and ensure that I am added as a collobaborator.
    2. Ask your teammates to see if they can login as well.
    3. In the group DM on Slack send me a message letting me know if your teammates can login.

Today’s lecture:

  • After today’s “hands-on” exercise, fill in this Google Form
  • Chalk talk 4.1 Sampling terminology

3.9 Multiple regression: two categorical predictors

Thu 11/2

Announcements:

  • PS09 due tomorrow Fri 11/3 at noon.
  • “Team information” stage of final project due today.
  • No problem set this week to allot time for final project proposals.

Today’s lecture:

  • Mutiple regression with two categorical predictors (ModernDive 6.6)

3.8 Multiple regression: one numerical & one categorical predictor

Tue 10/31

Announcements:

  • Demo of RStudio Server Shared Projects

Today’s lecture:

  • Mutiple regression with one categorical and one numerical predictor (ModernDive 6.5)
  • Mutiple regression with two categorical predictors (ModernDive 6.6)

3.7 Final project discussion

Mon 10/30

Announcements:

  • Curious about Spring 2018 math or stats courses? Want to learn more about majoring in Stats or Math? Then come to one of the Information Sessions for Majors and Prospective Majors!
    • Information Session for Majors and Prospective Majors in Math: Wed 11/1 6:30-7:30pm in SMudd 206
    • Information Session for Majors and Prospective Majors in Stats: Thu 11/2 6:00-7:00pm in SMudd 207
  • Problem sets
    • Solutions/discussion to PS08 posted below.
    • PS09 is now due on Fri 11/3 at noon. Please take it seriously as it is very similar in flavor to the kind of work you’ll be doing for your final project.
    • No problem set assigned this Thu 11/2 to give time for you to work on your project proposals.

Today’s lecture:


Problem set 9

Information

Assigned on Thu 10/26, due on Fri 11/3 noon.

You will do a thorough regression analysis, in a similar vein as you will be for your final projects!

Instructions:

  1. Download the PS09_YourFirstName_YourLastName.Rmd from Slack and change the filename to reflect your name.
  2. Slack Direct Message me @albert and the grader @mchien20 whenever you want:
    • A working link on Rpubs.com to your report
    • Your PS09_YourFirstName_YourLastName.Rmd file.
  3. If you are having R Markdown errors:
    1. Go over the R Markdown Debugging Google Doc first. These steps solve about 85% of all problems.
    2. Go to drop-in tutoring at Q-Center (see syllabus for times).

Example discusion/solutions

Source R Markdown file

Available here: PS09_solutions.Rmd

Preparation

  • Make sure to load all the necessary packages and create the new categorical variable trump_support_level.
  • Read the following
    1. The first four paragraphs on the Wikipedia entry for the Gini coefficient/index: a statistical measure of income inequality.
    2. The Jan 23, 2017 FiveThirtyEight article “Higher Rates Of Hate Crimes Are Tied To Income Inequality”. You will be partially reconstructing their analysis.
  • The dataset used for this article is included in the FiveThirtyEight package: hate_crimes.
    • Read the help file corresponding to this data by running ?hate_crimes
    • Run View(hate_crimes) and explore the dataset

Question 1

library(ggplot2)
library(dplyr)
library(broom)
library(knitr)
library(fivethirtyeight)

# Create a categorical variable of Trump support level based on the numerical
# variable of share of 2016 U.S. presidential voters who voted for Donald Trump.
# Each of the low, medium, and high levels have roughly one third of states.
hate_crimes <- hate_crimes %>% 
  mutate(
    trump_support_level = cut_number(share_vote_trump, 3, labels=c("low", "medium", "high"))
    )

Question 0: Preliminary questions

  1. What value of the Gini index indicates perfect income equality?
  2. What value of the Gini index indicates perfect income inequality?
  3. Why are the two hate crime variables based on counts per 100,000 individuals and not based on just the count?
  4. Name some differences on how and when the FBI reports-based and the Southern Poverty Law Center-based rates of hate crime were collected, and how this could impact any analyses.

Answers:

  1. 0
  2. 1
  3. We need to standardize comparisons between states that have massively different population sizes. We do this by comparison rates per 100,000 individuals instead of raw counts.
  4. The FBI data is voluntarily submitted, thus could suffer from volunteer bias, and only focuses on incidents that a prosecutable. The SPLC data focuses on indicidents, which is a broader and at the same time more subjective definition IMO. Furthermore, the SPLC counts are based on media accounts, which could suffer from a different form of bias.

Question 1: Hate crimes and Trump support

Let’s model the relationship, both visually and via regression, between:

  • \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election
  • \(x\): Level of Trump support in the state: low, medium, or high

a) Visual model

Create a visual model of this data (do not forget to include appropriate axes labels and title):

# Write your code to create visualization below:
ggplot(hate_crimes, aes(x=trump_support_level, y=hate_crimes_per_100k_splc)) +
  geom_boxplot() +
  labs(x="State's Trump support level", y="Hate crimes per 100k (SPLC reports)",
       title="Relationship between hate crimes and Trump support level for US States")

b) Regression model

Output the regression table and interpret the results

lm(hate_crimes_per_100k_splc ~ trump_support_level, data=hate_crimes) %>% 
  tidy() %>% 
  kable(digits=3)
term estimate std.error statistic p.value
(Intercept) 0.460 0.053 8.692 0.000
trump_support_levelmedium -0.238 0.079 -3.029 0.004
trump_support_levelhigh -0.269 0.080 -3.363 0.002

We see that on average

  • the “low” Trump support states had on average 0.46 incidents per 100K
  • the “medium” Trump support states had on average 0.46 - 0.238 = 0.222 incidents per 100K
  • the “high” Trump support states had on average 0.46 - 0.269 = 0.191 incidents per 100K

Say what? That’s not what I expected! I thought the relationship woud’ve been the other way around! As there is a higher level of Trump support, there would be a higher rate of incidents. Keep in mind, there could be other factors we are not accounting for!

c) Conclusion

  1. Give a one sentence as to whether or not your results above consistent with
  2. Which state was the outlier in the “low” group?

Write you answers here:

  1. Yes. Some surprising results: Washington, Oregon, and Massachusetts, which were solidly in Hilary Clinton’s camp, had a surprisingly large number of hate crime incidents.
  2. Washington DC was the outlier

Question 2: Two simple linear regressions

Create two separate visualizations (do not forget to include appropriate axes labels and title) and run two separate simple linear regressions (using only one predictor) for \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election with

  1. \(x\): the gini index
  2. \(x\): high school education level

and interpret any slope values.

a) Gini Index

ggplot(hate_crimes, aes(x=gini_index, y=hate_crimes_per_100k_splc)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(x="Gini index", y="Hate crimes per 100k (SPLC reports)", title="Relationship between hate crimes and Trump support level for US States")

lm(hate_crimes_per_100k_splc~gini_index, data=hate_crimes) %>% 
  tidy() %>% 
  kable(digits=3)
term estimate std.error statistic p.value
(Intercept) -1.527 0.783 -1.950 0.057
gini_index 4.021 1.718 2.341 0.024

We see that the relationship is positive. That as the Gini index increases by 1 unit, there is on average an associated increase of 4.021 hate crimes per 100K individuals. In other words, as there is more inequality, there tended to have been more hate crimes.

However, note that the scale of the Gini index is between 0 (perfect equality) and 1 (perfect inequality)! So it makes sense to rescale the units. For example: as the Gini index increases by 1/100 = 0.01 unit, there is on average an associated increase of 4.021/100 = 0.04021 hate crimes per 100K individuals.

b) High school education level

ggplot(hate_crimes, aes(x=share_pop_hs, y=hate_crimes_per_100k_splc)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(x="Share of population over 25 with HS degree (2009)", y="Hate crimes per 100k (SPLC reports)", title="Relationship between hate crimes and high school US States")

lm(hate_crimes_per_100k_splc ~ share_pop_hs, data=hate_crimes) %>% 
  tidy() %>% 
  kable(digits=3)
term estimate std.error statistic p.value
(Intercept) -1.705 0.923 -1.848 0.071
share_pop_hs 2.320 1.065 2.179 0.035

We see that the relationship is positive. That as the proportion of the population that has at least a high school degree increases by 1 unit, there is an associated increase in hate crimes of 2.320.

Two questions:

  1. Proportion increases by 1? That’s everybody! Let’s rescale our interpretation: as the proportion of the population that has at least a high school degree increases by 0.01 = 1/100 = 1 percentage point, there is an associated increase in hate crimes of 2.320/100 = 0.0230.
  2. Wait, as a higher proportion of the people has at least a high school degree, the more hate crimes? Shouldn’t it be the other way? More on this in class.

Question 3: Multiple regression

Run a multiple regression for

  1. \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election
  2. Using the following predictor variables simultaenously
    1. \(x_1\): the gini index
    2. \(x_2\): high school education level

an interpret both slope coefficients

lm(hate_crimes_per_100k_splc ~ share_pop_hs + gini_index, data=hate_crimes) %>% 
  tidy() %>% 
  kable(digits=3)
term estimate std.error statistic p.value
(Intercept) -8.212 1.419 -5.787 0
share_pop_hs 5.256 1.003 5.241 0
gini_index 8.702 1.630 5.340 0

Write your interpretation below:

  • All else being equal, for every increase in 1 percentage point of the population that has at least a high school degree, there is an associated increase of on average 5.256/100 = 0.05256 hate crimes per 100K
  • All else being equal, for every increase in 0.01 units in the Gini index, there is an associated increase of on average 8.702/100 = 0.08702 hate crimes per 100K

Question 4: Impact of DC on analyses

Create two new data frames:

  1. hate_crimes_no_new_york: the hate_crimes dataset without New York
  2. hate_crimes_no_DC: the hate_crimes data without the District of Columbia

Repeat the multiple regression from Question 3 and indicate the removal of which state from the dataset has a bigger impact on the analysis. Why do you think this is?

hate_crimes_no_new_york <- hate_crimes %>%
  filter(state != "New York")
lm(hate_crimes_per_100k_splc ~ share_pop_hs + gini_index, data=hate_crimes_no_new_york) %>% 
  tidy() %>% 
  kable(digits=3)
term estimate std.error statistic p.value
(Intercept) -8.655 1.448 -5.977 0
share_pop_hs 5.399 1.001 5.394 0
gini_index 9.415 1.706 5.517 0
hate_crimes_no_DC <- hate_crimes %>%
  filter(state != "District of Columbia")
lm(hate_crimes_per_100k_splc ~ share_pop_hs + gini_index, data=hate_crimes_no_DC) %>% 
  tidy() %>% 
  kable(digits=3)
term estimate std.error statistic p.value
(Intercept) -3.989 1.508 -2.645 0.011
share_pop_hs 3.284 0.943 3.481 0.001
gini_index 3.136 1.835 1.709 0.095

Answer: Removing DC changed the resulting slopes a lot more! Why? Because it is an outlier, and removing outliers has a bigger impact on analyses!

3.6 Multiple regression: two numerical predictors

Thu 10/26

Announcements:

  • Final project details coming on Monday

3D Scatterplot

The multiple regression generalization of a simple linear regression line is a regression plane i.e. a flat surface. Note in the following interactive plot:

  • The blue points are the 400 observed values, each corresponding to one row in the Credit data frame from the ISLR package. Hover over a blue point to get the observed (Income, Credit Limit, Credit Card Balance) values
  • The blue/green/yellow surface is the “best” fitting regression plane thru these points. Hover over any portion of the plane. The z value in the pop-up is the fitted value \(\widehat{y}\) corresponding to particular values of
    • x: \(x_1\) the tncome (in $10K) variable
    • y: \(x_2\) the credit limit variable

3.5 Regression with categorical predictor

Tue 10/24

Announcements:

  • Install plotly and ISLR packages

Today’s lecture:

  • Finish regression table
  • Look at point-by-point summaries
  • Residual analysis

Run following to create a 3D scatterplot in R:

library(ISLR)
library(plotly)
plot_ly(showscale=FALSE) %>%
  add_markers(
    x = Credit$Income,
    y = Credit$Limit,
    z = Credit$Balance,
    hoverinfo = 'text',
    text = ~paste("x1 - Income: ", Credit$Income, 
                  "</br> x2 - Limit: ", Credit$Limit, 
                  "</br> y - Balance: ", Credit$Balance)
  ) %>% 
  layout(
    scene = list(
      xaxis = list(title = "x1 - Income (in $10K)"),
      yaxis = list(title = "x2 - Limit ($)"),
      zaxis = list(title = "y - Balance ($)")
    )
  )

3.4 Best fitting line

Mon 10/23

Announcements:

  • When submitting problem sets, submit:
  • Note that:
    • .html files are the output.
    • .Rmd files are the source code. This is the document you need to
      • replicate/reproduce your HTML file.
      • Modify if you want to update your analysis
  • Solutions/discussion to PS06 on available seat miles posted below.
  • Solutions to PS07 on Midterm I questions/R Markdown posted below.

Today’s lecture:

  • Defining what we mean by “best” fitting line (ModernDive 6.2.6)
  • Started regression with categorical predictor (ModernDive 6.3)

Problem set 8

Information

Assigned on Fri 10/20, due on Thu 10/26 11:30am.

You will run a regression analysis on the evals dataset described in Section 6.1.1] of ModernDive. In particular you study the relationship between

  • Outcome variable \(y\): teaching score
  • Predictor variable \(x\): age

Instructions:

  1. Download the PS08_YourFirstName_YourLastName.Rmd from Slack and change the filename to reflect your name.
  2. Slack Direct Message me @albert and the grader @mchien20 whenever you want:
    • A working link on Rpubs.com to your report
    • Your PS08_YourFirstName_YourLastName.Rmd file.
  3. If you are having R Markdown errors:
    1. Go over the R Markdown Debugging Google Doc first. These steps solve about 85% of all problems.
    2. Go to drop-in tutoring at Q-Center (see syllabus for times).

Example discusion/solutions

Source R Markdown file

Available here: PS08_solutions.Rmd

Question 1

library(ggplot2)
library(dplyr)
library(broom)
library(knitr)

load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- evals %>%
  select(score, ethnicity, gender, language, age, bty_avg, rank)

Exploratory data analysis: Create a visualization that shows the relationship between:

  • \(y\): instructor teaching score
  • \(x\): instructor age

Comment on this relationship.

Answer: While the following plot using geom_point works, there is overplotting.

ggplot(evals, aes(x=age, y=score)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(x="Age of instructor", y="Teaching score", title="Relationship between teaching score and age")

So as a visualization tool, you should really consider using a geom_jitter() plot so as to break things up:

ggplot(evals, aes(x=age, y=score)) +
  geom_jitter() +
  geom_smooth(method="lm", se=FALSE) +
  labs(x="Age of instructor", y="Teaching score", title="Relationship between teaching score and age (jittered)")

Remarks

  • What does se=FALSE, it removes the “error/uncertainty” bands. We’ll see what this is later!
  • There appears to be a slight negative relationship, meaning as age goes up, average teaching score goes down. This is further evidenced by the correlation coefficient that is negative.
cor(evals$age, evals$score)
## [1] -0.107032

Comment on relationship: The relationship appears to be negative: as professors have higher ages, there is an associated drop in teaching score.

Question 2

  1. Display the regression table that shows both the 1) fitted intercept and 2) fitted slope of the regression line. Pipe %>% the table into kable(digits=3) to get a cleanly outputted table with 3 significant digits.
  2. Interpret the slope.
  3. For an instructor that is 67 years old, what would you guess that their teaching score would be?

Answer: First, the table:

lm(score~age, data=evals) %>% 
  tidy() %>% 
  kable(digits=3)
term estimate std.error statistic p.value
(Intercept) 4.462 0.127 35.195 0.000
age -0.006 0.003 -2.311 0.021

Second, the slope for age of -0.006 indicates: for every additional year in age of the instructor, there is an associated decrease of on average 0.006 out of 5 in teaching score. Note, we cautiously use the word associated, because we can’t make any claims that age directly causes drops in teaching score, but rather that these two variables are merely associated.

Third, for an instructor who is \(x=67\), we would guess a teaching score of

\[ \widehat{y} = 4.462 - 0.006 \times 67 = 4.06 \]

Question 3

Does there seem to be a systematic pattern in the lack-of-fit of the model? In other words, is there a pattern in the error between the fitted score \(\widehat{y}\) and the observed score \(y\)? Hint: geom_hline(yintercept=0, col="blue") adds a blue horizontal line at \(y=0\).

Answer: Here are a random sample of 5 rows of point-by-point info out of a possible 463.

point_by_point_info <- lm(score~age, data=evals) %>% 
  augment() %>% 
  select(score, age, .fitted, .resid)

point_by_point_info %>% 
  sample_n(5) %>% 
  kable()
score age .fitted .resid
64 4.2 37 4.242218 -0.0422180
309 3.6 35 4.254094 -0.6540945
89 4.4 56 4.129392 0.2706083
415 4.9 54 4.141268 0.7587318
301 4.4 43 4.206589 0.1934113

Let’s look for any pattern in the residuals. First in the plot of the residuals over age, there seems to be pretty even scatter of the residuals AKA what’s left over AKA the lack-of-fit across all age. Maybe not so much for profs over 70, but there aren’t many of them to begin with.

ggplot(point_by_point_info, aes(x=age, y=.resid)) +
  geom_point() +
  geom_hline(yintercept=0, col="blue") +
  labs(x="Age", y="Residual", title="Residuals over age")

Looking at the histogram, there seems to be a slight left skew, in that there are positive residuals. This is not ideal; we would rather they be normally distributed i.e. bell-shaped. More later in the course.

ggplot(point_by_point_info, aes(x=.resid)) +
  geom_histogram(binwidth = 0.25) +
  labs(x="Residual", title="Histogram of residuals")

Question 4

Say an college administrator wants to model teaching scores using more than one predictor/explantory variable than just age, in particular using the instructor’s gender as well. Create a visualization that summarizes this relationship and comment on the observed relationship.

Answer: While you could do this with facets, IMO having a single colored scatterplot is more succinct. In other words, I find the following easier to digest:

ggplot(evals, aes(x=age, y=score, col=gender)) +
  geom_jitter() +
  geom_smooth(method="lm", se=FALSE) +
  labs(x="Age of instructor", y="Teaching score", title="Relationship between teaching score and age+gender")

than this:

ggplot(evals, aes(x=age, y=score)) +
  geom_jitter() +
  geom_smooth(method="lm", se=FALSE) +
  labs(x="Age of instructor", y="Teaching score", title="Relationship between teaching score and age+gender") +
  facet_wrap(~gender)

Comment on observed relationship: the relationship between age and teaching score seems to differ between male and female faculty. The slope is more steeply negative for female faculty! Also note, there are fewer older female faculty.


3.3 Residual analysis

Thu 10/19

Announcements:

  • Problem set posted later this PM/tomorrow AM.
  • Developing correlation coefficient intuition: Play the Guess the Correlation game online.
  • Correlation is not necessarily causation: Just because two variables are highly correlated, doesn’t mean that one causes the other! They are merely associated! Check out Spurious Correlations.

Today’s lecture:

  • Finishing “Observed values, fitted values, and residuals” (ModernDive 6.2.4): point-by-point info.
  • Residual analysis (ModernDive 6.2.5).

3.2 Simple linear regression

Tue 10/17

Announcements:

  • Install the broom package, which is used to cleanly output regression tables and other regression results
  • Topics for course are switching: Going from data science to data modeling & statistical inference.
  • Accordingly lecture format is switching: Instead of lectures centering around ModernDive eBook and chalk talks supplementing eBook, lectures will now center around chalks talks and will refer to ModernDive for
    1. Section numbers
    2. Table numbers: tables of data
    3. Code
    4. Mostly importantly figure numbers: visualizations.

Today’s lecture:

  • New dataset: Professor evaluations (ModernDive 6.1.1)
  • Modeling topics
    • “Correlation coefficient” (ModernDive 6.2.2)
    • “Simple linear regression” (ModernDive 6.2.3)
    • “Observed values, fitted values, and residuals” (ModernDive 6.2.4)

2.13 R Markdown for generating reports

Mon 10/16

Announcements:

  • Install the readr package, which is used to read CSV files (See Lec 2.6 and ModernDive Section 4.2).

Topics:

  • R Markdown is a tool for generating HTML webpage reports that combine
    • Text
    • R code
    • R output: tables and graphs
  • R Markdown files have extension .Rmd, instead of the .R files we’ve been using until now.
  • Biggest source of confusion: R Markdown has it’s own environment. Just because something exists in your console, doesn’t mean it exists in R Markdown. Whatever code you run in your console, needs to be included in your .Rmd file
  • Your final projects will be submitted in R Markdown. Details coming shortly.

Problem set 7

Information

Assigned on Mon 10/16, due on Thu 10/19 11:30am. This problem set will be shorter than usual.

You will create an R Markdown report containing two plots and one table based entirely on midterm questions and publish these results to the web.

Instructions:

  1. Download the PS07_YourFirstName_YourLastName.Rmd from Slack and change the filename to reflect your name.
  2. Press “Knit” and say yes to all prompts to install all packages. An HTML page should pop up in RStudio.
  3. I suggest you write/test your code in a separate .R scratch pad first. When you feel confident about the output, copy the code over to the appropriate grey code block.
  4. Knit early, and knit often!
  5. When you are done
    • Knit one final time.
    • Click “Publish” in the HTML page that pops up in RStudio
    • Select Rpubs
    • Publish
    • Create an account
    • Give your document an appropriate title and description. Set the “slug” (meaning HTML link) to be “PS07”. After you hit continue, an online webpage should pop up. Copy the link.
  6. Slack Direct Message me @albert and the grader @mchien20 whenever you want:
    • A working link to your report
    • Your PS07_YourFirstName_YourLastName.Rmd file.

Notes:

  • If you’ve updated your HTML file, you can republish by following Step 5 again.
  • If you’re stuck:
    1. Go over the R Markdown Debugging Google Doc first. These steps solve about 85% of all problems.
    2. Go to drop-in tutoring at Q-Center (see syllabus for times).

Example discusion/solutions

Source R Markdown file

Available here: PS07_solutions.Rmd

Question 1

Here is a scatterplot of number of Starbucks/Dunkin Donuts locations per 1000 individuals over median income for 1024 census tracks in Western Massachusetts:

library(ggplot2)
library(dplyr)
library(nycflights13)
library(knitr)
library(readr)

DD_vs_SB <- read_csv("https://rudeboybert.github.io/STAT135/content/PS07/DD_vs_SB.csv")

ggplot(data = DD_vs_SB, mapping = aes(x=median_income, y=shops_per_1000, color=Type)) + 
  geom_point() + 
  facet_wrap(~Type) + 
  geom_smooth(method="lm", se=FALSE, color="blue") +
  scale_color_manual(values = c("orange", "darkgreen")) +
  labs(x="Median income in $", y="Shops per 1000 individuals", 
       title="# of Dunkin Donuts & Starbucks per capita vs median income")

Question 2

Here is a plot comparing beer vs spirit vs wine consumption worldwide:

drinks <- read_csv("https://rudeboybert.github.io/STAT135/content/PS07/drinks.csv")

# Either a boxplot:
ggplot(data = drinks, mapping = aes(x=type, y=servings)) + 
  geom_boxplot() +
  labs(x="Alcohol type", y="Average number of servings", 
       title="Comparison of beer vs spirits vs wine consumption worldwide")

# Or a faceted histogram works:
ggplot(data = drinks, mapping = aes(x=servings)) + 
  geom_histogram() + 
  facet_wrap(~type) +
  labs(x="Average number of servings", 
       title="Comparison of beer vs spirits vs wine consumption worldwide")

Question 3

Here is a table that shows the median departure delay for each airline leaving Newark:

EWR_median_delays <- flights %>%
  filter(origin == "EWR") %>%
  group_by(carrier) %>%
  summarize(avg_dep_delay = median(dep_delay, na.rm=TRUE))

kable(EWR_median_delays)
carrier avg_dep_delay
9E -5
AA -3
AS -3
B6 -3
DL -2
EV -1
MQ -2
OO -1
UA 0
US -4
VX -1
WN 1

Recall:

  • All kable() from the knitr package does is print out the table in a nice format.
  • Because there are missing values in dep_delay, we set na.rm=TRUE so that we ignore them in the calculation of the median. However, do this with caution! Is there a systematic reason/pattern in the missingness?

2.12 Midterm I Review

Wed 10/11 (Mon schedule)

Run this code first; the babynames package contains for each year from 1880 to 2015, the number of children of each sex given each name. All names with more than 5 uses are given.

install.packages("babynames") # Install package only once!
library(babynames)
View(babynames)

Let’s look at the distribution of females named “Taylor” over time:

library(ggplot2)
library(dplyr)

babynames_subset <- babynames %>% 
  filter(name == "Taylor" & sex == "F")

ggplot(babynames_subset, aes(x=year, y=n)) +
  geom_line() +
  labs(x="year", y="count", title="Popularity of name Taylor for females over time")

Format

  • In-class
  • Closed book, no calculators, dplyr cheatsheet will be provided.
  • More conceptual in nature
  • You won’t need to write explicit R code, but you may need to
    • Understand given code
    • Write code “sketches”; this is also known as pseudocode. For more details, in the following link, read Part I, boxes:
      1. Know what pseudocode is
      2. Understand why pseudocode is useful
      3. Remember that pseudocode is subjective and nonstandard

Sources

  • Lectures 2.11 to back to 1.1, but not 3.1, corresponding readings, and especially learning checks.
  • Problem sets 03, 04, and 05

ModernDive Chapter 3: Data visualization

  • What is the grammar of graphics?
  • 5NG:
    • What are the Five Named Graphs (5NG)?
    • What are their distinguishing features?
    • When would you use one versus another?
  • Explicitly tie the components of the grammar of graphics to the 5NG:
    • What data variables are we considering? Are they categorical or numerical?
    • What’s the geometric object in question?
    • What is the mapping from data variables to aesthetic attributes?
    • Which of the 5NG consider relationships between variables?
  • What does faceting do? What kind of variables can we facet by?

ModernDive Chapter 4: Data tidying

  • 4.1: Tidy data:
    • What is definition of “tidy” data formatting?
    • What is the alternative to tidy data formatting?
  • 4.4: Observational units, identification vs measurement variables. Why do we need identification variables for joins?
  • You can skip Chapters 4.2, 4.3, and 4.5

ModernDive Chapter 5: Data Wrangling

  • Understand the 5MV + joins. The images on the dplyr cheatsheet illustrate these well. By understand I mean:
    • What effect does each one have on a data frame?
    • How can you sequence them using %>% “thens” to format your data the way you want them.
  • Go over examples of data wrangling in the learning checks, the textbook, and Problem Sets and see if you can reconstruct the corresponding code sketches/pseudocode on your own.

2.11 Finishing data wrangling

Thu 10/5

Announcements:

  • Midterm I next Thursday 10/12. Topics: follow this page linearly
    • Lectures 2.11 to back to 1.1, but not 3.1, corresponding readings, and learning checks.
    • Problem sets 03, 04, and 05
    • No direct coding questions, but you might need to understand code, or write code “sketches”.
    • Cheatsheets will be distributed the day of lecture.
  • Review session on Wed 10/11 (Monday schedule)
    • Please Slack me any topics you would like covered before Tuesday afternoon.

Chalk talk:

  • Tao of data analysis
  • Solutions/discussion to PS05 posted below.
  • Start problem set 06, which is due on Fri 10/13 at noon instead of usual Thursday at 11:30am.

Problem set 6

Information

Assigned on Thu 10/5, due on Fri 10/13 noon (instead of usual Thursday at 11:30am).

You are a junior analyst in the airline industry, specifically working for Virgin America. You are tasked with calculating Virgin America’s total available seat miles for all flights leaving each of the three New York Metropolitan area airports separately, and comparing these figures with those of ExpressJet Airlines Inc, a close rival to your company. Available seat miles are the fundamental units of production for a passenger-carrying airline; it is a measure of capacity.

Deliverables:

  • Have a file called PS06_YourFirstName_YourLastName.R (Ex: PS06_Albert_Kim.R) ready to submit with
    1. Your results in an easy to read table
    2. Your results in a cleanly formatted visualization
    3. A two sentence “executive summary” on the comparison in available seat miles between Virgin America and ExpressJet Airlines Inc for the three NYC airports. In other words, what insight would you provide the CEO of Virgin America if you only had 1 minute of their time.
  • To be submitted on Slack in #ps06 channel. This week only, submit via a Direct Message that includes: 1) Albert 2) the grader Margaret Chien. You should already have a string of such messages when the grader gave you feedback on previous assignments.
  • Tips: I highly recommend you:
    • Reread the ChalkTalk on the “Tao of data analysis” first.
    • Work in groups, especially in the “sketching out your plan” step.
    • If you’re stuck, go to the drop-in help center on Tuesday evening and my office hours on Wednesday. Both I and the drop-in center TA’s will want to see your sketched plan first before helping you with any code.
# Problem Set 06

# Please Indicate
# -Your name:
# -Who you collaborated with:
# -Approximately how much time did you spend on this problem set:
# -What, if anything, gave you the most trouble:

# Needed packages. Make sure you have all these installed.
# For visualization:
library(ggplot2)
# For data wrangling:
library(dplyr)
# Data necessary:
library(nycflights13)
View(flights)
View(planes)
View(airlines)
View(weather)
View(airports)
# For printing clean outputs of data frames using kable() function:
library(knitr)

# Deliverable 1: Easy to read table. Save the table in a data frame called results

# View it here:
View(results)

# Deliverable 2: Visualization

# Executive Summary:

Example discusion/solutions

Deliverable 1: Table

library(ggplot2)
library(dplyr)
library(knitr)
library(nycflights13)

# Clean up our workspace: Reduce both the flights and planes data sets to only
# those columns and rows we need.
flights_cleaned <- flights %>%
  filter(carrier == "VX" | carrier == "EV") %>% 
  select(tailnum, carrier, distance, origin)

planes_cleaned <- planes %>%
  select(tailnum, seats)

# Now we join the two by the variable "tailnum".
flights_planes_joined <- flights_cleaned %>% 
  left_join(planes_cleaned, by = "tailnum") 

results <- flights_planes_joined %>% 
  # Now we can compute ASM:
  mutate(seat_miles = seats * distance) %>% 
  # Since we want all 6 possible origin X carrier pairs, we group by
  # those variables and THEN summarize
  group_by(origin, carrier) %>% 
  summarise(total_seat_miles = sum(seat_miles)) %>% 
  # While not necessary, I decide to sort in descending order of ASM
  arrange(desc(total_seat_miles))

# Recall, all the kable() function from the knitr package does it output tables
# in an easy to read format. 
kable(results)
origin carrier total_seat_miles
JFK VX 1632349744
EWR EV 1465958750
EWR VX 664331034
LGA EV 333516285
JFK EV 17761240

Deliverable 2: Visualization

We have two categorical variables in the table above, origin and carrier, and a pre-aggregated value total_seat_miles that we want to show on the \(y\)-axis. Thus a barplot using geom_col() would be appropriate here:

ggplot(results, aes(x=origin, y=total_seat_miles)) +
  geom_col() +
  facet_wrap(~carrier) +
  labs(x="NYC Airport", y="Total Available Seat Miles", 
       title="Virgin vs ExpressJet Seat Miles in NYC in 2013")

Deliverable 3: Executive summary

There are numerous ways to summarize this, but one way is, told from the perspective of a Virgin America executive: we have about the same total capacity, but definitely spread out over different airports. We’ve got JFK covered, but definitely don’t have the capacity that ExpressJet does at Newark and La Guardia. Sure we don’t have any capacity at La Guardia, but is this a problem given that some think its the worst airport in the country?

I suggest we target our marketing so that

  • We promote our great capacity to people who tend to fly out of JFK. As such, let’s run an ad campaign at subways stations along the A, E, J, and Z trains.
  • We start running ads that try to take a bite out ExpressJet’s Newark market.

3.1 Intro to modeling

Tue 10/3

Announcements:

  • Check-in with Slack notifications.

Topics

  • Chalk talk: 2.11: Means vs medians
  • Learning check 5.13 on joins discussion
  • Chalk talk: Introduction to modeling
  • In-class exercise: StitchFix

Learning check 5.13

LC 5.13 on joins. Let’s try to join the flights and weather data frames so we have a unified data set. Note that weather is recorded hourly at each of the 3 NYC airports (EWR, JFK, and LGA).

Run the following in your console and look at the hour column:

library(nycflights13)
library(dplyr)
View(weather)
View(flights)

hour is just a number between 0 and 23! To join the two datasets, you need to join by ALL identification variables as per the five arrows that connect flights with weather in red: year, month, day, hour, origin:

nycflights13

Run the following in your console and observed the difference between flights and flights_with_weather: weather data is merged!

flights_with_weather <- flights %>% 
  left_join(weather, by=c("year", "month", "day", "hour", "origin"))
View(flights_with_weather)

StitchFix

Watch the following video from Good Morning America, then fill out your “style profile” on StitchFix.com. Feel free to use a “temporary email” from 10 minute email.

GMA


2.10 Review of PS03 & PS04

Mon 10/2

Chalk talk:

  • Solutions/discussion to PS03 posted below.
  • Solutions/discussion to PS04 posted below.

Problem set 5

Information

Assigned on Thu 9/28, due on Thu 10/5 11:30am.

You will be performing your first data wrangling of real data! The goal of this problem set is to get practice. In some cases, you’ll make a visualization too.

  • Deliverables:
    1. Have a file called PS05_YourFirstName_YourLastName.R (Ex: PS05_Albert_Kim.R) ready to submit. Create a new .R file, save it with the appropriate name as indicated above, and cut and paste the content below. You should be cutting and pasting the commands that create your graphs into this file! + To be submitted on Slack in #ps05 channel, which I will create on Thu 10/5 11:30am.
  • Tips:
    • When generating tables, always sketch out beforehand what you think the table should look like, and only then try to code it using dplyr commands. Again, separate out the what (the data wrangling “verbs”) from the how (the code).
# Problem Set 05

# Please Indicate
# -Your name:
# -Who you collaborated with:
# -Approximately how much time did you spend on this problem set:
# -What, if anything, gave you the most trouble:

library(ggplot2)
library(dplyr)

# Question 1: Drug Use Amongst OkCupid Users -------------------------------
# Let's look at 60,000 San Francisco OkCupid users in 2012 and consider the
# variable `drug` reflecting users' answers to a question on their drug use.

# Run ALL this code first:
library(okcupiddata)
profiles <- profiles %>%
  select(sex, drugs) %>%
  mutate(drugs = ifelse(is.na(drugs), "N/A", drugs))
View(profiles)

# a) Type in a series of commands that will output a table of how many men and
# women there are in this population.

# b) Create a visualization that shows the distribution of the different responses
# of in variable `drugs`.

# c) Create a visualization that shows the same information as in part b), but
# for men and women separately. Who is more likely to say they never use drugs?
# Men or women?


# Question 2: Gapminder -------------------------------
# We're going revisit the Gapminder data

# Run ALL this code first:
library(gapminder)
View(gapminder)

# a) Output a table that shows the mean and median GDP per capita of all
# countries in the year 2007, but split by continent. Your table should be 5
# rows and 3 columns.


# Question 3: Titanic -------------------------------
# Let's study data relating to who survived the Titanic disaster.

# Run ALL this code first:
data(Titanic)
Titanic <- Titanic %>%
  as_data_frame()
View(Titanic)

# a) Output tables that compares survivor counts
# 1. between men and women
# 2. betweent the different classes

# b) For each comparison above, show who was most likely to survive. Spoiler: it
# was women and 1st class passengers; show your work.

Example discusion/solutions

library(ggplot2)
library(dplyr)

Question 1: Drug Use Amongst OkCupid Users

Let’s look at 60,000 San Francisco OkCupid users in 2012 and consider the variable drug reflecting users’ answers to a question on their drug use.

Run ALL this code first:

library(okcupiddata)
profiles <- profiles %>%
  select(sex, drugs) %>%
  mutate(drugs = ifelse(is.na(drugs), "N/A", drugs))
View(profiles)

Let’s peak at the first 6 rows. We see that we have 2 categorical variables:

sex drugs
m never
m sometimes
m N/A
m N/A
m never
m N/A

a) Type in a series of commands that will output a table of how many men and women there are in this population.

profiles %>%
  group_by(sex) %>%
  summarize(count=n())
sex count
f 24117
m 35829

Wow, this population is almost 60% male! A large factor could be that San Francisco has a lot of technology/engineering jobs, which unfortuntately still suffer from gender imbalances.

b) Create a visualization that shows the distribution of the different responses of in variable drugs.

Here we are only considering one categorical variable, so let’s use a barplot! If you look at the profiles data, we see that this is an unaggregated table, so we use geom_bar and not geom_col:

ggplot(profiles, aes(x=drugs)) +
  geom_bar()

Most people say “never” and some people say “often”. However, do you think all the “never”s are being completely honest? These are self-reported values, and thus they might suffer from response bias.

c) Create a visualization that shows the same information as in part b), but for men and women separately. Who is more likely to say they never use drugs? Men or women?

ggplot(profiles, aes(x=drugs, fill=sex)) +
  geom_bar(position="dodge") +
  labs(title="Drug use counts by sex")

Looking at this graph, there are more men who said they never do drugs. But remember, this population is 60% male! So we need to compare proportions, not counts! In other words, divide the height of the blue bars by 35829 and the red bars by 24117.

Here is the updated plot. It seems that women actually are more likely to say they never use drugs!

Note: Converting to proportions is unfortunately a tricky task in dplyr, thus you won’t be expected to do this going forward and the topic is left to another course. However, if you are curious, the code is below:

proportion_drugs_by_sex <- profiles %>% 
  group_by(sex, drugs) %>% 
  summarise(count=n()) %>% 
  # Override grouping structure
  group_by(sex) %>% 
  # Get total count for each group separately, i.e. group = sex
  mutate(count_in_sex = sum(count)) %>% 
  mutate(proportion = count/count_in_sex)
View(proportion_drugs_by_sex)
ggplot(proportion_drugs_by_sex, aes(x=drugs, y=proportion, fill=sex)) +
  geom_col(position="dodge")

Question 2: Gapminder

We’re going revisit the Gapminder data

Run ALL this code first:

library(gapminder)
View(gapminder)

Let’s peak at the first 6 rows of the data set:

country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134

a) Output a table that shows the mean and median GDP per capita of all countries in the year 2007, but split by continent. Your table should be 5 rows and 3 columns.

gapminder %>%
  filter(year==2007) %>%
  group_by(continent) %>%
  summarize(mean_gdp=mean(gdpPercap), median_gdp=median(gdpPercap))
continent mean_gdp median_gdp
Africa 3089.033 1452.267
Americas 11003.032 8948.103
Asia 12473.027 4471.062
Europe 25054.482 28054.066
Oceania 29810.188 29810.188

Question 3: Titanic

Let’s study data relating to who survived the Titanic disaster.

Run ALL this code first:

data(Titanic)
Titanic <- Titanic %>%
  as_data_frame()
View(Titanic)

Note: there are 4 categorical variables that categorize each passenger:

  • Class: 1st, 2nd, 3rd, or Crew. i.e. 4 levels
  • Sex: Male or Female. i.e. 2 levels
  • Age: Child or Adult. i.e. 2 levels
  • Survived: No or Yes. i.e. 2 levels

So each of the 2201 passenger fits into any one of \(4 \times 2 \times 2 \times 2 = 32\) possible categories, so the Titanic data frame has 32 rows with a column n that indicates the number of people in each category:

Class Sex Age Survived n
1st Male Child No 0
2nd Male Child No 0
3rd Male Child No 35
Crew Male Child No 0
1st Female Child No 0
2nd Female Child No 0
3rd Female Child No 17
Crew Female Child No 0
1st Male Adult No 118
2nd Male Adult No 154
3rd Male Adult No 387
Crew Male Adult No 670
1st Female Adult No 4
2nd Female Adult No 13
3rd Female Adult No 89
Crew Female Adult No 3
1st Male Child Yes 5
2nd Male Child Yes 11
3rd Male Child Yes 13
Crew Male Child Yes 0
1st Female Child Yes 1
2nd Female Child Yes 13
3rd Female Child Yes 14
Crew Female Child Yes 0
1st Male Adult Yes 57
2nd Male Adult Yes 14
3rd Male Adult Yes 75
Crew Male Adult Yes 192
1st Female Adult Yes 140
2nd Female Adult Yes 80
3rd Female Adult Yes 76
Crew Female Adult Yes 20

a) Output tables that compares survivor counts

  1. between men and women
  2. between the different classes

The following give the survivor counts split by sex and by class. Note we use the sum() function to add the values in the n variable

Titanic %>%
  filter(Survived == "Yes") %>% 
  group_by(Sex) %>%
  summarise(n=sum(n))
Sex n
Female 344
Male 367
Titanic %>%
  filter(Survived == "Yes") %>% 
  group_by(Class) %>%
  summarise(n=sum(n))
Class n
1st 203
2nd 118
3rd 178
Crew 212

b) For each comparison above, show who was most likely to survive. Spoiler: it was women and 1st class passengers; show your work.

We don’t have enough information however! When we say “who was more likely to survive”, we need to consider those who didn’t!

Titanic %>%
  filter(Survived == "No") %>% 
  group_by(Sex) %>%
  summarise(n=sum(n))
Sex n
Female 126
Male 1364
Titanic %>%
  filter(Survived == "No") %>% 
  group_by(Class) %>%
  summarise(n=sum(n))
Class n
1st 122
2nd 167
3rd 528
Crew 673

So

  • Survival by Sex: Women were much more likely to survive:
    • Female survival 344/(344+126) = 73.2%
    • Male survival 367/(367+1362) = 21.2%.
  • Survival by Class: 1st class was most likely to survive:
    • 1st class survival 203/(203+122) = 62.5%
    • 2nd class survival 118/(118+167) = 41.4%
    • 3rd class survival 178/(178+528) = 25.2%
    • Crew survival 212/(212+673) = 24.0%

Important note: We can group_by more than one categorical variable to make the above task easier!

Titanic %>%
  group_by(Sex, Survived) %>%
  summarise(n=sum(n))
Sex Survived n
Female No 126
Female Yes 344
Male No 1364
Male Yes 367
Titanic %>%
  group_by(Class, Survived) %>%
  summarise(n=sum(n))
Class Survived n
1st No 122
1st Yes 203
2nd No 167
2nd Yes 118
3rd No 528
3rd Yes 178
Crew No 673
Crew Yes 212

Optional exercise:

Titanic %>%
  group_by(Sex, Survived) %>%
  summarise(n=sum(n)) %>% 
  # Override previous grouping metadata with new one
  group_by(Sex) %>% 
  mutate(n_in_sex = sum(n)) %>% 
  mutate(prop = n/n_in_sex)
Sex Survived n n_in_sex prop
Female No 126 470 0.268
Female Yes 344 470 0.732
Male No 1364 1731 0.788
Male Yes 367 1731 0.212

2.9 Mutate, reorder, joins

Thu 9/28

Announcements:

  • Information session for statistics and biostatistics grqduate programs
    • Where: Merrill 300A.
    • When: Friday, October 6 12:00 pm. A light lunch will be served.
    • More information here.

Topics:

  • Chalk talk:
    • 2.9 Mutate, reorder, joins
  • Tech time:

2.8 Summarizing and grouping

Tue 9/26

Announcements:

  • Strategies for problem sets:
    • Please read chalk talk notes first. They are either
      • Executive summaries of important ideas
      • Explain the more difficult concepts slowly
    • When ggploting, focus on the grammar of graphics first (the what), and then the ggplot2 code after (the how). You don’t want to conflate the what and the how, or you’ll be doubly confused! Lay out a plan of attack on paper, blackboard, whiteboard, etc, first!
      1. Figure out based on chart from Chalk Talk 2.5 which plot is appropriate given your data.
      2. Sketch out what you want your plot to look like first!
      3. Create the table that shows the mapping of variables in your data frame to aes()thetic attributes
      4. Find an example in the book that looks similar, and cut/paste/modify code!
  • Example:
    CSV

Topics:

  • Chalk talk
    • Example of above strategy
    • Learning check: and/or operators 5.1
    • Graphical illustration of filter(), summarize(), and group_by()
  • Tech time:

Images

  • Only summarizing (summarizing rows): CSV
  • group_by() first (in this case color: grey, blue, green), then summarizeing for each color separately: CSV

2.7 Piping and filtering

Mon 9/25

Topics:

  • Chalk talk:
    • Learning checks: ID vs measurement variables 4.4
    • Function inputs/outputs, five main verbs (5MV), 5MV#1: filtering rows.
  • Tech time:

Images

filtering (subsetting rows):

CSV


Problem set 4

Information

  • Assigned on Thu 9/21, due on Thu 9/28 11:30am.

You will be get practice with more of the 5NG and begin to “think with data”.

  • Deliverables:
    1. Complete the following survey on the two DataCamp courses you took. This survey is considered part of the problem set.
    2. Code component:
      • Have a file called PS04_YourFirstName_YourLastName.R (Ex: PS04_Albert_Kim.R) ready to submit. Create a new .R file, save it with the appropriate name as indicated above, and cut and paste the content below. You should be cutting and pasting the commands that create your graphs into this file!
      • To be submitted on Slack in #ps04 channel, which I will create on Thu 9/28 11:30am.
    3. Textbook component:
      • Stats, Data, and Models 4th edition:
        • Chapter 2 (p.36): #4, #28 (add part d: sketch a “dodged barchart” showing this data).
        • Chapter 3 (p.76): #20 parts a) and b) only.
        • Chapter 4 (p.107): #31
      • To be submitted on paper in class. No emailed submissions will be accepted.
  • Tips:
    • Starting this problem set we will ramp things up; this will take longer.
    • When working in RStudio, View()ing any data frame first is always a good idea.
# Problem Set 04

# Please Indicate
# -Your name:
# -Who you collaborated with:
# -Approximately how much time did you spend on this problem set:
# -What, if anything, gave you the most trouble:

# Needed packages. Make sure you have all these installed.
library(ggplot2)
library(dplyr)
library(fivethirtyeight)
library(nycflights13)
library(gapminder)
library(tidyr)

# Question 1: Departure delays for NYC domestic flights --------------------

# a) The following bit of code takes the nycflights13 flights data and restricts
# it to those for United (UA) and American (AA)
flights_subset <- flights %>%
  filter(carrier == "UA" | carrier == "AA")

# Create a visualization that compares the distribution of departure delays for
# both carriers.

# b) Based on what you see, who tends to have higher departure delays? United
# or American?


# Question 2: Drinks ------------------------------------------------------
# Read the 538 articles corresponding to the drinks data frame linked in the
# help file:
?drinks

# Let's look at the data set drinks from the fivethirtyeight package.
View(drinks)

# a) Which 3 countries drink the most total litres of pure alcohol according to
# this data? No need to show your work here.

# b) The following bit of code uses the gather() function from the tidyr package
# to convert the data to tidy format and then arranges it by country, and then
# only considers the US, Canada, UK. Don't worry about understanding how this
# works for now, just run it and View() the new data frame.
drinks_tidy <- drinks %>%
  gather(type, servings, -c(country, total_litres_of_pure_alcohol)) %>%
  filter(country %in% c("USA", "Canada", "United Kingdom"))
View(drinks_tidy)

# Create an appropriate visualization that shows the # of servings of beer,
# wine, spirits consumed by each of the 3 countries listed above

# c) Who drinks the most wine of the group?


# Question 3: Gapminder ----------------------------------------------------
# Let's create a plot similar to the one we saw in Chapter 3 of ModernDive:
# https://ismayc.github.io/moderndiver-book/3-viz.html#components-of-the-grammar

# Specifically, in the TED talk "The best stats you've ever seen" at
# https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen
# Hans Rosling (RIP 2017) presents human and international development data. The
# data seen in the video is accessible in the gapminder data set within the
# gapminder package
View(gapminder)

# The next bit of code
# -Filters the data set to only consider observations for two years: 1952 & 2007
# -Renames the "pop" variable to "population"
gapminder <-
  filter(gapminder, year==1952 | year==2007) %>%
  rename(population=pop)

# a) Recreate the scatterplot of "Child Survival (%)" over "GDP per capita ($)"
# seen at 12:00 in the video, but
# * Making a comparison between 1952 and 2007
# * Displaying "life expectancy" instead of "Child Survival"
# Do so by changing the code snippet below.
ggplot(data=DATASETNAME, aes(AES1=VAR1, AES2=VAR2, AES3=VAR3, AES4=VAR4)) +
  geom_point() +
  facet_wrap(~VAR5) +
  scale_x_log10() +
  labs(x="WRITE LABEL HERE", y="WRITE LABEL HERE", title="WRITE TITLE HERE")

# I suggest you leave the above template untouched, copy and paste the above
# 5 lines of code below:

# b) Describe two facts that would be of interest to international development
# organizations.

Written solutions

Written solutions are here.

Example code discusion/solutions

library(ggplot2)
library(dplyr)
library(fivethirtyeight)
library(nycflights13)
library(gapminder)
library(tidyr)

Question 1: Departure delays for NYC domestic flights

a) The following bit of code takes the nycflights13 flights data and restricts it to those for United (UA) and American (AA).

flights_subset <- flights %>%
  filter(carrier == "UA" | carrier == "AA")

# Create a visualization that compares the distribution of departure delays for
# both carriers.
ggplot(flights_subset, aes(x=dep_delay)) +
  geom_histogram() +
  facet_wrap(~carrier)

ggplot(flights_subset, aes(x=carrier, y=dep_delay)) +
  geom_boxplot()

b) Based on what you see, who tends to have higher departure delays? United or American? It’s hard to see because of the outliers! But if you zoom in on the plot, we see that United has a larger median delay.

If I had to choose based on this data? I’d go with American!

Question 2: Drinks

a) Which 3 countries drink the most total litres of pure alcohol according to this data? No need to show your work here.

You can just View the data and sort in the sheet!

View(drinks)

Belarus, Lithuania, and Andorra!

b) The following bit of code uses the gather() function from the tidyr package to convert the data to tidy format and then arranges it by country, and then only considers the US, Canada, UK. Don’t worry about understanding how this works for now, just run it and View() the new data frame.

drinks_tidy <- drinks %>%
  gather(type, servings, -c(country, total_litres_of_pure_alcohol)) %>%
  filter(country %in% c("USA", "Canada", "United Kingdom"))

Create an appropriate visualization that shows the # of servings of beer, wine, spirits consumed by each of the 3 countries listed above

ggplot(drinks_tidy, aes(x=country, y=servings, fill=type)) +
  geom_col(position="dodge")

c) Who are the wine snobs of the group?

I should have said “who drinks the most wine?” The United Kingdom!

Question 3: Gapminder

Let’s create a plot similar to the one we saw in Chapter 3 of ModernDive.

Specifically, in the TED talk “The best stats you’ve ever seen” at Hans Rosling (RIP 2017) presents human and international development data. The data seen in the video is accessible in the gapminder data set within the gapminder package.

The next bit of code

  • Filters the data set to only consider observations for two years: 1952 & 2007
  • Renames the “pop” variable to “population”
gapminder <-
  filter(gapminder, year==1952 | year==2007) %>%
  rename(population=pop)

a) Recreate the scatterplot of “Child Survival (%)” over “GDP per capita ($)” seen at 12:00 in the video, but

  • Making a comparison between 1952 and 2007
  • Displaying “life expectancy” instead of “Child Survival”
ggplot(data=gapminder, aes(x=gdpPercap, y=lifeExp, col=continent, size=population)) +
  geom_point() +
  facet_wrap(~year) +
  scale_x_log10() +
  labs(x="GDP per capita US$", y="Life expectancy", title="Life expectancy over GDP 1952 vs 2007")

b) Describe two facts that would be of interest to international development organizations.

I would say:

  • The relationship between the GDP and life expectancy is positive, as one would expect.
  • Alas, Africa hasn’t made as significant gains in GDP and life expectancy as Asia
  • Look at China and India, the two biggest green dots, jump!

2.6 Tidy data formatting

Thu 9/21

Announcements:

  • Problem set submission via Slack at 11:30.
  • Amherst College Department of Mathematics and Statistics Amherst College Statistics Colloquia tonight.
    • Title: Lessons for official statistics production around the world from the experience of Greece
    • Speaker: Andreas Georgiou ’83, Visiting Lecturer in Economics Amherst College
    • When/where: Thursday, September 21, 2017 7:00 pm in Seeley Mudd 206.
    • Abstract: The seminar will address what lessons can be gleaned from the experience with official statistics production in Greece, both in the period up to the onset of the debt crisis and since then. The seminar will discuss lessons about the appropriate institutional setting, legal framework, statistical culture and statistical practices.

Topics:

  • Chalk talk:
    • Learning checks: Barplots 3.33
    • Tidy data format and spreadsheets + Comma Separated Values (CSV) files. CSV’s are all stuff, no fluff:
      CSV
  • Tech time:
    • Go over Chapter 4 of ModernDive sections, but
      • Skip 4.3 on converting from wide format to long/tidy format
      • Skip 4.5 on normal forms of data
    • Corresponding chapters in “Stats, Data, & Models” (SDM): Chapter 1.

2.5 Barplots + Summary of 5NG

Tue 9/19

Announcements:

  • Now we’re done the Five Named Graphs (5NG), but there are dozens of other possible plots! See here for a pretty exhaustive list!

Topics:

  • Chalk talk:
    • 2.5 Barplots + Wrapup of 5NG and what each is showing.
    • Learning checks: Boxplots 3.22
  • Tech time:
    • Read Chapters [3.8.1, end] of ModernDive. The DataCamp exercises are not quite ready for primetime, so you may skip these.

2.4 Boxplots

Mon 9/18

Topics:

  • Chalk talk: 2.4 Boxplots & aggregated/unaggregated tables
  • Tech time:
    • Learning checks:
      • Histograms 3.17 from last time.
      • Facets 3.20
    • Read Chapters 3.7 and the beginning of Chapter 3.8 (but stop just before 3.8.1) of ModernDive.

Problem set 3

Information

FiveThirtyEight

Assigned on Thu 9/14, due on Thu 9/21 11:30am. The submission method will be announced then.

Baby’s first data visualizations! We’re going to use data from the fivethirtyeight package, which is data corresponding to the articles on http://fivethirtyeight.com/.

  • Deliverables:
    • Have a file called PS03_YourFirstName_YourLastName.R (Ex: PS03_Albert_Kim.R) ready to submit on Thursday 9/21 11:30am
  • Tips: I can’t emphasize these enough, especially early on when learning to code:
    • Computers are stupid!
    • Do not code from scratch; copy/paste something that works and tweak it!
    • Learning to code is like learning a new language! It’s hard at first, but you get better with practice!
    • Do not spin your wheels! Work in groups, go to the weeknightly 7pm-9pm drop-in tutoring in Merrill Science Center 300B, and/or come to office hours.
  • Create a new .R file, save it with the appropriate name as indicated above, and cut and paste the following content. You should be cutting and pasting the commands that create your graphs into this file!
# Problem Set 03
library(ggplot2)
library(dplyr)
library(fivethirtyeight)

# Question 1 --------------------------------------------------------------
# Let's consider the bechdel data frame that's included in the fivethirtyeight
# package:
View(bechdel)

# Read the article on fivethirtyeight.com linked in the help file for the data
# frame
?bechdel

# Graphic creation: Create a scatterplot where the
# -x-axis represents the film's budget
# -y-axis represents the film's domestic gross
# -color represents whether or not the film passed the Bechdel Test

# Interpretation: Your grandma looks over your shoulder and sees the resulting
# plot. She asks "What does this plot ultimately mean?" What do you tell her?
# Write your answer below:


# Question 2 --------------------------------------------------------------
# Let's consider the US_births_1994_2003 data frame that's included in the
# fivethirtyeight package:
View(US_births_1994_2003)

# Read the article on fivethirtyeight.com linked in the help file for the data
# frame
?US_births_1994_2003

# Let's not consider all years between 1994 and 2003, but rather only 1999.
# Don't worry if you don't understand the two lines of code below; we'll see
# this in Chapter 5 of ModernDive: Data Wrangling
US_births_1999 <- US_births_1994_2003 %>%
  filter(year == 1999)
View(US_births_1999)

# Graphic creation: Create a "time series" plot of the number of births for
# each day in 1999

# Interpretation: Some one asks you: "Why is there that spike of more than
# 14,000 births in the second half of the year? What's going on?" What do you
# say? Write your answer below:


# Question 3 --------------------------------------------------------------
# Let's consider the airline_safety data frame that's included in the
# fivethirtyeight package:
View(airline_safety)

# Read the article on fivethirtyeight.com linked in the help file for the data
# frame
?airline_safety

# Graphic creation: Create a histogram of the number of incidents for all
# airlines between the years 2000 and 2014.

# Interpretation: Someone asks: "What is the typical number of incidents
# airlines airlines had between 2000 and 2014?" What do you say? Write your
# answer below:

Example discusion/solutions

library(ggplot2)
library(dplyr)
library(fivethirtyeight)

Question 1: bechdel data frame

ggplot(bechdel, aes(x=budget, y=domgross, col=binary)) +
  geom_point()

Interpretation: Your grandma looks over your shoulder and sees the resulting # plot. She asks “What does this plot ultimately mean?” What do you tell her? # Write your answer below:

Discussion: There is a positive relationship between budget and domestic gross, probably due to bigger budget films also having bigger marketing and distribution budgets. It’s kinda hard to tell from this plot if this relationship between budget and domestic gross is “significantly different” between movies that pass the Bechdel test and those that don’t. We’ll study this more when we see regression and statistical inference. Here is a preview!

Question 2: US births

# Let's not consider all years between 1994 and 2003, but rather only 1999.
# Don't worry if you don't understand the two lines of code below; we'll see
# this in Chapter 5 of ModernDive: Data Wrangling
US_births_1999 <- US_births_1994_2003 %>%
  filter(year == 1999)

# Graphic creation: Create a "time series" plot of the number of births for
# each day in 1999
ggplot(US_births_1999, aes(x=date, y=births)) +
  geom_line()

You should really be using a linegraph here and not a scatterplot, because sequential observations are related to each other i.e. they are connected. Linegraphs are suggestive of this.

Interpretation: Some one asks you: “Why is there that spike of more than 14,000 births in the second half of the year? What’s going on?” What do you say? Write your answer below:

year month date_of_month date day_of_week births
1999 9 9 1999-09-09 Thurs 14540
1999 12 21 1999-12-21 Tues 13508
1999 9 8 1999-09-08 Wed 13437
1999 9 21 1999-09-21 Tues 13384
1999 9 28 1999-09-28 Tues 13358
1999 7 7 1999-07-07 Wed 13343

The spike in births on Sept 9, 1999 is probably not due to chance; there is probably something going on. How do you write Sept 9, 1999? 9/9/99! People were inducing births deliberately on that day!

Question 3: Airline safety:

We can play with the binning structure either with binwidth or bins:

ggplot(airline_safety, aes(x=incidents_00_14)) +
  geom_histogram(binwidth = 2)

ggplot(airline_safety, aes(x=incidents_00_14)) +
  geom_histogram(bins = 12)

Interpretation: Someone asks: “What is the typical number of incidents airlines airlines had between 2000 and 2014?” What do you say? Write your answer below:

“Typical” can mean lots of different things. It can be

  • the mode: the most frequently occuring value. in this case values near 0
  • the mean: the average
  • the median: the middle value that splits the data 50/50.

In my opinion, I would say that most airlines didn’t have that many incidents, so the typical value is 0. Thank heavens!


2.3 Histograms and facets

Thu 9/14

Announcements:

  • “Gah! I can’t tell who you are from your Slack names!” Please go to: “STAT 495 Amherst…” on top left of Slack -> “Preferences” -> “Messages & Media” -> “Names” -> Click the “Full & display names” radio button.
  • Convenient button in RStudio: press “Up” button in Console to scroll thru history.

Topics:

  • Chalk talk 2.3
  • Tech time:
    • Be sure to always load the packages outlined at the beginning of chapter in ModernDive.
    • Create new “R Script” i.e. a “scratch pad”
    • Read Chapters [3.5, 3.6] of ModernDive on histograms and facets, completing all Learning Checks along the way.
    • Start PS03 below!

2.2 Scatterplots and linegraphs

Tue 9/12

Announcements:

  • Syllabus finalized, so please read it!
    • Midterm dates
    • Topic schedule (posted above also)
    • New office hours location: Seeley Mudd 208 (lounge)
  • Learning check discussions/solutions from “Lec 1.2 Data and R packages” readings are embedded inside book (See Chapter 2.4.3 of ModernDive for example).

Topics:

  • Chalk talk 2.2: The “Five Named Graphs” (5NG) and example of ggplot() function to create plot from last time.
  • Tech time: Read Chapters [3.3, 3.4] of ModernDive on scatterplots and linegraphs, completing all Learning Checks along the way. Be sure to always load the packages outlined at the beginning of chapter.

Problem set 2

Information

DataCamp

Assigned on Mon 9/11, due on Thu 9/14 11:30am.

  1. Complete the following chapters from the “Intermediate R” DataCamp course:
    • Chapter 1 Conditionals and Control Flow
    • Skip: Chapter 2 Loops
    • Chapter 3 Functions
    • Skip rest: Chapters 4-5
  2. There is nothing to submit as DataCamp logs your completion.

2.1 Grammar of Graphics

Mon 9/11

Announcements:

  • Math/stat table 12-1:30 in Terrace Room A Valentine Hall
  • Raffle tickets during drop-in hours. Something called “R Markdown Fest/Workshop” will be taking place; at the moment, participate only if you’re curious as we’ll cover this later. You can submit raffle tickets Monday and Tuesday:
    • 4-6pm (just this week on M & Tu)
    • 7-9pm (Normal S, M, Tu, W, Th office hours)
  • Problem sets will be posted/due Thursdays, so problem set 2 is last out-of-sync problem set.
  • If you were spinning your wheels with DataCamp, come to office hours or drop-in hours.
  • Syllabus will be finalized by tomorrow, including
    • Midterm dates
    • Topic schedule
  • Today we start topic 2: Data Science!

Topics:

  • Take a look at the graph in Section 3.1.2 of ModernDive
  • Chalk talk: 2.1 Grammar of graphics
  • Tech time: Read Chapters [3, 3.2] of ModernDive

1.2 Data and R packages

Thu 9/7

Announcements:

  • Drop-in tutoring hours posted on syllabus
  • Refresher on Slack notifications

Topics:

  • Chalk talk:
    • 1.2 R vs RStudio, coding
    • 1.2 R packages
    • 1.2 Data
  • Tech time:
    • Read Chapters [2.3, 2.5] of ModernDive
    • Do the Learning Checks in Chapter 2.4. These are in-class exercises that will not be collected, but discussions/solutions will be posted by the start of the next lecture.
    • (Optional reading) Corresponding chapters in “Stats, Data, & Models”" (SDM) 4th edition: Chapter 1.

Problem set 1

Information

DataCamp

Assigned on Tue 9/5, due on Mon 9/11 11:30am.

  1. Watch
  2. Accept the invitation from DataCamp and complete the following chapters from the “Introduction to R” DataCamp course:
    • Chapter 1 Intro to basics
    • Chapter 2 Vectors
    • Skip: Chapter 3 Matrices
    • Chapter 4 Factors
    • Chapter 5 Data frames
    • Skip: Chapter 5 Lists
  3. There is nothing to submit as DataCamp logs your completion.

1.1 Syllabus and background

Tue 9/5

Intro to Slack

  • Slack is a medium for communication that has many pros (and some cons)
  • Student feedback
  • Two ways to use it. Try both for a few lectures and stick with what you prefer
    • Via the Desktop App
    • Via the browser

Tech time

Slack

  • Chalk talk: 1.1 Slack
  • Exercise:
    • Send me a direct message (DM)
    • Practice sending group DM with seatmates
  • Keys for Slack success:

Toolbox

  • R and RStudio
  • DataCamp: an online interactive environment for learning data science (currently via R and Python).
    • Student feedback from a previous intro course.
    • Note: DataCamp is not a tool for long-term retention, but for introduction.

Tech time

  • Read Chapters [2, 2.2] of ModernDive (Chapters 2 thru 2.2 inclusive, but not Chapter 2.3):
    • Understand the difference between R and RStudio
    • Install them on your machines
    • Chapter 2.2. introduces what you’ll be learning in problem set 1
  • Accept the invitation in your Amherst emails to the DataCamp group “STAT/MATH 135 Intro to Stats via Modeling (Fall 2017)”