Schedule

The lectures are numbered according to the following rough topic outline:

  1. Getting started (orange): R, RStudio, R packages
  2. Data science (pink): Data visualization via ggplot2, “tidy” data format, and data wrangling via dplyr
  3. Data modeling (blue): Correlation, basic regression, and multiple regression
  4. Statistical background (green): Random assignment for causal inference, random sampling for statistical inference, sampling distributions, and standard errors.
  5. Statistical inference (yellow): confidence intervals, hypothesis testing, inference for regression, other inferential methods

Lec 2.9

In-class data wrangling exercise

Note: This is a bit harder than what will be expected of you in this course, but it is a nice all-encompassing and real example.

  1. Run the starter code below to view the data frame of interest for today. (Note joins are not a required topic in this class; if you’re still interested in learning how to join two datasets, read the optional ModernDive Section 5.8.)
  2. In groups of 2-3 sketch out the steps on the board to wrangle out a data frame of the available seat miles for each airline in 2013 in descending order. Available seat miles is a measure of airline capacity.

Hints:

  1. Take a close look at the data at your disposal in View(flights_plus).
  2. Sketch out what your end goal is roughly going to look like and then fill in all steps in between. This way you won’t confuse what you are trying to do (the algorithm) with how you are going to do it (writing code).
  3. Recall the 5 main verbs (5MV) for data wrangling we’ve seen so far:
    1. filter() rows
    2. summarize() many values to one using a summary statistic function like mean(), median(), etc.
    3. group_by() to add grouping meta-data
    4. mutate() existing variables to create new ones
    5. arrange() the rows of a data frame in order of one of the variables
    6. Extra: select() specific columns (i.e. variables) of your dataset.

Starter Code:

library(dplyr)
library(nycflights13)
# Don't worry if you don't fully understand what's going on here.
flights_plus <- flights %>% 
  left_join(planes, by = "tailnum") %>% 
  rename(year = year.x, year_manufactured = year.y) %>% 
  left_join(airlines, by = "carrier") %>% 
  filter(!is.na(seats))
View(flights_plus)

Quick Solutions

Based on the pseudocode we wrote in class:

flights_plus %>% 
  filter(year == 2013) %>% 
  select(name, seats, distance) %>% 
  mutate(flight_ASM = seats * distance) %>% 
  group_by(name) %>% 
  summarize(total_ASM = sum(flight_ASM)) %>% 
  arrange(desc(total_ASM))
name total_ASM
United Air Lines Inc. 15516377526
Delta Air Lines Inc. 10532885801
JetBlue Airways 9618222135
American Airlines Inc. 3677292231
US Airways Inc. 2533505829
Virgin America 2296680778
ExpressJet Airlines Inc. 1817236275
Southwest Airlines Co. 1718116857
Endeavor Air Inc. 776970310
Hawaiian Airlines Inc. 642478122
Alaska Airlines Inc. 314104736
AirTran Airways Corporation 219628520
Frontier Airlines Inc. 184832280
Mesa Airlines Inc. 20163632
Envoy Air 7162420
SkyWest Airlines Inc. 1299835

Detailed Solutions

First, let’s keep only the rows and variables we’re interested in. This allows us to spend less mental energy looking at the data since we’ve eliminated all rows and columns we don’t need. Note however since we only had 2013 flights to begin with, the filter(year == 2013) was not actually necessary.

flights_plus %>%
  filter(year == 2013) %>%
  select(name, seats, distance)
## # A tibble: 284,170 x 3
##    name                     seats distance
##    <chr>                    <int>    <dbl>
##  1 United Air Lines Inc.      149     1400
##  2 United Air Lines Inc.      149     1416
##  3 American Airlines Inc.     178     1089
##  4 JetBlue Airways            200     1576
##  5 Delta Air Lines Inc.       178      762
##  6 United Air Lines Inc.      191      719
##  7 JetBlue Airways            200     1065
##  8 ExpressJet Airlines Inc.    55      229
##  9 JetBlue Airways            200      944
## 10 JetBlue Airways            200     1028
## # ... with 284,160 more rows

Let’s calculate on each individual flight’s available seat miles and save this in a variable flight_ASM. Notice how there are still 284,170 rows in the resulting data frame, one for each flight:

flights_plus %>%
  filter(year == 2013) %>%
  select(name, seats, distance) %>%
  mutate(flight_ASM = seats * distance)
## # A tibble: 284,170 x 4
##    name                     seats distance flight_ASM
##    <chr>                    <int>    <dbl>      <dbl>
##  1 United Air Lines Inc.      149     1400     208600
##  2 United Air Lines Inc.      149     1416     210984
##  3 American Airlines Inc.     178     1089     193842
##  4 JetBlue Airways            200     1576     315200
##  5 Delta Air Lines Inc.       178      762     135636
##  6 United Air Lines Inc.      191      719     137329
##  7 JetBlue Airways            200     1065     213000
##  8 ExpressJet Airlines Inc.    55      229      12595
##  9 JetBlue Airways            200      944     188800
## 10 JetBlue Airways            200     1028     205600
## # ... with 284,160 more rows

Now let’s add grouping structure “meta-data”. Note the Groups: name [16] meta-data at the top of the data frame output indicating that the rows are grouped by name of which there are 16 possible values (one for each carrier). Also note that the data has not changed, we still have 284,170 rows:

flights_plus %>%
  filter(year == 2013) %>%
  select(name, seats, distance) %>%
  mutate(flight_ASM = seats * distance) %>%
  group_by(name)
## # A tibble: 284,170 x 4
## # Groups:   name [16]
##    name                     seats distance flight_ASM
##    <chr>                    <int>    <dbl>      <dbl>
##  1 United Air Lines Inc.      149     1400     208600
##  2 United Air Lines Inc.      149     1416     210984
##  3 American Airlines Inc.     178     1089     193842
##  4 JetBlue Airways            200     1576     315200
##  5 Delta Air Lines Inc.       178      762     135636
##  6 United Air Lines Inc.      191      719     137329
##  7 JetBlue Airways            200     1065     213000
##  8 ExpressJet Airlines Inc.    55      229      12595
##  9 JetBlue Airways            200      944     188800
## 10 JetBlue Airways            200     1028     205600
## # ... with 284,160 more rows

Let’s now summarize() each group by using the sum() function. We now only have 16 rows, one for each carrier:

flights_plus %>%
  filter(year == 2013) %>%
  select(name, seats, distance) %>%
  mutate(flight_ASM = seats * distance) %>%
  group_by(name) %>%
  summarize(total_ASM = sum(flight_ASM))
## # A tibble: 16 x 2
##    name                          total_ASM
##    <chr>                             <dbl>
##  1 AirTran Airways Corporation   219628520
##  2 Alaska Airlines Inc.          314104736
##  3 American Airlines Inc.       3677292231
##  4 Delta Air Lines Inc.        10532885801
##  5 Endeavor Air Inc.             776970310
##  6 Envoy Air                       7162420
##  7 ExpressJet Airlines Inc.     1817236275
##  8 Frontier Airlines Inc.        184832280
##  9 Hawaiian Airlines Inc.        642478122
## 10 JetBlue Airways              9618222135
## 11 Mesa Airlines Inc.             20163632
## 12 SkyWest Airlines Inc.           1299835
## 13 Southwest Airlines Co.       1718116857
## 14 United Air Lines Inc.       15516377526
## 15 US Airways Inc.              2533505829
## 16 Virgin America               2296680778

And finally, arrange() in descending order of total_ASM:

flights_plus %>%
  filter(year == 2013) %>%
  select(name, seats, distance) %>%
  mutate(flight_ASM = seats * distance) %>%
  group_by(name) %>%
  summarize(total_ASM = sum(flight_ASM)) %>%
  arrange(desc(total_ASM))
## # A tibble: 16 x 2
##    name                          total_ASM
##    <chr>                             <dbl>
##  1 United Air Lines Inc.       15516377526
##  2 Delta Air Lines Inc.        10532885801
##  3 JetBlue Airways              9618222135
##  4 American Airlines Inc.       3677292231
##  5 US Airways Inc.              2533505829
##  6 Virgin America               2296680778
##  7 ExpressJet Airlines Inc.     1817236275
##  8 Southwest Airlines Co.       1718116857
##  9 Endeavor Air Inc.             776970310
## 10 Hawaiian Airlines Inc.        642478122
## 11 Alaska Airlines Inc.          314104736
## 12 AirTran Airways Corporation   219628520
## 13 Frontier Airlines Inc.        184832280
## 14 Mesa Airlines Inc.             20163632
## 15 Envoy Air                       7162420
## 16 SkyWest Airlines Inc.           1299835

Lec 2.8

Baby names trend analysis:


Problem set 4

Overview

  • Assigned Tuesday 2/13, due Tuesday 2/20
  • Three components:
    1. DataCamp: Due at 9am. Two chapters and a feedback survey.
    2. Written component: To be handed in on paper at the start of lecture; no email submissions will be accepted.
    3. Code component: Due on DropBox at 9am; no submissions will be accepted after 9am. In the outside chance there is a problem with the DropBox link, Slack direct message me your problem set before 9am.
  • Learning goals:
    1. Practice, practice, practice data wrangling.
  • Hints:
    1. Coding component: Read Chapter 5 in ModernDive and complete the DataCamp portion first, then do the coding component. In particular the new Subsection 5.5.1 on group_by() more than one variable.
    2. Coding component: Sketch out your “algorithm” on paper first and only then start coding, just as we did in the in-class exercise for Lec 2.9 involving available seat miles.
    3. DataCamp: We’ve seen one way to add a title is to add a labs(title = "TITLE TEXT") layer to a plot. A second way to add a + ggtitle("TITLE TEXT") layer to a plot

1. DataCamp

  1. Login to DataCamp, in the top navbar click Groups -> STAT/MATH 135 … -> My Assignments -> Complete the 2 newly listed chapters:
    1. Introduction to the Tidyverse: Grouping and summarizing (do this one first)
    2. Introduction to the Tidyverse: Types of visualizations
  2. After you’re done the above, fill out this Feedback Survey on the “Introduction to the Tidyverse” DataCamp courses from PS03 and PS04, comparing them to the “Intro to R” and “Intermediate R” DataCamp courses from PS01 and PS02.

2. Written component

First, read the following links to get an idea of the context of the datasets you’ll work with in the code component.

Then answer the following questions, some of which will require you to complete the coding component below:

  • Question 1: Babynames
    1. What makes a name “unisex” according to 538’s definition?
    2. In the “How to Tell Someone’s Age When All You Know Is Her Name” article, the Figure “Median Ages for Males with the 25 Most Common Names” involves 25 instances of a)                 (a graphic we’ve seen in class) but each without the b)                . What are a) and b)?
    3. In the “How to Tell Someone’s Age When All You Know Is Her Name” article, consider the names in the Figure “Youngest Male Names”. Which name has the largest spread (i.e. variation) in age? Compute by hand a previously seen summary statistic of this spread based on based on the values in this figure. In other words, you don’t need to use the babynames dataset, rather just read off approximate values from graphics.
    4. Repeat the above for the names in the Figure “Youngest Female Names”

3. Code component

  • Follow the same procedure as in the PS02 code component to create, save, and submit a PS04_YourFirstName_YourLastName.R scratch pad.
  • Submit using this DropBox link.
  • Please note that all questions in the coding component can be answered by copying/pasting/tweaking exisiting code in ModernDive:
    • Find something similar to what you want in ModernDive.
    • Copy and paste the code that made this plot in PS04_YourFirstName_YourLastName.R.
    • Tweak the code to match what you want.
# Name: WRITE YOUR NAME HERE
# Collaboration: INDICATE WHO YOU COLLABORATED WITH FOR THIS PROBLEM SET

# Load all necessary packages:
library(ggplot2)
library(dplyr)
library(fivethirtyeight)
library(babynames)

# Question 1 --------------------------------------------------------------
# Let's load the evals data from PS03 and View() it
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- as_data_frame(evals)
View(evals)

# a) In PS03 -> Coding Component -> Q1.c), you created a boxplot comparing
# teaching scores between men and women. Copy/paste/tweak a data wrangling
# example in ModernDive Chapter 5 to compute the two medians in these plots
# exactly.


# b) In PS03 -> Coding Component -> Q1.d), you created a boxplot comparing
# teaching scores between men and women but this time faceted by ethnicity.
# Copy/paste/tweak a data wrangling example in ModernDive Chapter 5 to compute
# these four medians exactly in these plots exactly and arrange them in
# descending order.



# Question 2 --------------------------------------------------------------
# Let's View() the bechdel dataset from PS03 from the fivethirtyeight package
View(bechdel)

# a) In PS03 -> Coding Component -> Q2 you created a visualization that shows
# the distribution of the variable clean_test, which is the result of the
# bechdel test for each movie. Copy/paste/tweak a data wrangling example in
# ModernDive Chapter 5 to output a data frame showing the counts of movies for
# each of the 5 levels i.e. outcomes of the bechdel test.



# Question 3 --------------------------------------------------------------
# Let's View() the babynames dataset from the babynames package
View(babynames)

# Read the help file associated with this dataset to get a sense of the
# variables included:
?babynames

# a) Create a time series plot comparing the number (not percentage for now) of
# male babies born with the name from PS04 -> Written Component -> Q1.c). You'll
# need to copy/paste/tweak code from both ModernDive Chapter 3 (data
# visualization) and 5 (data wrangling) to achieve this.

# b) Create a time series plot comparing the number of female babies born with
# the name from PS04 -> Written Component -> Q1.d). You'll need to
# copy/paste/tweak code from both ModernDive Chapter 3 (data visualization) and
# 5 (data wrangling) to achieve this.

# c) Create a time series plot comparing the number of babies born with one of
# the "unisex" names from the fivethirtyeight articles for each year but making
# sure to distinguish between the sexes. There is more than one way to do this.
# You'll need to copy/paste/tweak code from both ModernDive Chapter 3 (data
# visualization) and 5 (data wrangling) to achieve this.

Lec 2.7

Baby’s first R Markdown report exercise. More on Tuesday 2/20.

  1. Download Term_Project_Proposal.Rmd from Course Webpage -> Term Project -> Section 2.3
  2. Upload to RStudio Server
  3. Click the “Knit” button. A webpage should pop-up. If not, you may need to disable your pop-up blocker in your browser.
  4. Click on the blue “Publish” button
  5. Create a new account on RPubs.com
  6. Set the page’s:
    • Title: testing
    • Description: testing
    • Slug (i.e. webpage URL address): testing
  7. A webpage that you can share should show in your browser

Lec 2.6

How to import an Excel file into R:

  • Download this Excel file
  • If you are using RStudio Server (Cloud version): upload
  • In the Files panel -> click on the file -> Import dataset…
  • Click on Import (bottom right)

Problem set 3

Overview

  • Assigned Wednesday 2/7, due Tuesday 2/13
  • Three components:
    1. DataCamp: Due at 9am. Your completion of the courses will be logged automatically by DataCamp.
    2. Written component: To be handed in on paper at the start of lecture; no email submissions will be accepted.
    3. Code component: Due on DropBox at 9am; no submissions will be accepted after 9am. In the outside chance there is a problem with the DropBox link, Slack direct message me your problem set before 9am.
  • Learning goals:
    1. DataCamp: Gentle introduction to topics from Chapter 5 on “data wrangling”, a less sinister sounding term for “data manipulation”.
    2. More practice with data visualization. In particular histograms, boxplots, and barplots.

1. DataCamp

Login to DataCamp, in the top navbar click Groups -> STAT/MATH 135 … -> My Assignments -> Complete the 2 newly listed chapters in this order:

  1. Introduction to the Tidyverse: Data wrangling (do this one first)
  2. Introduction to the Tidyverse: Data visualization

2. Written component

First, read the following links to get an idea of the context of the datasets you’ll work with in the code component. After all: numbers are numbers, but data has context.

Then answer the following questions:

  • Question 1: Teacher evaluations
    1. What does the variable bty_avg in the evals data set represent and how is it computed?
    2. Assuming men and women are of equal teaching ability, based on your visualizations below give an approximate numerical value of a “gender penalty” that women instructors pay on their teaching evaluation scores. This values doesn’t have to be precise; just eye-ball it.
    3. Say we have four demographic groups of professors: female minorities, male minorities, female non-minorities, and male non-minorities. Rank them from lowest to highest in terms of their median teaching scores.
  • Question 2: Bechdel test
    1. What is the purpose of the Bechdel test?
    2. What are the five possible outcomes of the Bechdel test?
    3. Based on your visualization, about what proportion of the \(n\) = 1794 movies included in the dataset “passed” the bechdel test?
  • Question 3: Drinks
    1. What is the source of the drinks data?
    2. What proportion of US alcohol servings consumption is beer?
    3. Among these three countries, what proportion of wine servings are consumed by the French?
    4. Bonus: Based on your visualization below, you should see a relationship between the categorical variables country and alcohol type: for each alcohol type, different countries consume them at different levels. Or put differently, for each alcohol type, the relative amount consumed depends on which country we are considering. Now say instead hypothetically that consumption was independent of country. Sketch out by hand what the barplot below would now look like.

2. Written component solutions

  • Question 1: Teacher evaluations
    1. It is the average beauty rating of an instructor based on a panel of 6 students’ evaluations saved in the variables bty_f1lower, bty_f1upper, bty_f2upper, bty_m1lower, bty_m1upper, bty_m2upper in the evals data frame.
    2. Using the figure from Q1.c) from the coding component below, one example of a numerical “gender penalty” would be the difference in medians of about 4.3-4.1 = 0.2 points out of 5.
    3. Using the figure from Q1.d) from the coding component belows and the “eyeball test”, from lowest to highest: female minorities, male minorities, and the male and female non-minorities are about the same.
  • Question 2: Bechdel test
    1. A rough metric to evaluate female representation in a movie.
    2. The outcomes form a hierarchy from lowest to highest:
      • There are at least two named women.
      • They talk to each other.
      • They talk to each other about something other than a male character.
      • Some end of difficult to classify; these are lumped in as “dubious”.
    3. If you leave out the “dubious” counts, then 800/1794 = 0.4459 = 44.59%. For such a non-strict test of female representation, a little less than half!
  • Question 3: Drinks
    1. World Health Organization
    2. 249/(249+158+84) = 0.507 = 50.7%, the highest of these three countries.
    3. 370/(370+195+84) = 0.570 = 57.0%, very much as one would expect.
    4. Bonus:

Let’s compute the total consumed for each alcohol type and then the average across the 3 countries:

country type servings total avg
France beer 127 595 198.33
United Kingdom beer 219 595 198.33
USA beer 249 595 198.33
France spirit 151 435 145.00
United Kingdom spirit 126 435 145.00
USA spirit 158 435 145.00
France wine 370 649 216.33
United Kingdom wine 195 649 216.33
USA wine 84 649 216.33

3. Code component

  • Follow the same procedure as in the PS02 code component to create, save, and submit a PS03_YourFirstName_YourLastName.R scratch pad.
  • Submit using this DropBox link.
  • Please note that all questions in the coding component can be answered by copying/pasting/tweaking exisiting code in ModernDive:
    • Find a similar looking plot to the one you want in ModernDive.
    • Copy and paste the code that made this plot in PS03_YourFirstName_YourLastName.R.
    • Tweak the code to match what you want.
# Name: WRITE YOUR NAME HERE
# Collaboration: INDICATE WHO YOU COLLABORATED WITH FOR THIS PROBLEM SET

# Load all necessary packages:
library(ggplot2)
library(dplyr)
library(fivethirtyeight)

# Question 1 --------------------------------------------------------------
# Let's load the evals data from the web and View() it
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- as_data_frame(evals)
View(evals)

# a) Visualize the distribution of the numerical variable score, which is the
# teaching score. Have the bin widths be 0.25 teaching score units.
# Do this below:

# b) Now show the same histogram as above, but for men and women separately.
# Do this below:

# c) Now compare teaching scores between men and women using a boxplot.
# Do this below:

# d) Now facet the above boxplot by teacher ethnicity.
# Do this below:


# Question 2 --------------------------------------------------------------
# Let's View() the bechdel dataset from the fivethirtyeight package
View(bechdel)

# Read the help file associated with this dataset to get a sense of the
# variables included:
?bechdel

# Create a visualization that shows the distribution of the variable clean_test,
# which is the result of the bechdel test for each movie.
# Do this below:


# Question 3 --------------------------------------------------------------
# Let's View() the drinks dataset from the fivethirtyeight package
View(drinks)

# Read the help file associated with this dataset to get a sense of the
# variables included:
?drinks

# We're going to consider a slightly modified and simplied version of this data
# in the drinks_subset data frame. Let's load the drinks_subset data from the
# web and View() it
load(url("https://rudeboybert.github.io/STAT135/static/PS/drinks_subset.RData"))
View(drinks_subset)

# Create "dodged" barplot that visualizes the distribution of the categorical
# variables type and country at the same time, where the bar colors
# represent the country.
# Do this below:

Hint: your visulization for Question 3 should look something like this, but with the heights of the bars reflecting the true serving counts in drinks_subset. Here all the heights are set to fake values of 100.


3. Code component solutions

Load all necessary packages:

library(ggplot2)
library(dplyr)
library(fivethirtyeight)

Question 1

a) Visualize the distribution of the numerical variable score, which is the teaching score. Have the bin widths be 0.25 teaching score units.

# Load evals data
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- as_data_frame(evals)

ggplot(data = evals, mapping = aes(x = score)) +
  geom_histogram(binwidth = 0.25) +
  labs(x = "Teaching score", y = "Count", title = "Teaching scores at UT Austin")

We say this distribution is left-skewed: left because there is a tail to the left and skewed because its not symmetric. Using the eye-ball test, it seems a typical evaluation is around 4.2 out of 5. Some got 5/5, where as a few even got around 2/5! Yikes!

b) Now show the same histogram as above, but for men and women separately. Easy, just add a facet_wrap(~gender) layer and update the title.

ggplot(data = evals, mapping = aes(x = score)) +
  geom_histogram(binwidth = 0.25) +
  labs(x = "Teaching score", y = "Count", title = "Teaching scores at UT Austin by gender") +
  facet_wrap(~gender)

We observe that there are more male instructors, but are their teaching scores higher? It’s hard to say just looking at this plot.

c) Now compare teaching scores between men and women using a boxplot.

ggplot(data = evals, mapping = aes(x = gender, y = score)) +
  labs(x = "Gender", y = "Teaching score", title = "Teaching scores at UT Austin by gender") +
  geom_boxplot()

Ah! The comparison is much easier! Men had a higher median teaching evaluation score of around 4.3 whereas women had a median teaching evaluation score of 4.1. The spread (length of the boxes) seems about similar, indicating similar spread.

d) Now facet the above boxplot by teacher ethnicity. Easy, just add a facet_wrap(~ethnicity) layer and update the title:

ggplot(data = evals, mapping = aes(x = gender, y = score)) +
  labs(x = "Gender", y = "Teaching score", title = "Teaching scores at UT Austin by gender and ethnicity") +
  geom_boxplot() +
  facet_wrap(~ethnicity)

Question 2

Create a visualization that shows the distribution of the variable clean_test, which is the result of the bechdel test for each movie. Since clean_test is categorical, a barplot best shows its distribution. Also, bechdel is NOT pre-counted, in other words it’s like Table 3.3 in ModernDive and not Table 3.4, and thus we need to use geom_bar()

ggplot(data = bechdel, mapping = aes(x = clean_test)) +
  geom_bar() +
  labs(x = "Bechdel test result", y = "Count", title = "Bechdel test results")

Question 3

Create a “dodged” barplot that visualizes the distribution of the categorical variables type and country at the same time, where the bar colors represent the country. Figure 3.24 is most similar to what we need. Let’s set

  • The categorical variable type be on the x-axis
  • The categorical variable country by the fill of the bars
  • The numerical variable servings on the y-axis
  • We use geom_col(position = "dodge") since the counts are pre-counted and we want a “dodged” barplot i.e a side-by-side barplot.
# Load data
load(url("https://rudeboybert.github.io/STAT135/static/PS/drinks_subset.RData"))

ggplot(data = drinks_subset, mapping = aes(x=type, y=servings, fill=country)) +
  geom_col(position = "dodge") +
  labs(x = "Alcohol type", y = "Number of servings", title = "Alcohol consumption in USA, France, and UK")

Lec 2.5

Say the following piecharts represent results of an election poll at time points A, B, then C. At each time point we present the proportion of the poll respondents who say they will support one of 5 candidates: 1 through 5. Based on these 3 piecharts, answer the following questions:

  1. At time point A, is candidate 5 doing better than candidate 4?
  2. Did candidate 3 do better at time point B or time point C?
  3. Who gained more support between time point A and time point B, candidate 2 or candidate 4?

Now answer these questions using barplots. Much easier as you can compare the 5 levels of the categorical variable “candidate” (1 through 5) using imaginary horizontal lines:

  1. At time point A, is candidate 5 did better than candidate 4.
  2. Did candidate 3 (green) did better at time point C.
  3. Between time point A and time point B, candidate 2 (blue) increased where as candidate 4 (green) decreased.

Lec 2.4

Say we want to plot a boxplot of the following 12 values which are pre-sorted:

1, 3, 5, 6, 7, 8, 9, 12, 13, 14, 15, 30

They have the following summary statistics:

Min. 1st Qu. Median 3rd Qu. Max.
1 5.5 8.5 13.5 30

Let’s compare the points and the corresponding boxplot side-by-side with the values on the \(y\)-axis matching:


Problem set 2

Overview

  • Assigned Wednesday 1/31, due Tuesday 2/6
  • Three components:
    1. DataCamp: Due at 9am. Your completion of these 3 courses will be logged automatically by DataCamp.
    2. Written component: To be handed in on paper at the start of lecture.
    3. Code component: Due at 9am. To be submitted electronically via this DropBox link; this link will close promptly just after 9am.
  • Learning goals:
    1. Baby’s first data analyses!
    2. Get used to problem set workflow for the semester.

1. DataCamp

Login to DataCamp, in the top navbar click Groups -> STAT/MATH 135 … -> My Assignments -> Complete the 3 newly listed courses in this order:

  1. Intermediate R: Conditionals and Control Flow
  2. Intermediate R: Loops. We won’t be seeing loops that much in this class, so it’s ok if you feel your understanding of this chapter is less complete.
  3. Intermediate R: Functions

2. Written component

First, read the following two articles on FiveThirtyEight.com to get an idea of the context of the datasets you’ll work with in the code component. After all: numbers are numbers, but data has context.

Then answer the following questions:

  • Question 1: Hate crimes
    1. What value of the Gini index indicates perfect income equality? What value of the Gini index indicates perfect income inequality?
    2. Why are the two hate crime variables based on counts per 100,000 individuals and not based on just the count?
    3. Name some differences on how and when the FBI reports-based and the Southern Poverty Law Center-based rates of hate crime were collected, and how this could impact any analyses.
    4. Based on the plot you’ll generate in the coding component below, answer this question: Describe in one sentence the relationship between the two variables income inequality and the incident of hate crimes in the days following the 2016 election as reported by the Southern Poverty Law Center.
    5. When created this plot, you’ll get a warning message: Removed 4 rows containing missing values (geom_point). Why is R returning this message?
  • Question 2: US Births
    1. Based on the plot you’ll generate in the coding component below, answer this question: why is there a seasonal pattern to the resulting time series plot of the number of births in the US in 1999?
    2. Based on the plot you’ll generate in the coding component below, answer this question: There is a clear spike in births (over 14,000) for one day in 1999. Identify this date using View(). What possible explanations are there for this anomalous spike?

2. Written component solutions

  • Question 1: Hate crimes
    1. A quick Google search will yield that a Gini index is a scale between 0 and 1 in some cases, and between 0 and 100 in others. They are the same, kind of like percentages. 0 indicates perfect income equality, whereas 1 (or 100) indicates perfect inequality. In the case of our data, they use the 0 to 1 scale.
    2. Comparisons of counts between large population states (California, New York, Texas) and small states (Vermont, Montana, Delaware) would not be fair otherwise.
    3. The FBI data is voluntarily submitted, thus could suffer from volunteer bias, and only focuses on incidents that are prosecutable. The SPLC data focuses on any incident, which is a broader and at the same time more subjective definition IMO. Furthermore, the SPLC counts are based on media accounts, which could suffer from different forms of bias.
    4. Whereas the article stated there is a positive relationship, just by eyeballing the scatterplot below, it isn’t immediately apparent. Also what is the effect of that single outlier point on the top right? We’ll study methods for explicitly quantifying relationships between variables in Chapters 6 and 7 of ModernDive: Data Modeling with Regression.
    5. Because four states did not have values for the variable hate_crimes_per_100k_splc: Hawaii, North and South Dakota, and Wyoming. So this plot only has 51-4 = 47 points. Could there be a systematic reason these four states in particular had missing values? Or was it a coincidence?
    6. Run View(hate_crimes) in the console, click on hate_crimes_per_100k_splc twice to sort this variable in descending order, and you’ll see this outlier was the District of Columbia. I spoke to the author of this article Maia Majumder who said DC is a very special case: it is very small geographically and it has among the most accurate data collection of such incidents of all jurisdictions in the US. So is DC an outlier because hate crimes and inequality are really that high? Or is it because they suffer the least from underreporting of hate crimes? Hard to say…
  • Question 2: US Births
    1. Seasonal pattern means there is a cycle that follow some fixed time interval that is not necessarily a “season” like winter -> spring -> summer -> fall. For example, electricity consumption follows a daily pattern, since consumption tends to peak in the evening then drop at night. Cough medicine sales follow a yearly pattern since they peak in winter and then drop for the rest of the year. In this case, the fixed time interval is weekly: births tend to be high on weekdays, then drop on the weekends. As suggeted by the article, this is most likely due to there being a lower number of induced births on the weekend.
    2. Run View(US_births_1999) in the console, click on births twice to sort this variable in descending order, and you’ll find the day in question is September 9, 1999, which in two-digit year format is written 9/9/99. While I have no proof of this theory, it could be because parents deliberately wanted their kids to have their birthdate only involve the number 9!

3. Code component

  • Create a new “scratchpad” .R file as seen in class and save it as PS02_YourFirstName_YourLastName.R. So for example in my case the file name would be PS02_Albert_Kim.R.
  • Copy and paste the code below into your PS02_YourFirstName_YourLastName.R file
  • Once you’ve completed your work, go over all the code you’ve copied/pasted/tweaked and make sure it runs.
  • Download this file: Go to the Files Panel of RStudio -> Click the checkbox next to PS02_YourFirstName_YourLastName.R -> Click “More” -> “Export”.
  • Submit this file electronically via this DropBox link
# Name: WRITE YOUR NAME HERE
# Collaboration: INDICATE WHO YOU COLLABORATED WITH FOR THIS PROBLEM SET

# Load all necessary packages:
library(ggplot2)
library(dplyr)
library(fivethirtyeight)

# Question 1 --------------------------------------------------------------
# Let's consider the hate_crimes data frame that's included in the
# fivethirtyeight package:
View(hate_crimes)

# Read the help file associated with this dataset to get a sense of the
# variables included:
?hate_crimes

# Copy/paste/tweak code that will create a scatterplot with
# -y-axis: # of hate crimes per 100K, Southern Poverty Law Center, Nov 9-18 2016
# -x-axis: Gini index (measure of inequality).
# Do this below:


# Question 2 --------------------------------------------------------------
# Let's look the US_births_1994_2003 data frame that's included in the
# fivethirtyeight package:
View(US_births_1994_2003)

# Read the help file associated with this dataset to get a sense of the
# variables included:
?US_births_1994_2003

# Let's not consider all years between 1994 and 2003, but rather only 1999.
# Let's filter for only rows where year equals 1999.
# Don't worry if you don't understand the two lines of code below; we'll see
# this in Chapter 5 of ModernDive: Data Wrangling
US_births_1999 <- US_births_1994_2003 %>%
  filter(year == 1999)
View(US_births_1999)

# Copy/paste/tweak code below that will create a "time series" plot of the
# number of births for each day in 1999:

3. Code component solutions

Load all necessary packages:

library(ggplot2)
library(dplyr)
library(fivethirtyeight)

Question 1

Note I added a labs for labels layer to the plot to have informative axes labels and titles. While not required for now, as the semester progresses these will be stressed, as they give the points context. Remember: numbers are numbers, but data has context.

ggplot(data = hate_crimes, mapping = aes(x = gini_index, y = hate_crimes_per_100k_splc)) +
  geom_point() +
  labs(x = "Gini index", y = "Hate crimes per 100K (Nov 9-18 2016)",
       title = "Hate crime incidence vs income inequality")

In a more intermediate class, you’ll learn how to replace points with in this case the state abbreviations using geom_text(). What states have the highest incidence of hate crimes in the days after the election? DC, Oregon, Washington, Maine, Minnesota, and Massachusetts. What states had the lowest? Mississippi and Arkansas. Interesting…

Question 2

US_births_1999 <- US_births_1994_2003 %>%
  filter(year == 1999)

ggplot(data = US_births_1999, mapping = aes(x = date, y = births)) +
  geom_line() +
  labs(x = "Date", y = "Number of births", title = "Daily 1999 US births")

Note:

  • The peaks and valleys corresponding to weekdays vs weekends. This is best seen by running View(US_births_1999).
  • The overall dip at the end of November, most likely due to US Thanksgiving.

Problem set 1

Assigned: Monday 1/22. Due: Tuesday 1/30 9am.

  1. Slack setup (25%)
  2. Administrative stuff (25%):
    • Complete this intro survey on Google Forms
    • Print, sign, date, and staple a copy of the syllabus available here
  3. DataCamp assignment (50%):
    • Click this link to create a free DataCamp account using your Amherst email. If this link won’t let you join the group, then Slack me a note along with your Amherst email.
    • In the navbar on top, click Groups -> STAT/MATH 135 … -> My Assignments -> Complete the 5 listed courses in order starting with “Intro to basics” and ending with “Data frames”.
    • Your completion of these 5 courses will be logged automatically by DataCamp. Don’t worry if feel like things don’t stick at first; we’re focused on short-term introduction and not long-term retention for now.

Lec 1.1