Schedule


Lec 39: Fri 4/29

No lecture: Online office hours instead


Lec 38: Wed 4/27

Announcements

  1. Midterms
    1. Go over Midterm III
    2. Be sure to read Midterm IV info
    3. Be sure to check that all your proficiency scores on Moodle under Grades are correct
  2. Office hours
    1. Online: Friday during lecture time
    2. Exam week office hours posted on calendar in Syllabus
  3. Unlike textbooks that students often sell at the end of the semester, ModernDive will always be available at moderndive.com

Closing notes

  1. See Slack #general
  2. Reflection on standards-based grading and how semester went
  3. Time for course evals
  4. Thank you to Beth Brown and all Spinelli Tutors
  5. Congrats to all seniors

Lec 37: Mon 4/25

No lecture


Lec 36: Fri 4/22

Announcements

  1. Discussion on what we’ll cover next week

Today’s Topics/Activities

1. Chalk talk

  1. (If anything remains from Lec 35) LINE conditions required for theory/formula-based regression SE, HT, and CI to be valid:
    1. Linearity of relationship between variables
    2. Independence of the residuals
    3. Normality of the residuals
    4. Equality of variance of the residuals
  2. Simulation-based inference for population regression slope \(\beta_1\) with infer framework (as opposed to R’s regression tables which are based on theory/formula-based approach)

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 35: Wed 4/20

Announcements

  1. Anecdote about Harvard and UC Berkeley
  2. Don’t forget about video lectures recorded during pandemic for select topics! In far-right column in schedule above.
  3. Midterm IV during finals week information posted
  4. Resubmission phase of final project
    1. Instructions posted
    2. Basic principle: minimize ink to information ratio

Today’s Topics/Activities

1. Chalk talk

  1. Interpreting remaining columns of a regression table: standard error, (test) statistic, p-value, (95%) confidence interval
  2. How are values in table computed? Not via simulations (like with infer) but using theory/formula-based approach.
  3. (As much as we can get done today) LINE conditions required for theory/formula-based regression SE, HT, and CI to be valid:
    1. Linearity of relationship between variables
    2. Independence of the residuals
    3. Normality of the residuals
    4. Equality of variance of the residuals

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 34: Mon 4/18

Announcements

  1. Discussion about Midterm III posted
  2. By Wednesday:
    1. Instructions for final phase of final project posted
    2. Feedback delivered
  3. Repeat announcement: SDS Focus Group

Today’s Topics/Activities

1. Chalk talk

  1. Sampling scenario 4: Difference in population means \(\mu_1 - \mu_2\)
  2. Theory-based hypothesis tests
    1. t-tests
    2. Fig 9.22 compares theory-based approach and simulation-based approach
  3. Controversy around p-values
  4. Starting ModernDive Chapter 10: Inference for regression (Learning Goals 14 & 15 in blue)
    1. Refresher of simple linear regression: Fig 10.1 and Table 10.1
    2. Sampling scenario 5: Population regression slope \(\beta_1\)

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 33: Fri 4/15

Announcements

  1. SDSCC announcement in Slack #random: Alumni speed networking

Today’s Topics/Activities

1. Chalk talk

  1. Interpreting hypothesis tests:
    1. Pre-determined \(\alpha\)-significance level: false positive rate you’re willing to tolerate
    2. Two possible outcomes of HT
    3. Two types of incorrect decisions. Analogous to COVID test
    4. How do we choose \(\alpha\)

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 32: Wed 4/13

Announcements

  1. SDS will be having a focus group on Diversity, Equity, and Inclusion (DEI) in SDS on Wednesday, 4/20 at 12:15 PM in McConnell 103. This event will also be cosponsored by Smithies in SDS and SDSCC. Anyone who in an SDS major/minor or has taken a class is welcome to join, and the first half will be moderated by faculty, while SSDS and SDSCC student liaisons will take over the rest.

Today’s Topics/Activities

1. Chalk talk

  1. Originally skipped in Lec 28: MD 8.7.2 (LG 11)
    1. Theory-based method for computing standard errors (instead of simulations): PS07 Q10 table
    2. Theory-based method for constructing confidence intervals (instead of simulations): Assuming normality
  2. Only handout for the semester: Example relevant to LG7, LG8, and LG9 on sampling: accuracy vs precision and representativeness in the statistical sense, sampling methodology, bias

Lec 31: Mon 4/11

Announcements

Today’s Topics/Activities

1. Chalk talk

  1. Discussion on Midterm II
  2. Conducting hypothesis testing with the infer package

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 30: Fri 4/8

Announcements

  1. Copy code below into your classnotes.Rmd. It loads the data from the Google Sheet you filled in last lecture

Today’s Topics/Activities

1. Chalk talk

  1. Putting terminology and a framework describing what you just did
  2. Fig 9.9: infer package for hypothesis testing
library(tidyverse)

# Load data:
resumes <- 
  "https://docs.google.com/spreadsheets/d/e/2PACX-1vTlmubr007CcPwa3j-w1CydqNAQkq_cqA-QNxnjpSFno7OjCi8lW0noTgVM1-Fn9mtZ29DYd7PaC32I/pub?gid=1948991627&single=true&output=csv" %>% 
  read_csv()
# View(resumes)

# You don't need to understand what this code is doing for this course:
resumes <- resumes %>% 
  slice(n()) %>% 
  select(-c(person, decision, observed_gender)) %>% 
  .[1, ]%>% 
  as.numeric() %>% 
  tibble(diff_m_minus_f = . )

# Visualize distribution of male - female differences:
ggplot(resumes, aes(x = diff_m_minus_f)) +
  geom_histogram(binwidth = 0.1) +
  labs(x = "Difference in prop promoted: male - female") +
  geom_vline(xintercept = 0.292, col = "red", size = 2)

2. In-class exercise

  1. Don’t forget to do the readings for Lec 29.
  2. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  3. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 29: Wed 4/6

Announcements

Today’s Topics/Activities

1. Chalk talk

  1. In-class tactile activity for Chapter 9 on Hypothesis Testing:
    1. Describe promotions at bank experiment
    2. Observed numerical results
    3. Are these results (statistically) significant i.e. there IS gender-based discrimination, or could it have been by chance?
    4. In a “hypothetical universe” of no gender-based discrimination…
    5. Permutations (shuffling) using a deck of cards (Fig 9.2): resampling without replacement (don’t put cards back)

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 28: Mon 4/4

Announcements

  1. Some students still need to take Midterm II, so no discussing it please.

Today’s Topics/Activities

1. Chalk talk

  1. Two factors that impact width of confidence interval i.e. size of net
    1. Confidence level: Fig 8.29
    2. Sample size: Fig 8.30
  2. New sampling scenario in MD 8.6: Difference in population proportions \(p_1 - p_2\) of two groups. Is yawning contagious?
  3. Recap of infer workflow: Fig 8.21. We’ll use this framework for Hypothesis Testing (which starts on Wednesday)

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 27: Fri 4/1

Announcements

  1. Additional notes about Midterm II
  2. Roadmap of what we have left for final 4 weeks of class

Today’s Topics/Activities

1. Chalk talk

  1. Re-visit Fig 8.27 from Lec 26: How to interpret 95% confidence intervals
  2. Moved to Lec 28 on Mon 4/4: Two factors that impact width of confidence interval i.e. size of net
    1. Confidence level: Fig 8.29
    2. Sample size: Fig 8.30

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 26: Wed 3/30

Announcements

  1. Instructions for next (shorter) phase of final project are posted.
  2. There are 13 midterms (450 students) this weekend. Self Scheduled Exams organizers expect a lot of traffic on Sunday afternoon.

Today’s Topics/Activities

1. Chalk talk

  1. Revisiting bowl sampling exercise (note bowl in book is from Amherst College)
  2. Comparing 95% to 80% confidence intervals Fig 8.27 vs Fig 8.28
  3. Correct interpretation of CI’s
  4. Moved to Lec 28 on Mon 4/4: Two factors that impact width of confidence interval i.e. size of net
    1. Confidence level: Fig 8.29
    2. Sample size: Fig 8.30

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 25: Mon 3/28

Announcements

  1. Added new optional “Resource: Video lecture” column to Google Sheet schedule above: Video chalk talks from Spring 2020 pandemic semester.
  2. Final project
    1. Watch video EDA feedback
    2. Instructions will be updated on Project page by Wednesday
    3. The last step of EDA phase ties into Learning Goal 06 on Midterm II: Multiple regression via interaction and parralel slopes model

Today’s Topics/Activities

1. Chalk talk

  1. Constructing confidence intervals using infer package (instead of rep_sample_n())

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 24: Fri 3/25

Announcements

  1. Gradebook on Moodle updated with Learning Goal proficiencies from Midterm I.
  2. Extension requests will be processed within 24h, usually by the morning after the original due date.

Today’s Topics/Activities

1. Chalk talk

  1. Recap of Lec 23: MD Figure 8.14
  2. MD Ch 8.7.2 touches on Learning Goal 10: Comparing sampling vs bootstrap distributions
  3. The thinking behind confidence intervals (CI): Fig 8.15
  4. Revisit Obama poll from Ch 7.4
  5. Two methods for constructing CI’s

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 23: Wed 3/23

Announcements

  1. You’ll get project feedback by Monday
  2. Talk about Midterm I
  3. In Slack #general: Smithies in SDS Data Ethics event on 2022/4/14 5pm with Dr. Emily Bender. Read more about her work here.
  4. Copy code below into your classnotes.Rmd. It loads the data from the Google Sheet you filled in last lecture
library(tidyverse)

# Load data and do minor data wrangling:
pennies <- 
  "https://docs.google.com/spreadsheets/d/e/2PACX-1vTlmubr007CcPwa3j-w1CydqNAQkq_cqA-QNxnjpSFno7OjCi8lW0noTgVM1-Fn9mtZ29DYd7PaC32I/pub?gid=101369328&single=true&output=csv" %>% 
  read_csv() %>% 
  select(group, group_average)

# Each row does not represent an individually resampled penny, rather each group's
# pre-summarized mean year:
View(pennies)

# Plot the bootstrap distribution:
ggplot(pennies, aes(x=group_average)) +
  geom_histogram(binwidth = 2) +
  labs(
    x = "Average year",
    title = "Bootstrap distribution for 18 groups' 50 pennies resampled with replacement",
    subtitle = "Vertical red line is sample mean of original sample of 50 pennies"
  ) +
  geom_vline(xintercept = 1995.44, col = "red")

Today’s Topics/Activities

From Black Mirror S4E4 "Hang the DJ"

From Black Mirror S4E4 “Hang the DJ”

1. Chalk talk

  1. Goal for this chapter: Table 8.1
  2. Go over classnotes.Rmd Lec 23 code
  3. What in-class tactile exercises have we done so far?
    1. Chapter 7: In-class we took 42 samples from bowl to create sampling distribution (See classnotes.Rmd Lec 19 code). IRL however you don’t take many samples, you sample only once.
    2. Chapter 8: In-class we only have 1 sample of 50 pennies from back. We can’t create sampling distribution. Rather, we approximate with the bootstrap method. Copy code below into classnotes.Rmd Lec 23 code.
  4. Today we mimic the pennies exercise virtually with 1000 simulations to build Fig 8.14

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 22: Mon 3/21

Announcements

  1. Re-visit learning goals

Today’s Topics/Activities

1. Chalk talk

  1. Example of CLT for non-normal population distributions here
  2. Next tactile activity: re-sampling with replacement from a sample of \(n=50\) pennies (Google Sheet)

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 21: Fri 3/11

Announcements

  1. Revisit learning goals: the importance of chapter 7 on sampling
  2. Midterm I discussion

Today’s Topics/Activities

1. Chalk talk

  1. Finish sampling framework from Lec 20: terminology, notations and definitions
  2. Statistical definitions: sampling distribution & standard error
  3. Obama poll in MD 7.4
  4. Central Limit Theorem: Watch bunny and dragon video
  5. Table 7.6.1 of all population parameters we’re going to infer about

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 20: Wed 3/9

Announcements

  1. What is “hump day”?
  2. Project EDA due tomorrow 9pm

Today’s Topics/Activities

1. Chalk talk

  1. Recap of Lec 19 readings, go over:
    1. Sampling exercise data on this Google Sheet can be loaded using code below. Copy to classnotes.Rmd:
    library(tidyverse)
    
    sampling_data <- 
      "https://docs.google.com/spreadsheets/d/e/2PACX-1vTlmubr007CcPwa3j-w1CydqNAQkq_cqA-QNxnjpSFno7OjCi8lW0noTgVM1-Fn9mtZ29DYd7PaC32I/pub?gid=0&single=true&output=csv" %>%
      read_csv()
    
    sampling_data <- sampling_data %>% 
      mutate(prop_red = number_of_red_balls/50)
    
    ggplot(sampling_data, aes(x = prop_red)) +
      geom_histogram(binwidth = 0.04) +
      labs(x = "Proportion of balls that are red")
    1. Code from Ch 7.2.1 (below Fig 7.8). Copy to classnotes.Rmd:
    library(tidyverse)
    library(moderndive)
    
    # Segment 1: sample size = 25 ------------------------------
    # 1.a) Virtually use shovel 1000 times
    virtual_samples_25 <- bowl %>% 
      rep_sample_n(size = 25, reps = 1000)
    
    # 1.b) Compute resulting 1000 replicates of proportion red
    virtual_prop_red_25 <- virtual_samples_25 %>% 
      group_by(replicate) %>% 
      summarize(red = sum(color == "red")) %>% 
      mutate(prop_red = red / 25)
    1. Fig 7.9
    2. Table 7.1
  2. Sampling framework: terminology, notations and definitions

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 19: Mon 3/7

Announcements

  1. Midterm I: Extra Self Scheduled Exam window scheduled by Science Center this week only: Wed 3/9 4-9pm. Staff will be available to hand out printed exams during this window and quiet rooms will be setup.
  2. SDS Presentation of the Major Tue 3/22 (after break). Undeclared majors and minors can attend in-person. If you’d like a grab-n-go lunch, please register in advance.

Today’s Topics/Activities

1. Chalk talk

  1. In-class tactile (hands-on) activity.
    1. Guess the proportion of balls in the bowl that are red in this Google Sheet
    2. Enter the number of balls in your shovel that are red (out of 50) in this Google Sheet
  2. Mimicking sampling virtually on computer

2. In-class exercise

  1. Do ModernDive readings for Wed 2022/3/2 Lec 17 (model selection) as well as today’s.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 18: Fri 3/4

Announcements

  1. Midterm discussion

Today’s Topics/Activities

1. Chalk talk

  1. Random assignment for causal inference

2. In-class exercise

None


Lec 17: Wed 3/2

Announcements

  1. Added cheat sheet policy to Midterm I
  2. Message in Slack #general: What are Shiny Apps?
  3. Discuss project EDA phase of final project:
    1. Download final_project.Rmd from Moodle
    2. Join the #final_project channel on Slack

Today’s Topics/Activities

1. Chalk talk

  1. Revisit Africa data in terms of learning goals for Midterm I: Put code in classnotes.Rmd

    library(readr)
    library(dplyr)
    library(ggplot2)
    library(moderndive)
    africa <- read_csv("https://rudeboybert.github.io/SDS220/static/africa_results_2022.csv")
    
    # 1. Regression with one numerical x
    # 1.a) EDA
    ggplot(data = africa, mapping = aes(x = height, y = countries_in_africa)) +
      geom_point() +
      geom_smooth(method = "lm", se = FALSE) +
      labs(x = "Height", y = "Number of countries guessed", title = "Africa survey")
    
    # 1.b) Fit model and get regression table
    africa_model_num_x <- lm(countries_in_africa ~ height, data = africa)
    get_regression_table(africa_model_num_x)
    
    # 2. Regression with one categorical x
    # 2.a) EDA
    ggplot(data = africa, mapping = aes(x = group, y = countries_in_africa)) +
      geom_boxplot() +
      labs(x = "Group", y = "Number of countries guessed", title = "Africa survey")
    
    # 2.b) Fit model and get regression table
    africa_model_cat_x <- lm(countries_in_africa ~ group, data = africa)
    get_regression_table(africa_model_cat_x)
  2. Model selection using visualizations. In particular, choosing between the interaction and parallel slopes models by comparing

    1. Figure 6.7 for UT Austin data
    2. Figure 6.8 for Massachusetts high school data

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 16: Mon 2/28

Announcements

  1. Self-schedule Seelye Midterm I this weekend. See Midterms tab
  2. Next phase of project “Project EDA” assigned on Wednesday

Today’s Topics/Activities

1. Chalk talk

  1. Recap of Lec 15: Interaction models are one type of multiple regression model for \(x_1\) numerical and \(x_2\) categorical (with \(k\) levels/groups).
  2. Parallel slopes models are another type of model.

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 15: Fri 2/25

Cancelled: Snow day


Lec 14: Wed 2/23

Announcements

Today’s Topics/Activities

1. Chalk talk

  1. Multiple regression for \(y\) as a function of two explanatory/predictor variables:
    1. \(x_1\) numerical
    2. \(x_2\) categorical
  2. First type of multiple regression model: interaction model. What is an interaction effect?
  3. Using teaching evals data from Chapter 5 again:
    1. Figure 6.1: EDA Visualization of interaction model
    2. Table 6.2: Regression table of coefficients for interaction model
    3. Equation for fitted values \(\hat{y}\)
    4. Table 6.4: Intercepts and slopes of regression line for both genders (recorded as binary at time of study in 2005)
    5. Interpretation of results

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 13: Mon 2/21

Announcements

  1. Project proposal due during lab on Thursday

Today’s Topics/Activities

1. Chalk talk

More on regression using a categorical explanatory/predictor variable \(x\). Reusing categ_regression_ex code example from Lec12:

  1. Write out equation for fitted value \(\hat{y}\) using indicator functions
  2. Computing all fitted values and residuals by hand
  3. Re-using code example from Lec12, apply get_regression_points() function (additional code is posted below in Lec12)

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 12: Fri 2/18

Announcements

  1. See Slack #general: Newest SDS faculty Shiya Cao visit today

Today’s Topics/Activities

1. Chalk talk

  1. Lec 11 in-class exercise: sum of squared residuals computation of dashed black line
  2. Regression with a categorical predictor variable. Copy the following code into your classnotes.Rmd that creates a data frame called categ_regression_ex with 9 rows and 2 variables:
    1. A numerical outcome variable \(y\) = value
    2. An categorical explanatory variable \(x\) = name with 3 levels: Bert, Flo, Ivy.
library(ggplot2)
library(dplyr)
library(moderndive)
categ_regression_ex <- tibble(
  name = c("Bert", "Bert", "Bert", "Flo", "Flo", "Flo", "Ivy", "Ivy", "Ivy"),
  value = c(9, 10, 11, 11, 12, 13, 8, 9, 10)
)
categ_regression_ex
# Code we typed during Lec12 to view regression table:
# Step 1: Fit and save model
categ_model <- lm(value ~ name, data = categ_regression_ex)
# Step 2: Output regression table
get_regression_table(categ_model)
# Code we typed during Lec13 to view information on all 9 points
# where:
# value = y outcome variable
# name = x explanatory variable (categorical)
# value_yat = y_hat fitted values (group/category means)
# residual = y - y_hat errors
get_regression_points(categ_model)

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 11: Wed 2/16

Announcements

  1. If you need a project group fill out the Google Form found on the Project page by 5pm today.

Today’s Topics/Activities

1. Chalk talk

  1. PS02 Q11: Skew
  2. Correlation is not necessarily causation:
    1. Spurious correlations
    2. What are X = treatment, Y = response, and Z = confounding variables?
    3. What are causal graphs?
  3. Recall simple_regression_ex
    1. Draw the 3 points and the “best-fitting” regression line in blue.
    2. Draw the 3 fitted values \(\widehat{y}\) using the equation for the regression line \(\widehat{y} = b_0 + b_1\cdot x\)
    3. Draw the 3 residuals \(y - \widehat{y}\)
    4. Compute the sum of squared residuals for this line.

2. In-class exercise

  1. Repeat sum of squared errors calculation for dashed black line, show the value is “worse” i.e. bigger than the value for the blue regression line.
  2. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  3. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 10: Mon 2/14

Announcements

  1. Slack announcement about SDSCC
  2. Revisit learning goals
  3. Project proposal posted
  4. Answer previous question

Today’s Topics/Activities

1. Chalk talk

Consider the following 3 points saved in a data frame called simple_regression_ex.

We display the regression table. This is done in two steps:

# Step 1: Fit regression model
model_ex <- lm(y ~ x, data = simple_regression_ex)

# Step 2: Get regression table
get_regression_table(model_ex)

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 09: Fri 2/11

Announcements

  1. First phase of project released on Monday: Project proposal due Thu 2/24
    • Choose groups of 2-3 students, ideally in same lab section
    • Choose from a basket of datasets
  2. Keyboard shortcut for %>%:
    • macOS: Command + Shift + M
    • Windows: Control + Shift + M

Today’s Topics/Activities

1. Chalk talk

  1. For ModernDive Ch5 you need to install some new packages, including the tidyverse package
  2. Normal distribution
  3. Correlation coefficient:
    • Definition
    • Examples
    • Play “guess the correlation” game
  4. Regression line:
    • Outcome/response \(y\) and explanatory/predictor variables \(x\)
    • Teaching evaluations

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 08: Wed 2/9

Announcements

Today’s Topics/Activities

1. Chalk talk

Using same fruit_basket example from chalk talk in Lec 07:

  1. “Summary functions” are many-to-one functions to compute “summary statistics,” like mean() & median()
  2. Using summary functions to then summarize() rows
  3. Setting “group meta-data” of a data frame using group_by(), then summarize() rows. In other words, group_by() by itself does not change the “data”, rather only the “meta-data”
  4. mutate() new variables from existing variables: price in cents

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 07: Mon 2/7

Announcements

  • Copy the following code in a new R code chunk in your classnotes.Rmd file

    library(readr)
    library(dplyr)
    library(ggplot2)
    africa <- read_csv("https://rudeboybert.github.io/SDS220/static/africa_results_2022.csv")
    
    # Look at variable types
    glimpse(africa)
    
    # 1. Scatterplot
    ggplot(data = africa, mapping = aes(x = height, y = countries_in_africa)) +
      geom_point()
    
    # 2. Histogram
    ggplot(data = africa, mapping = aes(x = height)) +
      geom_histogram()
    
    # 3. Boxplot
    ggplot(data = africa, mapping = aes(x = group, y = countries_in_africa)) +
      geom_boxplot()

Today’s Topics/Activities

1. Chalk talk

  1. Finish data visualization
    • Recap 3 data visualizations
    • Note about boxplots: width/height of the box = IQR = measure of spread
    • How to add labels to a ggplot() with labs()
  2. Data wrangling
    1. Define the “pipe operator” %>%, pronounced then
    2. Logical AKA boolean operators in computer programming
    3. filter() rows that match a criteria
    4. Example data frame we’ll use is fruit_basket

Posted after lecture: In case you want to run the chalk talk example on your computer:

# Create fruit_basket data frame from scratch
# (we won't do this much in this class)
library(tibble)
fruit_basket <- tibble(
  type = c("mango", "kiwi", "mango", "grape", "grape", "mango"),
  price = c(2, 3, 1, 3, 1, 3)
)

# Filter the basket only for grapes
library(dplyr)
grape_basket <- fruit_basket %>% 
  filter(type == "grape")

# View contents of grapes
grape_basket

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive.

Lec 06: Fri 2/4

Cancelled: Snow day


Lec 05: Wed 2/2

Announcements

  1. At the beginning of every lecture you should have the following open:
    1. This webpage
    2. ModernDive
    3. Slack team for this course
    4. RStudio
  2. Install the readr package
  3. Spinelli Center drop-in tutoring information posted in syllabus
  4. Recap of classnotes.Rmd: where you can save all code from all in-class exercises
  5. Recap in-class activity based on student ID numbers from last lecture

Today’s Topics/Activities

1. Chalk talk

We want to visualize the distribution of the following 12 values with a boxplot: 1, 3, 5, 6, 7, 8, 9, 12, 13, 14, 15, 30. They have the following summary statistics (a single numerical value summarizing many values):

  1. Quartiles (1st, 2nd, and 3rd) that cut up the data into 4 parts, each containing roughly one quarter = 25% of the data:
    1. 1st quartile: 5.5
    2. 2nd quartile = median: 8.5
    3. 3rd quartiles: 13.5
  2. Interquartile-range (IQR): the distance between the 3rd and 1st quartiles:
    1. 8 = 13.5 - 5.5

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive
  3. At 11:55am: Continue Africa activity

Lec 04: Mon 1/31

Announcements

  1. At the beginning of every lecture you should have the following open:
    1. This webpage
    2. ModernDive
    3. Slack team for this course
    4. RStudio

Today’s Topics/Activities

1. Chalk talk

  1. Mega-mega fun times in-class activity
  2. Draw histogram by hand, emphasizing that bins correspond to intervals (right-edge inclusive).
  3. Adjust the binning structure of the previous hand-drawn histogram two ways:
    • Adjusting the binwidth
    • Adjusting the number of bins
  4. Facets split one graphic by the values of another variable (that does not have too many unique values).

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive

Lec 03: Fri 1/28

Announcements

  1. Show example of thread in #questions on Slack.
  2. The Smith College CS Department Liaisons’ Software Install Fair is today 2pm-4:30pm, especially if you want help installing RStudio Desktop
  3. Lecture recording:
  4. Lecture notes

Today’s Topics/Activities

1. Chalk talk

  1. Draw scatterplot by hand
  2. Define the Grammar of Graphics
  3. Write example ggplot() code
  4. Setting up a classnotes.Rmd R Markdown file for the in-class exercises. This follows exactly what you did for Lab01 on Thu 1/27.

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. Today’s reading is “2 - 2.3.1”, which means the start of Chapter 2 up to and including 2.3.1. In other words you don’t need to read 2.3.2.
  3. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive

Lec 02: Wed 1/26

Announcements

  1. Waitlist: Fill out Google Form on Moodle by 1pm. I will start admitting people this afternoon.
  2. Labs and problem sets:
    • First lab with Beth Brown tomorrow using same Zoom link.
    • All labs and problem sets are going to be posted on Moodle (not this webpage)
    • All problem sets will be due on Gradescope at 5pm 9pm so that you can visit Spinelli drop-in tutoring hours one last time.
    • First problem set (Intro survey, syllabus quiz, practice tagging individual questions on Gradescope) is due tomorrow, Thu 1/27 at 9pm.
  3. The Smith College CS Department Liaisons will be holding a Software Install Fair on Fri Jan 28 2pm-4:30pm.
  4. Lecture recording
  5. Lecture notes

Today’s Topics/Activities

1. Chalk talk

  1. R vs RStudio
  2. RStudio server rstudio.smith.edu vs RStudio Desktop (see note about Software Install Fair this Friday above)
  3. R packages

2. In-class exercise

  1. Go over ModernDive reading in schedule above, which you need to complete before next lecture.
  2. You do not need to submit the answers of all Learning Checks; they are meant for your practice. The solutions to all Learning Checks can be found in Appendix D of ModernDive

Lec 01: Mon 1/24

  1. Go over TODO list posted on Moodle
  2. Lecture recording (started a little late, apologies)