Schedule

Topics:

  1. Data visualization (pink): Grammar of Graphics, Five Named Graphs (5NG), color theory.
  2. Working with data (blue): data wrangling, importing, and formatting
  3. Maps and spatial data (green): Maps and geospatial data.
  4. Learning how to learn new data science tools (yellow): SQL, TBD.

Note that while topics and topics dates may change, all problem sets (PS), project, and midterm dates will not.


Lec 38: Fri 12/10

Announcements

  • Today: the in-class data assistants will hold office hours
  • Final project due date/time is Friday 12/17 at 2pm (not 9pm)

Today’s Topics/Activities

1. In-class exercise

  • Work on Final Project

Lec 37: Wed 12/8

Announcements

  • ModernDive will always be free, always available at ModernDive.com
  • For Friday’s lecture the in-class data assistants will hold office hours
  • Finals week:
    • I’ve posted office hours for next week
    • Final project due date/time is Friday 12/17 at 2pm (not 9pm)
  • Spinelli center notes:
    • Friday 12/10 is final Spinelli drop-in tutoring hours
    • Thank you to the in-class data assistants Marium, Emma, Sunni, and Swaha
    • Thank you to the Spinelli Center tutors
  • Reflection exercise
  • Time to fill out course evaluations

Today’s Topics/Activities

1. In-class exercise

  • Work on Final Project

Lec 36: Mon 12/6

Announcements

  • Go over final project submission instructions
  • Wednesday is final lecture I’ll be present; on Friday the Spinelli data assistants will hold in-class office hours.
  • Note on coding style

Today’s Topics/Activities

1. In-class exercise

  • Test two useful packages below:
    • patchwork: Combine two ggplots together
    • janitor: Clean-up messy variable names
  • Work on Final Project
library(tidyverse)

# 1. Combine two ggplots together using patchwork package:
library(patchwork)

plot_1 <- ggplot(mtcars) + geom_point(aes(mpg, disp))
plot_2 <- ggplot(mtcars) + geom_boxplot(aes(gear, disp, group = gear))

# Side-by-side:
plot_1 + plot_2

# On top of each other:
plot_1 / plot_2


# 2. Say we have a data frame with really messy names:
data_frame_ugly <- tibble(
  `asdf ?!? qwerty%` = c(1, 2),
  variable.name...NAMES = c(2,1)
)
data_frame_ugly

# You can clean them very easily using the clean_names() function from the 
# janitor package
library(janitor)
data_frame_clean <- data_frame_ugly %>% 
  clean_names()
data_frame_clean

Lec 35: Fri 12/3

Announcements

  • MTH/SDS tenure track search email
  • Final project:
    • Open Slack to #final_project channel
    • Group leader: Create a Slack DM with all members AND myself and say “we’re a group”
    • Submission details for Final Project to be posted on Monday.
  • If you haven’t already download and install MySQL Workbench

Today’s Topics/Activities

1. Chalk Talk

  • dplyr and SQL are very similar. Both based on the same idea of database normalization (1970).
  • Moral of the story: If you know the dplyr package for data wrangling, you can learn SQL very quickly.

2. In-class exercise

Perform 2-3 SQL queries to convince yourself that if you know dplyr, you can learn SQL very quickly. This exercise is based on Prof. Baumer’s lecture notes.

  1. Setup
    1. Install and open MySQLWorkbench; this whole process takes about 10 minutes and necessitates creating an account with Oracle.
    2. Close the “Welcome to MySQL Workbench” message
    3. Click the plus sign next to “MySQL Connections” to add a connection to a SQL database.
    4. Setup a new connection" as shown in #general in Slack
    5. Click the resulting “Playing with SQL” connection and input the password in #general in Slack
  2. Running SQL code
    1. Copy and paste the code below into the Query window
    2. For each of the 10 code segments: highlight it and then run it by clicking the “lightning” icon.

Lec 34: Wed 12/1

Announcements

  • Download and install MySQL Workbench before Friday’s lecture
  • Before chalk talk:
    1. Download the following zip file: example_webpage.zip
    2. Move example_webpage.zip to your SDS192 folder on your computer
    3. Unzip example_webpage.zip. Windows users: be sure to “Extract all”
    4. In the resulting example_webpage folder, double-click the RStudio Project example_webpage.Rproj icon to open RStudio Project mode

Today’s topics/activities

1. Chalk talk

example_webpage is a portion of the R Markdown Websites code for this course webpage:

  • Inputs: .Rmd and _site.yml files
  • Output: webpage in docs/ folder, in particular the index.html mainpage
  • “Deploying” your webpage: Many ways

2. In-class exercise

Today you’ll modify the source code for example_webpage and then deploying this webpage using Netlify drop:


  1. Create an account on netlify.com using your GitHub account
    • Log into GitHub first. If you haven’t created an account, do so using using your Smith email address.
    • Sign into Netlify using your GitHub account
  2. Change and build your website locally (on your computer)
    • In index.Rmd change the author from "Albert Y. Kim" to you and change the title
    • Build your R Markdown Website by going to the “Build” panel of RStudio -> Clicking “Build Website”.
      You can also use the keyboard shortcuts:
      • macOS: Command+Shift+B
      • windows: Control+Shift+B
    • Inspect your webpage in your browser
  3. Deploy your R Markdown Website
    • Go to Netlify Drop
    • Drag-and-drop the docs/ folder output in your example_webpage RStudio Project folder.
    • If you want to rename your webpage’s URL rather than use the default one you’ve been assigned: Click on “Domain settings” -> Click on the “…” next to your default site name -> Click on “Edit site name” -> Rename your site

Lec 33: Mon 11/29

Announcements

  • Candidates for two new SDS faculty will be on campus this week and next
    • For data science position: See “Meet the SDS Data Science New Faculty Candidates!” email sent to SDS student mailing list.
    • For joint MTH/SDS position: To be confirmed soon.
    • Because of this office hours are highly inconsistent and variable this week. However, they will always be confirmed and posted at least 24h in advance.
  • Final project in groups of 2-3 will be assigned on Wednesday and due Fri 12/17 9pm (last day of exams). You can choose your group.
  • PS07 posted
  • To do before chalk talk
    • If you haven’t already, create an account on GitHub.com using your Smith email address. If you already have a GitHub account, make sure your Smith email is in your Profile settings.
    • Open the GitHub repo for the fivethirtyeight R package

Today’s topics/activities

1. Chalk talk

GitHub: Theory and terminology

  • What is git?
  • What is GitHub?
  • Terminology: Repo, local vs remote, clone, pull, commit/push
  • Most important files in any repo: README.md

2. In-class exercise

  • Work on MP3 (due tomorrow at 9pm), don’t forget to submit your Peer Evaluation Google Form.
  • Work on PS07

Lec 32: Mon 11/22

Announcements

  • MP3 now due Tuesday 11/30 at 9pm (after break). There will be no extensions past this due date/time.

Today’s topics/activities

1. Chalk talk

  • None

2. In-class exercise

  • Work on MP3

Lec 31: Fri 11/19

Announcements

  • MP3 now due Tuesday 11/30 at 9pm (after break). There will be no extensions past this due date/time.

Today’s topics/activities

1. Chalk talk

  • None

2. In-class exercise

  • Work on MP3

Lec 30: Wed 11/17

Announcements

Today’s topics/activities

1. Chalk talk

  • Recap of all sf data frames seen so far in MP3 Project -> examples.Rmd
  • Federal Information Processing Standard (FIPS) codes for counties
  • Example: Looking up the database:
    • 25XXX = Massachusetts counties
    • 25105 = Hampshire County, Massachusetts
  • From MP3 Project -> examples.Rmd -> Section 3 -> Look at contents of mass_pop_orig -> GEOID variable:
> mass_pop_orig
Simple feature collection with 14 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -73.50814 ymin: 41.23796 xmax: -69.92839 ymax: 42.88659
Geodetic CRS:  NAD83
First 10 features:
   GEOID                             NAME   variable estimate moe
1  25017  Middlesex County, Massachusetts B01003_001  1600842  NA
2  25025    Suffolk County, Massachusetts B01003_001   796605  NA
3  25001 Barnstable County, Massachusetts B01003_001   213496  NA
4  25027  Worcester County, Massachusetts B01003_001   824772  NA
5  25011   Franklin County, Massachusetts B01003_001    70577  NA
6  25013    Hampden County, Massachusetts B01003_001   467871  NA
7  25015  Hampshire County, Massachusetts B01003_001   161032  NA
8  25021    Norfolk County, Massachusetts B01003_001   700437  NA
9  25005    Bristol County, Massachusetts B01003_001   561037  NA
10 25009      Essex County, Massachusetts B01003_001   783676  NA

2. In-class exercise

  • Work on MP3

Lec 29: Mon 11/15

Announcements

  • Work on MP3 this Wed, Fri, and Mon before Thanksgiving break.

Today’s topics/activities

1. Chalk talk

  • Application Programmer Interfaces
  • Choropleth maps. In particular, how you set the bins corresponding to the color gradient can affect how your map looks. As indicated here, there are two approaches:
    • Equally sized interval bins
    • Quantile based bins

2. In-class exercise

  • Go over code in MP3 folder -> examples.Rmd -> Section 3 on “Choropleth maps using census data”
  • You will need to register an API key from the census bureau. Carefully read warning message to do so.

Lec 28: Fri 11/12

Announcements

  • Grad school panel on Mon 11/22 featuring SDS alumna. More info in #general

Today’s topics/activities

1. Chalk talk

2. In-class exercise

  1. Work on PS06. This is direct practice for MP3.
  2. Then start on MP3.

Lec 27: Wed 11/10

Announcements

  • Check important Slack announcement in #general
  • Mini-Project 3 posted.

Today’s topics/activities

1. Chalk talk

  1. Example solutions to examples.Rmd Section 1 exercise. Code posted in #mp3; screencast below
  2. Shapefiles

2. In-class exercise

While you are free to work in any order you like, I suggest you:

  1. Go over solutions to examples.Rmd -> Section 1 on “Converting data frames to sf objects”
  2. Go code for Section 2 “Loading shapefiles into R”. This is direct practice for PS06.
  3. Work on PS06. This is direct practice for MP3.
  4. Then start on MP3.

Lec 26: Mon 11/8

Announcements

  • Mini-project 3 (MP3) assigned on Wednesday, due Tuesday 11/23 at 9pm.
    • Add yourselves to the #mp3 channel. Please ask all questions about MP3 in #mp3, not in #questions
    • By Tuesday 5pm I will post the new groups (of two) in the #mp3 channel. Until #mp3 is due, you will sit next to your partner in class.
    • Please reach out to your partner with a Slack DM before Wednesday’s lecture to coordinate meeting before lecture so you can sit next to each other.
    • If you have seating restrictions due to hearing, sight, or mobility issues, please DM me.
  • PS06 to be posted by this evening
  • Compare the following London underground maps. As stated in this article, the transit map on the right sacrifices accuracy for clarity.
    • The map of stations as they truly exist:
    • The transit map inside the stations and trains. All lines are either straight or at 45 degrees and futhermore the geographic space is distorted.

Today’s topics/activities

  • Download MP3.zip
  • Move MP3.zip to your SDS192 folder on your computer
  • Unzip MP3.zip. Windows users: be sure to “Extract all”
  • In the resulting MP3 folder, double-click the RStudio Project MP3.Rproj icon

  • Verify that RStudio opens with MP3 written in the top-right

1. Chalk talk

  • RStudio Projects
  • sf package for static maps in ggplot2 and loading shapefiles. See Spatial Data Science for more.

2. In-class exercise

  • In MP3 folder -> examples.Rmd -> Section 1 on “Converting data frames to sf objects”, do exercises

Lec 25: Fri 11/5

No lecture today, instead optional in-class office hours:

  • Sec02 Sabin-Reed 220: 9:25-10:25
  • Sec01 Stoddard G2: 10:55-11:55

Lec 24: Wed 11/3

Announcements

  • Slack note in #random
  • No office hours tomorrow (Thursday). Instead, optional in-class office hours on Friday.

Today’s topics/activities

1. Chalk talk

  • Practice midterm posted in #midterms
  • “When would you use left or right join?”

2. In-class exercise

  • Open office hours

Lec 23: Mon 11/1

Announcements

  • Practice midterm posted on Slack in #midterms; we’ll go over solutions on Wednesday
  • Midterm II discussion: see midterms page

Today’s topics/activities

1. In-class exercise

  • Work on MP2

Lec 22: Fri 10/29

Announcements

  • Useful RStudio cheatsheets: Go to RStudio menu bar on top -> Help -> Cheatsheets:
    • Data Transformation with dplyr
    • Data Visualization with ggplot2

Today’s topics/activities

1. Chalk talk

  • Install and then load the tidyverse package: An umbrella package that installs/loads many useful packages for data science all at once.
    # Don't do all this:
    library(ggplot2)
    library(dplyr)
    library(readr)
    library(tidyr)
    library(stringr)
    library(tibble)
    library(forcats)
    library(purrr)
    
    # Instead, do this:
    library(tidyverse)

2. In-class exercise

  • Work on MP2

Lec 21: Wed 10/27

Announcements

  • Mid-Semester Assessment:
  • Talk about Spring 2022 SDS courses
  • Added “Tips & Tricks” tab to menu bar of course webpage
  • Update to syllabus
  • For a truly unique perspective on Data Visualization: Mona Chalabi @monachalabi. See video below:

Today’s topics/activities

1. Chalk talk

  • None

2. In-class exercise

  • Work on MP2

Lec 20: Mon 10/25

Announcements

  • MP1 grades posted: See Slack #mp1 for details
  • SDS is currently working hard to hire 3 new faculty who will start in July 2022; I’m chairing one committee and sitting on another. As a result for the next month
    • My office hours will be highly variable; consult the calendars in the syllabus often
    • There will unfortunately be lags in returning grading
    • Sec01 Stoddard only: I won’t be able to stay past 12:05pm so that I can attend lunch meetings.

Today’s topics/activities

1. Chalk talk

  • Recap of Lec19: Why did we use inner_join() for solution to LC 3.20 on computing Available Seat Miles
  • Importing spreadsheet data into R. Either Excel files or .csv Comma-Separated Values files. See image of example .csv file below.
  • Data formats: “tidy” AKA long/tall/narrow format versue “non-tidy” AKA wide format


2. In-class exercise

  • Go over ModernDive reading in schedule above

Lec 19: Fri 10/22

Announcements

Today’s topics/activities

1. Chalk talk

  • Pseudocode to compute Available Seat Miles

2. In-class exercise

  • Work on MP2

Lec 18: Wed 10/20

Announcements

  • Slack:
    • See #general Slack channel and give feedback on Spinelli tutors
    • Practice making text look like code: create a DM with your project partner and let’s practice.
  • Discuss Mini-Project 2 in full detail
  • Feel free to message me on weekends, I just likely won’t respond.

Today’s topics/activities

1. Chalk talk

  • What is pseudocode?
  • Types of joins:
    • Copy code below to your classnotes.Rmd
    • Refer to image (ignore semi_join())

2. In-class exercise

With your MP2 partner, practice data wranling! Complete ModernDive Learning Check 3.20: Using data in nycflights13 package, compute available seat miles for each airline separately:

  1. Write out the pseudocode first
  2. Then code it


Lec 17: Mon 10/18

Announcements

  • Mini-Project 2 posted
  • PS05 to be posted by this evening
  • If you’re curious about my experiences in grad school, working at Google, switching to academia, and advice for aspiring data scientists, check out my appearance on episode #43 of the DataBytes podcast “To Google and Back.” Also available on Apple Podcasts and Google Play.


Today’s topics/activities

1. Chalk talk

  • Adding to previous lectures:
    • Lec15 on group_by() and summarize(): difference between sum() and n() summary functions.
    • Lec16 on mutate(): ifelse() function
  • _join(), select(), and rename() functions

2. In-class exercise

  • Go over ModernDive reading in schedule above

Lec 16: Fri 10/15

Announcements

  • Mini-project 2 (MP2) assigned on Monday
    • Add yourselves to the #mp2 channel. Please ask all questions about MP2 in #mp2, not in #questions
    • By Sunday 5pm I will post the new groups (of two) in the #mp2 channel. Until #mp2 is due, you will sit next to your partner in class.
    • Please reach out to your partner with a Slack DM before Monday’s lecture to coordinate meeting before lecture so you can sit next to each other.
    • If you have seating restrictions due to hearing, sight, or mobility issues, please DM me.

Today’s topics/activities

1. Chalk talk

  • mutate() new columns/variables and arrange() i.e. sort rows

2. In-class exercise

  • Go over ModernDive reading in schedule above

Lec 15: Wed 10/13

Announcements

  • See Slack #general for info about presentation of SDS major
  • PS04 (shorter) will be posted this afternoon
  • Keyboard shortcuts for:
    1. %>% in RStudio: command + shift + m on macOS, control + shift + m on Windows
    2. Running code in RStudio: command + enter on macOS, control + enter on Windwos
    3. Quickly jumping between apps: command + tab on macOS, alt + tab on Windows
    4. Selecting many files at once: click first file, hold shift, click last file
    5. Deleting files: command + delete on macOS, delete on Windows

Today’s topics/activities

1. Chalk talk

  • summarize() rows and group_by() %>% summarize()

2. In-class exercise

  • Put finishing touches on MP1
  • Go over ModernDive reading in schedule above

Lec 14: Fri 10/8

Announcements

  • You are responsible for completing the ModernDive readings for Lec13 on the %>% operator and filter() before Wednesday’s lecture
  • A shorter PS04 will be assigned on Wednesday, due on Monday 10/18 9pm

Today’s topics/activities

1. Chalk talk

  • None

2. In-class exercise

  • Work on MP1

Lec 13: Wed 10/6

Announcements

Today’s topics/activities

1. Chalk talk

  • Computer file theory
    • What are folders/directories?
    • How does R Markdown find the .ics file?
    • What are .zip files? Special note for Windows users
    • Computer file hygiene: Delete files you don’t need anymore
  • Intro to data wrangling
    • Pipe operator %>% pronounced “then”
    • filter() rows that meet a certain criteria

2. In-class exercise

  • Go over ModernDive reading in schedule above

Lec 12: Mon 10/4

Announcements

  • Midterm:
    • No talking about it until after 5pm today please; there is one or more students who need to take it.
    • Why student ID and not name? For anonymized grading.
  • Update to office hours on syllabus
  • MP1:
    • Lecture schedule for Wed, Fri, and Wed after break
    • Post questions about MP1 in #mp1 on Slack
  • Discussion on managing group dynamics:
    • Life happens. If it does and it will affect you work, at the very least communicate and give your partner a heads up (text, Slack, etc.)
    • What to do when issues arise?
    • Don’t forget you’ll be filling out peer evaluation Google Form

Today’s topics/activities

1. Chalk talk

  • Trend lines via a geom_smooth() layer. Two types:
    • Linear regression
    • LOWESS: Locally Weighted Scatterplot Smoothing
  • Example code:
library(ggplot2)
library(dplyr)
library(gapminder)

# 1. Recreate plot from PS02 but with no color:
gapminder_2007 <- gapminder %>% 
  filter(year == 2007)

# 1.a) Add LOESS smoother layer with geom_smooth()
ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop)) +
  geom_point() +
  geom_smooth()

# 1.b) Remove standard error bars by setting se = FALSE
ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop)) +
  geom_point() +
  geom_smooth(se = FALSE)

# 1.c) Change span of "smoothing" window by change the value of span
ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop)) +
  geom_point() +
  geom_smooth(se = FALSE, span = 0.25)

# 1.d) Force line to be straight. i.e. linear regression
ggplot(data = gapminder_2007, mapping = aes(x = gdpPercap, y = lifeExp, size = pop)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

2. In-class exercise

  • Copy the example code above to your classnotes.Rmd and go over the code
  • Optional: Go over ModernDive reading in schedule above (this topic is covered in SDS 201/220 intro stats)
  • Work on MP1

Lec 11: Fri 10/1

Announcements

  • Open Slack at the start of every lecture
    • Check for DM’s
    • Check #midterms channel
  • In order to not disadvantage students who take the midterm earlier
    • I won’t be answering any Slack #midterms after 3pm today
    • I’ve instructed the Friday Spinelli tutor not to answer questions about the midterm

Today’s topics/activities

1. Chalk talk

  • Go over practice Midterm I. Boxplot for question 3.c):

2. In-class exercise

  • Work on MP1

Lec 10: Wed 9/29

Announcements

  • Sit next to your MP1 partner; your partner was posted in the #mp1 channel on Sunday 5pm.
  • If have an Office of Disability Services accommodations letter and you haven’t already, please Slack DM it to me.
  • Midterm I info posted
  • Mini-Project 1 info posted

Today’s topics/activities

1. Chalk talk

2. In-class exercise

  • With your partner, build a minimally viable product of your MP1

Lec 09: Fri 9/24

Announcements

  • On Slack #general: new SDS student lounge in McConnell 209
  • Additional resource: Prof. Ben Baumer’s book Modern Data Science with R used in his version of SDS 192.
  • Mini-project 1 (MP1) assigned on Monday
    • Slack demo of how to subscribe to a #channel: Adding yourselves to the #mp1 channel. Please ask all questions about MP1 there
    • You will be assigned groups for MP1, MP2, and MP3. You can choose your groups for the final project.
    • By Sunday 5pm I will post the groups (of two) in the #mp1 channel. Until #mp1 is due, you will sit next to your partner in class.
    • Please reach out to your partner with a Slack DM before Monday’s lecture to coordinate meeting before lecture so you can sit next to each other.

Today’s topics/activities

1. Chalk talk

  • Recap of barplots: Exercise on pie charts vs barplots below
  • Color theory
    1. color vs fill aesthetics in ggplot2
    2. Selecting an appropriate color palette from colorbrewer2.org
    3. How does ggplot2 pick default colors? Using a color wheel
    4. Also define colors in terms of hex codes

2. Exercise on pie charts vs barplots

Say the following piecharts represent results of an election poll at time points: A = September, B = October, and C = November. At each time point we present the proportion of the poll respondents who say they will support one of 5 candidates: 1 through 5.

Based on these 3 piecharts, answer the following questions:

  1. At time point A, is candidate 5 doing better than candidate 4?
  2. Did candidate 3 do better at time point B or time point C?
  3. Who gained more support between time point A and time point B, candidate 2 or candidate 4?

Compare that to using barplots. Which do you prefer?

3. In-class exercise

  • Go over ModernDive reading in schedule above.
  • Changing default color and fill color aesthetics:
    1. Copy and paste the code below into your classnotes.Rmd file
    2. Change both the color of the scatterplot points and the fill of the bars. You can do this by selecting a palette from colorbrewer2.org or by setting them manually
    3. Run colors() in your console to get English names of all colors in R
library(ggplot2)
library(dplyr)
library(nycflights13)
library(gapminder)

# 1. Recreate plot from PS02, but change default "color" palette of points:
gapminder_2007 <- gapminder %>% 
  filter(year == 2007)
ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) +
  geom_point() +
  scale_color_brewer(palette = "Set1")

# 2.a) Recreate Figure 2.26 but change default "fill" color of bars by adding a 
# palette layer:
ggplot(flights, aes(x = carrier, fill = origin)) +
  geom_bar(position = position_dodge(preserve = "single")) +
  scale_fill_brewer(palette = "Set1")

# 2.b) Recreate Figure 2.26 but change default "fill" color of bars by manually 
# changing colors in a layer:
ggplot(flights, aes(x = carrier, fill = origin)) +
  geom_bar(position = position_dodge(preserve = "single")) +
  scale_fill_manual(values = c("darkorange", "forestgreen", "navyblue"))

# 2.c) Recreate Figure 2.26 but change default "fill" color of bars by manually 
# changing colors in a layer using hex codes from: 
# https://www.color-hex.com/color-palette/114219
ggplot(flights, aes(x = carrier, fill = origin)) +
  geom_bar(position = position_dodge(preserve = "single")) +
  scale_fill_manual(values = c("#dc323a", "#003f77", "#c4c1c1"))

Lec 08: Wed 9/22

Announcements

  • Problem sets:
    • PS01 graded
    • PS03 posted

Today’s topics/activities

1. Chalk talk

  • Searching the internet effectively: a critical data science tool
  • Wrapping-up boxplots:
    • For a side-by-side boxplot, the x variable has to be categorical
    • Summary statistics that are robust to outliers: median and IQR
    • Why 1.5 x IQR?
  • Barplots
    • geom_bar() when counts are not pre-computed i.e. listed individually
    • geom_col() when counts are pre-computed and saved in a variable

2. In-class exercise

  • Go over ModernDive reading in schedule above.

Lec 07: Mon 9/20

Announcements

  • Office of Disability Services is looking to hire a note taker for the class. If interested, see note in Slack #general.
  • Problem sets:
    • PS02 due today at 5pm
    • PS03 to be posted by 6pm today

Today’s topics/activities

1. Chalk talk

  • Recap of histograms
  • Facets to split a visualization by the values of another variable
  • Default ordering of functions such as ggplot() where data = is assumed first and mapping = is assumed second
  • Boxplots! Powerful, but tricky!

Say we want to study the distribution of the following 12 values which are pre-sorted:

1, 3, 5, 6, 7, 8, 9, 12, 13, 14, 15, 30

They have the following summary statistics. A summary statistic is a single numerical value summarizing many values. Examples include the immediately obvious mean AKA average and median. Other less immediately obvious examples include:

  • Quartiles (1st, 2nd, and 3rd) that cut up the data into 4 parts, each containing roughly one quarter = 25% of the data
  • Minimum & maximum
  • Interquartile-range (IQR): the distance between the 3rd and 1st quartiles
Min. 1st Quartile Median = 2nd Quartile 3rd Quartile Max. IQR
1 5.5 8.5 13.5 30 8 = 13.5 - 5.5

Let’s compare the points and the corresponding boxplot side-by-side with the values on the \(y\)-axis matching:

2. In-class exercise

  • Go over ModernDive reading in schedule above.

Lec 06: Fri 9/17

Announcements

Today’s topics/activities

1. Chalk talk

  • In-class demo of using RMarkdown features in a classnotes.Rmd file to save lecture code
  • Take screenshots of your screen!
  • Histograms for visualizing the distribution of a numerical variable

Section 1 (Stoddard G6) Demo

Section 2 (Sabin-Reed 220) Demo

2. In-class exercise

  • If you still haven’t been able to “Knit to PDF”, please ask for help
  • Go over ModernDive reading in schedule above.

Lec 05: Wed 9/15

Announcements

  • PS02 was posted after Monday’s lecture.

Today’s topics/activities

1. Chalk talk

  • Overplotting and two approaches for addressing it
  • Linegraphs

2. In-class exercise

  • Explore the different formatting tools in R Markdown: go to RStudio top menu bar -> Help -> Markdown quick reference.
  • Sec01 in Stoddard: There was an typo in Step 8 in last lecture’s in-class exercise. If you weren’t able to Knit directly to PDF, please re-attempt Steps 8-9. Knitting directly to PDF, instead of Knitting to Word and then saving to PDF, is the preferred submission format for all problem sets. It will be less hassle for you and provide consistency for the graders.
  • Go over ModernDive reading in schedule above.

Lec 04: Mon 9/13

Announcements

  • Problem Set 02 due next Monday 5pm, now posted under Problem Sets

Today’s topics/activities

1. Chalk talk

  • Recap of previous lecture
  • “Where can I save all the code I run in class?” In an R Markdown .Rmd file; R Markdown is a tool for reproducible research
Input: An .Rmd file Output: An .html, .docx, or .pdf file.

2. In-class exercise

In-class battle-testing and practicing for PS02:

  1. At a couple of steps in this process, you will be asked to install packages. Say yes to all of them.
  2. If at any point your code won’t knit, go through these 6 R Markdown Fixes first, then seek assistance. These 6 fixes will resolve 85% of issues.
  3. Create new R Markdown .Rmd file:
    • Go to RStudio menu bar -> File -> New File -> R Markdown
    • Set “Title” to “My first R Markdown report” and “Author” as your name.
    • Save this file as testing somewhere on your computer. This will create a file called testing.Rmd
  4. Method 1: “Knit” a report to HTML:
    • Click the arrow next to “Knit” -> “Knit to HTML”.
    • An HTML webpage should pop up. However, it may be blocked by your browser. If so, in your browser’s URL bar, click on “Always allow pop-ups”.
  5. Method 1: Publish HTML report on web:
    • Click on blue “Publish” button on top right of the resulting pop-up html.
    • Select RPubs.
    • If you haven’t previously, create an account on Rpubs.com. If you have previously, login.
    • Set “Title” to “My first R Markdown report” and “Slug” to “testing”
    • You should end up with a webpage that looks like this one. This is live on the web!
  6. Method 1: Update HTML report on web:
    • Make some trivial change to your testing.Rmd file.
    • “Re-knit” your report and make sure your trivial change is reflected.
    • The blue “Publish” button should now read “Republish”
    • Click “Update existing”
    • Your updates are now live on the web!
  7. Method 2: “Knit” a report to Word
    • Click the arrow next to “Knit” -> “Knit to Word”.
    • Save the resulting Word document as a pdf file.
  8. Only if you are a macOS user:
    • Next to “Console” go to “Terminal”
    • Run this line of code:
    sudo chown -R `whoami`:admin /usr/local/bin
    • Enter your password. Note: Terminal has weird behavior whereby as you enter your password, the cursor will not move. Don’t worry your password is registering.
  9. Method 3: “Knit” a report to PDF
    • Run the following code in your console just once:
    install.packages('tinytex')
    tinytex::install_tinytex()
    • Click the arrow next to “Knit” -> “Knit to PDF”.

Lec 03: Fri 9/10

Announcements

  • Spinelli Center SDS drop-in tutoring hours now open! Get individual attention from SDS majors! In Sabin-Reed 301
    • Sunday through Thursday 7-9pm
    • Friday 2:35-3:30pm
  • By popular request:
    • Sec 01 in Stoddard G2 will now start 5 minutes later: 10:55 AM instead of 10:50AM
    • Sec 02 in Sabin-Reed 220 will now end 5 minutes earlier: 10:35 AM instead of 10:40 AM
  • I added extra instructions for Problem Set 01 after lecture, posted under Problem Sets
    • Show don’t tell how to tag questions on gradescope
  • ASA StatFest 2021 Sat 9/18 thru Sun 9/19 flyer and event webpage
    • Sunday 9/19 at 11:50AM: Opportunities in Statistics & Data Science in Academia, Government, & Non-Profit featuring SDS’s Prof. Randi Garcia!
    • Keynote address by Robert Santos, 116th President of the ASA, and President Biden’s nominee to serve as Director of the United States Census Bureau! If approved by the Senate, he would be the first Latinx Director of the Bureau!

Today’s topics/activities

1. Chalk talk

  • Recap of previous lecture
  • Grammar of Graphics
  • 5NG1: Scatterplots
  • Next time:
    • Question: Do I need to re-type my code in the Console every single time?
    • Answer: No! Save your work in an RMarkdown document

2. In-class exercise

  • Go over ModernDive reading in schedule above.

Lec 02: Wed 9/8

Announcements

  • Problem Set 01 due this Monday 5pm, posted under Problem Sets.

Today’s topics/activities

1. Chalk talk

  • Intro to Slack
  • What is difference between R and RStudio?
  • What are R packages?

2. In-class exercise

  • Go over ModernDive reading in schedule above.

About readings in this course:

  • You are responsible for completing a lecture’s readings before the next lecture. Ex: you are responsible to read all of ModernDive Chapter 1 before Wednesday.
  • I teach lectures assuming you have not done the readings beforehand. However, if it suits your learning style better, please do read beforehand.
  • While you don’t need to turn in your learning check answers, I highly recommend you still do them. The solutions are in Appendix D of the book.
  • If you have your headphones, you may listen to music during in-class reading time.

Lec 01: Fri 9/3

Announcements

Welcome!

Today’s topics/activities

  • Course webpage: bit.ly/sds192kim
  • My story
  • “Knock on wood if you’re with me”
  • What this class is about: Answering questions with data
    1. Data viz
    2. Data wrangling
    3. Maps
    4. Websites
  • Break!
  • Executive summary of syllabus
  • This weekend: Complete intro survey

Code examples from class

# Data visualization
library(fivethirtyeight)
library(ggplot2)
library(dplyr)
year_bins <- c("'70-'74", "'75-'79", "'80-'84", "'85-'89", "'90-'94",
               "'95-'99", "'00-'04", "'05-'09", "'10-'13")

bechdel <- bechdel %>%
  mutate(five_year = cut(year, breaks = seq(1969, 2014, 5), labels = year_bins))

ggplot(bechdel, aes(x = five_year, fill = clean_test)) +
  geom_bar(position = "fill", color = "black") +
  labs(x = "Year", y = "Proportion", fill = "Bechdel Test") +
  scale_fill_brewer(palette = "YlGnBu")

# Data Wranling
library(fec16)
all_transactions <- read_all_transactions()
View(all_transactions)

# Maps
library(leaflet)
leaflet() %>%
  addTiles() %>% 
  addMarkers(lng=-72.64022, lat=42.31706, popup="Smith College")