Schedule

Topics:

  1. Data visualization (pink): Grammar of Graphics, Five Named Graphs (5NG), color theory.
  2. Working with data (blue): data wrangling, importing, and formatting
  3. Maps and spatial data (green): Maps and geospatial data.
  4. Learning how to learn new data science tools (yellow): SQL, TBD.

Note that while topics and topics dates may change, all problem sets (PS), project, and midterm dates will not.


Lec 31: Wed 11/20

Announcements

  • MP2 update

Today’s topics/activities

1. Chalk talk

  • Yingke’s observation about “tidy” data using the example from the Wide and narrow data Wikipedia page: the “narrow” data example is not a good example of “tidy” data because the Value variable does not have consistent units (years, lbs/kg, m/feet)

2. In-class exercise

  1. Sec01: Visit from Smith GIS Lab at 11:30am. Slides on how to find spatial data
  2. Sec02: Get a version of your R Markdown Website deployed on Netlify

Lec 30: Mon 11/18

Announcements

  • MP3 due on Friday at 5pm 9pm
  • Updates: MP2 feedback and Midterm II
  • DataCamp article:

Today’s topics/activities

1. Chalk talk

You’ll be submitting your final project using R Markdown Websites and if your project has no sensitive data, publishing it on the web as well. Background:

  • Inputs: .Rmd and _site.yml files
  • Output: webpage in docs/ folder, in particular the index.html mainpage
  • “Deploying” your webpage: Many ways

2. In-class exercise

Today you’ll be building your own R Markdown Website, which is the technology I use for this 192 webpage. Then you’ll be deploying it to the web using Netlify drop.


  1. Setup the RStudio Project for an example R Markdown Website
    • Download the RStudio Project example_webpage.zip
    • Move example_webpage.zip to where you keep your 192 work, then unzip it to open the RStudio Project folder
    • Click on the example_webpage.Rproj RStudio Project file to open RStudio in “Project Mode”. You must do this.
  2. Create an account on netlify.com using your GitHub account
    • Log into GitHub first
    • Sign into Netlify using your GitHub account
  3. Change and build your website locally (on your computer)
    • In index.Rmd change the author from "Albert Y. Kim" to you and change the title
    • Build your R Markdown Website by going to the “Build” panel of RStudio -> Clicking “Build Website”.
      You can also use the keyboard shortcuts:
      • macOS: Command+Shift+B
      • windows: Control+Shift+B
    • Inspect your webpage in your browser
  4. Deploy your R Markdown Website
    • Go to Netlify Drop
    • Drag-and-drop the docs/ folder output in your example_webpage RStudio Project folder.
    • If you want to rename your webpage’s URL rather than use the default one you’ve been assigned: Click on “Domain settings” -> Click on the “…” next to your default site name -> Click on “Edit site name” -> Rename your site

Lec 29: Fri 11/15

Announcements

  • Sit next to your MP3 partner
  • Guest lecturer: Prof. Ben Baumer

Mega Announcement

Recently, the Five Colleges (with Smith as the lead institution) received a three-year grant from the NSF (totaling $1.2 million) for workforce development in data science. Most of that money will go to students, who will work on real-world data science projects sponsored by local community-based organizations. The Jandon Center for Community Engagement will be integrated into the project to bind all parties together.

We are looking for students to join the first cohort. Smith students can earn up to $2500 per semester, which is approximately 60% more than you would make in any other campus job. The minimum requirements for data science students are:

  • Has taken at least one course in computer programming
  • Has taken at least one course in statistics
  • Has taken at least one additional course in data science, data analysis, etc.
  • Is a sophomore or junior
  • Ability to commit to 8-10 hours per week through the spring semester

Unfortunately, only US citizens or permanent residents are eligible for funding from the NSF. It is possible that international students could be funded through a different mechanism, but there are no such alternative funding sources in place at present.

The application is here, please APPLY NOW!! Please contact Ben Baumer if you have more questions.

Today’s topics/activities

1. In-class exercise

  • Work on MP3

Lec 28: Wed 11/13

Announcements

  • Office hours today 3-4:30pm 3-4pm
  • This Friday the 15th: I’m out of town
    • No office hours on 9-10:30am
    • Guest lecturer: Prof. Ben Baumer
    • I will post PS06 solutions no later than Friday 5pm
  • MP3 examples and rubric posted
  • Members of the Smith College Spatial Analysis Lab will be making a brief presentation next week:
    • Sec01 AM: Wed 11/20 at 11:30am
    • Sec02 PM: Mon 11/18 at beginning of lecture
  • Check out the Jill Ker Conway Innovation & Entrepreneurship Center’s feature on SDS student Rachel Laflamme. In particular the quotes from the final paragraph:

Rachel indicated that being surrounded by “so many really intimidating Smithies” in classes, students who she believed were much smarter than she, caused her to question whether she should even pursue a career as a statistical and data scientist. Needless to say, those of us in attendance during her presentation - one delivered by a knowledgeable, talented and confident Smithie - were surprised by her revelation…

Asked about her most important take-away from the summer, Rachel thoughtfully replied:
“I learned that you have to understand your value and that you have a right to be compensated for the value that you bring. It’s a bold thing to fight for what you know you’re worth.”

Mega Announcement

SDS and GOV are hiring for a new joint position! The search committee would like your feedback which they take very seriously.

  1. Tea Times: They are looking for students who would like to have tea with the candidates on 11/19, 11/21, and 11/22 at 3:30-4:30pm in Bass 418. Sign up here.
  2. Teaching demos: Please bring your laptop! We need students here since this is a teaching demonstration!
    • Ju Yeon (Julia) Park (University of Pittsburgh - Postdoc)
      Tue 11/19, 5-6pm, Sabin-Reed 301.
    • Scott LaCombe (University of Iowa - Doctoral Student)
      Thu 11/21, 5-6pm, Sabin-Reed 301.
    • Stan Oklobdzija (Claremont McKenna College - Postdoc)
      Fri 11/22, 5-6pm, Sabin-Reed 301.
  3. Research Talks: A limited number of lunches will be available (first come, first served) at the lunchtime research talks. This is a chance for you to see how scholars present their work to other scholars. The talks will be aimed for the professors in the audience. Students are very welcome to attend, though we invite students to observe but not ask questions.
    • Ju Yeon (Julia) Park (University of Pittsburgh - Postdoc)
      Tue 11/19, 12:15-1:15pm, Campus Center 103/104
      Title: When Do Politicians Grandstand? Measuring Message Politics in Committee Hearings
    • Scott LaCombe (University of Iowa-Doctoral Student)
      Thu 11/21, 12:15-1:15pm, Campus Center 103/104
      Title: Institutional Design and Policy Responsiveness in US States
    • Stan Oklobdzija (Claremont McKenna College - Postdoc)
      Fri 11/22, 12:15-1:15pm, McConnell B15
      Title: Dark Money and Political Parties After Citizens United

Today’s topics/activities

1. Chalk talk

  • Choropleth maps. In particular, how you set the bins corresponding to the color gradient can affect how your map looks. As indicated here, there are two approaches:
    • Equally sized interval bins
    • Quantile based bins
  • Application Programmer Interfaces

2. In-class exercise

  • tidycensus package

Lec 27: Mon 11/11

Announcements

  • I filled out a rough schedule of topics for the rest of the semester in the spreadsheet above.
  • Data Research and Statistics Counselor from Spinelli Center Osman Keshawarz. You can book office hours here.
  • PS06 posted: Two components
    • R component due on Friday 11/15 10:45am: Straightforward and serves as practice for MP3.
    • Ethics reading quiz in-class on Monday 11/18. This is to ensure you are prepared for an in-class discussion. What do we do when “bad” people make “good” technology?
  • MP3 details posted.

Today’s topics/activities

1. Chalk talk

  1. Absolute vs relative filepaths
  2. Converting data in R into sf objects. We can then do both
    • Data wrangling using dplyr on them
    • Plot them in ggplot2 using geom_sf() layers
  3. Centroids: Geographic and population weighted. Not coincidentally, check out where FedEx and UPS main airport hubs are located.

2. In-class exercise

While you are free to work in any order you like, I suggest you:

  1. Go thru the examples.Rmd file in the MP3 RStudio Project (be sure to be working in RStudio Project mode):
    • Section 1: Interactive maps using leaflet
    • Section 2: Convert data frames into sf objects
    • Section 3: Loading shapefiles into R
    • Section 4: (We’ll do this on Wednesday) Choropleth maps using census data.
  2. Work on PS06. This is direct practice for MP3.
  3. Then start on MP3.

3. Tweet of the day

We’re at the cutting edge of map technology in R!


Lec 26: Fri 11/8

Announcements

  • MP3: Maps!
    • Groups posted on Slack in #MP3 channel.
    • You will be making one interactive and one static map. More details on Monday.

Today’s topics/activities

1. Chalk talk

  • Reading of “On Exactitude in Science” by Borges.
  • Slides on GIS
  • London underground maps. Compare them. As stated in this article, the transit map sacrifices accuracy for clarity.
    • The map of stations as they truly exist:
    • The transit map inside the stations and trains. All lines are either straight or at 45 degrees and futhermore the geographic space is distorted.

2. In-class exercise

Two chief tools we’ll be using.

Note: the code that was posted here has now been moved to the example.Rmd file in the MP3 RStudio Project folder.


Lec 25: Wed 11/6

Announcements


Today’s topics/activities

1. Chalk talk

  • Go over Midterm II practice exam solutions. Practice exam is posted on Slack under #general_announcements.

Lec 24: Mon 11/4

Announcements

  • Midterm II this weekend.
  • Coming on Wednesday. Organizers have:
    • Stated “This space is not for debate between speakers and audience. It is for speakers to be heard and listened to.”
    • Asked that you wear black in solidarity.


Today’s topics/activities

1. Chalk talk

  • Fact that will 🤯 about MP2 data.
  • Midterm II discussion. See midterms page.

2. In-class exercise

  • Practice midterm posted on Slack; we’ll go over solutions on Wednesday.

Lec 23: Fri 11/1

Announcements

Today’s topics/activities

1. In-class exercise

  • Work on MP2

Lec 22: Wed 10/30

Announcements

  • If you would like to give mid-semester feedback on how this course is going so far, you can leave feedback (either anonymously or not) via the Google Forms linke posted in the #general_announcements channel in Slack. The form will be open until Friday Nov 1st at 5pm.
  • Smithies in SDS FaculTEA on Friday 4pm in CC 103/104 with Albert Y. Kim, Ben Baumer, David Rockoff, Katherine Halvorsen, Miles Ott, and Randi Garcia.


Today’s topics/activities

1. Chalk talk

  • Strategies for writing reports.

2. In-class exercise

Work on MP2!


Lec 21: Mon 10/28

Announcements

  • Sit next to your MP2 partner today
  • Look at some posts in #random channel
  • Lecture policy reminder
  • Added “Tips & Tricks” tab to menu bar of course webpage
  • Reminder: joint SDS/PSY faculty member Randi Garci will be Honored with Sherrerd Prize teaching award today 4:30pm in Campus Center Carroll Room. Please attend!
  • Biennial (every two years) graduate school panel and discussion on MVP’s.


Today’s topics/activities

1. Chalk talk

Solutions to PS05 Q1.c). What happened to the average age above 60? The hint given on Slack was to look at the ggplot2 cheatsheet -> 2nd page -> Bottom right corner -> “Zooming”

  • ylim(a, b) sets the limit on the y-axis to be between a to b and “clips” (throws out) any points outside this interval
  • coord_cartesian(ylim=c(a, b)) zooms in on the the y-axis to be between a to b but does not “clip” (throw out) the points outside this interval

For example, consider the following regression line:

Let’s set the y-axis limit to be between 0 and 3. Using ylim(0, 3) clips out the point (5, 5) and thus the regression line is flat:

However, using coord_cartesian(ylim = c(0, 3)) merely zooms in on this part of the y-axis without clipping the point (5, 5) and thus the regression line is the original one:

2. In-class exercise

Work on MP2!


Lec 20: Fri 10/25

Announcements

  • Open floor to share thoughts/feeling about President McCartney’s email.
  • MP1 grading update

Today’s topics/activities

1. Chalk talk

  • Recap of Lec19: In-class exercise, computing available seat miles. Note solution is in ModernDive Appendix D -> Learning Check 3.20.
    1. Order of arithmetic matters!
    2. Difference between inner_join() and left_join()
  • Importing spreadsheet data into R. Either Excel files or .csv Comma-Separated Values files. See image of example .csv file below.
  • “Tidy” data format, in other words, long/tall/narrow format. This is as opposed to wide format.


2. In-class exercise

  • Go over ModernDive 4.1 - 4.2

Lec 19: Wed 10/23

Announcements

  • Sit next to your MP2 partner for today’s in-class exercise
  • Are you interested in Majoring in SDS? Check out our presentation of the major tomorrow!


Today’s topics/activities

1. Chalk talk

Recap of Lec18: Click on tweet below to see all 6 different types of joins:

Copy this to your classnotes.Rmd

2. In-class exercise

Complete ModernDive Learning Check 3.20: Using data in nycflights13 package, compute available seat miles:



Lec 18: Mon 10/21

Announcements

  • MP2 rubric posted
  • Posting in #maps channel on Slack
  • For Wednesday’s lecture sit next to your MP2 partner. We’ll be doing an in-class data-wrangling exercise

Today’s topics/activities

1. Chalk talk

  • Go over PS04 solutions
  • Recap of Lec17: How to arrange() by more than one variable
  • Last major set of verbs: _join() data frames, select(), and rename() variables

2. In-class exercise

  • Go over ModernDive 3.7, 3.8.1 - 3.8.2.

Lec 17: Fri 10/18

Announcements

  • PS05 posted: Due in one week.
  • Mini-Project 2 posted: Due in two weeks.
  • SDS talk on Monday at 5pm by Yeshimabeith Milner, founder of Data for Black Lives. Shout out to Smithie Laneé Jung for her important role in making this happen.


Today’s topics/activities

1. Chalk talk

  • Recap of Lec16:
    • Go over diagram of group_by() %>% summarize()
    • Difference between sum() and n() summary functions.
  • mutate() new columns/variables and arrange() i.e. sort rows

2. In-class exercise

  • Go over ModernDive 3.5 - 3.6

Lec 16: Wed 10/16

Announcements

  • Slack: Post on #random

Today’s topics/activities

1. Chalk talk

  • Recap of Lec14
  • summarize() rows and group_by() %>% summarize()

2. In-class exercise

  • Keyboard shortcut in RStudio
  • Go over ModernDive 3.3 - 3.4

Lec 15: Fri 10/11

Announcements

  • PS04 posted
  • MP1: Due at 5pm on Friday.
    1. Files on Moodle
    2. Peer evaluation Google Form
    3. Group leader only: PDF of reflection piece on Moodle

Today’s topics/activities

1. In-class exercise

  • Put finishing touches on MP1
  • Do readings from Lec14 on Wed 10/9: ModernDive 3 - 3.2
  • Start PS04 after you’ve done ModernDive readings

Lec 14: Wed 10/9

Announcements

Today’s topics/activities

1. Chalk talk

  • Intro to data wrangling
  • Pipe operator %>%
  • filter() rows that meet a certain criteria

Example:

2. In-class exercise

  • Go over ModernDive 3 - 3.2

Lec 13: Mon 10/7

Announcements

  • No talking about midterm until Wednesday’s lecture please.
  • Sit next to your MP1 partner for today’s lecture.
  • If you’re curious about my experiences in grad school, working at Google, switching to academia, and advice for aspiring data scientists, check out my appearance on episode #43 of the DataBytes podcast “To Google and Back.” Also available on Apple Podcasts and Google Play.


Today’s topics/activities

1. Chalk talk

  • Recap of Lec09:
    • span argument from geom_smooth()
    • What is se = FALSE mean
  • What is a “minimally viable product”?

2. In-class exercise

  • Work on MP1

Lec 12: Fri 10/4

Announcements

  • Part of class-time on Monday to work on MP1
  • Go over updated Midterm I instructions

Today’s topics/activities

1. Chalk talk

  • Trend lines. Two types (among many):
    • Linear regression
    • LOESS smoother:
  • Let’s make a Shiny interactive visualization! If this is your first shiny app, you will need to install some packages: say “yes” to any prompts.
    1. Go to RStudio menu bar -> File -> New File… -> Shiny -> Give it title “LOESS smoother”
    2. Let’s keep things simple and delete everything after line 38
    3. Save it as loess.Rmd
    4. Click “Run Document”
    5. See all possible input methods by looking at cheatsheet. Go to RStudio menu bar -> Help -> Cheatsheets -> Web Applications with Shiny -> Look at right-side of first page.

Lec 11: Wed 10/2

Announcements

  • None

Today’s topics/activities

1. Chalk talk

  • Go over Midterm I

Lec 10: Mon 9/30

Announcements

Today’s topics/activities

1. Chalk talk

  • Recap of Lec09: Color palettes from colorbrewer2.org
  • Student question: What is a tibble?
  • Recap of “five named graphs”: ModernDive Table 2.4

Lec 09: Fri 9/27

Announcements

  • Slack:
    • How to subscribe to a #channel
    • Using threads to keep conversations organized
  • Go over PS03 solutions

Today’s topics/activities

1. Chalk talk

  • Recap of barplots: Exercise on pie charts vs barplots below
  • Color theory
    1. color vs fill aesthetics in ggplot2
    2. Selecting a color palette from colorbrewer2.org
    3. How does ggplot2 pick default colors? Using a color wheel
    4. Get the color hex codes of ggplot2 default color palette:

2. Exercise on pie charts vs barplots

Say the following piecharts represent results of an election poll at time points: A = September, B = October, and C = November. At each time point we present the proportion of the poll respondents who say they will support one of 5 candidates: 1 through 5.

Based on these 3 piecharts, answer the following questions:

  1. At time point A, is candidate 5 doing better than candidate 4?
  2. Did candidate 3 do better at time point B or time point C?
  3. Who gained more support between time point A and time point B, candidate 2 or candidate 4?

Compare that to using barplots. Which do you prefer?

3. In-class exercise

  • Quiz on podcast
  • First phase of Mini-Project 1 roll-out

Lec 08: Mon 9/23

Announcements

  • No office hours on Wednesday
  • In Lec07 below, added image of chalk talk data and boxplot.
  • Problem sets: PS01 handed back

Tweet of the Day

Today’s topics/activities

1. Chalk talk

  • Recap of boxplots
  • Summary statistics that are robust to outliers: median and IQR
  • Barplots

2. In-class exercise

  • Go over ModernDive 2.8

Lec 07: Fri 9/20

Announcements

  • Slack:
    • Prof. Katie Kinnaird’s TRIPODS+X - Data Science Education Investigation
    • Post on #random by Ray
  • Problem sets:
  • Announcement from Smithies in SDS:


Tweet of the Day

Today’s topics/activities

1. Chalk talk

  • Recap of histograms
  • Facets to split a visualization by the values of another variable
  • Boxplots! Powerful, but tricky!

Say we want to study the distribution of the following 12 values which are pre-sorted:

1, 3, 5, 6, 7, 8, 9, 12, 13, 14, 15, 30

They have the following summary statistics. A summary statistic is a single numerical value summarizing many values. Examples include the immediately obvious mean AKA average and median. Other less immediately obvious examples include:

  • Quartiles (1st, 2nd, and 3rd) that cut up the data into 4 parts, each containing roughly one quarter = 25% of the data
  • Minimum & maximum
  • Interquartile-range (IQR): the distance between the 3rd and 1st quartiles
Min. 1st Quartile Median = 2nd Quartile 3rd Quartile Max. IQR
1 5.5 8.5 13.5 30 8 = 13.5 - 5.5

Let’s compare the points and the corresponding boxplot side-by-side with the values on the \(y\)-axis matching:

2. In-class exercise

  • Go over ModernDive 2.6 - 2.7
  • Start PS03

I don’t mind what you do with your class time, but it is very important that you complete the reading before next lecture. Boxplots take practice.


Lec 06: Wed 9/18

Announcements

  • Prof. Katie Kinnaird’s TRIPODS+X - Data Science Education Investigation

Today’s topics/activities

1. Chalk talk

  • Recap of previous lecture
  • Live-demo of creating classnotes.Rmd, an R Markdown file of all in-class exercise code: Write and copy/paste/tweak code in classnotes.Rmd and not in console. That way you can save it!
  • Histograms for visualizing distribution of a numerical variable.

2. In-class exercise

  • Go over ModernDive 2.5

Lec 05: Mon 9/16

Announcements

  • Slack message: Abandoning RStudio Cloud in favor of RStudio Desktop.
  • The art of managing Slack notifications

Today’s topics/activities

1. Chalk talk

  • Recap of previous lecture
  • Overplotting and two approaches for addressing it
  • Linegraphs

2. In-class exercise

  • Go over ModernDive 2.3.2 - 2.4

Lec 04: Fri 9/13

Announcements

  • Screencast from last lecture posted
  • I’m currently investigating issue with RStudio Cloud being slow
  • PS02 posted under Problem Sets

Today’s topics/activities

1. Chalk talk

  • Recap of previous lecture
  • R Markdown for reproducible research
Input: An .Rmd file Output: An .html webpage

2. In-class exercise

  1. At a couple of steps in this process, you will be asked to install packages. Say yes to all of them!
  2. Fiddle with RStudio settings:
    • Go to RStudio menu bar -> Tools -> Global Options… -> R Markdown
    • Uncheck box next to “Show output inline for all R Markdown Documents”
  3. Create new R Markdown .Rmd file:
    • Go to RStudio menu bar -> File -> New File -> R Markdown
    • Set “Title” to “My first R Markdown report” and “Author” as your name.
  4. “Knit” a report:
    • Click on the disk icon and save this file as testing somewhere on your computer. This will create a file called testing.Rmd
    • Click the arrow next to “Knit” -> “Knit to HTML”.
    • An HTML webpage should pop up. However, it may be blocked by your browser. If so, in your browser’s URL bar on the right, click on “Always allow pop-ups”.
  5. Publish this report on web:
    • Click on blue “Publish” button on top right of the resulting pop-up html.
    • Select RPubs.
    • If you haven’t previously, create an account on Rpubs.com. If you have previously, login.
    • Set “Title” to “My first R Markdown report” and “Slug” to “testing”
    • You should end up with a webpage that looks like this one. This is live on the web!
  6. Update your report on web:
    • Make some trivial change to your testing.Rmd file.
    • “Re-knit” your report and make sure your trivial change is reflected.
    • The blue “Publish” button should now read “Republish”
    • Click “Update existing”
    • Your updates are now live on the web!
  7. Bonus: Play around with different formatting tools in R Markdown to customize your report! Go to RStudio menu bar -> Help -> Markdown quick reference.

Tips on R Markdown:

  1. Knit early, knit often! If you wait until only after you’ve added a ton of code to knit and something doesn’t work, you’ll have a hard time figuring out where the error is. If you make incremental changes and knit after every step, you’ll better able to isolate where errors are.
  2. If you get stuck, go through these 6 R Markdown Fixes first, then seek assistance. These 6 fixes resolve 85% of issues in my experience.

Lec 03: Wed 9/11

Announcements

  • Slack updates: custom emojis and vote in today’s poll!

Today’s topics/activities

1. Chalk talk

  • Recap of previous lecture
  • Grammar of Graphics
  • Screencast of “Doing ModernDive readings”. In particular the idea of “Running R code in RStudio”:


2. In-class exercise

  • Go over ModernDive 2 - 2.3.1.

Lec 02: Mon 9/9

Announcements

Today’s topics/activities

1. Chalk talk

  • Intro to Slack slides
  • What is difference between R and RStudio?
  • What are R packages?

2. In-class exercise

  • Set up RStudio Cloud:
    • Click here to join the “SDS192” Workspace.
    • Click on “New Project”
    • Name it “Class Notes”
  • Go over ModernDive reading in schedule above.

About readings in this course:

  • You are responsible for completing a lecture’s readings before the next lecture. Ex: you are responsible to read all of ModernDive Chapter 1 before Wednesday.
  • I teach lectures assuming you have not done the readings beforehand. However, if it suits your learning style better, please do read beforehand.
  • While you don’t need to turn in your learning check answers, I highly recommend you still do them. The solutions are in Appendix D of the book.
  • If you have your headphones, you may listen to music during in-class reading time.

Lec 01: Fri 9/6

Announcements

Welcome!

Today’s topics/activities

  • My story.
  • What this class is about: Answering questions with data.
  • Executive summary of syllabus; finalized syllabus will be published next week.
  • Coding: it’s normal to be 😱