Why is it important to write clean code, that is well documented, and actually works? To be mindful of the work you create for your most important collaborator.
Today’s Topics/Activities
1. In-class exercise
Test two useful packages below:
patchwork: Combine two ggplots together
janitor: Clean-up messy variable names
Work on Final Project
library(tidyverse)# 1. Combine two ggplots together using patchwork package:library(patchwork)plot_1 <-ggplot(mtcars) +geom_point(aes(mpg, disp))plot_2 <-ggplot(mtcars) +geom_boxplot(aes(gear, disp, group = gear))# Side-by-side:plot_1+plot_2# On top of each other:plot_1/plot_2# 2. Say we have a data frame with really messy names:data_frame_ugly <-tibble(`asdf ?!? qwerty%` =c(1, 2),variable.name...NAMES =c(2,1))data_frame_ugly# You can clean them very easily using the clean_names() function from the # janitor packagelibrary(janitor)data_frame_clean <-data_frame_ugly %>%clean_names()data_frame_clean
Lec 35: Fri 12/3
Announcements
MTH/SDS tenure track search email
Final project:
Open Slack to #final_project channel
Group leader: Create a Slack DM with all members AND myself and say “we’re a group”
Submission details for Final Project to be posted on Monday.
dplyr and SQL are very similar. Both based on the same idea of database normalization (1970).
Moral of the story: If you know the dplyr package for data wrangling, you can learn SQL very quickly.
2. In-class exercise
Perform 2-3 SQL queries to convince yourself that if you know dplyr, you can learn SQL very quickly. This exercise is based on Prof. Baumer’s lecture notes.
Setup
Install and open MySQLWorkbench; this whole process takes about 10 minutes and necessitates creating an account with Oracle.
Close the “Welcome to MySQL Workbench” message
Click the plus sign next to “MySQL Connections” to add a connection to a SQL database.
Setup a new connection" as shown in #general in Slack
Click the resulting “Playing with SQL” connection and input the password in #general in Slack
Running SQL code
Copy and paste the code below into the Query window
For each of the 10 code segments: highlight it and then run it by clicking the “lightning” icon.
Lec 34: Wed 12/1
Announcements
Download and install MySQL Workbench before Friday’s lecture
Drag-and-drop the docs/ folder output in your example_webpage RStudio Project folder.
If you want to rename your webpage’s URL rather than use the default one you’ve been assigned: Click on “Domain settings” -> Click on the “…” next to your default site name -> Click on “Edit site name” -> Rename your site
Lec 33: Mon 11/29
Announcements
Candidates for two new SDS faculty will be on campus this week and next
For data science position: See “Meet the SDS Data Science New Faculty Candidates!” email sent to SDS student mailing list.
For joint MTH/SDS position: To be confirmed soon.
Because of this office hours are highly inconsistent and variable this week. However, they will always be confirmed and posted at least 24h in advance.
Final project in groups of 2-3 will be assigned on Wednesday and due Fri 12/17 9pm (last day of exams). You can choose your group.
PS07 posted
To do before chalk talk
If you haven’t already, create an account on GitHub.com using your Smith email address. If you already have a GitHub account, make sure your Smith email is in your Profile settings.
From MP3 Project -> examples.Rmd -> Section 3 -> Look at contents of mass_pop_orig -> GEOID variable:
>mass_pop_origSimple feature collection with 14 features and 7 fieldsGeometry type:MULTIPOLYGONDimension:XYBounding box:xmin:-73.50814 ymin:41.23796 xmax:-69.92839 ymax:42.88659Geodetic CRS:NAD83First 10 features:GEOID NAME variable estimate moe125017 Middlesex County, Massachusetts B01003_0011600842NA225025 Suffolk County, Massachusetts B01003_001796605NA325001 Barnstable County, Massachusetts B01003_001213496NA425027 Worcester County, Massachusetts B01003_001824772NA525011 Franklin County, Massachusetts B01003_00170577NA625013 Hampden County, Massachusetts B01003_001467871NA725015 Hampshire County, Massachusetts B01003_001161032NA825021 Norfolk County, Massachusetts B01003_001700437NA925005 Bristol County, Massachusetts B01003_001561037NA1025009 Essex County, Massachusetts B01003_001783676NA
2. In-class exercise
Work on MP3
Lec 29: Mon 11/15
Announcements
Work on MP3 this Wed, Fri, and Mon before Thanksgiving break.
Choropleth maps. In particular, how you set the bins corresponding to the color gradient can affect how your map looks. As indicated here, there are two approaches:
Equally sized interval bins
Quantile based bins
2. In-class exercise
Go over code in MP3 folder -> examples.Rmd -> Section 3 on “Choropleth maps using census data”
You will need to register an API key from the census bureau. Carefully read warning message to do so.
Lec 28: Fri 11/12
Announcements
Grad school panel on Mon 11/22 featuring SDS alumna. More info in #general
# Don't do all this:library(ggplot2)library(dplyr)library(readr)library(tidyr)library(stringr)library(tibble)library(forcats)library(purrr)# Instead, do this:library(tidyverse)
Added “Tips & Tricks” tab to menu bar of course webpage
Update to syllabus
For a truly unique perspective on Data Visualization: Mona Chalabi @monachalabi. See video below:
Today’s topics/activities
1. Chalk talk
None
2. In-class exercise
Work on MP2
Lec 20: Mon 10/25
Announcements
MP1 grades posted: See Slack #mp1 for details
SDS is currently working hard to hire 3 new faculty who will start in July 2022; I’m chairing one committee and sitting on another. As a result for the next month
My office hours will be highly variable; consult the calendars in the syllabus often
There will unfortunately be lags in returning grading
Sec01 Stoddard only: I won’t be able to stay past 12:05pm so that I can attend lunch meetings.
Today’s topics/activities
1. Chalk talk
Recap of Lec19: Why did we use inner_join() for solution to LC 3.20 on computing Available Seat Miles
Importing spreadsheet data into R. Either Excel files or .csv Comma-Separated Values files. See image of example .csv file below.
Data formats: “tidy” AKA long/tall/narrow format versue “non-tidy” AKA wide format
Feel free to message me on weekends, I just likely won’t respond.
Today’s topics/activities
1. Chalk talk
What is pseudocode?
Types of joins:
Copy code below to your classnotes.Rmd
Refer to image (ignore semi_join())
2. In-class exercise
With your MP2 partner, practice data wranling! Complete ModernDive Learning Check 3.20: Using data in nycflights13 package, compute available seat miles for each airline separately:
If you’re curious about my experiences in grad school, working at Google, switching to academia, and advice for aspiring data scientists, check out my appearance on episode #43 of the DataBytes podcast “To Google and Back.” Also available on Apple Podcasts and Google Play.
Today’s topics/activities
1. Chalk talk
Adding to previous lectures:
Lec15 on group_by() and summarize(): difference between sum() and n() summary functions.
Lec16 on mutate(): ifelse() function
_join(), select(), and rename() functions
2. In-class exercise
Go over ModernDive reading in schedule above
Lec 16: Fri 10/15
Announcements
Mini-project 2 (MP2) assigned on Monday
Add yourselves to the #mp2 channel. Please ask all questions about MP2 in #mp2, not in #questions
By Sunday 5pm I will post the new groups (of two) in the #mp2 channel. Until #mp2 is due, you will sit next to your partner in class.
Please reach out to your partner with a Slack DM before Monday’s lecture to coordinate meeting before lecture so you can sit next to each other.
If you have seating restrictions due to hearing, sight, or mobility issues, please DM me.
Today’s topics/activities
1. Chalk talk
mutate() new columns/variables and arrange() i.e. sort rows
2. In-class exercise
Go over ModernDive reading in schedule above
Lec 15: Wed 10/13
Announcements
See Slack #general for info about presentation of SDS major
PS04 (shorter) will be posted this afternoon
Keyboard shortcuts for:
%>% in RStudio: command + shift + m on macOS, control + shift + m on Windows
Running code in RStudio: command + enter on macOS, control + enter on Windwos
Quickly jumping between apps: command + tab on macOS, alt + tab on Windows
Selecting many files at once: click first file, hold shift, click last file
Deleting files: command + delete on macOS, delete on Windows
Today’s topics/activities
1. Chalk talk
summarize() rows and group_by() %>% summarize()
2. In-class exercise
Put finishing touches on MP1
Go over ModernDive reading in schedule above
Lec 14: Fri 10/8
Announcements
You are responsible for completing the ModernDive readings for Lec13 on the %>% operator and filter() before Wednesday’s lecture
A shorter PS04 will be assigned on Wednesday, due on Monday 10/18 9pm
library(ggplot2)library(dplyr)library(gapminder)# 1. Recreate plot from PS02 but with no color:gapminder_2007 <-gapminder %>%filter(year ==2007)# 1.a) Add LOESS smoother layer with geom_smooth()ggplot(data = gapminder_2007, mapping =aes(x = gdpPercap, y = lifeExp, size = pop)) +geom_point() +geom_smooth()# 1.b) Remove standard error bars by setting se = FALSEggplot(data = gapminder_2007, mapping =aes(x = gdpPercap, y = lifeExp, size = pop)) +geom_point() +geom_smooth(se =FALSE)# 1.c) Change span of "smoothing" window by change the value of spanggplot(data = gapminder_2007, mapping =aes(x = gdpPercap, y = lifeExp, size = pop)) +geom_point() +geom_smooth(se =FALSE, span =0.25)# 1.d) Force line to be straight. i.e. linear regressionggplot(data = gapminder_2007, mapping =aes(x = gdpPercap, y = lifeExp, size = pop)) +geom_point() +geom_smooth(method ="lm", se =FALSE)
2. In-class exercise
Copy the example code above to your classnotes.Rmd and go over the code
Optional: Go over ModernDive reading in schedule above (this topic is covered in SDS 201/220 intro stats)
Work on MP1
Lec 11: Fri 10/1
Announcements
Open Slack at the start of every lecture
Check for DM’s
Check #midterms channel
In order to not disadvantage students who take the midterm earlier
I won’t be answering any Slack #midterms after 3pm today
I’ve instructed the Friday Spinelli tutor not to answer questions about the midterm
Today’s topics/activities
1. Chalk talk
Go over practice Midterm I. Boxplot for question 3.c):
2. In-class exercise
Work on MP1
Lec 10: Wed 9/29
Announcements
Sit next to your MP1 partner; your partner was posted in the #mp1 channel on Sunday 5pm.
If have an Office of Disability Services accommodations letter and you haven’t already, please Slack DM it to me.
Our department is offering a pre-application review service (PARS) initiative to provide support and mentorship to PhD applicants from from historically marginalized groups. See details here: https://t.co/1g8lol4dRfpic.twitter.com/rHaMFHr5GF
Say the following piecharts represent results of an election poll at time points: A = September, B = October, and C = November. At each time point we present the proportion of the poll respondents who say they will support one of 5 candidates: 1 through 5.
Based on these 3 piecharts, answer the following questions:
At time point A, is candidate 5 doing better than candidate 4?
Did candidate 3 do better at time point B or time point C?
Who gained more support between time point A and time point B, candidate 2 or candidate 4?
Compare that to using barplots. Which do you prefer?
3. In-class exercise
Go over ModernDive reading in schedule above.
Changing default color and fill color aesthetics:
Copy and paste the code below into your classnotes.Rmd file
Change both the color of the scatterplot points and the fill of the bars. You can do this by selecting a palette from colorbrewer2.org or by setting them manually
Run colors() in your console to get English names of all colors in R
library(ggplot2)library(dplyr)library(nycflights13)library(gapminder)# 1. Recreate plot from PS02, but change default "color" palette of points:gapminder_2007 <-gapminder %>%filter(year ==2007)ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) +geom_point() +scale_color_brewer(palette ="Set1")# 2.a) Recreate Figure 2.26 but change default "fill" color of bars by adding a # palette layer:ggplot(flights, aes(x = carrier, fill = origin)) +geom_bar(position =position_dodge(preserve ="single")) +scale_fill_brewer(palette ="Set1")# 2.b) Recreate Figure 2.26 but change default "fill" color of bars by manually # changing colors in a layer:ggplot(flights, aes(x = carrier, fill = origin)) +geom_bar(position =position_dodge(preserve ="single")) +scale_fill_manual(values =c("darkorange", "forestgreen", "navyblue"))# 2.c) Recreate Figure 2.26 but change default "fill" color of bars by manually # changing colors in a layer using hex codes from: # https://www.color-hex.com/color-palette/114219ggplot(flights, aes(x = carrier, fill = origin)) +geom_bar(position =position_dodge(preserve ="single")) +scale_fill_manual(values =c("#dc323a", "#003f77", "#c4c1c1"))
Lec 08: Wed 9/22
Announcements
Problem sets:
PS01 graded
PS03 posted
Today’s topics/activities
1. Chalk talk
Searching the internet effectively: a critical data science tool
Wrapping-up boxplots:
For a side-by-side boxplot, the x variable has to be categorical
Summary statistics that are robust to outliers: median and IQR
Why 1.5 x IQR?
Barplots
geom_bar() when counts are not pre-computed i.e. listed individually
geom_col() when counts are pre-computed and saved in a variable
2. In-class exercise
Go over ModernDive reading in schedule above.
Lec 07: Mon 9/20
Announcements
Office of Disability Services is looking to hire a note taker for the class. If interested, see note in Slack #general.
Problem sets:
PS02 due today at 5pm
PS03 to be posted by 6pm today
Today’s topics/activities
1. Chalk talk
Recap of histograms
Facets to split a visualization by the values of another variable
Default ordering of functions such as ggplot() where data = is assumed first and mapping = is assumed second
Boxplots! Powerful, but tricky!
Say we want to study the distribution of the following 12 values which are pre-sorted:
1, 3, 5, 6, 7, 8, 9, 12, 13, 14, 15, 30
They have the following summary statistics. A summary statistic is a single numerical value summarizing many values. Examples include the immediately obvious mean AKA average and median. Other less immediately obvious examples include:
Quartiles (1st, 2nd, and 3rd) that cut up the data into 4 parts, each containing roughly one quarter = 25% of the data
Minimum & maximum
Interquartile-range (IQR): the distance between the 3rd and 1st quartiles
Min.
1st Quartile
Median = 2nd Quartile
3rd Quartile
Max.
IQR
1
5.5
8.5
13.5
30
8 = 13.5 - 5.5
Let’s compare the points and the corresponding boxplot side-by-side with the values on the \(y\)-axis matching:
2. In-class exercise
Go over ModernDive reading in schedule above.
Lec 06: Fri 9/17
Announcements
Oh snap! @SmithCollegeSDS is on a hiring spree! Put 👀 on these 3⃣ tenure track positions, apps due:
- 10/8 Biostatistics, statistics, or related - 10/15 joint hire with the Math dept - 10/22 candidates with a Ph.D. in stats, CS, information sciences, math, or related https://t.co/RCmtlSzg3S
Histograms for visualizing the distribution of a numerical variable
Section 1 (Stoddard G6) Demo
Section 2 (Sabin-Reed 220) Demo
2. In-class exercise
If you still haven’t been able to “Knit to PDF”, please ask for help
Go over ModernDive reading in schedule above.
Lec 05: Wed 9/15
Announcements
PS02 was posted after Monday’s lecture.
Today’s topics/activities
1. Chalk talk
Overplotting and two approaches for addressing it
Linegraphs
2. In-class exercise
Explore the different formatting tools in R Markdown: go to RStudio top menu bar -> Help -> Markdown quick reference.
Sec01 in Stoddard: There was an typo in Step 8 in last lecture’s in-class exercise. If you weren’t able to Knit directly to PDF, please re-attempt Steps 8-9. Knitting directly to PDF, instead of Knitting to Word and then saving to PDF, is the preferred submission format for all problem sets. It will be less hassle for you and provide consistency for the graders.
Go over ModernDive reading in schedule above.
Lec 04: Mon 9/13
Announcements
Problem Set 02 due next Monday 5pm, now posted under Problem Sets
Today’s topics/activities
1. Chalk talk
Recap of previous lecture
“Where can I save all the code I run in class?” In an R Markdown .Rmd file; R Markdown is a tool for reproducible research
Input: An .Rmd file
Output: An .html, .docx, or .pdf file.
2. In-class exercise
In-class battle-testing and practicing for PS02:
At a couple of steps in this process, you will be asked to install packages. Say yes to all of them.
If at any point your code won’t knit, go through these 6 R Markdown Fixes first, then seek assistance. These 6 fixes will resolve 85% of issues.
Create new R Markdown .Rmd file:
Go to RStudio menu bar -> File -> New File -> R Markdown
Set “Title” to “My first R Markdown report” and “Author” as your name.
Save this file as testing somewhere on your computer. This will create a file called testing.Rmd
Method 1: “Knit” a report to HTML:
Click the arrow next to “Knit” -> “Knit to HTML”.
An HTML webpage should pop up. However, it may be blocked by your browser. If so, in your browser’s URL bar, click on “Always allow pop-ups”.
Method 1: Publish HTML report on web:
Click on blue “Publish” button on top right of the resulting pop-up html.
Select RPubs.
If you haven’t previously, create an account on Rpubs.com. If you have previously, login.
Set “Title” to “My first R Markdown report” and “Slug” to “testing”
You should end up with a webpage that looks like this one. This is live on the web!
Method 1: Update HTML report on web:
Make some trivial change to your testing.Rmd file.
“Re-knit” your report and make sure your trivial change is reflected.
The blue “Publish” button should now read “Republish”
Click “Update existing”
Your updates are now live on the web!
Method 2: “Knit” a report to Word
Click the arrow next to “Knit” -> “Knit to Word”.
Save the resulting Word document as a pdf file.
Only if you are a macOS user:
Next to “Console” go to “Terminal”
Run this line of code:
sudo chown -R `whoami`:admin /usr/local/bin
Enter your password. Note: Terminal has weird behavior whereby as you enter your password, the cursor will not move. Don’t worry your password is registering.
Sunday 9/19 at 11:50AM: Opportunities in Statistics & Data Science in Academia, Government, & Non-Profit featuring SDS’s Prof. Randi Garcia!
Keynote address by Robert Santos, 116th President of the ASA, and President Biden’s nominee to serve as Director of the United States Census Bureau! If approved by the Senate, he would be the first Latinx Director of the Bureau!
Today’s topics/activities
1. Chalk talk
Recap of previous lecture
Grammar of Graphics
5NG1: Scatterplots
Next time:
Question: Do I need to re-type my code in the Console every single time?
Problem Set 01 due this Monday 5pm, posted under Problem Sets.
Today’s topics/activities
1. Chalk talk
Intro to Slack
What is difference between R and RStudio?
What are R packages?
2. In-class exercise
Go over ModernDive reading in schedule above.
About readings in this course:
You are responsible for completing a lecture’s readings before the next lecture. Ex: you are responsible to read all of ModernDive Chapter 1 before Wednesday.
I teach lectures assuming you have not done the readings beforehand. However, if it suits your learning style better, please do read beforehand.
While you don’t need to turn in your learning check answers, I highly recommend you still do them. The solutions are in Appendix D of the book.
If you have your headphones, you may listen to music during in-class reading time.
Lec 01: Fri 9/3
Announcements
Welcome!
Today’s topics/activities
Course webpage: bit.ly/sds192kim
My story
“Knock on wood if you’re with me”
What this class is about: Answering questions with data