Wed Oct 12, 2016

Switching Gears

With the internet, we are in a new age of data:

Bridging the Gap

  • Meet Jenny Bryan at UBC: GitHub profile
  • She teaches a graduate level class STAT 545 on Data wrangling, exploration, and analysis with R. Note the ordering.
  • Drawing

Classroom vs Real Data

Jenny Bryan said: "Classroom data are like teddy bears and real data are like a grizzly bear with salmon blood dripping out its mouth."

Traditional Classroom Data Real Data
Drawing Drawing

Real Data

Some attributes of real data:

  • Often not in a format ready for analysis
  • Messy and needs cleaning
  • Typos, weird outliers
  • Missing values
  • Inconsistent formatting

Real Data

Inconsistent formatting is a real pain:

  • Dates: "2016/10/12" vs "2016-10-12" vs "10/12/16" vs "10/12/2016" vs "Oct 12, 2016"
  • "DC" vs "D.C." vs "District of Columbia"
  • "Beyonce" vs "Beyoncé"

dplyr Package

To take this, we now officially introduce the dplyr package: a grammar of data manipulation

Drawing

Pedogical Note

Were it not for this package, I probably wouldn't be taking a data-centric view to this course.

Why do I have a dplyr sticker on my laptop? Why is dplyr so good IMO?

  • The verb describing the action you want to perform on your data IS the name of the function() you use.
  • So you don't need extensive programming experience (indexing, for loops, etc) to be able to manipulate data.

5MV

Say hello to the 5MV: the five main verbs

  1. select() columns by variable name
  2. filter() rows matching criteria
  3. mutate() existing variables to create new ones
  4. arrange() rows
  5. summarise() numerical variables that are group_by() categorical variables
  6. Also, later _join() two separate data frames by corresponding variables

Today:

  1. select() columns by variable name: front of cheatsheet, bottom right
  2. filter() rows matching criteria: front of cheatsheet, bottom middle. We've already used this in Chapter 3 on Data Viz.

Keep looking back and forth between book and cheatsheet!