April 20, 2016

Background

Myself

Middlebury College

Middlebury College

  • Small (2400 students) undergraduate only liberal arts college
  • Typical class size ~20 students
  • Stats within the math dept: 1.5 statisticians
  • No major/minor in stats
  • Courses:
    • Intro stats
    • Year-long 300-level prob & stats sequence
    • Applied/methods classes in other depts

Outline of Talk

Various Data Science initiatives at Middlebury:

  • Description of New MATH216 Intro to Data Science course
  • Proposed Minor in Data Science
  • ASA DataFest

Intro to Data Science

Course Structure

Prereqs

Only intro stats and some exposure to R or any other programming language

Syllabus

  • 5 biweekly analyses
    • Submitted in R Markdown: reproducible research
    • Feedback delivered via GitHub
  • Term project: both written report and 12 min presentation
  • In-class participation

Demographics

14 students: 6 Seniors, 6 Juniors, & 2 Sophomores

Of which:

  • Double majors:
    • Environmental Sciences-Econ
    • Econ-Linguistics
    • CS-Econ
  • Singe majors:
    • Economics x 4
    • Molecular Bio & Biochem x 3
    • International Politics and Econ, Neuroscience, Bio, CS

Principles

  • Mixture of lab & lecture: students bring their own laptops to class.
  • Use real, messy, complex data.
  • Discussions in class
  • This class uses R, but is not a class on R. I try to teach things in a language agnostic fashion: transferable ideas & concepts.
  • "Minimizing prerequisites to research.", quote by George Cobb.

Environment: RStudio

How to get students to learn R?

  • Key: Forget Base R
  • How? The Hadleyverse.
  • In particular
    • dplyr package for data wrangling/manipulation
    • ggplot2 package for data visualization

dplyr Verbs

Data manipulation via the following verbs on tidy data. The command name is the action we want to perform!

  1. filter: keep observations matching criteria
  2. summarise: reduce many values to one
  3. mutate: create new variables from existing ones
  4. arrange: reorder rows
  5. select: pick columns by name
  6. join: join two data sets
  7. group_by: group subsets of observations together

ggplot2: the Grammar of Graphics

     

ggplot2: the Grammar of Graphics

A statistical graphic is a mapping of variables in a

  • data set to
  • aes()thetic attributies of
  • geom_etric objects.

ggplot2 allows us to construct graphics in a modular fashion by specifying these components.

ggplot2: the Grammar of Graphics

ggplot2: the Grammar of Graphics

6 dimensions of information on a 2 dimensional page:

data aes() geom_
longitude x point
latitude y point
army size size path
forward vs retreat color path
date x, y text
temperature x, y line

Example Analyses

Dataset: Houston Flights

Domestic flights leaving Houston airport (IAH) in 2011. Four data sets:

  • flights: info on all 227,496 flights
  • weather: hourly weather info
  • planes: information on all 2853 airplanes
  • airports: information on all 3376 destination airports

Delayed Flights

Age of Airplanes

Dataset: OkCupid Data

  • Sample of 10% of San Francisco OkCupid users in June 2012 (\(n=59946\))
  • 40.2% of the sample was female
  • Use logistic regression to predict gender
  • Overfitting, out-of-sample prediction, cross-validation

Self-Referenced Body Type

Best predictors have distinct differences (in gender) in large segments of the population.

Self-Referenced Body Type

Need to normalize to compare proportions, not counts!

Dataset: Reed College Jukebox

All 222,540 songs played on the Reed College pool hall room jukebox from 2003-2009.

date_time artist album track
Sun Dec 7 05:12:57 2003 Tom Petty and the Heartbreakers Into the Great Wide Open
Sun Dec 7 05:15:56 2003 Jefferson Airplane Somebody To Love
Sun Dec 7 05:23:04 2003 Led Zeppelin Led Zeppelin IV 08 When The Levee Breaks

Importance of EDA

Importance of EDA

Time Series

Maps

Interactivity

Two R packages for interactivity:

  • Shiny: A web application framework for creating interactive web applications with no HTML, CSS, or JavaScript knowledge required.
  • Leaflet: Embed maps into OpenStreetMaps

Example on Middlebury Shiny Server Pro: VT Census Tracts.

Proposed Data Science Minor

What is Data Science?

Proposed Minor

  • Targeted at students in a field where facility with data is a valued skill, but not MATH and CS majors.
  • Courses
    • 3 computer science
    • 2 stats
    • 2 domain courses in bio, econ, psych, chem, physics, political science, etc.
  • Full details are here

DataFest

DataFest

DataFest is an internationally coordinated undergraduate data science hackathon run by the ASA.

Biggest ones:

DataFest

DataFest

Prof. Philip Yates at Saint Michael's College and I organized the first inaugural DataFest Vermont 802 the weekend of April 8th-10th at Saint Michael's College.

Example Work

Some Thoughts

Data Visualization

  • Data visualization is a gateway drug to statistics.
  • Prez from Season 4 of "The Wire":
  • Students got really excited by ggplot, maps, and Shiny apps.

Pedagogical Issue: Programming

It's like learning a language. Frustration!

  • Point-and-click vs command line.
  • Thinking algorithmically
  • Debugging: help files and Google
  • Hadley's wisdom

Learning to Code

  • Learning in class should reflect how we learn in real life
  • Experimenting with an open-learning format: students can collaborate completely and see each other's work. Still haven't decided on its merits.
  • However, new tools like Datacamp are increasing the ratio: \[\frac{\mbox{Payoff from learning R}}{\mbox{Startup costs}}\]

Importance of Feedback

  • Developing skills and intuition takes time. At Middlebury classes are small: attention and good feedback can be given.
  • Feedback via GitHub: they won't learn Git, rather just use it. GUI for feedback delivery.
  • Like giving feedback on a paper: more art than science.

Resources