April 20, 2016
Various Data Science initiatives at Middlebury:
Only intro stats and some exposure to R or any other programming language
14 students: 6 Seniors, 6 Juniors, & 2 Sophomores
Of which:
dplyr
package for data wrangling/manipulationggplot2
package for data visualizationData manipulation via the following verbs on tidy data. The command name is the action we want to perform!
filter
: keep observations matching criteriasummarise
: reduce many values to onemutate
: create new variables from existing onesarrange
: reorder rowsselect
: pick columns by namejoin
: join two data setsgroup_by
: group subsets of observations togetherA statistical graphic is a mapping of variables in a
data
set toaes()
thetic attributies ofgeom_
etric objects.ggplot2
allows us to construct graphics in a modular fashion by specifying these components.
6 dimensions of information on a 2 dimensional page:
data |
aes() |
geom_ |
---|---|---|
longitude | x |
point |
latitude | y |
point |
army size | size |
path |
forward vs retreat | color |
path |
date | x, y |
text |
temperature | x, y |
line |
Domestic flights leaving Houston airport (IAH) in 2011. Four data sets:
flights
: info on all 227,496 flightsweather
: hourly weather infoplanes
: information on all 2853 airplanesairports
: information on all 3376 destination airportsBest predictors have distinct differences (in gender) in large segments of the population.
Need to normalize to compare proportions, not counts!
All 222,540 songs played on the Reed College pool hall room jukebox from 2003-2009.
date_time | artist | album | track |
---|---|---|---|
Sun Dec 7 05:12:57 2003 | Tom Petty and the Heartbreakers | Into the Great Wide Open | |
Sun Dec 7 05:15:56 2003 | Jefferson Airplane | Somebody To Love | |
Sun Dec 7 05:23:04 2003 | Led Zeppelin | Led Zeppelin IV | 08 When The Levee Breaks |
quandl.com has a great R interface
Two R packages for interactivity:
Example on Middlebury Shiny Server Pro: VT Census Tracts.
DataFest is an internationally coordinated undergraduate data science hackathon run by the ASA.
Biggest ones:
Prof. Philip Yates at Saint Michael's College and I organized the first inaugural DataFest Vermont 802 the weekend of April 8th-10th at Saint Michael's College.
It's like learning a language. Frustration!