Last updated on 2017-05-15

Lec31 - Mon 5/15: Midterm III Review


  • Fri 5/19 9am-12pm in Warner 507
  • Closed book
  • Bring calculator or phone with calculator app


  • One question which will be a slight variation from a question from Midterm II
  • New topics:
    • k-Nearest Neighbors
    • Classification and Regression Trees
    • Principal Components Analysis
    • k-Means Clustering

Kaggle AirBnb Competition

  • Given set of US users we have
    • \(\vec{X}\): Demographic and web session predictor variables
    • \(y\): Categorical variable of new users' first booking destination, where NDF = no booking recorded
  • Score: If a single prediction is submitted for each user, score becomes "Proportion Classified Correctly"


In train, 58.3% of users did not book NDF

Sample Submission

The sample submission CSV has NDF predicted for all test users. Let's submit this…

id country
5uwns89zht NDF
jtl0dijy2j NDF
xx0ulgorjt NDF
6c6puo6ix0 NDF
czqhjk3yfe NDF
szx28ujmhf NDF


… we get 68% correct. In other words, roughly 68% of test users did not book.



Why do think we have

  • Training set: 58.3% of users did not book
  • Test set: ~68% of users did not book

Train vs Test Data

Look at the Account Creation Dates:

Outcome Variable in Training Data

Recall Lec03: Resampling


Google Flu

  • In 2009, Google designed an algorithm that tried to predict number of flu cases based on people's searches.
  • At first it worked amazing!
  • Then things fell apart
  • Post Mortem

Lec24 - Wed 4/19: Final Project & log-Transformations

Final Project

  • Due Tuesday May 23 12pm
  • Alone or in pairs
  • Kaggle Data: Any data set that has competition/leaderboard
  • Fill out this Google Form with your information.


A folder on GitHub organization (I'll set it up) with a file FINAL_PROJECT_FIRSTNAME_LASTNAME.R that

  • Loads all data
  • Has an exploratory data analysis (EDA)
  • Shows your thinking
  • Creates a submission CSV file

Reproducible Research

  • Crucial for participating in open-source research/coding community.
  • Someone else should be able to pick up your code and replicate and understand everything easily.
  • Especially me!
  • More especially: future you!


Ex1: The following will not work on other people's computers!

train <- read_csv("~/phoxie_hoxie/his_crazy_directory_structure/train.csv")

Ex2: If you create a variable in your console, but don't include code FINAL_PROJECT_FIRSTNAME_LASTNAME.R, it won't work for others.

Ex3: Comments!

log10 Function

  • Lines mark \(x=(0.01, 0.1, 1, 10) = (10^{-2}, 10^{-1}, 10^{0}, 10^{1})\)
  • Points marks the power of 10: \((-2, -1, 0, 1)\)

Cheese vs Milk Production in the US