Basic information

  • Course title: STAT/MATH 495 - Advanced Data Analysis
  • Instructor: Albert Y. Kim - Lecturer of Statistics
  • Email: Slack team: stat495-fall-2017.slack.com. I will respond to emails Slack messages within 24h, but not during weekends. Please only Slack me with administrative and briefer questions as I prefer having more substantive conversations in person.
  • Meeting locations/times:
    • Lectures: M 9:00-9:50 and Tu/Th 8:30-9:50 in Merrill Science Center 131.
    • Office hours: M 1:00-4:00 and W 2:00-5:00 in Converse Hall 316 Seeley Mudd 208 (conference room in lounge), or by appointment. Late changes to office hours will be posted on Slack.
    • StatFellow Office Hours: Andrew Kim (akim17 on Slack): M 6:00-8:00pm in Seeley Mudd 205.
    • Extra office hours: For bouncing ideas for final project. In Frost Cafe
      • Fri 11/3 1:30-3:30pm, Tue 11/7 1:30-3pm, Thu 11/9 1:30-3pm, and Tue 11/14 1:30-3pm
    • Exam week office hours
      • Tue 12/19 9am-12pm in SMudd 208
      • Wed 12/20 3pm-5pm in SMudd 208
      • Thu 12/21 2pm-5pm in SMudd 208
  • Important dates:
    • Midterm I: Thu 10/5
    • Midterm II: Thu 11/9
    • Final project presentations (see order below): Thu 12/7, Mon 12/11, and Tue 12/12
    • Exit survey will be posted here on Wed 12/20 at noon. This due at the same time as your final project submission.
    • Midterm III: Thu 12/21 9:00 a.m. Chapin Hall 201
    • Final projects due Fri 12/22 5pm
Name Final project group Order
Brendan, Leonard, Vickie C Th 1
Meron, Wayne D Th 2
Abbas, Caleb, Kiryu F Th 3
Christien, Harrison, Meredith H Mon 1
Brenna, Sara G Mon 2
Anthony, Jenn, Pei E Tue 1
Jeff, Luke, Tasheena B Tue 2
Jonathan, Sarah, Tim A Tue 3

Course Description

  • Official course description: On Amherst College Webpage
  • Unofficial course description: A course on the theoretical underpinnings of machine learning. Fundamental concepts of machine learning, such as crossvalidation and the bias-variance tradeoff, will be viewed through both statistical and mathematical lenses. As much of the coursework will center around participation in Kaggle competitions, there will be greater emphasis on supervised learning techniques including regression, smoothing methods, classification, and regularization/shrinkage methods. To this end, there will be a large computational component to the course, in particular the use of tools for data visualization, data wrangling, and data modeling. Furthermore, to encourage engagement with the open-source statistics, data science, and machine learning communities, work and collaboration will center around the use of GitHub.
  • Objectives: This semester you will
    1. Synthesize what you’ve learned in your statistics, data science, mathematics, and computer science courses in a less structured setting than a typical course.
    2. Learn to communicate empathetically when doing group/collaborative work.
    3. Develop your presentation skills and ability to think on your feet.
    4. Implement the ideas behind a “minimum viable product” into your workflow.
    5. Empower yourselves to actively participate in the open-source code/data ecosystem, necessitating understanding of GitHub pull requests.

Lectures

  • A typical lecture will consist of some balance of:
    • “Chalk talk”: Old-school talks on the blackboard where we’ll either
      • Prime the discussion for the day
      • Give an “executive summary” or a “bird’s eye view” of a topic
      • Go over more theoretical content
    • “Tech time”: Unstructured time for you to either
      • Read digital content
      • Work with your seatmates on in-class exercises
      • Start problem sets
  • The flow of the lecture notes will follow the main page of the course.

Attendance policy

I do not take explicit attendance, so there is no need to inform me of the occasional absences. That being said, abuse of this policy will eventually catch up to you (see Engagement below). Furthermore, please ask your peers what you missed in case of absence.

Lecture schedule

Roughly speaking we will cover the following topics. A more detailed outline and corresponding readings can be found here.

  1. Background
    • Intro to modeling
    • Simple case to start: splines
    • Out-of-sample prediction, sampling/resampling, crossvalidation
    • Bias-variance tradeoff
  2. Continuous outcomes I
    • LOESS smoother
    • Regression for prediction
  3. Categorical outcomes i.e. classification
    • Logistic regression for prediction + ROC curves
    • k-Nearest Neighbors
    • Classification and regression trees (CART)
  4. Continuous outcome II
    • Regularization/shrinkage methods: Ridge regression and LASSO
  5. Other methods
    • Boosting and bagging
    • Random forests
    • Neural nets
  6. Unsupervised learning (time permitting)
    • Principal components analysis
    • k-Means Clustering

Materials/Readings

We will use chiefly use the following available textbooks:

  1. “Happy Git and GitHub for the useR” by Jenny Bryan.
  2. “An Introduction to Statistical Learning” by James, Witten, Hastie, and Tibshirani. We’ll refer to this as “ISLR.”"
  3. “Computer Age Statistical Inference” by Efron and Tibshirani. We’ll refer to this as “CASI.”
  4. “The Elements of Statitiscal Learning” by Hastie, Tibshirani, and Friedman. We’ll refer to this as “ESL.”

The latter three are all supplemental resources: use these if you find chalk talks and the course main page are lacking in any regard. That being said, here is a rough hierarchy of how I much I refer to them:

  1. ISLR: Most of my notes center around this text, which is targeted at undergraduates.
  2. ESL: This is the graduate version of ISLR, and predates ISLR by a few years.
  3. CASI: A 30,000 foot overview of the past, present, and future of statistics/data science/machine learning. We’ll refer to this occationally.

Evaluation

Group Final Project 30%

Much of this course is a build up to the final project. You will participate in any Kaggle competition that has a leaderboard, but in a more exhaustive fashion. This could mean:

  • Comparing and constrasting methods
  • Drilling down on only one method
  • Thinking of creative ways of transformation existing variables

Details:

  • Due Friday 12/22 5pm (last day of exams).
  • All groups members are expected to contribute and a system will be put in place to hold your group peers accountable for their work.
  • As per the Amherst College Student Code of Conduct’s statement on intellectual responbisility and plagiarism all external sources must be cited in your submissions.

Deliverables:

  1. ASAP: A Slack DM including all your team members, myself, and Andrew indicating who your team leader is.
  2. On Thu 12/7, Mon 12/11, and Tue 12/12: in-class presentations. See top of syllabus for group order
    • Should be minimum 17m and maximum 22m of content and allow for 3m for questions.
    • Please email Slack your project group DM a link/pdf/PPT beforehand
    • Any format you like, but keep in mind:
      • People can only absorb a finite amount of information in a talk.
      • Try to take a “less is more” approach to content. i.e. for the same amount of information conveyed, try to use less ink.
      • Visualizations/tables with not too many sig.figs are great in this regard.
      • Just remember you are “marketing/selling your ideas”.
    • A print-out of this evaluation rubric will be handed out to all of you for each group so that you may evaluate your peers. After all presentations are done, I will return these to all groups.
  3. On due date Fri 12/22 5pm
    1. A README.md file that’s been updated.
    2. An Final_Project.Rmd file that is completely reproducible and easy for a new user to pick up and understand and replicate. New users include: other people, future you, and especially me!
    3. A two-page max summary PDF of your thinking:
      1. All the approaches you tried.
      2. An easy-to-read description of what went into your ultimate submission.
      3. What went well, what when wrong, and what you learned.
      4. Examples
    4. A completed Google Forms exit survey that will be posted here on Wed 12/20 at noon. Completion of this survey is worth 10% of your individual final project grade.

Numerical outcome variables:

Competition URL
House Prices: Advanced Regression Techniques https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Allstate Claims Severity https://www.kaggle.com/c/allstate-claims-severity
AMS Solar Energy Prediction https://www.kaggle.com/c/ams-2014-solar-energy-prediction-contest
Africa Soil Property Prediction Challenge https://www.kaggle.com/c/afsis-soil-properties
Bike Sharing Demand https://www.kaggle.com/c/bike-sharing-demand/data
Display Advertising Challenge https://www.kaggle.com/c/criteo-display-ad-challenge
Don’t Overfit https://www.kaggle.com/c/overfitting#description
ECML/PKDD 15: Taxi Trip Time Prediction (II) https://www.kaggle.com/c/pkdd-15-taxi-trip-time-prediction-ii
Grupo Bimbo Inventory Demand https://www.kaggle.com/c/grupo-bimbo-inventory-demand/data
how did it rain? II https://www.kaggle.com/c/how-much-did-it-rain-ii
Loan Default Prediction - Imperial College London https://www.kaggle.com/c/loan-default-prediction/data
Predict HIV Progression https://www.kaggle.com/c/hivprogression#description
The Hewlett Foundation: Automated Essay Scoring https://www.kaggle.com/c/asap-aes#evaluation
Two Sigma Connect: Rental Listing Inquiries https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries
Two Sigma Financial Modeling Challenge https://www.kaggle.com/c/two-sigma-financial-modeling/data
World Cup 2010 https://www.kaggle.com/c/worldcup2010

Categorical outcome variables:

Competition URL
March Machine Learning Mania 2017 https://www.kaggle.com/c/march-machine-learning-mania-2017
San Francisco Crime Classification https://www.kaggle.com/c/sf-crime
AirBnb New User Bookings https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings
Kobe Bryant Shot Selection https://www.kaggle.com/c/kobe-bryant-shot-selection
Shelter Animal Outcomes https://www.kaggle.com/c/shelter-animal-outcomes/data
Accelerometer Biometric Competition https://www.kaggle.com/c/accelerometer-biometric-competition/data
Credit Card Fraud Detection https://www.kaggle.com/dalpozz/creditcardfraud
Expedia Hotel Recommendations https://www.kaggle.com/c/expedia-hotel-recommendations
Facebook V: predicting check ins https://www.kaggle.com/c/facebook-v-predicting-check-ins/data
Give Me Some Credit https://www.kaggle.com/c/GiveMeSomeCredit
Intel & MobileODT Cervical Cancer Screening https://www.kaggle.com/c/intel-mobileodt-cervical-cancer-screening
Iris Species https://www.kaggle.com/uciml/iris
Leaf Classification https://www.kaggle.com/c/leaf-classification
Personality Prediction Based on Twitter Stream https://www.kaggle.com/c/twitter-personality-prediction#description
Predict Closed Questions on Stack Overflow https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow
Quora Question Pairs https://www.kaggle.com/c/quora-question-pairs
The Allen AI Science Challenge https://www.kaggle.com/c/the-allen-ai-science-challenge#evaluation
World Cup 2010 https://www.kaggle.com/c/worldcup2010

Weekly Problem Sets 20%

The problem sets in this class should be viewed as low-stakes opportunities to practice, instead of evaluative tools used by the instructor to assign grades.

  • Basic structure: Updated below
    1. You will be randomly assigned into pairs.
    2. You will complete the problem set and submit it via GitHub pull request.
    3. On the due date, I will pick a certain number of teams at random to present their findings.
    4. You will have the option of updating your submissions.
  • Sometimes problem sets will be individual, sometimes you will choose groups, sometimes you’ll be randomly assigned into groups.
  • Typical problem set timeline
    • Tuesday: problem set assigned
    • The following Tuesday 8:30am:
      • Submitted via a synchronized pull request by you/your team leader
      • Presentations (if applicable)
    • By Wednesday 9am: The TA will give feedback over GitHub. Do not commit/push any new changes until you’ve received the feedback.
    • By Thursday 8:30am: Revise/resubmit your work. There is no need to create a new pull request; the existing one will be updated.
    • By Monday: Albert will give his feedback and the grade.
  • Policy on collaboration:
    • While I encourage you to discuss problem sets with your peers, you must submit your own answers and not simple rewordings of another’s work.
    • As per the Amherst College Student Code of Conduct’s statement on intellectual responbisility and plagiarism all collaborations must be explicitly acknowledged in your submissions.
  • Lowest two scores dropped.
  • No extensions for problem sets will be granted.

Three Midterms 40%

There will be three midterms: two during the semester, one during finals week (dates posted under Basic information).

  • Two highest scores are weighted 15% each, lowest is weighted 10%.
  • All midterms are cumulative.
  • There will be no make-up nor rescheduled midterms, except in the following cases if documentation is provided (e.g. a dean’s note):
    • serious illness or death in the family.
    • athletic commitments, religious obligations, and job interviews if prior notice is given. In such cases, rescheduled exams must be taken after the rest of the class.
  • There will be no extra-credit work to improve midterm scores after the fact.

Engagement 10%

It is difficult to explicit codify what constitutes “an engaged student,” so instead I present the following rough principle I will follow: you’ll only get out of this class as much as you put in. Some examples of behavior counter to this principle:

  • Not participating in in-class exercises.
  • Engaging so little with me, either in class or during office hours, that I don’t know what your voice sounds like.
  • Submitting a problem set that has code or content that is copied from or is only a slightly modified version of your peers’ work.

Inclusion and Accessibility

I strive to make this course welcoming to all students. If you would like to discuss your learning needs with me, please schedule a meeting. I look forward to working with you to understand and support your academic success.

In particular, if you have a documented disability that requires accommodations, you will need to register with Accessibility Services for coordination of your academic accommodations. You can reach them via email at accessibility@amherst.edu, or via phone at 413.542.2337. Once you have your accommodations in place, I will be glad to meet with you privately during my office hours or at another agreed upon time to discuss the best implementation of your accommodations.