While I encourage you to discuss problem sets with your peers, you must submit your own answers and not simple rewordings of another’s work. Furthermore, all collaborations must be explicitly acknowledged at the top of your submissions.

All problem sets are to be submitted on the GitHub organization for this class.

Final Project

  • Final Project & Google Forms Exit Survey Due: Tue 5/23 12pm
  • The Exit Survey is here.
  • The Final Project should be submitted on GitHub in a folder Team_X where X is your team letter.
  • In the above GitHub page, there is a template Team_X folder. Download a .zip file of this folder here, which contains:
    • A Files folder: Put all data files there
    • report.pdf: Up to one page PDF describing
      • The approach you ultimately used for your submission.
      • Other things you tried. Refer to Sections in Extras in Final_Project_Code.R.
    • Final_Project_Code.R: Write your code here. This file includes the following template outline:
      1. Load All Necessary Packages
      2. Load Data Files & Data Cleaning
        • CSV/data files should be read assuming they are in the Files folder and the current working directory is Team_X. In other words, do this: read_csv("Files/CSV_NAME.csv") but not something like this: read_csv("/Users/aykim/Documents/MATH218/Team_X/Files/CSV_NAME.csv")
      3. Top 4-5 Visualizations/Tables of Exploratory Data Analysis:
        • Guidelines: If you had to illustrate using no modelling but only graphs and tables which variables have the most predictive power, which would you include?
      4. Cross-Validation of Final Model
        • Perform a cross-validation on only the final/ultimate model used for your submission.
        • The “score” in question should be the same as used to compute the Kaggle leaderboard. In other words, your estimated score should be roughly equal to the score returned by Kaggle after your submission.
      5. Create Submission
        • Fit model to all training data using any cross-validated “knobs”.
        • Make predictions for test data using broom::augment(newdata=) or predict()
        • Output a CSV using readr::write_csv(DATAFRAME_NAME, path="Files/SUBMISSION_NAME.csv") that is Kaggle submitable. This submission should return a Kaggle score that is close to your cross-validated score.
      6. Extras
        • Anything else you’ve tried that you’d like to include, but isn’t essential to the above, like other EDA’s, other modeling approaches you’ve tried, etc.

Problem Set 10

  • Info:
    1. Your final capstone assignment is to write a one page essay addressing the following question: How can machine learning methods can help take the ‘con’ out of ‘econometrics’? (defined by Russ Roberts at 17:17 in the listening below).
    2. Note that the ‘con’ is not just an issue in econometrics, but also any domain that uses statistics (in particular p-values) for causal inference (Wikipedia link), including psychology, biology, sociology, etc.
    3. Assigned Sat 4/29. Due by Fri 5/5 9:05am. No coding for this week.
  • Required Listening:
    1. Listen to the Econ Talk podcast’s interview (time 1h01m) of Stanford’s Prof Susan Athey.
    2. The following times are very domain specific and are optional listening:
      • 36:17-41:48 on the causal effects of economic policies (NAFTA & stimulus packages)
      • 41:49-47:02 on how Silicon Valley internet companies drive innovation
    3. Some terminology from the podcast explained:
      • Data Mining: Dilbert cartoon
      • Goodness-of-fit: How well do the \(\widehat{y}_i\) fit the \(y_i\)?
      • Marginal effect: The effect of a single covariate/predictor \(x_j\) in a multiple covariate/predictor setting, like multivariate regression. i.e. All other things being equal, the associated marginal effect of an increase of 1 unit in \(x_j\) on the outcome variable \(y\) is \(\beta_j\).
      • Counterfactual: Used in causal inference: what is counter to what was observed (more on Monday). For example:
        • Factual: The (observed) income of an individual who graduated from college.
        • Counterfactual: the (unobserved) income had that same individual not graduated from college.
  • Guidelines:
    1. Format: No more than single paged printed using 11pt font.
    2. Submission: Both
      • Printed copy (single page)
      • A file named PS10_FIRSTNAME_LASTNAME.PDF on GitHub. PDF format only please (I hate MS Word).
    3. Header: Should include your name, your major, & your year. For example: “Albert Y. Kim, Mathematics & Computer Science, ‘04”
    4. Target audience: other Middlebury students and faculty in your respective fields who are not as well versed in machine learning methods as you are.

Problem Set 09

  • Info: Assigned Thu 4/20. Due by Fri 4/28 9:05am. No starter code for this week.
    1. Pick a Kaggle competition from the compiled list of student-identified competitions that
      • Involves a categorical outcome variable \(y\) with at least 3 levels
      • Has a leaderboard
    2. Use
      • \(k\)-Nearest Neighbors as your model
      • Whatever mechanism used in the Kaggle leaderboard as your notion of “score”
      • Cross-validation to find the optimal \(k\). You can either code the cross-validation to find the optimal \(k\) using your own for loop, or use a function that does this for you much like glmnet::cv.glmnet() did for ridge regression/LASSO if there is one!
    3. Submit
      • Your predictions on Kaggle and note your score.
      • A file named PS09_FIRSTNAME_LASTNAME.R on GitHub.
    4. Fill out this Google Form.
  • Learning Goals:
    1. Getting experience with the final type of outcome variable of interest: categorical outcome variable that isn’t binary.
    2. Gearing up for the final project.
  • Example Solutions:

Problem Set 08

  • Info:
    • Code portion: Assigned Fri 4/14. Due by Sat 4/22 9:05am. No starter code for this week. Submit an entry to one of the two Kaggle competitions below. No requirements on the complexity of the submission; just do what you can in the time given for this problem set.
      1. Give Me Some Credit
      2. StumbleUpon Evergreen Classification Challenge
  • Learning Goals:
    1. Observe how hard it actually is to get a high Area Under the (ROC) curve!
  • Homework: By Sat 4/22 9:05am you need to complete submit a file named PS08_FIRSTNAME_LASTNAME.R on GitHub.

Problem Set 07

  • Info:
    • Code portion: Assigned Mon 4/10. Due by Fri 4/14 9:05am. No starter code for this week. This week we won’t worry about out-of-sample prediction but rather just within-sample to keep things simpler. Using the profiles data from logistic_regression.R in Lec20 we will continue to try to predict when users are sex == female:
      1. Create the ROC curve corresponding to a model that makes random guesses as reflected in \(\widehat{p}\). What is the area under the ROC curve?
      2. Create the ROC curve corresponding to a model that makes perfect guesses as reflected in \(\widehat{p}\). What is the area under the ROC curve?
      3. Say you are trying to predict sex == female using height and age as predictor variables for the purposes of advertisement targeting. Marketing feels that it is twice as costly to incorrectly predict a user to be male when they are really female than to incorrectly predict a user to be female when they are really male. Roughly what threshold \(p^\) should you use in the decision rule: “Predict a user to be female if \(\widehat{p} > p^\)”?
  • Learning Goals:
    1. Evaluate ROC curves.
    2. Experiment with \( p^* \) thresholds and cost functions
  • Homework: By Fri 4/14 9:05am, you need to complete submit a file named PS07_FIRSTNAME_LASTNAME.R on GitHub.
  • Example Solutions:

Problem Set 06

  • Info:
    • Code portion: Assigned Sat 4/1. Due by Fri 4/7 9:05am. Here is the PS06_FIRSTNAME_LASTNAME.R starter code. Goals (each its own section in starter code):
      1. Obtain the optimal \( \lambda \) for ridge regression: \( \lambda^*_{\mbox{ridge}} \)
      2. Obtain the optimal \( \lambda \) for LASSO: \( \lambda^*_{\mbox{lasso}} \)
      3. Compare linear regression, ridge regression, and LASSO using 10-fold cross-validation. Which is best?
      4. Make a submission on Kaggle.
    • Code review: Assigned Fri 4/7. Due by Mon 4/10 9:05am.
  • Learning Goals:
    1. Answer the question: “Why would I use regularization/shrinkage methods?”
    2. Make your first entry in the House Prices: Advanced Regression Techniques Kaggle Competition.
  • Homework: By Friday 4/7 9:05am, you need to complete the following:
    1. Submit the following on GitHub:
      2. PS06_FIRSTNAME_LASTNAME_ridge_coefficients.pdf
      3. PS06_FIRSTNAME_LASTNAME_LASSO_coefficients.pdf
      4. PS06_FIRSTNAME_LASTNAME_submissions.csv
    2. After submitting PS06_FIRSTNAME_LASTNAME_submissions.csv on Kaggle, fill out the following Google Form. Note the score on Kaggle is a slight variation on the (root) MSE known as the Root Mean Squared Logarithmic Error, which takes the average squared error of logged values.
  • Example Solutions:

Problem Set 05

  • Info:
    • Reading: Read Scott Fortmann Roe’s “Understanding the Bias-Variance Tradeoff.”
    • Code portion: Assigned Sun 3/19. To be submitted on Fri 3/24 9:05am. Here is the PS05.R starter code. Goal: Using smooth.spline(), recreate the left-most plot in Figure 2.12 on page 36 of book.
    • Code review: Assigned Fri 3/24. Due by Mon 4/3 9:05am.
  • Learning Goals:
    1. Preview MSE breakdown and bias-variance tradeoff in other contexts (above reading).
    2. Express in code that \( \mbox{Bias}\left[\widehat{f}(x)\right]^2 \) and \( \mbox{Var}\left[\widehat{f}(x)\right] \) are inversely related.
  • Homework: On Friday 3/17:
    1. You will submit the following on GitHub:
      1. PS05_FirstName_LastName.R
      2. PS05_FirstName_LastName_MSE.pdf: plot generated at the end of PS05.R
      3. PS05_FirstName_LastName_bias_sq_vs_var.pdf: plot generated at the end of PS05.R
    2. Take a light quiz at the beginning of lecture on the above reading. Try to tie in the elements of the reading to your algorithm in PS-05.R.
  • Example Solutions:
Code Reviewer A Code Reviewer B
Connor McCormick Jewel Chen, Bianca Gonzalez
Aayam Poudel Emily Miller
Kelsey Hoekstra David Valentin
Will Ernst Kyra Gray
Alexander Pastora Ben Czekanski
Shannia Fu Marcos Barrozo
Sierra Moen Tina Chen
Nina Sonneborn Elias Van Sickle
Ryan Rizzo Rebecca Conover
Alfred Hurley Xiaoli Jin
Brenda Li Phil Hoxie

Problem Set 04

  • Info:
    • Code portion: Assigned Mon 3/13. To be submitted on Fri 3/17 9:05am. Here is the PS04.R starter code.
    • Code review: Assigned Fri 3/17. Due by Mon 3/20 9:05am.
    • Kaggle: Peruse through the active and non-active Kaggle competitions and scout two competitions you think is appropriate for this class to participate in, ideally having one of each:
      1. The outcome variable is continuous: regression
      2. The outcome variable is categorical (including binary): classification
  • Learning Goals:
    • Unpack the behavior of the MSE i.e. prediction error.
  • Homework: On Friday 3/17, the following are due:
    1. PS04_FirstName_LastName.R on GitHub.
    2. Kaggle findings on this Google Form
  • Example Solutions:
Code Reviewer A Code Reviewer B
Jewel Chen Tina Chen
Emily Miller Bianca Gonzalez
Alfred Hurley Aayam Poudel
Alexander Pastora Shannia Fu
Brenda Li Will Ernst
Nina Sonneborn Rebecca Conover
Connor McCormick Kelsey Hoekstra
Phil Hoxie Malik Gomez
Xiaoli Jin Elias Van Sickle
David Valentin Marcos Antonio de Souza Barrozo Filho
Otto Nagengast Kyra Gray
Ben Czekanski Ryan Rizzo

Problem Set 03

  • Info:
    • Code portion: Assigned Sun 3/5. To be submitted on Fri 3/10 9:05am.
    • Code review: Assigned Fri 3/10. Due by Mon 3/13 9:05am.
    • Ethics:
      1. Listen to Econ Talk podcast interview (time 1h11m) of Cathy O’Neil, author of Weapons of Math Destruction.
      2. Explain in three paragraphs Cathy O’Neil’s argument of how supposedly objective mathematical/algorithmic models reinforce inequality in the context of 1) crime recidivism, 2) the thought experiment of hiring in tech firms and 3) teacher evaluations.
  • Learning Goals:
    1. Baby’s first true model fitting experience!
    2. However we keep things simple in that there is only one true predictor and you’ll know it exactly.
  • Homework: Data!!! If you’ve never downloaded a CSV off GitHub, ask the person next to you. Chances are they’ll know!
    • You will be submitting three files on Friday at 9:05am:
      1. PS03_FirstName_LastName.R with your model code + MSE on the training data
      2. PS03_submission_FirstName_LastName.csv with your predictions
      3. PS03_discussion_FirstName_LastName.txt with your discussion points. Text files only please!
  • Notes:
    • Fit any model you like!
    • What is an appropriate “score” criteria in this situation
    • After each model fit, compute the score of your model on the entirety of the train data
    • I will give you a score only after you submit your problem set on Friday. What can you do until then?
  • Example Solutions:
    • PS03_solutions.R
    • Shiny app look at data!!!
    • Source code for Shiny app here. To publish to shiny.middlebury.edu:
      • Click on Publish
      • Destination Account -> Add New Accounts
      • Enter the URL of the RStudio Connect Server: https://shiny.middlebury.edu:3737
      • Follow all steps
Code Reviewer A Code Reviewer B
Marcos Antonio de Souza Barrozo Filho Connor McCormick + Will Ernst
Bianca Gonzalez Ryan Rizzo
Sierra Moen Phil Hoxie
Malik Gomez Nina Sonneborn
David Valentin Ben Czekanski
Rebecca Conover Kelsey Hoekstra
Xiaoli Jin Aayam Poudel
Tina Chen Emily Miller
Jewel Chen Otto Nagengast
Kyra Gray Alfred Hurley
Alexander Pastora Brenda Li
Elias Van Sickle Shannia Fu

Problem Set 02

  • Info:
    • Code portion: Assigned Fri 2/24. To be submitted by Wed 3/1 9am on Fri 3/3 at the beginning of lecture (standby for submission format).
    • Code review: Assigned Fri 3/3. Due by Mon 3/6 9:05am
  • Learning Goals:
    1. Implement cross-validation from scratch. Later we’ll use existing R packages.
    2. dplyr newbies: perform your first substantive data manipulation!
    3. Code review to critique each other’s code! What’s done in practice!
  • Homework:
    1. Code portion:
      • Using the train.csv data from the Kaggle Titanic competititon and for the gender survival model, compute what I called the “pseudo-scores”: estimates of the Kaggle “score” of 76.555%. Do this via both
        1. leave-one-out cross-validation
        2. k=5 Fold cross-validation.
      • Save your work in a file called PS02_FirstName_LastName.R. Ex: in my case PS02_Albert_Kim.R. Have this ready to submit at the beginning of lecture on Fri 3/3.
      • In this same .R file, answer the following question in a commented section: We saw that Kaggle takes the test data (418 rows), only reports your score on the leaderboard based on half of these data, and declares the winner based on the other half which is withholded until the very end of the competition. Not only that, Kaggle does not tell you how they split the 418 rows. Say Kaggle didn’t do this and reported your score on the leaderboard based on the entire test data (418 rows). Write 2-3 sentences outlining a strategy of how you could exploit the information given by the leaderboard to get a perfect score of 100%.
    2. Code review: See groups below
      • Read the following document on code review practices at least once.
      • Pick out only the top 3 points you’d like to point out/give feedback on
      • Create a Slack direct message with your code review partner and the instructor “albert”.
      • Exchange feedback in whatever format you like
  • Notes:
    • Please feel free to work collaboratively!
  • Example Solutions: PS02_solutions.R. Please note there are many ways of completing this assignment.
Code Reviewer A Code Reviewer B
Bianca Gonzalez Ryan Rizzo + Will Ernst
Rebecca Conover Tina Chen
Elias Van Sickle Sierra Moen
Connor McCormick Malik Gomez
Kyra Gray Aayam Poudel
Kelsey Hoekstra Xiaoli Jin
Alfred Hurley Otto Nagengast
Brenda Li Alexander Pastora
Emily Miller Marcos Antonio de Souza Barrozo Filho
Nina Sonneborn Ben Czekanski
David Valentin Phil Hoxie
Shannia Fu Jewel Chen

Problem Set 01

  • Info:
    • Assigned Wed 2/15
    • Due Fri 2/24 9am
  • Learning Goals:
    1. Getting familiar with Kaggle competition workflow
    2. Allow time for tidyverse newbies to get up to speed
  • Homework:
    • tidyverse newbies: Do DataCamp classes listed in Lec02.
    • Baby’s first Kaggle competition: Titanic
      • Create a model that predicts Survival other than ones we’ve seen.
      • Submit the predictions CSV so that your ranking shows.
      • Slack message me your Kaggle name so I can find you in the rankings.
  • Notes:
    • Don’t focus on memorizing anything for now, just complete the assignment.
    • If you find yourself spinning your wheels, let me know.
  • Discussion:
    • I don’t actually care what your score was, this was about process.
    • Kaggle is very finicky about the submission format:
      • Number of rows and columns have to match exactly
      • Variable names have to match exactly
      • There can’t be missing values (NAs) in your predictions for Survived
      • Tricky: Survived has to be integers and not doubles i.e. 1 and not 1.0
      • See kaggle.R for example code.

Code Review Assigner Code

class <- c("Aayam Poudel", "Alexander Pastora", "Alfred Hurley", "Ben Czekanski",
"Bianca Gonzalez", "Brenda Li", "Connor McCormick", "David Valentin",
"Elias Van Sickle", "Emily Miller", "Jewel Chen", "Kelsey Hoekstra",
"Kyra Gray", "Malik Gomez", "Marcos Antonio de Souza Barrozo Filho", "Nina Sonneborn",
"Otto Nagengast", "Phil Hoxie", "Rebecca Conover", "Ryan Rizzo", "Shannia Fu",
"Sierra Moen", "Tina Chen", "Xiaoli Jin", "Will Ernst", NA)

sample(class) %>%
  matrix(ncol=2) %>% 
  as_data_frame() %>% 
  rename(`Code Reviewer A` = V1, `Code Reviewer B` = V2) %>%