---
title: "Problem Set 08"
author: "WRITE YOUR NAME HERE"
date: "2018-03-27"
output:
html_document:
highlight: tango
theme: cosmo
toc: yes
toc_depth: 2
toc_float:
collapsed: false
df_print: kable
---
```{r, include=FALSE}
# Do not edit this code block/chunk
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning = FALSE, fig.width = 16/2, fig.height = 9/2)
```
# Collaboration {-}
Please indicate who you collaborated with on this problem set:
# Background {-}
```{r}
library(ggplot2)
library(dplyr)
library(moderndive)
```
[Kaggle.com](https://www.kaggle.com/competitions) is a platform for predictive
modelling and analytics competitions in which statisticians, data scientists,
and machine learning experts compete to produce the predictive models applied to
datasets uploaded by companies and users. Individuals and teams that make the
best predictions can win prizes. One of the datasets on Kaggle.com is the [House
Sales in King County,
USA](https://www.kaggle.com/harlfoxem/housesalesprediction) dataset, where the
goal is to fit models to predict house prices based on features of the houses,
like size and number of rooms. This dataset is included in the `house_prices`
data frame in the `moderndive` package. We are going to model:
* Outcome variable $y$: `price` house price
* Predictor variables
1. $x_1$: numerical variable `sqft_living` square footage of home
1. $x_2$: categorical variable `condition` of the condition of the house (1 = worst thru 5 = best)
Run the following in your console to get a sense of the data:
1. `View(house_prices)` to view the data in spreadsheet viewer
1. `?house_prices` to see the help file
# Question 1: Exploratory data analysis
Create histograms for both `price` and `sqft_living`:
```{r}
# Code to create histogram of price:
# Code to create histogram of sqft_living:
```
Describe the shape of both histograms and what this tells you about the housing
market in Seattle.
# Question 2: Create new variables
Create new variables `log10_price` and `log10_sqrt_living` in the `house_prices`
data frame reflecting the `log10()` of `price` and `sqft_living` respectively.
Then create histograms of both variables:
```{r}
# Code to create new variables log10_price and log10_sqrt_living in house_prices:
# Code to create histogram of log10_price:
# Code to create histogram of log10_sqft_living:
```
# Question 3: Eyeballing the relationship
Let's eyeball the relationship between outcome and explanatory variables, but
instead of the originally named three variables:
* Outcome variable $y$: `price` house price
* Explantory variables
1. $x_1$: numerical variable `sqft_living` square footage of home
1. $x_2$: categorical variable `condition` of the condition of the house (1 = worst thru 5 = best)
Let's consider these three variables:
* Outcome variable $y$: `log10_price` log base 10 of house price
* Explantory variables
1. $x_1$: numerical variable `log10_sqft_living` log base 10 of square footage of home
1. $x_2$: categorical variable `condition` of the condition of the house (1 = worst thru 5 = best)
Create a visualization that will allow you to eyeball the relationship between
all three variables indicated:
```{r}
# Code to visualization of relationship
```
Comment on this relationship using non-statistical language.
# Question 4: Quantifying the relationship
Let's now quantify the relationship between the outcome and explanatory
variables. What are the slopes and intercepts of condition = 1, condition = 2,
and condition = 5 lines? Write your answers in the form: $\log10(price) = a + b
* \log10(square footage)$ where $a$ is the intercept and $b$ is the slope.
Recall you can use the visualization above to check your results.
```{r}
# Do something here:
```
1. Condition 1: $\log10(price) = a + b * \log10(square footage)$
1. Condition 2: $\log10(price) = a + b * \log10(square footage)$
1. Condition 5: $\log10(price) = a + b * \log10(square footage)$
# Question 5: Prediction
Say your are a realtor and someone calls you asking you how much their home will
sell for. They tell you that it's in condition = 5 and is sized 1900 square
feet. What do you tell them?