```{r, message=FALSE, warning=FALSE, echo=FALSE}
require(knitr)
opts_chunk$set(eval=FALSE)
```
This lab on Linear Regression in R comes from p. 109-119 of "Introduction to
Statistical Learning with Applications in R" by Gareth James, Daniela Witten,
Trevor Hastie and Robert Tibshirani. It was
[re-implemented](https://github.com/SmithCollege-SDS/tidy-islr) in Fall 2016 in
by Amelia McNamara and R. Jordan Crouser at Smith College, and adapted by Albert
Y. Kim.
# 3.6.1 Libraries
The `library()` function is used to load libraries, or groups of functions and
data sets that are not included in the base R distribution. Basic functions that
perform least squares linear regression and other simple analyses come standard
with the base distribution, but more exotic functions require additional
libraries. Here we load the following packages
* `MASS`: a very large collection of data sets and functions.
* `ISLR`: includes the data sets associated with this book.
* `broom`: takes the messy output of built-in functions in R, such as `lm`,
`nls`, or `t.test`, and turns them into tidy data frames.
* `car` packages, which provide very niche funtionality.
```{r}
library(tidyverse)
library(MASS)
library(ISLR)
library(broom)
```
# 3.6.2 Simple Linear Regression
The `MASS` library contains the `Boston` data set, which records house prices
for 506 neighborhoods around Boston, which we seek to predict using 13
predictors such as average number of rooms per house, average age of houses,
etc.
Let's start with a simple predictive model:
* $Y$ is `medv`: median house value
* $X$ is `lstat`: percent of households with low socioeconomic status
* Fit model $Y = f(\vec{X}) + \epsilon$ via simple linear regression $Y = \beta_0 + \beta_1 X + \epsilon$
## Exploratory Data Analysis
Before conducting any model fitting, you should always conduct an **exploratory data analysis** first:
* Read any documentation
* Look at your raw data
* Visualize your data
```{r}
# Help file:
?Boston
# RStudio Viewer:
View(Boston)
# Quick summary:
glimpse(Boston)
```
We will now plot `medv` and `lstat` along with the least squares regression line
using `ggplot()` and `geom_smooth()`:
```{r}
ggplot(Boston, aes(x=lstat, y=medv)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x="% of households with low SES", y="Median House Value", title="Boston Neighborhood House Prices")
```
No surprises here. We can remove the Standard Error bar from the regression line
by setting `geom_smooth(method = "lm", se=FALSE)`
## Fitting a Simple Linear Model
Simple linear regression is a case of a class of models known as *linear
models*, since the model is based on a linear equation. We can fit linear models
using the `lm()` function and the `~` formula function (just below ESC key). In
our simple linear regression case:
```{r}
model_SL <- lm(medv~lstat, data=Boston)
```
Note that
* the outcome variable `medv` comes first
* the output is saved in an object of class `lm` called `model_SL`
## Exploring a Simple Linear Model
Base R provides tools for exploring the output of a regression. Run these
one-by-one:
```{r}
# Basic summary:
model_SL
# Detailed summary:
summary(model_SL)
# All the different variables in an lm object. For example: model_SL$residuals
names(model_SL)
# Just the coefficients:
coef(model_SL)
# Residuals:
residuals(model_SL)
```
These outputs are however fragmented, disorganized, and hard to access. The
`broom` package allows for tidy output of many R objects classes (see a [current
list](https://github.com/tidyverse/broom#available-tidiers) of supported
classes). Test drive time:
```{r}
broom::tidy(model_SL)
broom::glance(model_SL)
broom::augment(model_SL)
```
In increasing order of usefulness in this class:
1. `tidy()` gives the standard regression table output in *tidy format*
1. `glance()` gives summary values of a regression in *tidy format*
1. **Most Useful for Us**: `augment()` augments the original input data (in our case `mdev` & `lstat`) with a table of regression outputs corresponding to each observation in *tidy format*. Note that each of the new columns begins with a `.` to avoid overwriting any of the original columns. In particular, we'll use:
+ `.fitted`: the fitted/predicted value $\widehat{y}_i$
+ `.resid`: the error term i.e. residual $y_i - \widehat{y}_i$
We see from the `tidy()` output that our fitted model $\widehat{y} = \widehat{\beta}_0 + \widehat{\beta}_1x$ is
$$
\widehat{medv} = 34.55 - 0.95 \times lstat
$$
## Predicting Using a Simple Linear Model
The `predict()` function can be used to quickly apply the fit of a model to a
`newdata` set instead of computing $\widehat{y}$ by hand. First, we'll make a
data frame with some new values for `lstat`. The variable namee has to match
the name of the variable used in the `lm()` call above:
```{r}
new_values <- data_frame(lstat=c(5,10,15))
```
Now, we'll call the `predict()` function from base R to see what our model predicts for the corresponding `medv` value:
```{r}
predict(model_SL, newdata = new_values)
```
Note, this is not in tidy format, but rather a vector of values. We can also do this
using `augment`:
```{r}
broom::augment(model_SL, newdata = new_values)
```
Now that we are familiar with the `broom::tidy()`, `broom::glance()`, and
``broom::augment()`` functions, I'll no longer preface them with the `broom::`.
## Exercises
1. Compute the Mean-Squared Error for the model fit in `model_SL` using the `Boston` data.
1. Compute the Mean-Squared Error for the model fit in `model_SL` using the `new_values` data.
1. Which of the two data sets would be the training data if this exercise were viewed through the lens of a Kaggle competition? Which would be the test data?
# 3.6.3 Multiple Linear Regression
In order to fit a multiple linear regression model using least squares, we again
use the `lm()` function. The syntax `lm(y∼x1+x2+x3)` is used to fit a model with
three predictors, `x1`, `x2`, and `x3`. The `summary()` function now outputs the
regression coefficients for all the predictors.
```{r}
model_ML <- lm(medv ~ lstat + age, data=Boston)
tidy(model_ML)
```
The Boston data set contains 13 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:
```{r}
model_ML_all <- lm(medv~., data=Boston)
tidy(model_ML_all)
```
What if we would like to perform a regression using all of the variables but
one? For example, say we want to exclude `age`. The following syntax results in
a regression using all predictors except `age`:
```{r}
model_ML_no_age = lm(medv~.-age, data=Boston)
tidy(model_ML_no_age)
```
# 3.6.4 Interaction Terms
It is easy to include interaction terms in a linear model using the `lm()`
function. The syntax `lstat:black` tells R to include an interaction term
between `lstat` and `black`. The syntax `lstat*age` simultaneously includes
`lstat`, `age`, and the interaction term `lstat×age` as predictors; it is a
shorthand for `lstat+age+lstat:age`.
```{r}
model_interaction <- lm(medv~lstat*age, data=Boston)
tidy(model_interaction)
```
# 3.6.5 Non-linear Transformations of the Predictors
The `lm()` function can also accommodate non-linear transformations of the
predictors. For instance, given a predictor `X`, we can create a predictor `X2`
using `I(X^2)`. The function `I()` is needed since the ^ has a special meaning
in a formula; wrapping as we do allows the standard usage in R, which is to
raise `X` to the power 2. We now perform a regression of `medv` onto `lstat` and
`lstat2`.
```{r}
model_ML_quadratic <- lm(medv~lstat+I(lstat^2), data=Boston)
tidy(model_ML_quadratic)
```
In order to create a cubic fit, we can include a predictor of the form `I(X^3)`.
However, this approach can start to get cumbersome for higher order polynomials.
A better approach involves using the `poly()` function to create the polynomial
within `lm()`. For example, the following command produces a fifth-order
polynomial fit:
```{r}
model_ML_5th_order_poly = lm(medv~poly(lstat, 5, raw=TRUE), data=Boston)
tidy(model_ML_5th_order_poly)
```
Of course, we are in no way restricted to using polynomial transformations of
the predictors. Here we try a log transformation:
```{r}
model_log <- lm(medv~log(rm), data=Boston)
tidy(model_log)
```
# 3.6.6 Qualitative Predictors
We will now examine the `Carseats` data, which is part of the `ISLR` library. We
will attempt to predict `Sales` (child car seat sales) in 400 locations based on
a number of predictors.
```{r}
?Carseats
View(Carseats)
glimpse(Carseats)
```
The `Carseats` data includes qualitative predictors such as `Shelveloc`, an
indicator of the quality of the shelving location—that is, the space within a
store in which the car seat is displayed—at each location. The predictor
`Shelveloc` takes on three possible values, `Bad`, `Medium`, and `Good`.
Given a qualitative variable such as `Shelveloc`, R generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.
```{r}
lm_fit <- lm(Sales ~ . + Income:Advertising + Price:Age, data=Carseats)
tidy(lm_fit)
```