```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, fig.width = 16/2, fig.height = 9/2) library(readr) library(dplyr) library(ggplot2) library(janitor) library(moderndive) # Set seed value of random number generator to get "replicable" random numbers. # The choice of seed value of 76 was an arbitrary one on my part. set.seed(76) ``` Click here to download the source R Markdown file: `multiple_regression.Rmd`. For this R Markdown file to "knit" you'll need to ensure you've installed the "development" version of the `moderndive` package by running: ```{r, eval = FALSE} library(devtools) install_github("moderndive/moderndive", ref = "geom_parallel_slopes") ``` # Data set 1 Let's use the same data as in Jenny and Albert's example [project proposal](https://rudeboybert.github.io/SDS220/static/term_project/project_proposal_example.html){target="_blank"}. ```{r, warning=FALSE, message=FALSE, echo = FALSE} # This does all the data wrangling in a single pipe chain: ma_schools <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSrWSNyNqRVA950sdYa1QazAT-l0T7dl6pE5Ewvt7LkSm9LXmeVNbCbqEcrbygFmFyK4B6VQQGebuk9/pub?gid=1469057204&single=true&output=csv") %>% # Clean variable names: clean_names() %>% # Convert total enrollment to a categorical variable school size: mutate( school_size = cut_number(total_enrollment, n = 3), school_size = recode_factor(school_size, "[0,341]" = "small", "(341,541]" = "medium", "(541,4.26e+03]" = "large") ) %>% # Drop schools with no 11th and 12th grade students: filter(x11_enrollment > 0 & x12_enrollment > 0) %>% # Drop rows with missing values for math SAT: filter(!is.na(average_sat_math)) %>% # Select only variables we need: select(school_name, average_sat_math, percent_economically_disadvantaged, school_size) ma_schools %>% sample_n(5) ``` ## Interaction model An interaction model is flexible: it allows for different slopes and different intercepts. To plot an interaction model, we use `geom_smooth()` like we did in Chapter 6. Note that while the three lines at first glance seem parallel, they aren't quite. ```{r} ggplot(ma_schools, aes(x = percent_economically_disadvantaged, y = average_sat_math, color = school_size))+ geom_point() + geom_smooth(method = "lm", se = FALSE ) + labs(y = "Math SAT Score", x = "Percentage of Economically Disadvantaged Students") ``` So of all possible three lines with different slopes and different intercepts, these three are "best fitting" in that they have the smallest sum of squared residuals as we saw in moderndive Ch 6.3.3. Let's get the regression table for the corresponding interaction model. Note the use of `*`. Also, in this case although R's default behavior would make `large` the baseline group because it comes first alphabetically, we changed the "baseline" group so that it is the `small` schools. ```{r} model_1_interaction <- lm(average_sat_math ~ percent_economically_disadvantaged * school_size, data = ma_schools) get_regression_table(model_1_interaction) ``` Notice how there are 6 rows in the regression table reflecting the 6 elements of the equation to obtained fitted values $\widehat{y}$. This is a measure of this model's complexity. ## Parallel slopes model An parallel slopes model is less flexible: it allows for different intercepts but forces a common slope. To plot a parallel slopes model, we unfortunately need to use a function written by me as there isn't one included in the `ggplot2` package. This is what I had you install in Lec14. ```{r} gg_parallel_slopes(y = "average_sat_math", num_x = "percent_economically_disadvantaged", cat_x = "school_size", data = ma_schools) ``` Let's get the regression table for the corresponding parallel slopes model. Note the use of `+`. ```{r} model_1_parralel_slopes <- lm(average_sat_math ~ percent_economically_disadvantaged + school_size, data = ma_schools) get_regression_table(model_1_parralel_slopes) ``` Notice how there are 4 rows in the regression table reflecting the 4 elements of the equation to obtained fitted values $\widehat{y}$. This is a measure of this model's complexity. In other words, the parallel slopes model is **simpler** than the interaction model. ## Conclusion The slopes of the lines in the plots of the interaction model and parallel slopes model are not that different. So the additional complexity of the interaction model **is IMO not warranted**; go with the simpler parallel slopes model. *** # Data set 2 Let's use the same data as moderndive Chapter 7.2 ```{r, warning=FALSE, message=FALSE, echo = FALSE} evals_ch7 <- evals %>% select(score, age, gender) evals_ch7 %>% sample_n(5) ``` ## Interaction model An interaction model is flexible: it allows for different slopes and different intercepts. To plot an interaction model, we use `geom_smooth()` like we did in Chapter 6. Note that the two lines are nowhere near parallel. ```{r} ggplot(evals_ch7, aes(x = age, y = score, color = gender)) + geom_point() + labs(x = "Age", y = "Teaching Score", color = "Gender") + geom_smooth(method = "lm", se = FALSE) ``` So of all possible two lines with different slopes and different intercepts, these two are "best fitting" in that they have the smallest sum of squared residuals as we saw in moderndive Ch 6.3.3. Let's get the regression table for the corresponding interaction model. Note the use of `*`. ```{r} model_2_interaction <- lm(score ~ age * gender, data = evals_ch7) get_regression_table(model_2_interaction) ``` Notice how there are 4 rows in the regression table reflecting the 6 elements of the equation to obtained fitted values $\widehat{y}$. This is a measure of this model's complexity. ## Parallel slopes model An parallel slopes model is less flexible: it allows for different intercepts but forces a common slope. To plot a parallel slopes model, we unfortunately need to use a function written by me as there isn't one included in the `ggplot2` package. This is what I had you install in Lec14. ```{r} gg_parallel_slopes(y = "score", num_x = "age", cat_x = "gender", data = evals_ch7) ``` Let's get the regression table for the corresponding parallel slopes model. Note the use of `+`. ```{r} model_2_parralel_slopes <- lm(score ~ age + gender, data = evals_ch7) get_regression_table(model_2_parralel_slopes) ``` Notice how there are 3 rows in the regression table reflecting the 3 elements of the equation to obtained fitted values $\widehat{y}$. This is a measure of this model's complexity. In other words, the parallel slopes model is **simpler** than the interaction model. ## Conclusion The slopes of the lines in the plots of the interaction model and parallel slopes model are somewhat different. So the additional complexity of the interaction model **is IMO warranted**; go with the more complex interaction model. *** # Data set 1 revisited Recall in our Massachusetts high school math SAT example above. We stated that since the "more flexible" interaction model yielded three lines that had near identical slopes, the additional complexity of the interaction model may not be warranted, and thus we are inclined to favor the simpler parallel slopes model: ```{r, echo = FALSE} gg_parallel_slopes(y = "average_sat_math", num_x = "percent_economically_disadvantaged", cat_x = "school_size", data = ma_schools) ``` However, upon closer inspection of the parallel slopes model plot above, it appears furthermore that the three lines have near identical intercepts as well! In other words, the three lines are almost entirely identical! So taking our "all other things being equal, simpler models are to be preferred" logic one step further, it can be argued we shouldn't even have three separate lines, rather just one! ```{r} ggplot(ma_schools, aes(x = percent_economically_disadvantaged, y = average_sat_math))+ geom_point() + geom_smooth(method = "lm", se = FALSE ) + labs(y = "Math SAT Score", x = "Percentage of Economically Disadvantaged Students") ``` In other words, **perhaps we should not be even using a multiple regression** that includes both: * The numerical explantory variable `percent_economically_disadvantaged` * The categorical explantory variable `school_size` but rather **perhaps we should be using an even simpler basic linear regression model** using only: * The numerical explantory variable `percent_economically_disadvantaged` ```{r} model_1_simple <- lm(average_sat_math ~ percent_economically_disadvantaged, data = ma_schools) get_regression_table(model_1_simple) ``` Compare this to the regression table for our parallel slopes model ```{r} model_1_parralel_slopes <- lm(average_sat_math ~ percent_economically_disadvantaged + school_size, data = ma_schools) get_regression_table(model_1_parralel_slopes) ``` Expressed differently, there is no need to distinguish between small, medium, and large schools; the relationship between math SAT scores and the percent of the student body that is economically disadvantaged is the same. This relationship is a steeply negative one: as high schools have a higher proportion of their student bodies that are economically disadvantaged, so also there is an associated decrease in their average math SAT scores. Sad, but true.