Click here to download the source R Markdown file: multiple_regression.Rmd
. For this R Markdown file to “knit” you’ll need to ensure you’ve installed the “development” version of the moderndive
package by running:
library(devtools)
install_github("moderndive/moderndive", ref = "geom_parallel_slopes")
Let’s use the same data as in Jenny and Albert’s example project proposal.
school_name | average_sat_math | percent_economically_disadvantaged | school_size |
---|---|---|---|
Newton South High | 620 | 8.8 | large |
Quincy High | 502 | 34.7 | large |
Hopedale Jr Sr High | 535 | 9.0 | medium |
Madison Park High | 358 | 68.1 | large |
Center For Technical Education Innovation | 555 | 36.5 | large |
An interaction model is flexible: it allows for different slopes and different intercepts.
To plot an interaction model, we use geom_smooth()
like we did in Chapter 6. Note that while the three lines at first glance seem parallel, they aren’t quite.
ggplot(ma_schools, aes(x = percent_economically_disadvantaged, y = average_sat_math, color = school_size))+
geom_point() +
geom_smooth(method = "lm", se = FALSE ) +
labs(y = "Math SAT Score", x = "Percentage of Economically Disadvantaged Students")
So of all possible three lines with different slopes and different intercepts, these three are “best fitting” in that they have the smallest sum of squared residuals as we saw in moderndive Ch 6.3.3.
Let’s get the regression table for the corresponding interaction model. Note the use of *
. Also, in this case although R’s default behavior would make large
the baseline group because it comes first alphabetically, we changed the “baseline” group so that it is the small
schools.
model_1_interaction <- lm(average_sat_math ~ percent_economically_disadvantaged * school_size, data = ma_schools)
get_regression_table(model_1_interaction)
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 594.327 | 13.288 | 44.726 | 0.000 | 568.186 | 620.469 |
percent_economically_disadvantaged | -2.932 | 0.294 | -9.961 | 0.000 | -3.511 | -2.353 |
school_sizemedium | -17.764 | 15.827 | -1.122 | 0.263 | -48.899 | 13.371 |
school_sizelarge | -13.293 | 13.813 | -0.962 | 0.337 | -40.466 | 13.880 |
percent_economically_disadvantaged:school_sizemedium | 0.146 | 0.371 | 0.393 | 0.694 | -0.585 | 0.877 |
percent_economically_disadvantaged:school_sizelarge | 0.189 | 0.323 | 0.586 | 0.559 | -0.446 | 0.824 |
Notice how there are 6 rows in the regression table reflecting the 6 elements of the equation to obtained fitted values \(\widehat{y}\). This is a measure of this model’s complexity.
An parallel slopes model is less flexible: it allows for different intercepts but forces a common slope.
To plot a parallel slopes model, we unfortunately need to use a function written by me as there isn’t one included in the ggplot2
package. This is what I had you install in Lec14.
gg_parallel_slopes(y = "average_sat_math", num_x = "percent_economically_disadvantaged",
cat_x = "school_size", data = ma_schools)
Let’s get the regression table for the corresponding parallel slopes model. Note the use of +
.
model_1_parralel_slopes <- lm(average_sat_math ~ percent_economically_disadvantaged + school_size, data = ma_schools)
get_regression_table(model_1_parralel_slopes)
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 588.190 | 7.607 | 77.325 | 0.000 | 573.226 | 603.154 |
percent_economically_disadvantaged | -2.777 | 0.106 | -26.120 | 0.000 | -2.986 | -2.568 |
school_sizemedium | -11.913 | 7.535 | -1.581 | 0.115 | -26.736 | 2.909 |
school_sizelarge | -6.362 | 6.923 | -0.919 | 0.359 | -19.981 | 7.258 |
Notice how there are 4 rows in the regression table reflecting the 4 elements of the equation to obtained fitted values \(\widehat{y}\). This is a measure of this model’s complexity. In other words, the parallel slopes model is simpler than the interaction model.
The slopes of the lines in the plots of the interaction model and parallel slopes model are not that different. So the additional complexity of the interaction model is IMO not warranted; go with the simpler parallel slopes model.
Let’s use the same data as moderndive Chapter 7.2
score | age | gender |
---|---|---|
4.9 | 50 | male |
4.8 | 34 | female |
3.7 | 50 | male |
4.7 | 52 | male |
3.5 | 34 | male |
An interaction model is flexible: it allows for different slopes and different intercepts.
To plot an interaction model, we use geom_smooth()
like we did in Chapter 6. Note that the two lines are nowhere near parallel.
ggplot(evals_ch7, aes(x = age, y = score, color = gender)) +
geom_point() +
labs(x = "Age", y = "Teaching Score", color = "Gender") +
geom_smooth(method = "lm", se = FALSE)
So of all possible two lines with different slopes and different intercepts, these two are “best fitting” in that they have the smallest sum of squared residuals as we saw in moderndive Ch 6.3.3.
Let’s get the regression table for the corresponding interaction model. Note the use of *
.
model_2_interaction <- lm(score ~ age * gender, data = evals_ch7)
get_regression_table(model_2_interaction)
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 4.883 | 0.205 | 23.795 | 0.000 | 4.480 | 5.286 |
age | -0.018 | 0.004 | -3.919 | 0.000 | -0.026 | -0.009 |
gendermale | -0.446 | 0.265 | -1.681 | 0.094 | -0.968 | 0.076 |
age:gendermale | 0.014 | 0.006 | 2.446 | 0.015 | 0.003 | 0.024 |
Notice how there are 4 rows in the regression table reflecting the 6 elements of the equation to obtained fitted values \(\widehat{y}\). This is a measure of this model’s complexity.
An parallel slopes model is less flexible: it allows for different intercepts but forces a common slope.
To plot a parallel slopes model, we unfortunately need to use a function written by me as there isn’t one included in the ggplot2
package. This is what I had you install in Lec14.
gg_parallel_slopes(y = "score", num_x = "age", cat_x = "gender",
data = evals_ch7)
Let’s get the regression table for the corresponding parallel slopes model. Note the use of +
.
model_2_parralel_slopes <- lm(score ~ age + gender, data = evals_ch7)
get_regression_table(model_2_parralel_slopes)
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 4.484 | 0.125 | 35.792 | 0.000 | 4.238 | 4.730 |
age | -0.009 | 0.003 | -3.280 | 0.001 | -0.014 | -0.003 |
gendermale | 0.191 | 0.052 | 3.632 | 0.000 | 0.087 | 0.294 |
Notice how there are 3 rows in the regression table reflecting the 3 elements of the equation to obtained fitted values \(\widehat{y}\). This is a measure of this model’s complexity. In other words, the parallel slopes model is simpler than the interaction model.
The slopes of the lines in the plots of the interaction model and parallel slopes model are somewhat different. So the additional complexity of the interaction model is IMO warranted; go with the more complex interaction model.
Recall in our Massachusetts high school math SAT example above. We stated that since the “more flexible” interaction model yielded three lines that had near identical slopes, the additional complexity of the interaction model may not be warranted, and thus we are inclined to favor the simpler parallel slopes model:
However, upon closer inspection of the parallel slopes model plot above, it appears furthermore that the three lines have near identical intercepts as well! In other words, the three lines are almost entirely identical! So taking our “all other things being equal, simpler models are to be preferred” logic one step further, it can be argued we shouldn’t even have three separate lines, rather just one!
ggplot(ma_schools, aes(x = percent_economically_disadvantaged, y = average_sat_math))+
geom_point() +
geom_smooth(method = "lm", se = FALSE ) +
labs(y = "Math SAT Score", x = "Percentage of Economically Disadvantaged Students")
In other words, perhaps we should not be even using a multiple regression that includes both:
percent_economically_disadvantaged
school_size
but rather perhaps we should be using an even simpler basic linear regression model using only:
percent_economically_disadvantaged
model_1_simple <- lm(average_sat_math ~ percent_economically_disadvantaged, data = ma_schools)
get_regression_table(model_1_simple)
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 581.281 | 3.267 | 177.936 | 0 | 574.855 | 587.707 |
percent_economically_disadvantaged | -2.780 | 0.101 | -27.499 | 0 | -2.979 | -2.581 |
Compare this to the regression table for our parallel slopes model
model_1_parralel_slopes <- lm(average_sat_math ~ percent_economically_disadvantaged + school_size, data = ma_schools)
get_regression_table(model_1_parralel_slopes)
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 588.190 | 7.607 | 77.325 | 0.000 | 573.226 | 603.154 |
percent_economically_disadvantaged | -2.777 | 0.106 | -26.120 | 0.000 | -2.986 | -2.568 |
school_sizemedium | -11.913 | 7.535 | -1.581 | 0.115 | -26.736 | 2.909 |
school_sizelarge | -6.362 | 6.923 | -0.919 | 0.359 | -19.981 | 7.258 |
Expressed differently, there is no need to distinguish between small, medium, and large schools; the relationship between math SAT scores and the percent of the student body that is economically disadvantaged is the same. This relationship is a steeply negative one: as high schools have a higher proportion of their student bodies that are economically disadvantaged, so also there is an associated decrease in their average math SAT scores. Sad, but true.