Click here to download the source R Markdown file: multiple_regression.Rmd. For this R Markdown file to “knit” you’ll need to ensure you’ve installed the “development” version of the moderndive package by running:

library(devtools)
install_github("moderndive/moderndive", ref = "geom_parallel_slopes")

Data set 1

Let’s use the same data as in Jenny and Albert’s example project proposal.

school_name average_sat_math percent_economically_disadvantaged school_size
Newton South High 620 8.8 large
Quincy High 502 34.7 large
Hopedale Jr Sr High 535 9.0 medium
Madison Park High 358 68.1 large
Center For Technical Education Innovation 555 36.5 large

Interaction model

An interaction model is flexible: it allows for different slopes and different intercepts.

To plot an interaction model, we use geom_smooth() like we did in Chapter 6. Note that while the three lines at first glance seem parallel, they aren’t quite.

ggplot(ma_schools, aes(x = percent_economically_disadvantaged, y = average_sat_math, color = school_size))+
  geom_point() +
  geom_smooth(method = "lm", se = FALSE ) +
  labs(y = "Math SAT Score", x = "Percentage of Economically Disadvantaged Students")

So of all possible three lines with different slopes and different intercepts, these three are “best fitting” in that they have the smallest sum of squared residuals as we saw in moderndive Ch 6.3.3.

Let’s get the regression table for the corresponding interaction model. Note the use of *. Also, in this case although R’s default behavior would make large the baseline group because it comes first alphabetically, we changed the “baseline” group so that it is the small schools.

model_1_interaction <- lm(average_sat_math ~ percent_economically_disadvantaged * school_size, data = ma_schools)
get_regression_table(model_1_interaction)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 594.327 13.288 44.726 0.000 568.186 620.469
percent_economically_disadvantaged -2.932 0.294 -9.961 0.000 -3.511 -2.353
school_sizemedium -17.764 15.827 -1.122 0.263 -48.899 13.371
school_sizelarge -13.293 13.813 -0.962 0.337 -40.466 13.880
percent_economically_disadvantaged:school_sizemedium 0.146 0.371 0.393 0.694 -0.585 0.877
percent_economically_disadvantaged:school_sizelarge 0.189 0.323 0.586 0.559 -0.446 0.824

Notice how there are 6 rows in the regression table reflecting the 6 elements of the equation to obtained fitted values \(\widehat{y}\). This is a measure of this model’s complexity.

Parallel slopes model

An parallel slopes model is less flexible: it allows for different intercepts but forces a common slope.

To plot a parallel slopes model, we unfortunately need to use a function written by me as there isn’t one included in the ggplot2 package. This is what I had you install in Lec14.

gg_parallel_slopes(y = "average_sat_math", num_x = "percent_economically_disadvantaged", 
                   cat_x = "school_size", data = ma_schools)

Let’s get the regression table for the corresponding parallel slopes model. Note the use of +.

model_1_parralel_slopes <- lm(average_sat_math ~ percent_economically_disadvantaged + school_size, data = ma_schools)
get_regression_table(model_1_parralel_slopes)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 588.190 7.607 77.325 0.000 573.226 603.154
percent_economically_disadvantaged -2.777 0.106 -26.120 0.000 -2.986 -2.568
school_sizemedium -11.913 7.535 -1.581 0.115 -26.736 2.909
school_sizelarge -6.362 6.923 -0.919 0.359 -19.981 7.258

Notice how there are 4 rows in the regression table reflecting the 4 elements of the equation to obtained fitted values \(\widehat{y}\). This is a measure of this model’s complexity. In other words, the parallel slopes model is simpler than the interaction model.

Conclusion

The slopes of the lines in the plots of the interaction model and parallel slopes model are not that different. So the additional complexity of the interaction model is IMO not warranted; go with the simpler parallel slopes model.


Data set 2

Let’s use the same data as moderndive Chapter 7.2

score age gender
4.9 50 male
4.8 34 female
3.7 50 male
4.7 52 male
3.5 34 male

Interaction model

An interaction model is flexible: it allows for different slopes and different intercepts.

To plot an interaction model, we use geom_smooth() like we did in Chapter 6. Note that the two lines are nowhere near parallel.

ggplot(evals_ch7, aes(x = age, y = score, color = gender)) +
  geom_point() +
  labs(x = "Age", y = "Teaching Score", color = "Gender") +
  geom_smooth(method = "lm", se = FALSE)

So of all possible two lines with different slopes and different intercepts, these two are “best fitting” in that they have the smallest sum of squared residuals as we saw in moderndive Ch 6.3.3.

Let’s get the regression table for the corresponding interaction model. Note the use of *.

model_2_interaction <- lm(score ~ age * gender, data = evals_ch7)
get_regression_table(model_2_interaction)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 4.883 0.205 23.795 0.000 4.480 5.286
age -0.018 0.004 -3.919 0.000 -0.026 -0.009
gendermale -0.446 0.265 -1.681 0.094 -0.968 0.076
age:gendermale 0.014 0.006 2.446 0.015 0.003 0.024

Notice how there are 4 rows in the regression table reflecting the 6 elements of the equation to obtained fitted values \(\widehat{y}\). This is a measure of this model’s complexity.

Parallel slopes model

An parallel slopes model is less flexible: it allows for different intercepts but forces a common slope.

To plot a parallel slopes model, we unfortunately need to use a function written by me as there isn’t one included in the ggplot2 package. This is what I had you install in Lec14.

gg_parallel_slopes(y = "score", num_x = "age", cat_x = "gender",
                    data = evals_ch7)

Let’s get the regression table for the corresponding parallel slopes model. Note the use of +.

model_2_parralel_slopes <- lm(score ~ age + gender, data = evals_ch7)
get_regression_table(model_2_parralel_slopes)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 4.484 0.125 35.792 0.000 4.238 4.730
age -0.009 0.003 -3.280 0.001 -0.014 -0.003
gendermale 0.191 0.052 3.632 0.000 0.087 0.294

Notice how there are 3 rows in the regression table reflecting the 3 elements of the equation to obtained fitted values \(\widehat{y}\). This is a measure of this model’s complexity. In other words, the parallel slopes model is simpler than the interaction model.

Conclusion

The slopes of the lines in the plots of the interaction model and parallel slopes model are somewhat different. So the additional complexity of the interaction model is IMO warranted; go with the more complex interaction model.


Data set 1 revisited

Recall in our Massachusetts high school math SAT example above. We stated that since the “more flexible” interaction model yielded three lines that had near identical slopes, the additional complexity of the interaction model may not be warranted, and thus we are inclined to favor the simpler parallel slopes model:

However, upon closer inspection of the parallel slopes model plot above, it appears furthermore that the three lines have near identical intercepts as well! In other words, the three lines are almost entirely identical! So taking our “all other things being equal, simpler models are to be preferred” logic one step further, it can be argued we shouldn’t even have three separate lines, rather just one!

ggplot(ma_schools, aes(x = percent_economically_disadvantaged, y = average_sat_math))+
  geom_point() +
  geom_smooth(method = "lm", se = FALSE ) +
  labs(y = "Math SAT Score", x = "Percentage of Economically Disadvantaged Students")

In other words, perhaps we should not be even using a multiple regression that includes both:

  • The numerical explantory variable percent_economically_disadvantaged
  • The categorical explantory variable school_size

but rather perhaps we should be using an even simpler basic linear regression model using only:

  • The numerical explantory variable percent_economically_disadvantaged
model_1_simple <- lm(average_sat_math ~ percent_economically_disadvantaged, data = ma_schools)
get_regression_table(model_1_simple)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 581.281 3.267 177.936 0 574.855 587.707
percent_economically_disadvantaged -2.780 0.101 -27.499 0 -2.979 -2.581

Compare this to the regression table for our parallel slopes model

model_1_parralel_slopes <- lm(average_sat_math ~ percent_economically_disadvantaged + school_size, data = ma_schools)
get_regression_table(model_1_parralel_slopes)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 588.190 7.607 77.325 0.000 573.226 603.154
percent_economically_disadvantaged -2.777 0.106 -26.120 0.000 -2.986 -2.568
school_sizemedium -11.913 7.535 -1.581 0.115 -26.736 2.909
school_sizelarge -6.362 6.923 -0.919 0.359 -19.981 7.258

Expressed differently, there is no need to distinguish between small, medium, and large schools; the relationship between math SAT scores and the percent of the student body that is economically disadvantaged is the same. This relationship is a steeply negative one: as high schools have a higher proportion of their student bodies that are economically disadvantaged, so also there is an associated decrease in their average math SAT scores. Sad, but true.