Click here to download the source R Markdown file: multiple_regression.Rmd. For this R Markdown file to “knit” you’ll need to ensure you’ve installed the “development” version of the moderndive package by running:

library(devtools)
install_github("moderndive/moderndive", ref = "geom_parallel_slopes")

Data set 1

Let’s use the same data as in Jenny and Albert’s example project proposal.

school_name	average_sat_math	percent_economically_disadvantaged	school_size
Newton South High	620	8.8	large
Quincy High	502	34.7	large
Hopedale Jr Sr High	535	9.0	medium
Madison Park High	358	68.1	large
Center For Technical Education Innovation	555	36.5	large

Interaction model

An interaction model is flexible: it allows for different slopes and different intercepts.

To plot an interaction model, we use geom_smooth() like we did in Chapter 6. Note that while the three lines at first glance seem parallel, they aren’t quite.

ggplot(ma_schools, aes(x = percent_economically_disadvantaged, y = average_sat_math, color = school_size))+
  geom_point() +
  geom_smooth(method = "lm", se = FALSE ) +
  labs(y = "Math SAT Score", x = "Percentage of Economically Disadvantaged Students")

So of all possible three lines with different slopes and different intercepts, these three are “best fitting” in that they have the smallest sum of squared residuals as we saw in moderndive Ch 6.3.3.

Let’s get the regression table for the corresponding interaction model. Note the use of *. Also, in this case although R’s default behavior would make large the baseline group because it comes first alphabetically, we changed the “baseline” group so that it is the small schools.

model_1_interaction <- lm(average_sat_math ~ percent_economically_disadvantaged * school_size, data = ma_schools)
get_regression_table(model_1_interaction)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	594.327	13.288	44.726	0.000	568.186	620.469
percent_economically_disadvantaged	-2.932	0.294	-9.961	0.000	-3.511	-2.353
school_sizemedium	-17.764	15.827	-1.122	0.263	-48.899	13.371
school_sizelarge	-13.293	13.813	-0.962	0.337	-40.466	13.880
percent_economically_disadvantaged:school_sizemedium	0.146	0.371	0.393	0.694	-0.585	0.877
percent_economically_disadvantaged:school_sizelarge	0.189	0.323	0.586	0.559	-0.446	0.824

Notice how there are 6 rows in the regression table reflecting the 6 elements of the equation to obtained fitted values \(\widehat{y}\). This is a measure of this model’s complexity.

Parallel slopes model

An parallel slopes model is less flexible: it allows for different intercepts but forces a common slope.

To plot a parallel slopes model, we unfortunately need to use a function written by me as there isn’t one included in the ggplot2 package. This is what I had you install in Lec14.

gg_parallel_slopes(y = "average_sat_math", num_x = "percent_economically_disadvantaged", 
                   cat_x = "school_size", data = ma_schools)

Let’s get the regression table for the corresponding parallel slopes model. Note the use of +.

model_1_parralel_slopes <- lm(average_sat_math ~ percent_economically_disadvantaged + school_size, data = ma_schools)
get_regression_table(model_1_parralel_slopes)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	588.190	7.607	77.325	0.000	573.226	603.154
percent_economically_disadvantaged	-2.777	0.106	-26.120	0.000	-2.986	-2.568
school_sizemedium	-11.913	7.535	-1.581	0.115	-26.736	2.909
school_sizelarge	-6.362	6.923	-0.919	0.359	-19.981	7.258

Notice how there are 4 rows in the regression table reflecting the 4 elements of the equation to obtained fitted values \(\widehat{y}\). This is a measure of this model’s complexity. In other words, the parallel slopes model is simpler than the interaction model.

Conclusion

The slopes of the lines in the plots of the interaction model and parallel slopes model are not that different. So the additional complexity of the interaction model is IMO not warranted; go with the simpler parallel slopes model.

Data set 2

Let’s use the same data as moderndive Chapter 7.2

score	age	gender
4.9	50	male
4.8	34	female
3.7	50	male
4.7	52	male
3.5	34	male

Interaction model

An interaction model is flexible: it allows for different slopes and different intercepts.

To plot an interaction model, we use geom_smooth() like we did in Chapter 6. Note that the two lines are nowhere near parallel.

ggplot(evals_ch7, aes(x = age, y = score, color = gender)) +
  geom_point() +
  labs(x = "Age", y = "Teaching Score", color = "Gender") +
  geom_smooth(method = "lm", se = FALSE)

So of all possible two lines with different slopes and different intercepts, these two are “best fitting” in that they have the smallest sum of squared residuals as we saw in moderndive Ch 6.3.3.

Let’s get the regression table for the corresponding interaction model. Note the use of *.

model_2_interaction <- lm(score ~ age * gender, data = evals_ch7)
get_regression_table(model_2_interaction)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	4.883	0.205	23.795	0.000	4.480	5.286
age	-0.018	0.004	-3.919	0.000	-0.026	-0.009
gendermale	-0.446	0.265	-1.681	0.094	-0.968	0.076
age:gendermale	0.014	0.006	2.446	0.015	0.003	0.024

Notice how there are 4 rows in the regression table reflecting the 6 elements of the equation to obtained fitted values \(\widehat{y}\). This is a measure of this model’s complexity.

Parallel slopes model

An parallel slopes model is less flexible: it allows for different intercepts but forces a common slope.

To plot a parallel slopes model, we unfortunately need to use a function written by me as there isn’t one included in the ggplot2 package. This is what I had you install in Lec14.

gg_parallel_slopes(y = "score", num_x = "age", cat_x = "gender",
                    data = evals_ch7)

Let’s get the regression table for the corresponding parallel slopes model. Note the use of +.

model_2_parralel_slopes <- lm(score ~ age + gender, data = evals_ch7)
get_regression_table(model_2_parralel_slopes)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	4.484	0.125	35.792	0.000	4.238	4.730
age	-0.009	0.003	-3.280	0.001	-0.014	-0.003
gendermale	0.191	0.052	3.632	0.000	0.087	0.294

Notice how there are 3 rows in the regression table reflecting the 3 elements of the equation to obtained fitted values \(\widehat{y}\). This is a measure of this model’s complexity. In other words, the parallel slopes model is simpler than the interaction model.

Conclusion

The slopes of the lines in the plots of the interaction model and parallel slopes model are somewhat different. So the additional complexity of the interaction model is IMO warranted; go with the more complex interaction model.

Data set 1 revisited

Recall in our Massachusetts high school math SAT example above. We stated that since the “more flexible” interaction model yielded three lines that had near identical slopes, the additional complexity of the interaction model may not be warranted, and thus we are inclined to favor the simpler parallel slopes model:

However, upon closer inspection of the parallel slopes model plot above, it appears furthermore that the three lines have near identical intercepts as well! In other words, the three lines are almost entirely identical! So taking our “all other things being equal, simpler models are to be preferred” logic one step further, it can be argued we shouldn’t even have three separate lines, rather just one!

ggplot(ma_schools, aes(x = percent_economically_disadvantaged, y = average_sat_math))+
  geom_point() +
  geom_smooth(method = "lm", se = FALSE ) +
  labs(y = "Math SAT Score", x = "Percentage of Economically Disadvantaged Students")

In other words, perhaps we should not be even using a multiple regression that includes both:

The numerical explantory variable percent_economically_disadvantaged
The categorical explantory variable school_size

but rather perhaps we should be using an even simpler basic linear regression model using only:

The numerical explantory variable percent_economically_disadvantaged

model_1_simple <- lm(average_sat_math ~ percent_economically_disadvantaged, data = ma_schools)
get_regression_table(model_1_simple)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	581.281	3.267	177.936	0	574.855	587.707
percent_economically_disadvantaged	-2.780	0.101	-27.499	0	-2.979	-2.581

Compare this to the regression table for our parallel slopes model

model_1_parralel_slopes <- lm(average_sat_math ~ percent_economically_disadvantaged + school_size, data = ma_schools)
get_regression_table(model_1_parralel_slopes)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	588.190	7.607	77.325	0.000	573.226	603.154
percent_economically_disadvantaged	-2.777	0.106	-26.120	0.000	-2.986	-2.568
school_sizemedium	-11.913	7.535	-1.581	0.115	-26.736	2.909
school_sizelarge	-6.362	6.923	-0.919	0.359	-19.981	7.258

Expressed differently, there is no need to distinguish between small, medium, and large schools; the relationship between math SAT scores and the percent of the student body that is economically disadvantaged is the same. This relationship is a steeply negative one: as high schools have a higher proportion of their student bodies that are economically disadvantaged, so also there is an associated decrease in their average math SAT scores. Sad, but true.

Parallel Slopes vs Interaction Models

Albert Y. Kim

Last updated on 2019-03-02

Data set 1

Interaction model

Parallel slopes model

Conclusion

Data set 2

Interaction model

Parallel slopes model

Conclusion

Data set 1 revisited