1. Data

The data consists of SAT scores for the 50 states in 1994-1995:

library(tidyverse)
SAT <- read_csv("http://people.reed.edu/~jones/141/sat.csv") %>% 
  select(State, salary, expend)
head(SAT)
State salary expend
Alabama 31.1 4.41
Alaska 48.0 8.96
Arizona 32.2 4.78
Arkansas 28.9 4.46
California 41.1 4.99
Colorado 34.6 5.44
X <- SAT %>% 
  select(salary, expend)

Let X be of dimension \(n \times p = 50 \times 2\), where our observations \(\vec{X} = (X_1, X_2)\) are

The two variables salary and expend are highly correlated (\(\rho=\) 0.87). You can think of them as being redundant; once you know one variable, the other variable doesn’t provide you with all that more information. This is very apparent in the plot below (with the regression line being dashed):

Note such collinear variables are a problem in a regression setting, because they “steal” each other’s effect, so it becomes difficult to isolate the effect of one vs the other. In other words, you have very unstable estimates of their \(\beta\) coefficients/slopes.

2. Recentering Covariates

We first recenter both salary and expenditure to each have mean 0 by subtracting their sample means \((\overline{x}_1, \overline{x}_2) = ( 34.829, 5.905)\) respectively:

X_recenter <- X %>% 
  mutate(
    salary = salary - mean(salary),
    expend = expend - mean(expend)
  )

We plot the two variables recentered at \((0, 0)\):

We compute the sample/empirical \(2 \times 2\) covariance matrix for both X and X_recenter (Note for students who have taken MATH 310 Probability: they are identical since variances/covariances are invariant to linear transformations):

cov(X) %>% round(3)
##        salary expend
## salary  35.30   7.04
## expend   7.04   1.86
cov(X_recenter) %>% round(3)
##        salary expend
## salary  35.30   7.04
## expend   7.04   1.86

3. Principal Components

Using R’s built-in linear algebra functionlity, we

eigen <- cov(X_recenter) %>% eigen()
eigen_vals <- eigen$values
Gamma <- eigen$vectors