The data consists of SAT scores for the 50 states in 1994-1995:

`library(tidyverse)`

`## Warning: package 'dplyr' was built under R version 3.4.2`

```
SAT <- read_csv("http://people.reed.edu/~jones/141/sat.csv") %>%
select(State, salary, expend)
head(SAT)
```

State | salary | expend |
---|---|---|

Alabama | 31.1 | 4.41 |

Alaska | 48.0 | 8.96 |

Arizona | 32.2 | 4.78 |

Arkansas | 28.9 | 4.46 |

California | 41.1 | 4.99 |

Colorado | 34.6 | 5.44 |

```
X <- SAT %>%
select(salary, expend)
```

Let `X`

be of dimension \(n \times p = 50 \times 2\), where our observations \(\vec{X} = (X_1, X_2)\) are

- \(X_1\)
`salary`

: estimated average annual salary of teachers in public elementary and secondary schools (in $1000 of USD) - \(X_2\)
`expend`

: expenditure per pupil in average daily attendance in public elementary and secondary schools (in $1000 of USD)

The two variables `salary`

and `expend`

are highly correlated (\(\rho=\) 0.87). You can think of them as being redundant; once you know one variable, the other variable doesn’t provide you with all that more information. This is very apparent in the plot below (with the regression line being dashed):

Note such *collinear* variables are a problem in a regression setting, because they “steal” each other’s effect, so it becomes difficult to isolate the effect of one vs the other. In other words, you have very unstable estimates of their \(\beta\) coefficients/slopes.

We first recenter both `salary`

and `expenditure`

to each have mean 0 by subtracting their sample means \((\overline{x}_1, \overline{x}_2) = ( 34.829, 5.905)\) respectively:

```
X_recenter <- X %>%
mutate(
salary = salary - mean(salary),
expend = expend - mean(expend)
)
```

We plot the two variables recentered at \((0, 0)\).

We compute the *sample/empirical* \(2 \times 2\) covariance matrix for both `X`

and `X_recenter`

(Note for students who have taken MATH310 Probability: they are identical since variances/covariances are invariant to linear transformations):

`cov(X) %>% round(3)`

```
## salary expend
## salary 35.30 7.04
## expend 7.04 1.86
```

`cov(X_recenter) %>% round(3)`

```
## salary expend
## salary 35.30 7.04
## expend 7.04 1.86
```

Using R’s built-in linear algebra functionlity, we

- Compute the first and second principal components. Linear algebra fact: they are the first and second
*eigenvectors*\(\vec{\gamma}_1\) & \(\vec{\gamma}_2\) of the covariance matrix. - Plot them in red and blue respectively.

```
eigen <- cov(X_recenter) %>% eigen()
eigen_vals <- eigen$values
Gamma <- eigen$vectors
```