PADP 7120 Data Applications in PA

class: center, middle, inverse, title-slide

.title[
# PADP 7120 Data Applications in PA
]
.subtitle[
## Simple & Multiple Linear Regression
]
.author[
### Alex Combs
]
.institute[
### UGA | SPIA | PADP
]
.date[
### Last updated: February 11, 2025
]

---

# Outline

- Understand the purpose of regression and its relevance to making decisions

- Interpret regression coefficients

- Assess goodness-of-fit for one model or between several competing models

- Understand what it means to control for other variables

---
# Correlation coefficients

``` r
select(States, -region) %>% 
  cor()
```

```
##                pop       SATV       SATM    percent    dollars        pay
## pop      1.0000000 -0.3381028 -0.2300418  0.2100687  0.1436745  0.3677244
## SATV    -0.3381028  1.0000000  0.9620359 -0.8627954 -0.5268313 -0.5559238
## SATM    -0.2300418  0.9620359  1.0000000 -0.8581495 -0.4844477 -0.4853306
## percent  0.2100687 -0.8627954 -0.8581495  1.0000000  0.7111474  0.6630098
## dollars  0.1436745 -0.5268313 -0.4844477  0.7111474  1.0000000  0.8476737
## pay      0.3677244 -0.5559238 -0.4853306  0.6630098  0.8476737  1.0000000
```

---
# Visualizing correlation

``` r
ggplot(States, aes(x = dollars, y = SATM)) +
  geom_point() +
  labs(x = 'Public ed spending in $1000s per student',
       y = 'Average SAT Math score') +
  theme(text=element_text(size=18))
```

---
# Correlation conclusions

- States that spend more on public education tend to have lower average SAT Math scores among its graduating high school students

- The strength of the relationship is moderate at -0.48

---
# Basic purpose of regression

.pull-left[
![](Regression_files/figure-html/unnamed-chunk-5-1.png)
]

.pull-right[
- Measures direction, strength, and magnitude of association
- Draws line of **best fit** through paired x-y data points
- Slope of line tells us the **average, predicted change in y given a change in x**
- Or, at any value of x, the **predicted value of y**
]

---
# Basic purpose of regression

- Regression is a way of modeling an outcome `$y$` as a function of one or more explanatory variables `$x$`.

- Regression can be used to **explain** an outcome or **predict** an outcome, or both.

---
# Adding a regression line

``` r
ggplot(States, aes(x = dollars, y = SATM)) +
  geom_point() +
* geom_smooth(method = 'lm', se = FALSE) +
  labs(x = 'Public ed spending in $1000s per student',
       y = 'Average SAT Math score')
```

---
# Equation of a line

`$$y = 600 - 20x$$`

- If `$x=5$`, `$y=?$`

- If `$x=6$`, `$y=?$`

- If `$x$` increases from 5 to 6, by how much does `$y$` change?

- If `$x$` increases from 2 to 5, by how much does `$y$` change?

---
# Simple regression model (population)

`$$y=\beta_0+\beta_1x+\epsilon$$`

- Which represents the dependent variable?
- Which represents the independent variable?
- Which represents the unknown population parameters?
- Which represents statistical noise?

---
# Simple regression model (sample)

`$$\hat{y}=b_0+b_1x$$`

- Which represents the predicted outcome?
- Which represents the estimates?

---
# Applied example

- Suppose we want to model average SAT math scores, `SATM`, in a state as a function of its spending on public education in **$1000s per student**, `dollars`

`$$SATM=\beta_0+\beta_1Dollars+\epsilon$$`

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std_error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p_value </th>
   <th style="text-align:right;"> lower_ci </th>
   <th style="text-align:right;"> upper_ci </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> intercept </td>
   <td style="text-align:right;"> 560.374 </td>
   <td style="text-align:right;"> 16.801 </td>
   <td style="text-align:right;"> 33.353 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 526.610 </td>
   <td style="text-align:right;"> 594.137 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dollars </td>
   <td style="text-align:right;"> -12.169 </td>
   <td style="text-align:right;"> 3.139 </td>
   <td style="text-align:right;"> -3.876 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> -18.478 </td>
   <td style="text-align:right;"> -5.860 </td>
  </tr>
</tbody>
</table>

- Replacing parameters with estimates

`$$\hat{SATM} = 560 - 12.17 \times Dollars$$`

---
# Example: Reporting results

`$$\hat{SATM} = 560 - 12.17 \times Dollars$$`

On average, a one thousand dollar increase in state spending on public education per student is associated with a 12 point decline in average SAT math scores.

- What is the predicted increase in `SATM` if a state increased `Dollars` by $2,500 per student?

- What is the predicted `SATM` in a state that spends $5,000 per student?

- What is the predicted `SATM` in a state that spends $0 per student?

---
# Another example

- Now `SATM` as a function of the percent of graduating high school students who take the SAT, `percent`

`$$SATM=\beta_0+\beta_1Percent+\epsilon$$`
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std_error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p_value </th>
   <th style="text-align:right;"> lower_ci </th>
   <th style="text-align:right;"> upper_ci </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> intercept </td>
   <td style="text-align:right;"> 538.975 </td>
   <td style="text-align:right;"> 4.351 </td>
   <td style="text-align:right;"> 123.870 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 530.231 </td>
   <td style="text-align:right;"> 547.719 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> percent </td>
   <td style="text-align:right;"> -1.232 </td>
   <td style="text-align:right;"> 0.105 </td>
   <td style="text-align:right;"> -11.701 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> -1.444 </td>
   <td style="text-align:right;"> -1.021 </td>
  </tr>
</tbody>
</table>

`$$\hat{SATM} = 539 - 1.23 \times Percent$$`

- Now we can consider any hypothetical *change* in percent or specific *value* of percent and provide a prediction for SAT math scores.

---
# Change vs. value
`$$\hat{SATM} = 539 - 1.23 \times Percent$$`
.pull-left[
**Predicted Change**

What if percent taking SAT increased from 50% to 55%?

]

.pull-right[

**Predicted Value**

What is the predicted average SAT math score for a state where 75% of high school graduates take the SAT?

]

---
# More practice

`$$\hat{SATM} = 539 - 1.23 \times Percent$$`

- What is the predicted average SAT math score in a state where 25% of its high school graduates take the SAT?
- What is the predicted average SAT math score in a state where 100% of its high school graduates take the SAT?
- What is the predicted change in average SAT math score in a state where the percent of high school graduates taking the SAT increases 75 percentage points?

---
class: inverse, middle, center

# Running regressions in R

---
# Running regression

- Explore data

``` r
glimpse(States)
```

```
## Rows: 51
## Columns: 7
## $ region  <fct> ESC, PAC, MTN, WSC, PAC, MTN, NE, SA, SA, SA, SA, PAC, MTN, EN…
## $ pop     <int> 4041, 550, 3665, 2351, 29760, 3294, 3287, 666, 607, 12938, 647…
## $ SATV    <int> 470, 438, 445, 470, 419, 456, 430, 433, 409, 418, 401, 404, 46…
## $ SATM    <int> 514, 476, 497, 511, 484, 513, 471, 470, 441, 466, 443, 481, 50…
## $ percent <int> 8, 42, 25, 6, 45, 28, 74, 58, 68, 44, 57, 52, 17, 16, 54, 5, 1…
## $ dollars <dbl> 3.648, 7.887, 4.231, 3.334, 4.826, 4.809, 7.914, 6.016, 8.210,…
## $ pay     <int> 27, 43, 30, 23, 39, 31, 43, 35, 39, 30, 29, 32, 25, 34, 32, 28…
```

---
# Running regression

``` r
summary(States)
```

```
##       pop             SATV            SATM          percent     
##  Min.   :  454   Min.   :397.0   Min.   :437.0   Min.   : 4.00  
##  1st Qu.: 1215   1st Qu.:422.5   1st Qu.:470.0   1st Qu.:11.50  
##  Median : 3294   Median :443.0   Median :490.0   Median :25.00  
##  Mean   : 4877   Mean   :448.2   Mean   :497.4   Mean   :33.75  
##  3rd Qu.: 5780   3rd Qu.:474.5   3rd Qu.:522.5   3rd Qu.:57.50  
##  Max.   :29760   Max.   :511.0   Max.   :577.0   Max.   :74.00  
##     dollars           pay       
##  Min.   :2.993   Min.   :22.00  
##  1st Qu.:4.354   1st Qu.:27.50  
##  Median :5.045   Median :30.00  
##  Mean   :5.175   Mean   :30.94  
##  3rd Qu.:5.689   3rd Qu.:33.50  
##  Max.   :9.159   Max.   :43.00
```

---
# Running regression

``` r
satm_mod1 <- lm(SATM ~ dollars, data = States)
satm_mod2 <- lm(SATM ~ percent, data = States)
```

``` r
library(moderndive)
get_regression_table(satm_mod1)
```

|term      | estimate| std_error| statistic| p_value| lower_ci| upper_ci|
|:---------|--------:|---------:|---------:|-------:|--------:|--------:|
|intercept |  560.374|    16.801|    33.353|       0|  526.610|  594.137|
|dollars   |  -12.169|     3.139|    -3.876|       0|  -18.478|   -5.860|

---
class: inverse, center, middle

# Regression Residual

---
# Recap of population and sample regressions

**Population Model**

`$$y=\beta_0+\beta_1x+\epsilon$$`

`$$SATM=\beta_0+\beta_1Dollars+\epsilon$$`

**Sample Model**

`$$\hat{y}=b_0+b_1x$$`
`$$\hat{SATM} = 560 - 12.17 \times Dollars$$`

- What happens to the error term, `$\epsilon$`, when we move to the sample model?

---
# The residual

`$$e=y-\hat{y}$$`

- The residual is the difference between the *observed* outcome `$y$` and the *predicted* outcome `$\hat{y}$` in our regression model

- The residual is the sample statistic of the population parameter `$\epsilon$`

---
# The residual

- What might be contained within the error term and residual?

- Measurement error (bad)
  - Mistakes in how we model `$y$` as a function of `$x$` (bad)
  - Violations of the assumptions that enable credible regression (bad)
  - Other variables that associate with or impact the outcome (sometimes OK)
  - Randomness of social phenomena (OK)
  
- Severity of the problem can depend on the purpose of the regression.

---
# Residuals

`$$\hat{SATM} = 560 - 12.17 \times Dollars$$`

- What is the predicted `SATM` for a state where `dollars` equals $7,887?

- How about a state where `dollars` equals $4,231?

- If there are `SATM` scores we can observe for states with these specific values for `dollars`, then we can check by how much our regression model overestimates or underestimates `SATM`.

---
# Examining residuals

``` r
get_regression_points(satm_mod1)
```

- Regression results for each observation

- Shows observed data, predicted outcomes, and residuals

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> ID </th>
   <th style="text-align:right;"> SATM </th>
   <th style="text-align:right;"> dollars </th>
   <th style="text-align:right;"> SATM_hat </th>
   <th style="text-align:right;"> residual </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 514 </td>
   <td style="text-align:right;"> 3.648 </td>
   <td style="text-align:right;"> 515.980 </td>
   <td style="text-align:right;"> -1.980 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 476 </td>
   <td style="text-align:right;"> 7.887 </td>
   <td style="text-align:right;"> 464.395 </td>
   <td style="text-align:right;"> 11.605 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 497 </td>
   <td style="text-align:right;"> 4.231 </td>
   <td style="text-align:right;"> 508.886 </td>
   <td style="text-align:right;"> -11.886 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 511 </td>
   <td style="text-align:right;"> 3.334 </td>
   <td style="text-align:right;"> 519.802 </td>
   <td style="text-align:right;"> -8.802 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 484 </td>
   <td style="text-align:right;"> 4.826 </td>
   <td style="text-align:right;"> 501.645 </td>
   <td style="text-align:right;"> -17.645 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 513 </td>
   <td style="text-align:right;"> 4.809 </td>
   <td style="text-align:right;"> 501.852 </td>
   <td style="text-align:right;"> 11.148 </td>
  </tr>
</tbody>
</table>

---
class: inverse, center, middle

# Goodness-of-fit

---
# Ordinary least squares (OLS)

- Regression draws the **line of best fit** through observed data. What does best fit mean?

- Simple linear regression is also known as ordinary least squares (OLS)

- Regression uses calculus and matrix algebra to determine the intercept and slope of a line that minimizes the sum of squared residuals (SSR)

---
# OLS

.pull-left[
![](Regression_files/figure-html/unnamed-chunk-19-1.png)
]

.pull-right[
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> ID </th>
   <th style="text-align:right;"> SATM </th>
   <th style="text-align:right;"> SATM_hat </th>
   <th style="text-align:right;"> residual </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 514 </td>
   <td style="text-align:right;"> 515.980 </td>
   <td style="text-align:right;"> -1.980 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 476 </td>
   <td style="text-align:right;"> 464.395 </td>
   <td style="text-align:right;"> 11.605 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 497 </td>
   <td style="text-align:right;"> 508.886 </td>
   <td style="text-align:right;"> -11.886 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 511 </td>
   <td style="text-align:right;"> 519.802 </td>
   <td style="text-align:right;"> -8.802 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 484 </td>
   <td style="text-align:right;"> 501.645 </td>
   <td style="text-align:right;"> -17.645 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 513 </td>
   <td style="text-align:right;"> 501.852 </td>
   <td style="text-align:right;"> 11.148 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 471 </td>
   <td style="text-align:right;"> 464.067 </td>
   <td style="text-align:right;"> 6.933 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 470 </td>
   <td style="text-align:right;"> 487.164 </td>
   <td style="text-align:right;"> -17.164 </td>
  </tr>
</tbody>
</table>
]

---
# OLS

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> ID </th>
   <th style="text-align:right;"> SATM </th>
   <th style="text-align:right;"> dollars </th>
   <th style="text-align:right;"> SATM_hat </th>
   <th style="text-align:right;"> residual </th>
   <th style="text-align:right;"> sq_resid </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 514 </td>
   <td style="text-align:right;"> 3.648 </td>
   <td style="text-align:right;"> 515.980 </td>
   <td style="text-align:right;"> -1.980 </td>
   <td style="text-align:right;"> 3.9204 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 476 </td>
   <td style="text-align:right;"> 7.887 </td>
   <td style="text-align:right;"> 464.395 </td>
   <td style="text-align:right;"> 11.605 </td>
   <td style="text-align:right;"> 134.6760 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 497 </td>
   <td style="text-align:right;"> 4.231 </td>
   <td style="text-align:right;"> 508.886 </td>
   <td style="text-align:right;"> -11.886 </td>
   <td style="text-align:right;"> 141.2770 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 511 </td>
   <td style="text-align:right;"> 3.334 </td>
   <td style="text-align:right;"> 519.802 </td>
   <td style="text-align:right;"> -8.802 </td>
   <td style="text-align:right;"> 77.4752 </td>
  </tr>
</tbody>
</table>

``` r
sum(satm_mod1_points$sq_resid)
```

```
## [1] 45727.36
```

- No other line will achieve a sum of squared residuals less than 45,727

- But this tells us nothing of *how good* this best fit is

---
# Goodness-of-fit

- When we ask how good is our best regression line, one way to answer is the **percent of total variation in the outcome that is explained by our model**

- `$R^2$` (r-squared)
  - Ranges from 0 to 1
  - The percent of variation in `$y$` explained by our regression model

---
# Goodness of fit

- Model 1 using `dollars`

``` r
get_regression_summaries(satm_mod1)
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> r_squared </th>
   <th style="text-align:right;"> adj_r_squared </th>
   <th style="text-align:right;"> rmse </th>
   <th style="text-align:right;"> sigma </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.235 </td>
   <td style="text-align:right;"> 0.219 </td>
   <td style="text-align:right;"> 29.94355 </td>
   <td style="text-align:right;"> 30.549 </td>
  </tr>
</tbody>
</table>

The regression using dollars explains 24% of total variation in average SAT math scores.

---
# Goodness of fit

- Model 2 using `percent`

``` r
get_regression_summaries(satm_mod2)
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> r_squared </th>
   <th style="text-align:right;"> adj_r_squared </th>
   <th style="text-align:right;"> rmse </th>
   <th style="text-align:right;"> sigma </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.736 </td>
   <td style="text-align:right;"> 0.731 </td>
   <td style="text-align:right;"> 17.57277 </td>
   <td style="text-align:right;"> 17.928 </td>
  </tr>
</tbody>
</table>

- What percent of variation in average SAT math scores is explained by model 2?

- Model 2 has the higher `$R^2$`. It is *statistically* better at explaining math scores.

---
# Goodness of fit

.pull-left[
<img src="Regression_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" />
]

.pull-right[
<img src="Regression_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" />
]

- Note that the data points in model 2 (right) are closer to the regression line -> more explained variance in `SATM` -> higher `$R^2$`

---
# Goodness of fit

- `$R^2$` may not be useful for a general audience and it has limited practical application for predictions.

- Root mean squared error (**RMSE**) answers the question: "How far off is our model, on average, from observed outcomes? If we predict an outcome, what is the typical range of inaccuracy?

- RMSE is the regression version of standard deviation; provides the average deviation from the regression line

- The lower the RMSE the better our model is at fitting the data

---
# Goodness of fit
Model 1 using dollars
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> r_squared </th>
   <th style="text-align:right;"> adj_r_squared </th>
   <th style="text-align:right;"> rmse </th>
   <th style="text-align:right;"> sigma </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.235 </td>
   <td style="text-align:right;"> 0.219 </td>
   <td style="text-align:right;"> 29.94355 </td>
   <td style="text-align:right;"> 30.549 </td>
  </tr>
</tbody>
</table>

Model 2 using percent
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> r_squared </th>
   <th style="text-align:right;"> adj_r_squared </th>
   <th style="text-align:right;"> rmse </th>
   <th style="text-align:right;"> sigma </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.736 </td>
   <td style="text-align:right;"> 0.731 </td>
   <td style="text-align:right;"> 17.57277 </td>
   <td style="text-align:right;"> 17.928 </td>
  </tr>
</tbody>
</table>

.pull-left[
- Model 1 predictions are off by about 30 points, on average.
- What about model 2?
]

.pull-right[
- Model 2 predictions are off by about 18 points, on average.
- Model 2 is *statistically* better at predicting math.
]

---
class: inverse, center, middle

# Multiple regression

---
# Multiple regression

- Multiple regression adds more than one explanatory variable to the model.

**Population model**
`$$y=\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_kx_k+\epsilon$$`

**Sample model**
`$$\hat{y}=b_0+b_1x_1+b_2x_2+\dots+b_kx_k$$`

- This isolates the association between the outcome and each explanatory variable, holding all other variables constant at their respective means.

---
# Controlling for other factors

- The two previous regressions each controlled for one variable we might theorize affects average state SAT math scores

- There is any number of additional variables contained in the error term `$\epsilon$`

- If we can, we should include those additional variables in the same regression model

- Especially if those variables might affect the explanatory variable currently in our model

---
# Making sense of our results

- We found states spending more on education tend to have lower SAT math scores, and states with a higher percent of graduates taking the SAT tend to have lower SAT math scores. What may be going on?

.pull-left[
<img src="Regression_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" />
]

.pull-right[
<img src="Regression_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" />
]

---
# Making sense of our results

- Relationship between spending and SAT participation

---
# Running multiple regression

`$$SATM=\beta_0+\beta_1Dollars+\beta_2Percent+\epsilon$$`

``` r
satm_mod3 <- lm(SATM ~ dollars + percent, data = States)
```

``` r
get_regression_table(satm_mod3)
```
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std_error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p_value </th>
   <th style="text-align:right;"> lower_ci </th>
   <th style="text-align:right;"> upper_ci </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> intercept </td>
   <td style="text-align:right;"> 514.652 </td>
   <td style="text-align:right;"> 10.299 </td>
   <td style="text-align:right;"> 49.969 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 493.944 </td>
   <td style="text-align:right;"> 535.360 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dollars </td>
   <td style="text-align:right;"> 6.395 </td>
   <td style="text-align:right;"> 2.482 </td>
   <td style="text-align:right;"> 2.577 </td>
   <td style="text-align:right;"> 0.013 </td>
   <td style="text-align:right;"> 1.405 </td>
   <td style="text-align:right;"> 11.384 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> percent </td>
   <td style="text-align:right;"> -1.492 </td>
   <td style="text-align:right;"> 0.142 </td>
   <td style="text-align:right;"> -10.519 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> -1.777 </td>
   <td style="text-align:right;"> -1.207 </td>
  </tr>
</tbody>
</table>

`$$SATM=515+6.4 \times Dollars - 1.5 \times Percent+\epsilon$$`

---
# Reporting results

`$$SATM=515+6.4 \times Dollars - 1.5 \times Percent+\epsilon$$`

On average, a one thousand dollar increase in spending per pupil is associated with 6.4 point increase in average state SAT math scores, **controlling for the percent of students taking the SAT**.

---
# Interpreting results

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std_error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p_value </th>
   <th style="text-align:right;"> lower_ci </th>
   <th style="text-align:right;"> upper_ci </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> intercept </td>
   <td style="text-align:right;"> 514.652 </td>
   <td style="text-align:right;"> 10.299 </td>
   <td style="text-align:right;"> 49.969 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 493.944 </td>
   <td style="text-align:right;"> 535.360 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dollars </td>
   <td style="text-align:right;"> 6.395 </td>
   <td style="text-align:right;"> 2.482 </td>
   <td style="text-align:right;"> 2.577 </td>
   <td style="text-align:right;"> 0.013 </td>
   <td style="text-align:right;"> 1.405 </td>
   <td style="text-align:right;"> 11.384 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> percent </td>
   <td style="text-align:right;"> -1.492 </td>
   <td style="text-align:right;"> 0.142 </td>
   <td style="text-align:right;"> -10.519 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> -1.777 </td>
   <td style="text-align:right;"> -1.207 </td>
  </tr>
</tbody>
</table>

`$$SATM=515+6.4 \times Dollars - 1.5 \times Percent+\epsilon$$`

- Controlling for the percent of graduates taking the SAT, what is the predicted change in SAT math scores if states increase public education spending by $5,000 per student?

---
# Interpreting results

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std_error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p_value </th>
   <th style="text-align:right;"> lower_ci </th>
   <th style="text-align:right;"> upper_ci </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> intercept </td>
   <td style="text-align:right;"> 514.652 </td>
   <td style="text-align:right;"> 10.299 </td>
   <td style="text-align:right;"> 49.969 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 493.944 </td>
   <td style="text-align:right;"> 535.360 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dollars </td>
   <td style="text-align:right;"> 6.395 </td>
   <td style="text-align:right;"> 2.482 </td>
   <td style="text-align:right;"> 2.577 </td>
   <td style="text-align:right;"> 0.013 </td>
   <td style="text-align:right;"> 1.405 </td>
   <td style="text-align:right;"> 11.384 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> percent </td>
   <td style="text-align:right;"> -1.492 </td>
   <td style="text-align:right;"> 0.142 </td>
   <td style="text-align:right;"> -10.519 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> -1.777 </td>
   <td style="text-align:right;"> -1.207 </td>
  </tr>
</tbody>
</table>

`$$SATM=515+6.4 \times Dollars - 1.5 \times Percent+\epsilon$$`

What is the predicted average SAT math score for a state that spends $5,000 per student, and has 50% of high school graduates take the SAT?

---
class: inverse, middle, center

# Connecting results to policy decisions

---
# Policy Decision

- Let's ignore inference for now and treat these results as if they are statistically meaningful.

- More spending increases average SAT math scores among states, **controlling for percent of high school grads taking SAT**
  
- But spending seems to increase the percent of SAT participation, which then decreases average SAT math scores

- Let's quantify the effect of spending on SAT participation

---
# Spending and SAT participation

``` r
satm_mod4 <- lm(percent ~ dollars, data = States)
```

``` r
get_regression_table(satm_mod4)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std_error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p_value </th>
   <th style="text-align:right;"> lower_ci </th>
   <th style="text-align:right;"> upper_ci </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> intercept </td>
   <td style="text-align:right;"> -30.64 </td>
   <td style="text-align:right;"> 9.403 </td>
   <td style="text-align:right;"> -3.259 </td>
   <td style="text-align:right;"> 0.002 </td>
   <td style="text-align:right;"> -49.536 </td>
   <td style="text-align:right;"> -11.744 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dollars </td>
   <td style="text-align:right;"> 12.44 </td>
   <td style="text-align:right;"> 1.757 </td>
   <td style="text-align:right;"> 7.081 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 8.910 </td>
   <td style="text-align:right;"> 15.971 </td>
  </tr>
</tbody>
</table>

- A $1,000 increase in spending per pupil increases the percent of grads taking the SAT by 12.4 *percentage points*.

---
# Policy Decision

- Our first model found $1,000 increase in spending is associated with -12.2 points in SAT math.

- In the multiple regression model, a $1,000 increase in spending per student:
  
  - Directly increases average SAT math scores 6.4 points
  
  - Directly increases participation 12.4 percentage points
  
  - Increase of 12.4 p.p. in participation reduces average SAT math scores 18.6 points

- How might these results inform a state policy decision concerning education spending and college entrance exam participation?

---
class: inverse, middle, center

# Goodness-of-fit with multiple regression

---
# Goodness-of-fit

**Model 2 using only percent**
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> r_squared </th>
   <th style="text-align:right;"> adj_r_squared </th>
   <th style="text-align:right;"> rmse </th>
   <th style="text-align:right;"> sigma </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.736 </td>
   <td style="text-align:right;"> 0.731 </td>
   <td style="text-align:right;"> 17.57277 </td>
   <td style="text-align:right;"> 17.928 </td>
  </tr>
</tbody>
</table>

**Model 3 using dollars and percent**

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> r_squared </th>
   <th style="text-align:right;"> adj_r_squared </th>
   <th style="text-align:right;"> mse </th>
   <th style="text-align:right;"> rmse </th>
   <th style="text-align:right;"> sigma </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.768 </td>
   <td style="text-align:right;"> 0.759 </td>
   <td style="text-align:right;"> 271.2768 </td>
   <td style="text-align:right;"> 16.47048 </td>
   <td style="text-align:right;"> 16.977 </td>
  </tr>
</tbody>
</table>

- Based on `$R^2$`, model 3 appears to be the better model, however...

---
# Goodness of fit

- `$R^2$` mechanically increases as you add more variables whether or not they improve the model

- Therefore, we should not use `$R^2$` to compare models with different numbers of explanatory variables

- **Adjusted `$R^2$` ** accounts for this, adjusting based on how useful each additional variable is at explaining the outcome

- Model 3 still has the higher adjusted `$R^2$` so it remains the statistically better model

- And the RMSE agrees; virtually always the case