PADP 7120 Data Applications in PA

class: center, middle, inverse, title-slide

.title[
# PADP 7120 Data Applications in PA
]
.subtitle[
## RLab 8: Regression Diagnostics
]
.author[
### Alex Combs
]
.institute[
### UGA | SPIA | PADP
]
.date[
### Last updated: April 16, 2025
]

---

# Outline

- Are regression results statistically valid?
  - Classic assumptions of regression (LINE)
  - Multi-collinearity
  - Outliers, high-leverage, and high-influence
  
---
# Setup

> **Start a new project and Rmd**

> **Change YAML.**

``` r
---
title: "RLab 8: Regression Diagnostics"
author: "Your Name"
output: 
  html_document:
    theme: spacelab
    df_print: paged
---
```

---
# Setup

> **Download `trump_vote` data on eLC and import**

> **Load the following packages**

``` r
library(tidyverse)
library(moderndive)
library(carData)
library(car)
library(gvlma)
```

---
# Data Summary

> **Use the `summary()` function to obtain a simple display of summary statistics**

- We will focus on three variables in the data

```
##  share_vote_trump     white         white_pov     
##  Min.   : 4.0     Min.   :19.00   Min.   : 4.000  
##  1st Qu.:41.5     1st Qu.:58.00   1st Qu.: 7.500  
##  Median :49.0     Median :72.00   Median : 9.000  
##  Mean   :49.0     Mean   :68.43   Mean   : 9.176  
##  3rd Qu.:57.5     3rd Qu.:80.50   3rd Qu.:10.000  
##  Max.   :70.0     Max.   :94.00   Max.   :17.000
```

---
# Regression

> **Run the following regression and produce tables of results and goodness-of-fit**

`$$TrumpVoteShare = \beta_0 + \beta_1PctWhite + \beta_2 PctWhitePov + \epsilon$$` 
---
# Regression

|term      | estimate| std_error| statistic| p_value| lower_ci| upper_ci|
|:---------|--------:|---------:|---------:|-------:|--------:|--------:|
|intercept |   10.933|     6.526|     1.675|   0.100|   -2.189|   24.054|
|white     |    0.264|     0.081|     3.239|   0.002|    0.100|    0.428|
|white_pov |    2.181|     0.543|     4.014|   0.000|    1.089|    3.274|

- We have now covered all the information in this table.

- But the information could be wrong if regression assumptions are violated.

---
class: inverse, center, middle

# Classical assumptions of linear regression

---
# Concerns

- Biased estimates
  - Systematically higher/lower estimates than the parameter

- Invalid hypothesis test
  - Inflated chance of false positives or false negatives

- Wider confidence intervals (less precision) than necessary

---
# Assumptions Focus on Residuals

---
# Classical regression assumptions

- **L.I.N.E.**

- **L**inear relationship between `$x$` and `$y$`

- Or proper inclusion of a nonlinear relationship

- **I**ndependent residuals

- Requires independent observations; data of one observation shares no correlation with other observations

- **N**ormality of residuals

- **E**qual variance in residuals

---
# Residual vs. Fitted Plot (RVF)

.pull-left[
<img src="Reg_Diagnostics_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" />
]

.pull-right[
<img src="Reg_Diagnostics_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" />
]

- Assumptions **L**, **N**, and **E** can be inspected using an RVF plot (on right).

- We want to see no obvious pattern in the RVF plot points

---
# Linear assumption

- Obvious pattern in RVF on the right

- Violation of **L**

---
# Linear assumption

- Changing the model to quadratic has improved the RVF

---
# Normally distributed residuals

---
# Normally distributed residuals

- Pattern on right shows a highly skewed distribution of residuals

- Violation of **N**

---
# Equal variance in residuals

---
# Equal variance in residuals

- Spread of the distribution is constant along the regression line

---
# Equal variance in residuals

- Spread of distribution clearly changing

- Violation of **E**

---
# Independent residuals/observations

- Cannot directly observe violation of **I**

- Requires an understanding of how observations in the data are related to each other

- Examples of possible dependencies:
  - Schools in the same district
  - People in the same household
  - Counties treated by the same state policy
  - Past periods related to current, future periods
  
- If **L** or **N** are violated, could be due to violation of **I**

---
# Potential Consequences

- Violation of **L** or **N** or **I**
  - Biased estimates

- Violation of **E**
  - Biased standard errors, thus biased confidence intervals
  - Invalid hypothesis testing

---
# Checking Assumptions

- The `gvlma(saved_results)` function from the gvlma package will perform a test of assumptions

- GVLMA:  Global Validation of Linear Model Assumptions

> **Run `gvlma()` on your saved results**

---
# GVLMA

- Global Stat: Holistic test of assumptions
- Skewness: Primary test of **N**; secondary test for influence
- Kurtosis: Primary test for influence; secondary test of **N**
- Link Function: Test of **L**
- Heteroskedasticity: Test of **E**

---
# Checking Assumptions

- The `plot(saved_regression_results)` function will produce a series of diagnostic plots

- Each plot targets a specific assumption or issue and will identify rows in the data **potentially** causing a problem

> **Use `plot()` on your regression results**

---
# RVF Plot

- We want to see no obvious pattern and a relatively straight line along 0

- Especially useful for evaluating **L**

---
# Normal Q-Q

- We want to see points fall approximately along the dotted line

- If not, suggests **N** may be violated

---
# Scale-Location

- We want to see a straight red line

- If not, evidence **E** may be violated

---
# Residuals vs. Leverage

- Useful for identifying high-influence observations

- Observations that cross the dotted Cook's distance may be a problem

---
# Unusual and influential data

- Regression outlier
  - An observation with a large residual

- High-leverage
  - An observation with a large deviation from the explanatory variable's mean

- High-influence
  - A regression outlier with high leverage

- An influential observation is an observation that, if removed, meaningfully changes regression results

---
# Summary of our Diagnostics

- Our model fails to meet assumptions, driven by **L**

- Is there one or more observations identified in the diagnostic plots that seem most problematic?

- What can or should we do?

---
# Potential Corrections

- Violation of **L** or **N**
  - Include quadratic or log transformations
  - Remove influential observation that could be cause

- Violation of **E**
  - Log transform the outcome variable
  - Specify **robust** standard errors (outside scope of class)
  - Remove influential observation that could be cause

- Violation of **I**
  - Control for variables that capture how observations are related to each other
  - Specify **clustered** standard errors (outside scope of class)

---
# Influencer?

- Visualizing the data can also provide an intuitive check for high influence

> **Generate a plot for the relationship between white poverty and Trump vote**

- What is the pattern of plot points doing to our attempt to fit a line?

- Which observation is this problematic point?

---
# Excluding DC

> **Save a new dataset that excludes DC**

> **Re-run regression and produce table of results**

---
# Comparing results

- Including DC

- Excluding DC

|term      | estimate| std_error| statistic| p_value| lower_ci| upper_ci|
|:---------|--------:|---------:|---------:|-------:|--------:|--------:|
|intercept |   18.859|     6.407|     2.943|   0.005|    5.969|   31.748|
|white     |    0.212|     0.076|     2.791|   0.008|    0.059|    0.364|
|white_pov |    1.770|     0.510|     3.471|   0.001|    0.744|    2.797|

---
# Influencer?

- Whether change in results is meaningful can be subjective

- Obvious meaningful changes:
  - Change in statistical significance
  - Estimates change between positive and negative
  - Changes validity of LINE assumptions

- Consider whether it makes sense to include or exclude certain observations

---
# Comparing regression lines

.pull-left[
<img src="Reg_Diagnostics_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" />
]

.pull-right[
<img src="Reg_Diagnostics_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" />
]

- D.C. (on left) pulls the left of regression line down; causing slope to be steeper

---
# Checking Assumptions

> **Use gvlma() on this second set of results**

- Did excluding DC improve the diagnostic results?

- What could we do to fix the remaining violation? Refer to slide 31.

> **Run another model with that correction and check assumptions again**

---
class: inverse, center, middle

# Multicollinearity

---
# Multicollinearity

- When two explanatory variables are strongly correlated with each other

- Since regression computes the association between an `$x$` and `$y$`, **controlling for all other `$x$`s**, multicollinearity can mask statistically significant associations between an `$x$` and `$y$`.

- In other words, multicollinearity can cause false negatives

---
# Multicollinearity

- Detection
  - If two variables have a correlation stronger than +/- 0.8, multicollinearity might be a problem
  - Variance Inflation Factor (VIF) is another method
    - A VIF greater than 10 indicates multicollinearity may be a problem

- Solutions
  - Combine the variables into a single index variable
  - If one variable is really important to the analysis, exclude the variables correlated to it. Be careful not to introduce omitted variable bias.

---
# Checking multicollinearity

- Can use `vif(saved_results)` from car package to check multicollinearity

> **Run `vif()` on our preferred results**

- Is multicollinearity a problem?