class: center, middle, inverse, title-slide .title[ # PADP 7120 Data Applications in PA ] .subtitle[ ## RLab 8: Regression Diagnostics ] .author[ ### Alex Combs ] .institute[ ### UGA | SPIA | PADP ] .date[ ### Last updated: April 16, 2024 ] --- # Outline - Are regression results statistically valid? - Classic assumptions of regression (LINE) - Multi-collinearity - Outliers, high-leverage, and high-influence --- # Setup > **Start a new project and Rmd** > **Change YAML.** ```r --- title: "RLab 7: Reporting Regression" author: "Your Name" output: html_document: theme: spacelab df_print: paged --- ``` --- # Setup > **Download `trump_vote` data on eLC and import** > **Load the following packages** ```r library(tidyverse) library(moderndive) library(carData) library(car) library(gvlma) ``` --- # Data Summary > **Use the `summary()` function to obtain a simple display of summary statistics** --- # Regression > **Run the following regression and produce tables of results and goodness-of-fit** `$$TrumpVoteShare = \beta_0 + \beta_1PctWhite + \beta_2 PctWhitePov + \epsilon$$` --- # Regression |term | estimate| std_error| statistic| p_value| lower_ci| upper_ci| |:---------|--------:|---------:|---------:|-------:|--------:|--------:| |intercept | 10.933| 6.526| 1.675| 0.100| -2.189| 24.054| |white | 0.264| 0.081| 3.239| 0.002| 0.100| 0.428| |white_pov | 2.181| 0.543| 4.014| 0.000| 1.089| 3.274| - We now understand all the information in this table. - But the information could be wrong if regression assumptions are violated. --- class: inverse, center, middle # Classical assumptions of linear regression --- # Concerns - Biased estimates - Systematically higher/lower estimates than the parameter - Invalid hypothesis test - Inflated chance of false positives or false negatives - Wider confidence intervals (less precision) than necessary --- # Assumptions Focus on Residuals <img src="Reg_Diagnostics_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Classical regression assumptions - **L.I.N.E.** - **L**inear relationship between `\(x\)` and `\(y\)` - Or proper inclusion of a nonlinear relationship -- - **I**ndependent residuals - Requires independent observations; data of one observation shares no correlation with other observations -- - **N**ormality of residuals - **E**qual variance in residuals --- # Residual vs. Fitted Plot (RVF) .pull-left[ <img src="Reg_Diagnostics_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="Reg_Diagnostics_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] - Assumptions **L**, **N**, and **E** can be inspected using an RVF plot (on right). - We want to see no obvious pattern in the RVF plot points --- # Linear assumption <img src="lectures_files/rvfp-linear.png" style="display: block; margin: auto;" /> - Obvious pattern in RVF on the right - Violation of **L** --- # Linear assumption <img src="lectures_files/rvfp-linear2.png" style="display: block; margin: auto;" /> - Changing the model to quadratic has improved the RVF --- # Normally distributed residuals <img src="lectures_files/norm-residuals.png" style="display: block; margin: auto;" /> --- # Normally distributed residuals <img src="lectures_files/rvfp-normality.png" style="display: block; margin: auto;" /> - Pattern on right shows a highly skewed distribution of residuals - Violation of **N** --- # Equal variance in residuals <img src="lectures_files/homo-hetero.png" style="display: block; margin: auto;" /> --- # Equal variance in residuals <img src="lectures_files/homoskedasticity.png" style="display: block; margin: auto;" /> - Spread of the distribution is constant along the regression line --- # Equal variance in residuals <img src="lectures_files/heteroskedasticity.png" style="display: block; margin: auto;" /> - Spread of distribution clearly changing - Violation of **E** --- # Independent residuals/observations - Cannot directly observe violation of **I** - Requires an understanding of how observations in the data are related to each other - Examples of possible dependencies: - Schools in the same district - People in the same household - Counties treated by the same state policy - Past periods related to current, future periods - If **L** or **N** are violated, could be due to violation of **I** --- # Potential Consequences - Violation of **L** or **N** or **I** - Biased estimates - Violation of **E** - Biased standard errors, thus biased confidence intervals - Invalid hypothesis testing --- # Checking Assumptions - The `gvlma(saved_results)` function from the gvlma package will perform a test of assumptions - GVLMA: Global Validation of Linear Model Assumptions > **Run `gvlma()` on your saved results** --- # GVLMA <img src="lectures_files/gvlma1.png" style="display: block; margin: auto;" /> - Global Stat: Holistic test of assumptions - Skewness: Primary test of **N**; secondary test for influence - Kurtosis: Primary test for influence; secondary test of **N** - Link Function: Test of **L** - Heteroskedasticity: Test of **E** --- # Checking Assumptions - The `plot(saved_regression_results)` function will produce a series of diagnostic plots - Each plot targets a specific assumption or issue and will identify rows in the data **potentially** causing a problem > **Use `plot()` on your regression results** --- # RVF Plot <img src="Reg_Diagnostics_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> - We want to see no obvious pattern and a relatively straight line along 0 - Especially useful for evaluating **L** --- # Normal Q-Q <img src="Reg_Diagnostics_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> - We want to see points fall approximately along the dotted line - If not, suggests **N** may be violated --- # Scale-Location <img src="Reg_Diagnostics_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> - We want to see a straight red line - If not, evidence **E** may be violated --- # Residuals vs. Leverage <img src="Reg_Diagnostics_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> - Useful for identifying high-influence observations - Observations that cross the dotted Cook's distance may be a problem --- # Unusual and influential data - Regression outlier - An observation with a large residual - High-leverage - An observation with a large deviation from the explanatory variable's mean - High-influence - A regression outlier with high leverage - An influential observation is an observation that, if removed, meaningfully changes regression results --- # Summary of our Diagnostics - Our model fails to meet assumptions, driven by **L** - Is there one or more observations identified in the diagnostic plots that seem most problematic? - What can or should we do? --- # Potential Corrections - Violation of **L** or **N** - Include quadratic or log transformations - Remove influential observation that could be cause - Violation of **E** - Log transform the outcome variable - Specify **robust** standard errors (outside scope of class) - Remove influential observation that could be cause - Violation of **I** - Control for variables that capture how observations are related to each other - Specify **clustered** standard errors (outside scope of class) --- # Influencer? - Visualizing the data can also provide an intuitive check for high influence > **Generate a plot for the relationship between white poverty and Trump vote** - What is the pattern of plot points doing to our attempt to fit a line to them? - Which state is this problematic point? --- # Excluding DC > **Save a new dataset that excludes DC** > **Re-run regression and produce table of results** --- # Comparing results - Including DC |term | estimate| std_error| statistic| p_value| lower_ci| upper_ci| |:---------|--------:|---------:|---------:|-------:|--------:|--------:| |intercept | 10.933| 6.526| 1.675| 0.100| -2.189| 24.054| |white | 0.264| 0.081| 3.239| 0.002| 0.100| 0.428| |white_pov | 2.181| 0.543| 4.014| 0.000| 1.089| 3.274| - Excluding DC |term | estimate| std_error| statistic| p_value| lower_ci| upper_ci| |:---------|--------:|---------:|---------:|-------:|--------:|--------:| |intercept | 18.859| 6.407| 2.943| 0.005| 5.969| 31.748| |white | 0.212| 0.076| 2.791| 0.008| 0.059| 0.364| |white_pov | 1.770| 0.510| 3.471| 0.001| 0.744| 2.797| --- # Influencer? - Whether change in results is meaningful can be subjective - Obvious meaningful changes: - Change in statistical significance - Estimates change between positive and negative - Changes validity of LINE assumptions - Consider whether it makes sense to include or exclude certain observations --- # Comparing regression lines .pull-left[ <img src="Reg_Diagnostics_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="Reg_Diagnostics_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> ] - D.C. (on left) pulls the left of regression line down; causing slope to be steeper --- # Checking Assumptions > **Use gvlma() on this second set of results** - Did excluding DC improve the diagnostic results? - What could we do to fix the remaining violation? Refer to slide 31. > **Run another model with that correction and check assumptions again** --- class: inverse, center, middle # Multicollinearity --- # Multicollinearity - When two explanatory variables are strongly correlated with each other - Since regression computes the association between an `\(x\)` and `\(y\)`, **controlling for all other `\(x\)`s**, multicollinearity can mask statistically significant associations between an `\(x\)` and `\(y\)`. - In other words, multicollinearity can cause false negatives --- # Multicollinearity - Detection - If two variables have a correlation stronger than +/- 0.8, multicollinearity might be a problem - Variance Inflation Factor (VIF) is another method - A VIF greater than 10 indicates multicollinearity may be a problem - Solutions - Combine the variables into a single index variable - If one variable is really important to the analysis, exclude the variables correlated to it. Be careful not to introduce omitted variable bias. --- # Checking multicollinearity - Can use `vif(saved_results)` from car package to check multicollinearity > **Run `vif()` on our preferred results** - Is multicollinearity a problem?