class: center, middle, inverse, title-slide .title[ # PADP 7120 Data Applications in PA ] .subtitle[ ## Simple & Multiple Linear Regression ] .author[ ### Alex Combs ] .institute[ ### UGA | SPIA | PADP ] .date[ ### Last updated: February 13, 2024 ] --- # Outline - Understand the purpose of regression and its relevance to making decisions - Interpret regression coefficients - Assess goodness-of-fit for one model or between several competing models - Understand what it means to control for other variables --- # Correlation coefficients ```r select(States, -region) %>% cor() ``` ``` ## pop SATV SATM percent dollars pay ## pop 1.0000000 -0.3381028 -0.2300418 0.2100687 0.1436745 0.3677244 ## SATV -0.3381028 1.0000000 0.9620359 -0.8627954 -0.5268313 -0.5559238 ## SATM -0.2300418 0.9620359 1.0000000 -0.8581495 -0.4844477 -0.4853306 ## percent 0.2100687 -0.8627954 -0.8581495 1.0000000 0.7111474 0.6630098 ## dollars 0.1436745 -0.5268313 -0.4844477 0.7111474 1.0000000 0.8476737 ## pay 0.3677244 -0.5559238 -0.4853306 0.6630098 0.8476737 1.0000000 ``` --- # Visualizing correlation ```r ggplot(States, aes(x = dollars, y = SATM)) + geom_point() + labs(x = 'Public ed spending in $1000s per student', y = 'Average SAT Math score') + theme(text=element_text(size=18)) ``` <img src="Regression_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- # Correlation conclusions <img src="Regression_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> - States that spend more on public education tend to have lower average SAT Math scores among its graduating high school students - The strength of the relationship is moderate at -0.48 --- # Basic purpose of regression .pull-left[ ![](Regression_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] .pull-right[ - Measures direction, strength, and magnitude of association - Draws line of **best fit** through paired x-y data points - Slope of line tells us the **average, predicted change in y given a change in x** - Or, at any value of x, the **predicted value of y** ] --- # Basic purpose of regression - Regression is a way of modeling an outcome `\(y\)` as a function of one or more explanatory variables `\(x\)`. - Regression can be used to **explain** an outcome or **predict** an outcome, or both. --- # Adding a regression line ```r ggplot(States, aes(x = dollars, y = SATM)) + geom_point() + * geom_smooth(method = 'lm', se = FALSE) + labs(x = 'Public ed spending in $1000s per student', y = 'Average SAT Math score') ``` <img src="Regression_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- # Equation of a line `$$y = 600 - 20x$$` - If `\(x=5\)`, `\(y=?\)` - If `\(x=6\)`, `\(y=?\)` -- - If `\(x\)` increases from 5 to 6, by how much does `\(y\)` change? - If `\(x\)` increases from 2 to 5, by how much does `\(y\)` change? --- # Simple regression model (population) `$$y=\beta_0+\beta_1x+\epsilon$$` - Which represents the dependent variable? - Which represents the independent variable? - Which represents the unknown population parameters? - Which represents statistical noise? --- # Simple regression model (sample) `$$\hat{y}=b_0+b_1x$$` - Which represents the predicted outcome? - Which represents the estimates? --- # Applied example - Suppose we want to model average SAT math scores, `SATM`, in a state as a function of its spending on public education in **$1000s per student**, `dollars` `$$SATM=\beta_0+\beta_1Dollars+\epsilon$$` -- <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std_error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p_value </th> <th style="text-align:right;"> lower_ci </th> <th style="text-align:right;"> upper_ci </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> intercept </td> <td style="text-align:right;"> 560.374 </td> <td style="text-align:right;"> 16.801 </td> <td style="text-align:right;"> 33.353 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 526.610 </td> <td style="text-align:right;"> 594.137 </td> </tr> <tr> <td style="text-align:left;"> dollars </td> <td style="text-align:right;"> -12.169 </td> <td style="text-align:right;"> 3.139 </td> <td style="text-align:right;"> -3.876 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -18.478 </td> <td style="text-align:right;"> -5.860 </td> </tr> </tbody> </table> -- - Replacing parameters with estimates `$$\hat{SATM} = 560 - 12.17 \times Dollars$$` --- # Example: Reporting results `$$\hat{SATM} = 560 - 12.17 \times Dollars$$` On average, a one thousand dollar increase in state spending on public education per student is associated with a 12 point decline in average SAT math scores. -- - What is the predicted increase in `SATM` if a state increased `Dollars` by $2,500 per student? -- - What is the predicted `SATM` in a state that spends $5,000 per student? -- - What is the predicted `SATM` in a state that spends $0 per student? --- # Another example - Now `SATM` as a function of the percent of graduating high school students who take the SAT, `percent` `$$SATM=\beta_0+\beta_1Percent+\epsilon$$` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std_error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p_value </th> <th style="text-align:right;"> lower_ci </th> <th style="text-align:right;"> upper_ci </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> intercept </td> <td style="text-align:right;"> 538.975 </td> <td style="text-align:right;"> 4.351 </td> <td style="text-align:right;"> 123.870 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 530.231 </td> <td style="text-align:right;"> 547.719 </td> </tr> <tr> <td style="text-align:left;"> percent </td> <td style="text-align:right;"> -1.232 </td> <td style="text-align:right;"> 0.105 </td> <td style="text-align:right;"> -11.701 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -1.444 </td> <td style="text-align:right;"> -1.021 </td> </tr> </tbody> </table> `$$\hat{SATM} = 539 - 1.23 \times Percent$$` - Now we can consider any hypothetical *change* in percent or specific *value* of percent and provide a prediction for SAT math scores. --- # Change vs. value `$$\hat{SATM} = 539 - 1.23 \times Percent$$` .pull-left[ **Predicted Change** What if percent taking SAT increased from 50% to 55%? ] -- .pull-right[ **Predicted Value** What is the predicted average SAT math score for a state where 75% of high school graduates take the SAT? ] --- # More practice `$$\hat{SATM} = 539 - 1.23 \times Percent$$` - What is the predicted average SAT math score in a state where 25% of its high school graduates take the SAT? - What is the predicted average SAT math score in a state where 100% of its high school graduates take the SAT? - What is the predicted change in average SAT math score in a state where the percent of high school graduates taking the SAT increases 75 percentage points? --- class: inverse, middle, center # Running regressions in R --- # Running regression - Explore data ```r glimpse(States) ``` ``` ## Rows: 51 ## Columns: 7 ## $ region <fct> ESC, PAC, MTN, WSC, PAC, MTN, NE, SA, SA, SA, SA, PAC, MTN, EN… ## $ pop <int> 4041, 550, 3665, 2351, 29760, 3294, 3287, 666, 607, 12938, 647… ## $ SATV <int> 470, 438, 445, 470, 419, 456, 430, 433, 409, 418, 401, 404, 46… ## $ SATM <int> 514, 476, 497, 511, 484, 513, 471, 470, 441, 466, 443, 481, 50… ## $ percent <int> 8, 42, 25, 6, 45, 28, 74, 58, 68, 44, 57, 52, 17, 16, 54, 5, 1… ## $ dollars <dbl> 3.648, 7.887, 4.231, 3.334, 4.826, 4.809, 7.914, 6.016, 8.210,… ## $ pay <int> 27, 43, 30, 23, 39, 31, 43, 35, 39, 30, 29, 32, 25, 34, 32, 28… ``` --- # Running regression ```r summary(States) ``` ``` ## pop SATV SATM percent ## Min. : 454 Min. :397.0 Min. :437.0 Min. : 4.00 ## 1st Qu.: 1215 1st Qu.:422.5 1st Qu.:470.0 1st Qu.:11.50 ## Median : 3294 Median :443.0 Median :490.0 Median :25.00 ## Mean : 4877 Mean :448.2 Mean :497.4 Mean :33.75 ## 3rd Qu.: 5780 3rd Qu.:474.5 3rd Qu.:522.5 3rd Qu.:57.50 ## Max. :29760 Max. :511.0 Max. :577.0 Max. :74.00 ## dollars pay ## Min. :2.993 Min. :22.00 ## 1st Qu.:4.354 1st Qu.:27.50 ## Median :5.045 Median :30.00 ## Mean :5.175 Mean :30.94 ## 3rd Qu.:5.689 3rd Qu.:33.50 ## Max. :9.159 Max. :43.00 ``` --- # Running regression ```r satm_mod1 <- lm(SATM ~ dollars, data = States) satm_mod2 <- lm(SATM ~ percent, data = States) ``` ```r library(moderndive) get_regression_table(satm_mod1) ``` |term | estimate| std_error| statistic| p_value| lower_ci| upper_ci| |:---------|--------:|---------:|---------:|-------:|--------:|--------:| |intercept | 560.374| 16.801| 33.353| 0| 526.610| 594.137| |dollars | -12.169| 3.139| -3.876| 0| -18.478| -5.860| --- class: inverse, center, middle # Regression Residual --- # Recap of population and sample regressions **Population Model** `$$y=\beta_0+\beta_1x+\epsilon$$` `$$SATM=\beta_0+\beta_1Dollars+\epsilon$$` **Sample Model** `$$\hat{y}=b_0+b_1x$$` `$$\hat{SATM} = 560 - 12.17 \times Dollars$$` - What happens to the error term, `\(\epsilon\)`, when we move to the sample model? --- # The residual `$$e=y-\hat{y}$$` - The residual is the difference between the *observed* outcome `\(y\)` and the *predicted* outcome `\(\hat{y}\)` in our regression model - The residual is the sample statistic of the population parameter `\(\epsilon\)` <img src="Regression_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- # The residual - What might be contained within the error term and residual? - Measurement error (bad) - Mistakes in how we model `\(y\)` as a function of `\(x\)` (bad) - Violations of the assumptions that enable credible regression (bad) - Other variables that associate with or impact the outcome (sometimes OK) - Randomness of social phenomena (OK) - Severity of the problem can depend on the purpose of the regression. --- # Residuals `$$\hat{SATM} = 560 - 12.17 \times Dollars$$` - What is the predicted `SATM` for a state where `dollars` equals $7,887? - How about a state where `dollars` equals $4,231? -- - If there are `SATM` scores we can observe for states with these specific values for `dollars`, then we can check by how much our regression model overestimates or underestimates `SATM`. --- # Examining residuals ```r get_regression_points(satm_mod1) ``` - Regression results for each observation - Shows observed data, predicted outcomes, and residuals <table> <thead> <tr> <th style="text-align:right;"> ID </th> <th style="text-align:right;"> SATM </th> <th style="text-align:right;"> dollars </th> <th style="text-align:right;"> SATM_hat </th> <th style="text-align:right;"> residual </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 514 </td> <td style="text-align:right;"> 3.648 </td> <td style="text-align:right;"> 515.980 </td> <td style="text-align:right;"> -1.980 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 476 </td> <td style="text-align:right;"> 7.887 </td> <td style="text-align:right;"> 464.395 </td> <td style="text-align:right;"> 11.605 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 497 </td> <td style="text-align:right;"> 4.231 </td> <td style="text-align:right;"> 508.886 </td> <td style="text-align:right;"> -11.886 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 511 </td> <td style="text-align:right;"> 3.334 </td> <td style="text-align:right;"> 519.802 </td> <td style="text-align:right;"> -8.802 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 484 </td> <td style="text-align:right;"> 4.826 </td> <td style="text-align:right;"> 501.645 </td> <td style="text-align:right;"> -17.645 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 513 </td> <td style="text-align:right;"> 4.809 </td> <td style="text-align:right;"> 501.852 </td> <td style="text-align:right;"> 11.148 </td> </tr> </tbody> </table> --- class: inverse, center, middle # Goodness-of-fit --- # Ordinary least squares (OLS) - Regression draws the **line of best fit** through observed data. What does best fit mean? - Simple linear regression is also known as ordinary least squares (OLS) - Regression uses calculus and matrix algebra to determine the intercept and slope of a line that minimizes the sum of squared residuals (SSR) --- # OLS .pull-left[ ![](Regression_files/figure-html/unnamed-chunk-19-1.png)<!-- --> ] .pull-right[ <table> <thead> <tr> <th style="text-align:right;"> ID </th> <th style="text-align:right;"> SATM </th> <th style="text-align:right;"> SATM_hat </th> <th style="text-align:right;"> residual </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 514 </td> <td style="text-align:right;"> 515.980 </td> <td style="text-align:right;"> -1.980 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 476 </td> <td style="text-align:right;"> 464.395 </td> <td style="text-align:right;"> 11.605 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 497 </td> <td style="text-align:right;"> 508.886 </td> <td style="text-align:right;"> -11.886 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 511 </td> <td style="text-align:right;"> 519.802 </td> <td style="text-align:right;"> -8.802 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 484 </td> <td style="text-align:right;"> 501.645 </td> <td style="text-align:right;"> -17.645 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 513 </td> <td style="text-align:right;"> 501.852 </td> <td style="text-align:right;"> 11.148 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 471 </td> <td style="text-align:right;"> 464.067 </td> <td style="text-align:right;"> 6.933 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 470 </td> <td style="text-align:right;"> 487.164 </td> <td style="text-align:right;"> -17.164 </td> </tr> </tbody> </table> ] --- # OLS <table> <thead> <tr> <th style="text-align:right;"> ID </th> <th style="text-align:right;"> SATM </th> <th style="text-align:right;"> dollars </th> <th style="text-align:right;"> SATM_hat </th> <th style="text-align:right;"> residual </th> <th style="text-align:right;"> sq_resid </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 514 </td> <td style="text-align:right;"> 3.648 </td> <td style="text-align:right;"> 515.980 </td> <td style="text-align:right;"> -1.980 </td> <td style="text-align:right;"> 3.9204 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 476 </td> <td style="text-align:right;"> 7.887 </td> <td style="text-align:right;"> 464.395 </td> <td style="text-align:right;"> 11.605 </td> <td style="text-align:right;"> 134.6760 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 497 </td> <td style="text-align:right;"> 4.231 </td> <td style="text-align:right;"> 508.886 </td> <td style="text-align:right;"> -11.886 </td> <td style="text-align:right;"> 141.2770 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 511 </td> <td style="text-align:right;"> 3.334 </td> <td style="text-align:right;"> 519.802 </td> <td style="text-align:right;"> -8.802 </td> <td style="text-align:right;"> 77.4752 </td> </tr> </tbody> </table> ```r sum(satm_mod1_points$sq_resid) ``` ``` ## [1] 45727.36 ``` - No other line will achieve a sum of squared residuals less than 45,727 - But this tells us nothing of *how good* this best fit is --- # Goodness-of-fit - When we ask how good is our best regression line, one way to answer is the **percent of total variation in the outcome that is explained by our model** -- - `\(R^2\)` (r-squared) - Ranges from 0 to 1 - The percent of variation in `\(y\)` explained by our regression model --- # Goodness of fit - Model 1 using `dollars` ```r get_regression_summaries(satm_mod1) ``` <table> <thead> <tr> <th style="text-align:right;"> r_squared </th> <th style="text-align:right;"> adj_r_squared </th> <th style="text-align:right;"> rmse </th> <th style="text-align:right;"> sigma </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.235 </td> <td style="text-align:right;"> 0.219 </td> <td style="text-align:right;"> 29.94355 </td> <td style="text-align:right;"> 30.549 </td> </tr> </tbody> </table> The regression using dollars explains 24% of total variation in average SAT math scores. --- # Goodness of fit - Model 2 using `percent` ```r get_regression_summaries(satm_mod2) ``` <table> <thead> <tr> <th style="text-align:right;"> r_squared </th> <th style="text-align:right;"> adj_r_squared </th> <th style="text-align:right;"> rmse </th> <th style="text-align:right;"> sigma </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.736 </td> <td style="text-align:right;"> 0.731 </td> <td style="text-align:right;"> 17.57277 </td> <td style="text-align:right;"> 17.928 </td> </tr> </tbody> </table> - What percent of variation in average SAT math scores is explained by model 2? -- - Model 2 has the higher `\(R^2\)`. It is *statistically* better at explaining math scores. --- # Goodness of fit .pull-left[ <img src="Regression_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="Regression_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> ] - Note that the data points in model 2 (right) are closer to the regression line -> more explained variance in `SATM` -> higher `\(R^2\)` --- # Goodness of fit - `\(R^2\)` may not be useful for a general audience and it has limited practical application for predictions. -- - Root mean squared error (**RMSE**) answers the question: "How far off is our model, on average, from observed outcomes? If we predict an outcome, what is the typical range of inaccuracy? -- - RMSE is the regression version of standard deviation; provides the average deviation from the regression line - The lower the RMSE the better our model is at fitting the data --- # Goodness of fit Model 1 using dollars <table> <thead> <tr> <th style="text-align:right;"> r_squared </th> <th style="text-align:right;"> adj_r_squared </th> <th style="text-align:right;"> rmse </th> <th style="text-align:right;"> sigma </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.235 </td> <td style="text-align:right;"> 0.219 </td> <td style="text-align:right;"> 29.94355 </td> <td style="text-align:right;"> 30.549 </td> </tr> </tbody> </table> Model 2 using percent <table> <thead> <tr> <th style="text-align:right;"> r_squared </th> <th style="text-align:right;"> adj_r_squared </th> <th style="text-align:right;"> rmse </th> <th style="text-align:right;"> sigma </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.736 </td> <td style="text-align:right;"> 0.731 </td> <td style="text-align:right;"> 17.57277 </td> <td style="text-align:right;"> 17.928 </td> </tr> </tbody> </table> -- .pull-left[ - Model 1 predictions are off by about 30 points, on average. - What about model 2? ] -- .pull-right[ - Model 2 predictions are off by about 18 points, on average. - Model 2 is *statistically* better at predicting math. ] --- class: inverse, center, middle # Multiple regression --- # Multiple regression - Multiple regression adds more than one explanatory variable to the model. -- **Population model** `$$y=\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_kx_k+\epsilon$$` **Sample model** `$$\hat{y}=b_0+b_1x_1+b_2x_2+\dots+b_kx_k$$` -- - This isolates the association between the outcome and each explanatory variable, holding all other variables constant at their respective means. --- # Controlling for other factors - The two previous regressions each controlled for one variable we might theorize affects average state SAT math scores - There is any number of additional variables contained in the error term `\(\epsilon\)` - If we can, we should include those additional variables in the same regression model - Especially if those variables might affect the explanatory variable currently in our model --- # Making sense of our results - We found states spending more on education tend to have lower SAT math scores, and states with a higher percent of graduates taking the SAT tend to have lower SAT math scores. What may be going on? .pull-left[ <img src="Regression_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="Regression_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> ] --- # Making sense of our results - Relationship between spending and SAT participation <img src="Regression_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> --- # Running multiple regression `$$SATM=\beta_0+\beta_1Dollars+\beta_2Percent+\epsilon$$` ```r satm_mod3 <- lm(SATM ~ dollars + percent, data = States) ``` ```r get_regression_table(satm_mod3) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std_error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p_value </th> <th style="text-align:right;"> lower_ci </th> <th style="text-align:right;"> upper_ci </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> intercept </td> <td style="text-align:right;"> 514.652 </td> <td style="text-align:right;"> 10.299 </td> <td style="text-align:right;"> 49.969 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 493.944 </td> <td style="text-align:right;"> 535.360 </td> </tr> <tr> <td style="text-align:left;"> dollars </td> <td style="text-align:right;"> 6.395 </td> <td style="text-align:right;"> 2.482 </td> <td style="text-align:right;"> 2.577 </td> <td style="text-align:right;"> 0.013 </td> <td style="text-align:right;"> 1.405 </td> <td style="text-align:right;"> 11.384 </td> </tr> <tr> <td style="text-align:left;"> percent </td> <td style="text-align:right;"> -1.492 </td> <td style="text-align:right;"> 0.142 </td> <td style="text-align:right;"> -10.519 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> -1.777 </td> <td style="text-align:right;"> -1.207 </td> </tr> </tbody> </table> `$$SATM=515+6.4 \times Dollars - 1.5 \times Percent+\epsilon$$` --- # Reporting results `$$SATM=515+6.4 \times Dollars - 1.5 \times Percent+\epsilon$$` On average, a one thousand dollar increase in spending per pupil is associated with 6.4 point increase in average state SAT math scores, **controlling for the percent of students taking the SAT**. --- # Interpreting results <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std_error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p_value </th> <th style="text-align:right;"> lower_ci </th> <th style="text-align:right;"> upper_ci </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> intercept </td> <td style="text-align:right;"> 514.652 </td> <td style="text-align:right;"> 10.299 </td> <td style="text-align:right;"> 49.969 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 493.944 </td> <td style="text-align:right;"> 535.360 </td> </tr> <tr> <td style="text-align:left;"> dollars </td> <td style="text-align:right;"> 6.395 </td> <td style="text-align:right;"> 2.482 </td> <td style="text-align:right;"> 2.577 </td> <td style="text-align:right;"> 0.013 </td> <td style="text-align:right;"> 1.405 </td> <td style="text-align:right;"> 11.384 </td> </tr> <tr> <td style="text-align:left;"> percent </td> <td style="text-align:right;"> -1.492 </td> <td style="text-align:right;"> 0.142 </td> <td style="text-align:right;"> -10.519 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> -1.777 </td> <td style="text-align:right;"> -1.207 </td> </tr> </tbody> </table> `$$SATM=515+6.4 \times Dollars - 1.5 \times Percent+\epsilon$$` - Controlling for the percent of graduates taking the SAT, what is the predicted change in SAT math scores if states increase public education spending by $5,000 per student? --- # Interpreting results <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std_error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p_value </th> <th style="text-align:right;"> lower_ci </th> <th style="text-align:right;"> upper_ci </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> intercept </td> <td style="text-align:right;"> 514.652 </td> <td style="text-align:right;"> 10.299 </td> <td style="text-align:right;"> 49.969 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 493.944 </td> <td style="text-align:right;"> 535.360 </td> </tr> <tr> <td style="text-align:left;"> dollars </td> <td style="text-align:right;"> 6.395 </td> <td style="text-align:right;"> 2.482 </td> <td style="text-align:right;"> 2.577 </td> <td style="text-align:right;"> 0.013 </td> <td style="text-align:right;"> 1.405 </td> <td style="text-align:right;"> 11.384 </td> </tr> <tr> <td style="text-align:left;"> percent </td> <td style="text-align:right;"> -1.492 </td> <td style="text-align:right;"> 0.142 </td> <td style="text-align:right;"> -10.519 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> -1.777 </td> <td style="text-align:right;"> -1.207 </td> </tr> </tbody> </table> `$$SATM=515+6.4 \times Dollars - 1.5 \times Percent+\epsilon$$` What is the predicted average SAT math score for a state that spends $5,000 per student, and has 50% of high school graduates take the SAT? --- class: inverse, middle, center # Connecting results to policy decisions --- # Policy Decision - Let's ignore inference for now and treat these results as if they are statistically meaningful. - More spending increases average SAT math scores among states, **controlling for percent of high school grads taking SAT** - But spending seems to increase the percent of SAT participation, which then decreases average SAT math scores - Let's quantify the effect of spending on SAT participation --- # Spending and SAT participation ```r satm_mod4 <- lm(percent ~ dollars, data = States) ``` ```r get_regression_table(satm_mod4) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std_error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p_value </th> <th style="text-align:right;"> lower_ci </th> <th style="text-align:right;"> upper_ci </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> intercept </td> <td style="text-align:right;"> -30.64 </td> <td style="text-align:right;"> 9.403 </td> <td style="text-align:right;"> -3.259 </td> <td style="text-align:right;"> 0.002 </td> <td style="text-align:right;"> -49.536 </td> <td style="text-align:right;"> -11.744 </td> </tr> <tr> <td style="text-align:left;"> dollars </td> <td style="text-align:right;"> 12.44 </td> <td style="text-align:right;"> 1.757 </td> <td style="text-align:right;"> 7.081 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 8.910 </td> <td style="text-align:right;"> 15.971 </td> </tr> </tbody> </table> - A $1,000 increase in spending per pupil increases the percent of grads taking the SAT by 12.4 *percentage points*. --- # Policy Decision - Our first model found $1,000 increase in spending is associated with -12.2 points in SAT math. -- - In the multiple regression model, a $1,000 increase in spending per student: - Directly increases average SAT math scores 6.4 points - Directly increases participation 12.4 percentage points - Increase of 12.4 p.p. in participation reduces average SAT math scores 18.6 points -- - How might these results inform a state policy decision concerning education spending and college entrance exam participation? --- class: inverse, middle, center # Goodness-of-fit with multiple regression --- # Goodness-of-fit **Model 2 using only percent** <table> <thead> <tr> <th style="text-align:right;"> r_squared </th> <th style="text-align:right;"> adj_r_squared </th> <th style="text-align:right;"> rmse </th> <th style="text-align:right;"> sigma </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.736 </td> <td style="text-align:right;"> 0.731 </td> <td style="text-align:right;"> 17.57277 </td> <td style="text-align:right;"> 17.928 </td> </tr> </tbody> </table> **Model 3 using dollars and percent** <table> <thead> <tr> <th style="text-align:right;"> r_squared </th> <th style="text-align:right;"> adj_r_squared </th> <th style="text-align:right;"> mse </th> <th style="text-align:right;"> rmse </th> <th style="text-align:right;"> sigma </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.768 </td> <td style="text-align:right;"> 0.759 </td> <td style="text-align:right;"> 271.2768 </td> <td style="text-align:right;"> 16.47048 </td> <td style="text-align:right;"> 16.977 </td> </tr> </tbody> </table> - Based on `\(R^2\)`, model 3 appears to be the better model, however... --- # Goodness of fit - `\(R^2\)` mechanically increases as you add more variables whether or not they improve the model - Therefore, we should not use `\(R^2\)` to compare models with different numbers of explanatory variables -- - **Adjusted `\(R^2\)` ** accounts for this, adjusting based on how useful each additional variable is at explaining the outcome - Model 3 still has the higher adjusted `\(R^2\)` so it remains the statistically better model - And the RMSE agrees; virtually always the case