class: center, middle, inverse, title-slide .title[ # PADP 7120 Data Applications in PA ] .subtitle[ ## Regression with Nonlinear Variables ] .author[ ### Alex Combs ] .institute[ ### UGA | SPIA | PADP ] .date[ ### Last updated: April 02, 2024 ] --- # Outline - Why use nonlinear variables - How to interpret coefficients for nonlinear variables - Assess goodness of fit --- # Why use nonlinear variables - The relationship between `\(x\)` and `\(y\)` may not follow a linear path <img src="Regression-with-nonlinears_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Common nonlinear variables - Quadratic model - Logarithmic transformations - Log-log - Log-level - Level-log --- # Quadratic model `$$y=\beta_0+\beta_1x_1+\beta_2x_1^2+\beta_3x_2+...+\beta_kx_k+\epsilon$$` - `\(x_1\)` appears twice in the regression model above - The regression model includes a linear term, `\(x_1\)`, and a quadratic term, `\(x_1^2\)` - Any variable in the regression model could be modeled quadratic `$$y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_2^2+...+\beta_kx_k+\epsilon$$` --- # Quadratic model `$$y=\beta_0+\beta_1x_1+\beta_2x_1^2+\beta_3x_2+...+\beta_kx_{k-1}+\epsilon$$` <img src="lectures_files/quadratic.png" width="420" style="display: block; margin: auto;" /> .pull-left[ If relationship between `\(x\)` and `\(y\)` is initially negative then eventually positive, linear term, `\(b_1\)`, will be negative and quadratic term, `\(b_2\)`, will be positive. ] .pull-right[ If relationship between `\(x\)` and `\(y\)` is initially positive then eventually negative, linear term, `\(b_1\)`, will be positive and quadratic term, `\(b_2\)`, will be negative. ] --- # Quadratic regression - **Population model:** `$$y=\beta_0+\beta_1x_1+\beta_2x_1^2+\epsilon$$` - We run the regression model and obtain some estimates for our sample model `$$\hat{y}=b_0+b_1x_1+b_2x_1^2$$` - Can then answer the same two questions: - Predicted change in `\(y\)` given a unit-change in `\(x\)`? - Predicted value of `\(y\)` given a value of `\(x\)`? - Third question unique to quadratic relationship: - At what value of `\(x\)` is `\(y\)` at its predicted maximum or minimum? --- # Linear vs. Quadratic unit change **Linear** `$$\hat{y}=b_0+b_1x_1$$` - What is the change in `\(y\)` from a unit-change in `\(x_1\)`? `$$\Delta \hat{y} = b_1$$` - Unit change is constant. Does not depend on value of `\(x\)` that is changing --- # Linear vs. Quadratic unit change **Quadratic** `$$\hat{y}=b_0+b_1x_1+b_2x_1^2$$` - What is the change in `\(y\)` from a unit-change in `\(x\)`? `$$\Delta \hat{y} = b_1+2b_2x_1$$` - Depends on the starting value of `\(x\)` --- # Finding Max or Min `$$\hat{y}=b_0+b_1x_1+b_2x_1^2$$` - At what value of `\(x\)` is `\(y\)` at its maximum or minimum? `$$x = \frac{-b_1}{2b_2}$$` - Remember, any variable in the model could be quadratic. `$$\hat{y}=b_0+b_1x_1+b_2x_2+b_3x_2^2$$` `$$x = \frac{-b_2}{2b_3}$$` --- # Quadratic regression `$$y= 100 + 10x - 1x^2$$` - What is the marginal effect `\(x\)` at `\(x=3\)`? - Marginal effect another term for change in `\(y\)` given a small change in `\(x\)` - Note we need a starting value for `\(x\)` to answer this question -- `$$\Delta \hat{y} = b_1+2b_2x_1$$` `$$\Delta \hat{y}=10+2\times -1 \times 3=10-6=4$$` - At `\(x=3\)`, `\(y\)` is predicted to increase 4 units given a one-unit increase in `\(x\)`, on average. --- # Quadratic regression `$$y= 100 + 10x - 1x^2$$` - At what value of `\(x\)` is `\(y\)` at its maximum? `$$x = \frac{-b_1}{2b_2}$$` `$$\frac{-10}{2(-1)}=5$$` - On average, `\(y\)` is predicted to be at its maximum when `\(x=5\)`. --- class: inverse, center, middle # Quadratic regression example using R --- # Data ``` ## Wage Educ Age ## Min. : 6.93 Min. : 6.00 Min. :18.00 ## 1st Qu.:19.14 1st Qu.:10.00 1st Qu.:34.75 ## Median :24.98 Median :14.00 Median :51.00 ## Mean :24.93 Mean :13.85 Mean :49.49 ## 3rd Qu.:30.57 3rd Qu.:17.00 3rd Qu.:65.25 ## Max. :43.44 Max. :22.00 Max. :77.00 ``` - Note average age is about 49 - When interested in marginal effect of a quadratic variable, common to consider changes around the average of `\(x\)` --- # Viz <img src="Regression-with-nonlinears_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- # Quadratic regression `$$Wage=\beta_0+\beta_1Educ+\beta_2Age+\beta_3Age^2+\epsilon$$` ```r quad <- lm(Wage ~ Educ + Age + I(Age^2), data = wages) ``` ```r get_regression_table(quad) ``` |term | estimate| std_error| statistic| p_value| lower_ci| upper_ci| |:---------|--------:|---------:|---------:|-------:|--------:|--------:| |intercept | -22.722| 3.023| -7.517| 0| -28.742| -16.701| |Educ | 1.254| 0.090| 13.990| 0| 1.075| 1.432| |Age | 1.350| 0.134| 10.077| 0| 1.083| 1.617| |I(Age^2) | -0.013| 0.001| -9.840| 0| -0.016| -0.011| `$$\hat{Wage}=-22.7+1.25*Educ+1.35*Age-0.013*Age^2$$` --- # Interpretation `$$\hat{Wage}=-22.7+1.25*Educ+1.35*Age-0.013*Age^2$$` - What is the marginal effect of age on wage at 49 years old? - At what age are wages at their maximum? - What is the predicted wage for a 26-year old with 16 years of education? --- class: inverse, center, middle # Log Transformations --- # Why use log transformation - Transform skewed data so regression line fits better - Express change in percentages instead of units - Reflect our hypotheses regarding the relationship between two variables - The relationship between `\(x\)` and `\(y\)` is not constant, but is always either positive or negative (not quadratic) --- # Transform skewed data <img src="Regression-with-nonlinears_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # Transform skewed data <img src="Regression-with-nonlinears_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- # Log-log Model `$$ln(y)=\beta_0+\beta_1ln(x_1)+...+\beta_kx_k+\epsilon$$` `$$ln(\hat{y})=b_0+b_1ln(x_1)+...+b_kx_k$$` - Might use this if relationship between `\(x\)` and `\(y\)` begins to plateau as `\(x\)` increases - As `\(x_1\)` increases 1%, y changes `\(b_1\)`% - Estimates an **elasticity:** percent change in `\(y\)` given a percent change in `\(x\)` --- # Log-log model in R `$$ln(LifeExp)=\beta_0+\beta_1ln(GDPperCap)+\beta_2Continent+\epsilon$$` ```r log_log <- lm(log(lifeExp) ~ log(gdpPercap) + continent, data = gapminder) ``` --- # Log-log model in R ```r get_regression_table(log_log) %>% select(term, estimate) ``` .pull-left[ |term | estimate| |:-------------------|--------:| |intercept | 3.062| |log(gdpPercap) | 0.112| |continent: Americas | 0.133| |continent: Asia | 0.110| |continent: Europe | 0.166| |continent: Oceania | 0.152| ] .pull-right[ - How do we interpret `log(gdpPercap)` coefficient? - On average, life expectancy is predicted to increase 0.11% with each 1% increase in GDP per capita, all else equal. ] --- # Level-log Model `$$y=\beta_0+\beta_1ln(x_1)+...+\beta_kx_k+\epsilon$$` `$$\hat{y}=b_0+b_1ln(x_1)+...+b_kx_k$$` - Might use this if only `\(x\)` is skewed. --- # Interpreting Level-log Model `$$\hat{y}=b_0+b_1ln(x_1)+...+b_kx_k$$` - As `\(x_1\)` increases 1%, y changes `\(\frac{b_1}{100}\)` units - Remember, any variable that is log-transformed will be expressed in percent change. `\(y\)` has not been transformed so it is expressed in unit change. - Can also consider if `\(x\)` doubles. This is equivalent to `\(x\)` increasing 100%. Then, - As `\(x\)` doubles, y changes by `\(b_1\)` units --- # Level-log Model in R `$$LifeExp=\beta_0+\beta_1ln(GDPperCap)+\beta_2Continent+\epsilon$$` ```r lev_log <- lm(lifeExp ~ log(gdpPercap) + continent, data = gapminder) ``` --- # Level-log Model in R ```r get_regression_table(lev_log) %>% select(term, estimate) ``` .pull-left[ |term | estimate| |:-------------------|--------:| |intercept | 2.317| |log(gdpPercap) | 6.422| |continent: Americas | 7.015| |continent: Asia | 5.912| |continent: Europe | 9.577| |continent: Oceania | 9.213| ] .pull-right[ - How do we interpret `log(gdpPercap)` coefficient? - On average, as GDP per capita increases 1%, life expectancy is predicted to increase `\(\frac{6.42}{100}=0.0642\)` years, all else equal. - Or a doubling of GDP per capita results in a 6.42 year increase in life expectancy ] --- # Log-level Model (Exponential Model) `$$ln(y)=\beta_0+\beta_1x_1+...+\beta_kx_k+\epsilon$$` `$$ln(\hat{y})=b_0+b_1x_1+...+b_kx_k$$` - Might use this if only `\(y\)` is skewed - Or we believe the effect of `\(x\)` on `\(y\)` grows as `\(x\)` increases. - As `\(x_1\)` increases 1 unit, y changes ( `\(b_1 \times 100\)` ) percent - Now change in `\(y\)` is expressed in percent because was log transformed. `\(x\)` was not so it is still expressed in unit changes. --- # Log-level Model in R `$$ln(LifeExp)=\beta_0+\beta_1GDPperCap+\beta_2Continent+\epsilon$$` ```r log_lev <- lm(log(lifeExp) ~ gdpPercap + continent, data = gapminder) ``` --- # Log-level Model in R ```r get_regression_table(log_lev) %>% select(term, estimate) ``` .pull-left[ |term | estimate| |:-------------------|--------:| |intercept | 3.856| |gdpPercap | 0.000| |continent: Americas | 0.250| |continent: Asia | 0.160| |continent: Europe | 0.311| |continent: Oceania | 0.316| ] .pull-right[ - Rounds to 0 in the table. The estimate equals 0.0004. How do we interpret `gdpPercap`? - As GDP per capita increases 1 dollar, life expectancy is predicted to increase `\(0.0004 \times 100 = 0.04\)` percent ] --- # Choosing the best model - Can't compare models that use `\(ln(y)=\dots\)` to models that use `\(y=\dots\)` based on `\(R^2\)` or `\(RMSE\)`. - Can compare log-log and log-level - Can compare level-level and level-log - Converting goodness of fit measures to be comparable is beyond scope of this course. - Instead, use scatter plots and your best judgment to decide between `\(ln(y)=\dots\)` and `\(y=\dots\)` --- # Recap - **Quadratic**: `\(b+2bx\)` unit change in `\(y\)` given a 1-unit increase in `\(x\)` - **Max or min**: `\(y\)` is at its max or min at `\(x = \frac{-b}{2b}\)` - **Log-log**: `\(b\)` percent change in `\(y\)` given a 1% increase in `\(x\)` - **Level-log**: `\(\frac{b}{100}\)` percent change in `\(y\)` given a 1-unit increase in `\(x\)` - **Log-level**: `\(b \times 100\)` unit change in `\(y\)` given a 1% increase in `\(x\)`