PADP 7120 Data Applications in PA

class: center, middle, inverse, title-slide

.title[
# PADP 7120 Data Applications in PA
]
.subtitle[
## Regression with Nonlinear Variables
]
.author[
### Alex Combs
]
.institute[
### UGA | SPIA | PADP
]
.date[
### Last updated: April 02, 2025
]

---

# Outline

- Why use nonlinear variables

- How to interpret coefficients for nonlinear variables

- Assess goodness of fit

---
# Why use nonlinear variables

- The relationship between `$x$` and `$y$` may not follow a linear path

---
# Common nonlinear variables

- Quadratic model

- Logarithmic transformations
  - Log-log
  - Log-level
  - Level-log

---
# Quadratic model

`$$y=\beta_0+\beta_1x_1+\beta_2x_1^2+\beta_3x_2+...+\beta_kx_k+\epsilon$$`

- `$x_1$` appears twice in the regression model above

- The regression model includes a linear term, `$x_1$`, and a quadratic term, `$x_1^2$`

- Any variable in the regression model could be modeled quadratic

`$$y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_2^2+...+\beta_kx_k+\epsilon$$`

---
# Quadratic model

`$$y=\beta_0+\beta_1x_1+\beta_2x_1^2+\beta_3x_2+...+\beta_kx_{k-1}+\epsilon$$`

.pull-left[
If relationship between `$x$` and `$y$` is initially negative then eventually positive, linear term, `$b_1$`, will be negative and quadratic term, `$b_2$`, will be positive. 
]

.pull-right[
If relationship between `$x$` and `$y$` is initially positive then eventually negative, linear term, `$b_1$`, will be positive and quadratic term, `$b_2$`, will be negative.
]

---
# Examples of quadratic relationships

- Tax rates on tax revenues (or many cases of price vs. revenue)

- Administrative staff on efficiency

- Class size on student performance

- Work hours on productivity

- Public health campaign intensity and behavior change

---
# Quadratic regression

- **Population model:**

`$$y=\beta_0+\beta_1x_1+\beta_2x_1^2+\epsilon$$`

- We run the regression model and obtain some estimates for our sample model

`$$\hat{y}=b_0+b_1x_1+b_2x_1^2$$`

- Can then answer the same two questions:
  - Predicted change in `$y$` given a unit-change in `$x$`?
  - Predicted value of `$y$` given a value of `$x$`?

- Third question unique to quadratic relationship:
  - At what value of `$x$` is `$y$` at its predicted maximum or minimum?

---
# Linear vs. Quadratic unit change

**Linear**

`$$\hat{y}=b_0+b_1x_1$$`

- What is the change in `$y$` from a unit-change in `$x_1$`?

`$$\Delta \hat{y} = b_1$$`

- Unit change is constant. Does not depend on the value of `$x$` that is changing

---
# Linear vs. Quadratic unit change

**Quadratic**

`$$\hat{y}=b_0+b_1x_1+b_2x_1^2$$`

- What is the change in `$y$` from a unit-change in `$x$`?

`$$\Delta \hat{y} = b_1+2b_2x_1$$`

- Depends on the starting value of `$x$`

---
# Finding Max or Min

`$$\hat{y}=b_0+b_1x_1+b_2x_1^2$$`

- At what value of `$x$` is `$y$` at its maximum or minimum?

`$$x = \frac{-b_1}{2b_2}$$`

- Remember, any variable in the model could be quadratic.

`$$\hat{y}=b_0+b_1x_1+b_2x_2+b_3x_2^2$$`

`$$x = \frac{-b_2}{2b_3}$$`

---
# Quadratic regression

`$$y= 100 + 10x - 1x^2$$`

- What is the marginal effect of `$x$` at `$x=3$`?
  - Marginal effect another term for change in `$y$` given a small change in `$x$` like 1 unit
  - Note we need a starting value for `$x$` to answer this question

--
`$$\Delta \hat{y} = b_1+2b_2x_1$$`
`$$\Delta \hat{y}=10+2\times -1 \times 3=10-6=4$$`

- At `$x=3$`, `$y$` is predicted to increase 4 units given a one-unit increase in `$x$`, on average.

---
# Quadratic regression

`$$y= 100 + 10x - 1x^2$$`

- At what value of `$x$` is `$y$` at its maximum?

`$$x = \frac{-b_1}{2b_2}$$`

`$$\frac{-10}{2(-1)}=5$$`

- On average, `$y$` is predicted to be at its maximum when `$x=5$`.

---
class: inverse, center, middle

# Quadratic regression example using R

---
# Data

```
##       Wage            Educ            Age       
##  Min.   : 6.93   Min.   : 6.00   Min.   :18.00  
##  1st Qu.:19.14   1st Qu.:10.00   1st Qu.:34.75  
##  Median :24.98   Median :14.00   Median :51.00  
##  Mean   :24.93   Mean   :13.85   Mean   :49.49  
##  3rd Qu.:30.57   3rd Qu.:17.00   3rd Qu.:65.25  
##  Max.   :43.44   Max.   :22.00   Max.   :77.00
```

- Note average age is about 49

- When interested in marginal effect of a quadratic variable, common to consider changes around the average of `$x$`

---
# Viz

---
# Quadratic regression

`$$Wage=\beta_0+\beta_1Educ+\beta_2Age+\beta_3Age^2+\epsilon$$`

``` r
quad <- lm(Wage ~ Educ + Age + I(Age^2), data = wages)
```

``` r
get_regression_table(quad)
```

|term      | estimate| std_error| statistic| p_value| lower_ci| upper_ci|
|:---------|--------:|---------:|---------:|-------:|--------:|--------:|
|intercept |  -22.722|     3.023|    -7.517|       0|  -28.742|  -16.701|
|Educ      |    1.254|     0.090|    13.990|       0|    1.075|    1.432|
|Age       |    1.350|     0.134|    10.077|       0|    1.083|    1.617|
|I(Age^2)  |   -0.013|     0.001|    -9.840|       0|   -0.016|   -0.011|

`$$\hat{Wage}=-22.7+1.25*Educ+1.35*Age-0.013*Age^2$$`

---
# Interpretation

`$$\hat{Wage}=-22.7+1.25*Educ+1.35*Age-0.013*Age^2$$`

- What is the marginal effect of age on wage at 49 years old?

- At what age are wages at their maximum?

- What is the predicted wage for a 26-year old with 16 years of education?

---
class: inverse, center, middle

# Log Transformations

---
# Why use log transformation

- Transform skewed data so regression line fits better

- Express change in percentages instead of units

- Reflect our hypotheses regarding the relationship between two variables
  - The relationship between `$x$` and `$y$` is not constant, but is always either positive or negative (not quadratic)

---
# Transform skewed data

---
# Transform skewed data

---
# Log-log Model

`$$ln(y)=\beta_0+\beta_1ln(x_1)+...+\beta_kx_k+\epsilon$$`
`$$ln(\hat{y})=b_0+b_1ln(x_1)+...+b_kx_k$$`

- Might use this if relationship between `$x$` and `$y$` begins to plateau as `$x$` increases

- As `$x_1$` increases 1%, y changes `$b_1$`%

- Estimates an **elasticity:** percent change in `$y$` given a percent change in `$x$`

---
# Log-log model in R

`$$ln(LifeExp)=\beta_0+\beta_1ln(GDPperCap)+\beta_2Continent+\epsilon$$`

``` r
log_log <- lm(log(lifeExp) ~ log(gdpPercap) + continent, 
               data = gapminder)
```

---
# Log-log model in R

``` r
get_regression_table(log_log) %>% 
  select(term, estimate)
```

.pull-left[

|term                | estimate|
|:-------------------|--------:|
|intercept           |    3.062|
|log(gdpPercap)      |    0.112|
|continent: Americas |    0.133|
|continent: Asia     |    0.110|
|continent: Europe   |    0.166|
|continent: Oceania  |    0.152|
]

.pull-right[

- How do we interpret `log(gdpPercap)` coefficient?

- On average, life expectancy is predicted to increase 0.11% with each 1% increase in GDP per capita, all else equal.

]

---
# Level-log Model

`$$y=\beta_0+\beta_1ln(x_1)+...+\beta_kx_k+\epsilon$$`
`$$\hat{y}=b_0+b_1ln(x_1)+...+b_kx_k$$`

- Might use this if only `$x$` is skewed.

---
# Interpreting Level-log Model

`$$\hat{y}=b_0+b_1ln(x_1)+...+b_kx_k$$`

- As `$x_1$` increases 1%, y changes `$\frac{b_1}{100}$` units

- Remember, any variable that is log-transformed will be expressed in percent change. `$y$` has not been transformed so it is expressed in unit change.

- Can also consider if `$x$` doubles. This is equivalent to `$x$` increasing 100%. Then,

- As `$x$` doubles, y changes by `$b_1$` units

---
# Level-log Model in R

`$$LifeExp=\beta_0+\beta_1ln(GDPperCap)+\beta_2Continent+\epsilon$$`

``` r
lev_log <- lm(lifeExp ~ log(gdpPercap) + continent, 
               data = gapminder)
```

---
# Level-log Model in R

``` r
get_regression_table(lev_log) %>% 
  select(term, estimate)
```

.pull-left[

|term                | estimate|
|:-------------------|--------:|
|intercept           |    2.317|
|log(gdpPercap)      |    6.422|
|continent: Americas |    7.015|
|continent: Asia     |    5.912|
|continent: Europe   |    9.577|
|continent: Oceania  |    9.213|
]

.pull-right[

- How do we interpret `log(gdpPercap)` coefficient?

- On average, as GDP per capita increases 1%, life expectancy is predicted to increase `$\frac{6.42}{100}=0.0642$` years, all else equal.

- Or a doubling of GDP per capita results in a 6.42 year increase in life expectancy

]

---
# Log-level Model (Exponential Model)

`$$ln(y)=\beta_0+\beta_1x_1+...+\beta_kx_k+\epsilon$$`
`$$ln(\hat{y})=b_0+b_1x_1+...+b_kx_k$$`

- Might use this if only `$y$` is skewed

- Or we believe the effect of `$x$` on `$y$` grows as `$x$` increases.

- As `$x_1$` increases 1 unit, y changes ( `$b_1 \times 100$` ) percent

- Now change in `$y$` is expressed in percent because was log transformed. `$x$` was not so it is still expressed in unit changes.

---
# Log-level Model in R

`$$ln(LifeExp)=\beta_0+\beta_1GDPperCap+\beta_2Continent+\epsilon$$`

``` r
log_lev <- lm(log(lifeExp) ~ gdpPercap + continent, 
               data = gapminder)
```

---
# Log-level Model in R

``` r
get_regression_table(log_lev) %>% 
  select(term, estimate)
```

.pull-left[

|term                | estimate|
|:-------------------|--------:|
|intercept           |    3.856|
|gdpPercap           |    0.000|
|continent: Americas |    0.250|
|continent: Asia     |    0.160|
|continent: Europe   |    0.311|
|continent: Oceania  |    0.316|
]

.pull-right[

- Rounds to 0 in the table. The estimate equals 0.0004. How do we interpret `gdpPercap`?

- As GDP per capita increases 1 dollar, life expectancy is predicted to increase `$0.0004 \times 100 = 0.04$` percent

]

---
# Choosing the best model

- Can't compare models that use `$ln(y)=\dots$` to models that use `$y=\dots$` based on `$R^2$` or `$RMSE$`.

- Can compare log-log and log-level
  - Can compare level-level and level-log

- Converting goodness of fit measures to be comparable is beyond scope of this course.

- Instead, use scatter plots and your best judgment to decide between `$ln(y)=\dots$` and `$y=\dots$`

---
# Recap

- **Quadratic**: `$b+2bx$` unit change in `$y$` given a 1-unit increase in `$x$`

- **Max or min**: `$y$` is at its max or min at `$x = \frac{-b}{2b}$`

- **Log-log**: `$b$` percent change in `$y$` given a 1% increase in `$x$`

- **Level-log**: `$\frac{b}{100}$` percent change in `$y$` given a 1-unit increase in `$x$`

- **Log-level**: `$b \times 100$` unit change in `$y$` given a 1% increase in `$x$`