PADP 7120 Data Applications in PA

class: center, middle, inverse, title-slide

.title[
# PADP 7120 Data Applications in PA
]
.subtitle[
## Regression with Categorical Variables
]
.author[
### Alex Combs
]
.institute[
### UGA | SPIA | PADP
]
.date[
### Last updated: February 19, 2025
]

---

# Outline

- Add categorical variables as explanatory variables

- Interact numerical and categorical variables

- Use a binary categorical variable as an outcome

---
# Adding to our toolbox

- So far, we have covered simple and multiple regression using quantitative/numerical variables

.font120[$$y=\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_kx_k+\epsilon$$]

- Any number of explanatory variables could be categorical

- The outcome could be categorical as well

---
# Example

- Suppose HR wants to examine if male professors are paid differently than female professors

``` r
glimpse(Salaries)
```

```
## Rows: 397
## Columns: 6
## $ rank          <fct> Prof, Prof, AsstProf, Prof, Prof, AssocProf, Prof, Prof,…
## $ discipline    <fct> B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, A, A,…
## $ yrs.since.phd <int> 19, 20, 4, 45, 40, 6, 30, 45, 21, 18, 12, 7, 1, 2, 20, 1…
## $ yrs.service   <int> 18, 16, 3, 39, 41, 6, 23, 45, 20, 18, 8, 2, 1, 0, 18, 3,…
## $ sex           <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Fe…
## $ salary        <int> 139750, 173200, 79750, 115000, 141500, 97000, 175000, 14…
```

- discipline coded as "A" for theoretical and "B" for applied

---
# Example

.pull-left[

``` r
ggplot(Salaries, 
       aes(x = yrs.service, 
           y = salary)) +
  geom_point(size = 4) +
  theme_minimal() +
  theme(
    text = element_text(size = 20))
```
]

.pull-right[
<img src="Regression-with-Categories_files/figure-html/unnamed-chunk-4-1.png" width="99%" />
]

---
# Example

.pull-left[

``` r
ggplot(Salaries, 
       aes(x = yrs.service, 
           y = salary, 
*          color = sex)) +
  geom_point(size = 4,
             alpha = 0.5) +
  theme_minimal() +
  theme(text = element_text(size = 20))
```
]

.pull-right[
<img src="Regression-with-Categories_files/figure-html/unnamed-chunk-6-1.png" width="99%" />
]

---
# Visualizing Parallel Slopes

---
# Parallel Slopes Model Implication

`$$y = \beta_0 + \beta_1x + \beta_2 d + \epsilon$$`
`$$\hat{y} = b_0 + b_1x + b_2 d$$`
- where `$x$` is continuous and `$d$` is a dummy variable = 0 or 1

- `$d$` allows the intercept to differ between groups, but assumes affect of `$x$` is the same for both

- `$d$` = 0 is the reference group to which `$d$` = 1 is compared

---
# Parallel Slopes Model Equations

`$$\hat{y} = b_0 + b_1x + b_2 d$$`

- Change in `$y$` due to a one-unit increase in `$x$` is still `$b_1$` for any scenario

- In cases where `$d=0$`

`$$\hat{y} = b_0 + b_1x$$`

- In cases where `$d=1$`

`$$\hat{y} = b_0 + b_1x + b_2$$`
`$$\hat{y} = (b_0 + b_2) + b_1x$$`
--

- `$b_2$` shifts the intercept. Regression line for `$d=1$` group will be above or below `$d=0$` group by `$b_2$` units.

---
# Example: Parallel Slopes

`$$Salary=\beta_0+\beta_1YrsEmployed+\beta_2Sex + \epsilon$$`

``` r
sal_mod <- lm(salary ~ yrs.service + sex, data = Salaries)
```

``` r
get_regression_table(sal_mod)
```
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std_error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p_value </th>
   <th style="text-align:right;"> lower_ci </th>
   <th style="text-align:right;"> upper_ci </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> intercept </td>
   <td style="text-align:right;"> 92356.947 </td>
   <td style="text-align:right;"> 4740.188 </td>
   <td style="text-align:right;"> 19.484 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 83037.722 </td>
   <td style="text-align:right;"> 101676.171 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> yrs.service </td>
   <td style="text-align:right;"> 747.612 </td>
   <td style="text-align:right;"> 111.396 </td>
   <td style="text-align:right;"> 6.711 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 528.607 </td>
   <td style="text-align:right;"> 966.617 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sex: Male </td>
   <td style="text-align:right;"> 9071.800 </td>
   <td style="text-align:right;"> 4861.644 </td>
   <td style="text-align:right;"> 1.866 </td>
   <td style="text-align:right;"> 0.063 </td>
   <td style="text-align:right;"> -486.208 </td>
   <td style="text-align:right;"> 18629.808 </td>
  </tr>
</tbody>
</table>

`$$\hat{Salary}=92357+748 \times YrsEmployed+9072 \times Sex$$`

- Note that Female is the reference group, Sex = 1 for Male and 0 for Female

---
# Interpreting results

`$$\hat{Salary}=92357+748 \times YrsEmployed+9072 \times Sex$$`

- What is the standard interpretation for the `YrsEmployed` coefficient?

- What is the standard interpretation of the coefficient for `Sex`? Recall that Sex=1 for Male and 0 for Female.

---
# Interpreting results

`$$\hat{Salary}=92357+748 \times YrsEmployed+9072 \times Sex$$`

- What is the predicted change in salary for a male professor following three additional years of employment?

- For a female professor?

- We forced the slopes to be parallel. Therefore, the predicted change must be the same across all groups.

---
# Interpreting results

`$$\hat{Salary}=92357+748 \times YrsEmployed+9072 \times Sex$$`

- What is the predicted salary for a male professor employed 20 years?

- For a female professor employed 20 years?

- Predicted value will **not** be the same because each group has a different intercept

---
class: inverse, middle, center

# We aren't required to model the slopes to be parallel. Could allow the effect of years employed to differ between male and female professors.

---
# Visualizing Interaction Model

.pull-left[

``` r
ggplot(Salaries, 
  aes(x = yrs.service, 
  y = salary, 
* color = sex)) +
  geom_point(alpha = 0.6, 
             size = 5) +
  geom_smooth(method = 'lm', 
              se = FALSE, 
              linewidth = 2) +
  theme_minimal() +
  theme(text = element_text(size = 20))
```
]

.pull-right[
<img src="Regression-with-Categories_files/figure-html/unnamed-chunk-12-1.png" width="99%" />
]

---
# Interaction model implication

`$$y = \beta_0 + \beta_1x + \beta_2d + \beta_3 xd + \epsilon$$`
`$$\hat{y} = b_0 + b_1x + b_2d + b_3 xd$$`
- Now `$x$` and `$d$` are interacted (multiplied together), allowing the effect of `$x$` to differ between groups of `$d$`

---
# Interaction model equations

`$$\hat{y} = b_0 + b_1x + b_2d + b_3 xd$$`
--

- For the `$d=0$` group:

`$$\hat{y} = b_0 + b_1x$$`

- For the `$d=1$` group:

`$$\hat{y} = b_0 + b_1x + b_2 + b_3x$$`

`$$\hat{y} = (b_0 + b_2) + (b_1+b_3)x$$`

- Intercept *and* slope *can* differ between groups

- Intercept differs by `$b_2$` and slope (affect of `$x$`) differs by `$b_3$`

---
# Interaction

- An interaction allows the effect of one variable to differ depending on the value of another variable

`$$Salary=\beta_0+\beta_1YrsEmp+\beta_2Sex + \beta_3YrsEmp*Sex + \epsilon$$`

- `$\beta_3$` tells us how the slopes compare
  - For Sex = 1 (male), the association between years and salary is `$\beta_1+\beta_3$`
  - For Sex = 0 (female), it is `$\beta_1$`

- `$\beta_2$` tells us how the intercepts compare
  - For Sex = 1 (male), intercept is `$\beta_0+\beta_2$`
  - For Sex = 0 (female), it is `$\beta_0$`

---
# Running an interaction model

``` r
sal_mod2 <- lm(salary ~ yrs.service + sex + 
*                yrs.service*sex,
               data = Salaries)
```

``` r
get_regression_table(sal_mod2)
```
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std_error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p_value </th>
   <th style="text-align:right;"> lower_ci </th>
   <th style="text-align:right;"> upper_ci </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> intercept </td>
   <td style="text-align:right;"> 82068.51 </td>
   <td style="text-align:right;"> 7568.72 </td>
   <td style="text-align:right;"> 10.84 </td>
   <td style="text-align:right;"> 0.00 </td>
   <td style="text-align:right;"> 67188.26 </td>
   <td style="text-align:right;"> 96948.76 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> yrs.service </td>
   <td style="text-align:right;"> 1637.30 </td>
   <td style="text-align:right;"> 523.03 </td>
   <td style="text-align:right;"> 3.13 </td>
   <td style="text-align:right;"> 0.00 </td>
   <td style="text-align:right;"> 609.02 </td>
   <td style="text-align:right;"> 2665.58 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sex: Male </td>
   <td style="text-align:right;"> 20128.63 </td>
   <td style="text-align:right;"> 7991.14 </td>
   <td style="text-align:right;"> 2.52 </td>
   <td style="text-align:right;"> 0.01 </td>
   <td style="text-align:right;"> 4417.90 </td>
   <td style="text-align:right;"> 35839.35 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> yrs.service:sexMale </td>
   <td style="text-align:right;"> -931.74 </td>
   <td style="text-align:right;"> 535.24 </td>
   <td style="text-align:right;"> -1.74 </td>
   <td style="text-align:right;"> 0.08 </td>
   <td style="text-align:right;"> -1984.04 </td>
   <td style="text-align:right;"> 120.56 </td>
  </tr>
</tbody>
</table>

`$$\hat{Sal}=82069+1637*YrsEmp+20129*Sex - 932*YrsEmp*Sex$$`

---
# Interpreting an interaction model

`$$\hat{Sal}=82069+1637*YrsEmp+20129*Sex - 932*YrsEmp*Sex$$`
- Recall that `Sex` = 1 for male, 0 for female

- What is the predicted change in salary for a male professor given one more year of employment?

- What is the predicted change in salary for a female professor given one more year of employment?

---
# Interpreting an interaction model

`$$\hat{Sal}=82069+1637*YrsEmp+20129*Sex - 932*YrsEmp*Sex$$`

- What is the predicted salary for a female professor employed for 10 years?

- What is the predicted salary for a male professor employed for 10 years?

---
class: inverse, middle, center

# Linear probability model (LPM)

---
# LPM

- Regression model with a two-level (binary, dummy) categorical variable for the response

- Typically, `y=1` if an outcome did occur, `y=0` if it did not occur

- An LPM estimates the effect of our explanatory variables on the probability that the outcome occurs

`$$Pr(y=1)=\beta_0+\beta_1x+\dots+\beta_kx_k+\epsilon$$`

---
# Running an LPM

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Divorce </th>
   <th style="text-align:right;"> Age25to29 </th>
   <th style="text-align:right;"> Income </th>
   <th style="text-align:right;"> Children </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 19 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 46 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 40 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 69 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 44 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

- Suppose we want to know how these variables might affect the probability of divorce among married or previously married individuals (Income is in $1,000s)

- Divorce = 1 if individual is divorced

- Income in $1,000s

---
# Running an LPM

``` r
divorce_mod <- lm(Divorce ~ Age25to29 + Income + Children,
              data = divorce)
```

``` r
get_regression_table(divorce_mod)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std_error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p_value </th>
   <th style="text-align:right;"> lower_ci </th>
   <th style="text-align:right;"> upper_ci </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> intercept </td>
   <td style="text-align:right;"> 0.379 </td>
   <td style="text-align:right;"> 0.217 </td>
   <td style="text-align:right;"> 1.747 </td>
   <td style="text-align:right;"> 0.092 </td>
   <td style="text-align:right;"> -0.067 </td>
   <td style="text-align:right;"> 0.824 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Age25to29 </td>
   <td style="text-align:right;"> 0.407 </td>
   <td style="text-align:right;"> 0.162 </td>
   <td style="text-align:right;"> 2.512 </td>
   <td style="text-align:right;"> 0.019 </td>
   <td style="text-align:right;"> 0.074 </td>
   <td style="text-align:right;"> 0.741 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Income </td>
   <td style="text-align:right;"> 0.001 </td>
   <td style="text-align:right;"> 0.003 </td>
   <td style="text-align:right;"> 0.283 </td>
   <td style="text-align:right;"> 0.779 </td>
   <td style="text-align:right;"> -0.005 </td>
   <td style="text-align:right;"> 0.007 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Children </td>
   <td style="text-align:right;"> -0.178 </td>
   <td style="text-align:right;"> 0.071 </td>
   <td style="text-align:right;"> -2.506 </td>
   <td style="text-align:right;"> 0.019 </td>
   <td style="text-align:right;"> -0.324 </td>
   <td style="text-align:right;"> -0.032 </td>
  </tr>
</tbody>
</table>

`$$\hat{Pr(y=1)}=0.38+0.4*Age+0.001*Income-0.18*Children$$`

---
# Interpreting an LPM

`$$\hat{Pr(y=1)}=0.38+0.4*Age+0.001*Income-0.18*Children$$`

- How does having a child affect the probability of divorce?

- Probability is expressed in units of percent. Remember that a unit change in percent probability is a **percentage point** change.

- What is the probability of divorce for someone 27 years old making 50,000 and has 0 children?