class: center, middle, inverse, title-slide .title[ # PADP 7120 Data Applications in PA ] .subtitle[ ## Regression with Categorical Variables ] .author[ ### Alex Combs ] .institute[ ### UGA | SPIA | PADP ] .date[ ### Last updated: February 20, 2024 ] --- # Outline - Add categorical variables as explanatory variables - Interact numerical and categorical variables - Use a binary categorical variable as an outcome --- # Adding to our toolbox - So far, we have covered simple and multiple regression using quantitative/numerical variables .font120[$$y=\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_kx_k+\epsilon$$] - Any number of explanatory variables could be categorical - The outcome could be categorical as well --- # Example - Suppose HR wants to examine if male professors are paid differently than female professors ```r glimpse(Salaries) ``` ``` ## Rows: 397 ## Columns: 6 ## $ rank <fct> Prof, Prof, AsstProf, Prof, Prof, AssocProf, Prof, Prof,… ## $ discipline <fct> B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, A, A,… ## $ yrs.since.phd <int> 19, 20, 4, 45, 40, 6, 30, 45, 21, 18, 12, 7, 1, 2, 20, 1… ## $ yrs.service <int> 18, 16, 3, 39, 41, 6, 23, 45, 20, 18, 8, 2, 1, 0, 18, 3,… ## $ sex <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Fe… ## $ salary <int> 139750, 173200, 79750, 115000, 141500, 97000, 175000, 14… ``` - discipline coded as "A" for theoretical and "B" for applied --- # Example .pull-left[ ```r ggplot(Salaries, aes(x = yrs.service, y = salary)) + geom_point(size = 4) + theme_minimal() + theme( text = element_text(size = 20)) ``` ] .pull-right[ <img src="Regression-with-Categories_files/figure-html/unnamed-chunk-4-1.png" width="99%" /> ] --- # Example .pull-left[ ```r ggplot(Salaries, aes(x = yrs.service, y = salary, * color = sex)) + geom_point(size = 4, alpha = 0.5) + theme_minimal() + theme(text = element_text(size = 20)) ``` ] .pull-right[ <img src="Regression-with-Categories_files/figure-html/unnamed-chunk-6-1.png" width="99%" /> ] --- # Visualizing Parallel Slopes <img src="Regression-with-Categories_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Parallel Slopes Model Implication `$$y = \beta_0 + \beta_1x + \beta_2 d + \epsilon$$` `$$\hat{y} = b_0 + b_1x + b_2 d$$` - where `\(x\)` is continuous and `\(d\)` is a dummy variable = 0 or 1 - `\(d\)` allows the intercept to differ between groups, but assumes affect of `\(x\)` is the same for both --- # Parallel Slopes Model Equations `$$\hat{y} = b_0 + b_1x + b_2 d$$` - Change in `\(y\)` due to a one-unit increase in `\(x\)` is still `\(b_1\)` for any scenario -- - In cases where `\(d=0\)` `$$\hat{y} = b_0 + b_1x$$` -- - In cases where `\(d=1\)` `$$\hat{y} = b_0 + b_1x + b_2$$` `$$\hat{y} = (b_0 + b_2) + b_1x$$` -- - `\(b_2\)` shifts the intercept. Regression line for `\(d=1\)` group will be above or below `\(d=0\)` group by `\(b_2\)` units. --- # Example: Parallel Slopes `$$Salary=\beta_0+\beta_1YrsEmployed+\beta_2Sex + \epsilon$$` ```r sal_mod <- lm(salary ~ yrs.service + sex, data = Salaries) ``` ```r get_regression_table(sal_mod) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std_error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p_value </th> <th style="text-align:right;"> lower_ci </th> <th style="text-align:right;"> upper_ci </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> intercept </td> <td style="text-align:right;"> 92356.947 </td> <td style="text-align:right;"> 4740.188 </td> <td style="text-align:right;"> 19.484 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 83037.722 </td> <td style="text-align:right;"> 101676.171 </td> </tr> <tr> <td style="text-align:left;"> yrs.service </td> <td style="text-align:right;"> 747.612 </td> <td style="text-align:right;"> 111.396 </td> <td style="text-align:right;"> 6.711 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 528.607 </td> <td style="text-align:right;"> 966.617 </td> </tr> <tr> <td style="text-align:left;"> sex: Male </td> <td style="text-align:right;"> 9071.800 </td> <td style="text-align:right;"> 4861.644 </td> <td style="text-align:right;"> 1.866 </td> <td style="text-align:right;"> 0.063 </td> <td style="text-align:right;"> -486.208 </td> <td style="text-align:right;"> 18629.808 </td> </tr> </tbody> </table> `$$\hat{Salary}=92357+748 \times YrsEmployed+9072 \times Sex$$` - Note that for the way these data are coded, Sex = 1 for Male and 0 for Female --- # Interpreting results `$$\hat{Salary}=92357+748 \times YrsEmployed+9072 \times Sex$$` - What is the standard interpretation for the `YrsEmployed` coefficient? -- - What is the standard interpretation of the coefficient for `Sex`? Recall that Sex=1 for Male and 0 for Female. --- # Interpreting results `$$\hat{Salary}=92357+748 \times YrsEmployed+9072 \times Sex$$` - What is the predicted change in salary for a male professor following three additional years of employment? -- - For a female professor? -- - We forced the slopes to be parallel. Therefore, the predicted change must be the same across all groups. --- # Interpreting results `$$\hat{Salary}=92357+748 \times YrsEmployed+9072 \times Sex$$` - What is the predicted salary for a male professor employed 20 years? -- - For a female professor employed 20 years? -- - Predicted value will **not** be the same because each group has a different intercept --- class: inverse, middle, center # We aren't required to model the slopes to be parallel. Could allow the effect of years employed to differ between male and female professors. --- # Visualizing Interaction Model .pull-left[ ```r ggplot(Salaries, aes(x = yrs.service, y = salary, * color = sex)) + geom_point(alpha = 0.6, size = 5) + geom_smooth(method = 'lm', se = FALSE, * linewidth = 2) + theme_minimal() + theme(text = element_text(size = 20)) ``` ] .pull-right[ <img src="Regression-with-Categories_files/figure-html/unnamed-chunk-12-1.png" width="99%" /> ] --- # Interaction model implication `$$y = \beta_0 + \beta_1x + \beta_2d + \beta_3 xd + \epsilon$$` `$$\hat{y} = b_0 + b_1x + b_2d + b_3 xd$$` - Now `\(x\)` and `\(d\)` are interacted (multiplied together), allowing the effect of `\(x\)` to differ between groups of `\(d\)` --- # Interaction model equations `$$\hat{y} = b_0 + b_1x + b_2d + b_3 xd$$` -- - For the `\(d=0\)` group: `$$\hat{y} = b_0 + b_1x$$` -- - For the `\(d=1\)` group: `$$\hat{y} = b_0 + b_1x + b_2 + b_3x$$` `$$\hat{y} = (b_0 + b_2) + (b_1+b_3)x$$` - Intercept *and* slope *can* differ between groups - Intercept differs by `\(b_2\)` and slope (affect of `\(x\)`) differs by `\(b_3\)` --- # Interaction - An interaction allows the effect of one variable to differ depending on the value of another variable `$$Salary=\beta_0+\beta_1YrsEmp+\beta_2Sex + \beta_3YrsEmp*Sex + \epsilon$$` - `\(\beta_3\)` tells us how the slopes compare - For Sex = 1 (male), the association between years and salary is `\(\beta_1+\beta_3\)` - For Sex = 0 (female), it is `\(\beta_1\)` - `\(\beta_2\)` tells us how the intercepts compare - For Sex = 1 (male), intercept is `\(\beta_0+\beta_2\)` - For Sex = 0 (female), it is `\(\beta_0\)` --- # Running an interaction model ```r sal_mod2 <- lm(salary ~ yrs.service + sex + * yrs.service*sex, data = Salaries) ``` ```r get_regression_table(sal_mod2) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std_error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p_value </th> <th style="text-align:right;"> lower_ci </th> <th style="text-align:right;"> upper_ci </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> intercept </td> <td style="text-align:right;"> 82068.51 </td> <td style="text-align:right;"> 7568.72 </td> <td style="text-align:right;"> 10.84 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 67188.26 </td> <td style="text-align:right;"> 96948.76 </td> </tr> <tr> <td style="text-align:left;"> yrs.service </td> <td style="text-align:right;"> 1637.30 </td> <td style="text-align:right;"> 523.03 </td> <td style="text-align:right;"> 3.13 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 609.02 </td> <td style="text-align:right;"> 2665.58 </td> </tr> <tr> <td style="text-align:left;"> sex: Male </td> <td style="text-align:right;"> 20128.63 </td> <td style="text-align:right;"> 7991.14 </td> <td style="text-align:right;"> 2.52 </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;"> 4417.90 </td> <td style="text-align:right;"> 35839.35 </td> </tr> <tr> <td style="text-align:left;"> yrs.service:sexMale </td> <td style="text-align:right;"> -931.74 </td> <td style="text-align:right;"> 535.24 </td> <td style="text-align:right;"> -1.74 </td> <td style="text-align:right;"> 0.08 </td> <td style="text-align:right;"> -1984.04 </td> <td style="text-align:right;"> 120.56 </td> </tr> </tbody> </table> `$$\hat{Sal}=82069+1637*YrsEmp+20129*Sex - 932*YrsEmp*Sex$$` --- # Interpreting an interaction model `$$\hat{Sal}=82069+1637*YrsEmp+20129*Sex - 932*YrsEmp*Sex$$` - Recall that `Sex` = 1 for male, 0 for female - What is the predicted change in salary for a male professor given one more year of employment? - What is the predicted change in salary for a female professor given one more year of employment? --- # Interpreting an interaction model `$$\hat{Sal}=82069+1637*YrsEmp+20129*Sex - 932*YrsEmp*Sex$$` - What is the predicted salary for a female professor employed for 10 years? - What is the predicted salary for a male professor employed for 10 years? --- class: inverse, middle, center # Linear probability model (LPM) --- # LPM - Regression model with a two-level (binary, dummy) categorical variable for the response - Typically, `y=1` if an outcome did occur, `y=0` if it did not occur - An LPM estimates the effect of our explanatory variables on the probability that the outcome occurs `$$Pr(y=1)=\beta_0+\beta_1x+\dots+\beta_kx_k+\epsilon$$` --- # Running an LPM <table> <thead> <tr> <th style="text-align:right;"> Divorce </th> <th style="text-align:right;"> Age25to29 </th> <th style="text-align:right;"> Income </th> <th style="text-align:right;"> Children </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 46 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> - Suppose we want to know how these variables might affect the probability of divorce among married or previously married individuals (Income is in $1,000s) - Divorce = 1 if individual is divorced - Income in $1,000s --- # Running an LPM ```r divorce_mod <- lm(Divorce ~ Age25to29 + Income + Children, data = divorce) ``` ```r get_regression_table(divorce_mod) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std_error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p_value </th> <th style="text-align:right;"> lower_ci </th> <th style="text-align:right;"> upper_ci </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> intercept </td> <td style="text-align:right;"> 0.379 </td> <td style="text-align:right;"> 0.217 </td> <td style="text-align:right;"> 1.747 </td> <td style="text-align:right;"> 0.092 </td> <td style="text-align:right;"> -0.067 </td> <td style="text-align:right;"> 0.824 </td> </tr> <tr> <td style="text-align:left;"> Age25to29 </td> <td style="text-align:right;"> 0.407 </td> <td style="text-align:right;"> 0.162 </td> <td style="text-align:right;"> 2.512 </td> <td style="text-align:right;"> 0.019 </td> <td style="text-align:right;"> 0.074 </td> <td style="text-align:right;"> 0.741 </td> </tr> <tr> <td style="text-align:left;"> Income </td> <td style="text-align:right;"> 0.001 </td> <td style="text-align:right;"> 0.003 </td> <td style="text-align:right;"> 0.283 </td> <td style="text-align:right;"> 0.779 </td> <td style="text-align:right;"> -0.005 </td> <td style="text-align:right;"> 0.007 </td> </tr> <tr> <td style="text-align:left;"> Children </td> <td style="text-align:right;"> -0.178 </td> <td style="text-align:right;"> 0.071 </td> <td style="text-align:right;"> -2.506 </td> <td style="text-align:right;"> 0.019 </td> <td style="text-align:right;"> -0.324 </td> <td style="text-align:right;"> -0.032 </td> </tr> </tbody> </table> `$$\hat{Pr(y=1)}=0.38+0.4*Age+0.001*Income-0.18*Children$$` --- # Interpreting an LPM `$$\hat{Pr(y=1)}=0.38+0.4*Age+0.001*Income-0.18*Children$$` - How does having a child affect the probability of divorce? -- - Probability is expressed in units of percent. Remember that a unit change in percent probability is a **percentage point** change. - What is the probability of divorce for someone 27 years old making 50,000 and has 0 children?