21 R Regression
21.1 Learning Objectives
In this chapter, you will learn how to:
- Run the following regression models:
- Continuous outcome and explanatory variable(s)
- Categorical explanatory variable
- Binary categorical (i.e. dummy) outcome (linear probability model)
- Interaction of two explanatory variables
- Generate tables of key results
21.2 Set-up
To complete this chapter, you need to
- Start a R Markdown document
- Change the YAML to at least the following. Feel free to add arguments.
---
title: 'R Chapter 21'
author: 'Your name'
output:
html_document:
theme: spacelab
df_print: paged
---
- Load the following packages
We will use the TeenPregnancy
dataset within the Stat2Data
package and the Salaries
dataset within the carData
package. While data in most packages is available in the background once the package is loaded, we need to manually load datasets from Stat2Data
in order to use them. Run the following code, and the dataset should show up in your Environment pane in the top-right.
Be sure to view the documentation for these data by clicking on the package name under the packages tab in the bottom-right pane, then click on the dataset.
21.3 Running Regression
Chapters 6 and 7 cover the following regression models:
- Simple linear regression with two numerical variables
- Multiple linear regression with all numerical variables
- Including a categorical explanatory variable (parallel slopes)
- Regression with a categorical explanatory interacted with a numerical variable
- Regression with a binary categorical outcome (linear probability model)
While our interpretation of results may need to adjust according to which of the above regression models we run,
the code to run a linear regression is the same regardless of the number and type of explanatory variables we include and whether the outcome variable is continuous or a binary categorical variable. With the exception of including an interaction, running the regression models listed above can be done with the same code structure shown below.
21.3.1 General Syntax
- We name a new object that will hold the results of our regression
lm
is the function for linear regression (acronym for linear model)- We replace
outcome
with the name of our outcome variable that should be either numerical or binary - The tilde
~
separates the outcome on the left-hand side of the regression equation from the explanatory variables on the right-hand side - We replace the
exp_1
toexp_k
with the names of however many explanatory variables we wish to include, each separated by a plus sign+
- We replace
name_of_dataset
with the name of the dataset that contains the variables for the regression model.
21.3.2 Continuous outcome and continuous or categorical explanatory variables
Recall the following multiple regression model from Chapter 6.
\[\begin{equation} FedSpend = \beta_0 + \beta_1Poverty + \beta_2HomeOwn + \beta_3Income + \epsilon \tag{21.1} \end{equation}\]
I ran this regression using the following code:
That’s all there is to it. I named the model fedpov2
to remind myself it was the second model I ran to examine the relationship between federal spending and poverty rate. Note that the code within the lm
function mimics Equation (21.1). No matter if the explanatory variables happen to be numerical or categorical, the regression works the same in R. Lastly, I did some behind-the-scenes cleaning of the original county
data discussed in Chapter 6 and named it selcounty
. Therefore, I told R to use that dataset when running the regression.
TeenPregnancy
is a dataset with 50 observations on the following 4 variables.
State
State abbreviationCivilWar
Role in Civil War (B=border, C=Confederate, O=other, or U=union)Church
Percentage who attended church in previous week (from a state survey)Teen
Number of pregnancies per 1,000 teenage girls in state
Exercise 1: Suppose we want to use the
TeenPregnancy
dataset to examine whether state teen
pregnancy rates are associated with church attendance and a state’s role
in the Civil War, which is a categorical variable with four levels
(admittedly an odd variable to include but let’s think of it as a proxy
for region). The model would be represented using the following
formula:
\[\begin{equation} Teen = \beta_0 + \beta_1Church + \beta_2CivilWar + \epsilon \end{equation}\]
Run this regression model.
21.3.3 Interactions
Though we only cover interacting a numerical variable with a categorical variable in this course, we can interact two variables of any type using the same code. In theory, we can interact more than two variables. In any case, an interaction requires us to multiply the variables within the lm
function.
Recall the regression model from Chapter 7 where mrall
is traffic fatality rate, vmiles
is the average miles driven, and jaild
is whether a state imposes mandatory jail for drunk driving.
\[\begin{equation} mrall = \beta_0 + \beta_1 vmiles + \beta_2 jaild + \beta_3 vmiles \times jaild + \epsilon \end{equation}\]
I ran this regression using the following code
Note that the only difference from the code in the previous example is
vmiles*jaild
, which tells R to interact the two variables by multiplying them together. Once again, the code reflects the equation.
Salaries
is a dataset with 397 observations recording rank (AsstProf, AssocProf, Prof), discipline (A = theoretical, B = applied), years since their Ph.D., years of experience, sex, and salary.
Exercise 2: Suppose we want to use the
Salaries
dataset to examine whether professor salary is
associated with their sex and how long they have worked at the
institution. Furthermore, suppose we theorize that the association
between salary and how long they have worked at the insitution differs
by sex, thus warranting an interaction. Therefore, we have the following
model:
\[\begin{equation} salary = \beta_0 + \beta_1sex + \beta_2yrs.service + \beta_3 sex \times yrs.service + \epsilon \end{equation}\]
Run this regression model.
21.3.4 Dummy outcome
While a regression with a dummy variable as the outcome does not require any special coding, we do need to make sure the dummy variable is coded as 1/0 in the data. Sometimes a dummy variable will be coded like this already in which case we don’t need to do anything to run the regression. Other times, the dummy variable will be coded using text like “yes” and “no” or “Male” and “Female” in the case of a dummy variable for sex.
Recall in Table 7.1 the coding for jaild
is yes/no. Also recall the regression equation (7.16) for the linear probability model example restated below:
\[Pr(jaild=1)=\beta_0+\beta_1vmiles+\beta_2region+\epsilon\]
I ran this regression using the following code:
But this won’t work if we include jaild
in our regression code without recoding it to 1/0. This can be done using the following code.
This code creates a new dataset named trdeath2
which is a copy of the trdeath
dataset except for changing the values for jaild
using the mutate
verb. Inside mutate
, I name the “new” variable jaild
, which overwrites the existing jaild
variable based on what follows the equal sign.
The if_else
function can be used for a variety of purposes, but it is the simplest way to recode a dummy variable in text to 1/0. The first argument is the conditional. Observations that meet this conditional receive the second argument, while observations that do not receive the third argument. Using natural language, I’m telling R, “If jaild
equals”yes”, then code it as 1 or else code it as 0.”
Exercise 3: Let’s keep using this
Salaries
data. Suppose we wanted to predict
discipline
, which again is coded as A = theoretical or B =
applied. Suppose we wanted to predict discipline using the following
model:
\[\begin{equation} Discipline = \beta_0 + \beta_1Sex + \beta_2YrsSincePhD + \epsilon \end{equation}\]
Run this regression model.
21.4 Reporting Regression Estimates
This section presents two ways to obtain results after running a regression. The first uses functions that load with the moderndive
package and the second uses functions that load with R by default (i.e. Base R). The moderndive
functions are somewhat more intuitive and produce results that look nicer, but the base R functions are more robust to any variety of regression model.
21.4.1 Moderndive
To get a standard table of regression estimates using moderndive
, we can use the get_regression_table
function on our saved regression model results like so
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 23.519 | 1.333 | 17.645 | 0.000 | 20.905 | 26.132 |
poverty | -0.056 | 0.021 | -2.674 | 0.008 | -0.097 | -0.015 |
homeownership | -0.126 | 0.012 | -10.736 | 0.000 | -0.149 | -0.103 |
income | 0.000 | 0.000 | -7.723 | 0.000 | 0.000 | 0.000 |
To get goodness-of-fit measures, we can use the get_regression_summaries
function like so
r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
---|---|---|---|---|---|---|---|---|
0.064 | 0.063 | 20.72216 | 4.55216 | 4.555 | 71.055 | 0 | 3 | 3123 |
and if I only want the R-squared, Adjusted R-squared, and RMSE from this table, I can add the select
function to the above code chunk like so
r_squared | adj_r_squared | rmse |
---|---|---|
0.064 | 0.063 | 4.55216 |
Exercise 4: Produce a standard table of regression
estimates and goodness-of-fit measures for each of your three regression
models using the moderndive
functions.
21.4.2 Base R
A comprehensive set of regression results can be obtained using the summary
function on our regression model like so
Call:
lm(formula = fed_spend ~ poverty + homeownership + income, data = selcounty)
Residuals:
Min 1Q Median 3Q Max
-9.463 -2.502 -1.007 1.015 39.327
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.51859688 1.33289566 17.645 < 0.0000000000000002 ***
poverty -0.05596900 0.02092699 -2.674 0.00752 **
homeownership -0.12582419 0.01171941 -10.736 < 0.0000000000000002 ***
income -0.00008593 0.00001113 -7.723 0.0000000000000152 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.555 on 3119 degrees of freedom
Multiple R-squared: 0.06397, Adjusted R-squared: 0.06307
F-statistic: 71.06 on 3 and 3119 DF, p-value: < 0.00000000000000022
which gives us most of the information from get_regression_table
except for the confidence intervals.
The summary
function also reports the R-squared and Adjusted R-squared at the bottom. The esidual standard error
at the bottom is not exactly the same as the RMSE above–it is actually equal to sigma
in the full table from get_regression_summaries
–but you can treat them the same.
To get the confidence intervals, we can use the confint
function like so,
2.5 % 97.5 %
(Intercept) 20.9051552232 26.13203853435
poverty -0.0970010683 -0.01493693303
homeownership -0.1488027358 -0.10284564474
income -0.0001077445 -0.00006411382
which uses the 95% confidence level by default.
Exercise 5: Produce a comprehensive set of results for each of your three regression models using the Base R functions.