21 R Regression

21.1 Learning Objectives

In this chapter, you will learn how to:

  • Run the following regression models:
    • Continuous outcome and explanatory variable(s)
    • Categorical explanatory variable
    • Binary categorical (i.e. dummy) outcome (linear probability model)
    • Interaction of two explanatory variables
  • Generate tables of key results

21.2 Set-up

To complete this chapter, you need to

  • Start a R Markdown document
  • Change the YAML to at least the following. Feel free to add arguments.
---
title: 'R Chapter 21'
author: 'Your name'
output: 
  html_document:
    theme: spacelab
    df_print: paged
---
  • Load the following packages
library(tidyverse)
library(moderndive)
library(Stat2Data)
library(carData)

We will use the TeenPregnancy dataset within the Stat2Data package and the Salaries dataset within the carData package. While data in most packages is available in the background once the package is loaded, we need to manually load datasets from Stat2Data in order to use them. Run the following code, and the dataset should show up in your Environment pane in the top-right.

data("TeenPregnancy")

Be sure to view the documentation for these data by clicking on the package name under the packages tab in the bottom-right pane, then click on the dataset.

21.3 Running Regression

Chapters 6 and 7 cover the following regression models:

  • Simple linear regression with two numerical variables
  • Multiple linear regression with all numerical variables
  • Including a categorical explanatory variable (parallel slopes)
  • Regression with a categorical explanatory interacted with a numerical variable
  • Regression with a binary categorical outcome (linear probability model)

While our interpretation of results may need to adjust according to which of the above regression models we run,

the code to run a linear regression is the same regardless of the number and type of explanatory variables we include and whether the outcome variable is continuous or a binary categorical variable. With the exception of including an interaction, running the regression models listed above can be done with the same code structure shown below.

21.3.1 General Syntax

new_object_name <- lm(outcome ~ exp_1 + exp_2 + ... + exp_k, data = name_of_dataset)
  • We name a new object that will hold the results of our regression
  • lm is the function for linear regression (acronym for linear model)
  • We replace outcome with the name of our outcome variable that should be either numerical or binary
  • The tilde ~ separates the outcome on the left-hand side of the regression equation from the explanatory variables on the right-hand side
  • We replace the exp_1 to exp_k with the names of however many explanatory variables we wish to include, each separated by a plus sign +
  • We replace name_of_dataset with the name of the dataset that contains the variables for the regression model.

21.3.2 Continuous outcome and continuous or categorical explanatory variables

Recall the following multiple regression model from Chapter 6.

\[\begin{equation} FedSpend = \beta_0 + \beta_1Poverty + \beta_2HomeOwn + \beta_3Income + \epsilon \tag{21.1} \end{equation}\]

I ran this regression using the following code:

fedpov2 <- lm(fed_spend ~ poverty + homeownership + income, data = selcounty)

That’s all there is to it. I named the model fedpov2 to remind myself it was the second model I ran to examine the relationship between federal spending and poverty rate. Note that the code within the lm function mimics Equation (21.1). No matter if the explanatory variables happen to be numerical or categorical, the regression works the same in R. Lastly, I did some behind-the-scenes cleaning of the original county data discussed in Chapter 6 and named it selcounty. Therefore, I told R to use that dataset when running the regression.

TeenPregnancy is a dataset with 50 observations on the following 4 variables.

  • State State abbreviation
  • CivilWar Role in Civil War (B=border, C=Confederate, O=other, or U=union)
  • Church Percentage who attended church in previous week (from a state survey)
  • Teen Number of pregnancies per 1,000 teenage girls in state

Exercise 1: Suppose we want to use the TeenPregnancy dataset to examine whether state teen pregnancy rates are associated with church attendance and a state’s role in the Civil War, which is a categorical variable with four levels (admittedly an odd variable to include but let’s think of it as a proxy for region). The model would be represented using the following formula:

\[\begin{equation} Teen = \beta_0 + \beta_1Church + \beta_2CivilWar + \epsilon \end{equation}\]

Run this regression model.

21.3.3 Interactions

Though we only cover interacting a numerical variable with a categorical variable in this course, we can interact two variables of any type using the same code. In theory, we can interact more than two variables. In any case, an interaction requires us to multiply the variables within the lm function.

Recall the regression model from Chapter 7 where mrall is traffic fatality rate, vmiles is the average miles driven, and jaild is whether a state imposes mandatory jail for drunk driving.

\[\begin{equation} mrall = \beta_0 + \beta_1 vmiles + \beta_2 jaild + \beta_3 vmiles \times jaild + \epsilon \end{equation}\]

I ran this regression using the following code

interactmod <- lm(mrall ~ vmiles + jaild + vmiles*jaild, data = trdeath)

Note that the only difference from the code in the previous example is vmiles*jaild, which tells R to interact the two variables by multiplying them together. Once again, the code reflects the equation.

Salaries is a dataset with 397 observations recording rank (AsstProf, AssocProf, Prof), discipline (A = theoretical, B = applied), years since their Ph.D., years of experience, sex, and salary.

Exercise 2: Suppose we want to use the Salaries dataset to examine whether professor salary is associated with their sex and how long they have worked at the institution. Furthermore, suppose we theorize that the association between salary and how long they have worked at the insitution differs by sex, thus warranting an interaction. Therefore, we have the following model:

\[\begin{equation} salary = \beta_0 + \beta_1sex + \beta_2yrs.service + \beta_3 sex \times yrs.service + \epsilon \end{equation}\]

Run this regression model.

21.3.4 Dummy outcome

While a regression with a dummy variable as the outcome does not require any special coding, we do need to make sure the dummy variable is coded as 1/0 in the data. Sometimes a dummy variable will be coded like this already in which case we don’t need to do anything to run the regression. Other times, the dummy variable will be coded using text like “yes” and “no” or “Male” and “Female” in the case of a dummy variable for sex.

Recall in Table 7.1 the coding for jaild is yes/no. Also recall the regression equation (7.16) for the linear probability model example restated below:

\[Pr(jaild=1)=\beta_0+\beta_1vmiles+\beta_2region+\epsilon\]

I ran this regression using the following code:

lpm_mod <- lm(jaild ~ mrall + region, data = trdeath2)

But this won’t work if we include jaild in our regression code without recoding it to 1/0. This can be done using the following code.

trdeath2 <- trdeath %>% 
  mutate(jaild = if_else(jaild == 'yes', 1, 0))

This code creates a new dataset named trdeath2 which is a copy of the trdeath dataset except for changing the values for jaild using the mutate verb. Inside mutate, I name the “new” variable jaild, which overwrites the existing jaild variable based on what follows the equal sign.

The if_else function can be used for a variety of purposes, but it is the simplest way to recode a dummy variable in text to 1/0. The first argument is the conditional. Observations that meet this conditional receive the second argument, while observations that do not receive the third argument. Using natural language, I’m telling R, “If jaild equals”yes”, then code it as 1 or else code it as 0.”

Exercise 3: Let’s keep using this Salaries data. Suppose we wanted to predict discipline, which again is coded as A = theoretical or B = applied. Suppose we wanted to predict discipline using the following model:

\[\begin{equation} Discipline = \beta_0 + \beta_1Sex + \beta_2YrsSincePhD + \epsilon \end{equation}\]

Run this regression model.

21.4 Reporting Regression Estimates

This section presents two ways to obtain results after running a regression. The first uses functions that load with the moderndive package and the second uses functions that load with R by default (i.e. Base R). The moderndive functions are somewhat more intuitive and produce results that look nicer, but the base R functions are more robust to any variety of regression model.

21.4.1 Moderndive

To get a standard table of regression estimates using moderndive, we can use the get_regression_table function on our saved regression model results like so

get_regression_table(fedpov2)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 23.519 1.333 17.645 0.000 20.905 26.132
poverty -0.056 0.021 -2.674 0.008 -0.097 -0.015
homeownership -0.126 0.012 -10.736 0.000 -0.149 -0.103
income 0.000 0.000 -7.723 0.000 0.000 0.000

To get goodness-of-fit measures, we can use the get_regression_summaries function like so

get_regression_summaries(fedpov2)
r_squared adj_r_squared mse rmse sigma statistic p_value df nobs
0.064 0.063 20.72216 4.55216 4.555 71.055 0 3 3123

and if I only want the R-squared, Adjusted R-squared, and RMSE from this table, I can add the select function to the above code chunk like so

get_regression_summaries(fedpov2) %>% 
  select(r_squared, adj_r_squared, rmse)
r_squared adj_r_squared rmse
0.064 0.063 4.55216

Exercise 4: Produce a standard table of regression estimates and goodness-of-fit measures for each of your three regression models using the moderndive functions.

21.4.2 Base R

A comprehensive set of regression results can be obtained using the summary function on our regression model like so

summary(fedpov2)

Call:
lm(formula = fed_spend ~ poverty + homeownership + income, data = selcounty)

Residuals:
   Min     1Q Median     3Q    Max 
-9.463 -2.502 -1.007  1.015 39.327 

Coefficients:
                 Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)   23.51859688  1.33289566  17.645 < 0.0000000000000002 ***
poverty       -0.05596900  0.02092699  -2.674              0.00752 ** 
homeownership -0.12582419  0.01171941 -10.736 < 0.0000000000000002 ***
income        -0.00008593  0.00001113  -7.723   0.0000000000000152 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.555 on 3119 degrees of freedom
Multiple R-squared:  0.06397,   Adjusted R-squared:  0.06307 
F-statistic: 71.06 on 3 and 3119 DF,  p-value: < 0.00000000000000022

which gives us most of the information from get_regression_table except for the confidence intervals.

The summary function also reports the R-squared and Adjusted R-squared at the bottom. The esidual standard error at the bottom is not exactly the same as the RMSE above–it is actually equal to sigma in the full table from get_regression_summaries–but you can treat them the same.

To get the confidence intervals, we can use the confint function like so,

confint(fedpov2)
                      2.5 %         97.5 %
(Intercept)   20.9051552232 26.13203853435
poverty       -0.0970010683 -0.01493693303
homeownership -0.1488027358 -0.10284564474
income        -0.0001077445 -0.00006411382

which uses the 95% confidence level by default.

Exercise 5: Produce a comprehensive set of results for each of your three regression models using the Base R functions.

21.5 Save and Upload

Knit your Rmd to save it and check for errors. If you are satisfied with your work, upload to eLC. Once you upload, answers will become available for download.