PADP 7120 Data Applications in PA

class: center, middle, inverse, title-slide

# PADP 7120 Data Applications in PA
## Panel Data Analysis
### Alex Combs
### UGA | SPIA | PADP
### Last updated: December 01, 2021

---

# Objectives

- Panel data provides additional information we can incorporate into regression
  - Why it matters
  - How to do so
  - Whether or not to do so

---
# Panel Data

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> country </th>
   <th style="text-align:left;"> continent </th>
   <th style="text-align:right;"> year </th>
   <th style="text-align:right;"> lifeExp </th>
   <th style="text-align:right;"> pop </th>
   <th style="text-align:right;"> gdpPercap </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Argentina </td>
   <td style="text-align:left;"> Americas </td>
   <td style="text-align:right;"> 1997 </td>
   <td style="text-align:right;"> 73.275 </td>
   <td style="text-align:right;"> 36203463 </td>
   <td style="text-align:right;"> 10967.282 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Argentina </td>
   <td style="text-align:left;"> Americas </td>
   <td style="text-align:right;"> 2002 </td>
   <td style="text-align:right;"> 74.340 </td>
   <td style="text-align:right;"> 38331121 </td>
   <td style="text-align:right;"> 8797.641 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Argentina </td>
   <td style="text-align:left;"> Americas </td>
   <td style="text-align:right;"> 2007 </td>
   <td style="text-align:right;"> 75.320 </td>
   <td style="text-align:right;"> 40301927 </td>
   <td style="text-align:right;"> 12779.380 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Bolivia </td>
   <td style="text-align:left;"> Americas </td>
   <td style="text-align:right;"> 1997 </td>
   <td style="text-align:right;"> 62.050 </td>
   <td style="text-align:right;"> 7693188 </td>
   <td style="text-align:right;"> 3326.143 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Bolivia </td>
   <td style="text-align:left;"> Americas </td>
   <td style="text-align:right;"> 2002 </td>
   <td style="text-align:right;"> 63.883 </td>
   <td style="text-align:right;"> 8445134 </td>
   <td style="text-align:right;"> 3413.263 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Bolivia </td>
   <td style="text-align:left;"> Americas </td>
   <td style="text-align:right;"> 2007 </td>
   <td style="text-align:right;"> 65.554 </td>
   <td style="text-align:right;"> 9119152 </td>
   <td style="text-align:right;"> 3822.137 </td>
  </tr>
</tbody>
</table>

- Same units measured over multiple time periods

- Provides variation between units (cross-sectional) AND between time periods (time-series)

---
# Revisiting Life Exp and GDP

```r
gap_ols <- lm(log(lifeExp) ~ log(gdpPercap) + continent, data = gapminder)
get_regression_table(gap_ols)
```

|term              | estimate| std_error| statistic| p_value| lower_ci| upper_ci|
|:-----------------|--------:|---------:|---------:|-------:|--------:|--------:|
|intercept         |    3.062|     0.026|   117.692|       0|    3.011|    3.113|
|log(gdpPercap)    |    0.112|     0.004|    31.843|       0|    0.105|    0.119|
|continentAmericas |    0.133|     0.011|    12.519|       0|    0.112|    0.154|
|continentAsia     |    0.110|     0.009|    12.037|       0|    0.092|    0.128|
|continentEurope   |    0.166|     0.012|    14.357|       0|    0.143|    0.189|
|continentOceania  |    0.152|     0.029|     5.187|       0|    0.095|    0.210|

---
# Revisiting Life Exp and GDP

- But previous model treats each observation as independent of each other

- Argentina in 2002 is not independent of Argentina in 1997

- Might there be unobserved characteristics of each country that do not change over time and effect GDP and life expectancy?

- And might there be global circumstances that change over time but affect all countries?

- If so, we have OVB in our model.

---
# Revisiting Life Exp and GDP

- Controlling for country and time **fixed** effects:

|term           |  estimate| std.error| statistic|   p.value|
|:--------------|---------:|---------:|---------:|---------:|
|log(gdpPercap) | 0.0021665| 0.0053537| 0.4046819| 0.6857672|

- GDP now fails to reject the null hypothesis

- No statistically significant evidence that GDP affects life expectancy

---
# Enter fixed effects regression

- On left, effect of `$X$` on `$Y$` is biased

- If we claim all confounding omitted variables are controlled for by including the **fixed** effect of each unit

- Then we have eliminated the OVB and can argue our estimate is a causal effect of `$X$` on `$Y$`

---
# Fixed effects regression

- Fixed effects (FE) regression controls for time-invariant (i.e. constant), unobserved differences across units of a panel

- Essentially includes a dummy variable for each unit in the panel, allowing each unit to have its own y-intercept

---
# Example

- Suppose we are interested in state-level college enrollment as a percentage of state population age 18-24

---
# Example
<img src="panel_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" />

---
# Example

- Now suppose we want to investigate the effect of average tuition on enrollment

- Note that a common regression line is fit to **ALL** of the plot points

---
# Example

- Note the variation in slopes if we isolate each state

---
# Example

---
# Fixed effects

- Fixed effects estimates the relationship between the response and explanatory variable(s) **within** each unit as the explanatory variable changes

- The estimate for an explanatory variable is the average slope of all the individual unit slopes

---
# Basic regression notation

- Basic OLS regression:

$$ y_i=\beta_0 + \beta_1x_i + \epsilon_i $$

- The `$i$` subscript is an **index** for the unit of analysis

- Signals to the reader that we are using variation *between* units `$i = 1 \dots N$` where `$N$` is the total units in our data

- In `gapminder`, there are 1,704 observations (142 countries X 12 years)

- Subscript `$i$` would run from 1 to 1,704

---
# Fixed Effects Notation

`$$y_{it}=\beta x_{it} + \alpha z_i + \delta w_t + \epsilon_{it}$$`

- Now we have values for unit `$i = 1 \dots N$` at time `$t = 1 \dots T$` where `$N$` is the total *unique* units in the data and `$T$` is the total time periods

- Signals to reader we are using variation *within* each unit over the time they are observed in the data

- In `gapminder`, there are 142 unique countries, so `$i$` runs from 1-142

- And there are 12 years for each country, so `$t$` runs from 1-12

---
# Fixed Effects Notation

`$$y_{it}=\beta x_{it} + \alpha z_i + \delta w_t + \epsilon_{it}$$`

- `$x_{it}$` represents factors that vary between units `$i$` and over time `$t$`
  - Education level, crime, unemployment, tax rates, etc.
  
--

- `$z_i$` vary between units `$i$` but are constant over time
  - Sex, race, geographic region, treatment/control group

- `$w_t$` vary over time `$t$` but are constant between units
  - Inflation, interest rates, recession, war

---
# Fixed effects

`$$y_{it}=\beta x_{it} + \alpha z_i + \delta w_t + \epsilon_{it}$$`

- Fixed effects is an admission that we can't possibly collect data for all the `$z_i$` variables

- Fortunately, we don't need data for `$z_i$` if we have panel data and run fixed effects

- With fixed effects, ALL `$z_i$` variables collapse into a *unique* intercept for each unit `$i$`

`$$y_{it}= \alpha_i + \beta x_{it} + \delta w_t + \epsilon_{it}$$`
- Note the subscript `$i$` for the intercept instead of a common `$\beta_0$` intercept

---
# Two-Way Fixed Effects

`$$y_{it}= \alpha_i + \beta x_{it} + \delta w_t + \epsilon_{it}$$`

- We can also control for all factors that vary over time but not across units `$w_t$`

- Often referred to as two-way fixed effects: unit and time

- Now including a dummy variable for each unit and each time period

`$$y_{it}= \alpha_i + \delta_t + \beta x_{it} + \epsilon_{it}$$`

---
# Running FE in R and interpretation

- First, basic OLS regression. PctEnroll is percent of 18-24 age population in college. Tuition is in 1,000s of dollars.

`$$PctEnroll_i = \beta_0 + \beta_1 Tuition_i + \epsilon_i$$`

```r
basic_ols <- lm(enroll_pct ~ tuition, data = statepanel)
```

```r
get_regression_table(basic_ols)
```

|term      | estimate| std_error| statistic| p_value| lower_ci| upper_ci|
|:---------|--------:|---------:|---------:|-------:|--------:|--------:|
|intercept |   34.792|     0.379|    91.679|   0.000|   34.047|   35.536|
|tuition   |   -0.213|     0.066|    -3.230|   0.001|   -0.342|   -0.084|

- On average, a $1,000 increase in states' average tuition decreases state enrollment 0.2 percentage points.

---
# Running FE in R and interpretation

`$$PctEnroll_{it} = \alpha_i + \beta_1 tuition_{it} + \epsilon_{it}$$`

```r
fe <- plm(enroll_pct ~ tuition, data = statepanel,
              index = c("state", "year"), model = "within")
```

```r
summary(fe) #get_regression_table won't work
```

|term    | estimate| std.error| statistic| p.value|
|:-------|--------:|---------:|---------:|-------:|
|tuition | 1.057165| 0.0652482|   16.2022|       0|

On average, a $1,000 increase in states' average tuition increases enrollment 1 percentage point, **controlling for state fixed effects**.

---
# Example

- Our fixed effects results are counterintuitive

- Economics tells us that as price increases, demand should decrease

- Enrollment should not be expected to increase as the price increases

- What is a plausible explanation?

- Most students do not pay full tuition

- A rise in tuition typically corresponds with a rise in financial aid

- We should control for how much tuition students actually pay

---
# Example

|state   | year|region | tuition| studentshare| unemp| enroll_pct|
|:-------|----:|:------|-------:|------------:|-----:|----------:|
|Alabama | 1991|South  |     3.3|         34.2|   7.4|       37.9|
|Alabama | 1992|South  |     3.6|         37.5|   7.6|       38.5|
|Alabama | 1993|South  |     3.6|         38.6|   7.3|       39.6|
|Alabama | 1994|South  |     3.8|         37.1|   6.2|       40.3|
|Alabama | 1995|South  |     3.8|         35.4|   5.9|       40.2|
|Alabama | 1996|South  |     3.9|         37.4|   5.2|       40.4|

- `studentshare` column is the percent of `tuition` students pay

- Also note there is a region column

---
# Back to basic OLS regression

```r
basic_ols2 <- lm(enroll_pct ~ tuition + studentshare + region, 
           data = statepanel)
```

|term         | estimate| std_error| statistic| p_value| lower_ci| upper_ci|
|:------------|--------:|---------:|---------:|-------:|--------:|--------:|
|intercept    |   30.800|     0.606|    50.830|   0.000|   29.611|   31.988|
|tuition      |   -1.226|     0.146|    -8.418|   0.000|   -1.512|   -0.940|
|studentshare |    0.240|     0.025|     9.624|   0.000|    0.191|    0.289|
|regionNE     |   -3.553|     0.517|    -6.869|   0.000|   -4.568|   -2.538|
|regionSouth  |    0.732|     0.440|     1.665|   0.096|   -0.131|    1.594|
|regionWest   |    1.271|     0.470|     2.707|   0.007|    0.350|    2.192|

- On average, a $1,000 increase in average *sticker price* results in a 1.2 percentage point decline in state enrollment, all else equal.

---
# Better FE model

```r
fe2 <- plm(enroll_pct ~ tuition + studentshare + region, 
           data = statepanel,
           index = c("state", "year"), model = "within")
```

|term         |   estimate| std.error| statistic| p.value|
|:------------|----------:|---------:|---------:|-------:|
|tuition      | -0.9695150| 0.1460752| -6.637095|       0|
|studentshare |  0.3442037| 0.0226472| 15.198490|       0|

- On average, a $1,000 increase in average sticker price results in a 1.0 percentage point decrease in enrollment, all else equal.

- Note no estimates for regions

---
# Interpretation

- Why no estimates for `region`?

- Because region is constant over time; a state's region is always the same

- Therefore, it gets absorbed into the fixed effect

- If we really care about a time-invariant variable, using fixed effects will prevent us from obtaining an estimate

---
# Drawbacks of FE

- Can't estimate association/effect of time-invariant variables

- Estimates are less precise than basic OLS regression
  - FE is like adding a dummy explanatory variable for each unit in the panel
  - Each explanatory variable imposes a penalty on precision (reduces sample size by 1)
  - Preferable if we can avoid this loss

---
# Testing whether FE should be used

```r
pFtest(fe2, basic_ols2)
```

```
## 
## 	F test for individual effects
## 
## data:  enroll_pct ~ tuition + studentshare + region
## F = 101.89, df1 = 45, df2 = 1166, p-value < 2.2e-16
## alternative hypothesis: significant effects
```

- If p-value < 0.05, use FE

- We should use FE in this example

---
# Controlling for time trends

`$$y_{it}= \alpha_i + \delta_t + \beta x_{it} + \epsilon_{it}$$`

- We may also want to control for a common time trend, `$\delta_t$`
- Adds dummy variables for each year in our panel
- Controls for factors that changed/occurred during over this time period that affected all units in the panel

---
# Adding time trends in FE

```r
fe2_time <- plm(enroll_pct ~ tuition + studentshare + unemp + region,
                data = statepanel, index = c("state", "year"), 
*               model = "within", effect = "twoways")
```

|term         |   estimate| std.error| statistic|   p.value|
|:------------|----------:|---------:|---------:|---------:|
|tuition      | -1.0946700| 0.1416791| -7.726403| 0.0000000|
|studentshare |  0.1678541| 0.0241394|  6.953524| 0.0000000|
|unemp        | -0.1496308| 0.0779612| -1.919299| 0.0551957|

- A $1,000 increase in average tuition results in a 1.1 percentage point decline in enrollment, all else equal.

---
# Testing whether to include time trends

- Adding a dummy for each year costs us more observations

```r
pFtest(fe2_time, fe2)
```

```
## 
## 	F test for twoways effects
## 
## data:  enroll_pct ~ tuition + studentshare + unemp + region
## F = 17.644, df1 = 25, df2 = 1141, p-value < 2.2e-16
## alternative hypothesis: significant effects
```

- If p-value < 0.05, include time trends

- We should include time trends in this example

---
# Recap

- Fixed effects eliminates OVB caused by explanatory variables that do not change over time

- Controlling for time trends eliminates OVB caused by explanatory variables that change over time but affect all units

- Omitted variables that vary between units AND time can still cause OVB

- Using fixed effects removes estimates for any variable that does not change