class: center, middle, inverse, title-slide # PADP 7120 Data Applications in PA ## Panel Data Analysis ### Alex Combs ### UGA | SPIA | PADP ### Last updated: December 01, 2021 --- # Objectives - Panel data provides additional information we can incorporate into regression - Why it matters - How to do so - Whether or not to do so --- # Panel Data <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> <th style="text-align:right;"> pop </th> <th style="text-align:right;"> gdpPercap </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 73.275 </td> <td style="text-align:right;"> 36203463 </td> <td style="text-align:right;"> 10967.282 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 74.340 </td> <td style="text-align:right;"> 38331121 </td> <td style="text-align:right;"> 8797.641 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 75.320 </td> <td style="text-align:right;"> 40301927 </td> <td style="text-align:right;"> 12779.380 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 62.050 </td> <td style="text-align:right;"> 7693188 </td> <td style="text-align:right;"> 3326.143 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 63.883 </td> <td style="text-align:right;"> 8445134 </td> <td style="text-align:right;"> 3413.263 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 65.554 </td> <td style="text-align:right;"> 9119152 </td> <td style="text-align:right;"> 3822.137 </td> </tr> </tbody> </table> - Same units measured over multiple time periods - Provides variation between units (cross-sectional) AND between time periods (time-series) --- # Revisiting Life Exp and GDP ```r gap_ols <- lm(log(lifeExp) ~ log(gdpPercap) + continent, data = gapminder) get_regression_table(gap_ols) ``` |term | estimate| std_error| statistic| p_value| lower_ci| upper_ci| |:-----------------|--------:|---------:|---------:|-------:|--------:|--------:| |intercept | 3.062| 0.026| 117.692| 0| 3.011| 3.113| |log(gdpPercap) | 0.112| 0.004| 31.843| 0| 0.105| 0.119| |continentAmericas | 0.133| 0.011| 12.519| 0| 0.112| 0.154| |continentAsia | 0.110| 0.009| 12.037| 0| 0.092| 0.128| |continentEurope | 0.166| 0.012| 14.357| 0| 0.143| 0.189| |continentOceania | 0.152| 0.029| 5.187| 0| 0.095| 0.210| --- # Revisiting Life Exp and GDP - But previous model treats each observation as independent of each other - Argentina in 2002 is not independent of Argentina in 1997 -- - Might there be unobserved characteristics of each country that do not change over time and effect GDP and life expectancy? -- - And might there be global circumstances that change over time but affect all countries? -- - If so, we have OVB in our model. --- # Revisiting Life Exp and GDP - Controlling for country and time **fixed** effects: |term | estimate| std.error| statistic| p.value| |:--------------|---------:|---------:|---------:|---------:| |log(gdpPercap) | 0.0021665| 0.0053537| 0.4046819| 0.6857672| - GDP now fails to reject the null hypothesis - No statistically significant evidence that GDP affects life expectancy --- # Enter fixed effects regression <img src="lectures_files/fixfx-dag.png" width="1175" style="display: block; margin: auto;" /> - On left, effect of `\(X\)` on `\(Y\)` is biased - If we claim all confounding omitted variables are controlled for by including the **fixed** effect of each unit - Then we have eliminated the OVB and can argue our estimate is a causal effect of `\(X\)` on `\(Y\)` --- # Fixed effects regression - Fixed effects (FE) regression controls for time-invariant (i.e. constant), unobserved differences across units of a panel -- - Essentially includes a dummy variable for each unit in the panel, allowing each unit to have its own y-intercept --- # Example - Suppose we are interested in state-level college enrollment as a percentage of state population age 18-24 <img src="panel_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Example <img src="panel_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- # Example - Now suppose we want to investigate the effect of average tuition on enrollment <img src="panel_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> - Note that a common regression line is fit to **ALL** of the plot points --- # Example <img src="panel_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> - Note the variation in slopes if we isolate each state --- # Example <img src="panel_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- # Fixed effects - Fixed effects estimates the relationship between the response and explanatory variable(s) **within** each unit as the explanatory variable changes - The estimate for an explanatory variable is the average slope of all the individual unit slopes --- # Basic regression notation - Basic OLS regression: $$ y_i=\beta_0 + \beta_1x_i + \epsilon_i $$ - The `\(i\)` subscript is an **index** for the unit of analysis - Signals to the reader that we are using variation *between* units `\(i = 1 \dots N\)` where `\(N\)` is the total units in our data -- - In `gapminder`, there are 1,704 observations (142 countries X 12 years) - Subscript `\(i\)` would run from 1 to 1,704 --- # Fixed Effects Notation `$$y_{it}=\beta x_{it} + \alpha z_i + \delta w_t + \epsilon_{it}$$` - Now we have values for unit `\(i = 1 \dots N\)` at time `\(t = 1 \dots T\)` where `\(N\)` is the total *unique* units in the data and `\(T\)` is the total time periods - Signals to reader we are using variation *within* each unit over the time they are observed in the data -- - In `gapminder`, there are 142 unique countries, so `\(i\)` runs from 1-142 - And there are 12 years for each country, so `\(t\)` runs from 1-12 --- # Fixed Effects Notation `$$y_{it}=\beta x_{it} + \alpha z_i + \delta w_t + \epsilon_{it}$$` - `\(x_{it}\)` represents factors that vary between units `\(i\)` and over time `\(t\)` - Education level, crime, unemployment, tax rates, etc. -- - `\(z_i\)` vary between units `\(i\)` but are constant over time - Sex, race, geographic region, treatment/control group -- - `\(w_t\)` vary over time `\(t\)` but are constant between units - Inflation, interest rates, recession, war --- # Fixed effects `$$y_{it}=\beta x_{it} + \alpha z_i + \delta w_t + \epsilon_{it}$$` - Fixed effects is an admission that we can't possibly collect data for all the `\(z_i\)` variables - Fortunately, we don't need data for `\(z_i\)` if we have panel data and run fixed effects -- - With fixed effects, ALL `\(z_i\)` variables collapse into a *unique* intercept for each unit `\(i\)` `$$y_{it}= \alpha_i + \beta x_{it} + \delta w_t + \epsilon_{it}$$` - Note the subscript `\(i\)` for the intercept instead of a common `\(\beta_0\)` intercept --- # Two-Way Fixed Effects `$$y_{it}= \alpha_i + \beta x_{it} + \delta w_t + \epsilon_{it}$$` - We can also control for all factors that vary over time but not across units `\(w_t\)` - Often referred to as two-way fixed effects: unit and time - Now including a dummy variable for each unit and each time period `$$y_{it}= \alpha_i + \delta_t + \beta x_{it} + \epsilon_{it}$$` --- # Running FE in R and interpretation - First, basic OLS regression. PctEnroll is percent of 18-24 age population in college. Tuition is in 1,000s of dollars. `$$PctEnroll_i = \beta_0 + \beta_1 Tuition_i + \epsilon_i$$` ```r basic_ols <- lm(enroll_pct ~ tuition, data = statepanel) ``` ```r get_regression_table(basic_ols) ``` |term | estimate| std_error| statistic| p_value| lower_ci| upper_ci| |:---------|--------:|---------:|---------:|-------:|--------:|--------:| |intercept | 34.792| 0.379| 91.679| 0.000| 34.047| 35.536| |tuition | -0.213| 0.066| -3.230| 0.001| -0.342| -0.084| - On average, a $1,000 increase in states' average tuition decreases state enrollment 0.2 percentage points. --- # Running FE in R and interpretation `$$PctEnroll_{it} = \alpha_i + \beta_1 tuition_{it} + \epsilon_{it}$$` ```r fe <- plm(enroll_pct ~ tuition, data = statepanel, index = c("state", "year"), model = "within") ``` ```r summary(fe) #get_regression_table won't work ``` |term | estimate| std.error| statistic| p.value| |:-------|--------:|---------:|---------:|-------:| |tuition | 1.057165| 0.0652482| 16.2022| 0| On average, a $1,000 increase in states' average tuition increases enrollment 1 percentage point, **controlling for state fixed effects**. --- # Example - Our fixed effects results are counterintuitive - Economics tells us that as price increases, demand should decrease - Enrollment should not be expected to increase as the price increases - What is a plausible explanation? -- - Most students do not pay full tuition - A rise in tuition typically corresponds with a rise in financial aid - We should control for how much tuition students actually pay --- # Example |state | year|region | tuition| studentshare| unemp| enroll_pct| |:-------|----:|:------|-------:|------------:|-----:|----------:| |Alabama | 1991|South | 3.3| 34.2| 7.4| 37.9| |Alabama | 1992|South | 3.6| 37.5| 7.6| 38.5| |Alabama | 1993|South | 3.6| 38.6| 7.3| 39.6| |Alabama | 1994|South | 3.8| 37.1| 6.2| 40.3| |Alabama | 1995|South | 3.8| 35.4| 5.9| 40.2| |Alabama | 1996|South | 3.9| 37.4| 5.2| 40.4| - `studentshare` column is the percent of `tuition` students pay - Also note there is a region column --- # Back to basic OLS regression ```r basic_ols2 <- lm(enroll_pct ~ tuition + studentshare + region, data = statepanel) ``` |term | estimate| std_error| statistic| p_value| lower_ci| upper_ci| |:------------|--------:|---------:|---------:|-------:|--------:|--------:| |intercept | 30.800| 0.606| 50.830| 0.000| 29.611| 31.988| |tuition | -1.226| 0.146| -8.418| 0.000| -1.512| -0.940| |studentshare | 0.240| 0.025| 9.624| 0.000| 0.191| 0.289| |regionNE | -3.553| 0.517| -6.869| 0.000| -4.568| -2.538| |regionSouth | 0.732| 0.440| 1.665| 0.096| -0.131| 1.594| |regionWest | 1.271| 0.470| 2.707| 0.007| 0.350| 2.192| - On average, a $1,000 increase in average *sticker price* results in a 1.2 percentage point decline in state enrollment, all else equal. --- # Better FE model ```r fe2 <- plm(enroll_pct ~ tuition + studentshare + region, data = statepanel, index = c("state", "year"), model = "within") ``` |term | estimate| std.error| statistic| p.value| |:------------|----------:|---------:|---------:|-------:| |tuition | -0.9695150| 0.1460752| -6.637095| 0| |studentshare | 0.3442037| 0.0226472| 15.198490| 0| - On average, a $1,000 increase in average sticker price results in a 1.0 percentage point decrease in enrollment, all else equal. - Note no estimates for regions --- # Interpretation - Why no estimates for `region`? -- - Because region is constant over time; a state's region is always the same - Therefore, it gets absorbed into the fixed effect - If we really care about a time-invariant variable, using fixed effects will prevent us from obtaining an estimate --- # Drawbacks of FE - Can't estimate association/effect of time-invariant variables - Estimates are less precise than basic OLS regression - FE is like adding a dummy explanatory variable for each unit in the panel - Each explanatory variable imposes a penalty on precision (reduces sample size by 1) - Preferable if we can avoid this loss --- # Testing whether FE should be used ```r pFtest(fe2, basic_ols2) ``` ``` ## ## F test for individual effects ## ## data: enroll_pct ~ tuition + studentshare + region ## F = 101.89, df1 = 45, df2 = 1166, p-value < 2.2e-16 ## alternative hypothesis: significant effects ``` - If p-value < 0.05, use FE - We should use FE in this example --- # Controlling for time trends `$$y_{it}= \alpha_i + \delta_t + \beta x_{it} + \epsilon_{it}$$` - We may also want to control for a common time trend, `\(\delta_t\)` - Adds dummy variables for each year in our panel - Controls for factors that changed/occurred during over this time period that affected all units in the panel --- # Adding time trends in FE ```r fe2_time <- plm(enroll_pct ~ tuition + studentshare + unemp + region, data = statepanel, index = c("state", "year"), * model = "within", effect = "twoways") ``` |term | estimate| std.error| statistic| p.value| |:------------|----------:|---------:|---------:|---------:| |tuition | -1.0946700| 0.1416791| -7.726403| 0.0000000| |studentshare | 0.1678541| 0.0241394| 6.953524| 0.0000000| |unemp | -0.1496308| 0.0779612| -1.919299| 0.0551957| - A $1,000 increase in average tuition results in a 1.1 percentage point decline in enrollment, all else equal. --- # Testing whether to include time trends - Adding a dummy for each year costs us more observations ```r pFtest(fe2_time, fe2) ``` ``` ## ## F test for twoways effects ## ## data: enroll_pct ~ tuition + studentshare + unemp + region ## F = 17.644, df1 = 25, df2 = 1141, p-value < 2.2e-16 ## alternative hypothesis: significant effects ``` - If p-value < 0.05, include time trends - We should include time trends in this example --- # Recap - Fixed effects eliminates OVB caused by explanatory variables that do not change over time - Controlling for time trends eliminates OVB caused by explanatory variables that change over time but affect all units - Omitted variables that vary between units AND time can still cause OVB - Using fixed effects removes estimates for any variable that does not change