12 Significance

“One out of every four people is suffering from some form of mental illness. Check three friends. If they’re OK, then it’s you.”

—Rita Mae Brown

We can now apply our knowledge of inference to fully understand all of our regression results and extend our results to other questions. First, this chapter explains each column in a regression table as well as how to test additional hypotheses that standard regression results do not answer by default. Then, while inference is used to identify statistical significance, that does not necessarily mean our results are practically significant. This chapter ends with how to determine the latter.

12.1 Learning objectives

Explain and interpret the standard components in a table of regression results
Construct the null and alternative hypotheses of a variable in a regression model
Determine the outcome of the hypothesis test based on the regression results
Explain the consequence of choosing a significance level for a hypothesis test
Distinguish between statistical and practical significance
Determine whether results are practically significant

12.2 Regression table

Chapters 6, 7, and 8 presented numerous regression tables. These tables included the standard set of results that statistical programs provide by default. Below is one of the tables from Chapter 7 for a regression to explain traffic fatalities as a function of miles driven and U.S. region.

Table 12.1: Parallel slopes for regions
term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	0.260	0.167	1.553	0.121	-0.069	0.589
vmiles	0.188	0.021	8.895	0.000	0.147	0.230
region: N. East	-0.076	0.065	-1.166	0.245	-0.204	0.052
region: South	0.519	0.056	9.283	0.000	0.409	0.630
region: West	0.641	0.062	10.264	0.000	0.518	0.763

We have covered how to interpret the estimate column at length. This value is the sample estimate of our unobserved population parameter. Provided our model is unbiased, and because of the Central Limit Theorem, we assume that the value of our estimate was drawn from a sampling distribution that is approximately normal and a mean equal to the population parameter. We assume our estimate is the mean of that sampling distribution. For example, in Table 12.1, it is assumed the estimate for vmiles of 0.188 represents the mean of an unobserved sampling distribution comprised of numerous estimates for vmiles that would be obtained if we repeated the regression using numerous samples.

The std-error column is the standard error of the regression estimate (aka coefficient) in the same row. This value is an approximation of the standard deviation of the sampling distribution from which the estimate was drawn. For example, the standard error for vmiles of 0.021 represents the standard deviation of its sampling distribution.

We know the true population parameter is highly unlikely to exactly equal our estimate but is expected to fall somewhere within our sampling distribution. With our estimate assumed to be the center of the normal sampling distribution and the standard error its standard deviation, we can apply the 68-95-99 rule to construct a range of values that represent a percentage of the estimates within the sampling distribution. The common choices are 95% and 99%, with 95% being the default in statistical programs.

The lower_ci and upper_ci columns provide the 95% confidence interval. This range represents our best guess of the plausible values for the unobserved population parameter. The population parameter either falls within our confidence interval or it does not. There is not a 95% probability that the parameter falls within our confidence interval. Rather, it is one range that if we were to repeat the analysis many times using different samples to construct many confidence intervals, we expect 95% of those ranges to successfully capture the population parameter. Therefore, the population parameter is no more likely to equal our estimate as it is to equal any value within our confidence interval.

The values for the confidence interval are obtained by subtracting and adding 1.96 standard errors from the estimate. In Table 12.1, the 95% confidence interval for vmiles is 0.147 ($0.188-2\times 0.021$) to 0.230 ($0.188+2\times 0.021$). If we were to repeat this regression 20 or 100 times, we would expect 19 or 95 of the resulting confidence intervals to capture the population parameter for vmiles. Understanding this chosen rate of success/failure, 0.147-0.230 is our best guess of the range of plausible values for the true response in traffic fatalities to the average number of miles driven. It is just as likely that this true population parameter equals 0.147 or 0.230 as it is to equal 0.188.

The statistic and p_value columns concern hypothesis testing. As a reminder, the regression model that produced Table 12.1 was

\[\begin{equation} mrall = \beta_0 + \beta_1vmiles + \beta_2region + \epsilon \tag{12.1} \end{equation}\]

where mrall is the number of traffic fatalities in a state per 10,000 population.

If the purpose of our regression is to explain traffic fatalities, then the inclusion of vmiles and region implies a research question along the lines of “Do distances driven and region in a state affect traffic fatalities?” Therefore, our regression model sets out to test whether vmiles and region has a statistically significant effect on mrall.

The standard null hypothesis for a regression model is no effect, or

$H_0: \beta=0$

and the alternative hypothesis is

$H_A: \beta \neq 0$

for each of the explanatory variables we include in the model. As we now know, our results will lead us to either reject the null hypothesis or fail to reject the null hypothesis for each explanatory variable.

If the null hypothesis were actually true, $\beta=0$, for any of our explanatory variables, then its sampling distribution should be centered at 0, not centered at the value of our estimate. For example, if $\beta_1=0$ for vmiles, then its sampling distribution should have a mean of 0, not 0.188. This alternative distribution if the null were true is referred to as the null distribution. Just like the sampling distribution, we assume the null distribution is approximately normal.

The statistic and p_value columns answer the following question: “If the null for my explanatory variable were true, thus my estimate having a null distribution with a mean equal to zero and a standard deviation equal to the standard error of my estimate, how likely is it that I got the estimate I got?”

Specifically, the statistic column equals how many standard deviations or standard errors our estimate is away from the center of the null distribution, zero. The value is equal to the estimate divided by the std_error. For example, the estimate for vmiles is 8.895 standard errors away from 0 ($\frac {0.188}{0.021}$). Assuming the null distribution is normal, how likely is it for us to get this estimate or one further away from 0 if the null were true? This is what the p_value provides us. If only 5% of the values in a normal distribution lie 3 standard deviations from the center, then an extremely small percentage of values must lie almost 9 standard deviations from the center. This is why the p-value for vmiles rounds to zero.

Supposing we chose a 5% significance level prior to running the regression, our p-value for vmiles is statistically significant, meaning we reject the null hypothesis that $\beta_1=0$. In other words, there is statistically significant evidence that the average number of miles driven by each driver in a state is associated (perhaps causes) with an increase in the state’s traffic fatality rate.

Since region is a nominal variable with four categories, it results in three estimates as if each level was a separate dummy variable equal to 1 if a state is in that region and 0 otherwise. The three regions in Table 12.1 are compared to the excluded region, Midwest. The null hypothesis is that there is no difference between the Midwest and another region of focus. Let us first focus on states in the West. On average, the traffic fatality rate for states in the West is 0.64 higher than states in the Midwest. The p-value is below 0.05, thus we can reject the null hypothesis and report this result as statistically significant.

By contrast, the traffic fatality rate for states in the Northeast is 0.08 less than states in the Midwest. However, the p-value is greater than 0.05. Therefore, we fail to reject the null hypothesis, meaning we cannot conclude with reasonable confidence that $\beta_{Northeast} \neq 0$.

Note that every estimate for which the p-value is less than 0.05, its confidence interval does not contain zero. These two parts of the table will always agree because they answer the same question from slightly different angles. If we choose a 95% confidence interval as our best guess of the plausible ranges for the population parameter, and that interval does not contain 0, then we must have obtained an estimate that, if the null were true, is so far away from 0 that the likelihood of getting it is less than 5%. Had we chosen a 99% confidence interval, or 1% significance level, then for any interval that does not contain 0 the corresponding p-value is less than 0.01.

12.3 Other hypotheses

By default, the standard regression table addresses hypothesis tests of the form $H_0: \beta=0$ and $H_A: \beta \neq 0$. If our null hypothesis is $\beta$ equals something other than 0, then we need to be careful because the statistic and p_value columns do not apply. However, the confidence interval can still be used. If the confidence interval does not contain the value used for our null hypothesis, then we can conclude with our chosen level of confidence that the population parameter does not equal that value.

Comparing different levels within a categorical variable is slightly more complicated. In Table 12.1, we can conclude that states in the South and West are different than states in the Midwest, and we cannot conclude states in the Northeast are different than states in the Midwest. But, what if we wanted to compare states in the South to states in the West, or Northeast to South? Our regression and our results were not set-up to make such comparisons.

A quick yet technically inaccurate way to make various comparisons across levels is to examine their confidence intervals. If the confidence intervals do not overlap or do not come close to overlapping, then we can be reasonably certain the population parameters for the two levels are not equal to each other. For example, the upper_ci for Northeast is 0.052 and the lower_ci for South is 0.4. The two intervals are separated by multiple standard errors. Therefore, it is probably safe to conclude they are different. By contrast, the confidence intervals for South and West overlap substantially. Therefore, it is not safe to conclude that South and West are different. Alternatively, we can tell our statistical software to exclude a specific level, thereby allowing us to test our hypothesis without the guesswork.

Finally, every regression result includes a global hypothesis test of the form

$H_0: \beta_0 = \beta_1 = \cdots = \beta_k = 0$

$H_A$: at least one $\beta \neq 0$.

Keep in mind that all of our conclusions are probabilistic. There is always a chance of Type I and Type II error. Since the hypothesis test assumes a sampling distribution for each explanatory variable, each explanatory variable we add is like taking an additional sample from some underlying population relevant to our regression. At 95% confidence, we expect 1 out of every 20 intervals to fail at capturing the parameter, such as not including zero when the parameter is truly zero. The global hypothesis test above is a way of testing whether we got a significant result due to the random chance that 1 out of every 20 explanatory variables can be expected to be significant even if the null were true. This global hypothesis test is commonly referred to as an F-test.

12.4 Practical significance

It is easy to lose sight of the forest for the trees when focusing on statistical significance. Just because we find a statistically significant relationship does not mean that relationship is practically significant or economically meaningful. Also, obtaining insignificant results does not necessarily mean you have results that are not important or worth reporting. Such distinctions between statistical and practical significance require an analyst or manager to have a broader sense of the underlying data and the context of the results.

The spirit behind practical significance is to ask ourselves whether our results matter in the real world beyond statistical significance.

After obtaining our results, asking ourselves three questions can help determine if our results are practically significant:

What is the typical change in the explanatory variable associated with the statistically significant estimate? The standard deviation of the variable provides us a sense of this (provided there is not a substantial skew).

Regression results are usually translated using a one-unit change in the explanatory variable. The first question considers whether a one-unit change is representative of how much the explanatory variable tends to vary. If the standard deviation of a variable is 0.1, then a one-unit change is an absurd magnitude of change to apply to the real world, and we should interpret our results using a much smaller change in the explanatory variable.

Is the predicted change in the outcome due to a typical change in the explanatory variable negligible or meaningful? The mean and standard deviation of the outcome give us a sense of this.

After plugging a more representative change for the explanatory variable into our regression results, we now have the associated predicted change in the outcome. The second question considers whether this predicted change would actually matter in the real world. If the standard deviation of an outcome is 500 and our predicted change in the outcome due to a typical change in the explanatory variable is 5 units, this would probably be considered negligible in most contexts. Be sure to consider the context. If the context for the above example were something like monthly income, then an increase of $5 is probably negligible. If the context were something related to life and death, such as murder or cancer rates, then even a relatively small change may be worth pursuing.

Do the bounds of the confidence interval for the explanatory variable potentially change my answer?

If I obtained a statistically significant positive estimate, perhaps the lower bound of the confidence interval is so close to zero that using it instead of the midpoint value would make the predicted change in the outcome negligible. If the estimate for the explanatory variable is statistically insignificant, perhaps its confidence interval is so close around zero that the entire range of plausible values of the parameter would lead to a negligible change in the outcome. In this case, even though I cannot say whether or not the parameter of interest equals 0, any plausible value of the parameter may have no practical impact in the real world.

12.4.1 Example of practical significance

The estimate in our regression table conveys the predicted change in the outcome given a 1-unit or percent change in the explanatory variable. Referring back to Table 12.1, as the average number of miles driven per driver increases by one mile, traffic fatality rate increases 0.188 per 10,000. Is a one-mile increase a realistic change in the average distances driven per driver? Is one mile representative of its typical variation?

How do we get a sense of what is a typical change in the explanatory variable? The standard deviation of a variable tells us the average deviation from the variable’s mean. For example, the standard deviation of vmiles is 1.1, so a typical change in vmiles is quite close to one unit. Based on the estimate for vmiles, a 1.1 unit change is predicted to change the traffic fatality rate by 0.21 ($0.188 \times 1.1$).

Next, is a predicted change of 0.21 in the traffic fatality rate negligible or meaningful? Again, we can use descriptive measures to answer this question. The mean traffic fatality rate is 2.0 and its standard deviation is 0.6. Thus, the predicted change in mrall from a typical change in vmiles is about 10% of the mean and about one-third a standard deviation. Given the typical variation in traffic fatality rate is 0.6, is a change of 0.2 negligible or meaningful? This is where professional judgment and context plays a role, as there is no universal rule to determine what is a meaningful effect. Since the context is something as consequential as fatalities, perhaps any change is practically significant.

Lastly, since the population parameter is just as likely to equal any value in the confidence interval as it is the estimate, we should check if the lower or upper bound of the confidence interval changes our answer regarding practical significance. Since the result for vmiles is positive, we should focus on how close the lower bound is to zero. The lower bound for vmiles equals 0.147. Repeating the calculations above using the lower bound indicates that a typical change of 1.1 in vmiles predicts a change in mrall of 0.16, which is about 8% of the mean fatality rate and one-fourth its standard deviation. Does this represent a negligible or meaningful change? Again, professional judgment and context is required.

Students of statistics are taught to focus so much on the estimate and statistical significance that they understandably get the impression that insignificance implies the results are useless. This is not necessarily the case. Once again, the confidence interval is helpful to determine whether statistically insignificant results are still practically significant.

Suppose the p-value for vmiles was equal to or greater than 0.05, thus leading us to fail to reject the null hypothesis. This would also mean that our 95% confidence interval contains 0. Whether the results are still useful depends on the precision of the confidence interval around 0 relative to what we consider a meaningful change in the fatality rate given a typical change in vmiles. For instance, if the confidence interval ranged between -10 and 10, then our best guess for the effect ranges between substantially negative to positive or possibly no effect. This sort of imprecision is useless. However, what if the confidence interval was -0.01 to 0.01? Then, assuming a change of 0.01 in the fatality rate is negligible, we could conclude the effect of vmiles is negligible with a reasonable level of confidence despite failing to reject the null hypothesis.

12.5 Key terms and concepts

Regression results
- estimate
- standard error
- statistic or t-statistic
- p-value
- lower and upper confidence intervals
Null distribution
Practical significance