11 Hypothesis Testing

11.1 Learning objectives

Identify or construct the null and alternative hypotheses of a research question
Interpret a p-value and apply it to determine the outcome of a hypothesis test
Distinguish between types I and II error in a research scenario
Explain when a chi-square test or t-test is appropriate

11.2 Hypothesis testing

At its foundation, hypothesis testing is a simple procedural process. It does involve some application of statistical theory, the most fascinating and misunderstood aspect of which is the p-value. This section covers how to set up and use a hypothesis test to determine if an estimate is statistically significant.

11.2.1 Null and alternative hypos

Setting up a hypothesis test first involves establishing two mutually exclusive, competing statements:

Null hypothesis: the condition I intend to test for is not true, not the case
Alternative hypothesis: the condition I intend to test for is true, is the case

The condition is based on our research question that warranted the analysis in the first place. For example, is the average age of MPA students less than the average age of all graduate students? Are females more likely to enroll in an MPA program than males? Does obtaining an MPA increase earnings?

Based on our question, we make a hypothesis before computing an estimate that would answer the question. For example, the average age of MPA students is less than the average age of all graduate students. Females are more likely to enroll in an MPA program than males. The effect of attaining an MPA increases earnings. These examples are alternative hypotheses; the affirmative of the condition we set out to test. The null hypothesis is the negative of the condition. For example, average age of MPA students is equal to graduate students. The likelihood of enrolling in an MPA program are equal between males and females. An MPA degree does not increase earnings.

Note that the examples of alternative hypotheses were all directional. They used words like less/more than or increase/decrease. An alternative hypothesis need not be directional even though we may expect one direction over the other. For example, our alternative hypotheses could be that average age differs between MPA students and other grad students, the likelihood of enrolling differs between males and females, and attaining an MPA affects income.

I encourage students to stick with non-directional hypotheses. In additional to simplifying the analysis, doing so reduces the likelihood that we report a statistically significant result when in fact there is not one (i.e. false positive). A strong case can be and is made that the statistical analyses allow too high of a likelihood for false positives. Claiming a direction means we have theoretically ruled out half of the possible results of our analysis before we even conduct the test, thereby increasing the likelihood we get the results we expected. If we are able to do this, one might wonder why conduct the test in the first place.

Translating the above into more mathematical concepts, most research questions involve differences: the difference in mean age between MPA students and other grad students, the difference in the proportions of female and male MPA graduates, the difference in the slopes of regression lines drawn through data points for income and education level or degree type.

Thus, the null hypothesis states that any of these differences equals zero. The alternative hypothesis states that any of these differences does not equal zero. The null hypothesis is denoted by \(H_0\) and the alternative hypothesis is denoted by \(H_A\).

Null: There is no difference in average age between MPA students and graduate students in all other programs.
\(H_0: \mu_{MPA}-\mu_{grad} = 0\)
Alternative: There is a difference in average age between MPA students and graduate students in all other programs.
\(H_A: \mu_{MPA}-\mu_{grad} \neq 0\)
Null: There is no difference between male and female MPA enrollment.
\(H_0: \rho_{female}-\rho_{male} = 0\)
Alternative: There is difference between male and female MPA enrollment.
\(H_A: \rho_{female}-\rho_{male} \neq 0\)
Null: There is no effect of MPA degrees on income.
\(H_0: \beta_{MPA} = 0\)
Alternative: There is an effect of MPA degrees on income.
\(H_A: \beta_{MPA} \neq 0\)

Note the use of population parameters in each case above. Again, inference computes a sample estimate to make inferences about a population. Therefore, our hypotheses include the unobserved population parameter. Our estimate and confidence interval represents our best guess of that population parameter. The purpose of the hypothesis test is to establish a threshold at which we are sufficiently confident in our results to make a conclusion concerning our hypotheses prior to viewing the results.

11.2.2 Conclusion and error type

There are two possible outcomes of a hypothesis test. We either

reject the null hypothesis, or
fail to reject the null hypothesis.

We never accept the null hypothesis or reject the alternative hypothesis. This may seem like semantics, but it actually has important implications for our conclusions.

Suppose our results do not meet the threshold to reject the hypothesis, thus leading us to fail to reject the null hypothesis. This does not mean the null hypothesis is true. Our null hypothesis states whatever parameter of interest in our study equals zero. We do not observe the population parameter. Therefore, we cannot say that our null hypothesis is true. Instead, we say that we do not have sufficient evidence to reject the null. It may be true or false, but we cannot determine which based on our particular estimate from our particular sample.

Conversely, if we reject the null, then our conclusion is that the null hypothesis is false. We can rule out with reasonable confidence that the population parameter does not equal zero because doing so does not require us to claim a particular value for the population parameter; just that it is not equal to zero.

A popular example of hypothesis testing is a jury decision in a court case. A defendant is accused of committing a crime. In truth, the defendant is either innocent or guilty of that crime. Ideally, the defendant is presumed innocent until proven guilty according to a jury of their peers. Therefore, the null hypothesis is that the defendant is innocent and the alternative hypothesis is that the defendant is guilty. The jury decides either that the defendant is guilty or not guilty. If guilty, then the jury has rejected the null hypothesis of innocence. If not guilty, then the jury has failed to reject the null hypothesis. Note that the jury does not decide the defendant is innocent, which would be equivalent to accepting the null or rejecting the alternative.

Despite our best efforts, there always remains some chance that we have reached the wrong conclusion with our hypothesis test. We can make one of two possible errors:

Type I error or false positive: rejecting the null when the null is actually true
Type II error or false negative: failing to reject the null when the null is actually false

For instance, Type I error is finding an innocent defendant guilty, a healthy patient sick, or an ineffective program effective. Type II error is finding a guilty defendant not guilty, no evidence of illness in a sick patient, or no evidence of efficacy in an effective program. Again, the null hypothesis involves the population parameter, so we do not know if it is actually true or not. The null must be true or false, and the outcome of our hypothesis test claims whether the null does not appear to be false or is false. Therefore, there is a probability that we have reached the incorrect conclusion.

It is impossible to eliminate the chance of Types I and II errors, though it is possible to increase or decrease their likelihoods. However, the two share an inverse relationship; as we reduce the chance of one type of error, we increase the chance of the other type of error. Depending on the context of our research question, we may be more or less concerned about Type II error, but the focus of a hypothesis test is placed on Type I error, which serves as the threshold for our decision.

11.2.3 Decision rule

Before testing our hypothesis, we ask ourselves the following question:

“What is the maximum probability of Type I error that I or others should be willing to tolerate?”

Actually, this question has been answered for us in most disciplines. The common threshold for this tolerance is 5% or 1% probability of rejecting the null hypothesis when the null is actually true. Social sciences typically use 5%, though 10% is still seen (too) often.

With our threshold of tolerance set, we can now test our hypothesis. We calculate the sample estimate and the standard error of our estimate. These are then used to calculate the confidence interval. The confidence level of our confidence interval depends on our chosen threshold for Type I error. If our threshold for Type I error is 5%, then we calculate the 95% confidence interval. If our threshold is 1%, we use a 99% confidence interval.

Our confidence interval is our best guess of the plausible range of values for the population parameter. If we chose a 95% confidence interval, then we decided to tolerate the chance that our confidence interval is one of the five out of 100 confidence intervals–or the one out of 100 in the case of 99% confidence intervals–expected to fail to capture the parameter. Our null hypothesis states that the parameter equals zero.

Therefore, if our confidence interval does not contain zero, we reject the null hypothesis. If our confidence interval does contain zero, then we have failed to reject the null hypothesis because the parameter might equal zero. It might not equal zero, but we cannot tell either way with a sufficient level of confidence.

All else equal, a lower confidence level, say, 90%, translates to a more narrow range of plausible values. Suppose that if a 90% confidence interval had been chosen, it would not have contained zero. In that case, we would reject the null hypothesis, but we are now using a confidence interval expected to fail 10 out of 100 times (or one out of 10). If the null were actually true, then choosing a 90% confidence interval increased the probability of Type I error and resulted in a false positive. If the null were actually false, then the use of a 95% confidence interval increased the probability of Type II error and resulted in a false negative. Again, this is the trade-off between Types I and II error when choosing between higher or lower levels of confidence.

In most cases, we do not need more information than the estimate, standard error, and confidence interval to make a decision regarding our hypothesis test. However, analyses involving a hypothesis test provide us an additional piece of information that allows us to arrive at the same conclusion but from a different perspective. This additional piece of information is called the p-value.

The p-value tells us the probability of obtaining the estimate we did, or an estimate further away from the null hypothesis (usually that the parameter equals 0), if the null hypothesis were actually true.

The p-value provides us a concise decision rule. The tolerance threshold we set is often referred to as the significance level and denoted by the Greek letter \(\alpha\) (alpha).

If \(p<\alpha\), reject the null hypothesis.
If \(p\geq \alpha\), fail to reject the null hypothesis.

Again, a typical value for \(\alpha\) is 5%, or 0.05. Having chosen a 5% significance level, if our results generate a p-value of 0.04, for instance, then the likelihood of obtaining our result in a world where the null is true is 4%. Therefore, we reject the null hypothesis. If our p-value were equal to 0.06, then the likelihood of obtaining our result in a world where the null is true is 6%. This exceeds our maximum tolerance, thus we fail to reject the null hypothesis.

11.3 Chi-square test

The Chi-square (\(\chi^2\)) test is a common choice for introducing the application of hypothesis testing. Chi-square is used to test whether two nominal variables are associated to a statistically significant extent. A nominal variable, such as race, sex, or political party affiliation, has two or more of levels that differ in name only. If one wanted to test if, given the level for a unit of analysis in one nominal variable (e.g. male), there is a higher likelihood for a particular level in another nominal variable to occur (e.g. Republican), a Chi-square test is an appropriate choice.

Note: A regression could also test this relationship. In fact, you have already learned how to run such a regression as long as the outcome of party affiliation is coded as binary (there are regression models that can handle categorical outcomes of more than two levels outside the scope of this course).

For an example, consider a poll targeted to the general U.S. public asking if workers who have illegally entered the U.S. should be 1) allowed to keep their jobs and apply for citizenship, 2) allowed to keep their jobs as temporary guest workers but not allowed to apply for citizenship, and 3) lose their jobs and have to leave the country. The poll also asked for political party affiliation. A total of 890 responses were collected, generating the following results.

Table 11.1: Response by political party
	Republican	Democrat	Independent
Apply for citizenship	57	101	120
Guest worker	121	28	113
Leave the country	179	45	126

We see in 11.1 that 179 out of 357 Republicans (50%) responded that illegal immigrants should be forced to leave the country, while 101 out of 174 Democrats (58%) responded that illegal immigrants should be allowed to apply for citizenship. Is there a statistically significant pattern in responses conditional on political party affiliation, or are these differences due to random noise?

The null hypothesis for this question is that there is no association between the opinion on illegal immigration and political party affiliation. That is to say, if we chose any of the three political party levels in our data, the probability of an individual providing one of the three opinions is equal the other opinions, or

\(H_0: P_{leave} = P_{guest} = P_{citizen}\)

The alternative hypothesis is that there is an association between opinion on illegal immigration and political party affiliation, or

\(H_A\): at least one \(P\) is not equal to the others

Next, suppose we chose to use the customary 5% statistical significance level, or \(\alpha=0.05\). Now we are ready to test our hypothesis using a Chi-square test. Doing so generates the following results.


    Pearson's Chi-squared test

data:  immigration_poll
X-squared = 100.95, df = 4, p-value < 0.00000000000000022

This is one case where there are no confidence intervals to compute since our variables are not numerical. Instead, we rely on the p-value. Our p-value is less than 0.00000000000000022. More concisely, \(p<0.05\). Therefore, we reject the null hypothesis that there is no association between the three illegal immigration opinions in the survey and political party affiliation. Furthermore, the probability for us to get the values in Table 11.1 or more extreme in a world where the null hypothesis is actually true is equal to an infinitesimal percent, not to suggest that such a small p-value is required to make inferences.

Our results allow us to make inferences such as Republicans are more likely to believe illegal immigrants should lose their jobs and have to leave the country, while Democrats are more likely to believe they should be allowed to keep their jobs and apply for citizenship. Not a particularly surprising inference, but perhaps that is because many such inferences in the past have been made using similar techniques and reported many times.

11.4 T-test

The t-test is another common introductory application of hypothesis testing. A t-test is used to test the association between a nominal variable with two levels and a numerical variable. It is frequently used in simple program evaluations with a pre/post or treatment/control design. Both involve a nominal variable with two levels. If one want to test if a numerical outcome is different between the two levels, then a t-test is an appropriate choice.

There are two varieties of the t-test. To test if an average of a numerical outcome is different between two groups, such as a treatment and control group, then we use an independent t-test. To test if an average numerical outcome is different before and after a treatment for the same units of analysis, we use a dependent t-test. The difference between the two t-tests concerns how we approximate the sampling distributions and confidence intervals, but their use for hypothesis testing is essentially the same.

Suppose we work for a nonprofit that provides job training workshops and want to evaluate their effectiveness. One way to go about such a task is to compare the earnings of participants (i.e. treatment group) to the earnings of non-participants (i.e. control group). We have earnings data for 185 participants and 128 non-participants, some of which is previewed in the table below.

Table 11.2: Preview of job training data
	treatment	earnings
29	1	66493.964
113	1	0.000
254	0	46651.829
98	1	10070.227
68	1	0.000
241	0	1211.736

Before constructing the null and alternative hypotheses, a note about the population in this example. In program evaluations or generally any analysis aimed toward testing whether some event caused an effect on an outcome of interest, there are two populations. There is the entire population of units of analysis (e.g. all people, all nations, all dogs) and the subset of that population for whom/which the program, policy, or intervention is intended.

The choice of population leads to two slightly different questions. If the entire population is our research population, then our intent is to estimate the average effect of the program on a randomly chosen unit from the entire population. This is referred to as the average treatment effect (ATE). If the subset that the program targets is our research population, then our intent is to report the average effect of a randomly chosen targeted unit. This is referred to as the average treatment on the treated (ATT). The detailed differences between the two are beyond the scope of this book and more appropriate for a class in program evaluation or causal inference, but the existence of this difference is worth being aware of.

Given that job training programs are not intended for all people, the presumption is that we want to estimate the average effect on those the program targets. Of course, we could choose as our population only those who participated in the program. In that case, we need not bother with hypothesis tests, as we would be calculating the population parameters directly from the observed data. However, we would not be able to generalize the results.

With the population in mind, our null hypothesis is that the average earnings of participants is equal to the average earnings of non-participants, or

\(H_0: \mu_{treated} = \mu_{untreated}\)

Our alternative hypothesis is that the average earnings of participants is not equal to the average earnings of non-participants, or

\(H_A: \mu_{treated} \neq \mu_{untreated}\)

Choosing a significance level of 5%, we are ready to test our hypothesis.

A simple computation of the average earnings between the two groups provides the following information.

Table 11.3: Comparison of means between treated and untreated
treatment	Average Earnings
0	21645.10
1	26031.49

Participants have a higher average earnings than non-participants, but is this difference statistically significant? For that, we should use an independent t-test because participants and non-participants are two different groups. Running the t-test provides the following results


    Welch Two Sample t-test

data:  earnings by treatment
t = -1.1921, df = 275.58, p-value = 0.2342
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -11629.708   2856.939
sample estimates:
mean in group 0 mean in group 1 
       21645.10        26031.49

Our p-value is greater than our significance level, \(0.23>0.05\). Therefore, we fail to reject the null hypothesis and conclude that there is not statistically significant evidence that participants earn more than non-participants, on average. Note that our 95% confidence interval ranges between -11,630 and 2,856. It includes zero, thus we cannot claim the difference between the means is not equal to zero with reasonable confidence.

Note: A regression could also test this relationship. A regression model of \(Earnings=\beta_0+\beta_1Training+\epsilon\) would produce the same results as the t-test.

To learn how to conduct chi-square and t-tests in R, proceed to Chapter 23.

11.5 Key terms and concepts

Margin of error
Survey weight
Null and alternative hypotheses
Rejecting the null
Failing to reject the null
Types I and II error
Confidence level
Significance level
P-value
Chi-square test
T-test