18 R Missing Data

18.1 Learning Objectives

In this chapter, you will learn how to:

Determine if a dataset has missing values
Determine which variables in a dataset have missing values and how many values are missing
Run functions on variables that have missing values
Replace all missing values with a non-missing value, such as 0, if doing so is advisable

18.2 Set-up

To complete this chapter, you need to

Start a R Markdown file, keeping the YAML and deleting the default content
Change the YAML to the following:

---
title: 'R Chapter 18'
author: 'Your name'
output: html_document
---

Load the following packages

library(tidyverse)
library(carData)

18.3 Data

We will use the SLID data from the carData package to learn how to deal with missing data. Per its documentation,

“The SLID data frame has 7425 rows and 5 columns. The data are from the 1994 wave of the Canadian Survey of Labour and Income Dynamics, for the province of Ontario. There are missing data, particularly for wages.”

As is always the case when we begin working with new data, we want to get a sense of what it contains.

Exercise 1: Use glimpse to examine SLID.

This is a moderately large dataset with 7,425 observations. Obviously, it would be terribly inefficient to look for missing values manually by scrolling through a spreadsheet of this size. We can already see from the glimpse results that wages has missing values given the NAs.

18.4 Checking for missing data

If it is not immediately obvious that a dataset contains missing values, we can tell R to check if an entire dataset has any missing data using the following function

anyNA(dataset)

where we replace dataset with the name of the dataset. If the dataset has at least one missing value, then anyNA will return TRUE.

Exercise 2: Use anyNA to confirm SLID has missing values.

The anyNA hasn’t told us anything we didn’t already know given the obvious NAs present in wages. Next, we may want to know which variables have missing values.

To determine which variables have missing values, we want to run anyNA repeatedly for each variable in our dataset. To run any function repeatedly on each row or column of a dataset, we can use the following function:

apply(dataset, 1 (for rows, or) 2 (for columns), function)

where we replace dataset with the name of our dataset, include either 1 or 2, and replace function with the name of the function we want to repeat.

Exercise 3: Use apply to run the anyNA function repeatedly on each column.

Your results should tell you that wages, education, and language contain missing values.

18.5 Counting missing values

Once we know a variable has missing values, we typically want to know how many values are missing or what percentage of total observations are missing for that variable.

The is.na function tests every value of a variable for whether it is missing. If a value is NA, is.na returns TRUE. To illustrate, the below code assigns a series of ten values to v, five of which are missing. This v object is no different from a variable in a dataset. Then, using the is.na function on v will return a list of TRUE/FALSE values accordingly.

v <- c(NA, 5, NA, 4, 10, 11, NA, NA, 1, NA)
is.na(v)

 [1]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE

Recall in Chapter 17 that the logical value of TRUE equals 1 in R, while FALSE equals 0. This means we can do math on TRUE/FALSE values just like we would if they were coded as 1/0.

If is.na gives us TRUE for every NA, then adding all the TRUEs will give us the total count of missing values.

To sum all the values of any variable, we can use the sum function

sum(is.na(v))

[1] 5

The result tells us 5 of the 10 values in v are missing. We can easily determine that 50% of the data for v is missing. But what if we have some denominator that is not as easy as 10? We can quickly to determine the percent of missing values by taking the average of TRUEs and FALSEs from the is.na function because the average sums the values of the variable and divides by the number of values.

We take the average of the is.na function using the mean function. Since each TRUE value is equal to 1, adding up all the TRUEs will equal 5, which is then divided by the total number of values, 10, giving us 0.5 or 50%.

Whenever we have a dummy 1/0 variable, the average of that variable is the percentage of observations equal to 1. In this case, 1 represents missing, but it could represent anything.

mean(is.na(v))

[1] 0.5

As expected, we get 0.5 or 50%. Building from this example, we can quantify the total and percent of missing values for wages like so

sum(is.na(SLID$wages))

[1] 3278

mean(is.na(SLID$wages))

[1] 0.4414815

Wages is missing 3,278 observations, or about 44% of all observations.

Exercise 4: Use the sum and mean function on is.na to determine the count and percent of missing values for the education and language variables.

If we had, say, 10 variables with missing values, the process above would be rather tedious. Like before, we can tell R to repeatedly quantify missing values for each variable using a slightly different function:

sapply(SLID, function(x) sum(is.na(x)))

    wages education       age       sex  language 
     3278       249         0         0       121

sapply(SLID, function(x) mean(is.na(x)))

     wages  education        age        sex   language 
0.44148148 0.03353535 0.00000000 0.00000000 0.01629630

18.6 Bypassing missing values

Many functions that execute some kind of computation (e.g. sum, average) do not work if you execute them on variables that contain missing values. This is deliberate so users are notified of missing values.

For instance, below I try to calculate average years of education.

mean(SLID$education)

[1] NA

In order to have functions bypass missing values, we have to include the na.rm=TRUE option that tells R to skip NAs.

mean(SLID$education, na.rm = TRUE)

[1] 12.49608

Since education is missing only 3% of its values, this is probably a good approximation of what the average would be if there were no missing values.

Exercise 5: Compute the average for wages.

It is unclear what to do with average wages since almost half of its values are missing. At the very least, we can report something like, “Only 56% of respondents reported a wage. Of those who reported a wage, the average equals $15.55.”

18.7 Drop all missing cases

First, we should always be careful when dropping data, as it could change our analysis and mislead a reader. Always ask yourself why a variable might be missing values and whether it matters to the conclusions you plan to make from the values that are not missing. If you do choose to remove observations that have missing values, always be transparent by stating how many observations from the total were removed due to missing data.

Suppose we want to remove all observations from SLID with a missing value for any variable. That is, we want to purge SLID of all missing values, perhaps so we don’t have to keep including na.rm=TRUE in all of our functions.

To remove all missing values, we can use the na.omit function like so:

dataset_nomissing <- na.omit(dataset)

where we create a new dataset indicating we’ve replaced the missing values (we don’t want to overwrite the original data). Inside the na.omit function, we include the name of the dataset.

Exercise 6: Create a new dataset SLID_nomissing that removes all missing values. Then calculate the average education on this new dataset without including na.rm=TRUE. Is the average education the same as in SLID?

18.8 Save and Upload

Knit your Rmd to save it and check for errors. If you are satisfied with your work, upload to eLC. Once you upload, answers will become available for download.