18 R Missing Data
18.1 Learning Objectives
In this chapter, you will learn how to:
- Determine if a dataset has missing values
- Determine which variables in a dataset have missing values and how many values are missing
- Run functions on variables that have missing values
- Replace all missing values with a non-missing value, such as 0, if doing so is advisable
18.2 Set-up
To complete this chapter, you need to
Start a R Markdown file, keeping the YAML and deleting the default content
Change the YAML to the following:
- Load the following packages
18.3 Data
We will use the SLID
data from the carData
package to learn how to deal with missing data. Per its documentation,
“The SLID data frame has 7425 rows and 5 columns. The data are from the 1994 wave of the Canadian Survey of Labour and Income Dynamics, for the province of Ontario. There are missing data, particularly for wages.”
As is always the case when we begin working with new data, we want to get a sense of what it contains.
Exercise 1: Use glimpse
to examine
SLID
.
This is a moderately large dataset with 7,425 observations. Obviously, it would be terribly inefficient to look for missing values manually by scrolling through a spreadsheet of this size. We can already see from the glimpse results that wages has missing values given the NA
s.
18.4 Checking for missing data
If it is not immediately obvious that a dataset contains missing values, we can tell R to check if an entire dataset has any missing data using the following function
where we replace dataset
with the name of the dataset. If the dataset has at least one missing value, then anyNA
will return TRUE
.
Exercise 2: Use anyNA
to confirm
SLID
has missing values.
The anyNA
hasn’t told us anything we didn’t already know given the obvious NA
s present in wages. Next, we may want to know which variables have missing values.
To determine which variables have missing values, we want to run anyNA
repeatedly for each variable in our dataset. To run any function repeatedly on each row or column of a dataset, we can use the following function:
where we replace dataset
with the name of our dataset, include either 1
or 2
, and replace function
with the name of the function we want to repeat.
Exercise 3: Use apply
to run the
anyNA
function repeatedly on each column.
Your results should tell you that wages, education, and language contain missing values.
18.5 Counting missing values
Once we know a variable has missing values, we typically want to know how many values are missing or what percentage of total observations are missing for that variable.
The is.na
function tests every value of a variable for whether it is missing. If a value is NA, is.na
returns TRUE
. To illustrate, the below code assigns a series of ten values to v
, five of which are missing. This v
object is no different from a variable in a dataset. Then, using the is.na
function on v
will return a list of TRUE/FALSE values accordingly.
[1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
Recall in Chapter 17 that the logical value of TRUE
equals 1 in R, while FALSE
equals 0. This means we can do math on TRUE/FALSE values just like we would if they were coded as 1/0.
If is.na
gives us TRUE
for every NA
, then adding all the TRUE
s will give us the total count of missing values.
To sum all the values of any variable, we can use the sum
function
[1] 5
The result tells us 5 of the 10 values in v
are missing. We can easily determine that 50% of the data for v
is missing. But what if we have some denominator that is not as easy as 10? We can quickly to determine the percent of missing values by taking the average of TRUE
s and FALSE
s from the is.na
function because the average sums the values of the variable and divides by the number of values.
We take the average of the is.na
function using the mean
function. Since each TRUE value is equal to 1, adding up all the TRUEs will equal 5, which is then divided by the total number of values, 10, giving us 0.5 or 50%.
Whenever we have a dummy 1/0 variable, the average of that variable is the percentage of observations equal to 1. In this case, 1 represents missing, but it could represent anything.
[1] 0.5
As expected, we get 0.5 or 50%. Building from this example, we can quantify the total and percent of missing values for wages like so
[1] 3278
[1] 0.4414815
Wages is missing 3,278 observations, or about 44% of all observations.
Exercise 4: Use the sum
and
mean
function on is.na
to determine the count
and percent of missing values for the education
and
language
variables.
If we had, say, 10 variables with missing values, the process above would be rather tedious. Like before, we can tell R to repeatedly quantify missing values for each variable using a slightly different function:
wages education age sex language
3278 249 0 0 121
wages education age sex language
0.44148148 0.03353535 0.00000000 0.00000000 0.01629630
18.6 Bypassing missing values
Many functions that execute some kind of computation (e.g. sum, average) do not work if you execute them on variables that contain missing values. This is deliberate so users are notified of missing values.
For instance, below I try to calculate average years of education.
[1] NA
In order to have functions bypass missing values, we have to include the na.rm=TRUE
option that tells R to skip NA
s.
[1] 12.49608
Since education is missing only 3% of its values, this is probably a good approximation of what the average would be if there were no missing values.
Exercise 5: Compute the average for
wages
.
It is unclear what to do with average wages since almost half of its values are missing. At the very least, we can report something like, “Only 56% of respondents reported a wage. Of those who reported a wage, the average equals $15.55.”
18.7 Drop all missing cases
First, we should always be careful when dropping data, as it could change our analysis and mislead a reader. Always ask yourself why a variable might be missing values and whether it matters to the conclusions you plan to make from the values that are not missing. If you do choose to remove observations that have missing values, always be transparent by stating how many observations from the total were removed due to missing data.
Suppose we want to remove all observations from SLID with a missing value for any variable. That is, we want to purge SLID of all missing values, perhaps so we don’t have to keep including na.rm=TRUE
in all of our functions.
To remove all missing values, we can use the na.omit
function like so:
where we create a new dataset indicating we’ve replaced the missing values (we don’t want to overwrite the original data). Inside the na.omit
function, we include the name of the dataset
.
Exercise 6: Create a new dataset
SLID_nomissing
that removes all missing values. Then
calculate the average education on this new dataset without
including na.rm=TRUE
. Is the average education the same as
in SLID?