B Wrangle and Tidy Reference

Unless data are already perfectly prepared, the most time consuming part of data analysis is wrangling and tidying data. It is impossible to cover all scenarios one may encounter when preparing raw data for an analysis. Even for advanced users of R, it is not uncommon to search for an unknown solution to a new problem via the web, texts, or manuals. Attempting to memorize the plethora of functions in R that could serve as solutions would quickly result in diminishing returns. Instead, it is more realistic to obtain enough familiarity with basic wrangle and tidy problems and solutions that one knows how and where to effectively search for the solution.

B.1 Cheatsheets

RStudio provides numerous cheatsheets to help R users reference commonly used and helpful functions. Below is a list of cheatsheets that pertain to wrangling and tidying.

This is the most relevant cheatsheet for what you will encounter in the course:

Others that are less relevant:

Knowing just a handful of functions can help you make considerable progress in many situations. The remainder of this chapter serves as a sort of cheatsheet for problems you may encounter during the course. Functions are demonstrated using the gapminder data.

The tidyverse package is actually a collection of several packages designed to make the wrangle, tidy, and data exploration process as intuitive and consistent as possible. You should almost always load tidyverse, as it contains every function you may need to wrangle and tidy data.

library(tidyverse)

B.2 Wrangle Verbs

  • filter: extract rows/cases
  • select: extract columns/variables
  • mutate: alter existing variables or create new variables
  • if_else: use a conditional to create a new variable equal to one value if an observation meets the conditional and another value if it does not; often combined with mutate
  • arrange: reorder rows in ascending or descending order of one or more variables
  • slice_head & slice_tail: extract a few rows from the top (head) or bottom (tail)
  • summarize: collapses data into a table of summary statistics
  • group_by: tells R to apply functions to each group separately; common to use with summarize

B.2.1 Filter

Use filter to extract rows from a dataset. Inversely, one can think of filter as a way to remove rows. However, remember that filter keeps the rows that meet the condition on which you filter. Therefore, you want to use a condition that keeps the rows you want.

Note there are 1,704 rows in the gapminder dataset.

glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

Suppose I want to keep only countries in Asia. Then:

gapminder %>% 
  filter(continent == 'Asia') %>% 
  glimpse()
Rows: 396
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

The result is a new dataset with 396 rows. Note the use of double equal signs == to tell R it is a conditional (“if equal to”) rather than setting something equal to something else, which would not make sense in this case.

Suppose I want countries in Asia AND in the year 1952. Then:

gapminder %>% 
  filter(continent == 'Asia' & year == 1952) %>% 
  glimpse()
Rows: 33
Columns: 6
$ country   <fct> "Afghanistan", "Bahrain", "Bangladesh", "Cambodia", "China",…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, …
$ lifeExp   <dbl> 28.801, 50.939, 37.484, 39.417, 44.000, 60.960, 37.373, 37.4…
$ pop       <int> 8425333, 120447, 46886859, 4693836, 556263527, 2125900, 3720…
$ gdpPercap <dbl> 779.4453, 9867.0848, 684.2442, 368.4693, 400.4486, 3054.4212…

This results in a new dataset with 33 rows. Note the use of the ampersand & to code the “and” conditional.

Suppose I want countries in Asia with a life expectancy less than or equal to 40 in 1952. Then:

gapminder %>% 
  filter(continent == 'Asia' & year == 1952 & lifeExp <= 40) %>% 
  glimpse()
Rows: 10
Columns: 6
$ country   <fct> "Afghanistan", "Bangladesh", "Cambodia", "India", "Indonesia…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia
$ year      <int> 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952
$ lifeExp   <dbl> 28.801, 37.484, 39.417, 37.373, 37.468, 36.319, 36.157, 37.5…
$ pop       <int> 8425333, 46886859, 4693836, 372000000, 82052000, 20092996, 9…
$ gdpPercap <dbl> 779.4453, 684.2442, 368.4693, 546.5657, 749.6817, 331.0000, …

Suppose I all countries in 1952 except those in Asia. There are a few options to do this. Which option is most efficient depends on the specific case. In this case:

Option 1: Using the “or” conditional | (least efficient)

gapminder %>% 
  filter(continent == 'Africa' | continent == 'Americas' | continent == 'Europe' | continent == 'Oceania') %>% 
  glimpse()
Rows: 1,308
Columns: 6
$ country   <fct> "Albania", "Albania", "Albania", "Albania", "Albania", "Alba…
$ continent <fct> Europe, Europe, Europe, Europe, Europe, Europe, Europe, Euro…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 55.230, 59.280, 64.820, 66.220, 67.690, 68.930, 70.420, 72.0…
$ pop       <int> 1282697, 1476505, 1728137, 1984060, 2263554, 2509048, 278009…
$ gdpPercap <dbl> 1601.056, 1942.284, 2312.889, 2760.197, 3313.422, 3533.004, …

Option 2: Using the shortcut %in% for multiple “or” conditionals (moderately efficient)

gapminder %>% 
  filter(continent %in% c('Africa', 'Americas', 'Europe', 'Oceania')) %>% 
  glimpse()
Rows: 1,308
Columns: 6
$ country   <fct> "Albania", "Albania", "Albania", "Albania", "Albania", "Alba…
$ continent <fct> Europe, Europe, Europe, Europe, Europe, Europe, Europe, Euro…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 55.230, 59.280, 64.820, 66.220, 67.690, 68.930, 70.420, 72.0…
$ pop       <int> 1282697, 1476505, 1728137, 1984060, 2263554, 2509048, 278009…
$ gdpPercap <dbl> 1601.056, 1942.284, 2312.889, 2760.197, 3313.422, 3533.004, …

Option 3: Use the “not equal to” conditional != (most efficient)

gapminder %>% 
  filter(continent != 'Asia') %>% 
  glimpse()
Rows: 1,308
Columns: 6
$ country   <fct> "Albania", "Albania", "Albania", "Albania", "Albania", "Alba…
$ continent <fct> Europe, Europe, Europe, Europe, Europe, Europe, Europe, Euro…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 55.230, 59.280, 64.820, 66.220, 67.690, 68.930, 70.420, 72.0…
$ pop       <int> 1282697, 1476505, 1728137, 1984060, 2263554, 2509048, 278009…
$ gdpPercap <dbl> 1601.056, 1942.284, 2312.889, 2760.197, 3313.422, 3533.004, …

B.2.2 Select

Suppose I want a dataset that contains only country, continent, year, and life expectancy. There are multiple options. Which is more efficient depends on the specific case. In this case:

Option 1: List the variables I want to keep (least efficient)

gapminder %>% 
  select(country, continent, year, lifeExp) %>% 
  glimpse()
Rows: 1,704
Columns: 4
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…

Option 2: List the variables I don’t want to keep (moderately efficient)

gapminder %>% 
  select(-pop, -gdpPercap) %>% 
  glimpse()
Rows: 1,704
Columns: 4
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…

Option 3: Use : to specify the range of variables, which only works because the variables I want happen to be stored next to each other (most efficient)

gapminder %>% 
  select(country:lifeExp) %>% 
  glimpse()
Rows: 1,704
Columns: 4
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…

B.2.3 Mutate

The mutate function allows you to mutate your dataset by either changing an existing variable or creating a new one.

Suppose I wanted to change GDP per capita so that it is expressed in thousands of dollars instead of dollars. Then:

gapminder %>% 
  mutate(gdpPercap = gdpPercap/1000) %>% 
  glimpse()
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 0.7794453, 0.8208530, 0.8531007, 0.8361971, 0.7399811, 0.786…

Note that I use the name of an existing variable on the left-hand side of the equation. This overwrites the data according to the function I have specified. You can scroll up to previous glimpses to confirm that gdpPercap has indeed been divided by 1,000.

Suppose I wanted a new variable that measures total GDP to have in addition to GDP per capita expressed in thousands. Since GDP per capita equals GDP divided by population, I can simply use the inverse of this calculation. Thus:

gapminder %>% 
  mutate(gdpPercap = gdpPercap/1000,
         gdp = gdpPercap*pop) %>% 
  glimpse()
Rows: 1,704
Columns: 7
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 0.7794453, 0.8208530, 0.8531007, 0.8361971, 0.7399811, 0.786…
$ gdp       <dbl> 6567086, 7585449, 8758856, 9648014, 9678553, 11697659, 12598…

Since mutate applies mathematical functions, there are way too many possible uses to cover here. The second page of the Data transformation cheatsheet lists numerous common functions used with mutate under the “Vector Functions” header.

Suppose there are multiple variables you want to mutate using the same function. A common example is when a bunch of variables are expressed as proportions between 0 and 1 when you want them all to be expressed as percentages between 0 and 100. You could use mutate to multiply each variable by 100, but this quickly becomes tedious. Instead, you can use across inside mutate to list the variables you want to mutate, then define the function you want applied to them.

For example, suppose I wanted to multiply all of the numerical variables in gapminder by 100 (doesn’t make sense but just go with it). Then:

gapminder %>% 
  mutate(across(c(year, lifeExp, pop, gdpPercap), ~ .x*100)) %>% 
  glimpse()
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <dbl> 195200, 195700, 196200, 196700, 197200, 197700, 198200, 1987…
$ lifeExp   <dbl> 2880.1, 3033.2, 3199.7, 3402.0, 3608.8, 3843.8, 3985.4, 4082…
$ pop       <dbl> 842533300, 924093400, 1026708300, 1153796600, 1307946000, 14…
$ gdpPercap <dbl> 77944.53, 82085.30, 85310.07, 83619.71, 73998.11, 78611.34, …

The first argument c(year, lifeExp, pop, gdpPercap) specifies the variables I want to mutate. The second argument ~ .x*100 tells R the function to use for the mutation. The .x is a generic representation for each of the variables listed in the first argument.

B.2.4 Combining filter, select, and mutate

You can do some serious wrangling efficiently with filter, select, and mutate. Suppose I wanted a new dataset of GDP (in billions) for European countries in 2007. Recall that the pipe operator, %>%, makes code easier to read and write by feeding the result of what precedes it to the next line that follows and so on.

In the code below, I create a new dataset named euro_gdp07 by first taking the gapminder dataset, then feeding it to the filter verb. The result is a dataset that includes only European countries in 2007, but this dataset is not created explicitly. Instead, it is fed to the mutate verb, which adds a variable named gdp_billions. Finally, this dataset is fed to the select verb. Using the glimpse verb we can see the final result.

euro_gdp07 <- gapminder %>% 
  filter(continent == 'Europe' & year == 2007) %>% 
  mutate(gdp_billions = (gdpPercap*pop)/1000000000) %>% 
  select(country, year, gdp_billions)

glimpse(euro_gdp07)
Rows: 30
Columns: 3
$ country      <fct> "Albania", "Austria", "Belgium", "Bosnia and Herzegovina"…
$ year         <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 21.376411, 296.229401, 350.141167, 33.897027, 78.213929, …

B.2.5 Combining Mutate and If_Else

There are two common cases for using the combination of mutate and if_else:

  • Convert the values of a two-level categorical variable (i.e. dummy variable) from text to numerical
  • Convert the values of a numerical variable or categorical variable with more than two levels to a two-level categorical variable

In either case, we can choose to create a new variable or overwrite the existing variable we wish to convert.

Suppose I want to create a new variable named rich equal to “yes” if a European country has a GDP greater than the average GDP and “no” if their GDP is less than or equal to the average.

euro_gdp07 <- euro_gdp07 %>% 
  mutate(rich = if_else(gdp_billions > mean(gdp_billions), "yes", "no"))

The first line in the above code overwrites the euro_gdp07 dataset by using the same name on the left side of the assignment operator <-. The euro_gdp07 is fed/piped to the mutate verb. Inside mutate, a name a variable rich. Since rich does not currently exist in the euro_gdp07 dataset, a new variable will be added.

This new variable named rich is defined using the if_else function. The first argument is the conditional. Here I define the conditional as “if gdp_billions is greater than the mean of gdp_billions”. Observations that meet the conditional you specify receive the second argument. In this case, European countries with a GDP greater than the mean of GDP among all European countries will receive a value equal to “yes”. Observations that do not meet the conditional you specify receive the third argument. In this case, European countries with a GDP less than or equal to the mean of GDP among all European countries will receive a value equal to “no”.

glimpse(euro_gdp07)
Rows: 30
Columns: 4
$ country      <fct> "Albania", "Austria", "Belgium", "Bosnia and Herzegovina"…
$ year         <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 21.376411, 296.229401, 350.141167, 33.897027, 78.213929, …
$ rich         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "ye…

Now suppose instead of using text (i.e. string variable) for rich, I want to use a numerical coding of 1/0 where 1 denotes yes/true and 0 no/false.

euro_gdp07 <- euro_gdp07 %>% 
  mutate(rich = if_else(rich == "yes", 1, 0))

Since rich already exists in euro_gdp07, I use the conditional “if rich equals yes.” If it does, the variable is overwritten with the value 1. If it does not, it is overwritten with the value 0.

glimpse(euro_gdp07)
Rows: 30
Columns: 4
$ country      <fct> "Albania", "Austria", "Belgium", "Bosnia and Herzegovina"…
$ year         <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 21.376411, 296.229401, 350.141167, 33.897027, 78.213929, …
$ rich         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, …

B.2.6 Arrange

The arrange verb is useful if you want to identify cases that have the highest or lowest values for one or more variables. By default, arrange reorders rows in ascending order (i.e. lowest to highest). In the previous glimpse, countries are arranged in alphabetical order. Suppose I wanted them arranged based on GDP.

euro_gdp07 %>% 
  arrange(gdp_billions) %>% 
  glimpse()
Rows: 30
Columns: 4
$ country      <fct> "Montenegro", "Iceland", "Albania", "Bosnia and Herzegovi…
$ year         <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 6.336476, 10.924102, 21.376411, 33.897027, 51.774743, 65.…
$ rich         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Now we see a few countries with the lowest GDP. If instead I wanted GDP arranged from highest to lowest, then:

euro_gdp07 %>% 
  arrange(desc(gdp_billions)) %>% 
  glimpse()
Rows: 30
Columns: 4
$ country      <fct> "Germany", "United Kingdom", "France", "Italy", "Spain", …
$ year         <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 2650.87089, 2017.96931, 1861.22794, 1661.26443, 1165.7598…
$ rich         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Now we see some of the wealthiest European countries.

B.2.7 Slice_Head/Tail

By default, the slice_head and slice_tail verbs extract the top and bottom 6 rows of a dataset, respectively. These verbs are useful if we want to show a reader a sample of the data in a familiar spreadsheet form, which can be useful. Though the output from glimpse is very useful, it does not look good in a report. The slice_head and slice_tail verbs allow us to provide similar information in a much more presentable format.

Suppose we wanted to show a reader the three wealthiest and poorest European countries (in absolute terms). We can specify the number of rows slice_head or slice_tail extract using n=#. Thus:

euro_gdp07 %>% 
  arrange(desc(gdp_billions)) %>% 
  slice_head(n=3) %>% 
  kable(digits = 0)
country year gdp_billions rich
Germany 2007 2651 1
United Kingdom 2007 2018 1
France 2007 1861 1

Note the use of kable in the last line. This function from the knitr package is a common way to print nicer looking tables. The digits= inside specifies how many digits to the right of the decimal to include in the table. In this case, I tell R to round to the nearest whole number.

euro_gdp07 %>% 
  arrange(gdp_billions) %>% 
  slice_head(n=3) %>% 
  kable(digits = 0)
country year gdp_billions rich
Montenegro 2007 6 0
Iceland 2007 11 0
Albania 2007 21 0

B.2.8 Summarize

Summarize creates a new dataset by collapsing all of the cases of a dataset into one or more summary statistics. It is useful for providing quick summary stat calculations in a somewhat presentable format. I do not recommend using summarize to produce the kind of summary stats table commonly found in reports because it can become tedious and the formatting is not good enough. I recommend using the arsenal package instead.

Suppose I wanted to report the average gdpPercap and lifeExp for 2007 in a rough and ready table. Then:

gapminder %>% 
  filter(year == 2007) %>% 
  summarize('Average GDP per capita' = mean(gdpPercap), 
            'Average life expectance' = mean(lifeExp)) %>% 
  kable(digits = 0)
Average GDP per capita Average life expectance
11680 67

The summarize verb works with numerous summary functions listed on the second page of the Data transformation cheatsheet under the heading “Summary Functions.”

B.2.9 Group_By

The group_by verb is most commonly used in tandem with summarize. If instead of calculating a summary stat for the entire dataset, you wanted to calculate the summary stat for each group of a categorical variable separately, use group_by before using summarize.

Suppose I wanted average GDP per capita and life expectancy in 2007 for each continent. Then:

gapminder %>% 
  filter(year == 2007) %>% 
  group_by(continent) %>% 
  summarize('Average GDP per capita' = mean(gdpPercap), 
            'Average life expectance' = mean(lifeExp)) %>% 
  kable()
continent Average GDP per capita Average life expectance
Africa 3089.033 54.80604
Americas 11003.032 73.60812
Asia 12473.027 70.72848
Europe 25054.482 77.64860
Oceania 29810.188 80.71950

Pretty powerful! Also, notice how the values in the table are reported to a fairly useless degree of precision because I did not specify digits=0 inside of the kable function.

You can also use multiple grouping variables. Suppose I wanted these summary stats for each continent each year since 1997. Then:

gapminder %>% 
  filter(year >= 1997) %>% 
  group_by(continent, year) %>% 
  summarize('Average GDP per capita' = mean(gdpPercap), 
            'Average life expectancy' = mean(lifeExp)) %>% 
  kable(digits=0)
continent year Average GDP per capita Average life expectancy
Africa 1997 2379 54
Africa 2002 2599 53
Africa 2007 3089 55
Americas 1997 8889 71
Americas 2002 9288 72
Americas 2007 11003 74
Asia 1997 9834 68
Asia 2002 10174 69
Asia 2007 12473 71
Europe 1997 19077 76
Europe 2002 21712 77
Europe 2007 25054 78
Oceania 1997 24024 78
Oceania 2002 26939 80
Oceania 2007 29810 81

B.3 Tidy Verbs

As with wrangling, one can encounter numerous different tidying scenarios. However, most of the time tidying involves converting a wide dataset to a long dataset. The most common untidy data one encounters is a time series or panel data where each time period is stored across columns (i.e. wide) rather than down rows (i.e. long).

Let’s begin with a simple time series of population taken from the gapminder data. Suppose we downloaded a dataset named uspop for U.S. population.

country 1997 2002 2007
United States 272911760 287675526 301139947

We don’t want each year to be a variable. Rather, we want year to be one variable with separate levels/rows for each period. We can achieve this with pivot_longer.

uspop %>% 
  pivot_longer(cols = '1997':'2007', 
               names_to = 'year',
               values_to = 'pop') %>% 
  kable(format = 'html')
country year pop
United States 1997 272911760
United States 2002 287675526
United States 2007 301139947

Note that pivot_longer tries to make the code as intuitive as possible using natural language. First, we tell R which columns to pivot, then we tell R to name the new column ‘year’, then we tell R to name the new column with the values for population ‘pop’.

Suppose we encountered a more difficult wide version of the gapminder data named gap_wide shown below. This one has multiple variables listed wide for each year.

Tidying gap_wide will take two steps. First, we can separate the variable names pop/lifeExp/gdpPercap from the numeric year into two columns using pivot_longer. This will result in a column that contains all three variables that precede the year and a column that contains year. We will also need to name a third column that will contain the values that the pivoted columns contained.

In the code below, I tell R which columns to pivot using cols and to name the two new columns ‘var’ and ‘year’. I use names_sep to tell that each of the columns should be separated using the underscore. Then, I give the new column that will contain the values the generic name ‘value’ since this is an temporary column.

gap_long1 <- gap_wide %>% 
  pivot_longer(cols = pop_1997:gdpPercap_2007,
               names_to = c('var', 'year'),
               names_sep = '_',
               values_to = 'value')

DT::datatable(gap_long1, rownames = FALSE, options = list(pageLength = 5, scrollX=T))

Now we need to convert the var column to wide using pivot_wider. This will create new columns for each of the unique values contained in the ‘var’ column. Since there are three unique values, the result will be three new columns. We also need to specify which column contains the values that will be transferred over to the three new columns.

In the code below, I tell R to pivot the ‘var’ column wide and take the values from the ‘value’ column. And voila; we are back to having our original, tidy data.

gap_long2 <- gap_long1 %>% 
  pivot_wider(names_from = var,
              values_from = value)

DT::datatable(gap_long2, rownames = FALSE, options = list(pageLength = 5, scrollX=T))