B Wrangle and Tidy Reference

Unless data are already perfectly prepared, the most time consuming part of data analysis is wrangling and tidying data. It is impossible to cover all scenarios one may encounter when preparing raw data for an analysis. Even for advanced users of R, it is not uncommon to search for an unknown solution to a new problem via the web, texts, or manuals. Attempting to memorize the plethora of functions in R that could serve as solutions would quickly result in diminishing returns. Instead, it is more realistic to obtain enough familiarity with basic wrangle and tidy problems and solutions that one knows how and where to effectively search for the solution.

B.1 Cheatsheets

RStudio provides numerous cheatsheets to help R users reference commonly used and helpful functions. Below is a list of cheatsheets that pertain to wrangling and tidying.

This is the most relevant cheatsheet for what you will encounter in the course:

Data transformation

Others that are less relevant:

Knowing just a handful of functions can help you make considerable progress in many situations. The remainder of this chapter serves as a sort of cheatsheet for problems you may encounter during the course. Functions are demonstrated using the gapminder data.

The tidyverse package is actually a collection of several packages designed to make the wrangle, tidy, and data exploration process as intuitive and consistent as possible. You should almost always load tidyverse, as it contains every function you may need to wrangle and tidy data.

library(tidyverse)

B.2 Wrangle Verbs

filter: extract rows/cases
select: extract columns/variables
mutate: alter existing variables or create new variables
if_else: use a conditional to create a new variable equal to one value if an observation meets the conditional and another value if it does not; often combined with mutate
arrange: reorder rows in ascending or descending order of one or more variables
slice_head & slice_tail: extract a few rows from the top (head) or bottom (tail)
summarize: collapses data into a table of summary statistics
group_by: tells R to apply functions to each group separately; common to use with summarize

B.2.1 Filter

Use filter to extract rows from a dataset. Inversely, one can think of filter as a way to remove rows. However, remember that filter keeps the rows that meet the condition on which you filter. Therefore, you want to use a condition that keeps the rows you want.

Note there are 1,704 rows in the gapminder dataset.

glimpse(gapminder)

Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

Suppose I want to keep only countries in Asia. Then:

gapminder %>% 
  filter(continent == 'Asia') %>% 
  glimpse()

Rows: 396
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

The result is a new dataset with 396 rows. Note the use of double equal signs == to tell R it is a conditional (“if equal to”) rather than setting something equal to something else, which would not make sense in this case.

Suppose I want countries in Asia AND in the year 1952. Then:

gapminder %>% 
  filter(continent == 'Asia' & year == 1952) %>% 
  glimpse()

Rows: 33
Columns: 6
$ country   <fct> "Afghanistan", "Bahrain", "Bangladesh", "Cambodia", "China",…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, …
$ lifeExp   <dbl> 28.801, 50.939, 37.484, 39.417, 44.000, 60.960, 37.373, 37.4…
$ pop       <int> 8425333, 120447, 46886859, 4693836, 556263527, 2125900, 3720…
$ gdpPercap <dbl> 779.4453, 9867.0848, 684.2442, 368.4693, 400.4486, 3054.4212…

This results in a new dataset with 33 rows. Note the use of the ampersand & to code the “and” conditional.

Suppose I want countries in Asia with a life expectancy less than or equal to 40 in 1952. Then:

gapminder %>% 
  filter(continent == 'Asia' & year == 1952 & lifeExp <= 40) %>% 
  glimpse()

Rows: 10
Columns: 6
$ country   <fct> "Afghanistan", "Bangladesh", "Cambodia", "India", "Indonesia…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia
$ year      <int> 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952
$ lifeExp   <dbl> 28.801, 37.484, 39.417, 37.373, 37.468, 36.319, 36.157, 37.5…
$ pop       <int> 8425333, 46886859, 4693836, 372000000, 82052000, 20092996, 9…
$ gdpPercap <dbl> 779.4453, 684.2442, 368.4693, 546.5657, 749.6817, 331.0000, …

Suppose I all countries in 1952 except those in Asia. There are a few options to do this. Which option is most efficient depends on the specific case. In this case:

Option 1: Using the “or” conditional | (least efficient)

gapminder %>% 
  filter(continent == 'Africa' | continent == 'Americas' | continent == 'Europe' | continent == 'Oceania') %>% 
  glimpse()

Rows: 1,308
Columns: 6
$ country   <fct> "Albania", "Albania", "Albania", "Albania", "Albania", "Alba…
$ continent <fct> Europe, Europe, Europe, Europe, Europe, Europe, Europe, Euro…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 55.230, 59.280, 64.820, 66.220, 67.690, 68.930, 70.420, 72.0…
$ pop       <int> 1282697, 1476505, 1728137, 1984060, 2263554, 2509048, 278009…
$ gdpPercap <dbl> 1601.056, 1942.284, 2312.889, 2760.197, 3313.422, 3533.004, …

Option 2: Using the shortcut %in% for multiple “or” conditionals (moderately efficient)

gapminder %>% 
  filter(continent %in% c('Africa', 'Americas', 'Europe', 'Oceania')) %>% 
  glimpse()

Rows: 1,308
Columns: 6
$ country   <fct> "Albania", "Albania", "Albania", "Albania", "Albania", "Alba…
$ continent <fct> Europe, Europe, Europe, Europe, Europe, Europe, Europe, Euro…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 55.230, 59.280, 64.820, 66.220, 67.690, 68.930, 70.420, 72.0…
$ pop       <int> 1282697, 1476505, 1728137, 1984060, 2263554, 2509048, 278009…
$ gdpPercap <dbl> 1601.056, 1942.284, 2312.889, 2760.197, 3313.422, 3533.004, …

Option 3: Use the “not equal to” conditional != (most efficient)

gapminder %>% 
  filter(continent != 'Asia') %>% 
  glimpse()

Rows: 1,308
Columns: 6
$ country   <fct> "Albania", "Albania", "Albania", "Albania", "Albania", "Alba…
$ continent <fct> Europe, Europe, Europe, Europe, Europe, Europe, Europe, Euro…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 55.230, 59.280, 64.820, 66.220, 67.690, 68.930, 70.420, 72.0…
$ pop       <int> 1282697, 1476505, 1728137, 1984060, 2263554, 2509048, 278009…
$ gdpPercap <dbl> 1601.056, 1942.284, 2312.889, 2760.197, 3313.422, 3533.004, …

B.2.2 Select

Suppose I want a dataset that contains only country, continent, year, and life expectancy. There are multiple options. Which is more efficient depends on the specific case. In this case:

Option 1: List the variables I want to keep (least efficient)

gapminder %>% 
  select(country, continent, year, lifeExp) %>% 
  glimpse()

Rows: 1,704
Columns: 4
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…

Option 2: List the variables I don’t want to keep (moderately efficient)

gapminder %>% 
  select(-pop, -gdpPercap) %>% 
  glimpse()

Rows: 1,704
Columns: 4
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…

Option 3: Use : to specify the range of variables, which only works because the variables I want happen to be stored next to each other (most efficient)

gapminder %>% 
  select(country:lifeExp) %>% 
  glimpse()

Rows: 1,704
Columns: 4
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…

B.2.3 Mutate

The mutate function allows you to mutate your dataset by either changing an existing variable or creating a new one.

Suppose I wanted to change GDP per capita so that it is expressed in thousands of dollars instead of dollars. Then:

gapminder %>% 
  mutate(gdpPercap = gdpPercap/1000) %>% 
  glimpse()

Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 0.7794453, 0.8208530, 0.8531007, 0.8361971, 0.7399811, 0.786…

Note that I use the name of an existing variable on the left-hand side of the equation. This overwrites the data according to the function I have specified. You can scroll up to previous glimpses to confirm that gdpPercap has indeed been divided by 1,000.

Suppose I wanted a new variable that measures total GDP to have in addition to GDP per capita expressed in thousands. Since GDP per capita equals GDP divided by population, I can simply use the inverse of this calculation. Thus:

gapminder %>% 
  mutate(gdpPercap = gdpPercap/1000,
         gdp = gdpPercap*pop) %>% 
  glimpse()

Rows: 1,704
Columns: 7
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 0.7794453, 0.8208530, 0.8531007, 0.8361971, 0.7399811, 0.786…
$ gdp       <dbl> 6567086, 7585449, 8758856, 9648014, 9678553, 11697659, 12598…

Since mutate applies mathematical functions, there are way too many possible uses to cover here. The second page of the Data transformation cheatsheet lists numerous common functions used with mutate under the “Vector Functions” header.

Suppose there are multiple variables you want to mutate using the same function. A common example is when a bunch of variables are expressed as proportions between 0 and 1 when you want them all to be expressed as percentages between 0 and 100. You could use mutate to multiply each variable by 100, but this quickly becomes tedious. Instead, you can use across inside mutate to list the variables you want to mutate, then define the function you want applied to them.

For example, suppose I wanted to multiply all of the numerical variables in gapminder by 100 (doesn’t make sense but just go with it). Then:

gapminder %>% 
  mutate(across(c(year, lifeExp, pop, gdpPercap), ~ .x*100)) %>% 
  glimpse()

Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <dbl> 195200, 195700, 196200, 196700, 197200, 197700, 198200, 1987…
$ lifeExp   <dbl> 2880.1, 3033.2, 3199.7, 3402.0, 3608.8, 3843.8, 3985.4, 4082…
$ pop       <dbl> 842533300, 924093400, 1026708300, 1153796600, 1307946000, 14…
$ gdpPercap <dbl> 77944.53, 82085.30, 85310.07, 83619.71, 73998.11, 78611.34, …

The first argument c(year, lifeExp, pop, gdpPercap) specifies the variables I want to mutate. The second argument ~ .x*100 tells R the function to use for the mutation. The .x is a generic representation for each of the variables listed in the first argument.

B.2.4 Combining filter, select, and mutate

You can do some serious wrangling efficiently with filter, select, and mutate. Suppose I wanted a new dataset of GDP (in billions) for European countries in 2007. Recall that the pipe operator, %>%, makes code easier to read and write by feeding the result of what precedes it to the next line that follows and so on.

In the code below, I create a new dataset named euro_gdp07 by first taking the gapminder dataset, then feeding it to the filter verb. The result is a dataset that includes only European countries in 2007, but this dataset is not created explicitly. Instead, it is fed to the mutate verb, which adds a variable named gdp_billions. Finally, this dataset is fed to the select verb. Using the glimpse verb we can see the final result.

euro_gdp07 <- gapminder %>% 
  filter(continent == 'Europe' & year == 2007) %>% 
  mutate(gdp_billions = (gdpPercap*pop)/1000000000) %>% 
  select(country, year, gdp_billions)

glimpse(euro_gdp07)

Rows: 30
Columns: 3
$ country      <fct> "Albania", "Austria", "Belgium", "Bosnia and Herzegovina"…
$ year         <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 21.376411, 296.229401, 350.141167, 33.897027, 78.213929, …

B.2.5 Combining Mutate and If_Else

There are two common cases for using the combination of mutate and if_else:

Convert the values of a two-level categorical variable (i.e. dummy variable) from text to numerical
Convert the values of a numerical variable or categorical variable with more than two levels to a two-level categorical variable

In either case, we can choose to create a new variable or overwrite the existing variable we wish to convert.

Suppose I want to create a new variable named rich equal to “yes” if a European country has a GDP greater than the average GDP and “no” if their GDP is less than or equal to the average.

euro_gdp07 <- euro_gdp07 %>% 
  mutate(rich = if_else(gdp_billions > mean(gdp_billions), "yes", "no"))

The first line in the above code overwrites the euro_gdp07 dataset by using the same name on the left side of the assignment operator <-. The euro_gdp07 is fed/piped to the mutate verb. Inside mutate, a name a variable rich. Since rich does not currently exist in the euro_gdp07 dataset, a new variable will be added.

This new variable named rich is defined using the if_else function. The first argument is the conditional. Here I define the conditional as “if gdp_billions is greater than the mean of gdp_billions”. Observations that meet the conditional you specify receive the second argument. In this case, European countries with a GDP greater than the mean of GDP among all European countries will receive a value equal to “yes”. Observations that do not meet the conditional you specify receive the third argument. In this case, European countries with a GDP less than or equal to the mean of GDP among all European countries will receive a value equal to “no”.

glimpse(euro_gdp07)

Rows: 30
Columns: 4
$ country      <fct> "Albania", "Austria", "Belgium", "Bosnia and Herzegovina"…
$ year         <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 21.376411, 296.229401, 350.141167, 33.897027, 78.213929, …
$ rich         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "ye…

Now suppose instead of using text (i.e. string variable) for rich, I want to use a numerical coding of 1/0 where 1 denotes yes/true and 0 no/false.

euro_gdp07 <- euro_gdp07 %>% 
  mutate(rich = if_else(rich == "yes", 1, 0))

Since rich already exists in euro_gdp07, I use the conditional “if rich equals yes.” If it does, the variable is overwritten with the value 1. If it does not, it is overwritten with the value 0.

glimpse(euro_gdp07)

Rows: 30
Columns: 4
$ country      <fct> "Albania", "Austria", "Belgium", "Bosnia and Herzegovina"…
$ year         <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 21.376411, 296.229401, 350.141167, 33.897027, 78.213929, …
$ rich         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, …

B.2.6 Arrange

The arrange verb is useful if you want to identify cases that have the highest or lowest values for one or more variables. By default, arrange reorders rows in ascending order (i.e. lowest to highest). In the previous glimpse, countries are arranged in alphabetical order. Suppose I wanted them arranged based on GDP.

euro_gdp07 %>% 
  arrange(gdp_billions) %>% 
  glimpse()

Rows: 30
Columns: 4
$ country      <fct> "Montenegro", "Iceland", "Albania", "Bosnia and Herzegovi…
$ year         <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 6.336476, 10.924102, 21.376411, 33.897027, 51.774743, 65.…
$ rich         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Now we see a few countries with the lowest GDP. If instead I wanted GDP arranged from highest to lowest, then:

euro_gdp07 %>% 
  arrange(desc(gdp_billions)) %>% 
  glimpse()

Rows: 30
Columns: 4
$ country      <fct> "Germany", "United Kingdom", "France", "Italy", "Spain", …
$ year         <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 2650.87089, 2017.96931, 1861.22794, 1661.26443, 1165.7598…
$ rich         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Now we see some of the wealthiest European countries.

B.2.7 Slice_Head/Tail

By default, the slice_head and slice_tail verbs extract the top and bottom 6 rows of a dataset, respectively. These verbs are useful if we want to show a reader a sample of the data in a familiar spreadsheet form, which can be useful. Though the output from glimpse is very useful, it does not look good in a report. The slice_head and slice_tail verbs allow us to provide similar information in a much more presentable format.

Suppose we wanted to show a reader the three wealthiest and poorest European countries (in absolute terms). We can specify the number of rows slice_head or slice_tail extract using n=#. Thus:

euro_gdp07 %>% 
  arrange(desc(gdp_billions)) %>% 
  slice_head(n=3) %>% 
  kable(digits = 0)

country	year	gdp_billions	rich
Germany	2007	2651	1
United Kingdom	2007	2018	1
France	2007	1861	1

Note the use of kable in the last line. This function from the knitr package is a common way to print nicer looking tables. The digits= inside specifies how many digits to the right of the decimal to include in the table. In this case, I tell R to round to the nearest whole number.

euro_gdp07 %>% 
  arrange(gdp_billions) %>% 
  slice_head(n=3) %>% 
  kable(digits = 0)

country	year	gdp_billions
Montenegro	2007	6
Iceland	2007	11
Albania	2007	21

B.2.8 Summarize

Summarize creates a new dataset by collapsing all of the cases of a dataset into one or more summary statistics. It is useful for providing quick summary stat calculations in a somewhat presentable format. I do not recommend using summarize to produce the kind of summary stats table commonly found in reports because it can become tedious and the formatting is not good enough. I recommend using the arsenal package instead.

Suppose I wanted to report the average gdpPercap and lifeExp for 2007 in a rough and ready table. Then:

gapminder %>% 
  filter(year == 2007) %>% 
  summarize('Average GDP per capita' = mean(gdpPercap), 
            'Average life expectance' = mean(lifeExp)) %>% 
  kable(digits = 0)

Average GDP per capita	Average life expectance
11680	67

The summarize verb works with numerous summary functions listed on the second page of the Data transformation cheatsheet under the heading “Summary Functions.”

B.2.9 Group_By

The group_by verb is most commonly used in tandem with summarize. If instead of calculating a summary stat for the entire dataset, you wanted to calculate the summary stat for each group of a categorical variable separately, use group_by before using summarize.

Suppose I wanted average GDP per capita and life expectancy in 2007 for each continent. Then:

gapminder %>% 
  filter(year == 2007) %>% 
  group_by(continent) %>% 
  summarize('Average GDP per capita' = mean(gdpPercap), 
            'Average life expectance' = mean(lifeExp)) %>% 
  kable()

continent	Average GDP per capita	Average life expectance
Africa	3089.033	54.80604
Americas	11003.032	73.60812
Asia	12473.027	70.72848
Europe	25054.482	77.64860
Oceania	29810.188	80.71950

Pretty powerful! Also, notice how the values in the table are reported to a fairly useless degree of precision because I did not specify digits=0 inside of the kable function.

You can also use multiple grouping variables. Suppose I wanted these summary stats for each continent each year since 1997. Then:

gapminder %>% 
  filter(year >= 1997) %>% 
  group_by(continent, year) %>% 
  summarize('Average GDP per capita' = mean(gdpPercap), 
            'Average life expectancy' = mean(lifeExp)) %>% 
  kable(digits=0)

continent	year	Average GDP per capita	Average life expectancy
Africa	1997	2379	54
Africa	2002	2599	53
Africa	2007	3089	55
Americas	1997	8889	71
Americas	2002	9288	72
Americas	2007	11003	74
Asia	1997	9834	68
Asia	2002	10174	69
Asia	2007	12473	71
Europe	1997	19077	76
Europe	2002	21712	77
Europe	2007	25054	78
Oceania	1997	24024	78
Oceania	2002	26939	80
Oceania	2007	29810	81

B.3 Tidy Verbs

As with wrangling, one can encounter numerous different tidying scenarios. However, most of the time tidying involves converting a wide dataset to a long dataset. The most common untidy data one encounters is a time series or panel data where each time period is stored across columns (i.e. wide) rather than down rows (i.e. long).

Let’s begin with a simple time series of population taken from the gapminder data. Suppose we downloaded a dataset named uspop for U.S. population.

country	1997	2002	2007
United States	272911760	287675526	301139947

We don’t want each year to be a variable. Rather, we want year to be one variable with separate levels/rows for each period. We can achieve this with pivot_longer.

uspop %>% 
  pivot_longer(cols = '1997':'2007', 
               names_to = 'year',
               values_to = 'pop') %>% 
  kable(format = 'html')

country	year	pop
United States	1997	272911760
United States	2002	287675526
United States	2007	301139947

Note that pivot_longer tries to make the code as intuitive as possible using natural language. First, we tell R which columns to pivot, then we tell R to name the new column ‘year’, then we tell R to name the new column with the values for population ‘pop’.

Suppose we encountered a more difficult wide version of the gapminder data named gap_wide shown below. This one has multiple variables listed wide for each year.

Tidying gap_wide will take two steps. First, we can separate the variable names pop/lifeExp/gdpPercap from the numeric year into two columns using pivot_longer. This will result in a column that contains all three variables that precede the year and a column that contains year. We will also need to name a third column that will contain the values that the pivoted columns contained.

In the code below, I tell R which columns to pivot using cols and to name the two new columns ‘var’ and ‘year’. I use names_sep to tell that each of the columns should be separated using the underscore. Then, I give the new column that will contain the values the generic name ‘value’ since this is an temporary column.

gap_long1 <- gap_wide %>% 
  pivot_longer(cols = pop_1997:gdpPercap_2007,
               names_to = c('var', 'year'),
               names_sep = '_',
               values_to = 'value')

DT::datatable(gap_long1, rownames = FALSE, options = list(pageLength = 5, scrollX=T))

Now we need to convert the var column to wide using pivot_wider. This will create new columns for each of the unique values contained in the ‘var’ column. Since there are three unique values, the result will be three new columns. We also need to specify which column contains the values that will be transferred over to the three new columns.

In the code below, I tell R to pivot the ‘var’ column wide and take the values from the ‘value’ column. And voila; we are back to having our original, tidy data.

gap_long2 <- gap_long1 %>% 
  pivot_wider(names_from = var,
              values_from = value)

DT::datatable(gap_long2, rownames = FALSE, options = list(pageLength = 5, scrollX=T))