B Wrangle and Tidy Reference
Unless data are already perfectly prepared, the most time consuming part of data analysis is wrangling and tidying data. It is impossible to cover all scenarios one may encounter when preparing raw data for an analysis. Even for advanced users of R, it is not uncommon to search for an unknown solution to a new problem via the web, texts, or manuals. Attempting to memorize the plethora of functions in R that could serve as solutions would quickly result in diminishing returns. Instead, it is more realistic to obtain enough familiarity with basic wrangle and tidy problems and solutions that one knows how and where to effectively search for the solution.
B.1 Cheatsheets
RStudio provides numerous cheatsheets to help R users reference commonly used and helpful functions. Below is a list of cheatsheets that pertain to wrangling and tidying.
This is the most relevant cheatsheet for what you will encounter in the course:
Others that are less relevant:
Knowing just a handful of functions can help you make considerable progress in many situations. The remainder of this chapter serves as a sort of cheatsheet for problems you may encounter during the course. Functions are demonstrated using the gapminder
data.
The tidyverse
package is actually a collection of several packages
designed to make the wrangle, tidy, and data exploration process as intuitive and consistent as possible. You should almost always load tidyverse
, as it contains every function you may need to wrangle and tidy data.
B.2 Wrangle Verbs
- filter: extract rows/cases
- select: extract columns/variables
- mutate: alter existing variables or create new variables
- if_else: use a conditional to create a new variable equal to one value if an observation meets the conditional and another value if it does not; often combined with
mutate
- arrange: reorder rows in ascending or descending order of one or more variables
- slice_head & slice_tail: extract a few rows from the top (head) or bottom (tail)
- summarize: collapses data into a table of summary statistics
- group_by: tells R to apply functions to each group separately; common to use with summarize
B.2.1 Filter
Use filter
to extract rows from a dataset. Inversely, one can think of filter
as a way to remove rows. However, remember that filter
keeps the rows that meet the condition on which you filter. Therefore, you want to use a condition that keeps the rows you want.
Note there are 1,704 rows in the gapminder
dataset.
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
Suppose I want to keep only countries in Asia. Then:
Rows: 396
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
The result is a new dataset with 396 rows. Note the use of double equal signs ==
to tell R it is a conditional (“if equal to”) rather than setting something equal to something else, which would not make sense in this case.
Suppose I want countries in Asia AND in the year 1952. Then:
Rows: 33
Columns: 6
$ country <fct> "Afghanistan", "Bahrain", "Bangladesh", "Cambodia", "China",…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, …
$ lifeExp <dbl> 28.801, 50.939, 37.484, 39.417, 44.000, 60.960, 37.373, 37.4…
$ pop <int> 8425333, 120447, 46886859, 4693836, 556263527, 2125900, 3720…
$ gdpPercap <dbl> 779.4453, 9867.0848, 684.2442, 368.4693, 400.4486, 3054.4212…
This results in a new dataset with 33 rows. Note the use of the ampersand &
to code the “and” conditional.
Suppose I want countries in Asia with a life expectancy less than or equal to 40 in 1952. Then:
Rows: 10
Columns: 6
$ country <fct> "Afghanistan", "Bangladesh", "Cambodia", "India", "Indonesia…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia
$ year <int> 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952, 1952
$ lifeExp <dbl> 28.801, 37.484, 39.417, 37.373, 37.468, 36.319, 36.157, 37.5…
$ pop <int> 8425333, 46886859, 4693836, 372000000, 82052000, 20092996, 9…
$ gdpPercap <dbl> 779.4453, 684.2442, 368.4693, 546.5657, 749.6817, 331.0000, …
Suppose I all countries in 1952 except those in Asia. There are a few options to do this. Which option is most efficient depends on the specific case. In this case:
Option 1: Using the “or” conditional |
(least efficient)
gapminder %>%
filter(continent == 'Africa' | continent == 'Americas' | continent == 'Europe' | continent == 'Oceania') %>%
glimpse()
Rows: 1,308
Columns: 6
$ country <fct> "Albania", "Albania", "Albania", "Albania", "Albania", "Alba…
$ continent <fct> Europe, Europe, Europe, Europe, Europe, Europe, Europe, Euro…
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 55.230, 59.280, 64.820, 66.220, 67.690, 68.930, 70.420, 72.0…
$ pop <int> 1282697, 1476505, 1728137, 1984060, 2263554, 2509048, 278009…
$ gdpPercap <dbl> 1601.056, 1942.284, 2312.889, 2760.197, 3313.422, 3533.004, …
Option 2: Using the shortcut %in%
for multiple “or” conditionals (moderately efficient)
Rows: 1,308
Columns: 6
$ country <fct> "Albania", "Albania", "Albania", "Albania", "Albania", "Alba…
$ continent <fct> Europe, Europe, Europe, Europe, Europe, Europe, Europe, Euro…
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 55.230, 59.280, 64.820, 66.220, 67.690, 68.930, 70.420, 72.0…
$ pop <int> 1282697, 1476505, 1728137, 1984060, 2263554, 2509048, 278009…
$ gdpPercap <dbl> 1601.056, 1942.284, 2312.889, 2760.197, 3313.422, 3533.004, …
Option 3: Use the “not equal to” conditional !=
(most efficient)
Rows: 1,308
Columns: 6
$ country <fct> "Albania", "Albania", "Albania", "Albania", "Albania", "Alba…
$ continent <fct> Europe, Europe, Europe, Europe, Europe, Europe, Europe, Euro…
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 55.230, 59.280, 64.820, 66.220, 67.690, 68.930, 70.420, 72.0…
$ pop <int> 1282697, 1476505, 1728137, 1984060, 2263554, 2509048, 278009…
$ gdpPercap <dbl> 1601.056, 1942.284, 2312.889, 2760.197, 3313.422, 3533.004, …
B.2.2 Select
Suppose I want a dataset that contains only country, continent, year, and life expectancy. There are multiple options. Which is more efficient depends on the specific case. In this case:
Option 1: List the variables I want to keep (least efficient)
Rows: 1,704
Columns: 4
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
Option 2: List the variables I don’t want to keep (moderately efficient)
Rows: 1,704
Columns: 4
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
Option 3: Use :
to specify the range of variables, which only works because the variables I want happen to be stored next to each other (most efficient)
Rows: 1,704
Columns: 4
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
B.2.3 Mutate
The mutate
function allows you to mutate your dataset by either changing an existing variable or creating a new one.
Suppose I wanted to change GDP per capita so that it is expressed in thousands of dollars instead of dollars. Then:
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 0.7794453, 0.8208530, 0.8531007, 0.8361971, 0.7399811, 0.786…
Note that I use the name of an existing variable on the left-hand side of the equation. This overwrites the data according to the function I have specified. You can scroll up to previous glimpses to confirm that gdpPercap has indeed been divided by 1,000.
Suppose I wanted a new variable that measures total GDP to have in addition to GDP per capita expressed in thousands. Since GDP per capita equals GDP divided by population, I can simply use the inverse of this calculation. Thus:
Rows: 1,704
Columns: 7
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 0.7794453, 0.8208530, 0.8531007, 0.8361971, 0.7399811, 0.786…
$ gdp <dbl> 6567086, 7585449, 8758856, 9648014, 9678553, 11697659, 12598…
Since mutate
applies mathematical functions, there are way too many possible uses to cover here. The second page of the Data transformation cheatsheet lists numerous common functions used with mutate
under the “Vector Functions” header.
Suppose there are multiple variables you want to mutate using the same function. A common example is when a bunch of variables are expressed as proportions between 0 and 1 when you want them all to be expressed as percentages between 0 and 100. You could use mutate to multiply each variable by 100, but this quickly becomes tedious. Instead, you can use across
inside mutate
to list the variables you want to mutate, then define the function you want applied to them.
For example, suppose I wanted to multiply all of the numerical variables in gapminder
by 100 (doesn’t make sense but just go with it). Then:
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <dbl> 195200, 195700, 196200, 196700, 197200, 197700, 198200, 1987…
$ lifeExp <dbl> 2880.1, 3033.2, 3199.7, 3402.0, 3608.8, 3843.8, 3985.4, 4082…
$ pop <dbl> 842533300, 924093400, 1026708300, 1153796600, 1307946000, 14…
$ gdpPercap <dbl> 77944.53, 82085.30, 85310.07, 83619.71, 73998.11, 78611.34, …
The first argument c(year, lifeExp, pop, gdpPercap)
specifies the variables I want to mutate. The second argument ~ .x*100
tells R the function to use for the mutation. The .x
is a generic representation for each of the variables listed in the first argument.
B.2.4 Combining filter, select, and mutate
You can do some serious wrangling efficiently with filter, select, and mutate. Suppose I wanted a new dataset of GDP (in billions) for European countries in 2007. Recall that the pipe operator, %>%
, makes code easier to read and write by feeding the result of what precedes it to the next line that follows and so on.
In the code below, I create a new dataset named euro_gdp07
by first taking the gapminder
dataset, then feeding it to the filter verb. The result is a dataset that includes only European countries in 2007, but this dataset is not created explicitly. Instead, it is fed to the mutate
verb, which adds a variable named gdp_billions
. Finally, this dataset is fed to the select
verb. Using the glimpse
verb we can see the final result.
euro_gdp07 <- gapminder %>%
filter(continent == 'Europe' & year == 2007) %>%
mutate(gdp_billions = (gdpPercap*pop)/1000000000) %>%
select(country, year, gdp_billions)
glimpse(euro_gdp07)
Rows: 30
Columns: 3
$ country <fct> "Albania", "Austria", "Belgium", "Bosnia and Herzegovina"…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 21.376411, 296.229401, 350.141167, 33.897027, 78.213929, …
B.2.5 Combining Mutate and If_Else
There are two common cases for using the combination of mutate
and if_else
:
- Convert the values of a two-level categorical variable (i.e. dummy variable) from text to numerical
- Convert the values of a numerical variable or categorical variable with more than two levels to a two-level categorical variable
In either case, we can choose to create a new variable or overwrite the existing variable we wish to convert.
Suppose I want to create a new variable named rich
equal to “yes” if a European country has a GDP greater than the average GDP and “no” if their GDP is less than or equal to the average.
The first line in the above code overwrites the euro_gdp07
dataset by using the same name on the left side of the assignment operator <-
. The euro_gdp07
is fed/piped to the mutate
verb. Inside mutate
, a name a variable rich
. Since rich
does not currently exist in the euro_gdp07
dataset, a new variable will be added.
This new variable named rich
is defined using the if_else
function. The first argument is the conditional. Here I define the conditional as “if gdp_billions
is greater than the mean of gdp_billions
”. Observations that meet the conditional you specify receive the second argument. In this case, European countries with a GDP greater than the mean of GDP among all European countries will receive a value equal to “yes”. Observations that do not meet the conditional you specify receive the third argument. In this case, European countries with a GDP less than or equal to the mean of GDP among all European countries will receive a value equal to “no”.
Rows: 30
Columns: 4
$ country <fct> "Albania", "Austria", "Belgium", "Bosnia and Herzegovina"…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 21.376411, 296.229401, 350.141167, 33.897027, 78.213929, …
$ rich <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "ye…
Now suppose instead of using text (i.e. string variable) for rich
, I want to use a numerical coding of 1/0 where 1 denotes yes/true and 0 no/false.
Since rich already exists in euro_gdp07
, I use the conditional “if rich
equals yes.” If it does, the variable is overwritten with the value 1. If it does not, it is overwritten with the value 0.
Rows: 30
Columns: 4
$ country <fct> "Albania", "Austria", "Belgium", "Bosnia and Herzegovina"…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 21.376411, 296.229401, 350.141167, 33.897027, 78.213929, …
$ rich <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, …
B.2.6 Arrange
The arrange
verb is useful if you want to identify cases that have the highest or lowest values for one or more variables. By default, arrange
reorders rows in ascending order (i.e. lowest to highest). In the previous glimpse, countries are arranged in alphabetical order. Suppose I wanted them arranged based on GDP.
Rows: 30
Columns: 4
$ country <fct> "Montenegro", "Iceland", "Albania", "Bosnia and Herzegovi…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 6.336476, 10.924102, 21.376411, 33.897027, 51.774743, 65.…
$ rich <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Now we see a few countries with the lowest GDP. If instead I wanted GDP arranged from highest to lowest, then:
Rows: 30
Columns: 4
$ country <fct> "Germany", "United Kingdom", "France", "Italy", "Spain", …
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
$ gdp_billions <dbl> 2650.87089, 2017.96931, 1861.22794, 1661.26443, 1165.7598…
$ rich <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Now we see some of the wealthiest European countries.
B.2.7 Slice_Head/Tail
By default, the slice_head
and slice_tail
verbs extract the top and bottom 6 rows of a dataset, respectively. These verbs are useful if we want to show a reader a sample of the data in a familiar spreadsheet form, which can be useful. Though the output from glimpse
is very useful, it does not look good in a report. The slice_head
and slice_tail
verbs allow us to provide similar information in a much more presentable format.
Suppose we wanted to show a reader the three wealthiest and poorest European countries (in absolute terms). We can specify the number of rows slice_head
or slice_tail
extract using n=#
. Thus:
country | year | gdp_billions | rich |
---|---|---|---|
Germany | 2007 | 2651 | 1 |
United Kingdom | 2007 | 2018 | 1 |
France | 2007 | 1861 | 1 |
Note the use of kable
in the last line. This function from the knitr
package is a common way to print nicer looking tables. The digits=
inside specifies how many digits to the right of the decimal to include in the table. In this case, I tell R to round to the nearest whole number.
country | year | gdp_billions | rich |
---|---|---|---|
Montenegro | 2007 | 6 | 0 |
Iceland | 2007 | 11 | 0 |
Albania | 2007 | 21 | 0 |
B.2.8 Summarize
Summarize creates a new dataset by collapsing all of the cases of a dataset into one or more summary statistics. It is useful for providing quick summary stat calculations in a somewhat presentable format. I do not recommend using summarize
to produce the kind of summary stats table commonly found in reports because it can become tedious and the formatting is not good enough. I recommend using the arsenal
package instead.
Suppose I wanted to report the average gdpPercap and lifeExp for 2007 in a rough and ready table. Then:
gapminder %>%
filter(year == 2007) %>%
summarize('Average GDP per capita' = mean(gdpPercap),
'Average life expectance' = mean(lifeExp)) %>%
kable(digits = 0)
Average GDP per capita | Average life expectance |
---|---|
11680 | 67 |
The summarize
verb works with numerous summary functions listed on the second page of the Data transformation cheatsheet under the heading “Summary Functions.”
B.2.9 Group_By
The group_by
verb is most commonly used in tandem with summarize
. If instead of calculating a summary stat for the entire dataset, you wanted to calculate the summary stat for each group of a categorical variable separately, use group_by
before using summarize
.
Suppose I wanted average GDP per capita and life expectancy in 2007 for each continent. Then:
gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarize('Average GDP per capita' = mean(gdpPercap),
'Average life expectance' = mean(lifeExp)) %>%
kable()
continent | Average GDP per capita | Average life expectance |
---|---|---|
Africa | 3089.033 | 54.80604 |
Americas | 11003.032 | 73.60812 |
Asia | 12473.027 | 70.72848 |
Europe | 25054.482 | 77.64860 |
Oceania | 29810.188 | 80.71950 |
Pretty powerful! Also, notice how the values in the table are reported to a fairly useless degree of precision because I did not specify digits=0
inside of the kable
function.
You can also use multiple grouping variables. Suppose I wanted these summary stats for each continent each year since 1997. Then:
gapminder %>%
filter(year >= 1997) %>%
group_by(continent, year) %>%
summarize('Average GDP per capita' = mean(gdpPercap),
'Average life expectancy' = mean(lifeExp)) %>%
kable(digits=0)
continent | year | Average GDP per capita | Average life expectancy |
---|---|---|---|
Africa | 1997 | 2379 | 54 |
Africa | 2002 | 2599 | 53 |
Africa | 2007 | 3089 | 55 |
Americas | 1997 | 8889 | 71 |
Americas | 2002 | 9288 | 72 |
Americas | 2007 | 11003 | 74 |
Asia | 1997 | 9834 | 68 |
Asia | 2002 | 10174 | 69 |
Asia | 2007 | 12473 | 71 |
Europe | 1997 | 19077 | 76 |
Europe | 2002 | 21712 | 77 |
Europe | 2007 | 25054 | 78 |
Oceania | 1997 | 24024 | 78 |
Oceania | 2002 | 26939 | 80 |
Oceania | 2007 | 29810 | 81 |
B.3 Tidy Verbs
As with wrangling, one can encounter numerous different tidying scenarios. However, most of the time tidying involves converting a wide dataset to a long dataset. The most common untidy data one encounters is a time series or panel data where each time period is stored across columns (i.e. wide) rather than down rows (i.e. long).
Let’s begin with a simple time series of population taken from the gapminder
data. Suppose we downloaded a dataset named uspop
for U.S. population.
country | 1997 | 2002 | 2007 |
---|---|---|---|
United States | 272911760 | 287675526 | 301139947 |
We don’t want each year to be a variable. Rather, we want year to be one variable with separate levels/rows for each period. We can achieve this with pivot_longer
.
uspop %>%
pivot_longer(cols = '1997':'2007',
names_to = 'year',
values_to = 'pop') %>%
kable(format = 'html')
country | year | pop |
---|---|---|
United States | 1997 | 272911760 |
United States | 2002 | 287675526 |
United States | 2007 | 301139947 |
Note that pivot_longer
tries to make the code as intuitive as possible using natural language. First, we tell R which columns to pivot, then we tell R to name the new column ‘year’, then we tell R to name the new column with the values for population ‘pop’.
Suppose we encountered a more difficult wide version of the gapminder
data named gap_wide
shown below. This one has multiple variables listed wide for each year.
Tidying gap_wide
will take two steps. First, we can separate the variable names pop/lifeExp/gdpPercap
from the numeric year into two columns using pivot_longer
. This will result in a column that contains all three variables that precede the year and a column that contains year. We will also need to name a third column that will contain the values that the pivoted columns contained.
In the code below, I tell R which columns to pivot using cols
and to name the two new columns ‘var’ and ‘year’. I use names_sep
to tell that each of the columns should be separated using the underscore. Then, I give the new column that will contain the values the generic name ‘value’ since this is an temporary column.
gap_long1 <- gap_wide %>%
pivot_longer(cols = pop_1997:gdpPercap_2007,
names_to = c('var', 'year'),
names_sep = '_',
values_to = 'value')
DT::datatable(gap_long1, rownames = FALSE, options = list(pageLength = 5, scrollX=T))
Now we need to convert the var
column to wide using pivot_wider
. This will create new columns for each of the unique values contained in the ‘var’ column. Since there are three unique values, the result will be three new columns. We also need to specify which column contains the values that will be transferred over to the three new columns.
In the code below, I tell R to pivot the ‘var’ column wide and take the values from the ‘value’ column. And voila; we are back to having our original, tidy data.