class: center, middle, inverse, title-slide .title[ # PADP 7120 Data Applications in PA ] .subtitle[ ## RLab 3: Data Wrangling Part 1 ] .author[ ### Alex Combs ] .institute[ ### UGA | SPIA | PADP ] .date[ ### Last updated: January 23, 2024 ] --- # Outline - Wrangle verbs - extract top or bottom rows using `slice_head` or `slice_tail` - extract rows using `filter` - extract columns using `select` - create new or change existing variables using `mutate` - reorder rows using `arrange` --- # Why learn data wrangling - Learning statistics is not as useful if you can't prepare data - Change original data to help achieve our goal - Create a table we want to include in a report --- # Realistic Expectations - Cleaning data is not fun (to most people) and can involve many steps - You do not need to memorize - Our goal is to be able to start with real data and have an idea of the functions needed to get to the desired result - If you understand the basic steps and available functions, searching for examples or copying previous work is easy --- # Set up > **Create a new project named "rlab3" and start new R Markdown document.** > **Change YAML:** ```r --- title: "RLab3: Data Wrangling" author: "Your Name" output: html_document: df_print: paged --- ``` > **Keep setup code chunk at the top. Change the global option to `echo=FALSE, message=FALSE, warning=FALSE`** > **Delete rest of the template.** --- # Set up: Packages > **In the setup code chunk, load the following packages:** ```r library(tidyverse) # handles virtually all wrangling tasks library(gapminder) # for data we will use ``` --- class: inverse, middle, center # Wrangle verbs --- # Practice Data - We will practice wrangle verbs using the `gapminder` dataset |country |continent | year| lifeExp| pop| gdpPercap| |:-----------|:---------|----:|-------:|-------:|---------:| |Afghanistan |Asia | 1952| 28.8| 8425333| 779.4| - Continents & years included ``` ## [1] "Africa" "Americas" "Asia" "Europe" "Oceania" ``` ``` ## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 ``` --- # Preview data > **Below the setup code chunk, on a new line outside of the code chunk, type a heading Preview** > **Insert a new code chunk. Name this code chunk {r preview}** > **Provide a preview of the data using `slice_head()` to print the first four rows.** --- # Filter ![](labs_files/filter.png) - Keeps rows based on criteria you specify - General syntax: ```r filter(data_set, criteria) #prints the result when knit filtered_data_set <- filter(data_set, criteria) #saves result filtered_data_set # prints the saved result in the above line ``` --- # Specifying criteria - We can tell R to keep rows that meet certain criteria - "If equal to" `==` - "If not equal to" `!=` - "If greater than, less than, or equal to" `>, <, >=, <=` - "If missing values for" `is.na()` - We can specify multiple criteria - "And" `&` - "Or" `|` --- # Filter examples Print all observations if year equals 2007 ```r filter(gapminder, year == 2007) ``` -- Print all observations if continent equals Africa ```r filter(gapminder, continent == 'Africa') ``` -- Save a new dataset containing only observations where life expectancy is greater than 75 years, then print the saved result. ```r long_life <- filter(gapminder, lifeExp > 75) long_life ``` --- # Filter examples Print observations missing values for life expectancy ```r filter(gapminder, is.na(lifeExp)) ``` -- Save a new dataset containing observations where continent is not equal to Oceania (remove Oceania observations) ```r no_oceania <- filter(gapminder, continent != 'Oceania') ``` --- # Filter practice > **Type a new heading, "Wrangle Practice"** > **Type a lower-level heading, "Filter"** -- > **Start a new code chunk and name it `filter`.** -- > **Use `filter` to save new dataset named `asia` containing all Asia observations** -- > **On new line, use `filter` to save new dataset named `gap1952` containing all 1952 observations** -- > **On new line, use `filter` to save a new dataset named `oceania` containing only Oceania observations. Then print `oceania`.** --- # Filter examples (multiple criteria) Print all Africa observations for 2007 ```r filter(gapminder, continent == 'Africa' & year == 2007) ``` -- Print all Africa and Oceania observations ```r filter(gapminder, continent == 'Africa' | continent == 'Oceania') ``` -- Print non-Oceania observations for 1992 or later ```r filter(gapminder, continent != 'Oceania' & year >= 1992) ``` --- # Filter practice > **Start a new code chunk named `filter2`** -- > **Print a table containing countries only in the "Americas" with a life expectancy lower than 65 in year 1952** --- # Select ![](labs_files/select.png) - General syntax ```r new_dataset <- select(old_dataset, var1, var2, var3, ...) # where var is the name of the variable you want to keep ``` - Or to drop variables ```r new_dataset <- select(old_dataset, -var1, -var2, ...) ``` - Whichever requires the shorter list of names --- # Select examples Print only country, year, and life expectancy ```r select(gapminder, country, year, lifeExp) ``` -- Remove population and print the rest ```r select(gapminder, -pop) ``` --- # Select practice > **Insert a heading named Select** -- > **Start a new code chunk and name it `select`** -- > **You have an dataset saved named `oceania`. Use select to print a table of `oceania` containing only `country`, `year`, and `pop`** -- > **On separate line, use select to print a table of `oceania` with all variables except `continent`** -- > **Let's knit our Rmd to check the output so far.** --- # Mutate ![](labs_files/mutate.png) - Generic syntax ```r dataset <- mutate(dataset, var_name = any_formula) ``` - Only a single `=` is used because we are assigning value - If `var_name` is an existing variable, `mutate` will replace it - Otherwise, `mutate` will add a new variable --- # Mutate examples Replace `pop`ulation rescaled from ones to millions ```r gapminder <- mutate(gapminder, pop = pop/1000000) ``` -- Add a new variable `pop_millions` that rescales `pop` to millions ```r gapminder <- mutate(gapminder, pop_millions = pop/1000000) ``` -- Save new dataset with a variable equal to average life expectancy among all observations ```r gapminder2 <- mutate(gapminder, avg_lifeExp = mean(lifeExp)) ``` --- # Mutate examples Can create multiple variables at a time separated by commas ```r gapminder3 <- mutate(gapminder, pop_millions = pop/1000000, avg_lifeExp = mean(lifeExp)) ``` --- # Practice > **Type a heading titled Mutate. Start a new code chunk and name it mutate.** -- > **First, save a new dataset `gapminder07` containing only observations for 2007** -- - Our data contain a variable, `gdpPercap`, for GDP per capita but suppose we want GDP also > **On a new line, create a new variable `gdp` equal to a country's GDP and save it to `gapminder07`** --- # Practice continued > **In the *same* mutate function, create a second variable, `global_gdp`, equal to the `sum()` of all `gdp`s.** -- > **In the *same* mutate function, create a third variable, `pct_gdp`, equal to each country's percent of `global_gdp`.** -- > **Print a table that displays only the country name and percent of global gdp where `pct_gdp` is greater than or equal to 10.** --- # Arrange ![](labs_files/arrange.png) ```r arrange(data_set, var) # Where var is the variable by which you want to arrange # Arranges in ascending order by default arrange(data_set, desc(var)) # Arranges in descending order ``` - Useful for printing tables to see highest or lowest cases --- # Arrange examples Print countries arranged in ascending order of life expectancy (low-to-high) ```r arrange(gapminder, lifeExp) ``` -- Print countries arranged in descending order of life expectancy (high-to-low) ```r arrange(gapminder, desc(lifeExp)) ``` -- Rearrange and replace existing data ```r gapminder <- (gapminder, lifeExp) ``` --- # Arrange practice > **Type a heading titled Arrange. Start a new code chunk and name it arrange.** -- > **Rearrange `gapminder07` in descending order by `pct_gdp`** -- > **Print a table of the first 10 rows** --- # Upload > **Knit your Rmd and review the output** > **Upload Rmd document to eLC**