PADP 7120 Data Applications in PA

class: center, middle, inverse, title-slide

.title[
# PADP 7120 Data Applications in PA
]
.subtitle[
## RLab 3: Data Wrangling Part 1
]
.author[
### Alex Combs
]
.institute[
### UGA | SPIA | PADP
]
.date[
### Last updated: January 21, 2025
]

---

# Outline

- Wrangle verbs
  - extract top or bottom rows using `slice_head` or `slice_tail`
  - extract rows using `filter`
  - extract columns using `select`
  - create new or change existing variables using `mutate`
  - reorder rows using `arrange`

---
# Why learn data wrangling

- Learning statistics is not as useful if you can't prepare data

- Change original data to help achieve our goal

- Create a table we want to include in a report

---
# Realistic Expectations

- Cleaning data is not fun (to most people) and can involve many steps

- You do not need to memorize

- Our goal is to be able to start with real data and have an idea of the functions needed to get to the desired result

- If you understand the basic steps and available functions, searching for examples or copying previous work is easy

---
# Set up

> **Create a new project named "rlab3" and start new R Markdown document.**
  
> **Change YAML:**

``` r
---
title: "RLab3: Data Wrangling"
author: "Your Name"
output:
  html_document:
    df_print: paged
---
```

> **Keep setup code chunk at the top. Change the global option to `echo=FALSE, message=FALSE, warning=FALSE`**

> **Delete rest of the template.**

---
# Set up: Packages

> **In the setup code chunk, load the following packages:**

``` r
library(tidyverse) # handles virtually all wrangling tasks
library(gapminder) # for data we will use
```

---
class: inverse, middle, center

# Wrangle verbs

---
# Practice Data

- We will practice wrangle verbs using the `gapminder` dataset

|country     |continent | year| lifeExp|     pop| gdpPercap|
|:-----------|:---------|----:|-------:|-------:|---------:|
|Afghanistan |Asia      | 1952|    28.8| 8425333|     779.4|

- Continents & years included

```
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
```

```
##  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
```

---
# Preview data

> **Below the setup code chunk, on a new line outside of the code chunk, type a heading Preview**

> **Insert a new code chunk. Name this code chunk {r preview}**

> **Provide a preview of the data using `slice_head()` to print the first four rows.**

---
# Filter

![](labs_files/filter.png)

- Keeps rows based on criteria you specify

- General syntax:

``` r
filter(data_set, criteria) #prints the result when knit
filtered_data_set <- filter(data_set, criteria) #saves result
filtered_data_set # prints the saved result in the above line
```

---
# Specifying criteria

- We can tell R to keep rows that meet certain criteria

- "If equal to" `==`
  - "If not equal to" `!=`
  - "If greater than, less than, or equal to" `>, <, >=, <=`
  - "If missing values for" `is.na()`

- We can specify multiple criteria
  - "And" `&`
  - "Or" `|`

---
# Filter examples

Print all observations if year equals 2007

``` r
filter(gapminder, year == 2007)
```

Print all observations if continent equals Africa

``` r
filter(gapminder, continent == 'Africa')
```

Save a new dataset containing only observations where life expectancy is greater than 75 years, then print the saved result.

``` r
long_life <- filter(gapminder, lifeExp > 75)
long_life
```

---
# Filter examples

Print observations missing values for life expectancy

``` r
filter(gapminder, is.na(lifeExp))
```

Save a new dataset containing observations where continent is not equal to Oceania (remove Oceania observations)

``` r
no_oceania <- filter(gapminder, continent != 'Oceania')
```

---
# Filter practice

> **Type a new heading, "Wrangle Practice"**

> **Type a lower-level heading, "Filter"**

> **Start a new code chunk and name it `filter`.**

> **Use `filter` to save new dataset named `asia` containing all Asia observations**

> **On new line, use `filter` to save new dataset named `gap1952` containing all 1952 observations**

> **On new line, use `filter` to save a new dataset named `oceania` containing only Oceania observations. Then print `oceania`.**

---
# Filter examples (multiple criteria)

Print all Africa observations for 2007

``` r
filter(gapminder, continent == 'Africa' & year == 2007)
```

Print all Africa and Oceania observations

``` r
filter(gapminder, continent == 'Africa' | continent == 'Oceania')
```

Print non-Oceania observations for 1992 or later

``` r
filter(gapminder, continent != 'Oceania' & year >= 1992)
```

---
# Filter practice

> **Start a new code chunk named `filter2`**

> **Print a table containing countries only in the "Americas" with a life expectancy lower than 65 in year 1952**

---
# Select

![](labs_files/select.png)

- General syntax

``` r
new_dataset <- select(old_dataset, var1, var2, var3, ...)
# where var is the name of the variable you want to keep
```

- Or to drop variables

``` r
new_dataset <- select(old_dataset, -var1, -var2, ...)
```

- Whichever requires the shorter list of names

---
# Select examples

Print only country, year, and life expectancy

``` r
select(gapminder, country, year, lifeExp)
```

Remove population and print the rest

``` r
select(gapminder, -pop)
```

---
# Select practice

> **Insert a heading named Select**

> **Start a new code chunk and name it `select`**

> **You have an dataset saved named `oceania`. Use select to print a table of `oceania` containing only `country`, `year`, and `pop`**

> **On separate line, use select to print a table of `oceania` with all variables except `continent`**

> **Let's knit our Rmd to check the output so far.**

---
# Mutate

![](labs_files/mutate.png)

- Generic syntax

``` r
dataset <- mutate(dataset, var_name = any_formula)
```

- Only a single `=` is used because we are assigning value

- If `var_name` is an existing variable, `mutate` will replace it

- Otherwise, `mutate` will add a new variable

---
# Mutate examples

Replace `pop`ulation rescaled from ones to millions

``` r
gapminder <- mutate(gapminder, pop = pop/1000000)
```

Add a new variable `pop_millions` that rescales `pop` to millions

``` r
gapminder <- mutate(gapminder, pop_millions = pop/1000000)
```

Save new dataset with a variable equal to average life expectancy among all observations

``` r
gapminder2 <- mutate(gapminder, avg_lifeExp = mean(lifeExp))
```

---
# Mutate examples

Can create multiple variables at a time separated by commas

``` r
gapminder3 <- mutate(gapminder,
                     pop_millions = pop/1000000,
                     avg_lifeExp = mean(lifeExp))
```

---
# Practice

> **Type a heading titled Mutate. Start a new code chunk and name it mutate.**

> **First, save a new dataset `gapminder07` containing only observations for 2007**

- Our data contain a variable, `gdpPercap`, for GDP per capita but suppose we want GDP also

> **On a new line, create a new variable `gdp` equal to a country's GDP and save it to `gapminder07`**

---
# Practice continued

> **In the *same* mutate function, create a second variable, `global_gdp`, equal to the `sum()` of all `gdp`s.**

> **In the *same* mutate function, create a third variable, `pct_gdp`, equal to each country's percent of `global_gdp`.**

> **Print a table that displays only the country name and percent of global gdp where `pct_gdp` is greater than or equal to 10.**

---
# Arrange

![](labs_files/arrange.png)

``` r
arrange(data_set, var)
# Where var is the variable by which you want to arrange
# Arranges in ascending order by default

arrange(data_set, desc(var))
# Arranges in descending order
```

- Useful for printing tables to see highest or lowest cases

---
# Arrange examples

Print countries arranged in ascending order of life expectancy (low-to-high)

``` r
arrange(gapminder, lifeExp)
```

Print countries arranged in descending order of life expectancy (high-to-low)

``` r
arrange(gapminder, desc(lifeExp))
```

Rearrange and replace existing data

``` r
gapminder <- (gapminder, lifeExp)
```

---
# Arrange practice

> **Type a heading titled Arrange. Start a new code chunk and name it arrange.**

> **Rearrange `gapminder07` in descending order by `pct_gdp`**

> **Print a table of the first 10 rows**

---
# Upload

> **Knit your Rmd and review the output**

> **Upload Rmd document to eLC**