class: center, middle, inverse, title-slide .title[ # PADP 7120 Data Applications in PA ] .subtitle[ ## RLab 6: Combining Data ] .author[ ### Alex Combs ] .institute[ ### UGA | SPIA | PADP ] .date[ ### Last updated: March 11, 2024 ] --- # Objectives - Learn to perform two most common tasks for combining data - Append - Left join --- # Set up > **Create a new project named "rlab6" and start new R Markdown document. Change YAML and global options. Delete template.** ```r --- title: "RLab 6: Combining Data" author: "Your Name" output: html_document: df_print: kable --- ``` --- # Setup > **Load the following packages** ```r library(tidyverse) ``` --- # Download & Import Data > **Download the `st_fin_2014.csv`, `st_fin_2015.csv`, `st_fin_2016.csv`, `covid_ga_5.csv` and `ga_county_pop.csv` files from eLC and add them to your project folder.** > **In your setup code chunk, import the data files using the following code.** ```r st_fin_2014 <- read_csv("st_fin_2014.csv") st_fin_2015 <- read_csv("st_fin_2015.csv") st_fin_2016 <- read_csv("st_fin_2016.csv") covid_ga_5 <- read_csv("covid_ga_5.csv") ga_county_pop <- read_csv("ga_county_pop.csv") ``` --- # Append - Stacking observations of two or more datasets that have common variables with exactly the same name. - Example: Adding multiple time periods together of the same data that is released annually. ![:scale 50%](labs_files/append.png) --- # Append Example - Suppose I need to append three datasets with same layout -- ```r glimpse(gapminder07) ``` ``` ## Rows: 142 ## Columns: 3 ## $ country <fct> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", … ## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, … ## $ gdpPercap <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, 34435… ``` ```r glimpse(gapminder02) ``` ``` ## Rows: 142 ## Columns: 3 ## $ country <fct> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", … ## $ year <int> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, … ## $ gdpPercap <dbl> 726.7341, 4604.2117, 5288.0404, 2773.2873, 8797.6407, 30687.… ``` --- # Append Example ```r glimpse(gapminder97) ``` ``` ## Rows: 142 ## Columns: 3 ## $ country <fct> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", … ## $ year <int> 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, … ## $ gdpPercap <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, 26997… ``` --- # Append Example - To append, all we need to do is `bind` the `rows` of each dataset using the `bind_rows` function - General syntax ```r new_dataset <- bind_rows(dataset1, dataset2, ...) ``` --- # Append Example ```r gapminder97_07 <- bind_rows(gapminder97, gapminder02, gapminder07) glimpse(gapminder97_07) ``` ``` ## Rows: 426 ## Columns: 3 ## $ country <fct> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", … ## $ year <int> 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, … ## $ gdpPercap <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, 26997… ``` - There are now 426 rows in the new dataset (3 x 142) --- # Append Practice > **Immediately below your setup code chunk, insert a heading "Append"** - You have three `st_fin_year` datasets with the same layout > **Append them to create a new dataset `st_fin_2014_2016`** --- # Append Practice > **Start a new code chunk and use `glimpse` to confirm it worked** ```r glimpse(st_fin_2014_2016) ``` - You should have 150 observations (3 years X 50 states) - Now we can explore things like changes in state finances over time --- class: inverse, middle, center # Left joining data --- # County COVID Data - Let's view the `covid_ga_5` data - These come from *New York Time's* daily COVID data [here](https://github.com/nytimes/covid-19-data) - *NYT* discontinued collecting local data on March 23, 2023 and now uses weekly CDC data - Our `covid_ga_5` data contain their last counts of cumulative COVID cases and deaths for 5 Georgia counties --- # County comparison > **Insert a heading "Left Join"** - Suppose we want to compare cumulative cases as of 3/23/2023 across these 5 counties. > **Generate a bar graph using the following code.** ```r covid_ga_5 %>% ggplot(aes(x = county, y = cases)) + geom_col() ``` > **Make another bar graph for deaths** --- # County comparison - Cases <img src="rlab6-combining-data_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- # County comparison - Deaths <img src="rlab6-combining-data_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- # County comparison - Why are these county comparisons misleading? --- # County comparison - We haven't adjusted for county population. - Therefore, we need to compute rates; cases and/or deaths per some standard number of people. - But the raw data do not include county population - We need to `left_join` county population data to the COVID data --- # Left Joins - A left join is used when we want to add variables from a secondary dataset to our primary dataset, matching observations based on a common set of variables called the **key** - Keeps **ALL** of the observations in our main dataset regardless of whether there is a match in the secondary dataset - Observations in the secondary dataset that don't match with the main dataset **are not** added ![:scale 60%](labs_files/left_join.png) --- # The Key to Joins/Merges - The **key** is the variable or set of variables that uniquely identifies each observation/row that is shared across the datasets you want to join. - If you can identify the unit of analysis in both datasets, you have identified the key to use for joins - Datasets must have the same key in order to join them - The key does not need to have the same variable name(s) across datasets --- # General Syntax of Left Joins - If a single variable is the key and named the same across datasets ```r new_dataset <- name_of_left_dataset %>% left_join(name_of_right_dataset, by = "name_of_key") ``` -- - If multiple variables of the same name are the key ```r new_dataset <- name_of_left_dataset %>% left_join(name_of_right_dataset, * by = c("name_of_key1", "name_of_key2", ...)) ``` --- # General Syntax of Left Joins - If one variable is the key and named differently across datasets ```r new_dataset <- name_of_left_dataset %>% left_join(name_of_right_dataset, * by = c("key_in_left_data" = "key_in_right_data")) ``` -- - If multiple variables are the key and at least one is named differently ```r new_dataset <- name_of_left_dataset %>% left_join(name_of_right_dataset, * by = c("key_in_left_data" = "key_in_right_data", * "common_key", ...)) ``` --- # General Syntax to Left Joins - Suppose I want to join several datasets to a main dataset. Some have same name for key and some don't. ```r new_dataset <- name_of_left_dataset %>% left_join(right1_dataset, by = "name_of_key") %>% left_join(right2_dataset, by = c("key_in_left" = "key_in_right2")) %>% left_join(right3_dataset, by = "name_of_key") ``` --- # Left Join Example - Suppose I want to add population to the `gapminder97_07` that I appended earlier ```r glimpse(gapminder97_07) ``` ``` ## Rows: 426 ## Columns: 3 ## $ country <fct> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", … ## $ year <int> 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, … ## $ gdpPercap <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, 26997… ``` --- # Left Join Example - Suppose I could find population data for only the U.S. ```r glimpse(gap_pop_US) ``` ``` ## Rows: 12 ## Columns: 3 ## $ country <fct> "United States", "United States", "United States", "United Sta… ## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 20… ## $ pop <int> 157553000, 171984000, 186538000, 198712000, 209896000, 2202390… ``` - Note that there are only 12 observations in `gap_pop_US`, not 426 as in `gapminder97_07` --- # Left Join Example - First step is to identify the key(s) for `gapminder97_07` and `gap_pop_US` - `country` AND `year` - They are named the same across both datasets as well, so... -- ```r gap_pop_97_07 <- gapminder97_07 %>% left_join(gap_pop_US, by = c('country', 'year')) glimpse(gap_pop_97_07) ``` ``` ## Rows: 426 ## Columns: 4 ## $ country <fct> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", … ## $ year <int> 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, … ## $ gdpPercap <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, 26997… ## $ pop <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ``` --- # Left Join Practice - You imported `ga_county_pop` earlier - These data include total county population for 2019 from the US Census --- # Left Join Practice - Which variable(s) is(are) the key for joining the two datasets? ```r glimpse(ga_county_pop) ``` ``` ## Rows: 159 ## Columns: 2 ## $ county <chr> "Appling", "Atkinson", "Bacon", "Baker", "Baldwin", "Banks"… ## $ population <dbl> 18386, 8165, 11164, 3038, 44890, 19234, 83240, 107738, 1670… ``` ```r glimpse(covid_ga_5) ``` ``` ## Rows: 5 ## Columns: 6 ## $ date <date> 2023-03-23, 2023-03-23, 2023-03-23, 2023-03-23, 2023-03-23 ## $ county <chr> "Clarke", "Cobb", "DeKalb", "Fulton", "Gwinnett" ## $ state <chr> "Georgia", "Georgia", "Georgia", "Georgia", "Georgia" ## $ fips <dbl> 13059, 13067, 13089, 13121, 13135 ## $ cases <dbl> 35440, 217466, 193938, 277539, 258304 ## $ deaths <dbl> 265, 2106, 1985, 2671, 2127 ``` --- # Variable Type Mismatch - Before using a key to join data, make sure they are stored as the same type of variable in both datasets - The glimpse output tells us how each variable is stored. Is the key stored differently across the two datasets? -- - For example, if `county` was stored as a factor variable in `covid_ga_5` and as a character variable in `ga_county_pop`, we could change county to factor in `ga_county_pop` like so. ```r ga_county_pop$county <- as.factor(ga_county_pop$county) ``` --- # Left Join Practice - Which data set makes the most sense as our "left" dataset? > **Left join the two datasets, overwriting `covid_ga_5`.** ```r covid_ga_5 <- covid_ga_5 %>% ___(___, ___ = '___') ``` --- # Left Join Practice > **Use `glimpse` on our updated data** ```r glimpse(covid_ga_5) ``` ``` ## Rows: 5 ## Columns: 7 ## $ date <date> 2023-03-23, 2023-03-23, 2023-03-23, 2023-03-23, 2023-03-23 ## $ county <chr> "Clarke", "Cobb", "DeKalb", "Fulton", "Gwinnett" ## $ state <chr> "Georgia", "Georgia", "Georgia", "Georgia", "Georgia" ## $ fips <dbl> 13059, 13067, 13089, 13121, 13135 ## $ cases <dbl> 35440, 217466, 193938, 277539, 258304 ## $ deaths <dbl> 265, 2106, 1985, 2671, 2127 ## $ population <dbl> 128331, 760141, 759297, 1063937, 936250 ``` --- # Adjusting for Population > **Insert a new code chunk below the left join that added population to `covid_ga_5`. Add three new variables: `pop100K`, `cases_rate`, and `deaths_rate`.** ```r covid_ga_5 <- covid_ga_5 %>% ___(pop100K = ___, cases_rate = ___, deaths_rate = ___) ``` --- # Update bar graphs > **Add a new code chunk. Copy-and-paste bar graph code. Modify the code to include new variables.** --- # Updated bar graphs <img src="rlab6-combining-data_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> --- # Updated bar graphs <img src="rlab6-combining-data_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> --- # Upload > **Upload your Rmd to eLC.** --- # Other types of joins - Inner join keeps only observations where the key matches ![:scale 65%](labs_files/inner_join.png) - Replace `left_join` with `inner_join` in any of the previous examples --- # Other types of joins - Full join keeps all observations in both datasets regardless of match ![](labs_files/full_join.png) - Replace `left_join` with `full_join` in any of the examples