class: center, middle, inverse, title-slide .title[ # PADP 7120 Data Applications in PA ] .subtitle[ ## RLab 4: Data Wrangling Part 2 ] .author[ ### Alex Combs ] .institute[ ### UGA | SPIA | PADP ] .date[ ### Last updated: January 30, 2024 ] --- # Outline - More wrangling - combine functions using the pipe operator `%>%` - provide descriptive statistics using `summarize` and `group_by` - Make small improvements to the format of our tables --- # Set up > **Open your RLab 3 project file** > **From the files pane, open your RLab 3 Rmd file** > **Load one additional package in the setup code chunk** ```r library(knitr) ``` > **Use "Run" menu to run all code chunks** --- # Review - In last R Lab, we learned: - Keep/remove rows using `filter` - Keep/remove columns using `select` - Create new or change existing variables using `mutate` - Reorder rows based on ascending or descending values of a variable using `arrange` - Print a sample of our data using `slice_head` or `slice_tail` --- class: inverse, middle, center # Combining Wrangle Verbs --- # Pipe operator %>% - Allows us to run multiple commands, sequentially feeding the result of the previous line of code to the next - Makes code easier to read and write - Keyboard shortcut is `Cmd+Shift+M` or `Ctrl+Shift+M` .pull-left[ Prints the result ```r dataset %>% filter(___) %>% select(___) %>% mutate(___) %>% arrange(___) ``` ] .pull-right[ Saves the result ```r dataset <- dataset %>% filter(___) %>% select(___) %>% mutate(___) %>% arrange(___) ``` ] --- # Pipe example Suppose I want to print a table of the top five countries with respect to percent of global GDP in 2007. **Without using %>%:** ```r #Filter gapminder to get 2007 gapminder07 <- filter(gapminder, year == 2007) #Add variables to gapminder07 gapminder07 <- mutate(gapminder07, gdp = gdpPercap*pop, global_gdp = sum(gdp), pct_gdp = (gdp/global_gdp)*100) #Create dataset according to desired table gdp_table <- select(gapminder07, country, pct_gdp) gdp_table <- arrange(gdp_table, desc(pct_gdp)) #Print table slice_head(gdp_table, n = 5) ``` --- # Pipe example To print same table with `%>%`: ```r gapminder %>% filter(year == 2007) %>% mutate(gdp = gdpPercap*pop, global_gdp = sum(gdp), pct_gdp = (gdp/global_gdp)*100) %>% select(country, pct_gdp) %>% arrange(desc(pct_gdp)) %>% slice_head(n=5) ``` - Don't specify the data inside each verb because that is being piped from the first line - Don't have to save the result each step of the way; no `gapminder07` or `gdp_table` needed. --- # Pipe operator practice > **Add a heading "Pipe Operator"** > **Start a new code chunk.** > **Use the pipe operator to save a new dataset `gapminder52` that contains only 1952 observations and add the GDP variables we created earlier (consider copy-and-paste)** -- > **On a new line, use pipe operator to print a table of `gapminder52` that includes only `country` and `pct_gdp`, arranged in descending order and only the first 5 countries.** --- class: inverse, middle, center # Summarize & Group By --- # Summarize ![](labs_files/summarize.png) - Generic syntax ```r summarize(data_set, "Name of column" = function(variable), "Name of column" = function(variable),...) ``` - Where `function` is one of many possible summary functions - Useful for reporting a few summary stats --- # Summarize example Suppose I want to report median life expectancy and GDP per capita in 2007 ```r gapminder %>% filter(year == 2007) %>% summarize("Median Life Expectancy" = median(lifeExp), "Median GDP per Capita" = median(gdpPercap)) ``` | Median Life Expectancy| Median GDP per Capita| |----------------------:|---------------------:| | 72| 6,124| --- # Summarize practice > **Add a heading "Summarize"** > **Start a new code chunk and name it summarize** > **Use `gapminder52` to print a table with the `min`imum, `median`, and `max`imum for `pct_gdp` in 1952. Name columns accordingly.** --- # Group By ![](labs_files/group_by.png) - General syntax ```r data_set %>% group_by(grouping_variable) %>% summarize(name = function(variable)) ``` - Year and categorical variables are common grouping variables --- # Group by example Median life expectancy and GDP per capita in 2007 **by continent** ```r gapminder %>% filter(year == 2007) %>% * group_by(continent) %>% summarize('Median Life Expectancy' = median(lifeExp), 'Median GDP per Capita' = median(gdpPercap)) ``` |Continent | Median Life Expectancy| Median GDP per Capita| |:---------|----------------------:|---------------------:| |Africa | 53| 1,452| |Americas | 73| 8,948| |Asia | 72| 4,471| |Europe | 79| 28,054| |Oceania | 81| 29,810| --- # Group by practice > **Add a heading "Group By"** > **Start a new code chunk and name it groupby.** > **Copy and paste your code that made the summary table** > **Change the table so it reports these summary statistics by each year** --- # Kable options for tables - The `kable()` function, part of the `knitr` package, helps with formatting tables. - A few useful options: - `digits = #` sets the number of digits to the right of the decimal - `format.args = list(big.mark = ',')` inserts commas - `col.names = c("name","name")` renames the columns - `caption = "Title"` provides a table title --- # Kable example > **Add a heading "Formatted Table"** > **Start a new code chunk and name it kable. Add the following code.** ```r gapminder52 %>% select(country, pct_gdp) %>% arrange(desc(pct_gdp)) %>% slice_head(n=5) %>% kable(digits = 1, col.names = c('Country', 'Percent of Global GDP'), caption = 'Largest Economies in 1952') ``` --- # One more practice table - Suppose we want to examine how life expectancy has changed over time > **Create the best table you can that shows a reader this information** --- # Upload Rmd > **Let's knit our Rmd to inspect the output** > **Upload your Rmd to eLC**