class: center, middle, inverse, title-slide .title[ # PADP 7120 Data Applications in PA ] .subtitle[ ## RLab 2: Importing & Exporting ] .author[ ### Alex Combs ] .institute[ ### UGA | SPIA | PADP ] .date[ ### Last updated: January 16, 2024 ] --- # Objectives - Import data - included in a R package - csv and Excel - Export R Markdown - change YAML to print nicer tables - to Word and PDF - suppress code and/or results - Export data to csv --- # Set-up > **Create a new project named "rlab2". I recommend locating it in the same parent file as your rlab1 project.** --- # Set-up > **Start a R Markdown document** > **Change the title in the YAML to "RLab2: Import & Export"** > **Keep the setup code chunk at the top of the template but delete the rest of the template.** --- # Set-up: packages > **In the setup code chunk at the top, add the below code. Run this code to load packages.** ```r library(tidyverse) #includes function for importing CSV library(readxl) #for importing Excel files library(gapminder) #contains data we need to use library(fivethirtyeight) #contains data we need to use ``` - If you receive error message "package does not exist," you need to install it. --- # Set-up: files > **Download "ncbirths" from eLC and add it to your current project folder** > **Download "ga_schdist_raw" from eLC and add to your "rlab2" project folder.** --- class: inverse, middle, center # Data imports --- # Data in R Packages - Many R packages, like `gapminder` and `fivethirtyeight` include datasets - We can browse datasets by going to the "Packages" tab in the bottom-right pane of RStudio - Click on the package that contains the relevant dataset - A list of datasets will appear - Click on a dataset to learn more about it --- # Data in R Packages > **In a new code chunk, add and run the following code** ```r slice_head(gapminder, n=4) # slice_head() prints row(s) from the top of a dataset # n=4 tells it to print the top 4 rows ``` |country |continent | year| lifeExp| pop| gdpPercap| |:-----------|:---------|----:|-------:|--------:|---------:| |Afghanistan |Asia | 1952| 28.801| 8425333| 779.4453| |Afghanistan |Asia | 1957| 30.332| 9240934| 820.8530| |Afghanistan |Asia | 1962| 31.997| 10267083| 853.1007| |Afghanistan |Asia | 1967| 34.020| 11537966| 836.1971| - This is one way to provide a preview of the data --- # Data in R Packages > **Change the code so it saves as an object named `gap_preview` like so and run again** ```r gap_preview <- slice_head(gapminder, n=4) ``` - You should now see an object named `gap_preview` in your environment (top-right pane) --- # Printing the result - Note that the table does not print when you run the code - By default, R does not print the result when saved as an object -- > **Type `gap_preview` on a separate line so the table prints when knit** ```r gap_preview <- slice_head(gapminder, n=4) *gap_preview ``` --- # Data in R Packages - There is a dataset named `hate_crimes` contained within the `fivethirtyeight` package. -- > **Insert a _new code chunk_ and add the following code** ```r hc <- hate_crimes ``` - Now we have a copy of `hate_crimes` in our environment --- # Data in R Packages > **Start a new line in the code chunk you just created.** > **Add code that will print the first six rows of the `hc` dataset.** > **Knit to HTML** --- class: inverse, middle, center # Printing tables --- # Printing tables - We have made two tables that will print when we knit using the code: ```r gap_preview # prints saved object in environment slice_head(hc, n=4) # prints result without saving anything ``` - There is enough horizontal space for all `gap_preview` variables to display but not for all `hc` variables --- # Printing tables - We can improve how all tables print when knit by changing the YAML > **Change the YAML to the example below. Make sure you indent correctly.** ```r --- title: "RLab2: Import & Export" author: "Your Name" output: * html_document: * df_print: kable --- ``` > **Knit to HTML** - `gap_preview` table looks better but `hc` table runs off the page --- # Printing tables > **Change `df_print` option in YAML to paged like highlighted below:** ```r --- title: "RLab2: Import & Export" author: "Alex Combs" output: html_document: * df_print: paged --- ``` > **Knit to HTML** - Now we can interact with the `hc` table to view multiple pages of variables. --- # YAML themes This is not required, but you can change the theme of your document (similar to themed templates in Microsoft Office) like below. ```r --- title: "RLab2: Import & Export" author: "Alex Combs" output: html_document: df_print: paged * theme: readable --- ``` -If interested, more info on [themes](https://bootswatch.com/3/) and [document formatting](https://bookdown.org/yihui/rmarkdown/html-document.html#appearance-and-style) --- class: inverse, middle, center # Code chunk options --- # Code chunk options - Can control how code knits using code chunk options -- - Include options by adding a comma after the `r`, followed by a space ```r {r, options_go_here} ``` -- - echo = TRUE/FALSE - Suppresses the code from the output but shows results ```r {r, echo=FALSE} ``` -- - include = TRUE/FALSE - Suppress the code AND all results from the output ```r {r, include=FALSE} ``` --- # Code chunk options > **Find the code chunk that prints the `gap_preview` table** > **Add the code chunk option that suppresses the code AND results** > **Find the code chunk that prints the first four rows of `hc`** > **Add the code chunk option that suppresses ONLY the code** > **Knit to HTML** --- # Output options overview ![](labs_files/output_options.png) - The minus signs indicate what each option suppresses --- # Global code chunk option - When you start a new Rmd template, it contains a code chunk near the top named "setup" that contains ```r {r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` - This sets `echo=TRUE` for all code chunks by default > **Change to `echo=FALSE`, to suppress all code when knit.** > **Add `message=FALSE` and `warning=FALSE` to suppress the background information RStudio sometimes prints but should not be included in most reports** --- class: inverse, middle, center # Back to data imports --- # Importing data - Organizations publish data files in a variety of formats that may be inconvenient but tend to use a consistent format across files and time - Provides the opportunity to use code to import many similar files quickly - Messy data files present a choice in workflow - Clean the data some as part of the import, or - Import data as is and clean later - Avoid changing the data files in something like Excel before importing it because you or others can't replicate that as easily --- # Importing CSV - Generic syntax: ```r any_name <- read_csv("filename.csv") ``` - Tells R to import the `"filename.csv"`, naming it whatever we type on the left side - Remember: - Quotes around the file - Include the file extension .csv > **In the setup code chunk, import ncbirths.csv naming the saved data as ncbirths.** --- # Importing Excel - Generic syntax: ```r anyname <- read_excel("filename.xlsx") ``` - Works with both .xls or .xlsx file extensions > **In the setup code chunk, import ga_schdist_raw.xlsx naming the saved data as ga_schdist_raw** --- # Preserve or Clean on Import - The previous two slides cover all the code needed to import the two most common public data file formats. - If the data you want to import looks relatively clean, additional code is not neeeded. - If data are messy, we may want to clean the data at the same time as importing it. > **Click on the two data sets you just saved to the environment. Let's examine whether data look correct.** --- # Importing data - Starting with `ncbirths` - Some desirable features of this dataset: - Variable names are in the top row - Variable names are short - This dataset doesn't need changes --- # Importing data - Now viewing `ga_schdist_raw`, the data did not import as we want - Can help to first open messy data in spreasheet software like Excel to understand what needs to be done - Data has some undesirable qualities - At least first 6 rows should be skipped - Variable names are very long - Row 231 and below contain information we don't want to import - Note that the data we want to import runs from cells A7 to I230 --- # Importing data - Common cleaning tasks to combine with importing: - Specifying which rows and columns to import if the raw file has non-data information in it - Changing how R will store a variable - We can handle these tasks using RStudio's import function --- # Importing data > **In file pane, click on "ga_schdist_raw.xlsx", then "Import Dataset..."** - Can click on a variable to change how R stores it, or exclude it from the import - Can skip some specified number of the top rows, or specify a range of cells - Can specify whether the top row includes variable names --- # Importing data - The RStudio import window in the bottom-right provides the code needed to import the data based on the options we choose - **Important:** You must paste this code in your Rmd or else the data won't import when you knit --- # Importing data > **Notice there is a "Range" option in the bottom left. Enter `A7:I230`.** -- > **There are numeric variables stored as character. Click on each and change to numeric using the dropdown menu.** -- > **Copy code and paste in the _setup code chunk_** --- # Importing data - We need to change the code we just pasted in a few ways: > **Delete the part of the file path unique to your computer and include only the file name.** - This is important for sharing documents. Others will encounter an error when trying to run code that contains a file path unique to your computer. > **Delete the last line containing View because it often causes an error when knit** > **Run this code to complete import.** --- # Importing data - Variable names are long but this dataset is ready to use - Do not *have* to rename variables but typing long variables can be inconvenient - What if we wanted to rename these variables to make it easier to work with --- # Renaming variables Generic syntax ```r dataset <- rename(dataset, new_name = current_name, new_name = "current name",...) ``` - If a current variable name has a space, need to use quotes --- # Renaming variables - Let's rename one variable in `ga_schdist_raw` > Refer to the example code on previous slide to rename the first variable, "Agency Name", to district. Add this code to your setup code chunk. > Run this code to check whether it works --- class: inverse, center, middle # Exporting Document and Data --- # Knit to Word - If Word is installed on your computer, knitting to Word should work - Good way to get all text, graphs, and tables in a final Word document - Can then change style and formatting --- # Knit to PDF - To knit to PDF, you can install a light version of LaTeX within RStudio. ```r # In the console pane, type install.packages('tinytex') # hit Enter # Next, in the console pane, type tinytex::install_tinytex() # hit Enter ``` --- # Export data to CSV - We may want to export data in our environment that we have prepared in RStudio for others to use - Generic syntax for exporting data to CSV: ```r write_csv(name_of_object_in_environment, 'name_for_new_file.csv') ``` -- > **Export the `ga_schdist_raw` object to a csv file. Name the file "ga_schdist_clean.csv"** --- # Recap - We have now covered the basic workflow of most data projects using statistical programming software - Launch application (RStudio) -> load any packages & import data needed for analysis -> analysis -> save/export any documents, graphs, tables, data, etc. - Now we can focus on the skills useful for actual analysis -- > **Upload your Rmd to the RLab assignment dropbox on eLC.**