19 R Description

19.1 Learning Objectives

In this chapter, you will learn how to:

  • Calculate descriptive statistics individually
  • Automate a professional-quality table of descriptive statistics

19.2 Set-up

To complete this chapter, you need to

  • Start a R Markdown document
  • Change the YAML to the following:
---
title: 'R Chapter 19'
author: 'Your name'
output: 
  html_document:
    theme: spacelab
    df_print: paged
---
  • Load the following packages
library(tidyverse)
library(arsenal)
library(knitr)
library(carData)

19.3 Introduction

Summary statistics tables are ubiquitous in reports and studies. Usually a dataset involves numerous variables that would require too many visualizations, though we should still consider visualizations for the most important variables. A standard table of summary stats provides readers a reference for key measures pertaining to all our variables in a fairly compact form.

In this chapter, we set out to summarize variables within the States dataset of the carData package.

Exercise 1: Use the glimpse function to examine the States dataset.

States is a cross-section of the 50 states and D.C. containing education and related statistics. Be sure to skim the documentation for States to understand each variable. You can do that by clicking on the carData package under the Packages tab then clicking on the States link.

19.4 Quick Stat Computation

Instead of producing a full table of descriptive statistics, sometimes we just want to know one or two descriptive measures of a single variable.

Because R can hold many datasets/objects at once, we need to tell it which dataset/object to apply a given function. We have had to do this many times already. Similarly, if we want R to apply a function to a specific variable within a dataset, we need to tell which variable in which dataset. This is done using the $ operator.

Below are all of the useful descriptive measures of center and spread applied to the pay (i.e. average teacher’s salary in 1,000s) variable in the States dataset. Note that the $ operator tells R to apply the function to a specific variable within a dataset.

mean(States$pay)
[1] 30.94118
median(States$pay)
[1] 30
sd(States$pay)
[1] 5.308151
IQR(States$pay)
[1] 6
range(States$pay)
[1] 22 43

Exercise 2: Calculate the average and standard deviation for state spending on public education in $1000s per student.

19.5 Summary Table

Summary tables come in many styles, so there is no way to cover everything. In most cases, a summary table includes the following descriptive measures depending on the type of variable:

  • Numerical variables
    • Mean
    • Standard deviation
    • Minimum
    • Maximum
  • Categorical variables
    • Counts for each level, and/or
    • Percentages for each level

If a variable is skewed, then it may be wise to replace the mean and standard deviation with the median and IQR. We will learn how to do this.

19.5.1 Quick Table

Sometimes we do not want to print a fancy table. Rather, we may want to quickly see a full set of descriptive statistics for ourselves that our reader will never see. This can be done using the summary function on our dataset like so:

summary(gapminder)
        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop               gdpPercap       
 Min.   :     60011   Min.   :   241.2  
 1st Qu.:   2793664   1st Qu.:  1202.1  
 Median :   7023596   Median :  3531.8  
 Mean   :  29601212   Mean   :  7215.3  
 3rd Qu.:  19585222   3rd Qu.:  9325.5  
 Max.   :1318683096   Max.   :113523.1  
                                        

We would almost certainly want to suppress this code and output (i.e. include=FALSE code chunk option) if preparing a report for an external audience. Note that for the categorical variables, country and continent, summary provides the count of observations instead of measures of center or spread.

Exercise 3: Generate a quick table of descriptive statistics for all of the variables in States. Suppress the code and output.

19.5.2 Using Arsenal

Due to the many styles of summary tables, there are numerous R packages designed to produce summary tables. The best R package in terms of quickly getting the information to a nicely formatted table of which I am aware is Arsenal. Therefore, we will learn how to use Arsenal. I will demonstrate Arsenal using the gapminder dataset. Then, I will ask you to replicate those demonstrations using the States data.

Producing a summary table with Arsenal involves at least two, probably three, steps.

  • Create a new object containing the summary statistics we want to include in a table
  • Relabel the variables to something appropriate for our audience
  • Generating the summary table based on the new object we just created

Here is an example following the steps above using gapminder data without altering any of Aresenal’s default options that we will want to know how to alter in some cases.

sum.gapminder <- tableby(~ continent + gdpPercap + lifeExp + pop, data = gapminder)

The above code is what actually creates the table I want to export. First, I name the object whatever I want. Then I use the tableby function. We will learn what the tilde, ~, does later. For now, know that it is required. Then, I list the variable I want included in the table, separating each with a plus sign. Lastly, I tell R which dataset to apply this function.

labels(sum.gapminder) <- c(continent = "Continent", gdpPercap = "GDP Per Capita", lifeExp = "Life Expectancy", pop = "Population")

Most datasets do not use variable names that would be appropriate for an external audience. The names in gapminder are not bad; most readers could make sense of what the names imply about the data, but it is simple enough (though tedious) to provide a more polished look.

Therefore, in the above code I use the labels function on the sum.gapminder table I just created, then assign each variable I told R to include in the table a label that will replace the name when it prints.

summary(sum.gapminder, title = "Summary Stats for Gapminder Data")
Table 19.1: Summary Stats for Gapminder Data
Overall (N=1704)
Continent
   Africa 624 (36.6%)
   Americas 300 (17.6%)
   Asia 396 (23.2%)
   Europe 360 (21.1%)
   Oceania 24 (1.4%)
GDP Per Capita
   Mean (SD) 7215.327 (9857.455)
   Range 241.166 - 113523.133
Life Expectancy
   Mean (SD) 59.474 (12.917)
   Range 23.599 - 82.603
Population
   Mean (SD) 29601212.325 (106157896.744)
   Range 60011.000 - 1318683096.000

This last line of code actually prints the summary table when I knit my document. The previous two steps (tableby and labels) can be included in the same code chunk, but this third step needs to have its own code chunk because

you need to include a specific code chunk option, results='asis', in order for the table to export properly. To be clear, in the top line of a code chunk that contains {r} by default, you need to change it to {r, results='asis', echo=FALSE}. I also include the echo=FALSE option assuming we do not want our reader to see our code.

Exercise 4: Replicate the code shown above to create a summary table for the States data using the Arsenal package. Be sure to relabel the variables to something relatively understandable and brief. Labeling is tedious but you only need to do it once. I suggest you knit your document now to see what you just made.

In three relatively short bits of code, we already have a decent summary table that would have taken excruciatingly long to input manually. But it can be made better.

19.5.3 Adjustments to Arsenal

19.5.3.1 Decimal digits

The biggest aesthetic issue with my table is that it includes so many decimals. None of these variables have such a small range that rounding to integers masks useful information. Obviously, if a variable only ranges between 0 and 1, we would not want to round to an integer.

Specifying the number of decimals is quite easy with Arsenal. Because arsenal tries to be as flexible as possible, we have to specify the number of decimals separately for numerical and percentage measures. The following code sets the number of decimals to zero for the gapminder data.

sum.gapminder2 <- tableby(~ continent + gdpPercap + lifeExp + pop, data = gapminder, digits = 0, digits.pct = 0)

labels(sum.gapminder2) <- c(continent = "Continent", gdpPercap = "GDP Per Capita", lifeExp = "Life Expectancy", pop = "Population")
summary(sum.gapminder2, title = "Summary Stats for Gapminder Data")
Table 19.2: Summary Stats for Gapminder Data
Overall (N=1704)
Continent
   Africa 624 (37%)
   Americas 300 (18%)
   Asia 396 (23%)
   Europe 360 (21%)
   Oceania 24 (1%)
GDP Per Capita
   Mean (SD) 7215 (9857)
   Range 241 - 113523
Life Expectancy
   Mean (SD) 59 (13)
   Range 24 - 83
Population
   Mean (SD) 29601212 (106157897)
   Range 60011 - 1318683096

Exercise 5: Replicate the code shown above to create a second summary table for the States data with no decimals. Note that you can simply copy-and-paste the labels code.

19.5.3.2 Reporting median and IQR

Instead of the mean and standard deviation, we may want to report the median, first quartile, and third quartile for our numerical variables. We can control the descriptive measures using the following code.

sum.gapminder3 <- tableby(~ continent + gdpPercap + lifeExp + pop, data = gapminder, digits = 0, digits.pct = 0, numeric.stats = c("median", "q1q3", "range"))

labels(sum.gapminder3) <- c(continent = "Continent", gdpPercap = "GDP Per Capita", lifeExp = "Life Expectancy", pop = "Population")
summary(sum.gapminder3, title = "Summary Stats for Gapminder Data")
Table 19.3: Summary Stats for Gapminder Data
Overall (N=1704)
Continent
   Africa 624 (37%)
   Americas 300 (18%)
   Asia 396 (23%)
   Europe 360 (21%)
   Oceania 24 (1%)
GDP Per Capita
   Median 3532
   Q1, Q3 1202, 9325
   Range 241 - 113523
Life Expectancy
   Median 61
   Q1, Q3 48, 71
   Range 24 - 83
Population
   Median 7023596
   Q1, Q3 2793664, 19585222
   Range 60011 - 1318683096

Exercise 6: Replicate the code shown above to create a summary table for the States data that reports median and the first and third quartiles.

19.5.3.3 Across groups

Finally, instead of reporting summary statistics for the entire sample, we may want to report them separately for each level of a categorical variable. This is a common way to make comparisons.

We can have Arsenal report across groups by adding the categorical variable to the left side of the formula in the tableby code. The code below reports the gapminder data across continents. Note that the tilde ~ is used to separate grouping variables on the left side from the variables we wish to summarize on the right side.

By default, Arsenal tests for correlations across groups and reports a p-value. This is not a common part of a summary table (at least for fields in which I am familiar), so I turn this feature off with the test = FALSE within the code below.

sum.gapminder4 <- tableby(continent ~ gdpPercap + lifeExp + pop, data = gapminder, digits = 0, digits.pct = 0, test = FALSE)

labels(sum.gapminder4) <- c(continent = "Continent", gdpPercap = "GDP Per Capita", lifeExp = "Life Expectancy", pop = "Population")
summary(sum.gapminder4, title = "Summary Stats for Gapminder Data")
Table 19.4: Summary Stats for Gapminder Data
Africa (N=624) Americas (N=300) Asia (N=396) Europe (N=360) Oceania (N=24) Total (N=1704)
GDP Per Capita
   Mean (SD) 2194 (2828) 7136 (6397) 7902 (14045) 14469 (9355) 18622 (6359) 7215 (9857)
   Range 241 - 21951 1202 - 42952 331 - 113523 974 - 49357 10040 - 34435 241 - 113523
Life Expectancy
   Mean (SD) 49 (9) 65 (9) 60 (12) 72 (5) 74 (4) 59 (13)
   Range 24 - 76 38 - 81 29 - 83 44 - 82 69 - 81 24 - 83
Population
   Mean (SD) 9916003 (15490923) 24504795 (50979430) 77038722 (206885205) 17169765 (20519438) 8874672 (6506342) 29601212 (106157897)
   Range 60011 - 135031164 662850 - 301139947 120447 - 1318683096 147962 - 82400996 1994794 - 20434176 60011 - 1318683096

Exercise 7: Replicate the code shown above to create a summary table for the States data that reports across regions.

19.5.4 Export to CSV

Knitting your notebook to HTML, Word, or PDF should produce a summary table in the appropriate format. Depending on our or others’ workflow, we may want to export our summary table to CSV in order to open in Excel or other spreadsheet software. Arsenal can easily handle this.

To export my gapminder summary to CSV, I need to create a new object that contains the actual summary table. Below, I save the last summary to the object named sum.table.

sum.table <- summary(sum.gapminder4, title = "Summary Stats for Gapminder Data")

Next, I need to convert this table into a data frame using the as.data.frame() function like so.

sum.table <- as.data.frame(sum.table)

Lastly, I just need to save this data frame as a CSV file using the write.csv() function like so.

write.csv(sum.table, file = "sumtable.csv")

R will save the CSV file to my project folder. Otherwise, R will save the file to my current working directory.

19.6 Correlation Coefficient

As mentioned in Chapter 4, the correlation coefficient quantifies the direction and strength of association between two numerical variables. It is rare to see correlations used in any table. Instead, correlations are typically used as an exploratory tool to inform a more advanced analysis like regression. Nevertheless, we may want to report a specific correlation coefficient in our prose.

To calculate the correlation coefficient between two variables, we can use the cor() function like so:

cor(gapminder$gdpPercap, gapminder$lifeExp)
[1] 0.5837062

where I include two variables from the gapminder dataset.

To calculate correlation coefficient between all numerical variables in a dataset, we can simply include the dataset in cor without specifying any variable. Note that I must first remove the variables that are not numeric.

gapminder %>% 
  select(-country, -continent) %>% 
  cor()
                year    lifeExp         pop   gdpPercap
year      1.00000000 0.43561122  0.08230808  0.22731807
lifeExp   0.43561122 1.00000000  0.06495537  0.58370622
pop       0.08230808 0.06495537  1.00000000 -0.02559958
gdpPercap 0.22731807 0.58370622 -0.02559958  1.00000000

Exercise 8: Calculate the correlation coefficients between all the variables in States. Which two variables have the stongest correlation? What is the direction?

19.7 Save and Upload

Knit your Rmd to save it and check for errors. If you are satisfied with your work, upload to eLC. Once you upload, answers will become available for download.