19 R Description
19.1 Learning Objectives
In this chapter, you will learn how to:
- Calculate descriptive statistics individually
- Automate a professional-quality table of descriptive statistics
19.2 Set-up
To complete this chapter, you need to
- Start a R Markdown document
- Change the YAML to the following:
---
title: 'R Chapter 19'
author: 'Your name'
output:
html_document:
theme: spacelab
df_print: paged
---
- Load the following packages
19.3 Introduction
Summary statistics tables are ubiquitous in reports and studies. Usually a dataset involves numerous variables that would require too many visualizations, though we should still consider visualizations for the most important variables. A standard table of summary stats provides readers a reference for key measures pertaining to all our variables in a fairly compact form.
In this chapter, we set out to summarize variables within the States
dataset of the carData
package.
Exercise 1: Use the glimpse
function to
examine the States
dataset.
States
is a cross-section of the 50 states and D.C. containing education and related statistics. Be sure to skim the documentation for States
to understand each variable. You can do that by clicking on the carData
package under the Packages tab then clicking on the States
link.
19.4 Quick Stat Computation
Instead of producing a full table of descriptive statistics, sometimes we just want to know one or two descriptive measures of a single variable.
Because R can hold many datasets/objects at once, we need to tell it which dataset/object to apply a given function. We have had to do this many times already. Similarly, if we want R to apply a function to a specific variable within a dataset, we need to tell which variable in which dataset. This is done using the
$
operator.
Below are all of the useful descriptive measures of center and spread applied to the pay
(i.e. average teacher’s salary in 1,000s) variable in the States
dataset. Note that the $
operator tells R to apply the function to a specific variable within a dataset.
[1] 30.94118
[1] 30
[1] 5.308151
[1] 6
[1] 22 43
Exercise 2: Calculate the average and standard deviation for state spending on public education in $1000s per student.
19.5 Summary Table
Summary tables come in many styles, so there is no way to cover everything. In most cases, a summary table includes the following descriptive measures depending on the type of variable:
- Numerical variables
- Mean
- Standard deviation
- Minimum
- Maximum
- Categorical variables
- Counts for each level, and/or
- Percentages for each level
If a variable is skewed, then it may be wise to replace the mean and standard deviation with the median and IQR. We will learn how to do this.
19.5.1 Quick Table
Sometimes we do not want to print a fancy table. Rather, we may want to quickly see a full set of descriptive statistics for ourselves that our reader will never see. This can be done using the summary
function on our dataset like so:
country continent year lifeExp
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
Algeria : 12 Asia :396 Median :1980 Median :60.71
Angola : 12 Europe :360 Mean :1980 Mean :59.47
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
Australia : 12 Max. :2007 Max. :82.60
(Other) :1632
pop gdpPercap
Min. : 60011 Min. : 241.2
1st Qu.: 2793664 1st Qu.: 1202.1
Median : 7023596 Median : 3531.8
Mean : 29601212 Mean : 7215.3
3rd Qu.: 19585222 3rd Qu.: 9325.5
Max. :1318683096 Max. :113523.1
We would almost certainly want to suppress this code and output (i.e. include=FALSE
code chunk option) if preparing a report for an external audience. Note that for the categorical variables, country and continent, summary
provides the count of observations instead of measures of center or spread.
Exercise 3: Generate a quick table of descriptive
statistics for all of the variables in States
. Suppress the
code and output.
19.5.2 Using Arsenal
Due to the many styles of summary tables, there are numerous R packages designed to produce summary tables. The best R package in terms of quickly getting the information to a nicely formatted table of which I am aware is Arsenal. Therefore, we will learn how to use Arsenal. I will demonstrate Arsenal using the gapminder
dataset. Then, I will ask you to replicate those demonstrations using the States
data.
Producing a summary table with Arsenal involves at least two, probably three, steps.
- Create a new object containing the summary statistics we want to include in a table
- Relabel the variables to something appropriate for our audience
- Generating the summary table based on the new object we just created
Here is an example following the steps above using gapminder
data without altering any of Aresenal’s default options that we will want to know how to alter in some cases.
The above code is what actually creates the table I want to export. First, I name the object whatever I want. Then I use the tableby
function. We will learn what the tilde, ~
, does later. For now, know that it is required. Then, I list the variable I want included in the table, separating each with a plus sign. Lastly, I tell R which dataset to apply this function.
labels(sum.gapminder) <- c(continent = "Continent", gdpPercap = "GDP Per Capita", lifeExp = "Life Expectancy", pop = "Population")
Most datasets do not use variable names that would be appropriate for an external audience. The names in gapminder
are not bad; most readers could make sense of what the names imply about the data, but it is simple enough (though tedious) to provide a more polished look.
Therefore, in the above code I use the labels
function on the sum.gapminder
table I just created, then assign each variable I told R to include in the table a label that will replace the name when it prints.
Overall (N=1704) | |
---|---|
Continent | |
Africa | 624 (36.6%) |
Americas | 300 (17.6%) |
Asia | 396 (23.2%) |
Europe | 360 (21.1%) |
Oceania | 24 (1.4%) |
GDP Per Capita | |
Mean (SD) | 7215.327 (9857.455) |
Range | 241.166 - 113523.133 |
Life Expectancy | |
Mean (SD) | 59.474 (12.917) |
Range | 23.599 - 82.603 |
Population | |
Mean (SD) | 29601212.325 (106157896.744) |
Range | 60011.000 - 1318683096.000 |
This last line of code actually prints the summary table when I knit my document. The previous two steps (tableby
and labels
) can be included in the same code chunk, but this third step needs to have its own code chunk because
you need to include a specific code chunk option,
results='asis'
, in order for the table to export properly. To be clear, in the top line of a code chunk that contains{r}
by default, you need to change it to{r, results='asis', echo=FALSE}
. I also include theecho=FALSE
option assuming we do not want our reader to see our code.
Exercise 4: Replicate the code shown above to create
a summary table for the States
data using the Arsenal
package. Be sure to relabel the variables to something relatively
understandable and brief. Labeling is tedious but you only need to do it
once. I suggest you knit your document now to see what you just
made.
In three relatively short bits of code, we already have a decent summary table that would have taken excruciatingly long to input manually. But it can be made better.
19.5.3 Adjustments to Arsenal
19.5.3.1 Decimal digits
The biggest aesthetic issue with my table is that it includes so many decimals. None of these variables have such a small range that rounding to integers masks useful information. Obviously, if a variable only ranges between 0 and 1, we would not want to round to an integer.
Specifying the number of decimals is quite easy with Arsenal. Because arsenal tries to be as flexible as possible, we have to specify the number of decimals separately for numerical and percentage measures. The following code sets the number of decimals to zero for the gapminder data.
sum.gapminder2 <- tableby(~ continent + gdpPercap + lifeExp + pop, data = gapminder, digits = 0, digits.pct = 0)
labels(sum.gapminder2) <- c(continent = "Continent", gdpPercap = "GDP Per Capita", lifeExp = "Life Expectancy", pop = "Population")
Overall (N=1704) | |
---|---|
Continent | |
Africa | 624 (37%) |
Americas | 300 (18%) |
Asia | 396 (23%) |
Europe | 360 (21%) |
Oceania | 24 (1%) |
GDP Per Capita | |
Mean (SD) | 7215 (9857) |
Range | 241 - 113523 |
Life Expectancy | |
Mean (SD) | 59 (13) |
Range | 24 - 83 |
Population | |
Mean (SD) | 29601212 (106157897) |
Range | 60011 - 1318683096 |
Exercise 5: Replicate the code shown above to create
a second summary table for the States
data with no
decimals. Note that you can simply copy-and-paste the labels code.
19.5.3.2 Reporting median and IQR
Instead of the mean and standard deviation, we may want to report the median, first quartile, and third quartile for our numerical variables. We can control the descriptive measures using the following code.
sum.gapminder3 <- tableby(~ continent + gdpPercap + lifeExp + pop, data = gapminder, digits = 0, digits.pct = 0, numeric.stats = c("median", "q1q3", "range"))
labels(sum.gapminder3) <- c(continent = "Continent", gdpPercap = "GDP Per Capita", lifeExp = "Life Expectancy", pop = "Population")
Overall (N=1704) | |
---|---|
Continent | |
Africa | 624 (37%) |
Americas | 300 (18%) |
Asia | 396 (23%) |
Europe | 360 (21%) |
Oceania | 24 (1%) |
GDP Per Capita | |
Median | 3532 |
Q1, Q3 | 1202, 9325 |
Range | 241 - 113523 |
Life Expectancy | |
Median | 61 |
Q1, Q3 | 48, 71 |
Range | 24 - 83 |
Population | |
Median | 7023596 |
Q1, Q3 | 2793664, 19585222 |
Range | 60011 - 1318683096 |
Exercise 6: Replicate the code shown above to create
a summary table for the States
data that reports median and
the first and third quartiles.
19.5.3.3 Across groups
Finally, instead of reporting summary statistics for the entire sample, we may want to report them separately for each level of a categorical variable. This is a common way to make comparisons.
We can have Arsenal report across groups by adding the categorical variable to the left side of the formula in the tableby
code. The code below reports the gapminder
data across continents. Note that the tilde ~
is used to separate grouping variables on the left side from the variables we wish to summarize on the right side.
By default, Arsenal tests for correlations across groups and reports a p-value. This is not a common part of a summary table (at least for fields in which I am familiar), so I turn this feature off with the test = FALSE
within the code below.
sum.gapminder4 <- tableby(continent ~ gdpPercap + lifeExp + pop, data = gapminder, digits = 0, digits.pct = 0, test = FALSE)
labels(sum.gapminder4) <- c(continent = "Continent", gdpPercap = "GDP Per Capita", lifeExp = "Life Expectancy", pop = "Population")
Africa (N=624) | Americas (N=300) | Asia (N=396) | Europe (N=360) | Oceania (N=24) | Total (N=1704) | |
---|---|---|---|---|---|---|
GDP Per Capita | ||||||
Mean (SD) | 2194 (2828) | 7136 (6397) | 7902 (14045) | 14469 (9355) | 18622 (6359) | 7215 (9857) |
Range | 241 - 21951 | 1202 - 42952 | 331 - 113523 | 974 - 49357 | 10040 - 34435 | 241 - 113523 |
Life Expectancy | ||||||
Mean (SD) | 49 (9) | 65 (9) | 60 (12) | 72 (5) | 74 (4) | 59 (13) |
Range | 24 - 76 | 38 - 81 | 29 - 83 | 44 - 82 | 69 - 81 | 24 - 83 |
Population | ||||||
Mean (SD) | 9916003 (15490923) | 24504795 (50979430) | 77038722 (206885205) | 17169765 (20519438) | 8874672 (6506342) | 29601212 (106157897) |
Range | 60011 - 135031164 | 662850 - 301139947 | 120447 - 1318683096 | 147962 - 82400996 | 1994794 - 20434176 | 60011 - 1318683096 |
Exercise 7: Replicate the code shown above to create
a summary table for the States
data that reports across
regions.
19.5.4 Export to CSV
Knitting your notebook to HTML, Word, or PDF should produce a summary table in the appropriate format. Depending on our or others’ workflow, we may want to export our summary table to CSV in order to open in Excel or other spreadsheet software. Arsenal can easily handle this.
To export my gapminder
summary to CSV, I need to create a new object that contains the actual summary table. Below, I save the last summary to the object named sum.table
.
Next, I need to convert this table into a data frame using the as.data.frame()
function like so.
Lastly, I just need to save this data frame as a CSV file using the write.csv()
function like so.
R will save the CSV file to my project folder. Otherwise, R will save the file to my current working directory.
19.6 Correlation Coefficient
As mentioned in Chapter 4, the correlation coefficient quantifies the direction and strength of association between two numerical variables. It is rare to see correlations used in any table. Instead, correlations are typically used as an exploratory tool to inform a more advanced analysis like regression. Nevertheless, we may want to report a specific correlation coefficient in our prose.
To calculate the correlation coefficient between two variables, we can use the cor()
function like so:
[1] 0.5837062
where I include two variables from the gapminder
dataset.
To calculate correlation coefficient between all numerical variables in a dataset, we can simply include the dataset in cor
without specifying any variable. Note that I must first remove the variables that are not numeric.
year lifeExp pop gdpPercap
year 1.00000000 0.43561122 0.08230808 0.22731807
lifeExp 0.43561122 1.00000000 0.06495537 0.58370622
pop 0.08230808 0.06495537 1.00000000 -0.02559958
gdpPercap 0.22731807 0.58370622 -0.02559958 1.00000000
Exercise 8: Calculate the correlation coefficients
between all the variables in States
. Which two variables
have the stongest correlation? What is the direction?