2 Data

Nothing exists except atoms and empty space; everything else is opinion.

—Democritus


Data are like the atoms of knowledge. We cannot effectively convert this raw material of knowledge into a useful product without first understanding the raw material. Therefore, learning statistics naturally begins with learning the types and structures of data.

2.1 Learning objectives

  • Understand the organization of rectangular data
  • Identify the unit of analysis within a dataset
  • Identify and distinguish types of variables
  • Identify and distinguish types of dataset structures

2.2 Rectangular Data

Table 2.1: Generic rectangular data
Variable_1 Variable_2 Variable_3
Datum (Unit of Analysis) Datum Datum
Datum (Unit of Analysis) Datum Datum
Datum (Unit of Analysis) Datum Datum


Rectangular data is organized by rows and columns. A rectangular dataset has three components. Not all datasets will match the below description because datasets are not always organized correctly.

  • Unit of analysis: The entity or subject a row of data refers to. The unit of analysis uniquely identifies each row of a dataset. If we have a dataset of 50 states and some variables measured in 2020, then our unit of analysis is states. If you were told a specific state, then you could find the row in the dataset. If we have 50 states measured in 2019 and 2020, then the unit of analysis is state-year because you will need to know the state and year to find a specific row.
  • Variable: A measured characteristic of the unit of analysis. State unemployment rate is a variable for a state unit of analysis. In this example, state name is a variable as well as the unit of analysis.
  • Datum: The intersection of a variable (column) and an observation (row) resulting in a cell. The datum is a particular piece of information. A cell could contain something like 4.8 as the unemployment rate for Georgia in 2020.

2.3 Types of variables

The variables in a given dataset can be of several types. Types of variables are important to learn because it has consequences for data applications, such as description, visualization, and inference. Different types of variables offer different levels of information.

For example, suppose you ask two people to report their annual income. What options do they have for answers? If virtually any value, then you know to a precise degree the income each earns and can compute the precise difference between the two incomes. What if their choices are either more or less than $50,000? Then, you have a coarse understanding of how much they earn. If they provide different answers, you can only conclude whether one makes more than the other but not by how much. If they provide the same answer, then the two are grouped together even though it is highly unlikely they earn equal incomes. This makes a serious difference for statistical analysis.


Variable Types

Figure 2.1: Variable Types


All variables belong to one of two broad types: qualitative (categorical) or quantitative (numeric).

  • Qualitative variables take on values that have no intrinsic numerical meaning. They are naturally expressed in words.
  • Quantitative variables take on values that do have intrinsic numerical meaning. They are naturally expressed using numbers.

2.3.1 Qualitative variables

Qualitative variables can be further differentiated into two types: nominal and ordinal.

  • Nominal variables take on values that differ in name only. There is no way to rank a value as more or less than another value.
  • Ordinal variables take on values that can be ranked relative to each other but the difference between rankings has no numerical value.

The values that categorical variables take on are commonly referred to as levels. Categorical variables can contain virtually any number of levels, though the number of levels is usually limited.

If our unit of analysis is individuals, a variable such as assigned sex contains two levels: male and female. The variable sex is nominal because its values have no numerical meaning and the two levels have no ranking. Race, state, country, continent, political party, and any variable coded as yes/no such as unemployed, married, or participated in some program are all examples of nominal variables.

If you have ever participated in a customer satisfaction survey, then you have contributed data to an ordinal variable. Those scales that provide some number of options from “disagree” to “agree” are called Likert scales. Your answer has no intrinsic numerical value but it can be ranked against the answers of others. One respondent can be said to be more satisfied than another but not by how much. Moreover, one can only trust the results insofar as respondents have the same understanding or frame of reference–the service that satisfied one respondent may not have satisfied another. Other ordinal variables for which there is a commonly understood scale, such as education level (high school, associate, bachelor, post-graduate) and income level (more or less than the federal poverty line) do not have this issue.

Qualitative variables can be aggregated to a higher unit of analysis such that they become quantitative variables. For example, the U.S. Census records race at the individual level (nominal). This information is then used to report the percentage of a state’s population by race, such as Georgia’s population being about 32% Black or African American. What is categorical variable at the individual level is now a quantitative variable at the state level.

2.3.2 Quantitative variables

Quantitative variables can be further differentiated into two types: discrete and continuous.

  • Discrete variables take on countable or indivisible values.
  • Continuous variables take on infinitely divisible values (at least in theory).

Any variable that is a count of persons, places, events, or things is a discrete variable, taking on integer values (e.g. 0, 1, 2, 3,…). The count of homeless people in a city, students in a classroom, hospital beds, or nonprofit volunteers are all discrete variables.

Many discrete variables can be treated as if continuous without any consequences. All of the examples above could likely be treated as conitnuous in an analysis.

When should we care that a variable is discrete? Chapter 4 and chapters on inference will discuss how statistics relies heavily on the normal distribution, also referred to as a bell curve. If a discrete variable can take on integer values only, and especially if only a few values are possible, then that variable is unlikely to be normally distributed.

Rare discrete events, such as plane crashes or government defaults are not normally distributed. Application of basic statistical procedures to such variables may be inappropriate, requiring methods outside to scope of this course.

In many cases, if a variable is numeric, then it is continuous or can be treated as such. Continuous variables can contain values with an infinite number of decimal places. Still, continuous variables are recorded in data with a limited number of decimal places because either we measure phenomena with finite precision or it simply becomes impractical to include so many decimal places. For example, age is usually recorded in discrete years, but we could record continuously to the zeptosecond (a trillionth of a billionth of a second).

Many discrete variables become continuous because we calculate averages, proportions, or rates from them.

The number of students in the classroom is discrete (e.g. 20, 25, etc.), but the average number of students in classrooms (i.e. total students/number of classrooms) is continuous. This is how we can have values such as 22.5 for pupil-to-teacher ratio. The number of homeless people in a city is discrete but the proportion of the city’s population that is homeless (count of homeless/population) is continuous.

2.3.3 Index variables

Index variables can be ordinal or (usually) numerical but warrant separate discussion.

An index variable is a composite measure of multiple variables.

They can be used to make a continuous variable out of multiple categorical variables or simplify multiple quantitative variables into one quantitative measure. Purposes such as ranking colleges, measuring multidimensional poverty (i.e. factors beyond income), and determining political ideology make use of index variables.

Index variables mask underlying information. This can be helpful or harmful. In either case, it is important to consider how an index variable is constructed. Doing so can offer insight or uncover problems. An instructive example familiar to readers is college rankings. U.S. News and World Report describes how rankings are determined.

What makes a college good? According to these rankings, five percent of what makes a college good is the percent of undergraduate alumni giving a donation as a proxy of student satisfaction. Another 20% is based on the opinions of administrators at peer institutions. Are these choices wise? This is difficult to say and besides the point. The point is that index variables involve choices made by people and are not data that are observed directly. They are synthetic materials of knowledge and warrant careful consideration.

2.4 Dataset structures

Just as the type of variable one is dealing with impacts the kinds of visualizations or analyses one should use, so too does the structure of a dataset. Datasets come in three varieties depending on their unit of analysis.

  • Cross-sectional
    • Pooled cross-sectional
  • Time series
  • Panel or longitudinal

Cross-sectional data is a snapshot in time measuring some size sample of units. One column serves as the identifier of the unit of analysis, such as the name or ID number of the unit. Notice in Table 2.2 that all one needs to know is the country in order to identify a specific row.

Table 2.2: Cross-section example
country continent year lifeExp pop gdpPercap
Argentina Americas 2007 75.320 40301927 12779.380
Bolivia Americas 2007 65.554 9119152 3822.137
Brazil Americas 2007 72.390 190010647 9065.801


Pooled cross-sectional data could be considered a fourth structure but is simply multiple cross-sections stacked atop each other. The critical quality of pooled cross-sectional data is that each cross-section contains different units measured at different times, not the same units measured at different times. Notice in Table 2.3 that the countries included from 2002 are not the same as those included from 2007.

Table 2.3: Pooled cross-section example
country continent year lifeExp pop gdpPercap
Algeria Africa 2002 70.994 31287142 5288.040
Angola Africa 2002 41.003 10866106 2773.287
Benin Africa 2002 54.406 7026113 1372.878
Botswana Africa 2002 46.634 1630347 11003.605
Argentina Americas 2007 75.320 40301927 12779.380
Bolivia Americas 2007 65.554 9119152 3822.137
Brazil Americas 2007 72.390 190010647 9065.801


Time series data measures one unit over multiple time periods. The unit of analysis in time series data is time, as it uniquely identifies each row. Notice in Table 2.4 that one country is tracked over multiple years.

Table 2.4: Time series example
country continent year lifeExp pop gdpPercap
Argentina Americas 1977 68.481 26983828 10079.027
Argentina Americas 1982 69.942 29341374 8997.897
Argentina Americas 1987 70.774 31620918 9139.671
Argentina Americas 1992 71.868 33958947 9308.419
Argentina Americas 1997 73.275 36203463 10967.282
Argentina Americas 2002 74.340 38331121 8797.641
Argentina Americas 2007 75.320 40301927 12779.380


Panel (or longitudinal) data measures the same units over multiple time periods. The unit of analysis is pair of unit and time period. Notice in Table 2.5 that in order to identify a specific row, you would need to know the country and year. One could also think of panel data as numerous time series.

Table 2.5: Panel example
country continent year lifeExp pop gdpPercap
Argentina Americas 1997 73.275 36203463 10967.282
Argentina Americas 2002 74.340 38331121 8797.641
Argentina Americas 2007 75.320 40301927 12779.380
Bolivia Americas 1997 62.050 7693188 3326.143
Bolivia Americas 2002 63.883 8445134 3413.263
Bolivia Americas 2007 65.554 9119152 3822.137


To learn how to examine data in R, proceed to Chapter 17.

2.5 Key terms and concepts

  • Unit of analysis
  • Variable
  • Types of variables: qualitative, quantitative, nominal, ordinal, discrete, continuous, index
  • Data structures: cross-sectional, pooled cross-sectional, time series, panel