PADP 7120 Data Applications in PA

class: center, middle, inverse, title-slide

.title[
# PADP 7120 Data Applications in PA
]
.subtitle[
## Data Description
]
.author[
### Alex Combs
]
.institute[
### UGA | SPIA | PADP
]
.date[
### Last updated: January 29, 2025
]

---

# Outline

- Distinguish between descriptive and inferential statistics

- Identify population, sample, parameter, and statistic

- Distributions and their description

---
# Two categories of statistics

.center[
![](lectures_files/stats_types.png)
]

---
# Population vs. Sample

- **Population:** all members of a specified group pertaining to a research question

- A population can be any size based on our research question
  
  - If we can observe the population, all we need to do is describe it to reach useful conclusions

- **Sample:** a subset of that population

- We can describe a sample or a population

- If we cannot observe the population, we take a sample and use inferential statistics to reach useful conclusions about that population

---
# Descriptive vs. inferential

- Suppose we want to learn more about employment and earning outcomes among MPA graduates through a survey

.pull-left[
**Descriptive Questions**

- What is the average income of respondents?
- What percent of respondents are employed?
- What percent of respondents earn more than they did prior to MPA?

]

.pull-right[
**Inferential Questions**

- What is the average income for MPA graduates?
- Does an MPA increase employment?
- Does an MPA increase income?
]

---
# Parameter vs. statistic

- **Parameter:** a measure pertaining to a population

- Typically used when population is unobserved
  - The "true" value we try to estimate using a sample

- **Statistic:** a measure pertaining to a sample

- Also referred to as an **estimate**

---
# Back to Survey Example

Suppose our survey receives 100 responses and finds that the median income of UGA MPA alumni is $75,000

- Is this a sample statistic or a population parameter?

---
# Descriptive vs. inferential

Suppose we are told the average salary of college-educated women in Georgia is greater than the national average for college-educated women.

Suppose we set out to confirm the average salary in Georgia for ourselves. We survey a sample of 1,000 college-educated women in Georgia and record their income.

- What is the population and sample?
- What is the parameter and statistic/estimate?

---
# Descriptive vs. inferential

Suppose we want to know the percent of Clarke County residents who have a valid ID for an upcoming election.

What is the population?

We survey people entering Baldwin Hall if they have a valid ID according to Georgia's election laws. We get 100 responses, and 80% said they have a valid ID.

- Is our result of 80% a population parameter or an estimate?

- Why might we have a sample size less than 100?

---
class: inverse, middle, center

# Data description

---
# Distribution

- A distribution shows the (possible) values for a variable and how often they occur.

---
# Describing numerical variables

- Center
  - Mean, median, mode
- Spread
  - Variance, standard deviation, IQR, range
- Association
  - Covariance, correlation
  - Regression coefficient, coefficient of determination (will cover later)

---
# Measures of center

- Mean: the balancing point of the distribution
- Median: the middle of the distribution (50th percentile)
- Mode: the peak(s) of the distribution

---
# Mean (average)

- Add values and divide by the count of values

``` r
(2 + 4 + 6 + 8 + 10)/5
```

```
## [1] 6
```

.pull-left[
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> ID </th>
   <th style="text-align:right;"> variable </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
</tbody>
</table>
]

.pull-right[

``` r
mean(x$variable)
```

```
## [1] 6
```

``` r
mean(ncbirths$weeks)
```

```
## [1] 38.4675
```

]

---
# Median

- Arrange values in order, find the middle value
- If no middle value because even number of values, average the two middle values

.pull-right[

``` r
median(x$variable)
```

```
## [1] 6
```

![](lectures_files/median.png)

``` r
median(ncbirths$weeks)
```

```
## [1] 39
```

]

---
# Mode

- The value that occurs most frequently

.pull-right[

- Variable has no mode. One more of any of the 5 values would make that value the mode.

]

---
# Mean vs. median vs. mode

- We use measures of center to communicate the *typical* value of a distribution.

- Which measure best conveys what is typical depends on the distribution.

![](lectures_files/center.png)

---
# Mean vs. median vs. mode

- Mean is sensitive to extreme values, median is not

- Median is better to use when a distribution is skewed

- Mode is can be used for discrete or categorical variables

---
# Mean vs. median vs. mode

- Which measure of center is best for describing typical weeks pregnant?

.pull-left[
<img src="Description_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" />
]

.pull-right[

``` r
mean(ncbirths$weeks)
```

```
## [1] 38.4675
```

``` r
median(ncbirths$weeks)
```

```
## [1] 39
```
]

---
# Mean vs. median vs. mode

- Which measure of center is best for describing typical GDP per capita?

.pull-left[
<img src="Description_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" />
]

.pull-right[

``` r
mean(gapminder$gdpPercap)
```

```
## [1] 7215.327
```

``` r
median(gapminder$gdpPercap)
```

```
## [1] 3531.847
```
]

---
class: inverse, middle, center

# Measures of spread

---
# Measures of spread

- **Variance:** the average squared deviation from the mean

- **Standard deviation:** the average deviation from the mean

- **Interquartile range:** the difference between 75th and 25th percentiles

- **Range:** the difference between the maximum and minimum values

---
# Standard deviation (SD)

- Variance makes no sense as a descriptive measure of spread

``` r
var(ncbirths$weeks)
```

```
## [1] 7.583423
```

- On average, weeks pregnant deviates from the average by 7.6 squared weeks.

- Standard deviation recovers the original variable

``` r
sd(ncbirths$weeks)
```

```
## [1] 2.753802
```

- On average, weeks pregnant deviates from the average by almost 3 weeks.

---
# Interquartile range (IQR)

- Divide the distribution into 4 equal parts
- Each dividing value is the 25th, 50th, and 75th percentiles
- IQR is the difference between 25th and 75th percentiles

``` r
quantile(ncbirths$weeks, c(.25, .5, .75))
```

```
## 25% 50% 75% 
##  38  39  40
```

- The IQR for weeks pregnant is 2

``` r
IQR(ncbirths$weeks)
```

```
## [1] 2
```

---
# SD vs. IQR

- We use SD and IQR to communicate the typical deviation of a distribution from its center
- SD is based on the mean; sensitive to extreme values
- IQR uses percentiles; not sensitive to extreme values

.pull-right[

``` r
sd(xextreme$variable)
```

```
## [1] 42.54409
```

``` r
IQR(xextreme$variable)
```

```
## [1] 4
```
]

---
# SD vs. IQR

.pull-left[
![](Description_files/figure-html/unnamed-chunk-23-1.png)
]

.pull-right[

``` r
var(ncbirths$weeks)
```

```
## [1] 7.583423
```

``` r
sd(ncbirths$weeks)
```

```
## [1] 2.753802
```

``` r
IQR(ncbirths$weeks)
```

```
## [1] 2
```
]

---
# SD vs. IQR

.pull-left[
<img src="Description_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" />
]

.pull-right[

``` r
var(gapminder$gdpPercap)
```

```
## [1] 97169410
```

``` r
sd(gapminder$gdpPercap)
```

```
## [1] 9857.455
```

``` r
IQR(gapminder$gdpPercap)
```

```
## [1] 8123.402
```
]

---
# Range

.pull-left[
- Describes the greatest extent to which the variable changes

- Or the possible values of the variable

- Or how different are the most different observations
]

.pull-right[

``` r
range(ncbirths$weeks)
```

```
## [1] 20 45
```

``` r
range(gapminder$gdpPercap)
```

```
## [1]    241.1659 113523.1329
```
]

---
# Choosing measures

.pull-left[
![](Description_files/figure-html/unnamed-chunk-28-1.png)
]

.pull-right[
<img src="Description_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" />
]

- Which measures of center and spread should we use or not use to describe the above distributions?

---
class: inverse, middle, center

# Measures of Association

---
# Conditional distributions

- Distribution of a variable *given* another variable's values

---
# Measures of association

Suppose we want to investigate the association between the percent of a state population that is white and the percent of the state population that voted for Donald Trump.

.pull-left[
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> share_white </th>
   <th style="text-align:right;"> share_vote_trump </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 65 </td>
   <td style="text-align:right;"> 63 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 58 </td>
   <td style="text-align:right;"> 53 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 51 </td>
   <td style="text-align:right;"> 50 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 74 </td>
   <td style="text-align:right;"> 60 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 39 </td>
   <td style="text-align:right;"> 33 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 69 </td>
   <td style="text-align:right;"> 44 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
<img src="Description_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" />
]

---
# Measures of association

- Or percent of white population in poverty

- Units with higher values along the x-axis tend to have higher or lower values along the Y axis. Or no tendency.

---
# Measures of association

- The association between two variables can be described in terms of

- **Direction:** when one variable increases, does the other variable *tend* to increase or decrease

- **Strength:** how closely do the variables move together

- **Magnitude:** when one variable increases a certain amount, how much does the other variable increase or decrease

---
# Measures of association

- **Covariance:** measures **direction** of association between two variables

- **Correlation coefficient:** measures **direction** and **strength** of association between two variables

- **Regression coefficient:** measures the **direction** and **magnitude** of association between an explanatory variable and an outcome variable

---
# Correlation coefficient

- Ranges between -1 and 1

- Positive or negative value tells us the direction

- The closer to -1 or 1, the stronger the association in that direction, with 0 indicating no association

- No definitive scale; rule of thumb:
  - 0.8: very strong
  - 0.6: strong
  - 0.4: moderate
  - 0.2: weak

---
# Correlation coefficient
<img src="Description_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" />

``` r
cor(state_trump$share_vote_trump, state_trump$share_white_poverty)
```

```
## [1] 0.4872326
```

- What is the interpretation of this correlation coefficient?

---
# Correlation

Percent white population in poverty

``` r
cor(state_trump$share_vote_trump, state_trump$share_white_poverty)
```

```
## [1] 0.4872326
```

Percent population that is white

``` r
cor(state_trump$share_vote_trump, state_trump$share_white)
```

```
## [1] 0.4220675
```

Which correlation is stronger?

---
# Correlation

Unit of analysis is states.
Is the correlation positive or negative?

---
# Correlation

``` r
cor(state_trump$median_house_inc, state_trump$share_vote_trump)
```

```
## [1] -0.5925995
```

What is the interpretation?

---
# Limitations of correlation

- Measures only linear association

- Sensitive to extreme values

- Is necessary but not sufficient to claim causation

---
# Recap

- Difference between descriptive and inferential statistics; sample and population

- When presented with descriptive statistics, consider what they say and don't say about the underlying distribution.

- Correlation is a building block of how we explain or predict phenomena in our world using statistics.