class: center, middle, inverse, title-slide .title[ # PADP 7120 Data Applications in PA ] .subtitle[ ## Data Description ] .author[ ### Alex Combs ] .institute[ ### UGA | SPIA | PADP ] .date[ ### Last updated: January 30, 2024 ] --- # Outline - Distinguish between descriptive and inferential statistics - Identify population, sample, parameter, and statistic - Distributions and their description --- # Two categories of statistics .center[ ![](lectures_files/stats_types.png) ] --- # Population vs. Sample - **Population:** all members of a specified group pertaining to a research question - A population can be any size based on our research question - If we can observe the population, all we need to do is describe it to reach useful conclusions -- - **Sample:** a subset of that population - We can describe a sample or a population - If we cannot observe the population, we take a sample and use inferential statistics to reach useful conclusions about that population --- # Descriptive vs. inferential - Suppose we want to learn more about employment and earning outcomes among MPA graduates through a survey -- .pull-left[ **Descriptive Questions** - What is the average income of respondents? - What percent of respondents are employed? - What percent of respondents earn more than they did prior to MPA? ] -- .pull-right[ **Inferential Questions** - What is the average income for MPA graduates? - Does an MPA increase employment? - Does an MPA increase income? ] --- # Parameter vs. statistic - **Parameter:** a measure pertaining to a population - Typically used when population is unobserved - The "true" value we try to estimate using a sample - **Statistic:** a measure pertaining to a sample - Also referred to as an **estimate** --- # Back to Survey Example Suppose our survey receives 100 responses and finds that the median income of UGA MPA alumni is $75,000 - Is this a sample statistic or a population parameter? --- # Descriptive vs. inferential Suppose we are told the average salary of college-educated women in Georgia is greater than the national average for college-educated women. Suppose we set out to confirm the average salary in Georgia for ourselves. We survey a sample of 1,000 college-educated women in Georgia and record their income. -- - What is the population and sample? - What is the parameter and statistic/estimate? --- # Descriptive vs. inferential Suppose we want to know the percent of Clarke County residents who have a valid ID for the upcoming election. -- What is the population? -- We survey people entering Baldwin Hall if they have a valid ID according to Georgia's election laws. We get 100 responses, and 80% said they have a valid ID. -- - Is our result of 80% a population parameter or an estimate? -- - Why might we have a sample size less than 100? --- class: inverse, middle, center # Data description --- # Distribution - A distribution shows the (possible) values for a variable and how often they occur. <img src="Description_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Describing numerical variables - Center - Mean, median, mode - Spread - Variance, standard deviation, IQR, range - Association - Covariance, correlation - Regression coefficient, coefficient of determination (will cover later) --- # Measures of center - Mean: the balancing point of the distribution - Median: the middle of the distribution (50th percentile) - Mode: the peak(s) of the distribution --- # Mean (average) - Add values and divide by the count of values ```r (2 + 4 + 6 + 8 + 10)/5 ``` ``` ## [1] 6 ``` .pull-left[ <table> <thead> <tr> <th style="text-align:right;"> ID </th> <th style="text-align:right;"> variable </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 10 </td> </tr> </tbody> </table> ] .pull-right[ ```r mean(x$variable) ``` ``` ## [1] 6 ``` <img src="lectures_files/mean.png" width="1105" /> ```r mean(ncbirths$weeks) ``` ``` ## [1] 38.4675 ``` ] --- # Median - Arrange values in order, find the middle value - If no middle value because even number of values, average the two middle values .pull-left[ <table> <thead> <tr> <th style="text-align:right;"> ID </th> <th style="text-align:right;"> variable </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 10 </td> </tr> </tbody> </table> ] .pull-right[ ```r median(x$variable) ``` ``` ## [1] 6 ``` ![](lectures_files/median.png) ```r median(ncbirths$weeks) ``` ``` ## [1] 39 ``` ] --- # Mode - The value that occurs most frequently .pull-left[ <table> <thead> <tr> <th style="text-align:right;"> ID </th> <th style="text-align:right;"> variable </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 10 </td> </tr> </tbody> </table> ] .pull-right[ - Variable has no mode. One more of any of the 5 values would make that value the mode. ] --- # Mean vs. median vs. mode - We use measures of center to communicate the *typical* value of a distribution. - Which measure best conveys what is typical depends on the distribution. ![](lectures_files/center.png) --- # Mean vs. median vs. mode - Mean is sensitive to extreme values, median is not - Median is better to use when a distribution is skewed - Mode is can be used for discrete or categorical variables --- # Mean vs. median vs. mode - Which measure of center is best for describing typical weeks pregnant? .pull-left[ <img src="Description_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ```r mean(ncbirths$weeks) ``` ``` ## [1] 38.4675 ``` ```r median(ncbirths$weeks) ``` ``` ## [1] 39 ``` ] --- # Mean vs. median vs. mode - Which measure of center is best for describing typical GDP per capita? .pull-left[ <img src="Description_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ```r mean(gapminder$gdpPercap) ``` ``` ## [1] 7215.327 ``` ```r median(gapminder$gdpPercap) ``` ``` ## [1] 3531.847 ``` ] --- class: inverse, middle, center # Measures of spread --- # Measures of spread - **Variance:** the average squared deviation from the mean - **Standard deviation:** the average deviation from the mean - **Interquartile range:** the difference between 75th and 25th percentiles - **Range:** the difference between the maximum and minimum values --- # Standard deviation (SD) - Variance makes no sense as a descriptive measure of spread ```r var(ncbirths$weeks) ``` ``` ## [1] 7.583423 ``` - On average, weeks pregnant deviates from the average by 7.6 squared weeks. -- - Standard deviation recovers the original variable ```r sd(ncbirths$weeks) ``` ``` ## [1] 2.753802 ``` - On average, weeks pregnant deviates from the average by almost 3 weeks. --- # Interquartile range (IQR) - Divide the distribution into 4 equal parts - Each dividing value is the 25th, 50th, and 75th percentiles - IQR is the difference between 25th and 75th percentiles ```r quantile(ncbirths$weeks, c(.25, .5, .75)) ``` ``` ## 25% 50% 75% ## 38 39 40 ``` - The IQR for weeks pregnant is 2 -- ```r IQR(ncbirths$weeks) ``` ``` ## [1] 2 ``` --- # SD vs. IQR - We use SD and IQR to communicate the typical deviation of a distribution from its center - SD is based on the mean; sensitive to extreme values - IQR uses percentiles; not sensitive to extreme values -- .pull-left[ <table> <thead> <tr> <th style="text-align:right;"> ID </th> <th style="text-align:right;"> variable </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 100 </td> </tr> </tbody> </table> ] .pull-right[ ```r sd(xextreme$variable) ``` ``` ## [1] 42.54409 ``` ```r IQR(xextreme$variable) ``` ``` ## [1] 4 ``` ] --- # SD vs. IQR .pull-left[ ![](Description_files/figure-html/unnamed-chunk-23-1.png)<!-- --> ] .pull-right[ ```r var(ncbirths$weeks) ``` ``` ## [1] 7.583423 ``` ```r sd(ncbirths$weeks) ``` ``` ## [1] 2.753802 ``` ```r IQR(ncbirths$weeks) ``` ``` ## [1] 2 ``` ] --- # SD vs. IQR .pull-left[ <img src="Description_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ```r var(gapminder$gdpPercap) ``` ``` ## [1] 97169410 ``` ```r sd(gapminder$gdpPercap) ``` ``` ## [1] 9857.455 ``` ```r IQR(gapminder$gdpPercap) ``` ``` ## [1] 8123.402 ``` ] --- # Range .pull-left[ - Describes the greatest extent to which the variable changes - Or the possible values of the variable - Or how different are the most different observations ] .pull-right[ ```r range(ncbirths$weeks) ``` ``` ## [1] 20 45 ``` ```r range(gapminder$gdpPercap) ``` ``` ## [1] 241.1659 113523.1329 ``` ] --- # Choosing measures .pull-left[ ![](Description_files/figure-html/unnamed-chunk-28-1.png)<!-- --> ] .pull-right[ <img src="Description_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> ] - Which measures of center and spread should we use or not use to describe the above distributions? --- class: inverse, middle, center # Measures of Association --- # Conditional distributions - Distribution of a variable *given* another variable's values <img src="Description_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> --- # Measures of association Suppose we want to investigate the association between the percent of a state population that is white and the percent of the state population that voted for Donald Trump. .pull-left[ <table> <thead> <tr> <th style="text-align:right;"> share_white </th> <th style="text-align:right;"> share_vote_trump </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 65 </td> <td style="text-align:right;"> 63 </td> </tr> <tr> <td style="text-align:right;"> 58 </td> <td style="text-align:right;"> 53 </td> </tr> <tr> <td style="text-align:right;"> 51 </td> <td style="text-align:right;"> 50 </td> </tr> <tr> <td style="text-align:right;"> 74 </td> <td style="text-align:right;"> 60 </td> </tr> <tr> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 33 </td> </tr> <tr> <td style="text-align:right;"> 69 </td> <td style="text-align:right;"> 44 </td> </tr> </tbody> </table> ] .pull-right[ <img src="Description_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> ] --- # Measures of association - Or percent of white population in poverty <img src="Description_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> - Units with higher values along the x-axis tend to have higher or lower values along the Y axis. Or no tendency. --- # Measures of association - The association between two variables can be described in terms of - **Direction:** when one variable increases, does the other variable *tend* to increase or decrease - **Strength:** how closely do the variables move together - **Magnitude:** when one variable increases a certain amount, how much does the other variable increase or decrease --- # Measures of association - **Covariance:** measures **direction** of association between two variables - **Correlation coefficient:** measures **direction** and **strength** of association between two variables - **Regression coefficient:** measures the **direction** and **magnitude** of association between an explanatory variable and an outcome variable --- # Correlation coefficient - Ranges between -1 and 1 - Positive or negative value tells us the direction - The closer to -1 or 1, the stronger the association in that direction, with 0 indicating no association - No definitive scale; rule of thumb: - 0.8: very strong - 0.6: strong - 0.4: moderate - 0.2: weak --- # Correlation coefficient <img src="Description_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> ```r cor(state_trump$share_vote_trump, state_trump$share_white_poverty) ``` ``` ## [1] 0.4872326 ``` - What is the interpretation of this correlation coefficient? --- # Correlation Percent white population in poverty ```r cor(state_trump$share_vote_trump, state_trump$share_white_poverty) ``` ``` ## [1] 0.4872326 ``` Percent population that is white ```r cor(state_trump$share_vote_trump, state_trump$share_white) ``` ``` ## [1] 0.4220675 ``` Which correlation is stronger? --- # Correlation <img src="Description_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> Unit of analysis is states. Is the correlation positive or negative? --- # Correlation <img src="Description_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" /> ```r cor(state_trump$median_house_inc, state_trump$share_vote_trump) ``` ``` ## [1] -0.5925995 ``` What is the interpretation? --- # Limitations of correlation - Measures only linear association - Sensitive to extreme values - Is necessary but not sufficient to claim causation --- # Recap - Difference between descriptive and inferential statistics; sample and population - When presented with descriptive statistics, consider what they say and don't say about the underlying distribution. - Correlation is a building block of how we explain or predict phenomena in our world using statistics.