class: center, middle, inverse, title-slide .title[ # PADP 7120 Data Applications in PA ] .subtitle[ ## Data Measurement ] .author[ ### Alex Combs ] .institute[ ### UGA | SPIA | PADP ] .date[ ### Last updated: January 23, 2024 ] --- # Outline - Measurement validity - Measurement reliability - Missing data --- # What does the data say? "Data doesn't *say* anything. *Humans* say things. ... Data can’t say anything about an issue any more than a hammer can build a house ... Data is a necessary ingredient in discovery, but you need a human to select it, shape it, and then turn it into an insight. Data is therefore only as useful as its quality and the skills of the person wielding it" (Jones-Rooy, 2019). --- # Credible analysis ![](lectures_files/credible.png) --- # Measurement Validity - **Measurement Validity**: The extent to which a variable measures what it is intended to measure. - Are values recorded accurately? - Do values accurately capture the concept I or others claim they do? --- # Measurement Reliability - **Measurement Reliability**: The extent to which the measurement process or instrument for a variable generates consistent values. - Are values consistent across subjects and/or time? - Is variation across subjects or time real or due to inconsistent measurement? --- # Example ![](lectures_files/measure_lines.png) --- # Discussion - With your neighbors, discuss the following about any of the examples below or your own: - How might there be issues with measurement validity and/or reliability? - What might be some consequences of poor measurement validity and/or reliability? - Property values - Crime rates - Test scores - Census data (e.g., population, race/ethnicity) --- # Why does this matter? - Can lead to important questions about how data are collected. - Can impact our ability to make comparisons between observations or over time that are often involved in decision-making. - One of many reminders not to be too certain. Much of statistics is about producing a *range* of values we trust will usually contain the truth. --- class: inverse, middle, center # Missing data --- # Missing Data - Two types - Explicit - Implicit - Explicitly missing data are data that we can *see* are missing in the data; cells containing a value that denotes missing - Implicitly missing data are data that we would expect to be included based on data structure but are not; no obvious sign of missing --- # Explicitly missing <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> <th style="text-align:right;"> pop </th> <th style="text-align:right;"> gdpPercap </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 40301927 </td> <td style="text-align:right;"> 12779.4 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 65.6 </td> <td style="text-align:right;"> 9119152 </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 72.4 </td> <td style="text-align:right;"> 190010647 </td> <td style="text-align:right;"> 9065.8 </td> </tr> </tbody> </table> - Incorrect to say we have 3 countries with a mean life expectancy of 68.972 --- # Implicitly missing <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> <th style="text-align:right;"> pop </th> <th style="text-align:right;"> gdpPercap </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 73.3 </td> <td style="text-align:right;"> 36203463 </td> <td style="text-align:right;"> 10967.3 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 74.3 </td> <td style="text-align:right;"> 38331121 </td> <td style="text-align:right;"> 8797.6 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 75.3 </td> <td style="text-align:right;"> 40301927 </td> <td style="text-align:right;"> 12779.4 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 62.0 </td> <td style="text-align:right;"> 7693188 </td> <td style="text-align:right;"> 3326.1 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 65.6 </td> <td style="text-align:right;"> 9119152 </td> <td style="text-align:right;"> 3822.1 </td> </tr> </tbody> </table> - Bolivia missing in 2002 may affect a statistic I want to compute from these data --- # Remember - Don't assume data are missing at random - Missing data may be systematically different than observed data - Consider the group the data is supposed to contain and whether missing data require us to limit the scope/generalizability of conclusions