class: center, middle, inverse, title-slide .title[ # PADP 7120 Data Applications in PA ] .subtitle[ ## Data ] .author[ ### Alex Combs ] .institute[ ### UGA | SPIA | PADP ] .date[ ### Last updated: January 16, 2024 ] --- # Outline - Identify the unit of analysis - Identify types of variables - Identify types of data structures - Review of R Chapter --- # Rectangular data - Rectangular data consists - Observations of a unit of analysis - Variables -- - **Unit of analysis**: The subject/entity generating the data. Each row contains an observation of a member within the unit of analysis. -- - **Variable**: A characteristic measured of each member of the unit of analysis. Each column contains a variable. --- # Rectangular data <table> <thead> <tr> <th style="text-align:left;"> Variable_1 </th> <th style="text-align:left;"> Variable_2 </th> <th style="text-align:left;"> Variable_3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Member 1 of Unit of Analysis </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:left;"> Member 2 of Unit of Analysis </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:left;"> Member 3 of Unit of Analysis </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> </tbody> </table> - Each row is (or should be) an **observation** - Each column is (or should be) a **variable** --- # Identify the unit of analysis - The unit of analysis is the variable or set of variables that uniquely identifies each row in a data set. -- - What is the likely unit of analysis based on the variable names? <table> <thead> <tr> <th style="text-align:left;"> Var1 </th> <th style="text-align:left;"> GDP </th> <th style="text-align:left;"> Population </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Unit of Analysis </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:left;"> Unit </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:left;"> Unit </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> </tbody> </table> --- # Identify the unit of analysis - What is the likely unit of analysis based on the variable names? <table> <thead> <tr> <th style="text-align:left;"> Var1 </th> <th style="text-align:left;"> GRE_Score </th> <th style="text-align:left;"> Income </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Unit of Analysis </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:left;"> Unit </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:left;"> Unit </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> </tbody> </table> --- # Identify the unit of analysis - What is the unit of analysis based on the variable names? <table> <thead> <tr> <th style="text-align:right;"> Year </th> <th style="text-align:left;"> Unemployment </th> <th style="text-align:left;"> Poverty </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2017 </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:right;"> 2019 </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> </tbody> </table> --- # Why care about unit of analysis? - Important to know what level is observed or measured. For example, individuals vs. an aggregation of individuals within a city, county, state, etc. -- - Spreadsheets will often combine units of analysis. For example, state totals calculated from multiple county totals. This is fine for spreadheets. -- - A raw dataset we wish to use in software like RStudio should contain one unit of analysis. For example, state totals should be separate from county totals. - Understanding the unit of analysis will help us identify rows or columns that do not belong together. --- class: inverse, middle, center # Suppose we want to study homelessness in Athens-Clarke County (potential causes or correlates, demographics, government performance metrics, etc.). We need to collect data. --- class: inverse, middle, center # What variables might we collect if the unit of analysis is **individuals**? --- class: inverse, middle, center # What variables might we collect if the unit of analysis is **years**? --- # Types of variables ![](lectures_files/variables.png) --- # Qualitative vs. quantitative - A **qualitative** or **categorical** variable is a variable that is naturally expressed in words with no intrinsic numerical value -- - A **quantitative** or **numerical** variable is a variable that has intrinsic numerical meaning --- class: inverse, middle, center # Which of our variables are qualitative and which are quantitative? --- # Types of qualitative variables .center[ ![:scale 40%](lectures_files/categorical.png) ] - **Nominal** variables take on values that differ by name only - **Ordinal** variables take on values that can be ranked relative to each other but the difference between rankings has no numerical value --- # Types of quantitative variables .center[ ![:scale 40%](lectures_files/quantitative.png) ] - **Discrete** variables take on numeric values that are **countable**, integers (e.g. 0, 1, 2, ...) - **Continuous** variables take on potentially any value. For example, a percentage ranges between 0 and 100 but can take on infinite values between that range (e.g. 50, 50.4, 50.44, 50.444, ...) even though we round to a finite set of values. --- class: inverse, middle, center # Which of our qualitative variables are nominal or ordinal. Which of our quantitative variables are discrete or continuous? --- # Variables and information - Variables measure a characteristic of the unit of analysis - Each variable has some amount of information encoded into it <img src="lectures_files/varinfo.png" width="2035" height="50%" /> --- # Why variable types matter - Types of variables inform what kind of visualization and analysis to use. - For [example](https://coggle.it/diagram/Vxlydu1akQFeqo6-/t/inference). --- class: inverse, center, middle # Dataset structures --- # Cross-sectional <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> <th style="text-align:right;"> pop </th> <th style="text-align:right;"> gdpPercap </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 75.320 </td> <td style="text-align:right;"> 40301927 </td> <td style="text-align:right;"> 12779.380 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 65.554 </td> <td style="text-align:right;"> 9119152 </td> <td style="text-align:right;"> 3822.137 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 72.390 </td> <td style="text-align:right;"> 190010647 </td> <td style="text-align:right;"> 9065.801 </td> </tr> </tbody> </table> - A snapshot in time - What is the unit of analysis? --- # Pooled cross-sectional <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> <th style="text-align:right;"> pop </th> <th style="text-align:right;"> gdpPercap </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Algeria </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 70.994 </td> <td style="text-align:right;"> 31287142 </td> <td style="text-align:right;"> 5288.040 </td> </tr> <tr> <td style="text-align:left;"> Angola </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 41.003 </td> <td style="text-align:right;"> 10866106 </td> <td style="text-align:right;"> 2773.287 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 75.320 </td> <td style="text-align:right;"> 40301927 </td> <td style="text-align:right;"> 12779.380 </td> </tr> <tr> <td style="text-align:left;"> Benin </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 54.406 </td> <td style="text-align:right;"> 7026113 </td> <td style="text-align:right;"> 1372.878 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 65.554 </td> <td style="text-align:right;"> 9119152 </td> <td style="text-align:right;"> 3822.137 </td> </tr> <tr> <td style="text-align:left;"> Botswana </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 46.634 </td> <td style="text-align:right;"> 1630347 </td> <td style="text-align:right;"> 11003.605 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 72.390 </td> <td style="text-align:right;"> 190010647 </td> <td style="text-align:right;"> 9065.801 </td> </tr> </tbody> </table> - Multiple cross-sections combined - Different subjects observed in each cross-section --- # Time Series <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> <th style="text-align:right;"> pop </th> <th style="text-align:right;"> gdpPercap </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1977 </td> <td style="text-align:right;"> 68.481 </td> <td style="text-align:right;"> 26983828 </td> <td style="text-align:right;"> 10079.027 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1982 </td> <td style="text-align:right;"> 69.942 </td> <td style="text-align:right;"> 29341374 </td> <td style="text-align:right;"> 8997.897 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1987 </td> <td style="text-align:right;"> 70.774 </td> <td style="text-align:right;"> 31620918 </td> <td style="text-align:right;"> 9139.671 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:right;"> 71.868 </td> <td style="text-align:right;"> 33958947 </td> <td style="text-align:right;"> 9308.419 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 73.275 </td> <td style="text-align:right;"> 36203463 </td> <td style="text-align:right;"> 10967.282 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 74.340 </td> <td style="text-align:right;"> 38331121 </td> <td style="text-align:right;"> 8797.641 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 75.320 </td> <td style="text-align:right;"> 40301927 </td> <td style="text-align:right;"> 12779.380 </td> </tr> </tbody> </table> - One subject across time - Unit of analysis? --- # Panel/longitudinal data <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> <th style="text-align:right;"> pop </th> <th style="text-align:right;"> gdpPercap </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 73.275 </td> <td style="text-align:right;"> 36203463 </td> <td style="text-align:right;"> 10967.282 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 74.340 </td> <td style="text-align:right;"> 38331121 </td> <td style="text-align:right;"> 8797.641 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 75.320 </td> <td style="text-align:right;"> 40301927 </td> <td style="text-align:right;"> 12779.380 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 62.050 </td> <td style="text-align:right;"> 7693188 </td> <td style="text-align:right;"> 3326.143 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 63.883 </td> <td style="text-align:right;"> 8445134 </td> <td style="text-align:right;"> 3413.263 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 65.554 </td> <td style="text-align:right;"> 9119152 </td> <td style="text-align:right;"> 3822.137 </td> </tr> </tbody> </table> - Same subjects observed across time - Unit of analysis?