class: center, middle, inverse, title-slide .title[ # PADP 7120 Data Applications in PA ] .subtitle[ ## Data ] .author[ ### Alex Combs ] .institute[ ### UGA | SPIA | PADP ] .date[ ### Last updated: January 15, 2025 ] --- # Outline - Identify the unit of analysis - Identify types of variables - Identify types of data structures --- # Rectangular data - Rectangular data consists of - Observations of a unit of analysis - Variables -- - **Unit of analysis**: The subject/entity generating the data. Each row contains an observation of a member of the unit of analysis. -- - **Variable**: A characteristic measured of each member of the unit of analysis. Each column contains a variable. --- # Rectangular data <table> <thead> <tr> <th style="text-align:left;"> Variable_1 </th> <th style="text-align:left;"> Variable_2 </th> <th style="text-align:left;"> Variable_3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Member 1 of Unit of Analysis </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:left;"> Member 2 of Unit of Analysis </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:left;"> Member 3 of Unit of Analysis </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> </tbody> </table> - Each row is (or should be) an **observation** - Each column is (or should be) a **variable** --- # Identify the unit of analysis - What is the likely unit of analysis based on the variable names? <table> <thead> <tr> <th style="text-align:left;"> Var1 </th> <th style="text-align:left;"> GDP </th> <th style="text-align:left;"> Population </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Unit of Analysis </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:left;"> Unit </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:left;"> Unit </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> </tbody> </table> --- # Identify the unit of analysis - What is the likely unit of analysis based on the variable names? <table> <thead> <tr> <th style="text-align:left;"> Var1 </th> <th style="text-align:left;"> GRE_Score </th> <th style="text-align:left;"> Income </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Unit of Analysis </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:left;"> Unit </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:left;"> Unit </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> </tbody> </table> --- # Identify the unit of analysis - What is the unit of analysis based on the variable names? <table> <thead> <tr> <th style="text-align:right;"> Year </th> <th style="text-align:left;"> Unemployment </th> <th style="text-align:left;"> Poverty </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2017 </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> <tr> <td style="text-align:right;"> 2019 </td> <td style="text-align:left;"> Datum </td> <td style="text-align:left;"> Datum </td> </tr> </tbody> </table> --- # Why care about unit of analysis? - Important to know what level is observed or measured. For example, individuals vs. an aggregation of individuals within a city, county, state, etc. - Spreadsheets will sometimes combine units of analysis. For example, state totals calculated from multiple county totals. - Be careful not to accidentally include a unit analysis that does not belong in the data --- class: inverse, middle, center # Suppose we want to study homelessness in Athens-Clarke County (potential causes or correlates, demographics, government performance metrics, etc.). We need to collect data. --- class: inverse, middle, center # What variables might we collect if the unit of analysis is **individuals**? --- class: inverse, middle, center # What variables might we collect if the unit of analysis is **years**? --- # Types of variables data:image/s3,"s3://crabby-images/6b4c1/6b4c1da064de0490135b8693fdbaa8a9357f4232" alt="" --- # Qualitative vs. quantitative - A **qualitative** or **categorical** variable is a variable that is naturally expressed in words with no intrinsic numerical value -- - A **quantitative** or **numerical** variable is a variable that has intrinsic numerical meaning --- class: inverse, middle, center # Which of our variables are qualitative and which are quantitative? --- # Types of qualitative variables .center[ data:image/s3,"s3://crabby-images/a492e/a492e62951768caac9c47163ded5681b4b145d12" alt=":scale 40%" ] - **Nominal** variables take on values that differ by name only - **Ordinal** variables take on values that can be ranked relative to each other but the difference between rankings has no numerical value --- # Types of quantitative variables .center[ data:image/s3,"s3://crabby-images/8dbfc/8dbfc0af498ab55ab9060ab3f3d6ac8533673ef0" alt=":scale 40%" ] - **Discrete** variables take on numeric values that are **countable**, integers (e.g. 0, 1, 2, ...) - **Continuous** variables take on potentially any value. For example, a percentage ranges between 0 and 100 but can take on infinite values between that range (e.g. 50, 50.4, 50.44, 50.444, ...) even though we round to a finite set of values. --- class: inverse, middle, center # Which of our qualitative variables are nominal or ordinal. Which of our quantitative variables are discrete or continuous? --- # Variables and information - Variables measure a characteristic of the unit of analysis - Each variable has some amount of information encoded into it <img src="lectures_files/varinfo.png" width="2035" height="50%" /> --- # Why variable types matter - Types of variables inform what kind of visualization and analysis to use. - For [example](https://coggle.it/diagram/Vxlydu1akQFeqo6-/t/inference). --- class: inverse, center, middle # Dataset structures --- # Cross-sectional <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 75.320 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 65.554 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 72.390 </td> </tr> </tbody> </table> - A snapshot in time -- - One way to identify the unit of analysis is finding the variable or set of variables that uniquely identifies each row in a dataset. - What is the unit of analysis? --- # Pooled cross-sectional <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Algeria </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 70.994 </td> </tr> <tr> <td style="text-align:left;"> Angola </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 41.003 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 75.320 </td> </tr> <tr> <td style="text-align:left;"> Benin </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 54.406 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 65.554 </td> </tr> <tr> <td style="text-align:left;"> Botswana </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 46.634 </td> </tr> <tr> <td style="text-align:left;"> Brazil </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 72.390 </td> </tr> </tbody> </table> - Multiple cross-sections combined - Different subjects observed in each cross-section - What is the unit of analysis? --- # Time Series <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1977 </td> <td style="text-align:right;"> 68.481 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1982 </td> <td style="text-align:right;"> 69.942 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1987 </td> <td style="text-align:right;"> 70.774 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:right;"> 71.868 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 73.275 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 74.340 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 75.320 </td> </tr> </tbody> </table> - One subject across time - What is the unit of analysis? --- # Panel or longitudinal data <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 73.275 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 74.340 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 75.320 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 62.050 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 63.883 </td> </tr> <tr> <td style="text-align:left;"> Bolivia </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 65.554 </td> </tr> </tbody> </table> - Same subjects observed across time - Combination of cross-sectional and time series data - What is the unit of analysis?