When Citizen Data Science Flies Off Course
Jul 29, 2022
The travel blog View from the Wing recently published a short piece identifying the airports with the most expensive airfares in the country. Most readers will simply take that list at face value. Some readers (like us!), however, pause and wonder: What does "most expensive" really mean?
The internet has democratized data, making it available farther and wider and encouraging "citizen data scientists" to analyze and comment without context. Unfortunately, you can't believe everything you read! Being a savvy consumer of data requires data literacy: an understanding of the data domain, business processes, and circumstances under which the data is created and analyzed. This last piece is especially critical to the ability to select and control for the appropriate variables, to create a true study and comparison of data points.
The air travel industry, like higher education, has very specific context that is critical to understanding its data. Infographics and aggregated numbers make great sound bites and easy story leads, but there is danger in looking only at high-level numbers, especially for decision-making. Aggregation can be too far removed from the actual data to be meaningful. In this case, distilling airfare data from the Department of Transportation to "most expensive" is essentially impossible because of the number and complexity of variables, such as:
- Does the airport have international flights? High-mileage flights?
- What proportion of the seats are economy vs. business or first class?
- What customer profiles use this airport and what are their price tolerances?
Critical data patterns like these are masked by aggregation and can easily be missed by an inexperienced analyst who doesn’t understand the domain. Just as a single, high-level value of "most expensive" airport ignores the considerations above, a single KPI for student retention is just a starting point. Relying solely on that number and not looking any deeper means some groups of students will be left behind, hidden in aggregation. Those skilled in data literacy in the higher ed domain will know how to look deeper and ask the right questions. What differences exist among students with different entry statuses (e.g. first-time or transfer), cohort memberships, socioeconomic statuses, or demographic profiles? What initiatives are having a positive impact on retention for certain groups, and how can we scale those?
Discovering and understanding the relationships between data takes time and effort, and this is the skill of a data literate analyst. Without a true understanding of the domain, data is more likely to be reduced to a single, meaningless number, diminishing the capacity for effective analysis. This can lead to misinformed decisions and prioritizing the wrong initiatives within the institutional strategy. Investment in an analytics solution must include data literacy training for your staff, to truly understand and create meaning from your data.