-- The dangers of choosing our questions based on the data available
I know, I’m a party pooper; I am going to try to convince you to stop and think before you start collecting as much data as you can, or any at all for that matter.
Nowadays, it is quite common to find yourself sitting on a pile of data, trying to figure out ways to use it. Data is cheap to collect and gather, and a great deal of it is just collected for “future purposes”. This practice is prevalent in organizations that have created data-centered products.
Why do we use data in the first place? We can probably agree to some extent that “it helps answer questions and test ideas out”. Therefore, our questions should drive our data, and not the other way. It is dangerous to form our questions based on the available data.
Danger #1: Data itself is a risky asset. It gets stale and even rots over time, losing value. Also, data can be used in harmful ways if mishandled (or in the wrong hands). Therefore, the more data we hold and the longer we keep it, the higher the risk.
Danger #2: Data is not a raw resource; it is already “processed”. Data is a limited representation of the world from a specific point of view at a particular time. The collection process heavily impacts the data. We must keep in mind the context of collection and be aware of potential limitations: what did we not capture, what possible errors (human and mechanical) might have occurred, and what embedded biases lie within. The representation you collected may not be the best one for your intended purpose (maybe it’s not even a useful one). Rephrasing George Box’s infamous quote: “all data is wrong, but some is useful (sometimes)”.
And so, with this, we have that data is risky, incomplete, and opinionated. Re-using data may have unhygienic effects, polluting the new project with old limitations and biases.
Instead of asking: “what can I use this data for?”, we should first question ourselves: “what do I want to do?”. This answer will help you assess whether any collected data is useful for your purposes. It’s vital to understand the risks and limitations these bring and how different the context of the data is from the context of our purpose.
Understanding this difference — i.e., “is collection context similar enough to the context of my purpose?” — is challenging. Context is such a slippery concept that it’s even hard to define. However, a method particularly well suited to deal with such a challenge is ethnography. Ethnographers study behaviors by examining the social situation and the interactions between the participants and their context. They use techniques such as participant observation, interviews, and surveys. Ethnography is already used in many industries because its methods are perfectly equipped to understand how their customers use products and identify ways to improve their value.
For the purposes of our data-centric projects, ethnography helps us refine our question, “what do I want to do?”, reducing the scope and better defining our population of study we want to focus on. At this point, data scientists and ethnographers can work side by side to define a collection mechanism (or carefully reuse past collected data) that is appropriate for the analysis, documenting along the way the potential biases and limitations that the data has within.
In a not-too-distant future, we might have a team of data ethnographers. This team would have the capability to deal with the holistic nature of context, but also understand the statistical and technical limitations of data.