On Missing Datasets

Mimi Onuoha

Figure 1: Missing Datasets repository

To talk about obfuscation is to talk about negotiations of access to data. But what about cases where the presence of data itself is a luxury? These spaces offer new viewpoints for considering many of the same themes—of surveillance, privacy, power, and access—that obfuscation is concerned with. Because it is in these arenas that much of my work is situated, I have my own term for these spaces where omitted data live: missing datasets.

More specifically, “missing datasets” refer to the blank spots that exist in spaces that are otherwise data-saturated. My interest in them stems from the observation that within many spaces where large amounts of data are collected, there are often correlating empty spaces where data are not being collected.

The word “missing” is inherently normative, it implies both a lack and an ought: something does not exist, but it should. That which should be somewhere is not in its expected place; an established system is disrupted by distinct absence. These absences are significant, for that which we ignore reveals just as much (if not more) than what we give our attention to. It’s in these things that we find cultural and colloquial clues to what is deemed important. Spots that we’ve left blank reveal our hidden biases and indifferences.

In addition, by paying attention to missing datasets we are able examine the wider culture of data gathering; in a world in which data collection is routine, explicit, and the de facto business model for an increasing number of industries, missing datasets force us to consider the spaces that remain removed from this emergent value system.

Why Are They Missing?

Below I present four reasons, accompanied by real-world examples, for why a data set that seems like it should exist might not. Though these are not exhaustive, each reveals the quiet complications inherent within data collection.

  1. Those who have the resources to collect data lack the incentive.
    Police brutality towards civilians in the United States provides a powerful example of this maxim. Though incarceration and crime are among the most data-driven areas of public policy, traditionally there has been little history of standardized and rigorous data collection around brutality in policing. 

    Recently, public interest campaigns like Fatal Encounters and the Guardian’s The Counted have filled this void, and today there is data about the issue [1, 2].  But the fact remains that a task that is arduous and time-consuming for these individuals and organizations would be relatively easy for the law enforcement agents who are most closely tied to the creation of the dataset in the first place—they merely lack any significant incentive to gather it.

  2. The data to be collected resist simple quantification (corollary: we prioritize collecting things that fit our modes of collection).
    The defining tension of data collection is the challenge of defining a messy, organic world in formats that are structured. This complication is magnified for information that is difficult to collect by nature of its very form. For instance, since there’s no reason for other countries to monitor US currency within their countries, and the very nature of cash and the anonymity it affords renders it difficult to track, we don’t know how much US currency is outside of the country’s borders [3].Other subjects resist quantification entirely. Things like emotions are hard to quantify (at this time, at least). Institutional racism is similarly subtle; it often reveals itself more in effect and results than in deliberate acts of malevolence. Not all things are quantifiable, and at times the very desire to render the world more abstract, trackable, and machine-readable is an idea that itself should invite examination.
  1. The act of collection involves more work than the benefit the data is perceived to give.
    Sexual assault and harassment are woefully underreported [4]. While there are many reasons informing this reality, one major one could be that in many cases the very act of reporting sexual assault is an intensive and painful process. For some, the benefit of reporting isn’t perceived to be equal or greater than the cost of the process.
  1. There are advantages to nonexistence.
    To collect, record, and archive aspects of the world is an intentional act. As the concept of obfuscation illustrates, there are situations in which it can be advantageous for a group to remain outside of the oft-narrow bounds of collection. In short, sometimes a missing dataset can function as a form of protection, such as in the case of sanctuary cities deleting identifying data related to undocumented immigrants [5].

Final Thoughts

It is important not to interpret the highlighting of missing datasets as a direct call or invocation to fill these gaps. Rather, the topic lends itself to specific and general considerations of our wider system of data collection.

If we begin from the understanding that there will always be data missing from any collection system, we allow ourselves the space to address resulting patterns of inclusion and exclusion.

For more examples of missing datasets, please visit my GitHub repository [6], and essays on Quartz [7] and Data & Society’s Points blog [8].


[1] “Fatal Encounters,” Fatal Encounters, accessed May 22, 2017, http://www.fatalencounters.org/.

[2] “The Counted: people killed by police in the US,” The Guardian (New York City), https://www.theguardian.com/us-news/ng-interactive/2015/jun/01/the-counted-police-killings-us-database.

[3] United States, Federal Reserve Board, The Federal Reserve Board , by Ruth Judson, November 2012, accessed May 22, 2017, https://www.federalreserve.gov/pubs/ifdp/2012/1058/default.htm.

[4] United States, Federal Reserve Board, The Federal Reserve Board, by Ruth Judson, November 2012, accessed May 22, 2017, https://www.federalreserve.gov/pubs/ifdp/2012/1058/default.htm.

[5] Colin Lecher, “NYC will stop retaining data that could identify immigrants under Trump administration,” The Verge, December 2016, accessed May 22, 2017, https://www.theverge.com/2016/12/8/13882676/new-york-idnyc-trump-de-blasio-data-new-policy.

[6] Mimi Onuoha, “On Missing Datasets,” GitHub, accessed September 29, 2017 https://github.com/MimiOnuoha/missing-datasets.

[7] Mimi Onuoha, “Broadway won’t document its dramatic race problem, so a group of actors spent five years quietly gathering this data themselves,” Quartz, December 4, 2016, accessed September 29, 2017, https://qz.com/842610/broadways-race-problem-is-unmasked-by-data-but-the-theater-industry-is-still-stuck-in-neutral/.

[8] Mimi Onuoha, “The Point of Collection,” Points (Data & Society), February 10, 2016, accessed September 29, 2017, https://points.datasociety.net/the-point-of-collection-8ee44ad7c2fa.