Research Note: Knowing Unknowns in an Age of Incomplete Information

Stanford Data Science Scholar, Saurabh Khanna, discusses misinformation in his Oct 27, 2022 research note

Abstract

The technological revolution of the Internet has digitized the social, economic, political, and cultural activities of billions of humans. While researchers have started paying due attention to concerns of misinformation and bias, these obscure a much less researched and equally insidious problem – that of uncritically consuming incomplete information. The problem of incomplete information consumption stems from the very nature of explicitly ranked information on the Internet, where humans with bounded rationality are left with little choice but to consume the tip of the iceberg. In this study, I leverage the context of Internet search to present ways of quantifying `information visibility', that is, how much we do not know when we consume information online. I then apply these metrics to 8.4 trillion raw Internet search results extracted from daily search trends across 48 nations for one year. The study finally proposes a prototype of an open-source internet search platform that aims to enable choice by balancing both relevance and visibility in the information we access.

Research Note

Humans are in the middle of a transition -- a transition to a life on the Internet. In the last two decades, our interactions have experienced the beginnings of a computational revolution that is still unfolding. This revolution has largely been driven by the technological revolution of the Internet, which has effectively digitized the social, economic, political, and cultural activities of billions of people, generating vast repositories of digital data as a byproduct. The scale of this revolution is indicated by more than 6 billion Internet searches originating every day on Google alone, which roughly corresponds to one daily search per human on our planet. The COVID-19 pandemic acted as a powerful catalyst for this already amplifying revolution. The pandemic saw teenagers' daily Internet use for non-school work skyrocketing to unprecedented levels. These changes went beyond a specific demographic as the country saw an overall 47% rise in broadband Internet usage.

While this explosion of freely available information enabled a democratic discourse across space and time, concerns were also raised around potential harms of the information flowing on the Internet. Scientists across disciplines have made progress studying these concerns along two themes. The first theme pertains to the propagation of misinformation, where the information being propagated is different from the ground truth for a given context. A second theme has been the growing focus on algorithmic fairness and the propagation of bias, wherein the information propagated not only differs from the ground truth, but also can particularly harm traditionally marginalized populations. Notwithstanding the validity and the gravity of the questions addressed by these two themes, they do depend on the availability of verifiable ground truths. Given the subjectivity and diversity in opinions expressed on the Internet, the presence of verifiable ground truths would more be an exception rather than a norm. It is extremely difficult to evaluate the quality of information on the Internet when the ground truths themselves are unclear, or even nonexistent.

While we have made promising progress on countering misinformation and bias, we have missed out on tackling another potent (and arguably equally tenuous) problem -- that of consuming incomplete information by being subject to an information overload. It is no secret that we are inundated with information growing at an alarming rate. Researchers estimate that human knowledge is growing exponentially, doubling every 12 hours at present. There are two aspects governing our interactions with the mass of information propagating on the Internet. First, on the stimulus side, all information shown to us on the Internet is ranked by nature. In the context of web search, for instance, the n-th search result ranks higher than the n+1-th result. Similarly, in a social media context, the n-th post is more visible than the n+1-th one. Second, on the response side, humans are restricted by the bounds of their own rationality. In other words, we lack the mental capabilities to keep up and effectively process the exponentially growing faucet of information we face every day. Consequently, we react to this ranked digital information with a strong predilection for the tip of the iceberg, where our clicks roughly follow a power law distribution.

Given this context, it is a fundamental question to ask what proportion of the spectrum stays invisible from us when navigating digital information. In other words, from a population of N results for a given search query q on the Internet, how representative is viewing just n < N search results? This is different from assessing whether the n search results are either misinformative or biased, but worth assessing nonetheless. The importance of this question is even more pronounced given the implications it has for human behavior. Studies have shown the rising levels of mental distraction among almost all population demographics, a large part of which is driven by the fear of missing out on what we could not see. Additionally, the misinformation and bias literature has highlighted the existence of bias in top web search results, news search results, and social media posts. This is problematic as the Internet sends us ranked information, a ranking that we feed sparingly off, and a ranking that could possibly be biased. The emphasis on 'possibly' pertains to the ambiguity we face in accurately assessing the ground truths in most situations. This in turn can lead to harms of representation, wherein digital systems end up reinforcing the subordination of some groups along the lines of identity.

Taken together, our failure to know how much we do not know when consuming information is a critical loophole in Internet-enabled systems enabling human discourse at an unprecedented scale. As Jamie Susskind points out in his book Future Politics, if we have no control on the flow of information in our society, we have no control on our shared sense of right and wrong. From a philosophical standpoint, an intention to know unknowns is hardly a new line of questioning, but rather a centuries-old one. Notwithstanding the fundamental nature and utility of this question in the current information overload age, it is surprising that this has evaded ample research attention. Before answering ‘Is what I know different from the ground truth?’ through research on misinformation and algorithmic bias, we are yet to answer adequately ‘How much do I not even know?’

The information space as we research today (on the left) and as we do not (on the right)

Research Note: Knowing Unknowns in an Age of Incomplete Information

Abstract

Research Note

More News Topics

More News

Diving into Ocean Ecology: Data Science and Marine Conservation Converge

Sustainability Data Science Conference Awards

The Joyous Side of Data Science, Perseverance, and Long-Distance Running