Data Narrative Generator

—the Temporality and Causality in Data Narratives

 

Hane Lee

 

 

1.  Introduction

 

While the term “statistical bias” encompasses any kind of bias that could come about in the any of the collection, analysis, presentation, and organization processes regarding data, many may imagine an induced error in numerics and some neighboring words: “close to barely 1 out of 5 people (22.5%),” “almost 1 out of 4 people (22.5%).” The tweaking of numbers (“statistics”) is most definitely debatable on its own, but labeling this issue a statistical bias in the end perhaps directs the search for a solution to producing less biased numbers. In this boom of big data, however, data is more often analyzed to prove justification to a purpose rather than to be observed, and the bias starts from the “logical solutions,” the narrative, that the developer is trying to craft.

 

In telling a story, however, the linguistic structure is extremely powerful whether intentional or unintentional. Charlotte Linde, a linguist who has written extensively about narratives and various strategies of creating coherence in her book Life Stories, underlines the importance of temporal sequence and narrative presupposition in creating believable coherence [1]. She gives the following example:

1. I got flustered and I backed the car into a tree.

2. I backed the car into a tree and I got flustered (Linde).

Two clauses are repeated in the two sentences but in alternate sequences. The sequences, however, create a completely different temporal and causal implication between the two clauses. The first sentence could easily imply that my getting flustered caused me to back the car into a tree; the second sentence implies that my backing the car into a tree caused me to get flustered. Either relationship could be true. Either could also be false, and getting flustered and backing the car into a tree may have been two independent events.

 

While these implications may have been developed through deliberate choice of the writer, it could have been coincidentally generated and unconsciously misunderstood by the reader. Especially in the times of automatically generated content, the sequence is not necessarily critically implied nor received. Let us say, for example, that a social media service boosts two articles to the user, one about the black neighborhood being pushed out further away from the city center because of expensive housing costs and another about secondary education rates being significantly lower in black households. Would the reader try to infer that lack of wealth led to lack of education? What if the articles were ordered the other way, and the user saw the article about education and then the one about being pushed out? Did lack of education lead to lack of wealth? What caused the other?

 

Before trying to determine the cause and effect, a question worth asking is whether there is there a temporal or causal relationship at all in the first place. When two events are listed, it is easy to infer by our habits of forming narratives that the first event happened before the second event. Once we establish the temporal hierarchy, it is not uncommon to jump to inferring that the first event caused the second. Both temporal and causal relationships should not be presumed and critically reviewed.

 

 

2.  Data Narrative Generator

 

The Data Narrative Generator experiments with the temporal and causal relationships brought by linguistic sequence in automatically generated data narratives. Each sentence that is input to the generator is an analysis drawn from a single table from the 2013 American Housing Survey as reported by the United States Census Bureau [2]. The user is made to choose a race (black or Hispanic), and the generator gives two sentences in sequence. One of them is a comparitive fact related to the household characteristics of the chosen race, and the other is related to more general demographic information. The user is then given the choice to reverse the order of the two sentences to contrast the effect. The following two paragraphs are generated results with the “black” racial choice:

 

Only 21.9% of black households compared to 31.6% of the total population completed a bachelor’s degree or higher. Almost 71% of the surveyed white households were homeowners whereas only 43% of the surveyed black households were homeowners.

 

Almost 71% of the surveyed white households were homeowners whereas only 43% of the surveyed black households were homeowners. Only 21.9% of black households compared to 31.6% of the total population completed a bachelor’s degree or higher.

 

The sentences are direct readings from the table that are easily unrelated to each other and could perhaps even be automatically generated. Simply placing them side by side, however, gives temporal and causal hierarchy when the reader infers from them. The hierarchy is not clearly noticeable, however, when only one of the paragraphs are given.

 

To assign a stronger intention to the paragraphs, the generator offers the choice of placing conjunctions, a better frame for the story. The following two paragraphs are generated results with conjunctions and “Hispanic” racial choice:

 

There is an interesting relationship between housing conditions and social status of minority groups. Almost 71% of the surveyed white households were homeowners whereas only 47.0% of the surveyed Hispanic households were homeowners. Thereby, only 14.6% of Hispanic households compared to 31.6% of the total population completed a bachelor’s degree or higher.

 

There is an interesting relationship between housing conditions and social status of minority groups. Only 14.6% of Hispanic households compared to 31.6% of the total population completed a bachelor’s degree or higher. Thereby, almost 71% of the surveyed white households were homeowners whereas only 47.0% of the surveyed black households were homeowners.

 

The first sentence is general enough to encompass any analysis from the table, but it forcefully presupposes that there is a connection between the following two sentences and could prompt the reader to actively create one.

 

 

3.  Discussion

 

While singling out the two races and openly creating rather “disparaging” remarks may seem like a strange, maybe discriminative, idea, it was directly derived from the data table itself. While the row titles contain household characteristics such as race, age, citizenship, number of members, and household composition, the column titles includes tenure, regions, and central city versus non-central. The column titles, however, also redundantly single out only “Black alone” and “Hispanic” among race categories to put side by side with Elderly and below poverty households to give a better visibility regarding all the row titles. This pointed attention was designed specifically for such racial comparisons.

 

Data definitely could be and have been used in creating many more intentionally malicious stories than these subtle examples. However, the focus of this paper was on unintentional interpretation of unintentionally generated data narratives because we are living in a growing abundance of it, and often they can have harmful consequences without intention. More effort needs to be put towards systematically discover and appropriately address the inherent biases.

 

Compared to machine learning methods, traditional statistics allows a better observation of the framework of the data and formalized tools to assess both the process and result. This paper, however, refrains from delving deeper into statistical concepts for a few reasons. First, while minute details and subtlety regarding the presentation of data is a frequent discussion among statisticians, they do not formally consider a linguistic or literary approach especially from the receiver’s point of view. Second, the currently prevalent methods do not necessarily focus on statistical methods. They make extensive use of specific concepts, but they try to avoid mathematically and computationally expensive processes. Therefore, it seemed more relevant to both fields to focus on a more fundamental issue. Third, the contemporary audience of data analysis has significantly widened to the general public. To raise awareness towards the general question of data and AI ethics, it is important to be critical about this topic also from various nonprofessional angles.

4.  Conclusion

 

The linguistic, narrative perspective regarding data can be valuable because it places a priority in understanding the story that is ultimately perceived by the receiver whereas most current data narratives rarely extend beyond the intentional use of the developer.

For future work, further analysis of how each stage of data collection, processing, interpretation, and presentation can affect the data narrative would be valuable in understanding the final narrative imprint that a dataset could have on the receiver after being influenced by many different institutions and values, including but not pertained to the government, academia, news media, social media, and the market.

5.  References

 

[1] Linde, Charlotte. Life Stories: the Creation of Coherence. New York, Oxford: Oxford University Press, 1993. Print.

[2] U.S. Census Bureau. American Housing Survey. 2013. Web. Available: https://www.census.gov/

  • Grey Instagram Icon