Let us data tell us the true stories behind stories

Machine learning is changing what we think of as literature, as researchers are starting to take an algorithmic approach to their field. Advances in natural language processing and digitisation of text make it possible to study literature using a “big data” lens.

Forget about the hero journey. Emotional arcs of stories are dominated by six basic shapes(1). Stop unproductive debates about unknown authorship (or plagiarism). Beowulf is the work of a single author.

Emotional arcs of stories are dominated by six basic shapes

Stylometry, the science dealing with the determination of the statistical characteristics of the style of texts, is based on the observation that each of us uses even the same language in a slightly different way. The individuality of any person manifests itself clearly in the way they use a surprisingly small number of words.

For example, George R. R. Martin used 22,000 different distinct words to tell his massive history (close to 1.8 million words) A Song of Ice and Fire, with a variability of around 1%. You can also use network theory to make sense of characters relationships. For comparison, Shakespeare ‘Hamlet’s is 30,000 words long and uses 4,200 distinct words with an impressive variability of 13%! Martin, please, use less words.

Are there patterns that can predict future events in the books and television series? Can novelist and screenwriters of the future use artificial intelligence to generate new material, enabling a book to be written in weeks instead of years?

Decoding ‘Game of Thrones’ by way of data science

In fact, what avid producers are actually aiming at is automating away writers like Martin. It’s been also the unconfessed dream of journals for a while.

One of the mathematical ideas I hate more is classifications. In literature, classification is the technology behind a marketing weapon of creativity destruction: literary genres. What does machine learning tell us about genre?

Let’s have a look to science fiction, a recurrent topic in Mind the Post (and now its Spanish sibling Alienímagina). It’s not immediately obvious that Gibson’s Neuromancer, Wells’ Time Machine, and Mary Shelley’s Frankenstein would share much common vocabulary. In an interesting article(2), Ted Underwood tackles this question:

Invocations of scale (“vast,” “far,” “larger”) are very characteristic of science fiction, as are large numbers (“thousands”). Self-conscious references to the “earth” and to things that are “human” tend to accompany “creatures” from which humanity may be distinguished, and the pronoun “its” is common, since we often confront actors who lack an easily recognized human gender. This is not by any means an exhaustive description of the genre — just a taste of the model.

The narrative premise of much historiography is that science fiction was an inchoate phenomenon (scattered across utopias, planetary romances, etc) until given a new shape and direction by particular pulp magazines and anthologies between 1925 and 1950. Hugo Gernsback’s Amazing Stories (1926) often plays a central role. Wolfe says, for instance, “science fiction, despite its healthy legacy throughout the nineteenth century, was essentially a designed genre after 1926.” 39 Even after that point, “the science fiction novel persistently failed to cohere as a genre in the manner of mysteries and Westerns” until The Pocket Book of Science Fiction emerged in 1943. 40 None of these crucial moments of consolidation are visible in the model. Where language is concerned, the half-century from Verne through Gernsback (1875-1925) appears just as coherent and as distinct from other forms of fiction as the period after 1926.
Ted Underwood, Science fiction 1771-1989

Let us data tell us the true stories behind stories.


(1) Reagan, Andrew J., et al. ‘The Emotional Arcs of Stories Are Dominated by Six Basic Shapes’. EPJ Data Science, vol. 5, no. 1, Dec. 2016, p. 31. arXiv.org, doi:10.1140/epjds/s13688-016-0093-1

(2) Underwood, Ted. The Life Cycles of Genres. Harvard Dataverse, 2016. DOI.org (Datacite), doi:10.7910/dvn/xkqoqm.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.