Modelling Suspense in Short Stories as Uncertainty Reduction over Neural Representation

Page created by Ted King
 
CONTINUE READING
Modelling Suspense in Short Stories as Uncertainty Reduction over
                                                                   Neural Representation

                                                                             David Wilmot and Frank Keller
                                                                    Institute for Language, Cognition and Computation
                                                                      School of Informatics, University of Edinburgh
                                                                       10 Crichton Street, Edinburgh EH8 9AB, UK
                                                                   david.wilmot@ed.ac.uk, keller@inf.ed.ac.uk

                                                               Abstract                           only sporadically been used in story generation sys-
                                             Suspense is a crucial ingredient of narrative fic-
                                                                                                  tems (O’Neill and Riedl, 2014; Cheong and Young,
                                             tion, engaging readers and making stories com-       2014).
arXiv:2004.14905v1 [cs.CL] 30 Apr 2020

                                             pelling. While there is a vast theoretical litera-      Suspense, intuitively, is a feeling of anticipation
                                             ture on suspense, it is computationally not well     that something risky or dangerous will occur; this
                                             understood. We compare two ways for mod-             includes the idea both of uncertainty and jeopardy.
                                             elling suspense: surprise, a backward-looking        Take the play Romeo and Juliet: Dramatic suspense
                                             measure of how unexpected the current state is       is created throughout — the initial duel, the meet-
                                             given the story so far; and uncertainty reduc-
                                                                                                  ing at the masquerade ball, the marriage, the fight
                                             tion, a forward-looking measure of how unex-
                                             pected the continuation of the story is. Both        in which Tybalt is killed, and the sleeping potions
                                             can be computed either directly over story rep-      leading to the death of Romeo and Juliet. At each
                                             resentations or over their probability distribu-     moment, the audience is invested in something be-
                                             tions. We propose a hierarchical language            ing at stake and wonders how it will end.
                                             model that encodes stories and computes sur-            This paper aims to model suspense in computa-
                                             prise and uncertainty reduction. Evaluating          tional terms, with the ultimate goal of making it
                                             against short stories annotated with human sus-
                                                                                                  deployable in NLP systems that analyze or generate
                                             pense judgements, we find that uncertainty re-
                                             duction over representations is the best predic-     narrative fiction. We start from the assumption that
                                             tor, resulting in near human accuracy. We also       concepts developed in psycholinguistics to model
                                             show that uncertainty reduction can be used to       human language processing at the word level (Hale,
                                             predict suspenseful events in movie synopses.        2001, 2006) can be generalised to the story level to
                                                                                                  capture suspense, the Hale model. This assumption
                                         1   Introduction
                                                                                                  is supported by the fact that economists have used
                                         As current NLP research expands to include longer,       similar concepts to model suspense in games (Ely
                                         fictional texts, it becomes increasingly important       et al., 2015; Li et al., 2018), the Ely model. Com-
                                         to understand narrative structure. Previous work         mon to both approaches is the idea that suspense
                                         has analyzed narratives at the level of characters       is a form of expectation: In games, we expect to
                                         and plot events (e.g., Gorinski and Lapata, 2018;        win or lose instead in stories, we expect that the
                                         Martin et al., 2018). However, systems that pro-         narrative will end a certain way.
                                         cess or generate narrative texts also have to take          We will therefore compare two ways for mod-
                                         into account what makes stories compelling and           elling narrative suspense: surprise, a backward-
                                         enjoyable. We follow a literary tradition that makes     looking measure of how unexpected the current
                                         And then? (Forster, 1985; Rabkin, 1973) the pri-         state is given the story so far; and uncertainty re-
                                         mary question and regards suspense as a crucial          duction, a forward-looking and measure of how
                                         factor of storytelling. Studies show that suspense is    unexpected the continuation of the story is. Both
                                         important for keeping readers’ attention (Khrypko        measures can be computed either directly over story
                                         and Andreae, 2011), promotes readers’ immersion          representations, or indirectly over the probability
                                         and suspension of disbelief (Hsu et al., 2014), and      distributions over such representations. We pro-
                                         plays a big part in making stories enjoyable and in-     pose a hierarchical language model based on Gen-
                                         teresting (Oliver, 1993; Schraw et al., 2001). Com-      erative Pre-Training (GPT, Radford et al., 2018) to
                                         putationally less well understood, suspense has          encode story-level representations and develop an
inference scheme that uses these representations to        pense using general language models fine-tuned on
compute both surprise and uncertainty reduction.           stories, without planning and domain knowledge.
For evaluation, we use the WritingPrompt corpus            The advantage is that the model can be trained on
of short stories (Fan et al., 2018), part of which we      large volumes of available narrative text without
annotate with human sentence-by-sentence judge-            requiring expensive annotations, making it more
ments of suspense. We find that surprise over rep-         generalisable.
resentations and over probability distributions both          Other work emphasises the role of characters and
predict suspense judgements. However uncertainty           their development in story understanding (Bamman
reduction over representations is better, resulting        et al., 2014, 2013; Chaturvedi et al., 2017; Iyyer
in near human-level accuracy. We also show that            et al., 2016) or summarisation (Gorinski and Lap-
our models can be used to predict turning points,          ata, 2018). A further important element of narra-
i.e., major narrative events, in movie synopses (Pa-       tive structure is plot, i.e., the sequence of events
palampidi et al., 2019).                                   in which characters interact. Neural models have
                                                           explicitly modelled events (Martin et al., 2018; Har-
2   Related Work                                           rison et al., 2017; Rashkin et al., 2018) or the results
                                                           of actions (Roemmele and Gordon, 2018; Liu et al.,
In narratology, uncertainty over outcomes is tradi-        2018a,b). On the other hand, some neural genera-
tionally seen as suspenseful (e.g., O’Neill, 2013;         tion models (Fan et al., 2018) just use a hierarchical
Zillmann, 1996; Abbott, 2008). Other authors               model on top of a language model; our architecture
claim that suspense can exist without uncertainty          follows this approach.
(e.g., Smuts, 2008; Hoeken and van Vliet, 2000;
Gerrig, 1989) and that readers feel suspense even          3     Models of Suspense
when they read a story for the second time (Dela-
                                                           3.1    Definitions
torre et al., 2018), which is unexpected if suspense
is uncertainty; this is referred to as the paradox of      In order to formalise measures of suspense, we
suspense (Prieto-Pablos, 1998; Yanal, 1996). Con-          assume that a story consists of a sequence of sen-
sidering Romeo and Juliet again, in the first view         tences. These sentences are processed one by one,
suspense is motivated by primarily by uncertainty          and the sentence at the current timepoint t is repre-
over what will happen. Who will be hurt or killed in       sented by an embedding et (see Section 4 for how
the fight? What will happen after marriage? How-           embeddings are computed). Each embedding is
ever, at the beginning of the play we are told “from       associated with a probability P(et ). Continuations
forth the fatal loins of these two foes, a pair of star-   of the story are represented by a set of possible next
                                                                                                                  i
crossed lovers take their life”, and so the suspense       sentences, whose embeddings are denoted by et+1 .
is more about being invested in the plot than not             The first measure of suspense we consider is
knowing the outcome, aligning more with the sec-           surprise (Hale, 2001), which in the psycholinguis-
ond view: suspense can exist without uncertainty.          tic literature has been successfully used to predict
We do not address the paradox of suspense directly         word-based processing effort (Demberg and Keller,
in this paper, but we are guided by the debate to          2008; Roark et al., 2009; Van Schijndel and Linzen,
operationalise methods that encompass both views.          2018a,b). Surprise is a backward-looking predic-
The Hale model is closer to the traditional model          tor: it measures how unexpected the current word
of suspense as being about uncertainty. In contrast,       is given the words that preceded it (i.e., the left
the Ely model is more in line with the second view         context). Hale formalises surprise as the negative
that uncertainty matters less than consequentially         log of the conditional probability of the current
different outcomes.                                        word. For stories, we compute surprise over sen-
   In NLP, suspense is studied most directly in nat-       tences. As our sentence embeddings et include
ural language generation, with systems such as             information about the left context e1 , . . . , et−1 , we
Dramatis (O’Neill and Riedl, 2014) and Suspenser           can write Hale surprise as:
(Cheong and Young, 2014), two planning-based                                St
                                                                              Hale
                                                                                     = − log P(et )             (1)
story generators that use the theory of Gerrig and
Bernardo (1994) that suspense is created when a            An alternative measure for predicting word-by-
protagonist faces obstacles that reduce successful         word processing effort used in psycholinguistics is
outcomes. Our approach, in contrast, models sus-           entropy reduction (Hale, 2006). This measure is
forward-looking: it captures how much the current         the next state et+1 :
word changes our expectations about the words we                                Ely                  i           2
will encounter next (i.e., the right context). Again,                      Ut         = E[(et − et+1 ) ]
                                                                                        i                i           2       (4)
we compute entropy at the story level, i.e., over sen-                  = ∑ P(et+1 )(et − et+1 )
tences instead of over words. Given a probability                           i
distribution over possible next sentences P(et+1 ),
                                                 i
                                                          This is closely related to Hale entropy reduction,
we calculate the entropy of that distribution. En-        but again the entropy is computed over states (sen-
tropy reduction is the change of that entropy from        tence embeddings in our case), rather than over
one sentence to the next:                                 probability distributions. Intuitively, this measure
                                                          captures how much the uncertainty about the rest
                              i             i
           Ht = − ∑ P(et+1 ) log P(et+1 )                 of the story is reduced by the current sentence.
                      i                            (2)    We refer to the forward-looking measures in Equa-
                             Hale
                           Ut       = Ht−1 − Ht           tions (2) and (4) as Hale and Ely uncertainty reduc-
                                                          tion, respectively.
                                                             Ely et al. also suggest versions of their measures
Note that we follow Frank (2013) in computing
                                                          in which each state is weighted by a value αt , thus
entropy over surface strings, rather than over parse
                                                          accounting for the fact that some states may be
states as in Hale’s original formulation.
                                                          more inherently suspenseful than others:
    In the economics literature, Ely et al. (2015)
                                                                                                                         2
                                                                                            = αt (et − et−1 )
                                                                                  αEly
have proposed two measures that are closely re-                                  St
lated to Hale surprise and entropy reduction. At                                                             i       2
                                                                                                                             (5)
                                                                                = E[αt+1 (et − et+1 ) ]
                                                                        αEly
                                                                      Ut
the heart of their theory of suspense is the notion of
belief in an end state. Games are a good example:         We stipulate that sentences with high emotional va-
the state of a tennis game changes with each point        lence are more suspenseful, as emotional involve-
being played, making a win more or less likely.           ment heightens readers’ experience of suspense.
Ely et al. define surprise as the amount of change        This can be captured in Ely et al.’s framework by
from the previous time step to the current time step.     assigning the αs the scores of a sentiment classifier.
Intuitively, large state changes (e.g., one player sud-
denly comes close to winning) are more surprising         3.2   Modelling Approach
than small ones. Representing the state at time t as      We now need to show how to compute the surprise
et , Ely surprise is defined as:                          and uncertainty reduction measures introduced in
                                                          the previous section. This involves building a
                  Ely                   2                 model that processes stories sentence by sentence,
                 St       = (et − et−1 )           (3)
                                                          and assigns each sentence an embedding that en-
                                                          codes the sentence and its preceding context, as
Ely et al.’s approach can be adapted for modelling        well as a probability. These outputs can then be
suspense in stories if we assume that each sentence       used to compute a surprise value for the sentence.
in a story changes the state (the characters, places,        Furthermore, the model needs to be able to gen-
events in a story, etc.). States et then become sen-      erate a set of possible next sentences (story contin-
tence embeddings, rather than beliefs in end states,      uations), each with an embedding and a probability.
and Ely surprise is the distance between the current      Generating upcoming sentences is potentially very
embedding et and the previous embedding et−1 . In         computationally expensive since the number of con-
this paper, we will use L1 and L2 distances; other        tinuations grows exponentially with the number of
authors (Li et al., 2018) experiment with informa-        future time steps. As an alternative, we can there-
tion gain and KL divergence, but found worse per-         fore sample possible next sentences from a corpus
formance when modelling suspense in games. Just           and use the model to assign them embeddings and
like Hale surprise, Ely surprise models backward-         probabilities. Both of these approaches will pro-
looking prediction, but over representations, rather      duce sets of upcoming sentences, which we can
than over probabilities.                                  then use to compute uncertainty reduction. While
   Ely et al. also introduce a measure of forward-        we have so far only talked about the next sentences,
looking prediction, which they define as the ex-          we will also experiment with uncertainty reduction
pected difference between the current state et and        computed using longer rollouts.
ℓ
                    lm
                                                                                 ℓ
                                                                                                                                 last word, which is then fed into a second RNN
         fusion (affine)
                                                                                                         ⋅
                                                                                                                                 (story enc) that computes a story embedding. The
                                                                                                                                 overall story representation is the hidden state of
    Concat
                                             story_enc
                                               (RNN)
                                                                                                                                 its last sentence. Crucially, this model also gives
                                                                     0                          +1               +2        +3
     word
      and
     story
                                                                                                                                 us et , a contextualised representation of the current
    vectors
                                                                                                                                 sentence at point t in the story, to compute surprise
                                                                 0                         1                 2         3
                             ( )
                                                             3                                 ( )               ( )       ( )
                                   = [   ;       ( )]

                                                                                                                                 and uncertainty reduction.
                                                                                sent_enc

    0                    1                   2                   3
                                                                                 (RNN)
                                                                                                                                    Model training includes a generative loss `gen to
     0
                                                                                                                                 improve the quality of the sentences generated by
                                                                                                                                 the model. We concatenate the word representa-
                                                                                                                                 tions w j for all word embeddings in the latest sen-
                                                                                word_enc
                                                                                 (GPT)
                                                                                                                                 tence with the latest story embedding emax(t) . This
         0

             Once
                             1

                             upon
                                                 2

                                                     a
                                                                         3

                                                                         time                        1             2         3
                                                                                                                                 is run through affine ELU layers to produce en-
                                                                                                                                 riched word embedding representations, analogous
                                                                                                                                 to the Deep Fusion model (Gülçehre et al., 2015),
Figure 1: Architecture of our hierarchical model.
See text for explanation of the components word enc,                                                                             with story state instead of a translation model. The
sent enc, and story enc.                                                                                                         related Cold Fusion approach (Sriram et al., 2018)
                                                                                                                                 proved inferior.

4             Model                                                                                                              Loss Functions To obtain the discriminatory
                                                                                                                                 loss `disc for a particular sentence s in a batch, we
4.1                 Architecture                                                                                                 compute the dot product of all the story embed-
Our overall approach leverages contextualised lan-                                                                               dings e in the batch, and then take the cross-entropy
guage models, which are a powerful tool in NLP                                                                                   across the batch with the correct next sentence:
when pretrained on large amounts of text and fine
                                                                                                                                                                  exp(et+1 ⋅ et )
                                                                                                                                                                         i=s
                                                                                                                                                  i=s
tuned on a specific task (Peters et al., 2018; De-                                                                                       `disc (et+1 ) = − log                           (6)
vlin et al., 2019). Specifically, we use Generative
                                                                                                                                                                         i
                                                                                                                                                                 ∑i exp(et+1 ⋅ et )
Pre-Training (GPT, Radford et al., 2018), a model
                                                                                                                                 Modelled on Quick Thoughts (Logeswaran and
which has proved successful in generation tasks
                                                                                                                                 Lee, 2018), this forces the model to maximise the
(Radford et al., 2019; See et al., 2019).
                                                                                                                                 dot product of the correct next sentence versus
Hierarchical Model Previous work found that                                                                                      other sentences in the same story, and negative
hierarchical models show strong performance in                                                                                   examples from other stories, and so encourages
story generation (Fan et al., 2018) and under-                                                                                   representations that anticipate what happens next.
standing tasks (Cai et al., 2017). The language                                                                                     The generative loss in Equation (7) is a standard
model and hierarchical encoders we use are uni-                                                                                  LM loss, where w j is the GPT word embeddings
directional, which matches the incremental way                                                                                   from the sentence and emax(t) is the story context
in which human readers process stories when they                                                                                 that each word is concatenated with:
experience suspense.
                                                                                                                                  `gen = − ∑ log P(w j ∣w j−1 , w j−2 , . . . ; emax(t) ) (7)
   Figure 1 depicts the architecture of our hierar-
              1                                                                                                                               j
chical model. It builds a chain of representations
that anticipates what will come next in a story, al-                                                                             The overall loss is `disc + `gen . More advanced gen-
lowing us to infer measures of suspense. For a                                                                                   eration losses (e.g., Zellers et al., 2019) could be
given sentence, we use GPT as our word encoder                                                                                   used, but are an order of magnitude slower.
(word enc in Figure 1) which turns each word in a
sentence into a word embedding wi . Then, we use                                                                                 4.2   Inference
an RNN (sent enc) to turn the word embeddings of                                                                                 We compute the measures of surprise and uncer-
the sentences into a sentence embedding γi . Each                                                                                tainty reduction introduced in Section 3.1 using the
sentence is represented by the hidden state of its                                                                               output of the story encoder story enc. In addition
         1                                                                                                                       to the contextualised sentence embeddings et , this
    Model code and scripts for evaluation are avail-
able at https://github.com/dwlmt/Story-Untangling/                                                                               requires their probabilities P(et ), and a distribution
                                                                                                                                 over alternative continuations P(et+1 ).
                                                                                                                                                                       i
tree/acl-2020-dec-submission
We implement a recursive beam search over a                                                    GRU      LSTM
tree of future sentences in the story, looking be-
tween one and three sentences ahead (rollout). The         Loss                                   5.84     5.90
probability is calculated using the same method as         Discriminatory Accuracy                0.55     0.54
the discriminatory loss, but with the cosine similar-      Discriminatory Accuracy k = 10         0.68     0.68
ity rather than the dot product of the embeddings          Generative Accuracy                    0.37     0.46
         i
et and et+1 fed into a softmax function. We found          Generative Accuracy k = 10             0.85     0.85
that cosine outperformed dot product on inference          Cosine Similarity                      0.48     0.50
as the resulting probability distribution over contin-     L2 Distance                            1.73     1.59
uations is less concentrated.                              Number of Epochs                       4        2

5   Methods                                                Table 1: For accuracy the baseline probability is 1 in
                                                           99; k = 10 is the accuracy of the top 10 sentences of the
Dataset The overall goal of this work is to test           batch. From the best epoch of training on the Writing-
whether the psycholinguistic and economic theo-            Prompts development set.
ries introduced in Section 3 are able to capture
human intuition of suspense. For this, it is impor-
                                                           tice, the suspense curves generated are very similar,
tant to use actual stories which were written by
                                                           with a long upward trajectory and flattening or dip
authors with the aim of being engaging and inter-
                                                           near the end. After finishing a story, annotators had
esting. Some of the story datasets used in NLP do
                                                           to write a short summary of the story.
not meet this criterion; for example ROC Cloze
                                                              In the instructions, suspense was framed as dra-
(Mostafazadeh et al., 2016) is not suitable because
                                                           matic tension, as pilot annotations showed that the
the stories are very short (five sentences), lack nat-
                                                           term suspense was too closely associated with mur-
uralness, and are written by crowdworkers to fulfill
                                                           der mystery and related genres. Annotators were
narrow objectives, rather than to elicit reader en-
                                                           asked to take the character’s perspective when read-
gagement and interest. A number of authors have
                                                           ing to achieve stronger inter-annotator agreement
also pointed out technical issues with such artificial
                                                           and align closely with literary notions of suspense.
corpora (Cai et al., 2017; Sharma et al., 2018).
                                                           During training, all workers had to annotate a test
   Instead, we use WritingPrompts (Fan et al.,
                                                           story and achieve 85% accuracy before they could
2018), a corpus of circa 300k short stories from
                                                           continue. Full instructions and the training story
the /r/WritingPrompts subreddit. These stories
                                                           are in Appendix B.
were created as an exercise in creative writing, re-
sulting in stories that are interesting, natural, and of      The inter-annotator agreement α (Krippendorff,
suitable length. The original split of the data into       2011) was 0.52 and 0.57 for the development and
90% train, 5% development, and 5% test was used.           test sets, respectively. Given the inherently sub-
Pre-processing steps are described in Appendix A.          jective nature of the task, this is substantial agree-
                                                           ment. This was achieved after screening out and
Annotation To evaluate the predictions of our              replacing annotators who had low agreement for
model, we selected 100 stories each from the devel-        the stories they annotated (mean α < 0.35), showed
opment and test sets of the WritingPrompts corpus,         suspiciously low reading times (mean RT < 600 ms
such that each story was between 25 and 75 sen-            per sentence), or whose story summaries indicated
tence in length. Each sentence of these stories was        low-quality annotation.
judged for narrative suspense; five master work-
ers from Amazon Mechanical Turk annotated each             Training and Inference The training used SGD
story after reading instructions and completing a          with Nesterov momentum (Sutskever et al., 2013)
training phase. They read one sentence at a time           with a learning rate of 0.01 and a momentum of 0.9.
and provided a suspense judgement using the five-          Models were run with early stopping based on the
point scale consisting of Big Decrease in suspense         mean of the accuracies of training tasks. For each
(1% of the cases), Decrease (11%), Same (50%), In-         batch, 50 sentence blocks from two different stories
crease (31%), and Big Increase (7%). In contrast to        were chosen to ensure that the negative examples in
prior work (Delatorre et al., 2018), a relative rather     the discriminatory loss include easy (other stories)
than absolute scale was used. Relative judgements          and difficult (same story) sentences.
are easier to make while reading, though in prac-             We used the pretrained GPT weights but fine-
tuned the encoder and decoder weights on our task.       lines also reflect how much change occurs from
For the RNN components of our hierarchical model,        one sentence to the next in a story: WordOverlap is
we experimented with both GRU (Chung et al.,             the Jaccard similarity between the two sentences,
2015) and LSTM (Hochreiter and Schmidhuber,              GloveSim is the cosine similarity between the av-
1997) variants. The GRU model had two layers in          eraged Glove (Pennington et al., 2014) word em-
both sen enc and story enc; the LSTM model had           beddings of the two sentences, and GPTSim is the
four layers each in sen enc and story enc. Both          cosine similarity between the GPT embeddings of
had two fusion layers and the size of the hidden         the two sentences. The α baseline is the weighted
layers for both model variants was 768. We give          VADER sentiment score.
the results of both variants on the tasks of sentence
generation and sentence discrimination in Table 1.       6       Results
Both perform similarly, with slightly worse loss         6.1      Narrative Suspense
for the LSTM variant, but faster training and better
generation accuracy. Overall, model performance          Task The annotator judgements are relative
is strong: the LSTM variant picks out the correct        (amount of decrease/increase in suspense from sen-
sentence 54% of the time and generates it 46%            tence to sentence), but the model predictions are
of the time. This indicates that our architecture        absolute values. We could convert the model pre-
successfully captures the structure of stories.          dictions into discrete categories, but this would
   At inference time, we obtained a set of story         fail to capture the overall arc of the story. Instead,
continuations either by random sampling or by gen-       we convert the relative judgements into absolute
eration. Random sampling means that n sentences          suspense values, where Jt = j1 + ⋅ ⋅ ⋅ + jt is the ab-
were selected from the corpus and used as contin-        solute value for sentence t and j1 , . . . , jt are the rel-
uations. For generation, sentences were generated        ative judgements for sentences 1 to t. We use −0.2
using top-k sampling (with k = 50) using the GPT         for Big Decrease, −0.1 for Decrease, 0 for Same,
                                                                                                            2
language model and the approach of Radford et al.        0.1 for Increase, and 0.2 for Big Increase. Both
(2019), which generates better output than beam          the absolute suspense judgements and the model
search (Holtzman et al., 2018) and can outperform        predictions are normalised by converting them to
a decoder (See et al., 2019). For generation, we         z-scores.
used up to 300 words as context, enriched with the          To compare model predictions and absolute sus-
story sentence embeddings from the corresponding         pense values, we use Spearman’s ρ (Sen, 1968)
points in the story. For rollouts of one sentence,       and Kendall’s τ (Kendall, 1975). Rank correlation
we generated 100 possibilities at each step; for roll-   is preferred because we are interested in whether
outs of two, 50 possibilities and rollouts of three,     human annotators and models view the same part
25 possibilities. This keeps what is an expensive        of the story as more or less suspenseful; also, rank
inference process manageable.                            correlation methods are good at detecting trends.
                                                         We compute ρ and τ between the model predic-
Importance We follow Ely et al. in evaluat-              tions and the judgements of each of the annotators
ing weighted versions of their surprise and un-          (i.e., five times for five annotators), and then take
certainty reduction measure St
                                αEly
                                     and Ut
                                           αEly
                                                (see     the average. We then average these values again
Equation (5)). We obtain the αt values by tak-           over the 100 stories in the test or development sets.
ing the sentiment scores assigned by the VADER           As the human upper bound, we compute the mean
sentiment classifier (Hutto and Gilbert, 2014) to        pairwise correlation of the five annotators.
each sentence and multiplying them by 1.0 for pos-       Results Figure 2 shows surprise and uncertainty
itive sentiment and 2.0 for negative sentiment. The      reduction measures and human suspense judge-
stronger negative weighting reflects the observation     ments for an example story (text and further ex-
that negative consequences can be more important         amples in Appendix C). We performed model se-
than positive ones (O’Neill, 2013; Kahneman and          lection using the correlations on the development
Tversky, 2013).
                                                             2
                                                              These values were fitted with predictions (or cross-worker
Baselines We test a number of baselines as al-           annotation) using 5-fold cross validation and an L1 loss to
                                                         optimise the mapping. A constraint is placed so that Same
ternatives to surprise and uncertainty reduction de-     is 0, increases are positive and decreases are negative with a
rived from our hierarchical model. These base-           minimum 0.05 distance between.
3
                                                                                    Prediction              Model         Rollout   τ↑     ρ↑
           2.5
                                                                                    Human                                           .553   .614
            2
                                                                                    Baselines               WordOverlap     1       .017   .026
Suspense

           1.5                                                                                              GloveSim        1       .017   .029
            1
                                                                                                            GPTSim          1       .021   .031
                                                                                                            α               1       .024   .036
           0.5
                                                                                        Hale
                                                                                    S          -Gen         GRU             1       .145   .182
            0
                                                                                                            LSTM            1       .434   .529
                 0      5      10     15         20         25         30      35
                                                                                        Hale
                                                                                    S          -Cor         GRU             1       .177   .214
                                      Sentence
                                                                                                            LSTM            1       .580   .675
                                           Hale       Ely        Ely    αEly
           Figure 2: Story 27, Human, S         ,S ,U ,U           .                    Hale
           Solid lines: generated alternative continuations, dashed                 U          -Gen         GRU             1       .036   .055
           lines: sampled alternative continuations.                                                        LSTM            1       .009   .016
                                                                                        Hale
                                                                                    U          -Cor         GRU             1       .048   .050
                                                                                                            LSTM            1       .066   .094
           set, which are given in Table 2. We experimented
           with all the measures introduced in Section 3.1,                             Ely
                                                                                    S                       GRU             1       .484   .607
           computing sets of alternative sentences either us-                                               LSTM            1       .427   .539
           ing generated continuations (Gen) or continuations                           αEly
           sampled from the corpus (Cor), except for S ,
                                                          Ely                       S                       GRU             1       .089   .123
           which can be computed without alternatives. We                                                   LSTM            1       .115   .156
           compared the LSTM and GRU variants (see Sec-                                 Ely
                                                                                    U         -Gen          GRU             1       .241   .161
           tion 4) and experimented with rollouts of up to                                                                  2       .304   .399
           three sentences. We tried L1 and L2 distance for                                                 LSTM            1       .610   .698
           the Ely measures, but only report L1, which always                                                               2       .393   .494
           performed better.                                                            Ely
                                                                                    U         -Cor          GRU             1       .229   .264
           Discussion On the development set (see Table 2),                                                                 2       .512   .625
           we observe that all baselines perform poorly, indi-                                                              3       .515   .606
           cating that distance between simple sentence rep-                                                LSTM            1       .594   .678
           resentations or raw sentiment values do not model                                                                2       .564   .651
                                                   Hale                                                                     3       .555   .645
           suspense. We find that Hale surprise S       performs
           well, reaching a maximum ρ of .675 on the devel-                         U
                                                                                        αEly
                                                                                               -Gen         GRU             1       .216   .124
                                                        Hale
           opment set. Hale uncertainty reduction U          , how-                                                         2       .219   .216
           ever, performs consistently poorly. Ely surprise                                                 LSTM            1       .474   .604
            Ely
           S also performs well, reaching as similar value                                                                  2       .316   .418
           as Hale surprise. Overall, Ely uncertainty reduction                         αEly
             Ely                                                                    U          -Cor         GRU             1       .205   .254
           U is the strongest performer, with ρ = .698, nu-
           merically outperforming the human upper bound.                                                                   2       .365   .470
              Some other trends are clear from the develop-                                                 LSTM            1       .535   .642
           ment set: using GRUs reduces performance in all                                                                  2       .425   .534
           cases but one; rollout of more than one never leads
                                                                                    Table 2: Development set results for WritingPrompts
           to an improvement; sentiment weighting (prefix
                                                                                    for generated (Gen) or corpus sampled (Cor) alternative
           α in the table) always reduces performance, as it                        continuations; α indicates sentiment weighting. Bold:
           introduces considerable noise (see Figure 2). We                         best model in a given category; red: best model overall.
           therefore eliminate the models that correspond to
           these settings when we evaluate on the test set.
                                                                                                      Ely
              For the test set results in Table 3 we also report                    test set, U again is the best measure, with a cor-
           upper and lower confidence bounds computed us-                           relation statistically indistinguishable from human
           ing the Fisher Z-transformation (p < 0.05). On the                       performance (based on CIs). We find that absolute
Prediction         τ↑            ρ↑                                          Dev D ↓        Test D ↓
      Human          .652 (.039)    .711 (.033)            Human                Not reported       4.30 (3.43)
       Hale
      S -Gen         .407 (.089)    .495 (.081)            Theory Baseline        9.65 (0.94)      7.47 (3.42)
       Hale                                                TAM                    7.11 (1.71)      6.80 (2.63)
      S -Cor         .454 (.085)    .523 (.079)
        Hale
      U      -Gen    .036 (.102)    .051 (.102)            WordOverlap            13.9 (1.45)      12.7 (3.13)
        Hale
      U      -Cor    .061 (.100)    .088 (.101)            GloveSim               10.2 (0.74)      10.4 (2.54)
       Ely                                                 GPTSim                 16.8 (1.47)      18.1 (4.71)
      S              .391 (.092)    .504 (.082)            α                      11.3 (1.24)      11.2 (2.67)
        Ely
      U -Gen         .620 (.067)    .710 (.053)
                                                             Hale
        Ely
      U -Cor         .605 (.069)    .693 (.056)            S -Gen                 8.27 (0.68)     8.72 (2.27)
                                                             Hale
                                                           U      -Gen            10.9 (1.02)    10.69 (3.66)
Table 3: Test set results for WritingPrompts for gen-        Ely
erated (Gen) or corpus sampled (Cor) continuations.
                                                           S                      9.54 (0.56)      9.01 (1.92)
                                                             αEly
LSTM with rollout one; brackets: confidence intervals.     S                      9.95 (0.78)      9.54 (2.76)
                                                              Ely
                                                           U -Gen                 8.75 (0.76)      8.38 (1.53)
                                                              Ely
                                                           U -Cor                 8.74 (0.76)      8.50 (1.69)
correlations are higher on the test set, presumably           αEly
                                                           U       -Gen           8.80 (0.61)      7.84 (3.34)
reflecting the higher human upper bound.                      αEly
   Overall, we conclude that our hierarchical ar-          U       -Cor           8.61 (0.68)      7.78 (1.61)
chitecture successfully models human suspense
                                                          Table 4: TP prediction on the TRIPOD development
judgements on the WritingPrompts dataset. The             and test sets. D is the normalised distance to the gold
                            Ely
overall best predictor is U , uncertainty reduc-          standard; CI in brackets.
tion computed over story representations. This
measure combines the probability of continuation
  Hale
(S ) with distance between story embeddings               the sentence with the highest surprise or uncer-
  Ely
(S ), which are both good predictors in their own         tainty reduction value within a given constrained
right. This finding supports the theoretical claim        window. We report the same baselines as in the pre-
that suspense is an expectation over the change in        vious experiment, as well as the Theory Baseline,
future states of a game or a story, as advanced by        which uses screenwriting theory to predict where
Ely et al. (2015).                                        in a movie a given TP should occur (e.g., Point of
                                                          No Return theoretically occurs 50% through the
6.2   Movie Turning Points                                movie). This baseline is hard to beat (Papalampidi
Task and Dataset An interesting question is               et al., 2019).
whether the peaks in suspense in a story correspond
to important narrative events. Such events are some-      Results and Discussion Figure 3 plots both gold
times called turning points (TPs) and occur at cer-       standard and predicted TPs for a sample movie
tain positions in a movie according to screenwrit-        synopsis (text and further examples in Appendix D).
ing theory (Cutting, 2016). A corpus of movie             The results on the TRIPOD development and test
synopses annotated with turning points is available       sets are reported in Table 4 (we report both due to
in the form of the TRIPOD dataset (Papalampidi            the small number of synopses in TRIPOD). We use
et al., 2019). We can therefore test if surprise or       our best LSTM model with a of rollout of one; the
uncertainty reduction predict TPs in TRIPOD. As           distance measure for Ely surprise and uncertainty
our model is trained on a corpus of short stories,        reduction is now L2 distance, as it outperformed
this will also serve as an out-of-domain evaluation.      L1 on TRIPOD. We report results in terms of D,
   Papalampidi et al. (2019) assume five TPs: 1. Op-      the normalised distance between gold standard and
portunity, 2. Change of Plans, 3. Point of no Return,     predicted TP positions.
4. Major Setback, and 5. Climax. They derive a               On the test set, the best performing model
                                                                              αEly              αEly
prior distribution of TP positions from their test set,   with D = 7.78 is U       -Cor, with U      -Gen only
and use this to constrain predicted turning points        slightly worse. It is outperformed by TAM, the
to windows around these prior positions. We fol-          best model of Papalampidi et al. (2019), which
low this approach and select as the predicted TP          however requires TP annotation at training time.
that foreshadows the work, masking key suspense-
               6
                                                                            ful words, or paraphrasing. An analogue of this
               5                                                            would be adversarial examples used in computer
               4
                                                                            vision. Additional annotations, such as how certain
Suspense

                                                                            readers are about the outcome of the story, may
               3
                                                                            also be helpful in better understanding the relation-
               2                                                            ship between suspense and uncertainty. Automated
               1                                                            interpretability methods as proposed by Sundarara-
                                                                            jan et al. (2017), could shed further light on models’
               0
                                                                            predictions.
                   0      10       20              30         40       50
                                                                               The recent success of language models in wide-
                                        Sentence
                                                                            ranging NLP tasks (e.g., Radford et al., 2019) has
                                             Hale       Ely    Ely   αEly
           Figure 3: Movie 15 Minutes, S    ,S ,U ,U             ,          shown that language models are capable of learn-
           ◆ theory baseline, ⭑ TP annotations, triangles are pre-          ing semantically rich information implicitly. How-
           dicted TPs.                                                      ever, generating plausible future continuations is
                                                                            an essential part of the model. In text generation,
           U
               αEly
                  -Cor is close to the Theory Baseline on the               Fan et al. (2019) have found that explicitly incor-
           test set, an impressive result given that our model              porating coreference and structured event repre-
           has no TP supervision and is trained on a differ-                sentations into generation produces more coherent
           ent domain. The fact that models with sentiment                  generated text. A more sophisticated model would
           weighting (prefix α) perform well here indicates                 incorporate similar ideas.
           that turning points often have an emotional reso-                   Autoregressive models that generate step by step
           nance as well as being suspenseful.                              alternatives for future continuations are computa-
                                                                            tionally impractical for longer rollouts and are not
           7       Conclusions                                              cognitively plausible. They also differ from the
                                                                            Ely et al. (2015) conception of suspense, which
           Our overall findings suggest that by implementing                is in terms of Bayesian beliefs over a longer-term
           concepts from psycholinguistic and economic the-                 future state, not step by step. There is much recent
           ory, we can predict human judgements of suspense                 work (e.g., Ha and Schmidhuber (2018); Gregor
                                                              Ely
           in storytelling. That uncertainty reduction (U )                 et al. (2019)), on state-space approaches that model
                                            Hale
           outperforms probability-only (S ) and state-only                 beliefs as latent states using variational methods.
              Ely
           (S ) surprise suggests that, while consequential                 In principle, these would avoid the brute-force cal-
           state change is of primary importance for suspense,              culation of a rollout and conceptually, anticipating
           the probability distribution over the states is also a           longer-term states aligns with theories of suspense.
           necessary factor. Uncertainty reduction therefore                   Related tasks such as inverting the understanding
           captures the view of suspense as reducing paths to               of suspense to utilise the models in generating more
           a desired outcome, with more consequential shifts                suspenseful stories may also prove fruitful.
           as the story progresses (O’Neill and Riedl, 2014;                   This paper is a baseline that demonstrates how
           Ely et al., 2015; Perreault, 2018). This is more in              modern neural network models can implicitly rep-
           line with the Smuts (2008) Desire-Frustration view               resent text meaning and be useful in a narrative con-
           of suspense, where uncertainty is secondary.                     text without recourse to supervision. It provides a
              Strong psycholinguistic claims about suspense                 springboard to further interesting applications and
           are difficult to make due to several weaknesses in               research on suspense in storytelling.
           our approach, which highlight directions for fu-
           ture research: the proposed model does not have a                Acknowledgments
           higher-level understanding of event structure; most
           likely it picks up the textual cues that accompany               The authors would like to thank the anonymous re-
           dramatic changes in the text. One strand of further              viewers, Pinelopi Papalampidi and David Hodges
           work is therefore analysis: Text could be artificially           for reviews of the annotation task, the AMT annota-
           manipulated using structural changes, for example                tors, and Mirella Lapata, Ida Szubert, and Elizabeth
           by switching the order of sentences, mixing multi-               Nielson for comments on the paper. Wilmot’s work
           ple stories, including a summary at the beginning                is funded by an EPSRC doctoral training award.
References                                                 Jeffrey Ely, Alexander Frankel, and Emir Kamenica.
                                                              2015. Suspense and surprise. Journal of Political
H Porter Abbott. 2008. The Cambridge introduction to          Economy, 123(1):215–260.
  narrative. Cambridge University Press.
                                                           Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-
David Bamman, Brendan OConnor, and Noah A Smith.             erarchical neural story generation. In Proceedings
  2013. Learning latent personas of film characters.         of the 56th Annual Meeting of the Association for
  In Proceedings of the 51st Annual Meeting of the           Computational Linguistics (Volume 1: Long Papers),
  Association for Computational Linguistics (Volume          pages 889–898, Melbourne, Australia. Association
  1: Long Papers), pages 352–361.                            for Computational Linguistics.
David Bamman, Ted Underwood, and Noah A. Smith.            Angela Fan, Mike Lewis, and Yann Dauphin. 2019.
  2014. A Bayesian mixed effects model of literary           Strategies for structuring story generation. In ACL.
  character. In Proceedings of the 52nd Annual Meet-
  ing of the Association for Computational Linguis-        Edward Morgan Forster. 1985. Aspects of the Novel,
  tics (Volume 1: Long Papers), pages 370–379, Balti-        volume 19. Houghton Mifflin Harcourt.
  more, Maryland. Association for Computational Lin-
  guistics.                                                Stefan L Frank. 2013. Uncertainty reduction as a mea-
                                                              sure of cognitive load in sentence comprehension.
Zheng Cai, Lifu Tu, and Kevin Gimpel. 2017. Pay at-          Topics in Cognitive Science, 5(3):475–494.
  tention to the ending: Strong neural baselines for the
  roc story cloze task. In Proceedings of the 55th An-     Richard J Gerrig. 1989. Suspense in the absence
  nual Meeting of the Association for Computational          of uncertainty. Journal of Memory and Language,
  Linguistics (Volume 2: Short Papers), pages 616–           28(6):633–648.
  622.
                                                           Richard J Gerrig and Allan BI Bernardo. 1994. Read-
Snigdha Chaturvedi, Mohit Iyyer, and Hal Daume III.          ers as problem-solvers in the experience of suspense.
  2017. Unsupervised learning of evolving relation-          Poetics, 22(6):459–472.
  ships between literary characters. In Thirty-First
  AAAI Conference on Artificial Intelligence.              Philip John Gorinski and Mirella Lapata. 2018. What’s
                                                             this movie about? a joint neural network architec-
Yun-Gyung Cheong and R Michael Young. 2014. Sus-             ture for movie content analysis. In Proceedings of
  penser: A story generation system for suspense.            the 2018 Conference of the North American Chap-
  IEEE Transactions on Computational Intelligence            ter of the Association for Computational Linguistics:
  and AI in Games, 7(1):39–52.                               Human Language Technologies, Volume 1 (Long Pa-
                                                             pers), pages 1770–1781, New Orleans, Louisiana.
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho,              Association for Computational Linguistics.
  and Yoshua Bengio. 2015. Gated feedback recur-
  rent neural networks. In International Conference        Karol Gregor, George Papamakarios, Frederic Besse,
  on Machine Learning, pages 2067–2075.                      Lars Buesing, and Theophane Weber. 2019. Tempo-
                                                             ral difference variational auto-encoder. In Interna-
James E Cutting. 2016. Narrative theory and the dy-          tional Conference on Learning Representations.
  namics of popular movies. Psychonomic Bulletin &
  Review, 23(6):1713–1743.                                 Çaglar Gülçehre, Orhan Firat, Kelvin Xu, Kyunghyun
                                                              Cho, Loı̈c Barrault, Huei-Chi Lin, Fethi Bougares,
Pablo Delatorre, Carlos León, Alberto G Salguero,            Holger Schwenk, and Yoshua Bengio. 2015. On us-
  Manuel Palomo-Duarte, and Pablo Gervás. 2018.              ing monolingual corpora in neural machine transla-
  Confronting a paradox: a new perspective of the im-         tion. CoRR, abs/1503.03535.
  pact of uncertainty in suspense. Frontiers in Psy-
  chology, 9:1392.                                         David Ha and Jürgen Schmidhuber. 2018. Recur-
                                                             rent world models facilitate policy evolution. In
Vera Demberg and Frank Keller. 2008. Data from eye-          Advances in Neural Information Processing Sys-
  tracking corpora as evidence for theories of syntactic     tems 31, pages 2451–2463. Curran Associates, Inc.
  processing complexity. Cognition, 101(2):193–210.          https://worldmodels.github.io.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              John Hale. 2001. A probabilistic Earley parser as
   Kristina Toutanova. 2019. BERT: Pre-training of           a psycholinguistic model. In Proceedings of the
   deep bidirectional transformers for language under-       2nd Conference of the North American Chapter of
   standing. In Proceedings of the 2019 Conference           the Association for Computational Linguistics, vol-
   of the North American Chapter of the Association          ume 2, pages 159–166, Pittsburgh, PA. Association
   for Computational Linguistics: Human Language             for Computational Linguistics.
  Technologies, Volume 1 (Long and Short Papers),
   pages 4171–4186, Minneapolis, Minnesota. Associ-        John Hale. 2006. Uncertainty about the rest of the sen-
   ation for Computational Linguistics.                      tence. Cognitive science, 30(4):643–672.
Brent Harrison, Christopher Purdy, and Mark O Riedl.       Zhiwei Li, Neil Bramley, and Todd M. Gureckis. 2018.
  2017. Toward automated story generation with               Modeling dynamics of suspense and surprise. In
  markov chain monte carlo methods and deep neural           Proceedings of the 40th Annual Meeting of the Cog-
  networks. In Thirteenth Artificial Intelligence and        nitive Science Society, CogSci 2018, Madison, WI,
  Interactive Digital Entertainment Conference.              USA, July 25-28, 2018.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.             Chunhua Liu, Haiou Zhang, Shan Jiang, and Dong
  Long short-term memory. Neural Computation,                Yu. 2018a. DEMN: Distilled-exposition enhanced
  9(8):1735–1780.                                            matching network for story comprehension. In Pro-
                                                             ceedings of the 32nd Pacific Asia Conference on Lan-
Hans Hoeken and Mario van Vliet. 2000. Suspense,             guage, Information and Computation, Hong Kong.
  curiosity, and surprise: How discourse structure in-       Association for Computational Linguistics.
  fluences the affective and cognitive processing of a
  story. Poetics, 27(4):277–286.                           Fei Liu, Trevor Cohn, and Timothy Baldwin. 2018b.
                                                             Narrative modeling with memory chains and seman-
Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine              tic supervision. In Proceedings of the 56th An-
  Bosselut, David Golub, and Yejin Choi. 2018.               nual Meeting of the Association for Computational
  Learning to write with cooperative discriminators.         Linguistics (Volume 2: Short Papers), pages 278–
  In Proceedings of the 56th Annual Meeting of the As-       284, Melbourne, Australia. Association for Compu-
  sociation for Computational Linguistics (Volume 1:         tational Linguistics.
  Long Papers), pages 1638–1649, Melbourne, Aus-
  tralia. Association for Computational Linguistics.       Lajanugen Logeswaran and Honglak Lee. 2018. An ef-
                                                             ficient framework for learning sentence representa-
Matthew Honnibal and Ines Montani. 2017. spaCy 2:            tions. In International Conference on Learning Rep-
 Natural language understanding with Bloom embed-            resentations.
 dings, convolutional neural networks and incremen-
 tal parsing. To appear.                                   Lara J Martin, Prithviraj Ammanabrolu, Xinyu Wang,
                                                             William Hancock, Shruti Singh, Brent Harrison, and
C. T. Hsu, M. Conrad, and A. M. Jacobs. 2014. Fiction        Mark O Riedl. 2018. Event representations for au-
   feelings in harry potter: haemodynamic response in        tomated story generation with deep neural nets. In
   the mid-cingulate cortex correlates with immersive        Thirty-Second AAAI Conference on Artificial Intelli-
   reading experience. NeuroReport, 25:1356–1361.            gence.

Clayton J Hutto and Eric Gilbert. 2014. Vader: A par-      Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong
  simonious rule-based model for sentiment analysis          He, Devi Parikh, Dhruv Batra, Lucy Vanderwende,
  of social media text. In Eighth international AAAI         Pushmeet Kohli, and James Allen. 2016. A cor-
  conference on weblogs and social media.                    pus and cloze evaluation for deeper understanding of
                                                             commonsense stories. In Proceedings of the 2016
Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jor-           Conference of the North American Chapter of the
 dan Boyd-Graber, and Hal Daumé III. 2016. Feud-            Association for Computational Linguistics: Human
 ing families and former friends: Unsupervised learn-        Language Technologies, pages 839–849, San Diego,
 ing for dynamic fictional relationships. In Proceed-        California. Association for Computational Linguis-
 ings of the 2016 Conference of the North Ameri-             tics.
 can Chapter of the Association for Computational
 Linguistics: Human Language Technologies, pages           M. B. Oliver. 1993. Exploring the paradox of the en-
 1534–1544.                                                  joyment of sad films. Human Communication Re-
                                                             search, 19:315–342.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
  Tomas Mikolov. 2016. Bag of tricks for efficient text    Brian O’Neill. 2013. A computational model of sus-
  classification. arXiv preprint arXiv:1607.01759.           pense for the augmentation of intelligent story gen-
                                                             eration. Ph.D. thesis, Georgia Institute of Technol-
Daniel Kahneman and Amos Tversky. 2013. Prospect             ogy.
  theory: An analysis of decision under risk. In Hand-
  book of the fundamentals of financial decision mak-      Brian O’Neill and Mark Riedl. 2014. Dramatis: A
  ing: Part I, pages 99–127. World Scientific.               computational model of suspense. In Proceedings of
                                                             the Twenty-Eighth AAAI Conference on Artificial In-
MG Kendall. 1975.      Rank correlation measures.            telligence, July 27 -31, 2014, Québec City, Québec,
 Charles Griffin, London, 202:15.                            Canada., pages 944–950.

Y. Khrypko and P. Andreae. 2011. Towards the prob-         Pinelopi Papalampidi, Frank Keller, and Mirella Lap-
   lem of maintaining suspense in interactive narrative.      ata. 2019. Movie plot analysis via turning point
   In Proceedings of the 7th Australasian Conference          identification. In Proceedings of the 2019 Confer-
   on Interactive Entertainment, pages 5:1–5:3.               ence on Empirical Methods in Natural Language
                                                             Processing and the 9th International Joint Confer-
Klaus Krippendorff. 2011. Computing krippendorff’s            ence on Natural Language Processing (EMNLP-
  alpha-reliability. 2011. Annenberg School for Com-         IJCNLP), pages 1707–1717, Hong Kong, China. As-
  munication Departmental Papers: Philadelphia.               sociation for Computational Linguistics.
Jeffrey Pennington, Richard Socher, and Christopher       Abigail See, Aneesh Pappu, Rohun Saxena, Akhila
   Manning. 2014. Glove: Global vectors for word rep-       Yerukola, and Christopher D Manning. 2019. Do
   resentation. In Proceedings of the 2014 conference       massively pretrained language models make better
   on empirical methods in natural language process-        storytellers? arXiv preprint arXiv:1909.10705.
   ing (EMNLP), pages 1532–1543.
                                                          Pranab Kumar Sen. 1968. Estimates of the regres-
Joseph Perreault. 2018. The Universal Structure of Plot     sion coefficient based on kendall’s tau. Journal of
  Content: Suspense, Magnetic Plot Elements, and the        the American statistical association, 63(324):1379–
  Evolution of an Interesting Story. Ph.D. thesis, Uni-     1389.
  versity of Idaho.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt           Rishi Sharma, James Allen, Omid Bakhshandeh, and
 Gardner, Christopher Clark, Kenton Lee, and Luke           Nasrin Mostafazadeh. 2018. Tackling the story end-
 Zettlemoyer. 2018. Deep contextualized word rep-           ing biases in the story cloze test. In Proceedings of
 resentations. In Proceedings of the 2018 Confer-           the 56th Annual Meeting of the Association for Com-
 ence of the North American Chapter of the Associ-          putational Linguistics (Volume 2: Short Papers),
 ation for Computational Linguistics: Human Lan-            pages 752–757.
 guage Technologies, Volume 1 (Long Papers), pages
 2227–2237, New Orleans, Louisiana. Association           Aaron Smuts. 2008. The desire-frustration theory of
 for Computational Linguistics.                             suspense. The Journal of Aesthetics and Art Criti-
                                                            cism, 66(3):281–290.
Juan A Prieto-Pablos. 1998. The paradox of suspense.
  Poetics, 26(2):99–113.                                  Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and
                                                            Adam Coates. 2018. Cold fusion: Training seq2seq
Eric S Rabkin. 1973. Narrative suspense.” When
                                                            models together with language models. In Inter-
  Slim turned sideways...”. Ann Arbor: University of
                                                            speech 2018, 19th Annual Conference of the Interna-
   Michigan Press.
                                                            tional Speech Communication Association, Hyder-
Alec Radford, Karthik Narasimhan, Tim Salimans,             abad, India, 2-6 September 2018., pages 387–391.
  and Ilya Sutskever. 2018.         Improving lan-
  guage understanding by generative pre-training.         Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017.
  Https://openai.com/blog/language-unsupervised/.          Axiomatic attribution for deep networks. In Pro-
                                                           ceedings of the 34th International Conference on
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,         Machine Learning - Volume 70, ICML17, page
  Dario Amodei, and Ilya Sutskever. 2019. Language         33193328. JMLR.org.
  models are unsupervised multitask learners. OpenAI
  Blog, 1(8).                                             Ilya Sutskever, James Martens, George Dahl, and Ge-
                                                             offrey Hinton. 2013. On the importance of initial-
Hannah Rashkin, Maarten Sap, Emily Allaway,
                                                             ization and momentum in deep learning. In Interna-
  Noah A. Smith, and Yejin Choi. 2018. Event2Mind:
                                                             tional conference on machine learning, pages 1139–
  Commonsense inference on events, intents, and reac-
                                                             1147.
  tions. In Proceedings of the 56th Annual Meeting of
  the Association for Computational Linguistics (Vol-
  ume 1: Long Papers), pages 463–473, Melbourne,          Marten Van Schijndel and Tal Linzen. 2018a. Can en-
  Australia. Association for Computational Linguis-        tropy explain successor surprisal effects in reading?
  tics.                                                    CoRR, abs/1810.11481.

Brian Roark, Asaf Bachrach, Carlos Cardenas, and          Marten Van Schijndel and Tal Linzen. 2018b. Model-
  Christophe Pallier. 2009. Deriving lexical and syn-      ing garden path effects without explicit hierarchical
  tactic expectation-based measures for psycholinguis-     syntax. In Proceedings of the 40th Annual Meeting
  tic modeling via incremental top-down parsing. In        of the Cognitive Science Society, CogSci 2018, Madi-
  Proceedings of the 2009 Conference on Empirical          son, WI, USA, July 25-28, 2018.
  Methods in Natural Language Processing, pages
  324–333, Singapore. Association for Computational       Robert J Yanal. 1996. The paradox of suspense. The
  Linguistics.                                              British Journal of Aesthetics, 36(2):146–159.
Melissa Roemmele and Andrew Gordon. 2018. An
 encoder-decoder approach to predicting causal rela-      Rowan Zellers, Ari Holtzman, Hannah Rashkin,
 tions in stories. In Proceedings of the First Work-        Yonatan Bisk, Ali Farhadi, Franziska Roesner, and
 shop on Storytelling, pages 50–59, New Orleans,            Yejin Choi. 2019. Defending against neural fake
 Louisiana. Association for Computational Linguis-          news. CoRR, abs/1905.12616.
 tics.
                                                          Dolf Zillmann. 1996. The psychology of suspense
G. Schraw, Flowerday, T., and S. Lehman. 2001. In-          in dramatic exposition. Suspense: Conceptual-
  creasing situational interest in the classroom. Edu-      izations, theoretical analyses, and empirical explo-
  cational Psychology Review, 13:211–224.                   rations, pages 199–231.
A     Pre-processing                                       training for subsequent HITS. Other stories are in
                                                           separate HITS, please search for ”Story dramatic
WritingPrompts comes from a public forum of
                                                           tension, reading sentence by sentence” to find them.
short stories and so is naturally noisy. Story au-
                                                           The training completion code will work for all re-
thors often use punctuation in unusual ways to
                                                           lated HITS.
mark out sentences or paragraph boundaries and
                                                              You will read a short story and for each sentence
there are lots of spelling mistakes. Some of these
                                                           be asked to assess how the dramatic tension in-
cause problems with the GPT model and in some
                                                           creases, decreases or stays the same. Each story
circumstances can cause it to crash. To improve
                                                           will take an estimated 8-10 minutes. Judge each
the quality, sentence demarcations are left as they
                                                           sentence on how the dramatic tension has changed
are from the original WritingPrompts dataset but
                                                           as felt by the main characters in the story, not what
some sentences are cleaned up and others skipped
                                                           you as a reader feel. Dramatic tension is the excite-
over. Skipping over is also why there sometimes
                                                           ment or anxiousness over what will happen to the
are gaps in the graph plots as the sentence was
                                                           characters next, it is anticipation.
ignored during training and inference. The pre-
                                                              Increasing levels of each of the following in-
processing steps are as follows. Where substitu-
                                                           crease the level of dramatic tension:
tions are made rather than ignoring the sentence,
the token is replaced by the Spacy (Honnibal and              • Uncertainty: How uncertain are the charac-
Montani, 2017) POS tag.                                         ters involved about what will happen next?
                                                                Put yourself in the characters shoes; judge
    1. English Language: Some phrases in sen-                   the change in the tension based on how the
       tences can be non-English, Whatthelang                   characters perceive the situation.
       (Joulin et al., 2016) is used to filter out these
       sentences.                                             • Significance: How significant are the conse-
                                                                quences of what will happen to the central
    2. Nondictionary words: PyDictionary and                    characters of the story?
       PyEnchant and used to check if each word
       is a dictionary word. If not they are replaced.     An Example: Take a dramatic moment in a story
                                                           such as a character that needs to walk along a dan-
    3. Repeating Symbols: Some author mark out             gerous cliff path. When the character first realises
       sections by using a string of characters such       they will encounter danger the tension will rise,
       as *************** or !!!!!!!!!!!!. This can        then tension will increase further. Other details
       cause the Pytorch GPT implementation to             such as falling rocks or slips will increase the ten-
       break so repeating characters are replaced          sion further to a peak. When the cliff edge has been
       with a single one.                                  navigated safely the tension will drop. The pattern
                                                           will be the same with a dramatic event such as a
    4. Ignoring sentences: If after all of these re-
                                                           fight, argument, accident, romantic moment, where
       placements there are not three or more GPT
                                                           the tension will rise to a peak and then fall away as
       word pieces ignoring the POS replacements
                                                           the tension is resolved.
       then the sentence is skipped. The same pro-
                                                              You will be presented with one sentence at a
       cessing applies to generating sentences in the
                                                           time. Once you have read the sentence, you will
       inference. Occasionally the generated sen-
                                                           press one of five keys to judge the increase or de-
       tences can be nonsense, so the same criteria
                                                           crease in dramatic tension that this sentence caused.
       are used to exclude them.
                                                           You will use five levels (with keyboard shortcuts in
B     Mechanical Turk Written Instructions                 brackets):

These are the actual instructions given to the Me-            • Big Decrease (A): A sudden decrease in dra-
chanical Turk Annotators, plus the example in Ta-               matic tension of the situation. In the cliff
ble 5:                                                          example the person reaching the other side
                                                                safely.
INSTRUCTIONS For the first HIT there will be
an additional training step to pass. This will take           • Decrease (S): A slow decrease in the level of
about 5 minutes. After this you will receive a code             tension, a more gradual drop. For example the
which you can enter in the code box to bypass the               cliff walker sees an easier route out.
Annotation        Sentence
 NA                Clancy Marguerian, 154, private first class of the 150 + army , sits in his foxhole.
 Increase          Tired cold, wet and hungry, the only thing preventing him from laying down his rifle
                   and walking towards the enemy lines in surrender is the knowledge that however bad
                   he has it here, life as a 50 - 100 POW is surely much worse .
 Increase          He’s fighting to keep his eyes open and his rifle ready when the mortar shells start
                   landing near him.
 Same              He hunkers lower.
 Increase          After a few minutes under the barrage, Marguerian hears hurried footsteps, a grunt,
                   and a thud as a soldier leaps into the foxhole.
 Same              The man’s uniform is tan , he must be a 50 - 100 .
 Big Increase      The two men snarl and grab at each other , grappling in the small foxhole .
 Same              Abruptly, their faces come together.
 Decrease          “Clancy?”
 Decrease          “Rob?”
 Big Decrease      Rob Hall, 97, Corporal in the 50 - 100 army grins, as the situation turns from life or
                   death struggle, to a meeting of two college friends.
 Decrease          He lets go of Marguerian’s collar.
 Same              “ Holy shit Clancy , you’re the last person I expected to see here ”
 Same              “ Yeah ” “ Shit man , I didn’t think I’d ever see Mr. volunteers every saturday morning
                   at the food shelf’ , not after The Reorganization at least ”
 Same              “Yeah Rob , it is something isn’t it ”
 Decrease          “ Man , I’m sorry, I tried to kill you there”.

Table 5: One of the training annotation examples given to Mechanical Turk workers. The annotation labels are the
recommended labels. This is an extract from a validation set WritingPrompts story.

   • Same (Space): Stays at a similar level. In the       through can be done to tie these into the suspense
     cliff example an ongoing description of the          measures and also the WritingPrompts prompts.
     event.
                                                          C     Writing Prompts Examples
   • Increase (K): A gradual increase in the ten-
     sion. Loose rocks fall nearby the cliff walker.      The numbers are from the full WritingPrompts test
                                                          set. Since random sampling was done from these
   • Big Increase (L): A more sudden dramatic             from for evaluation the numbers are not in a con-
     increase such as an argument. The cliff walker       tiguous block. There are a couple of nonsense
     suddenly slips and falls.                            sentences or entirely punctuation sentences. In the
                                                          model these are excluded in pre-processing but in-
POST ACTUAL INSTRUCTIONS In addition
                                                          cluded here to match the sentence segmentation.
to the suspense annotation. The following review
                                                          Also there are some unusual break such as “should
questions were asked:
                                                          n’t”, this is because the word segmentation pro-
   • Please write a summary of the story in one or        duced by the Spacy tokenizer.
     two sentences.
                                                          C.1     Story 27
   • Do you think the story is interesting or not?        This is Story 27 from the test set in Figure 4, it is
     And why? One or two sentences.                       the same as the example in the main text:
   • How interesting is the story? 1–5
                                                              0. As I finished up my research on Alligator
The main purpose of this was to test if the MTurk                breeding habits for a story I was tasked with
Annotators were comprehending the stories and not                writing , a bell began to ring loudly throughout
trying to cheat by skipping over. Some further work              the office .
You can also read