Modelling Suspense in Short Stories as Uncertainty Reduction over Neural Representation

Page created by Lewis Leonard
 
CONTINUE READING
Modelling Suspense in Short Stories as Uncertainty Reduction over
                          Neural Representation

                                    David Wilmot and Frank Keller
                           Institute for Language, Cognition and Computation
                             School of Informatics, University of Edinburgh
                              10 Crichton Street, Edinburgh EH8 9AB, UK
                          david.wilmot@ed.ac.uk, keller@inf.ed.ac.uk

                      Abstract                               only sporadically been used in story generation sys-
    Suspense is a crucial ingredient of narrative fic-
                                                             tems (O’Neill and Riedl, 2014; Cheong and Young,
    tion, engaging readers and making stories com-           2014).
    pelling. While there is a vast theoretical litera-          Suspense, intuitively, is a feeling of anticipation
    ture on suspense, it is computationally not well         that something risky or dangerous will occur; this
    understood. We compare two ways for mod-                 includes the idea both of uncertainty and jeopardy.
    elling suspense: surprise, a backward-looking            Take the play Romeo and Juliet: Dramatic suspense
    measure of how unexpected the current state is           is created throughout — the initial duel, the meet-
    given the story so far; and uncertainty reduc-
                                                             ing at the masquerade ball, the marriage, the fight
    tion, a forward-looking measure of how unex-
    pected the continuation of the story is. Both            in which Tybalt is killed, and the sleeping potions
    can be computed either directly over story rep-          leading to the death of Romeo and Juliet. At each
    resentations or over their probability distribu-         moment, the audience is invested in something be-
    tions. We propose a hierarchical language                ing at stake and wonders how it will end.
    model that encodes stories and computes sur-                This paper aims to model suspense in computa-
    prise and uncertainty reduction. Evaluating              tional terms, with the ultimate goal of making it
    against short stories annotated with human sus-
                                                             deployable in NLP systems that analyze or generate
    pense judgements, we find that uncertainty re-
    duction over representations is the best predic-         narrative fiction. We start from the assumption that
    tor, resulting in near human accuracy. We also           concepts developed in psycholinguistics to model
    show that uncertainty reduction can be used to           human language processing at the word level (Hale,
    predict suspenseful events in movie synopses.            2001, 2006) can be generalised to the story level to
                                                             capture suspense, the Hale model. This assumption
1   Introduction
                                                             is
As current NLP research expands to include longer,           similar concepts to model suspense in games (Ely
fictional texts, it becomes increasingly important           et al., 2015; Li et al., 2018), the Ely model. Com-
to understand narrative structure. Previous work             mon to both approaches is the idea that suspense
has analyzed narratives at the level of characters           is a form of expectation: In games, we expect to
and plot events (e.g., Gorinski and Lapata, 2018;            win or lose instead in stories, we expect that the
Martin et al., 2018). However, systems that pro-             narrative will end a certain way.
cess or generate narrative texts also have to take              We will therefore compare two ways for mod-
into account what makes stories compelling and               elling narrative suspense: surprise, a backward-
enjoyable. We follow a literary tradition that makes         looking measure of how unexpected the current
And then? (Forster, 1985; Rabkin, 1973) the pri-             state is given the story so far; and uncertainty re-
mary question and regards suspense as a crucial              duction, a forward-looking and measure of how
factor of storytelling. Studies show that suspense is        unexpected the continuation of the story is. Both
important for keeping readers’ attention (Khrypko            measures can be computed either directly over story
and Andreae, 2011), promotes readers’ immersion              representations, or indirectly over the probability
and suspension of disbelief (Hsu et al., 2014), and          distributions over such representations. We pro-
plays a big part in making stories enjoyable and in-         pose a hierarchical language model based on Gen-
teresting (Oliver, 1993; Schraw et al., 2001). Com-          erative Pre-Training (GPT, Radford et al., 2018) to
putationally less well understood, suspense has              encode story-level representations and develop an

                                                         1763
       Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1763–1788
                          July 5 - 10, 2020. c 2020 Association for Computational Linguistics
inference scheme that uses these representations to        pense using general language models fine-tuned on
compute both surprise and uncertainty reduction.           stories, without planning and domain knowledge.
For evaluation, we use the WritingPrompt corpus            The advantage is that the model can be trained on
of short stories (Fan et al., 2018), part of which we      large volumes of available narrative text without
annotate with human sentence-by-sentence judge-            requiring expensive annotations, making it more
ments of suspense. We find that surprise over rep-         generalisable.
resentations and over probability distributions both          Other work emphasises the role of characters and
predict suspense judgements. However uncertainty           their development in story understanding (Bamman
reduction over representations is better, resulting        et al., 2014, 2013; Chaturvedi et al., 2017; Iyyer
in near human-level accuracy. We also show that            et al., 2016) or summarisation (Gorinski and Lap-
our models can be used to predict turning points,          ata, 2018). A further important element of narra-
i.e., major narrative events, in movie synopses (Pa-       tive structure is plot, i.e., the sequence of events
palampidi et al., 2019).                                   in which characters interact. Neural models have
                                                           explicitly modelled events (Martin et al., 2018; Har-
2   Related Work                                           rison et al., 2017; Rashkin et al., 2018) or the results
                                                           of actions (Roemmele and Gordon, 2018; Liu et al.,
In narratology, uncertainty over outcomes is tradi-        2018a,b). On the other hand, some neural genera-
tionally seen as suspenseful (e.g., O’Neill, 2013;         tion models (Fan et al., 2018) just use a hierarchical
Zillmann, 1996; Abbott, 2008). Other authors               model on top of a language model; our architecture
claim that suspense can exist without uncertainty          follows this approach.
(e.g., Smuts, 2008; Hoeken and van Vliet, 2000;
Gerrig, 1989) and that readers feel suspense even          3     Models of Suspense
when they read a story for the second time (Dela-
                                                           3.1    Definitions
torre et al., 2018), which is unexpected if suspense
is uncertainty; this is referred to as the paradox of      In order to formalise measures of suspense, we
suspense (Prieto-Pablos, 1998; Yanal, 1996). Con-          assume that a story consists of a sequence of sen-
sidering Romeo and Juliet again, in the first view         tences. These sentences are processed one by one,
suspense is motivated by primarily by uncertainty          and the sentence at the current timepoint t is repre-
over what will happen. Who will be hurt or killed in       sented by an embedding et (see Section 4 for how
the fight? What will happen after marriage? How-           embeddings are computed). Each embedding is
ever, at the beginning of the play we are told “from       associated with a probability P(et ). Continuations
forth the fatal loins of these two foes, a pair of star-   of the story are represented by a set of possible next
                                                                                                                  i
crossed lovers take their life”, and so the suspense       sentences, whose embeddings are denoted by et+1 .
is more about being invested in the plot than not             The first measure of suspense we consider is
knowing the outcome, aligning more with the sec-           surprise (Hale, 2001), which in the psycholinguis-
ond view: suspense can exist without uncertainty.          tic literature has been successfully used to predict
We do not address the paradox of suspense directly         word-based processing effort (Demberg and Keller,
in this paper, but we are guided by the debate to          2008; Roark et al., 2009; Van Schijndel and Linzen,
operationalise methods that encompass both views.          2018a,b). Surprise is a backward-looking predic-
The Hale model is closer to the traditional model          tor: it measures how unexpected the current word
of suspense as being about uncertainty. In contrast,       is given the words that preceded it (i.e., the left
the Ely model is more in line with the second view         context). Hale formalises surprise as the negative
that uncertainty matters less than consequentially         log of the conditional probability of the current
different outcomes.                                        word. For stories, we compute surprise over sen-
   In NLP, suspense is studied most directly in nat-       tences. As our sentence embeddings et include
ural language generation, with systems such as             information about the left context e1 , . . . , et−1 , we
Dramatis (O’Neill and Riedl, 2014) and Suspenser           can write Hale surprise as:
(Cheong and Young, 2014), two planning-based                                St
                                                                              Hale
                                                                                     = − log P(et )             (1)
story generators that use the theory of Gerrig and
Bernardo (1994) that suspense is created when a            An alternative measure for predicting word-by-
protagonist faces obstacles that reduce successful         word processing effort used in psycholinguistics is
outcomes. Our approach, in contrast, models sus-           entropy reduction (Hale, 2006). This measure is

                                                      1764
forward-looking: it captures how much the current         the next state et+1 :
word changes our expectations about the words we                                Ely                  i           2
will encounter next (i.e., the right context). Again,                      Ut         = E[(et − et+1 ) ]
                                                                                        i                i           2       (4)
we compute entropy at the story level, i.e., over sen-                  = ∑ P(et+1 )(et − et+1 )
tences instead of over words. Given a probability                           i
distribution over possible next sentences P(et+1 ),
                                                 i
                                                          This is closely related to Hale entropy reduction,
we calculate the entropy of that distribution. En-        but again the entropy is computed over states (sen-
tropy reduction is the change of that entropy from        tence embeddings in our case), rather than over
one sentence to the next:                                 probability distributions. Intuitively, this measure
                                                          captures how much the uncertainty about the rest
                              i             i
           Ht = − ∑ P(et+1 ) log P(et+1 )                 of the story is reduced by the current sentence.
                      i                            (2)    We refer to the forward-looking measures in Equa-
                             Hale
                           Ut       = Ht−1 − Ht           tions (2) and (4) as Hale and Ely uncertainty reduc-
                                                          tion, respectively.
                                                             Ely et al. also suggest versions of their measures
Note that we follow Frank (2013) in computing
                                                          in which each state is weighted by a value αt , thus
entropy over surface strings, rather than over parse
                                                          accounting for the fact that some states may be
states as in Hale’s original formulation.
                                                          more inherently suspenseful than others:
    In the economics literature, Ely et al. (2015)
                                                                                                                         2
                                                                                            = αt (et − et−1 )
                                                                                  αEly
have proposed two measures that are closely re-                                  St
lated to Hale surprise and entropy reduction. At                                                             i       2
                                                                                                                             (5)
                                                                                = E[αt+1 (et − et+1 ) ]
                                                                        αEly
                                                                      Ut
the heart of their theory of suspense is the notion of
belief in an end state. Games are a good example:         We stipulate that sentences with high emotional va-
the state of a tennis game changes with each point        lence are more suspenseful, as emotional involve-
being played, making a win more or less likely.           ment heightens readers’ experience of suspense.
Ely et al. define surprise as the amount of change        This can be captured in Ely et al.’s framework by
from the previous time step to the current time step.     assigning the αs the scores of a sentiment classifier.
Intuitively, large state changes (e.g., one player sud-
denly comes close to winning) are more surprising         3.2   Modelling Approach
than small ones. Representing the state at time t as      We now need to show how to compute the surprise
et , Ely surprise is defined as:                          and uncertainty reduction measures introduced in
                                                          the previous section. This involves building a
                  Ely                   2                 model that processes stories sentence by sentence,
                 St       = (et − et−1 )           (3)
                                                          and assigns each sentence an embedding that en-
                                                          codes the sentence and its preceding context, as
Ely et al.’s approach can be adapted for modelling        well as a probability. These outputs can then be
suspense in stories if we assume that each sentence       used to compute a surprise value for the sentence.
in a story changes the state (the characters, places,        Furthermore, the model needs to be able to gen-
events in a story, etc.). States et then become sen-      erate a set of possible next sentences (story contin-
tence embeddings, rather than beliefs in end states,      uations), each with an embedding and a probability.
and Ely surprise is the distance between the current      Generating upcoming sentences is potentially very
embedding et and the previous embedding et−1 . In         computationally expensive since the number of con-
this paper, we will use L1 and L2 distances; other        tinuations grows exponentially with the number of
authors (Li et al., 2018) experiment with informa-        future time steps. As an alternative, we can there-
tion gain and KL divergence, but found worse per-         fore sample possible next sentences from a corpus
formance when modelling suspense in games. Just           and use the model to assign them embeddings and
like Hale surprise, Ely surprise models backward-         probabilities. Both of these approaches will pro-
looking prediction, but over representations, rather      duce sets of upcoming sentences, which we can
than over probabilities.                                  then use to compute uncertainty reduction. While
   Ely et al. also introduce a measure of forward-        we have so far only talked about the next sentences,
looking prediction, which they define as the ex-          we will also experiment with uncertainty reduction
pected difference between the current state et and        computed using longer rollouts.

                                                     1765
ℓ
                    lm
                                                                                 ℓ
                                                                                                                                    (story enc) that computes a story embedding. The
                                                                                                         ⋅

         fusion (affine)                                                                                                             overall story representation is the hidden state of
                                                                                                                                    its last sentence. Crucially, this model also gives
    Concat
                                             story_enc
                                               (RNN)
                                                                     0                          +1               +2        +3
                                                                                                                                    us et , a contextualised representation of the current
     word
      and
     story
                                                                                                                                    sentence at point t in the story, to compute surprise
    vectors                  ( )
                                                                 0                         1                 2         3

                                                                                                                                    and uncertainty reduction.
                                                             3                                 ( )               ( )       ( )
                                   = [   ;       ( )]

                                                                                sent_enc
                                                                                                                                       Model training includes a generative loss `gen to
                                                                                 (RNN)
    0                    1                   2                   3
                                                                                                                                    improve the quality of the sentences generated by
     0                                                                                                                              the model. We concatenate the word representa-
                                                                                                                                    tions w j for all word embeddings in the latest sen-
                                                                                                                                    tence with the latest story embedding emax(t) . This
                                                                                word_enc

         0                   1                   2                       3
                                                                                 (GPT)
                                                                                                                                    is run through affine ELU layers to produce en-
             Once            upon                    a                   time                        1             2         3

                                                                                                                                    riched word embedding representations, analogous
                                                                                                                                    to the Deep Fusion model (Gülçehre et al., 2015),
Figure 1: Architecture of our hierarchical model.                                                                                   with story state instead of a translation model. The
See text for explanation of the components word enc,                                                                                related Cold Fusion approach (Sriram et al., 2018)
sent enc, and story enc.                                                                                                            proved inferior.
                                                                                                                                    Loss Functions To obtain the discriminatory
4                Model                                                                                                              loss `disc for a particular sentence s in a batch, we
                                                                                                                                    compute the dot product of all the story embed-
4.1                 Architecture
                                                                                                                                    dings e in the batch, and then take the cross-entropy
Our overall approach leverages contextualised lan-                                                                                  across the batch with the correct next sentence:
guage models, which are a powerful tool in NLP
                                                                                                                                                                        exp(et+1 ⋅ et )
                                                                                                                                                                               i=s
                                                                                                                                                        i=s
when pretrained on large amounts of text and fine                                                                                              `disc (et+1 ) = − log                           (6)
tuned on a specific task (Peters et al., 2018; De-
                                                                                                                                                                               i
                                                                                                                                                                       ∑i exp(et+1 ⋅ et )
vlin et al., 2019). Specifically, we use Generative                                                                                 Modelled on Quick Thoughts (Logeswaran and
Pre-Training (GPT, Radford et al., 2018), a model                                                                                   Lee, 2018), this forces the model to maximise the
which has proved successful in generation tasks                                                                                     dot product of the correct next sentence versus
(Radford et al., 2019; See et al., 2019).                                                                                           other sentences in the same story, and negative
                                                                                                                                    examples from other stories, and so encourages
Hierarchical Model Previous work found that
                                                                                                                                    representations that anticipate what happens next.
hierarchical models show strong performance in
                                                                                                                                       The generative loss in Equation (7) is a standard
story generation (Fan et al., 2018) and under-
                                                                                                                                    LM loss, where w j is the GPT word embeddings
standing tasks (Cai et al., 2017). The language
                                                                                                                                    from the sentence and emax(t) is the story context
model and hierarchical encoders we use are uni-
                                                                                                                                    that each word is concatenated with:
directional, which matches the incremental way
in which human readers process stories when they                                                                                        `gen = − ∑ log P(w j ∣w j−1 , w j−2 , . . . ; emax(t) ) (7)
experience suspense.                                                                                                                                j
   Figure 1 depicts the architecture of our hierar-                                                                                 The overall loss is `disc + `gen . More advanced gen-
              1
chical model. It builds a chain of representations                                                                                  eration losses (e.g., Zellers et al., 2019) could be
that anticipates what will come next in a story, al-                                                                                used, but are an order of magnitude slower.
lowing us to infer measures of suspense. For a
given sentence, we use GPT as our word encoder                                                                                      4.2      Inference
(word enc in Figure 1) which turns each word in a                                                                                   We compute the measures of surprise and uncer-
sentence into a word embedding wi . Then, we use                                                                                    tainty reduction introduced in Section 3.1 using the
an RNN (sent enc) to turn the word embeddings of                                                                                    output of the story encoder story enc. In addition
the sentences into a sentence embedding γi . Each                                                                                   to the contextualised sentence embeddings et , this
sentence is represented by the hidden state of its                                                                                  requires their probabilities P(et ), and a distribution
                                                                                                                                    over alternative continuations P(et+1 ).
                                                                                                                                                                          i
last word, which is then fed into a second RNN
             1                                                                                                                         We implement a recursive beam search over a
    Model code and scripts for evaluation are avail-
able at https://github.com/dwlmt/Story-Untangling/                                                                                  tree of future sentences in the story, looking be-
tree/acl-2020-dec-submission                                                                                                        tween one and three sentences ahead (rollout). The

                                                                                                                                 1766
probability is calculated using the same method as                                             GRU      LSTM
the discriminatory loss, but with the cosine similar-
ity rather than the dot product of the embeddings            Loss                              5.84     5.90
         i
et and et+1 fed into a softmax function. We found            Discriminatory Acc.               0.55     0.54
that cosine outperformed dot product on inference            Discriminatory Acc. k = 10        0.68     0.68
as the resulting probability distribution over contin-       Generative Acc.                   0.37     0.46
uations is less concentrated.                                Generative Acc. k = 10            0.85     0.85
                                                             Cosine Similarity                 0.48     0.50
5   Methods                                                  L2 Distance                       1.73     1.59
                                                             Number of Epochs                  4        2
Dataset The overall goal of this work is to test
whether the psycholinguistic and economic theo-            Table 1: For accuracy the baseline probability is 1 in
ries introduced in Section 3 are able to capture           99; k = 10 is the accuracy of the top 10 sentences of the
human intuition of suspense. For this, it is impor-        batch. From the best epoch of training on the Writing-
tant to use actual stories which were written by           Prompts development set.
authors with the aim of being engaging and inter-
esting. Some of the story datasets used in NLP do
                                                           to write a short summary of the story.
not meet this criterion; for example ROC Cloze
                                                              In the instructions, suspense was framed as dra-
(Mostafazadeh et al., 2016) is not suitable because
                                                           matic tension, as pilot annotations showed that the
the stories are very short (five sentences), lack nat-
                                                           term suspense was too closely associated with mur-
uralness, and are written by crowdworkers to fulfill
                                                           der mystery and related genres. Annotators were
narrow objectives, rather than to elicit reader en-
                                                           asked to take the character’s perspective when read-
gagement and interest. A number of authors have
                                                           ing to achieve stronger inter-annotator agreement
also pointed out technical issues with such artificial
                                                           and align closely with literary notions of suspense.
corpora (Cai et al., 2017; Sharma et al., 2018).
                                                           During training, all workers had to annotate a test
   Instead, we use WritingPrompts (Fan et al.,
                                                           story and achieve 85% accuracy before they could
2018), a corpus of circa 300k short stories from
                                                           continue. Full instructions and the training story
the /r/WritingPrompts subreddit. These stories
                                                           are in Appendix B.
were created as an exercise in creative writing, re-
sulting in stories that are interesting, natural, and of      The inter-annotator agreement α (Krippendorff,
suitable length. The original split of the data into       2011) was 0.52 and 0.57 for the development and
90% train, 5% development, and 5% test was used.           test sets, respectively. Given the inherently sub-
Pre-processing steps are described in Appendix A.          jective nature of the task, this is substantial agree-
                                                           ment. This was achieved after screening out and
Annotation To evaluate the predictions of our              replacing annotators who had low agreement for
model, we selected 100 stories each from the devel-        the stories they annotated (mean α < 0.35), showed
opment and test sets of the WritingPrompts corpus,         suspiciously low reading times (mean RT < 600 ms
such that each story was between 25 and 75 sen-            per sentence), or whose story summaries indicated
tence in length. Each sentence of these stories was        low-quality annotation.
judged for narrative suspense; five master work-
ers from Amazon Mechanical Turk annotated each             Training and Inference The training used SGD
story after reading instructions and completing a          with Nesterov momentum (Sutskever et al., 2013)
training phase. They read one sentence at a time           with a learning rate of 0.01 and a momentum of 0.9.
and provided a suspense judgement using the five-          Models were run with early stopping based on the
point scale consisting of Big Decrease in suspense         mean of the accuracies of training tasks. For each
(1% of the cases), Decrease (11%), Same (50%), In-         batch, 50 sentence blocks from two different stories
crease (31%), and Big Increase (7%). In contrast to        were chosen to ensure that the negative examples in
prior work (Delatorre et al., 2018), a relative rather     the discriminatory loss include easy (other stories)
than absolute scale was used. Relative judgements          and difficult (same story) sentences.
are easier to make while reading, though in prac-             We used the pretrained GPT weights but fine-
tice, the suspense curves generated are very similar,      tuned the encoder and decoder weights on our task.
with a long upward trajectory and flattening or dip        For the RNN components of our hierarchical model,
near the end. After finishing a story, annotators had      we experimented with both GRU (Chung et al.,

                                                      1767
2015) and LSTM (Hochreiter and Schmidhuber,              GloveSim is the cosine similarity between the av-
1997) variants. The GRU model had two layers in          eraged Glove (Pennington et al., 2014) word em-
both sen enc and story enc; the LSTM model had           beddings of the two sentences, and GPTSim is the
four layers each in sen enc and story enc. Both          cosine similarity between the GPT embeddings of
had two fusion layers and the size of the hidden         the two sentences. The α baseline is the weighted
layers for both model variants was 768. We give          VADER sentiment score.
the results of both variants on the tasks of sentence
generation and sentence discrimination in Table 1.       6       Results
Both perform similarly, with slightly worse loss         6.1      Narrative Suspense
for the LSTM variant, but faster training and better
generation accuracy. Overall, model performance          Task The annotator judgements are relative
is strong: the LSTM variant picks out the correct        (amount of decrease/increase in suspense from sen-
sentence 54% of the time and generates it 46%            tence to sentence), but the model predictions are
of the time. This indicates that our architecture        absolute values. We could convert the model pre-
successfully captures the structure of stories.          dictions into discrete categories, but this would
   At inference time, we obtained a set of story         fail to capture the overall arc of the story. Instead,
continuations either by random sampling or by gen-       we convert the relative judgements into absolute
eration. Random sampling means that n sentences          suspense values, where Jt = j1 + ⋅ ⋅ ⋅ + jt is the ab-
were selected from the corpus and used as contin-        solute value for sentence t and j1 , . . . , jt are the rel-
uations. For generation, sentences were generated        ative judgements for sentences 1 to t. We use −0.2
using top-k sampling (with k = 50) using the GPT         for Big Decrease, −0.1 for Decrease, 0 for Same,
                                                                                                            2
language model and the approach of Radford et al.        0.1 for Increase, and 0.2 for Big Increase. Both
(2019), which generates better output than beam          the absolute suspense judgements and the model
search (Holtzman et al., 2018) and can outperform        predictions are normalised by converting them to
a decoder (See et al., 2019). For generation, we         z-scores.
used up to 300 words as context, enriched with the          To compare model predictions and absolute sus-
story sentence embeddings from the corresponding         pense values, we use Spearman’s ρ (Sen, 1968)
points in the story. For rollouts of one sentence,       and Kendall’s τ (Kendall, 1975). Rank correlation
we generated 100 possibilities at each step; for roll-   is preferred because we are interested in whether
outs of two, 50 possibilities and rollouts of three,     human annotators and models view the same part
25 possibilities. This keeps what is an expensive        of the story as more or less suspenseful; also, rank
inference process manageable.                            correlation methods are good at detecting trends.
                                                         We compute ρ and τ between the model predic-
Importance We follow Ely et al. in evaluat-              tions and the judgements of each of the annotators
ing weighted versions of their surprise and un-          (i.e., five times for five annotators), and then take
certainty reduction measure St
                                αEly
                                     and Ut
                                           αEly
                                                (see     the average. We then average these values again
Equation (5)). We obtain the αt values by tak-           over the 100 stories in the test or development sets.
ing the sentiment scores assigned by the VADER           As the human upper bound, we compute the mean
sentiment classifier (Hutto and Gilbert, 2014) to        pairwise correlation of the five annotators.
each sentence and multiplying them by 1.0 for pos-       Results Figure 2 shows surprise and uncertainty
itive sentiment and 2.0 for negative sentiment. The      reduction measures and human suspense judge-
stronger negative weighting reflects the observation     ments for an example story (text and further ex-
that negative consequences can be more important         amples in Appendix C). We performed model se-
than positive ones (O’Neill, 2013; Kahneman and          lection using the correlations on the development
Tversky, 2013).                                          set, which are given in Table 2. We experimented
                                                         with all the measures introduced in Section 3.1,
Baselines We test a number of baselines as al-
                                                         computing sets of alternative sentences either us-
ternatives to surprise and uncertainty reduction de-
                                                             2
rived from our hierarchical model. These base-                These values were fitted with predictions (or cross-worker
lines also reflect how much change occurs from           annotation) using 5-fold cross validation and an L1 loss to
                                                         optimise the mapping. A constraint is placed so that Same
one sentence to the next in a story: WordOverlap is      is 0, increases are positive and decreases are negative with a
the Jaccard similarity between the two sentences,        minimum 0.05 distance between.

                                                    1768
3
                                                                                     Prediction        Model         Roll   τ↑      ρ↑
           2.5
                                                                                     Human                                  .553   .614
            2
                                                                                     Baselines         WordOverlap    1     .017   .026
Suspense

           1.5
                                                                                                       GloveSim       1     .017   .029
            1                                                                                          GPTSim         1     .021   .031
           0.5
                                                                                                       α              1     .024   .036
                                                                                         Hale
            0                                                                        S          -Gen   GRU            1     .145   .182
                                                                                                       LSTM           1     .434   .529
                 0     5      10     15         20     25         30          35
                                                                                         Hale
                                     Sentence                                        S          -Cor   GRU            1     .177   .214
                                            Hale     Ely    Ely        αEly                            LSTM           1     .580   .675
           Figure 2: Story 27, Human, S         ,S ,U ,U           .
           Solid lines: generated alternative continuations, dashed                      Hale
                                                                                     U          -Gen   GRU            1     .036   .055
           lines: sampled alternative continuations.                                                   LSTM           1     .009   .016
                                                                                         Hale
                                                                                     U          -Cor   GRU            1     .048   .050
           ing generated continuations (Gen) or continuations                                          LSTM           1     .066   .094
                                                          Ely
           sampled from the corpus (Cor), except for S ,                                 Ely
           which can be computed without alternatives. We                            S                 GRU            1     .484   .607
           compared the LSTM and GRU variants (see Sec-                                                LSTM           1     .427   .539
           tion 4) and experimented with rollouts of up to                           S
                                                                                         αEly
                                                                                                       GRU            1     .089   .123
           three sentences. We tried L1 and L2 distance for                                            LSTM           1     .115   .156
           the Ely measures, but only report L1, which always
                                                                                         Ely
           performed better.                                                         U         -Gen    GRU            1     .241   .161
                                                                                                                      2     .304   .399
           Discussion On the development set (see Table 2),                                            LSTM           1     .610   .698
           we observe that all baselines perform poorly, indi-                                                        2     .393   .494
           cating that distance between simple sentence rep-                             Ely
           resentations or raw sentiment values do not model                         U         -Cor    GRU            1     .229   .264
           suspense. We find that Hale surprise S
                                                    Hale
                                                         performs                                                     2     .512   .625
           well, reaching a maximum ρ of .675 on the devel-                                                           3     .515   .606
                                                         Hale                                          LSTM           1     .594   .678
           opment set. Hale uncertainty reduction U           , how-
                                                                                                                      2     .564   .651
           ever, performs consistently poorly. Ely surprise
             Ely                                                                                                      3     .555   .645
           S also performs well, reaching as similar value
                                                                                         αEly
           as Hale surprise. Overall, Ely uncertainty reduction                      U          -Gen   GRU            1     .216   .124
              Ely
           U is the strongest performer, with ρ = .698, nu-                                                           2     .219   .216
           merically outperforming the human upper bound.                                              LSTM           1     .474   .604
               Some other trends are clear from the develop-                                                          2     .316   .418
           ment set: using GRUs reduces performance in all                               αEly
           cases but one; rollout of more than one never leads                       U          -Cor   GRU            1     .205   .254
           to an improvement; sentiment weighting (prefix                                                             2     .365   .470
           α in the table) always reduces performance, as it                                           LSTM           1     .535   .642
           introduces considerable noise (see Figure 2). We                                                           2     .425   .534
           therefore eliminate the models that correspond to
                                                                                   Table 2: Development set results for WritingPrompts
           these settings when we evaluate on the test set.                        for generated (Gen) or corpus sampled (Cor) alternative
               For the test set results in Table 3 we also report                  continuations; α indicates sentiment weighting. Bold:
           upper and lower confidence bounds computed us-                          best model in a given category; red: best model overall.
           ing the Fisher Z-transformation (p < 0.05). On the
                       Ely
           test set, U again is the best measure, with a cor-
           relation statistically indistinguishable from human                     reflecting the higher human upper bound.
           performance (based on CIs). We find that absolute                          Overall, we conclude that our hierarchical ar-
           correlations are higher on the test set, presumably                     chitecture successfully models human suspense

                                                                          1769
Prediction         τ↑            ρ↑                                          Dev D ↓        Test D ↓
      Human          .652 (.039)    .711 (.033)             Human               Not reported       4.30 (3.43)
       Hale
      S -Gen         .407 (.089)    .495 (.081)             Theory Baseline       9.65 (0.94)      7.47 (3.42)
       Hale                                                 TAM                   7.11 (1.71)      6.80 (2.63)
      S -Cor         .454 (.085)    .523 (.079)
        Hale
      U      -Gen    .036 (.102)    .051 (.102)             WordOverlap           13.9 (1.45)      12.7 (3.13)
        Hale
      U      -Cor    .061 (.100)    .088 (.101)             GloveSim              10.2 (0.74)      10.4 (2.54)
       Ely                                                  GPTSim                16.8 (1.47)      18.1 (4.71)
      S              .391 (.092)    .504 (.082)             α                     11.3 (1.24)      11.2 (2.67)
        Ely
      U -Gen         .620 (.067)    .710 (.053)
                                                             Hale
        Ely
      U -Cor         .605 (.069)    .693 (.056)             S -Gen                8.27 (0.68)     8.72 (2.27)
                                                              Hale
                                                            U      -Gen           10.9 (1.02)    10.69 (3.66)
Table 3: Test set results for WritingPrompts for gen-        Ely
erated (Gen) or corpus sampled (Cor) continuations.
                                                            S                     9.54 (0.56)      9.01 (1.92)
                                                              αEly
LSTM with rollout one; brackets: confidence intervals.      S                     9.95 (0.78)      9.54 (2.76)
                                                               Ely
                                                            U -Gen                8.75 (0.76)      8.38 (1.53)
                                                               Ely
                                                            U -Cor                8.74 (0.76)      8.50 (1.69)
judgements on the WritingPrompts dataset. The                  αEly
                            Ely                             U       -Gen          8.80 (0.61)      7.84 (3.34)
overall best predictor is U , uncertainty reduc-               αEly
tion computed over story representations. This              U       -Cor          8.61 (0.68)      7.78 (1.61)
measure combines the probability of continuation
  Hale                                                    Table 4: TP prediction on the TRIPOD development
(S ) with distance between story embeddings
  Ely                                                     and test sets. D is the normalised distance to the gold
(S ), which are both good predictors in their own         standard; CI in brackets.
right. This finding supports the theoretical claim
that suspense is an expectation over the change in
future states of a game or a story, as advanced by        which uses screenwriting theory to predict where
Ely et al. (2015).                                        in a movie a given TP should occur (e.g., Point of
                                                          No Return theoretically occurs 50% through the
6.2   Movie Turning Points                                movie). This baseline is hard to beat (Papalampidi
Task and Dataset An interesting question is               et al., 2019).
whether the peaks in suspense in a story correspond
to important narrative events. Such events are some-      Results and Discussion Figure 3 plots both gold
times called turning points (TPs) and occur at cer-       standard and predicted TPs for a sample movie
tain positions in a movie according to screenwrit-        synopsis (text and further examples in Appendix D).
ing theory (Cutting, 2016). A corpus of movie             The results on the TRIPOD development and test
synopses annotated with turning points is available       sets are reported in Table 4 (we report both due to
in the form of the TRIPOD dataset (Papalampidi            the small number of synopses in TRIPOD). We use
et al., 2019). We can therefore test if surprise or       our best LSTM model with a of rollout of one; the
uncertainty reduction predict TPs in TRIPOD. As           distance measure for Ely surprise and uncertainty
our model is trained on a corpus of short stories,        reduction is now L2 distance, as it outperformed
this will also serve as an out-of-domain evaluation.      L1 on TRIPOD. We report results in terms of D,
   Papalampidi et al. (2019) assume five TPs: 1. Op-      the normalised distance between gold standard and
portunity, 2. Change of Plans, 3. Point of no Return,     predicted TP positions.
4. Major Setback, and 5. Climax. They derive a               On the test set, the best performing model
                                                                               αEly              αEly
prior distribution of TP positions from their test set,   with D = 7.78 is U        -Cor, with U      -Gen only
and use this to constrain predicted turning points        slightly worse. It is outperformed by TAM, the
to windows around these prior positions. We fol-          best model of Papalampidi et al. (2019), which
low this approach and select as the predicted TP          however requires TP annotation at training time.
                                                            αEly
the sentence with the highest surprise or uncer-          U      -Cor is close to the Theory Baseline on the
tainty reduction value within a given constrained         test set, an impressive result given that our model
window. We report the same baselines as in the pre-       has no TP supervision and is trained on a differ-
vious experiment, as well as the Theory Baseline,         ent domain. The fact that models with sentiment

                                                     1770
readers are about the outcome of the story, may
            6
                                                                                also be helpful in better understanding the relation-
            5
                                                                                ship between suspense and uncertainty. Automated
            4                                                                   interpretability methods as proposed by Sundarara-
Suspense

            3
                                                                                jan et al. (2017), could shed further light on models’
                                                                                predictions.
            2
                                                                                   The recent success of language models in wide-
            1                                                                   ranging NLP tasks (e.g., Radford et al., 2019) has
            0                                                                   shown that language models are capable of learn-
                0       10        20              30         40         50      ing semantically rich information implicitly. How-
                                       Sentence                                 ever, generating plausible future continuations is
                                                                                an essential part of the model. In text generation,
                                              Hale     Ely        Ely   αEly
           Figure 3: Movie 15 Minutes, S    ,S ,U ,U             ,              Fan et al. (2019) have found that explicitly incor-
           ◆ theory baseline, ⭑ TP annotations, triangles are pre-              porating coreference and structured event repre-
           dicted TPs.                                                          sentations into generation produces more coherent
                                                                                generated text. A more sophisticated model would
           weighting (prefix α) perform well here indicates                     incorporate similar ideas.
           that turning points often have an emotional reso-                       Autoregressive models that generate step by step
           nance as well as being suspenseful.                                  alternatives for future continuations are computa-
                                                                                tionally impractical for longer rollouts and are not
           7    Conclusions                                                     cognitively plausible. They also differ from the
                                                                                Ely et al. (2015) conception of suspense, which
           Our overall findings suggest that by implementing                    is in terms of Bayesian beliefs over a longer-term
           concepts from psycholinguistic and economic the-                     future state, not step by step. There is much recent
           ory, we can predict human judgements of suspense                     work (e.g., Ha and Schmidhuber (2018); Gregor
                                                              Ely
           in storytelling. That uncertainty reduction (U )                     et al. (2019)), on state-space approaches that model
                                            Hale
           outperforms probability-only (S ) and state-only                     beliefs as latent states using variational methods.
              Ely
           (S ) surprise suggests that, while consequential                     In principle, these would avoid the brute-force cal-
           state change is of primary importance for suspense,                  culation of a rollout and conceptually, anticipating
           the probability distribution over the states is also a               longer-term states aligns with theories of suspense.
           necessary factor. Uncertainty reduction therefore                       Related tasks such as inverting the understanding
           captures the view of suspense as reducing paths to                   of suspense to utilise the models in generating more
           a desired outcome, with more consequential shifts                    suspenseful stories may also prove fruitful.
           as the story progresses (O’Neill and Riedl, 2014;                       This paper is a baseline that demonstrates how
           Ely et al., 2015; Perreault, 2018). This is more in                  modern neural network models can implicitly rep-
           line with the Smuts (2008) Desire-Frustration view                   resent text meaning and be useful in a narrative con-
           of suspense, where uncertainty is secondary.                         text without recourse to supervision. It provides a
              Strong psycholinguistic claims about suspense                     springboard to further interesting applications and
           are difficult to make due to several weaknesses in                   research on suspense in storytelling.
           our approach, which highlight directions for fu-
           ture research: the proposed model does not have a                    Acknowledgments
           higher-level understanding of event structure; most
                                                                                The authors would like to thank the anonymous re-
           likely it picks up the textual cues that accompany
                                                                                viewers, Pinelopi Papalampidi and David Hodges
           dramatic changes in the text. One strand of further
                                                                                for reviews of the annotation task, the AMT annota-
           work is therefore analysis: Text could be artificially
                                                                                tors, and Mirella Lapata, Ida Szubert, and Elizabeth
           manipulated using structural changes, for example
                                                                                Nielson for comments on the paper. Wilmot’s work
           by switching the order of sentences, mixing multi-
                                                                                is funded by an EPSRC doctoral training award.
           ple stories, including a summary at the beginning
           that foreshadows the work, masking key suspense-
           ful words, or paraphrasing. An analogue of this
           would be adversarial examples used in computer
           vision. Additional annotations, such as how certain

                                                                             1771
References                                                 Jeffrey Ely, Alexander Frankel, and Emir Kamenica.
                                                              2015. Suspense and surprise. Journal of Political
H Porter Abbott. 2008. The Cambridge introduction to          Economy, 123(1):215–260.
  narrative. Cambridge University Press.
                                                           Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-
David Bamman, Brendan O’Connor, and Noah A                   erarchical neural story generation. In Proceedings
  Smith. 2013. Learning latent personas of film char-        of the 56th Annual Meeting of the Association for
  acters. In Proceedings of the 51st Annual Meeting of       Computational Linguistics (Volume 1: Long Papers),
  the Association for Computational Linguistics (Vol-        pages 889–898, Melbourne, Australia. Association
  ume 1: Long Papers), pages 352–361.                        for Computational Linguistics.
David Bamman, Ted Underwood, and Noah A. Smith.            Angela Fan, Mike Lewis, and Yann Dauphin. 2019.
  2014. A Bayesian mixed effects model of literary           Strategies for structuring story generation. In ACL.
  character. In Proceedings of the 52nd Annual Meet-
  ing of the Association for Computational Linguis-        Edward Morgan Forster. 1985. Aspects of the Novel,
  tics (Volume 1: Long Papers), pages 370–379, Balti-        volume 19. Houghton Mifflin Harcourt.
  more, Maryland. Association for Computational Lin-
  guistics.                                                Stefan L Frank. 2013. Uncertainty reduction as a mea-
                                                              sure of cognitive load in sentence comprehension.
Zheng Cai, Lifu Tu, and Kevin Gimpel. 2017. Pay at-          Topics in Cognitive Science, 5(3):475–494.
  tention to the ending: Strong neural baselines for the
  roc story cloze task. In Proceedings of the 55th An-     Richard J Gerrig. 1989. Suspense in the absence
  nual Meeting of the Association for Computational          of uncertainty. Journal of Memory and Language,
  Linguistics (Volume 2: Short Papers), pages 616–           28(6):633–648.
  622.
                                                           Richard J Gerrig and Allan BI Bernardo. 1994. Read-
Snigdha Chaturvedi, Mohit Iyyer, and Hal Daume III.          ers as problem-solvers in the experience of suspense.
  2017. Unsupervised learning of evolving relation-          Poetics, 22(6):459–472.
  ships between literary characters. In Thirty-First
  AAAI Conference on Artificial Intelligence.              Philip John Gorinski and Mirella Lapata. 2018. What’s
                                                             this movie about? a joint neural network architec-
Yun-Gyung Cheong and R Michael Young. 2014. Sus-             ture for movie content analysis. In Proceedings of
  penser: A story generation system for suspense.            the 2018 Conference of the North American Chap-
  IEEE Transactions on Computational Intelligence            ter of the Association for Computational Linguistics:
  and AI in Games, 7(1):39–52.                               Human Language Technologies, Volume 1 (Long Pa-
                                                             pers), pages 1770–1781, New Orleans, Louisiana.
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho,              Association for Computational Linguistics.
  and Yoshua Bengio. 2015. Gated feedback recur-
  rent neural networks. In International Conference        Karol Gregor, George Papamakarios, Frederic Besse,
  on Machine Learning, pages 2067–2075.                      Lars Buesing, and Theophane Weber. 2019. Tempo-
                                                             ral difference variational auto-encoder. In Interna-
James E Cutting. 2016. Narrative theory and the dy-          tional Conference on Learning Representations.
  namics of popular movies. Psychonomic Bulletin &
  Review, 23(6):1713–1743.                                 Çaglar Gülçehre, Orhan Firat, Kelvin Xu, Kyunghyun
                                                              Cho, Loı̈c Barrault, Huei-Chi Lin, Fethi Bougares,
Pablo Delatorre, Carlos León, Alberto G Salguero,            Holger Schwenk, and Yoshua Bengio. 2015. On us-
  Manuel Palomo-Duarte, and Pablo Gervás. 2018.              ing monolingual corpora in neural machine transla-
  Confronting a paradox: a new perspective of the im-         tion. CoRR, abs/1503.03535.
  pact of uncertainty in suspense. Frontiers in Psy-
  chology, 9:1392.                                         David Ha and Jürgen Schmidhuber. 2018. Recur-
                                                             rent world models facilitate policy evolution. In
Vera Demberg and Frank Keller. 2008. Data from eye-          Advances in Neural Information Processing Sys-
  tracking corpora as evidence for theories of syntactic     tems 31, pages 2451–2463. Curran Associates, Inc.
  processing complexity. Cognition, 101(2):193–210.          https://worldmodels.github.io.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              John Hale. 2001. A probabilistic Earley parser as
   Kristina Toutanova. 2019. BERT: Pre-training of           a psycholinguistic model. In Proceedings of the
   deep bidirectional transformers for language under-       2nd Conference of the North American Chapter of
   standing. In Proceedings of the 2019 Conference           the Association for Computational Linguistics, vol-
   of the North American Chapter of the Association          ume 2, pages 159–166, Pittsburgh, PA. Association
   for Computational Linguistics: Human Language             for Computational Linguistics.
  Technologies, Volume 1 (Long and Short Papers),
   pages 4171–4186, Minneapolis, Minnesota. Associ-        John Hale. 2006. Uncertainty about the rest of the sen-
   ation for Computational Linguistics.                      tence. Cognitive science, 30(4):643–672.

                                                      1772
Brent Harrison, Christopher Purdy, and Mark O Riedl.       Zhiwei Li, Neil Bramley, and Todd M. Gureckis. 2018.
  2017. Toward automated story generation with               Modeling dynamics of suspense and surprise. In
  markov chain monte carlo methods and deep neural           Proceedings of the 40th Annual Meeting of the Cog-
  networks. In Thirteenth Artificial Intelligence and        nitive Science Society, CogSci 2018, Madison, WI,
  Interactive Digital Entertainment Conference.              USA, July 25-28, 2018.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.             Chunhua Liu, Haiou Zhang, Shan Jiang, and Dong
  Long short-term memory. Neural Computation,                Yu. 2018a. DEMN: Distilled-exposition enhanced
  9(8):1735–1780.                                            matching network for story comprehension. In Pro-
                                                             ceedings of the 32nd Pacific Asia Conference on Lan-
Hans Hoeken and Mario van Vliet. 2000. Suspense,             guage, Information and Computation, Hong Kong.
  curiosity, and surprise: How discourse structure in-       Association for Computational Linguistics.
  fluences the affective and cognitive processing of a
  story. Poetics, 27(4):277–286.                           Fei Liu, Trevor Cohn, and Timothy Baldwin. 2018b.
                                                             Narrative modeling with memory chains and seman-
Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine              tic supervision. In Proceedings of the 56th An-
  Bosselut, David Golub, and Yejin Choi. 2018.               nual Meeting of the Association for Computational
  Learning to write with cooperative discriminators.         Linguistics (Volume 2: Short Papers), pages 278–
  In Proceedings of the 56th Annual Meeting of the As-       284, Melbourne, Australia. Association for Compu-
  sociation for Computational Linguistics (Volume 1:         tational Linguistics.
  Long Papers), pages 1638–1649, Melbourne, Aus-
  tralia. Association for Computational Linguistics.       Lajanugen Logeswaran and Honglak Lee. 2018. An ef-
                                                             ficient framework for learning sentence representa-
Matthew Honnibal and Ines Montani. 2017. spaCy 2:            tions. In International Conference on Learning Rep-
 Natural language understanding with Bloom embed-            resentations.
 dings, convolutional neural networks and incremen-
 tal parsing. To appear.                                   Lara J Martin, Prithviraj Ammanabrolu, Xinyu Wang,
                                                             William Hancock, Shruti Singh, Brent Harrison, and
C. T. Hsu, M. Conrad, and A. M. Jacobs. 2014. Fiction        Mark O Riedl. 2018. Event representations for au-
   feelings in harry potter: haemodynamic response in        tomated story generation with deep neural nets. In
   the mid-cingulate cortex correlates with immersive        Thirty-Second AAAI Conference on Artificial Intelli-
   reading experience. NeuroReport, 25:1356–1361.            gence.

Clayton J Hutto and Eric Gilbert. 2014. Vader: A par-      Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong
  simonious rule-based model for sentiment analysis          He, Devi Parikh, Dhruv Batra, Lucy Vanderwende,
  of social media text. In Eighth international AAAI         Pushmeet Kohli, and James Allen. 2016. A cor-
  conference on weblogs and social media.                    pus and cloze evaluation for deeper understanding of
                                                             commonsense stories. In Proceedings of the 2016
Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jor-           Conference of the North American Chapter of the
 dan Boyd-Graber, and Hal Daumé III. 2016. Feud-            Association for Computational Linguistics: Human
 ing families and former friends: Unsupervised learn-        Language Technologies, pages 839–849, San Diego,
 ing for dynamic fictional relationships. In Proceed-        California. Association for Computational Linguis-
 ings of the 2016 Conference of the North Ameri-             tics.
 can Chapter of the Association for Computational
 Linguistics: Human Language Technologies, pages           M. B. Oliver. 1993. Exploring the paradox of the en-
 1534–1544.                                                  joyment of sad films. Human Communication Re-
                                                             search, 19:315–342.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
  Tomas Mikolov. 2016. Bag of tricks for efficient text    Brian O’Neill. 2013. A computational model of sus-
  classification. arXiv preprint arXiv:1607.01759.           pense for the augmentation of intelligent story gen-
                                                             eration. Ph.D. thesis, Georgia Institute of Technol-
Daniel Kahneman and Amos Tversky. 2013. Prospect             ogy.
  theory: An analysis of decision under risk. In Hand-
  book of the fundamentals of financial decision mak-      Brian O’Neill and Mark Riedl. 2014. Dramatis: A
  ing: Part I, pages 99–127. World Scientific.               computational model of suspense. In Proceedings of
                                                             the Twenty-Eighth AAAI Conference on Artificial In-
MG Kendall. 1975.      Rank correlation measures.            telligence, July 27 -31, 2014, Québec City, Québec,
 Charles Griffin, London, 202:15.                            Canada., pages 944–950.

Y. Khrypko and P. Andreae. 2011. Towards the prob-         Pinelopi Papalampidi, Frank Keller, and Mirella Lap-
   lem of maintaining suspense in interactive narrative.      ata. 2019. Movie plot analysis via turning point
   In Proceedings of the 7th Australasian Conference          identification. In Proceedings of the 2019 Confer-
   on Interactive Entertainment, pages 5:1–5:3.               ence on Empirical Methods in Natural Language
                                                             Processing and the 9th International Joint Confer-
Klaus Krippendorff. 2011. Computing krippendorff’s            ence on Natural Language Processing (EMNLP-
  alpha-reliability. 2011. Annenberg School for Com-         IJCNLP), pages 1707–1717, Hong Kong, China. As-
  munication Departmental Papers: Philadelphia.               sociation for Computational Linguistics.

                                                      1773
Jeffrey Pennington, Richard Socher, and Christopher       Abigail See, Aneesh Pappu, Rohun Saxena, Akhila
   Manning. 2014. Glove: Global vectors for word rep-       Yerukola, and Christopher D Manning. 2019. Do
   resentation. In Proceedings of the 2014 conference       massively pretrained language models make better
   on empirical methods in natural language process-        storytellers? arXiv preprint arXiv:1909.10705.
   ing (EMNLP), pages 1532–1543.
                                                          Pranab Kumar Sen. 1968. Estimates of the regres-
Joseph Perreault. 2018. The Universal Structure of Plot     sion coefficient based on kendall’s tau. Journal of
  Content: Suspense, Magnetic Plot Elements, and the        the American statistical association, 63(324):1379–
  Evolution of an Interesting Story. Ph.D. thesis, Uni-     1389.
  versity of Idaho.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt           Rishi Sharma, James Allen, Omid Bakhshandeh, and
 Gardner, Christopher Clark, Kenton Lee, and Luke           Nasrin Mostafazadeh. 2018. Tackling the story end-
 Zettlemoyer. 2018. Deep contextualized word rep-           ing biases in the story cloze test. In Proceedings of
 resentations. In Proceedings of the 2018 Confer-           the 56th Annual Meeting of the Association for Com-
 ence of the North American Chapter of the Associ-          putational Linguistics (Volume 2: Short Papers),
 ation for Computational Linguistics: Human Lan-            pages 752–757.
 guage Technologies, Volume 1 (Long Papers), pages
 2227–2237, New Orleans, Louisiana. Association           Aaron Smuts. 2008. The desire-frustration theory of
 for Computational Linguistics.                             suspense. The Journal of Aesthetics and Art Criti-
                                                            cism, 66(3):281–290.
Juan A Prieto-Pablos. 1998. The paradox of suspense.
  Poetics, 26(2):99–113.                                  Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and
                                                            Adam Coates. 2018. Cold fusion: Training seq2seq
Eric S Rabkin. 1973. Narrative suspense.” When
                                                            models together with language models. In Inter-
  Slim turned sideways...”. Ann Arbor: University of
                                                            speech 2018, 19th Annual Conference of the Interna-
   Michigan Press.
                                                            tional Speech Communication Association, Hyder-
Alec Radford, Karthik Narasimhan, Tim Salimans,             abad, India, 2-6 September 2018., pages 387–391.
  and Ilya Sutskever. 2018.         Improving lan-
  guage understanding by generative pre-training.         Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017.
  Https://openai.com/blog/language-unsupervised/.          Axiomatic attribution for deep networks. In Pro-
                                                           ceedings of the 34th International Conference on
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,         Machine Learning - Volume 70, ICML’17, page
  Dario Amodei, and Ilya Sutskever. 2019. Language         3319–3328. JMLR.org.
  models are unsupervised multitask learners. OpenAI
  Blog, 1(8).                                             Ilya Sutskever, James Martens, George Dahl, and Ge-
                                                             offrey Hinton. 2013. On the importance of initial-
Hannah Rashkin, Maarten Sap, Emily Allaway,
                                                             ization and momentum in deep learning. In Interna-
  Noah A. Smith, and Yejin Choi. 2018. Event2Mind:
                                                             tional conference on machine learning, pages 1139–
  Commonsense inference on events, intents, and reac-
                                                             1147.
  tions. In Proceedings of the 56th Annual Meeting of
  the Association for Computational Linguistics (Vol-
  ume 1: Long Papers), pages 463–473, Melbourne,          Marten Van Schijndel and Tal Linzen. 2018a. Can en-
  Australia. Association for Computational Linguis-        tropy explain successor surprisal effects in reading?
  tics.                                                    CoRR, abs/1810.11481.

Brian Roark, Asaf Bachrach, Carlos Cardenas, and          Marten Van Schijndel and Tal Linzen. 2018b. Model-
  Christophe Pallier. 2009. Deriving lexical and syn-      ing garden path effects without explicit hierarchical
  tactic expectation-based measures for psycholinguis-     syntax. In Proceedings of the 40th Annual Meeting
  tic modeling via incremental top-down parsing. In        of the Cognitive Science Society, CogSci 2018, Madi-
  Proceedings of the 2009 Conference on Empirical          son, WI, USA, July 25-28, 2018.
  Methods in Natural Language Processing, pages
  324–333, Singapore. Association for Computational       Robert J Yanal. 1996. The paradox of suspense. The
  Linguistics.                                              British Journal of Aesthetics, 36(2):146–159.
Melissa Roemmele and Andrew Gordon. 2018. An
 encoder-decoder approach to predicting causal rela-      Rowan Zellers, Ari Holtzman, Hannah Rashkin,
 tions in stories. In Proceedings of the First Work-        Yonatan Bisk, Ali Farhadi, Franziska Roesner, and
 shop on Storytelling, pages 50–59, New Orleans,            Yejin Choi. 2019. Defending against neural fake
 Louisiana. Association for Computational Linguis-          news. CoRR, abs/1905.12616.
 tics.
                                                          Dolf Zillmann. 1996. The psychology of suspense
G. Schraw, Flowerday, T., and S. Lehman. 2001. In-          in dramatic exposition. Suspense: Conceptual-
  creasing situational interest in the classroom. Edu-      izations, theoretical analyses, and empirical explo-
  cational Psychology Review, 13:211–224.                   rations, pages 199–231.

                                                     1774
A     Pre-processing                                       training for subsequent HITS. Other stories are in
                                                           separate HITS, please search for ”Story dramatic
WritingPrompts comes from a public forum of
                                                           tension, reading sentence by sentence” to find them.
short stories and so is naturally noisy. Story au-
                                                           The training completion code will work for all re-
thors often use punctuation in unusual ways to
                                                           lated HITS.
mark out sentences or paragraph boundaries and
                                                              You will read a short story and for each sentence
there are lots of spelling mistakes. Some of these
                                                           be asked to assess how the dramatic tension in-
cause problems with the GPT model and in some
                                                           creases, decreases or stays the same. Each story
circumstances can cause it to crash. To improve
                                                           will take an estimated 8-10 minutes. Judge each
the quality, sentence demarcations are left as they
                                                           sentence on how the dramatic tension has changed
are from the original WritingPrompts dataset but
                                                           as felt by the main characters in the story, not what
some sentences are cleaned up and others skipped
                                                           you as a reader feel. Dramatic tension is the excite-
over. Skipping over is also why there sometimes
                                                           ment or anxiousness over what will happen to the
are gaps in the graph plots as the sentence was
                                                           characters next, it is anticipation.
ignored during training and inference. The pre-
                                                              Increasing levels of each of the following in-
processing steps are as follows. Where substitu-
                                                           crease the level of dramatic tension:
tions are made rather than ignoring the sentence,
the token is replaced by the Spacy (Honnibal and              • Uncertainty: How uncertain are the charac-
Montani, 2017) POS tag.                                         ters involved about what will happen next?
                                                                Put yourself in the characters shoes; judge
    1. English Language: Some phrases in sen-                   the change in the tension based on how the
       tences can be non-English, Whatthelang                   characters perceive the situation.
       (Joulin et al., 2016) is used to filter out these
       sentences.                                             • Significance: How significant are the conse-
                                                                quences of what will happen to the central
    2. Nondictionary words: PyDictionary and                    characters of the story?
       PyEnchant and used to check if each word
       is a dictionary word. If not they are replaced.     An Example: Take a dramatic moment in a story
                                                           such as a character that needs to walk along a dan-
    3. Repeating Symbols: Some author mark out             gerous cliff path. When the character first realises
       sections by using a string of characters such       they will encounter danger the tension will rise,
       as *************** or !!!!!!!!!!!!. This can        then tension will increase further. Other details
       cause the Pytorch GPT implementation to             such as falling rocks or slips will increase the ten-
       break so repeating characters are replaced          sion further to a peak. When the cliff edge has been
       with a single one.                                  navigated safely the tension will drop. The pattern
                                                           will be the same with a dramatic event such as a
    4. Ignoring sentences: If after all of these re-
                                                           fight, argument, accident, romantic moment, where
       placements there are not three or more GPT
                                                           the tension will rise to a peak and then fall away as
       word pieces ignoring the POS replacements
                                                           the tension is resolved.
       then the sentence is skipped. The same pro-
                                                              You will be presented with one sentence at a
       cessing applies to generating sentences in the
                                                           time. Once you have read the sentence, you will
       inference. Occasionally the generated sen-
                                                           press one of five keys to judge the increase or de-
       tences can be nonsense, so the same criteria
                                                           crease in dramatic tension that this sentence caused.
       are used to exclude them.
                                                           You will use five levels (with keyboard shortcuts in
B     Mechanical Turk Written Instructions                 brackets):

These are the actual instructions given to the Me-            • Big Decrease (A): A sudden decrease in dra-
chanical Turk Annotators, plus the example in Ta-               matic tension of the situation. In the cliff
ble 5:                                                          example the person reaching the other side
                                                                safely.
INSTRUCTIONS For the first HIT there will be
an additional training step to pass. This will take           • Decrease (S): A slow decrease in the level of
about 5 minutes. After this you will receive a code             tension, a more gradual drop. For example the
which you can enter in the code box to bypass the               cliff walker sees an easier route out.

                                                      1775
Annotation        Sentence
 NA                Clancy Marguerian, 154, private first class of the 150 + army , sits in his foxhole.
 Increase          Tired cold, wet and hungry, the only thing preventing him from laying down his rifle
                   and walking towards the enemy lines in surrender is the knowledge that however bad
                   he has it here, life as a 50 - 100 POW is surely much worse .
 Increase          He’s fighting to keep his eyes open and his rifle ready when the mortar shells start
                   landing near him.
 Same              He hunkers lower.
 Increase          After a few minutes under the barrage, Marguerian hears hurried footsteps, a grunt,
                   and a thud as a soldier leaps into the foxhole.
 Same              The man’s uniform is tan , he must be a 50 - 100 .
 Big Increase      The two men snarl and grab at each other , grappling in the small foxhole .
 Same              Abruptly, their faces come together.
 Decrease          “Clancy?”
 Decrease          “Rob?”
 Big Decrease      Rob Hall, 97, Corporal in the 50 - 100 army grins, as the situation turns from life or
                   death struggle, to a meeting of two college friends.
 Decrease          He lets go of Marguerian’s collar.
 Same              “ Holy shit Clancy , you’re the last person I expected to see here ”
 Same              “ Yeah ” “ Shit man , I didn’t think I’d ever see Mr. volunteers every saturday morning
                   at the food shelf’ , not after The Reorganization at least ”
 Same              “Yeah Rob , it is something isn’t it ”
 Decrease          “ Man , I’m sorry, I tried to kill you there”.

Table 5: One of the training annotation examples given to Mechanical Turk workers. The annotation labels are the
recommended labels. This is an extract from a validation set WritingPrompts story.

   • Same (Space): Stays at a similar level. In the       through can be done to tie these into the suspense
     cliff example an ongoing description of the          measures and also the WritingPrompts prompts.
     event.
                                                          C     Writing Prompts Examples
   • Increase (K): A gradual increase in the ten-
     sion. Loose rocks fall nearby the cliff walker.      The numbers are from the full WritingPrompts test
                                                          set. Since random sampling was done from these
   • Big Increase (L): A more sudden dramatic             from for evaluation the numbers are not in a con-
     increase such as an argument. The cliff walker       tiguous block. There are a couple of nonsense
     suddenly slips and falls.                            sentences or entirely punctuation sentences. In the
                                                          model these are excluded in pre-processing but in-
POST ACTUAL INSTRUCTIONS In addition
                                                          cluded here to match the sentence segmentation.
to the suspense annotation. The following review
                                                          Also there are some unusual break such as “should
questions were asked:
                                                          n’t”, this is because the word segmentation pro-
   • Please write a summary of the story in one or        duced by the Spacy tokenizer.
     two sentences.
                                                          C.1     Story 27
   • Do you think the story is interesting or not?        This is Story 27 from the test set in Figure 4, it is
     And why? One or two sentences.                       the same as the example in the main text:
   • How interesting is the story? 1–5
                                                              0. As I finished up my research on Alligator
The main purpose of this was to test if the MTurk                breeding habits for a story I was tasked with
Annotators were comprehending the stories and not                writing , a bell began to ring loudly throughout
trying to cheat by skipping over. Some further work              the office .

                                                     1776
You can also read