Narrative Modeling with Memory Chains and Semantic Supervision

Page created by Russell Osborne
 
CONTINUE READING
Narrative Modeling with Memory Chains and Semantic Supervision

                                                                 Fei Liu      Trevor Cohn       Timothy Baldwin
                                                                   School of Computing and Information Systems
                                                                           The University of Melbourne
                                                                                Victoria, Australia
                                                                     fliu3@student.unimelb.edu.au
                                                                 t.cohn@unimelb.edu.au tb@ldwin.net

                                                             Abstract                             Context: Sam loved his old belt. He matched it with
arXiv:1805.06122v1 [cs.CL] 16 May 2018

                                                                                                  everything. Unfortunately he gained too much weight.
                                             Story comprehension requires a deep se-              It became too small.
                                             mantic understanding of the narrative,               Coherent Ending: Sam went on a diet.
                                                                                                  Incoherent Ending: Sam was happy.
                                             making it a challenging task. Inspired by
                                             previous studies on ROC Story Cloze Test,
                                             we propose a novel method, tracking var-                  Figure 1: Story Cloze Test example.
                                             ious semantic aspects with external neural
                                             memory chains while encouraging each
                                             to focus on a particular semantic aspect.             The current state-of-the-art approach of
                                             Evaluated on the task of story ending pre-         Chaturvedi et al. (2017) is based on understanding
                                             diction, our model demonstrates superior           the context from three perspectives: (1) event
                                             performance to a collection of competitive         sequence, (2) sentiment trajectory, and (3) topic
                                             baselines, setting a new state of the art. 1       consistency. Chaturvedi et al. (2017) adopt exter-
                                                                                                nal tools to recognise relevant aspect-triggering
                                         1 Introduction
                                                                                                words, and manually design features to incorpo-
                                         Story narrative comprehension has been                 rate them into the classifier. While identifying
                                         a long-standing challenge in artificial in-            triggers has been made easy by the use of various
                                         telligence (Winograd, 1972; Turner, 1994;              linguistic resources, crafting such features is
                                         Schubert and Hwang, 2000). The difficulties of         time consuming and requires domain-specific
                                         this task arise from the necessity for understand-     knowledge along with repeated experimentation.
                                         ing not only narratives, but also commonsense          Inspired by the argument for tracking the dy-
                                         and normative social behaviour (Charniak, 1972).    namics of events, sentiment and topic, we propose
                                         Of particular interest in this paper is the work    to address the issues identified above with mul-
                                         by Mostafazadeh et al. (2016) on understanding      tiple external memory chains, each responsible
                                         commonsense stories in the form of a Story Cloze    for a single aspect. Building on Recurrent Entity
                                         Test: given a short story, we must predict the most Networks (EntNets), a superior framework to
                                         coherent sentential ending from two options (e.g.   LSTMs demonstrated by Henaff et al. (2017) for
                                         see Figure 1).                                      reasoning-focused question answering and cloze-
                                            Many attempts have been made to solve            style reading comprehension, we introduce a novel
                                         this problem, based either on linear classifiers    multi-task learning objective, encouraging each
                                         with handcrafted features (Schwartz et al., 2017;   chain to focus on a particular aspect. While still
                                         Chaturvedi et al., 2017), or representation learningmaking use of external linguistic resources, we do
                                         via deep learning models (Mihaylov and Frank,       not extract or design features from them but rather
                                         2017; Bugert et al., 2017; Mostafazadeh et al.,     utilise such tools to generate labels. The generated
                                         2017). Another widely used component of com-        labels are then used to guide training so that each
                                         petitive systems is language model-based features,  chain focuses on tracking a particular aspect. At
                                         which require training on large corpora in the storytest time, our model is free of feature engineering
                                         domain.                                             such that, once trained, it can be easily deployed to
                                            1
                                                                                             unseen data without preprocessing. Moreover, our
                                              Code available at http://github.com/liufly/narrative-modeling.
approach also differs in the lack of a ROC Stories
                                                              key kj
language model component, eliminating the need                                                   update
for large, domain-specific training corpora.                                       m̃ji
                                                                               φ          L    σ gate   ⊙
   Evaluated on the task of Story Cloze Test, our                                                gij
model outperforms a collection of competitive                                                  C
baselines, achieving state-of-the-art performance
                                                                                                         +
under more modest data requirements.                          memory mji−1                                   mji
                                                                                          hi
2 Methodology
In the story cloze test, given a story of length             Figure 2: Illustration of EntNet with a single
L, consisting of a sequence of context sentences             memory chain at time i. φ and σ represent Equa-
c = {c1 , c2 , . . . , cL }, we are interested in predict-   tions 1 and 2, while circled nodes L, C, ⊙ and
ing the coherent ending to the story out of two end-         + depict the location, content terms, Hadamard
ing options o1 and o2 . Following previous studies           product, and addition, resp.
(Schwartz et al., 2017), we frame this problem as a
binary classification task. Assuming o1 and o2 are           Memory update candidate. At every time step
the logical and inconsistent endings respectively                                                    −
                                                                                                     →
                                                             i, the internal memory update candidate m̃ji for
with their associated labels being y1 and y2 , we            the j-th memory chain is computed as:
assign y1 = 1 and y2 = 0. At test time, given
                                                                    −
                                                                    →        → →j
                                                                             −         →
                                                                                       −      −
                                                                                              →
a pair of possible endings, the system returns the                  m̃ji = φ( U −
                                                                                mi−1 + V kj + W hi )            (1)
one with the higher score as the prediction. In this
section, we first describe the model architecture            where kj is the embedding for the j-th entity
                                                                    → −
                                                                    −   →      −
                                                                               →
and then detail how we identify aspect-triggering            (key), U , V and W are trainable parameters, and
words in text and incorporate them in training.              φ is the parametric ReLU (He et al., 2015).

2.1   Proposed Model                                         Memory update. The update of each memory
                                                             chain is controlled by a gating mechanism:
First, to take neighbouring contexts into account,
we process context sentences and ending options                        −
                                                                       →g ji = σ(hi · kj + hi · −
                                                                                                →j )
                                                                                                m i−1           (2)
at the word level with a bi-directional gated recur-                   −
                                                                       →j −    → j      →
                                                                                        − j   −
                                                                                              →j
                                              →
                                              −                        m̊i = mi−1 + g i ⊙ m̃i                   (3)
rent unit (“Bi-GRU”: Chung et al. (2014)): h i =
−−→         −
            →
GRU(wi , h i−1 ) where wi is the embedding of                where ⊙ denotes Hadamard product (resulting
                                                                                                              −→
the i-th word in the story or ending option. The             in the unnormalised memory representation m̊ji ),
                                   ←−
backward hidden representation h i is computed               and σ is the sigmoid function. The gate −   →g ji con-
analogously but with a different set of GRU pa-              trols how much the memory chain should be up-
                                          −
                                          →      ←
                                                 −
rameters. We simply take the sum hi = h i + h i              dated, a decision factoring in two elements: (1) the
as the representation at time i. An ending option,           “location” term, quantifying how related the cur-
denoted o, is represented by taking the sum of the           rent input hi (the output of the Bi-GRU at time i)
final states of the same Bi-GRU over the ending              is to the key kj of the entity being tracked; and (2)
option word sequence.                                        the “content” term, measuring the similarity be-
   This representation is then taken as input                tween the input and the current state −  →j of the
                                                                                                      m i−1
to a Recurrent Entity Network (“EntNet”:                     entity tracked by the j-th memory chain.
Henaff et al. (2017)), capable of tracking the state
                                                             Normalisation. We normalise each memory
of the world with external memory. Developed in                              →j = −  → −    →             −
                                                                                                          →
the context of machine reading comprehension,                representation: −
                                                                             m  i    m̊ji /km̊ji k where km̊ji k de-
                                                                                             −
                                                                                             →
EntNet maintains a number of “memory chains”                 notes the Euclidean norm of m̊ji . In doing so, we
— one for each entity — and dynamically up-                  allow the model to forget in the sense that, as −  →j
                                                                                                                m  i
dates the representations of them as it progresses           is constrained to be lying on the surface of the unit
                                                                                                           −
                                                                                                           →
through a story. Here, we borrow the concept of              sphere, adding new information carried by m̃ji and
“entity” and use each chain to track a unique as-            then performing normalisation inevitably causes
pect. An illustration of EntNet is provided in               forgetting in that the cosine distance between the
Figure 2.                                                    original and updated memory decreases.
Bi-directionality. In contrast to EntNet, we                                Ricky fell while hiking in the woods
apply the above steps in both directions, scanning             lEvent         0    1     0      1    0   0    1
a story both forward and backward, with arrows                 lSentiment     0    1     0      0    0   0    0
denoting the processing direction. ← −j is computed
                                     m i
                                                               lT opic        1    1     0      1    0   0    1
analogously to − →j but with a different set of pa-
                 m  i
                                                              Table 1: An example of the linguistically moti-
rameters, and mji = −  →j + ←
                       m  i   m−j . We further define
                                 i
                                                              vated memory chain supervision binary labels.
gij to be the average of the gate values of both di-
rections at time i for the j-th chain:                        to open its gate only when it sees a trigger bearing
                                                              information semantically sensitive to that particu-
                 gij   =   (−
                            →
                            g ji   +   ←
                                       −
                                       g ji )/2        (4)
                                                              lar aspect. Note that at test time, we do not apply
which will be used for semantic supervision.                  such supervision. This effectively becomes a bi-
                                                              nary tagging task for each memory chain indepen-
Final classifier. The final prediction ŷ to an end-          dently and results in individual memory chains be-
ing option given its context story is performed by            ing sensitive to only a specific set of triggers bear-
incorporating the last states of all memory chains            ing one of the three types of signals.
in the form of a weighted sum u:                                 While largely inspired by Chaturvedi et al.
                                                              (2017), our approach differs in how we extract
            pj = softmax((kj )⊤ Watt o)                (5)    and use such features. Also note that, while still
                 X
            u=       pj mjT                            (6)    making use of external linguistic resources to de-
                       j                                      tect trigger words, our approach lets the mem-
                                                              ory chains decide how such words should in-
where T denotes the total number of words in a                fluence the final prediction, as opposed to the
story and Watt is a trainable weight matrix. We               handcrafted conditional probability features of
then transform u to get the final prediction:                 Chaturvedi et al. (2017).
                ŷ = σ(Rφ(Hu + o))                     (7)       To identify the trigger words, we use external
                                                              linguistic tools, and mark trigger words for each
where R and H are trainable weight matrices.                  aspect with a label of 1. An example is presented
                                                              in Table 1, noting that the same word can act as
2.2   Semantically Motivated Memory Chains                    trigger for multiple aspects.
In order to encourage each chain to focus on a
                                                              Event sequence. We parse each sentence into
particular aspect, we supervise the activation of
                                                              its FrameNet representation with SEMAFOR
each update gate (Equation 2) with binary labels.
                                                              (Das et al., 2010), and identify each frame target
Intuitively, for the sentiment chain, a sentiment-
                                                              (word or phrase tokens evoking a frame).
bearing word (e.g. amazing) receives a label of 1,
allowing the model to open the gate and therefore             Sentiment                  trajectory. Following
change the internal state relevant to this aspect,            Chaturvedi et al. (2017), we utilise a pre-compiled
while a neutral one (e.g. school) should not trig-            list of sentiment words (Liu et al., 2005). To take
ger the activation with an assigned label of 0. To            negation into account, we parse each sentence
achieve this, we use three memory chains to in-               with the Stanford Core NLP dependency
dependently track each of: (1) event sequence, (2)            parser (Manning et al., 2014; Chen and Manning,
sentiment trajectory, and (3) topical consistency.            2014) and include negation words as trigger
To supervise the memory update gates of these                 words.
chains, we design three sequences of binary labels:
lj = {l1j , l2j , . . . , lTj } for j ∈ [1, 3] representing   Topical consistency. We process each sen-
event, sentiment, and topic, and lij ∈ {0, 1}. The            tence with the Stanford Core NLP POS tag-
label at time i for the j-th aspect is only assigned          ger and identify nouns and verbs, following
a value of 1 if the word is a trigger for that particu-       Chaturvedi et al. (2017).
lar aspect: lij = 1; otherwise lij = 0. During train-
ing, we utilise such sequences of binary labels to            2.3 Training Loss
supervise the memory update gate activations of               In addition to the cross entropy loss of the final
each chain. Specifically, each chain is encouraged            prediction of right/wrong endings, we also take
into account the memory update gate supervision                      Model                Acc.      SD
of each chain by adding the second term. More for-
                                                                     DSSM                 58.5      —
mally, the model is trained to minimise the loss:
                                                                     TBMIHAYLOV           72.8      —
                            X
 L = XEntropy(y, ŷ) + α        XEntropy(lij , gij )                 MSAP                 75.2      —
                              i,j                                    HCM                  77.6      —
                                                                     EntNet †             77.6     ±0.5
where ŷ and gij are defined in Equations 7 and 4 re-                Our model†           78.5     ±0.5
spectively, y is the gold label for the current ending   Table 2: Performance on the Story Cloze test set.
option o, and lij is the semantic supervision binary     Bold = best performance, † = ave. accuracy over 5
label at time i for the j-th memory chain. In our        runs, SD = standard deviation, “—” = not reported.
experiments, we empirically set α to 0.5 based on
development data.
                                                         Evaluation. Following previous studies, we
3 Experiments                                            evaluate the performance in terms of coherent
                                                         ending prediction accuracy: #correct
                                                                                        #total . Here, we re-
3.1   Experimental Setup
                                                         port the average accuracy over 5 runs with dif-
Dataset. To test the effectiveness of our model,         ferent random seeds. Echoing Melis et al. (2018),
we employ the Story Cloze Test dataset of                we also observe some variance in model perfor-
Mostafazadeh et al. (2016). The development and          mance even if training is carried out with the same
test set each consist of 1,871 4-sentence stories,       random seed, which is largely due to the non-
each with a pair of ending options. Consistent with      deterministic ordering of floating-point operations
previous studies, we split the development set into      in our environment (Tensorflow (Abadi et al.,
a training and validation set (for early stopping),      2015) with a single GPU). Therefore, to account
resulting in 1,683 and 188 in each set, resp. Note       for the randomness, we further train our model 5
that while most current approaches make use of           times for each random seed and select the model
the much larger training set, comprised of 100K 5-       with the best validation performance.
sentence ROC stories (with coherent endings only,           We benchmark against a collection of
also released as part of the dataset) to train a lan-    strong baselines, including the top-3 sys-
guage model, we make no use of this data.                tems of the 2017 LSDSem workshop shared
Model configuration. We initialise our model             task     (Mostafazadeh et al.,     2017):     MSAP
with word2vec embeddings (300-D, pre-trained             (Schwartz et al., 2017), HCM (Chaturvedi et al.,
on 100B Google News articles, not updated dur-           2017)2 , and TBMIHAYLOV (Mihaylov and Frank,
ing training: Mikolov et al. (2013a,b)). In addition     2017). The first two primarily rely on a sim-
to the three supervised chains, we also add a “free”     ple logistic regression classifier, both taking
chain, unconstrained to any semantic aspect.             linguistic and probability features from a ROC
   Training is carried out over 200 epochs with          Stories domain-specific neural language model.
the FTRL optimiser (McMahan et al., 2013) and a          TBMIHAYLOV is LSTM-based; we also include
batch size of 128 and learning rate of 0.1. We use       DSSM (Mostafazadeh et al., 2016). We addi-
the following hyper-parameters for weight matri-         tionally implement a bi-directional EntNet
ces in both directions: R ∈ R300×1 , H, U, V, W          (Henaff et al., 2017) with the same hyper-
are all matrices of size R300×300 , and hidden size      parameter settings as our model and no semantic
of the Bi-GRU is 300. Dropout is applied to the          supervision.3
output of φ in the final classifier (Equation 7) with    3.2 Results and Discussion
a rate of 0.2. Moreover, we employ the technique
introduced by Gal and Ghahramani (2016) where            The experimental results are shown in Table 2.
the same dropout mask is applied at every step to        State-of-the-art results. Our model outper-
the input wi to the Bi-GRU and the input hi to the       forms a collection of strong baselines, setting a
memory chains with rates of 0.5 and 0.2 respec-
                                                             2
tively. Lastly, to curb overfitting, we regularise the         We take the performance reported in a subsequent paper,
                                                         3.2% better than the original submission to the shared task.
last layer (Equation 7) with an L2 penalty on its            3
                                                               EntNet is also subject to the same model selection cri-
weights: λkRk where λ = 0.001.                           terion described above.
Model                          Acc.     SD
                                                             Our model                      78.5    ±0.5
                                                             —bi-directionality             77.8    ±0.7
                                                             —all semantic supervision      77.6    ±0.5
                                                               —event sequence              78.7    ±0.2
                                                               —sentiment trajectory        75.9    ±0.4
   Event Key Sentiment Key Topic Key Free Key                  —topical consistency         77.3    ±0.4
   Event Word Sentiment Word Topic Word
                                                               —free chain                  77.0    ±0.4
Figure 3: t-SNE scatter plot of aspect-triggering        Table 3: Ablation study. Average accuracy over
words (output representations hi by Bi-GRU,              5 runs on the Story Cloze test set (all subject to
500 randomly sampled from each aspect) and the           the same model selection criterion as our model).
learned keys. EntNet (left) vs. our model (right).       Bold: best performance. SD: standard deviation.
Best viewed in colour.

                                                         performance down to 77.8, is inferior to its bi-
new state of the art. Note that this is achieved with-
                                                         directional cousin. Removing all semantic super-
out any external linguistic resources at test time or
                                                         vision (resulting in 4 free memory chains) is
domain-specific language model-based probabil-
                                                         also damaging to the accuracy (dropping to 77.6).
ity features, highlighting the effectiveness of the
                                                         Among the different types of semantic super-
proposed model.
                                                         vision, we see various degrees of performance
EntNet vs. our model. We see clear improve-              degradation, with removing sentiment trajectory
ments of the proposed method over EntNet,                being the most detrimental, reflecting its value
an absolute gain of 0.9% in accuracy. This vali-         to the task. Interestingly, we observe improve-
dates the hypothesis that encouraging each mem-          ment when removing event sequence supervision
ory chain to focus on a unique, well-defined aspect      from consideration. We suspect that this is mainly
is beneficial.                                           due to the noise introduced by the rather inaccu-
                                                         rate FrameNet representation output of SEMAFOR
Discussion. To better understand the benefits of         (F1 = 61.4% in frame identification as reported in
the proposed method, we visualise the learned            Das et al. (2010)). While it is possible that replac-
word representations (output representation of the       ing SEMAFOR with the SemLM approach (with
Bi-GRU, hi ) and keys between EntNet and our             the extended frame definition to include explicit
model in Figure 3. With EntNet, while senti-             discourse markers, e.g. but) in Peng and Roth
ment words form a loose cluster, words bearing           (2016) and Chaturvedi et al. (2017) may poten-
event and topic signal are placed in close prox-         tially reduce the amount of noise, we leave this
imity and are largely indistinguishable. With our        exercise for future work.
model, on the other hand, the degree of separation
is much clearer. The intersection of a small por-        4 Conclusion
tion of the event and topic groups is largely due to
the fact that both aspects include verbs. Another        In this paper, we have proposed a novel model
interesting perspective is the location of the au-       for tracking various semantic aspects with exter-
tomatically learned keys (displayed as diamonds):        nal memory chains. While requiring less domain-
while all the keys with EntNet end up overlap-           specific training data, our model achieves state-
ping, indicating little difference among them, the       of-the-art performance on the task of ROC Story
keys learned by our method demonstrate seman-            Cloze ending prediction, beating a collection of
tic diversity, with each placed within its respective    strong baselines.
cluster. Note that the free key is learned to com-
plement the event key, a difficult challenge given       Acknowledgments
the two disjoint clusters.
                                                         We thank the anonymous reviewers for their
Ablation study. We further perform a ablation            valuable feedback, and gratefully acknowledge
study to analyse the utility of each component           the support of Australian Government Research
in Table 3. The uni-directional variant, with its        Training Program Scholarship. This work was
also supported in part by the Australian Research        Yarin Gal and Zoubin Ghahramani. 2016. A theo-
Council.                                                   retically grounded application of dropout in recur-
                                                           rent neural networks. In Proceedings of Advances
                                                           in Neural Information Processing Systems (NIPS
                                                           2016), pages 1027–1035, Barcelona, Spain.
References
                                                        Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene         Sun. 2015. Delving deep into rectifiers: Surpass-
  Brevdo, Zhifeng Chen, Craig Citro, Greg S. Cor-          ing human-level performance on ImageNet classifi-
  rado, Andy Davis, Jeffrey Dean, Matthieu Devin,          cation. In Proceedings of the 2015 IEEE Interna-
  Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,            tional Conference on Computer Vision (ICCV 2015),
  Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal      pages 1026–1034, Washington, DC, USA.
  Jozefowicz, Lukasz Kaiser, Manjunath Kudlur,
  Josh Levenberg, Dandelion Mané, Rajat Monga,         Mikael Henaff, Jason Weston, Arthur Szlam, Antoine
  Sherry Moore, Derek Murray, Chris Olah, Mike             Bordes, and Yann LeCun. 2017. Tracking the world
  Schuster, Jonathon Shlens, Benoit Steiner, Ilya          state with recurrent entity networks. In Proceed-
  Sutskever, Kunal Talwar, Paul Tucker, Vincent Van-       ings of the 5th International Conference on Learn-
  houcke, Vijay Vasudevan, Fernanda Viégas, Oriol         ing Representations (ICLR 2017), Toulon, France.
  Vinyals, Pete Warden, Martin Wattenberg, Mar-
  tin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015.        Bing Liu, Minqing Hu, and Junsheng Cheng. 2005.
  TensorFlow: Large-scale machine learning on heterogeneous   systems.
                                                           Opinion   observer: analyzing and comparing opin-
  Software available from tensorflow.org.                  ions on the web. In Proceedings of the 14th Interna-
                                                           tional Conference on World Wide Web (WWW 2005),
Michael Bugert, Yevgeniy Puzikov, Andreas Rücklé,        pages 342–351, Chiba, Japan.
  Judith Eckle-Kohler, Teresa Martin, Eugenio
  Martı́nez-Cámara, Daniil Sorokin, Maxime Peyrard,    Christopher D. Manning, Mihai Surdeanu, John Bauer,
  and Iryna Gurevych. 2017. LSDSem 2017: Explor-           Jenny Finkel, Steven J. Bethard, and David Mc-
  ing data generation methods for the story cloze test.    Closky. 2014. The Stanford CoreNLP natural lan-
  In Proceedings of the 2nd Workshop on Linking            guage processing toolkit. In Association for Compu-
  Models of Lexical, Sentential and Discourse-level        tational Linguistics (ACL) System Demonstrations,
  Semantics (LSDSem 2017), pages 56–61, Valencia,          pages 55–60.
  Spain.
                                                        H. Brendan McMahan, Gary Holt, David Sculley,
Eugene Charniak. 1972. Toward a model of chil-             Michael Young, Dietmar Ebner, Julian Grady,
  dren”s story comprehension. Technical report,            Lan Nie, Todd Phillips, Eugene Davydov, Daniel
  Massachusetts Institute of Technology, Cambridge,        Golovin, et al. 2013. Ad click prediction: A view
  USA.                                                     from the trenches. In Proceedings of the 19th ACM
                                                           SIGKDD International Conference on Knowledge
Snigdha Chaturvedi, Haoruo Peng, and Dan Roth.             Discovery and Data Mining (KDD 2013), pages
  2017. Story comprehension for predicting what hap-       1222–1230, Chicago, USA.
  pens next. In Proceedings of the 2017 Conference
  on Empirical Methods in Natural Language Pro-         Gábor Melis, Chris Dyer, and Phil Blunsom. 2018.
  cessing (EMNLP 2017), pages 1603–1614, Copen-            On the state of the art of evaluation in neural lan-
  hagen, Denmark.                                          guage models. In Proceedings of the 6th Inter-
                                                           national Conference on Learning Representations
                                                           (ICLR 2018), Vancouver, Canada.
Danqi Chen and Christopher Manning. 2014. A fast
  and accurate dependency parser using neural net-      Todor Mihaylov and Anette Frank. 2017. Story cloze
  works. In Proceedings of the 2014 Conference on          ending selection baselines and data examination. In
  Empirical Methods in Natural Language Processing         Proceedings of the 2nd Workshop on Linking Models
  (EMNLP 2014), pages 740–750, Doha, Qatar.                of Lexical, Sentential and Discourse-level Semantics
                                                            (LSDSem 2017), pages 87–92, Valencia, Spain.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
  and Yoshua Bengio. 2014. Empirical evaluation of       Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
  gated recurrent neural networks on sequence model-       Dean. 2013a. Efficient estimation of word represen-
  ing. In Proceedings of the NIPS 2014 Deep Learn-         tations in vector space. In Proceedings of the 1st
  ing and Representation Learning Workshop.                International Conference on Learning Representa-
                                                           tions (ICLR 2013), Scottsdale, USA.
Dipanjan Das, Nathan Schneider, Desai Chen, and
  Noah A. Smith. 2010. Probabilistic frame-semantic      Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-
  parsing. In Human Language Technologies: The             rado, and Jeff Dean. 2013b. Distributed representa-
  2010 Annual Conference of the North American             tions of words and phrases and their compositional-
  Chapter of the Association for Computational Lin-        ity. In Proceedings of the 27th Annual Conference
  guistics (NAACL-HLT 2010), pages 948–956, Los            on Neural Information Processing Systems (NIPS
  Angeles, USA.                                            2013), pages 3111–3119, Stateline, USA.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong
  He, Devi Parikh, Dhruv Batra, Lucy Vanderwende,
  Pushmeet Kohli, and James Allen. 2016. A cor-
  pus and cloze evaluation for deeper understanding of
  commonsense stories. In Proceedings of the 2016
  Conference of the North American Chapter of the
  Association for Computational Linguistics: Human
  Language Technologies (NAACL-HLT 2016), pages
  839–849, San Diego, USA.
Nasrin Mostafazadeh, Michael Roth, Annie Louis,
  Nathanael Chambers, and James Allen. 2017. LS-
  DSem 2017 shared task: The story cloze test. In
  Proceedings of the 2nd Workshop on Linking Models
  of Lexical, Sentential and Discourse-level Semantics
  (LSDSem 2017), pages 46–51, Valencia, Spain.
Haoruo Peng and Dan Roth. 2016. Two discourse
  driven language models for semantics. In Proceed-
  ings of the 54th Annual Meeting of the Association
  for Computational Linguistics (ACL 2016), pages
  290–300, Berlin, Germany.
Lenhart K. Schubert and Chung Hee Hwang. 2000.
  Episodic logic meets little red riding hood: A com-
  prehensive, natural representation for language un-
  derstanding.    In Lucja Iwanska and Stuart C.
  Shapiro, editors, Natural Language Processing and
  Knowledge Representation: Language for Knowl-
  edge and Knowledge for Language, pages 111–174.
  MIT/AAAI Press, Menlo Park, USA.
Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila
  Zilles, Yejin Choi, and Noah A. Smith. 2017. The
  effect of different writing tasks on linguistic style:
  A case study of the ROC story cloze task. In Pro-
  ceedings of the 21st Conference on Computational
  Natural Language Learning (CoNLL 2017), pages
  15–25, Vancouver, Canada.
Scott R. Turner. 1994. The Creative Process: A
  Computer Model of Storytelling and Creativity.
  Lawrence Erlbaum, Hillsdale, USA.
Terry Winograd. 1972. Understanding natural lan-
  guage. Cognitive Psychology, 3(1):1–191.
You can also read