A Bi-Encoder LSTM Model For Learning Unstructured Dialogs

Page created by Leon Hammond
 
CONTINUE READING
A Bi-Encoder LSTM Model For Learning Unstructured Dialogs

                                                  Diwanshu Shekhar∗                     Pooran S. Negi†              Mohammand Mahoor‡
                                                  University of Denver                University of Denver            University of Denver

                                                               Abstract                          These systems interact with humans to get infor-
                                                                                                 mation to help complete the task. These include
                                             Creating a data-driven model that is trained        the digital assistants that are now on every cell-
arXiv:2104.12269v1 [cs.CL] 25 Apr 2021

                                             on a large dataset of unstructured dialogs is       phone or on home controllers and voice assistants
                                             a crucial step in developing Retrieval-based        such as Siri, Cortana, Alexa, Google Now/Home,
                                             Chatbot systems. This paper presents a Long
                                                                                                 etc.
                                             Short Term Memory (LSTM) based architec-
                                             ture that learns unstructured multi-turn dialogs
                                             and provides results on the task of selecting          Chatbot Systems, the area of this paper, are
                                             the best response from a collection of given        systems that can carry on extended conversations
                                             responses. Ubuntu Dialog Corpus Version             with the goal of mimicking unstructured conver-
                                             2 was used as the corpus for training. We           sations or ‘chats’ characteristic of human-human
                                             show that our model achieves 0.8%, 1.0%             interaction. Lowe et al. (2015) explored learn-
                                             and 0.3% higher accuracy for Recall@1, Re-
                                                                                                 ing models such as TF-IDF (Term Frequency-
                                             call@2 and Recall@5 respectively than the
                                             benchmark model. We also show results on            Inverse Document Frequency), Recurrent Neural
                                             experiments performed by using several sim-         Network (RNN) and a Dual Encoder (DE) based
                                             ilarity functions, model hyper-parameters and       on Long Short Term Memory (LSTM) model suit-
                                             word embeddings on the proposed architecture        able to learn from the Ubuntu Dialog Corpus
                                                                                                 Version 1 (UDCv1). We use this same archi-
                                         1   Introduction                                        tecture but on Ubuntu Dialog Corpus Version 2
                                                                                                 (UDCv2) as a benchmark and introduce a new
                                         Recently statistical techniques based on recurrent      LSTM based architecture called the Bi-Encoder
                                         neural networks (RNN) have achieved remarkable          LSTM model (BE) that achieves 0.8%, 1.0% and
                                         successes in a variety of natural language pro-         0.3% higher accuracy for Recall@1, Recall@2
                                         cessing tasks, leading to a great deal of commer-       and Recall@5 respectively than the DE model.
                                         cial and academic interests in the field (Bengio        In contrast to the DE model, the proposed BE
                                         et al., 2013; Cambria and White, 2014). Signifi-        model has separate LSTM networks for encod-
                                         cant progress in the area of Machine Translation,       ing utterances and responses. The BE model also
                                         Text Categorization, Spam Filtering, and Summa-         has a different similarity measure for utterance
                                         rization have been made. Research in developing         and response matching than that of the benchmark
                                         Dialog Systems or Conversational Agents - per-          model. We further show results of various experi-
                                         haps a desirable application of the future- have        ments necessary to select the best similarity func-
                                         been growing rapidly. A Dialog System can com-          tion, hyper-parameters and word embedding for
                                         municate with human in text, speech or both and         the BE model.
                                         can be classified into - Task-oriented Systems and
                                         Chatbot Systems.                                           Section 2 describes the related current state-of-
                                            Task-oriented systems are designed for a partic-     the-art research on Chatbot Systems. We describe
                                         ular task and set up to have short conversations.       the proposed BE model in Section 3. The exper-
                                             * diwanshu.shekhar@du.edu
                                                                                                 iments and results are described in Section 4 and,
                                             †
                                                 pooran.negi@du.edu                              we conclude the paper in Section 5 with sugges-
                                             ‡
                                                 mohammad.mahoor@du.edu                          tions for potential future work.
2     Background                                             DE and CNN models performed better than the
                                                             DE model.
For clarity, we establish a notation in this pa-
per wherein the type of the mathematical quan-                  Another type of corpora-based Chatbot system
tity involved will be denoted by its representa-             is the Generative Chatbot system. One clear ben-
tion. Scalars are represented by lower case letters          efit of the Generative systems is that they don’t
i, j, k, · · · ; α, β, γ, · · · , vectors are represented    need a repository of responses to choose from as
via lower case bold letters a, b, · · ·, e, , · · · and     a response is generated by the system itself. Rit-
matrices are represented by bold upper case letters          ter et al. (2011) used Sequence-to-Sequence RNN
A, B, · · · , E, · · · . Calligraphic letters A, T , · · ·   (seq2seq) model, a model that is commonly used
are used to represent sets of objects. We con-               for Machine Translation (Sutskever et al., 2014),
sistently follow similar convention for functions            to generate a response given an utterance. Al-
where f represents scalar valued functions, bold             though seq2seq models works really well in Ma-
f represents vector valued functions and bold Ai,.           chine translation, the model did not perform very
represents the ith row of the matrix A.                      well in the response generation task as in machine
                                                             translation words or phrases in the source and tar-
2.1    Related Work
                                                             get sentences tend to align well with each other;
Early Chatbot systems such as ELIZA (Weizen-                 but in dialogs, a user utterance may share no words
baum, 1966), ALICE (Wallace, 2008) and PARRY                 or phrases with a coherent response. Several mod-
(Colby et al., 1971) were based on pattern match-            ifications of seq2seq model have been made for
ing where a human statement was matched to a                 response generation. Li et al. (2015) made modifi-
pattern and a response was retrieved that pertained          cation to address the problem of seq2seq model
to the matched pattern. These Chatbot systems                producing responses like “I’m OK” or “I don’t
were rule based and needed domain expertise to               know” that tend to end the conversation. Lowe
hand-craft rules in advance which made the design            et al. (2017) used hierarchical approach to use
of these systems very expensive and tedious. To              longer prior context in the seq2seq model. The ba-
address this limitation, the idea of corpora-based           sic seq2seq model focuses on generating single re-
Chatbot System was introduced. At the time of                sponses, and so don’t tend to do a good job of con-
this research, two large corpora are available to de-        tinuously generating responses that cohere across
sign the corpora-based Chatbot Systems - Twitter             multiple turns. This can be addressed by using
Corpus (Ritter et al., 2010) and the Ubuntu Dialog           reinforcement learning (Li et al., 2016), as well
Corpus (Lowe et al., 2015). Serban et al. (2015)             as techniques like adversarial networks (Li et al.,
did a survey of all available corpora for corpora-           2017) that can select multiple responses that make
based Chatbot systems.                                       the overall conversation more natural.
    A type of corpora-based Chatbot systems that
has been popular is the Information Retrieval (IR)              Not all Generative Chatbot systems are based
based Chatbot systems. In the IR-based Chatbot               on seq2seq model. Shang et al. (2015) showed
systems, an utterance is matched to a repository of          that transduction models can be used to generate
responses and the response that matches the most             response. Wen et al. (2015) presented a statistical
is retrieved. If this repository is too big, the re-         language generator based on a semantically con-
trieval process may be too slow. To address this             trolled Long Short-term Memory (LSTM) struc-
problem, Jafarpour et al. (2010) devised a filter-           ture. Although not related to Chatbot systems,
ing technique based on feature selection to reduce           (Pan et al., 2016) introduced an LSTM-E architec-
the size of the set of responses to match the given          ture that was able to generate a description given
utterance with. Wang et al. (2013) used the same             a video. (Kannan et al., 2016) demonstrated a hy-
filtering technique but used RankSVM to match                brid system called Smart Reply that leverages both
utterance with responses. Lowe et al. (2015) used            the Retrieval and Generative concepts. At the time
LSTM-based Dual Encoder model (DE) to retrieve               of this research, the generative-based systems are
the best response from a set of responses of size 10         not doing so well and most production systems
(since the set of responses to choose from was al-           are essentially retrieval-based such as - Cleverbot
ready small filtering was not necessary). Kadlec             (Carpenter, 2011) and Microsoft’s Little Bing sys-
et al. (2015) showed that an ensemble of LSTM,               tem.
entropy loss is given by:

                                                            X (q, p) = −q · log(p) − (1 − q)log(1 − p) (3)

                                                           The model is trained using the Adam Optimizer
                                                        (Kingma and Ba, 2014) with a learning rate of
                                                        0.001 by minimizing the loss function in Eq. (3).
                                                           Figure 2 shows the Cumulative Match Charac-
                                                        teristic (CMC) curve that shows the true positive
                                                        identification rate of the BE model for Recall@k
Figure 1: Bi Encoder LSTM Architecture. RNNs are
                                                        for k ∈ {1, 2, ..., 10}.
colored in grey and white to show two different LSTM
networks

3   Bi Encoder Model
The proposed BE model architecture in Figure 1
is motivated by the typical setup of conversation
between two persons. Each person has to encode
long and short term conversation contexts to best
respond to a spoken sentence (an utterance or con-
text).
   As a natural design choice, in the BE model one               Figure 2: CMC Curve of the BE Model
LSTM cell (colored in grey) learns encoding of an
utterance (or questions or contexts) while the other       In this subsequent section, we look at various
LSTM cell (colored in white) learn encoding of a        experiments that helped us decide to select the best
response (or answers or responses). A sequence          similarity function, hyper-parameters and word
of GloVe embedding vectors (Pennington et al.,          embedding for the BE model. We also show per-
2014) of an utterance are fed into the upper LSTM       formance of the BE model in comparison to the
while the sequence of embedding vectors of a re-        DE model.
sponse are fed into the lower LSTM cell. Vectors
representing the final states ht ∈ Rs of the upper      4     Experiments and Results
and lower LSTM cells are used for final represen-       All our models were implemented in Tensorflow
tations of the utterance and the response as ue , re    v1.7 and trained using a GeForce GTX 1080 Ti
respectively. To drive learning of ue and re , we       NVIDIA GPU. We used the same training data,
measure their similarity in the hidden vector space     UDC as (Lowe et al., 2015) but its second ver-
using dot product i.e                                   sion (UDCv2). Models are trained on 1 million
                                                        pairs of utterances and responses on the training
                                                        set and evaluated against a test set. We fine-tune
             sim(ue , re ) = huTe , re i         (1)
                                                        the model with hyper-parameters, determine the
   The training of the model via BPTT is done by        optimum similarity function and word embedding
minimizing the binary cross entropy X (q, p) be-        using the validation dataset.
tween the learned probability p and the ground             For evaluation and model selection, we present
truth paring probability q = {0, 1}, where 1 de-        our model with 10 response candidates, consisting
notes ue , re are genuinely paired and 0 denotes        of one right response and the rest nine incorrect
otherwise. Using similarity in Eq. (1), p is calcu-     responses. This set of 10 response candidates per
lated as:                                               context is provided in the validation and test set in
                                                        UDCv2 (more details in Section 4.1). The model
                                                        ranks these responses and its ranking is considered
                             1
             p=                                  (2)    correct if the correct response is among the first k
                   1+   e(sim(ue ,re )+b)               candidates. This quantity is denoted as Recall@k.
   In eq. (2), b is a scalar free parameter bias that   Specifically, we report mean values of Recall@1,
is learned by the model. The aforementioned cross       Recall@2 and Recall@5.
Context               Response              Label
                                                           you just click        unless nat get        1
                                                           userprefer and        distract     by
                                                           creat one ...         someth shinier
                                                             eou       eot i
                                                           ca n’t access the
                                                           wiki at all - it
                                                           throw http auth
                                                           at me        eou
                                                             eot
Figure 3: DE Model. All RNNs are colored in white to       i think that tu-      i think that tu-      0
show the same LSTM network has been fed first by the       tori be outdat -      tori be outdat -
utterance and then by the response                         veri first instruct   veri first instruct
                                                           fail eou and          fail eou and
                                                           it differ slight to   it differ slight to
   For benchmarking, we use the DE model in
                                                           what ubotu have       what ubotu have
(Lowe et al., 2015) and the results of the DE
                                                           to say eou            to say eou
model on UDCv2 as published in
https://github.com/rkadlec/
                                                        Table 1: Examples from the training dataset of UDCv2
ubuntu-ranking-dataset-creator.
                                                        showing both the correct (1) and incorrect response (0)
In contrast to the BE model, the DE model has           labels
one LSTM cell that encodes both the utterance
and the response. The encoding for the utterance,
ue is multiplied with a trainable matrix M whose        validation and test set; the sampling procedure
result is compared with the encoding for response,      for the context length in the validation and test
re by a dot product (Figure 3).                         set is changed from an inverse distribution to a
   We also reproduced the DE model for compar-          uniform distribution; the tokenization and entity
ison and we refer it as the DER model. Note that        replacement procedure was removed; differenti-
the DE model was originally modeled and trained         ation between the end of an utterance ( eou )
in Theano.                                              and end of turn ( eot ) has been added; a bug
                                                        that caused the distribution of false responses in
4.1   Data                                              the test and validation sets to be different from the
The Ubuntu Dialog Corpus (UDC) is the largest           true responses was fixed.
freely available multi-turn dialog corpus (Lowe            The training set consists of labelled 1 million
et al., 2015). It was constructed from the Ubuntu       pairs of utterances and responses. It has equal dis-
chat logs - a collection of logs from Ubuntu-           tribution of true context-response pairs labeled as
related chat rooms on the Freenode IRC network.         1 versus the context-distraction pairs labeled as 0.
Although multiple users can talk at the same time       Keeping all the words that occur at least 5 times,
in the chat room, the logs were pre-processed us-       the training set has a vocabulary of 91,620. The
ing heuristics to create two-person conversations.      average utterance is 86 words long and the aver-
The resulting corpus consists of almost one mil-        age response is 17 words long.
lion two-person conversations, where a user seeks          The validation dataset consists of 19,560 exam-
help with his/her Ubuntu-related problems (the av-      ples where each example consists of a context and
erage length of a dialog is 8 turns, with a minimum     10 responses where the first response is always the
of 3 turns). Because of its size, the corpus is well-   true response. The test dataset, structured the same
suited for deep learning in the context of dialogue     as the validation dataset, consists of 18920 exam-
systems.                                                ples. The correct response is the actual next utter-
   UDCv2 released in 2017 made sev-                     ance in the dialogue and a false response is ran-
eral significant updates to its predecesor              domly sampled utterance from elsewhere within a
(https://github.com/rkadlec/                            set of dialogues in UDC that has been set aside
ubuntu-ranking-dataset-creator).                        for creation of validation and test set (Lowe et al.,
To summarize - UDCv2 is separated into training,        2015). The words of the UDCv2 are stemmed
Model Id          Description              Recall@1    Recall@2    Recall@5
(strip suffixes from the end of the word), and lem-           DE        Dual Encoder (Benchmark)         55.2       72.09       92.43
                                                             DER        Dual Encoder Reproduced          52.6       70.09       91.51
matized (normalize words that have the same root,             BE         Bi-Encoder (Proposed)           56.0       73.15        92.7
despite their surface differences).
                                                      Table 2: Comparison of top-k % accuracy on UDCv2
4.2   Effect of similarity measures
                                                      on the test set
In the BE model, we used dot product similarity
                                                        Model Id               Description              Recall@1    Recall@2    Recall@5
between the encoded utterance ut and response            BE-19           BE with Cosine similarity       43.09       61.99       86.97
re . Before we made that decision, we evaluted           BE-20
                                                         BE-21
                                                                       BE with Polynomial Similarity
                                                                         BE using all hidden states
                                                                                                         55.11
                                                                                                          54.7
                                                                                                                     71.64
                                                                                                                     71.54
                                                                                                                                 92.17
                                                                                                                                 91.63
severa other similarity measures. The description        BE-22          BE with deep LSTM model           53.8        71.6        92.4
                                                          BE              BE with Dot Similarity         56.88       73.24       92.86
of these similarity measures are given in the sub-
sequent sections.                                     Table 3: Results of different similarity measures used
4.2.1 Cosine Similarity                               on the BE model using the validation set
Instead of taking the dot product of ue and re , we
ignore their magnitude and take the dot product of    the utterance and response. The encoding for the
their unit vectors. This is shown in the following    utterance is given by:
equation:                                                                              T
                                                                                       X t2
                                                                                  ue =      · ht                                          (7)
                                                                                         T2
                               uTe  · re                                                     t=1
             sim(ue , re ) =                    (4)
                               |ue | |re |            where T is the maximum context length of the ut-
                                                      terance
4.2.2 Polynomial Similarity
                                                         Similarly, the encoding for the response is given
In machine learning, the polynomial kernel is a       by:
kernel function commonly used with support vec-
                                                                                       T
tor machines (SVMs) and other kernelized mod-                                          X t2
els. Although the Radial Basis Function (RBF)                                     re =      · h0t                                         (8)
                                                                                         T2
                                                                                            t=1
kernel is more popular in SVM classification than
the polynomial kernel, Goldberg and Elhadad             We keep the similarity function as in the origi-
(2008) showed that polynomial kernel gives better     nal BE model as shown in Eq. (1).
result than the RBF-Kernel for NLP applications:
                                                      4.4      Deep LSTM
   For degree−d polynomials, the polynomial ker-
nel is defined as:                                    In this experiment, we added two more layers to
                                                      the shallow LSTM BE model and looked at the
            K(x, y) = (xT · y + c)d             (5)   result. We keep the similarity function as in the
                                                      original BE model as shown in Eq. (1).
where x and y are vectors in the input space, i.e.
vectors of features computed from training or test    4.5      Results and Discussion
samples and c ≥ 0 is a free parameter trading off     Table 2 compares the performance of the proposed
the influence of higher-order versus lower-order      BE model, the benchmark DE model and the re-
terms in the polynomial. When c = 0, the kernel       produced DE model, the DER model on UDCv2
is called homogeneous.                                dataset. Compared to benchmark DE model, the
   In this experiment, we used the polynomial ker-    proposed BE model achieves 0.8%, 1.0% and
nel from 0th to the 3rd degree for the similarity     0.3% higher accuracy for Recall@1, Recall@2
measure. The following equation gives the simi-       and Recall@5 respectively. Note that compared
larity function:                                      to the reproduced DE model, the BE model does
                            3                         better than when it is compared to the benchmark
                            X
          sim(ue , re ) =     (uTe · re )d      (6)   model.
                            d=0                          Table 3 shows the results of various experiments
                                                      we performed on the BE model.
4.3   Effect of using all hidden states                  For a given NLP task, choice of words em-
In this experiment, we used all the hidden states     bedding to real vector space can affect the per-
of the LSTM and not just the final states to encode   formance of a model. Table 4 shows the results
Embedding        Recall@1   Recall@2   Recall@5
        Random           41.7       61.1       87.8
       Word2Vec         56.55      73.61       92.7
    Twitter 27B 200d    52.50      69.59      91.44
   Common Crawl 42B     56.88      73.24      92.86
   Common Crawl 840B    56.43      73.25      92.66

Table 4: Comparison of performances of the BE model
with various embedding types. Results are shown on
the validation set

                                                        Figure 5: Effect of (a) RNN cell size and (b) training
                                                        batch size on the BE (Bi-Encoder) model

                                                        4.6   Error Analysis
Figure 4: T-SNE plot of word embeddings of some fre-    Similar to Lowe et al. (2017), we performed qual-
quently occurring words in UDCv2                        itative error analysis on the 100 randomly chosen
                                                        examples from the test dataset where the model
                                                        made an error for Recall@1 (Table 5). The er-
of using various embedding vectors with the BE          rored examples were evaluated by three persons
model.We first looked at the random embedding           where each one manually gave a score to each ex-
and then used the Word2Vec embedding trained            amples for the metrics - Difficulty Rating, Model
on the UDCv2. We also used the pre-trained              Response Rating and Error Category.
GloVe embeddding (Mikolov et al., 2013) and ran            Difficulty Rating[1-5] measures how difficult
the model with all four pre-trained GloVe embed-        human finds the context to match the right re-
dings that are available - (1) Wikipedia - 6B to-       sponse. A rating of 1 on the difficulty scale means
kens, 400K vocab, uncased, 50d, 100d, 200d, and         that the question is easily answerable by all hu-
300d vectors, (2) Common Crawl - 42B tokens,            mans. A 2 indicates moderate difficulty, which
1.9M vocab, uncased, 300d vectors, (3) Common           should still be answerable by all humans but only
Crawl - 840B tokens, 2.2M vocab, cased, 300d            if they are paying attention. A 3 means that the
vectors, and (4) Twitter - 2B tweets, 27B tokens,       question is fairly challenging, and may either re-
1.2M vocab, uncased, 25d, 50d, 100d and 200d            quire some familiarity with Ubuntu or the human
vectors. Both pre-trained and trained embeddings        respondent paying very close attention to answer
on UDCv2 show better results than the random            correctly. A 4 is very hard, usually meaning that
embedding. Between the Word2Vec and GloVe,              there are other responses that are nearly as good as
the Common Crawl 42B embedding of the GloVe             the true response; many humans would be unable
shows the best result. The T-SNE plot of Common         to answer questions of difficulty 4 correctly. A 5
Crawl 42B embeddings is shown in Figure 4. As           means that the question is effectively impossible:
can be seen in the diagram, similar words (for ex-      either the true response is completely unrelated to
ample - “thank”, “thx” and “ty”) appear embedded        the context, or it is very short and generic
close to each other.                                       Model Response Rating[1-3] measures the rea-
  In our experiments, we tuned LSTM cell size           soning of the model’s choice. A score of 1 indi-
and the training batch size (Figure 5).                 cates that the model predicted response is com-
pletely unreasonable given the context. A 2 means        sponses given a context(utterance). Empirically
that the response chosen was somewhat reason-            we have shown that on average 92.7%, 73.15%
able, and that it’s possible for a human to make         and 56.0% of the time, correct response will be
a similar mistake. A 3 means that the model’s re-        in top 5, top 2 and top 1 correct responses respec-
sponse was more suited to the context than the ac-       tively in Ubuntu Dialog Corpus Version 2 exceed-
tual response.                                           ing the accuracy of the benchmark model in all
   Error Category[1-4] puts model error in a spe-        three metrics. Collobert and Weston (2008) used a
cific category. Error Category of 1 relates to tone      language model with a Rank loss/similarity where
and style of the context. If a model makes an error      he had only positive examples and generated neg-
attributed to the misspellings, incorrect grammar,       ative examples by corrupting the positive ones.
use of emoticons, use of technical jargon or com-        Several other works have shown the Rank loss
mands etc in the context, then the error category        to be useful in training situations where pairs of
will be 1. Error Category 2 relates to when the          correct or incorrect items are to be scored (Gold-
context and chosen responses relate to the same          berg, 2016). Since UDC dataset matches this sce-
topic. Error Category 3 relates to model’s inabil-       nario, we recommend the future work to explore
ity to account for turn-taking structure. For exam-      the BE model with the Rank loss. In a large corpus
ple if the last turn in the context asks a question      like UDC where users are seeking help in Ubuntu
and the model chose a answer where it is not an-         related problems, it is reasonable to assume that
swering the question. Error Category 4 means the         there can be multiple thread of discussion(topics)
model picked the response because it sees some           related to Ubuntu. Identifying the latent topics and
common words between the context and the re-             grouping the utterances based on topics will allow
sponses.                                                 training an ensemble of BE models. As there is
                                                         no explicit grouping of the utterance, we plan to
           Difficulty Rating       % of Errors           identity these hidden topics using Latent Dirich-
             Impossible (5)           19%
                                                         let Allocation (LDA). Topics distribution of utter-
           Very Difficult (4)         11%
              Difficult (3)           22%                ances can be used to group them using probabilis-
              Moderate (2)            30%                tic measure of distance. We hypothesize that en-
                Easy (1)              18%                sembles of BE models may serve in efficient selec-
         Model Response Rating     % of Errors           tion of correct responses. Since the retrieval-based
          Better than actual (3)      23%
            Reasonable (2)            21%                systems have to loop through every single possible
           Unreasonable (1)           56%                responses, if the system needs to go through a very
            Error Category         % of Errors           large set, the system may be practically not feasi-
          Common words (4)            13%                ble in production. As shown by (Kannan et al.,
            Turn-taking (3)           45%
             Same topic (2)           26%
                                                         2016) one way to reduce the number of possible
           Tone and style (1)         16%                responses is through clustering. (Jafarpour et al.,
                                                         2010) and (Wang et al., 2013) also showed sev-
Table 5: Qualitative evaluation of the errors from the   eral ways of reducing the large set of possible re-
BE model                                                 sponses to a smaller set. We intend to apply such
                                                         ideas in our future work. Moreover, in a multi-turn
   The qualitative analysis results (Table 5) show       dialog system capturing longer term context is es-
that the BE model was not able to predict well the       sential to selecting correct response. Our proposed
turn-taking structure of the dialogs. A little more      architecture can be extended to more hierarchical
than half of the errored examples had the human          RNN layers, capturing longer context. We plan to
difficulty level ranging from 3 to 5, and almost         investigate this further in conjunction with para-
half of the model responses in the errored exam-         graph vector (Le and Mikolov, 2014).
ples were either reasonable or better than the ac-
tual response.
                                                         References
5   Conclusions and Future Work                          Yoshua Bengio, Aaron Courville, and Pascal Vincent.
                                                           2013. Representation learning: A review and new
This paper presented a new LSTM based RNN ar-              perspectives. IEEE transactions on pattern analysis
chitecture that can score a set of pre-defined re-         and machine intelligence, 35(8):1798–1828.
Erik Cambria and Bebo White. 2014. Jumping nlp               Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean,
   curves: A review of natural language processing re-          Alan Ritter, and Dan Jurafsky. 2017. Adversar-
   search. IEEE Computational intelligence magazine,            ial learning for neural dialogue generation. arXiv
   9(2):48–57.                                                  preprint arXiv:1701.06547.

Rollo Carpenter. 2011. Cleverbot.                            Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle
                                                               Pineau. 2015. The ubuntu dialogue corpus: A large
Kenneth Mark Colby, Sylvia Weber, and Franklin Den-            dataset for research in unstructured multi-turn dia-
  nis Hilf. 1971. Artificial paranoia. Artificial Intelli-     logue systems. arXiv preprint arXiv:1506.08909.
  gence, 2(1):1–25.
                                                             Ryan Thomas Lowe, Nissan Pow, Iulian Vlad Serban,
Ronan Collobert and Jason Weston. 2008. A unified              Laurent Charlin, Chia-Wei Liu, and Joelle Pineau.
  architecture for natural language processing: Deep           2017. Training end-to-end dialogue systems with
  neural networks with multitask learning. In Pro-             the ubuntu dialogue corpus. Dialogue & Discourse,
  ceedings of the 25th international conference on             8(1):31–65.
  Machine learning, pages 160–167. ACM.
                                                             Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
Yoav Goldberg. 2016. A primer on neural network                rado, and Jeff Dean. 2013. Distributed representa-
  models for natural language processing. Journal of           tions of words and phrases and their compositional-
  Artificial Intelligence Research, 57:345–420.                ity. In Advances in neural information processing
                                                               systems, pages 3111–3119.
Yoav Goldberg and Michael Elhadad. 2008. splitsvm:
  fast, space-efficient, non-heuristic, polynomial ker-      Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and
  nel computation for nlp applications. In Proceed-            Yong Rui. 2016. Jointly modeling embedding and
  ings of the 46th Annual Meeting of the Association           translation to bridge video and language. Proceed-
  for Computational Linguistics on Human Language              ings of the IEEE conference on computer vision and
  Technologies: Short Papers, pages 237–240. Asso-             pattern recognition.
  ciation for Computational Linguistics.
                                                             Jeffrey Pennington, Richard Socher, and Christopher
Sina Jafarpour, Christopher JC Burges, and Alan Rit-            Manning. 2014. Glove: Global vectors for word
   ter. 2010. Filter, rank, and transfer the knowledge:         representation. In Proceedings of the 2014 confer-
   Learning to chat. Advances in Ranking, 10:2329–              ence on empirical methods in natural language pro-
   9290.                                                        cessing (EMNLP), pages 1532–1543.

Rudolf Kadlec, Martin Schmid, and Jan Kleindienst.           Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Un-
  2015. Improved deep learning baselines for ubuntu            supervised modeling of twitter conversations. In
  corpus dialogs. arXiv preprint arXiv:1510.03753.             Human Language Technologies: The 2010 Annual
                                                               Conference of the North American Chapter of the
Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias               Association for Computational Linguistics, pages
  Kaufmann, Andrew Tomkins, Balint Miklos, Greg                172–180. Association for Computational Linguis-
  Corrado, László Lukács, Marina Ganea, Peter               tics.
  Young, et al. 2016. Smart reply: Automated re-
  sponse suggestion for email. In Proceedings of the         Alan Ritter, Colin Cherry, and William B Dolan. 2011.
  22nd ACM SIGKDD International Conference on                  Data-driven response generation in social media. In
  Knowledge Discovery and Data Mining, pages 955–              Proceedings of the conference on empirical methods
  964. ACM.                                                    in natural language processing, pages 583–593. As-
                                                               sociation for Computational Linguistics.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
  method for stochastic optimization. arXiv preprint         Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Lau-
  arXiv:1412.6980.                                              rent Charlin, and Joelle Pineau. 2015. A survey of
                                                                available corpora for building data-driven dialogue
Quoc Le and Tomas Mikolov. 2014. Distributed rep-               systems. arXiv preprint arXiv:1512.05742.
  resentations of sentences and documents. In Inter-
  national Conference on Machine Learning, pages             Lifeng Shang, Zhengdong Lu, and Hang Li. 2015.
  1188–1196.                                                    Neural responding machine for short-text conversa-
                                                                tion. arXiv preprint arXiv:1503.02364.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
   and Bill Dolan. 2015. A diversity-promoting objec-        Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
   tive function for neural conversation models. arXiv          Sequence to sequence learning with neural net-
   preprint arXiv:1510.03055.                                   works. In Advances in neural information process-
                                                                ing systems, pages 3104–3112.
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley,
   Jianfeng Gao, and Dan Jurafsky. 2016. Deep rein-          Richard S Wallace. 2008. Alice: Artificial intelligence
   forcement learning for dialogue generation. arXiv           foundation inc. Received from: http://www. alice-
   preprint arXiv:1606.01541.                                  bot. org.
Hao Wang, Zhengdong Lu, Hang Li, and Enhong
  Chen. 2013. A dataset for research on short-text
  conversations. In Proceedings of the 2013 Confer-
  ence on Empirical Methods in Natural Language
  Processing, pages 935–945.
Joseph Weizenbaum. 1966. Eliza—a computer pro-
   gram for the study of natural language communica-
   tion between man and machine. Communications of
   the ACM, 9(1):36–45.
Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-
  Hao Su, David Vandyke, and Steve Young. 2015.
  Semantically conditioned lstm-based natural lan-
  guage generation for spoken dialogue systems.
  arXiv preprint arXiv:1508.01745.

A   Source Code
The source code for this project can be found here
-
https://github.com/
DiwanshuShekhar/bi_encoder_lstm
You can also read