Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing

Page created by Laurie Santos
 
CONTINUE READING
Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing
Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing

                          Osman Ramadan, Paweł Budzianowski, Milica Gašić
                                    Department of Engineering,
                                   University of Cambridge, U.K.
                              {oior2,pfb30,mg436}@cam.ac.uk

                         Abstract                                 Understanding (SLU) unit that utilizes a seman-
                                                                  tic dictionary to hold all the key terms, rephras-
      Robust dialogue belief tracking is a key                    ings and alternative mentions of a belief state. The
      component in maintaining good quality di-                   SLU then delexicalises each turn using this seman-
      alogue systems. The tasks that dialogue                     tic dictionary, before it passes it to the BT compo-
      systems are trying to solve are becoming                    nent (Wang and Lemon, 2013; Henderson et al.,
      increasingly complex, requiring scalabil-                   2014b; Williams, 2014; Zilka and Jurcicek, 2015;
      ity to multi-domain, semantically rich dia-                 Perez and Liu, 2016; Rastogi et al., 2017). How-
      logues. However, most current approaches                    ever, this approach is not scalable to multi-domain
      have difficulty scaling up with domains                     dialogues because of the effort required to de-
      because of the dependency of the model                      fine a semantic dictionary for each domain. More
      parameters on the dialogue ontology. In                     advanced approaches, such as the Neural Belief
      this paper, a novel approach is introduced                  Tracker (NBT), use word embeddings to alleviate
      that fully utilizes semantic similarity be-                 the need for delexicalisation and combine the SLU
      tween dialogue utterances and the ontol-                    and BT into one unit, mapping directly from turns
      ogy terms, allowing the information to                      to belief states (Mrkšić et al., 2017). Nevertheless,
      be shared across domains. The evalua-                       the NBT model does not tackle the problem of
      tion is performed on a recently collected                   mixing different domains in a conversation. More-
      multi-domain dialogues dataset, one order                   over, as each slot is trained independently without
      of magnitude larger than currently avail-                   sharing information between different slots, scal-
      able corpora. Our model demonstrates                        ing such approaches to large multi-domain sys-
      great capability in handling multi-domain                   tems is greatly hindered.
      dialogues, simultaneously outperforming
                                                                     In this paper, we propose a model that jointly
      existing state-of-the-art models in single-
                                                                  identifies the domain and tracks the belief states
      domain dialogue tracking tasks.
                                                                  corresponding to that domain. It uses semantic
1     Introduction                                                similarity between ontology terms and turn
                                                                  utterances to allow for parameter sharing between
Spoken Dialogue Systems (SDS) are computer                        different slots across domains and within a single
programs that can hold a conversation with a hu-                  domain. In addition, the model parameters are
man. These can be task-based systems that help                    independent of the ontology/belief states, thus
the user achieve specific goals, e.g. finding and                 the dimensionality of the parameters does not
booking hotels or restaurants. In order for the                   increase with the size of the ontology, making
SDS to infer the user goals/intentions during the                 the model practically feasible to deploy in multi-
conversation, its Belief Tracking (BT) component                  domain environments without any modifications.
maintains a distribution of states, called a belief               Finally, we introduce a new, large-scale corpora
state, across dialogue turns (Young et al., 2010).                of natural, human-human conversations providing
The belief state is used by the system to take ac-                new possibilities to train complex, neural-based
tions in each turn until the conversation is con-                 models. Our model systematically improves upon
cluded and the user goal is achieved. In order to                 state-of-the-art neural approaches both in single
extract these belief states from the conversation,                and multi-domain conversations.
traditional approaches use a Spoken Language

                                                            432
    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 432–437
                 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
2     Background                                              3.2   Multi-domain Dialogue State Tracking
The belief states of the BT are defined based                 Recently, Rastogi et al. (2017) proposed a multi-
on an ontology - the structured representation of             domain approach using delexicalized utterances
the database which contains entities the system               fed to a two layer stacked bi-directional GRU net-
can talk about. The ontology defines the terms                work to extract features from the user and the sys-
over which the distribution is to be tracked in               tem utterances. These, combined with the candi-
the dialogue. This ontology is constructed in                 date slots and values, are passed to a feed-forward
terms of slots and values in a single domain set-             neural network with a softmax in the last layer.
ting. Or, alternatively, in terms of domains, slots           The candidate set fed to the network consists of
and values in a multi-domain environment. Each                the selected candidates from the previous turn and
domain consists of multiple slots and each slot               candidates from the ontology to a limit K, which
contains several values, e.g. domain=hotel,                   restricts the maximum size of the chosen set. Con-
slot=price, value=expensive. In each                          sequently, the model does not need an ad-hoc be-
turn, the BT fits a distribution over the values of           lief state update mechanism like in the NBT.
each slot in each domain, and a none value is                    The parameters of the GRU network are de-
added to each slot to indicate if the slot is not             fined for the domain, whereas the parameters of
mentioned so that the distribution sums up to 1.              the feed-forward network are defined per slot, al-
The BT then passes these states to the Policy Op-             lowing transfer learning across different domains.
timization unit as full probability distributions to          However, the model relies on delexicalization to
take actions. This allows robustness to noisy envi-           extract the features, which limits the performance
ronments (Young et al., 2010). The larger the on-             of the BT, as it does not scale to the rich variety of
tology, the more flexible and multi-purposed the              the language. Moreover, the number of parameters
system is, but the harder it is to train and maintain         increases with the number of slots.
a good quality BT.

3     Related Work                                            4     Method

In recent years, a plethora of research has been              The core idea is to leverage semantic similarities
generated on belief tracking (Williams et al.,                between the utterances and ontology terms to com-
2016). For the purposes of this paper, two pre-               pute the belief state distribution. In this way, the
viously proposed models are particularly relevant.            model parameters only learn to model the interac-
                                                              tions between turn utterances and ontology terms
3.1    Neural Belief Tracker (NBT)                            in the semantic space, rather than the mapping
The main idea behind the NBT (Mrkšić et al.,                from utterances to states. Consequently, informa-
2017) is to use semantically specialized pre-                 tion is shared between both slots and across do-
trained word embeddings to encode the user ut-                mains. Additionally, the number of parameters
terance, the system act and the candidate slots and           does not increase with the ontology size. Do-
values taken from the ontology. These are fed to              main tracking is considered as a separate task but
semantic decoding and context modeling modules                is learned jointly with the belief state tracking of
that apply a three-way gating mechanism and pass              the slots and values. The proposed model uses
the output to a non-linear classifier layer to pro-           semantically specialized pre-trained word embed-
duce a distribution over the values for each slot. It         dings (Wieting et al., 2015). To encode the user
uses a simple update rule, p(st ) = βp(st−1 ) + λy,           and system utterances, we employed 7 indepen-
where p(st ) is the belief state at time step t, y is         dent bi-directional LSTMs (Graves and Schmid-
the output of the binary decision maker of the NBT            huber, 2005). Three of them are used to encode
and β and λ are tunable parameters.                           the system utterance for domain, slot and value
   The NBT leverages semantic information                     tracking respectively. Similarly, three Bi-LSTMs
from the word embeddings to resolve lexi-                     encode the user utterance while and the last one
cal/morphological ambiguity and maximize the                  is used to track the user affirmation. A variant of
shared parameters across the values of each slot.             the CNNs as a feature extractor, similar to the one
However, it only applies to a single domain and               used in the NBT-CNN (Mrkšić et al., 2017) is also
does not share parameters across slots.                       employed.

                                                        433
Figure 1: The proposed model architecture, using Bi-LSTMs as encoders. Other variants of the model
use CNNs as feature extractors (Kim, 2014; Kalchbrenner et al., 2014).

4.1   Domain Tracking                                         4.2    Candidate Slots and Values Tracking
Figure 1 presents the system architecture with two            Slots and values are tracked using a similar archi-
bi-directional LSTM networks as information en-               tecture as for domain tracking (Figure 1). How-
coders running over the word embeddings of the                ever, to correctly model the context of the system-
user and system utterances. The last hidden states            user dialogue at each turn, three different cases are
of the forward and backward layers are concate-               considered when computing the similarity vectors:
nated to produce hdusr , hdsys of size L for the user           1. Inform: The user is informing the system
and system utterances respectively. In the second                  about his/her goal, e.g. ’I am looking for a
variant of the model, CNNs are used to produce                     restaurant that serves Turkish food’.
these vectors (Kim, 2014; Kalchbrenner et al.,                  2. Request: The system is requesting informa-
2014). To detect the presence of the domain in the                 tion by asking the user about the value of
dialogue turn, element-wise multiplication is used                 a specific slot. If the system utterance is:
as a similarity metric between the hidden states                   ’When do you want the taxi to arrive?’ and
and the ontology embeddings of the domain:                         the user answers with ’19:30’.
         dk = hdk     tanh(Wd ed + bd ),                        3. Confirm: The system wants to confirm in-
                                                                   formation about the value of a specific slot. If
where k ∈ {usr, sys}, ed is the embedding vector
                                                                   the system asked: ’Would you like free park-
of the domain and Wd ∈ RL×D transforms the
                                                                   ing?’, the user can either affirm positively or
domain word embeddings of dimension D to the
                                                                   negatively. The model detects the user affir-
hidden representation. The information about se-
                                                                   mation, using a separate bi-directional LSTM
mantic similarity is held by dusr and dsys , which
                                                                   or CNN to output hausr .
are fed to a non-linear layer to output a binary de-
cision:                                                       The three cases are modelled as following:

      Pt (d) = σ(wd {dusr ⊕ dsys } + bd ),                           s,v
                                                                    yinf = winf {susr ⊕ vusr } + binf ,
where wd ∈   R2L   and bd are learnable parameters                   s,v
                                                                    yreq = wreq {ssys ⊕ vusr } + breq ,
that map the semantic similarity to a belief state                   s,v
probability Pt (d) of a domain d at a turn t.                       yaf  = waf {ssys ⊕ vsys ⊕ hausr } + baf ,

                                                        434
where sk , vk for k ∈ {usr, sys} represent seman-              are split into domains and slots and values. Thanks
tic similarity between the user and system utter-              to the disjoint training, the learning of slot and
ances and the ontology slot and value terms re-                value belief states are not restricted to a specific
spectively computed as shown in Figure 1, and w                domain. Therefore, the model shares the knowl-
and b are learnable parameters.                                edge of slots and values across different domains.
   The distribution over the values of slot s in do-           The loss function for the domain tracking is:
main d at turn t can be computed by summing the
                                                                                 N X
unscaled states, yinf , yreq and yaf for each value                              X
                                                                       Ld = −               tn (d)logP1:T
                                                                                                      n
                                                                                                          (d),
v in s, and applying a softmax to normalize the
                                                                                n=1 d∈D
distribution:
                           s,v
      Pt (s, v) = softmax(yinf    s,v
                               + yreq    s,v
                                      + yaf  ).                where d is a vector of domains over the dialogue,
                                                               tn (d) is the domain label for the dialogue n and
4.3   Belief State Update                                      N is the number of dialogues. Similarly, the loss
Since dialogue systems in the real-world operate               function for the slots and values tracking is:
in noisy environments, a robust BT should utilize                             N
the flow of the conversation to reduce the uncer-
                                                                              X X
                                                                   Ls,v = −                 tn (s, v)logP1:T
                                                                                                         n
                                                                                                             (s, v),
tainty in the belief state distribution. This can                             n=1 s,v∈S,V
be achieved by passing the output of the deci-
sion maker, at each turn, as an input to an RNN                where s and v are vectors of slots and values over
that runs over the dialogue turns as shown in Fig-             the dialogue and tn (s, v) is the joint label vector
ure 1, which allows the gradients to be propagated             for the dialogue n.
across turns. This alleviates the problem of tun-
ing hyper-parameters for rule-based updates. To                5     Datasets and Baselines
avoid the vanishing gradient problem, three net-               Neural approaches to statistical dialogue develop-
works were tested: a simple RNN, an RNN with                   ment, especially in a task-oriented paradigm, are
a memory cell (Henderson et al., 2014a) and a                  greatly hindered by the lack of large scale datasets.
LSTM. The RNN with a memory cell proved to                     That is why, following the Wizard-of-Oz (WOZ)
give the best results. In addition to the fact that it         approach (Kelley, 1984; Wen et al., 2017), we
reduces the vanishing gradient problem, this vari-             ran text-based multi-domain corpus data collec-
ant is less complex than an LSTM, which makes                  tion scheme through Amazon MTurk. The main
training easier. Furthermore, a variant of RNN                 goal of the data collection was to acquire human-
used for domain tracking has all its weights of the            human conversations between a tourist visiting a
form: Wi = αi I, where αi is a distinct learn-                 city and a clerk from an information center. At the
able parameter for hidden, memory and previous                 beginning of each dialogue the user (visitor) was
state layers and I is the identity matrix. Similarly,          given explicit instructions about the goal to ful-
weights of the RNN used to track the slots and val-            fill, which often spanned multiple domains. The
ues is of the form: Wj = γj I + λj (1 − I), where              task of the system (wizard) is to assist a visitor
γj and λj are the learnable parameters. These two              having an access to databases over domains. The
variants of RNN are a combination of Henderson                 WOZ paradigm allowed us to obtain natural and
et al. (2014a) and Mrkvsić and Vulić (2018) previ-           semantically rich multi-topic dialogues spanning
ous works. The output is P1:T (d) and P1:T (s, v),             over multiple domains such as hotels, attractions,
which represents the joint probability distribution            restaurants, booking trains or taxis. The dialogues
of the domains and slots and values respectively               cover from 1 up to 5 domains per dialogue greatly
over the complete dialogue. Combining these to-                varying in length and complexity.
gether produces the full belief state distribution of
the dialogue:                                                  5.1    Data Structure
        P1:T (d, s, v) = P1:T (d)P1:T (s, v).                  The data consists of 2480 single-domain dialogues
                                                               and 7375 multi-domain dialogues usually span-
4.4   Training Criteria                                        ning from 2 up to 5 domains. Some domains con-
Domain tracking and slots and values tracking are              sists also of sub-domains like booking. The aver-
trained disjointly. Belief state labels for each turn          age sentence lengths are 11.63 and 15.01 for users

                                                         435
WOZ 2.0                           New WOZ (only restaurants)
                 Slot      NBT-CNN       Bi-LSTM           CNN           NBT-CNN Bi-LSTM CNN
                Food         88.9           96.1           96.4            78.3      84.7       85.3
             Price range     93.7           98.0           97.9            92.6      95.6       93.6
                Area         94.3           97.8           98.1            78.3      82.6       86.4
             Joint goals     84.2           85.1           85.5            57.7      59.9       63.7

Table 1: WOZ 2.0 and new dataset test set accuracies of the NBT-CNN and the two variants of the
proposed model, for slots food, price range, area and joint goals.

and wizards respectively. The combined ontol-                                New WOZ (multi-domain)
ogy consists of 5 domains, 27 slots and 663 val-                         Model       F1 score Accuracy %
ues making it significantly larger than observed in                 Uniform Sampling   0.108        10.8
other datasets. To enforce reproducibility of re-                       Bi-LSTM        0.876        93.7
sults, we distribute the corpus with a pre-specified                      CNN          0.878        93.2
train/test/development random split. The test and
development sets contain 1k examples each. Each                 Table 2: The overall F1 score and accuracy for the
dialogues consists of a goal, user and system ut-               multi-domain dialogues test set.4
terances and a belief state per turn. The data and
model is publicly available.1                                   This is because the dialogues in the new dataset
                                                                are richer and more noisier, as a closer resem-
5.2      Evaluation                                             blance to real environment dialogues.
We also used the extended WOZ 2.0 dataset (Wen                     Table 2 presents the results on multi-domain di-
et al., 2017).2 WOZ2 dataset consists of 1200 sin-              alogues from the new dataset described in Sec-
gle topic dialogues constrained to the restaurant               tion 5. To demonstrate the difficulty of the multi-
domain. All the weights were initialised using nor-             domain belief tracking problem, values of a the-
mal distribution of zero mean and unit variance                 oretical baseline that samples the belief state uni-
and biases were initialised to zero. ADAM op-                   formly at random are also presented. Our model
timizer (Kingma and Ba, 2014) (with 64 batch                    gracefully handles such a difficult task. In most
size) is used to train all the models for 600 epochs.           of the cases, CNNs demonstrate better perfor-
Dropout (Srivastava et al., 2014) was used for reg-             mance than Bi-LSTMs. We hypothesize that this
ularisation (50% dropout rate on all the intermedi-             comes from the effectiveness of extracting local
ate representations). For each of the two datasets              and position-invariant features, which are crucial
we compare our proposed architecture (using ei-                 for semantic similarities (Yin et al., 2017).
ther Bi-LSTM or CNN as encoders) to the NBT
model3 (Mrkšić et al., 2017).
                                                                7       Conclusions
                                                                In this paper, we proposed a new approach that
6       Results                                                 tackles the issue of multi-domain belief tracking,
Table 1 shows the performance of our model in                   such as model parameter scalability with the ontol-
tracking the belief state of single-domain dia-                 ogy size. Our model shows improved performance
logues, compared to the NBT-CNN variant of the                  in single-domain tasks compared to the state-of-
NBT discussed in Section 3.1. Our model outper-                 the-art NBT method. By exploiting semantic sim-
forms NBT in all the three slots and the joint goals            ilarities between dialogue utterances and ontology
for the two datasets. NBT previously achieved                   terms, the model alleviates the need for ontology-
state-of-the-art results (Mrkšić et al., 2017). More-         dependent parameters and maximizes the amount
over, the performance of all models is worse on the             of information shared between slots and across do-
new dataset for restaurant compared to WOZ 2.0.                 mains. In future, we intend to investigate introduc-
                                                                ing new domains and ontology terms without fur-
    1
    http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/          ther training thus performing zero-shot learning.
    2
    Publicly available at https://mi.eng.cam.ac.
uk/˜nm480/woz_2.0.zip.                                              4
                                                                     F1-score is computed by considering all the values in
  3
    Publicly available at https://github.com/                   each slot of each domain as positive and the ”none” state of
nmrksic/neural-belief-tracker.                                  the slot as negative.

                                                          436
Acknowledgments                                                  Abhinav Rastogi, Dilek Hakkani-Tur, and Larry Heck.
                                                                   2017. Scalable multi-domain dialogue state track-
The authors would like to thank Nikola Mrkšić,                   ing. arXiv preprint arXiv:1712.10224 .
Jacquie Rowe, the Cambridge Dialogue Systems
                                                                 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Group and the ACL reviewers for their construc-                    Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
tive feedback. Paweł Budzianowski is supported                     Dropout: A simple way to prevent neural networks
by EPSRC Council and Toshiba Research Eu-                          from overfitting. Journal of Machine Learning Re-
rope Ltd, Cambridge Research Laboratory. The                       search 15:1929–1958.
data collection was funded through Google Fac-                   Zhuoran Wang and Oliver Lemon. 2013. A simple
ulty Award.                                                        and generic belief tracking mechanism for the dia-
                                                                   log state tracking challenge: On the believability of
                                                                   observed information. In Proceedings of the SIG-
                                                                   DIAL 2013 Conference. pages 423–432.
References
Alex Graves and Jürgen Schmidhuber. 2005. Frame-                Tsung-Hsien Wen, Milica Gašić, Nikola Mrkšić, Lina
  wise phoneme classification with bidirectional lstm              M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes,
  and other neural network architectures. Neural Net-              David Vandyke, and Steve Young. 2017. A network-
  works 18(5-6):602–610.                                           based end-to-end trainable task-oriented dialogue
                                                                   system. In Proceedings on EACL .
Matthew Henderson, Blaise Thomson, and Steve                     John Wieting, Mohit Bansal, Kevin Gimpel, Karen
 Young. 2014a. Robust dialog state tracking using                  Livescu, and Dan Roth. 2015. From paraphrase
 delexicalised recurrent neural networks and unsu-                 database to compositional paraphrase model and
 pervised adaptation. In Spoken Language Technol-                  back. Transactions of the Association for Compu-
 ogy Workshop (SLT), 2014 IEEE. IEEE, pages 360–                   tational Linguistics 3:345–358.
 365.
                                                                 Jason Williams, Antoine Raux, and Matthew Hender-
Matthew Henderson, Blaise Thomson, and Steve                        son. 2016. The dialog state tracking challenge se-
 Young. 2014b. Word-based dialog state tracking                     ries: A review. Dialogue & Discourse 7(3):4–33.
 with recurrent neural networks. In Proceedings
 of the 15th Annual Meeting of the Special Inter-                Jason D Williams. 2014. Web-style ranking and slu
 est Group on Discourse and Dialogue (SIGDIAL).                     combination for dialog state tracking. In Proceed-
 pages 292–299.                                                     ings of the 15th Annual Meeting of the Special Inter-
                                                                    est Group on Discourse and Dialogue (SIGDIAL).
Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-               pages 282–291.
  som. 2014. A convolutional neural network for
  modelling sentences. In Proceedings of ACL .                   Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich
                                                                   Schütze. 2017. Comparative study of cnn and rnn
John F Kelley. 1984. An iterative design methodol-                 for natural language processing. arXiv preprint
  ogy for user-friendly natural language office infor-             arXiv:1702.01923 .
  mation applications. ACM Transactions on Infor-
  mation Systems (TOIS) 2(1):26–41.                              Steve Young, Milica Gašić, Simon Keizer, François
                                                                    Mairesse, Jost Schatzmann, Blaise Thomson, and
Yoon Kim. 2014. Convolutional neural networks for                   Kai Yu. 2010. The hidden information state model:
  sentence classification. In Proceedings of EMNLP .                A practical framework for pomdp-based spoken dia-
                                                                    logue management. Computer Speech & Language
Diederik P Kingma and Jimmy Ba. 2014. Adam: A                       24(2):150–174.
  method for stochastic optimization. ICLR .
                                                                 Lukas Zilka and Filip Jurcicek. 2015. Incremental
                                                                   lstm-based dialog state tracker. In Automatic Speech
Nikola Mrkšić, Diarmuid Ó Séaghdha, Tsung-Hsien
                                                                   Recognition and Understanding (ASRU), 2015 IEEE
  Wen, Blaise Thomson, and Steve Young. 2017.
                                                                   Workshop on. IEEE, pages 757–762.
  Neural belief tracker: Data-driven dialogue state
  tracking. In Proceedings of the 55th Annual Meet-
  ing of the Association for Computational Linguistics
  (Volume 1: Long Papers). volume 1, pages 1777–
  1788.

Nikola Mrkvsić and Ivan Vulić. 2018. Fully statistical
  neural belief tracking. In Proceedings of ACL.

Julien Perez and Fei Liu. 2016. Dialog state tracking,
   a machine reading approach using memory network.
   arXiv preprint arXiv:1606.04052 .

                                                           437
You can also read