XPersona: Evaluating Multilingual Personalized Chatbot

Page created by Jerry Morales
 
CONTINUE READING
XPersona: Evaluating Multilingual Personalized Chatbot

                                        Zhaojiang Lin∗, Zihan Liu∗ , Genta Indra Winata∗ , Samuel Cahyawijaya∗ , Andrea Madotto∗ ,
                                                                  Yejin Bang, Etsuko Ishii, Pascale Fung
                                                             Center for Artificial Intelligence Research (CAiRE)
                                                            Department of Electronic and Computer Engineering
                                             The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
                                          {zlinao,zliucr,giwinata,scahyawijaya,amadotto}@connect.ust.hk,
                                                                          pascale@ece.ust.hk

                                                                   Abstract                          English), and thus the resulting systems can per-
                                                                                                     form conversations only in the training language.
                                                 Personalized dialogue systems are an essen-         For wide, commercial dialogue systems are re-
                                                 tial step toward better human-machine in-           quired to handle a large number of languages since
arXiv:2003.07568v2 [cs.CL] 8 Apr 2020

                                                 teraction.    Existing personalized dialogue
                                                                                                     the smart home devices market is increasingly inter-
                                                 agents rely on properly designed conversa-
                                                 tional datasets, which are mostly monolingual       national (Etherington, 2019). Therefore, creating
                                                 (e.g., English), which greatly limits the usage     multilingual conversational benchmarks is essen-
                                                 of conversational agents in other languages.        tial, yet challenging since it is costly to perform
                                                 In this paper, we propose a multi-lingual ex-       human annotation of data in all languages.
                                                 tension of Persona-Chat (Zhang et al., 2018),           A possible solution is to use translation systems
                                                 namely XPersona. Our dataset includes per-          before and after the model inference, a two-step
                                                 sona conversations in six different languages
                                                                                                     translation from any language to English and from
                                                 other than English for building and evaluating
                                                 multilingual personalized agents. We experi-
                                                                                                     English to any language. This comes with three ma-
                                                 ment with both multilingual and cross-lingual       jor problems: 1) amplification of translation errors
                                                 trained baselines, and evaluate them against        since the current dialogue systems are far from per-
                                                 monolingual and translation-pipeline models         fect, especially with noisy input; 2) the three-stage
                                                 using both automatic and human evaluation.          pipeline system is significantly slower in terms of
                                                 Experimental results show that the multilin-        inference speed; and 3) high translation costs since
                                                 gual trained models outperform the translation-     the current state-of-the-art models, especially in
                                                 pipeline and that they are on par with the
                                                                                                     low resources languages, are only available using
                                                 monolingual models, with the advantage of
                                                 having a single model across multiple lan-          costly APIs.
                                                 guages. On the other hand, the state-of-the-            In this paper, we analyze two possible
                                                 art cross-lingual trained models achieve infe-      workarounds to alleviate the aforementioned chal-
                                                 rior performance to the other models, showing       lenges. The first is to build a cross-lingual trans-
                                                 that cross-lingual conversation modeling is a       ferable system by aligning cross-lingual represen-
                                                 challenging task. We hope that our dataset and      tations, as in Conneau et al. (2018), in which the
                                                 baselines 1 will accelerate research in multilin-
                                                                                                     system is trained on one language and zero-shot
                                                 gual dialogue systems.
                                                                                                     to another language. The second is to learn a mul-
                                                                                                     tilingual system directly from noisy multilingual
                                         1       Introduction                                        data (e.g., translated data), thus getting rid of the
                                                                                                     translation system dependence at inference time.
                                         Personalized dialogue agents have been shown effi-              To evaluate the aforementioned systems, we pro-
                                         cient in conducting human-like conversation. This           pose a dataset called Multilingual Persona-Chat,
                                         progress has been catalyzed thanks to existing con-         or XPersona, by extending the Persona-Chat cor-
                                         versational dataset such as Persona-chat (Zhang             pora (Dinan et al., 2019a) to six languages: Chi-
                                         et al., 2018; Dinan et al., 2019a). However, the            nese, French, Indonesian, Italian, Korean, and
                                         training data are provided in a single language (e.g.,      Japanese. In XPersona, the training sets are au-
                                                 ∗                                                   tomatically translated using translation APIs with
                                             Equal contributions. Listing order is random
                                             1
                                            Datasets and all the baselines are available in https:   several human-in-the-loop passes of mistake cor-
                                         //github.com/HLTCHKUST/Xpersona                             rection. In contrast, the validation and test sets
Persona    I helped design the game starcraft. I am a famous twitch streamer..
            Sys: do you know the game starcraft ? i helped designing it !
    En      Usr: that s awesome , i m sure my ex boyfriend would like that game
            Sys: i usually stream the game on my twitch channel that s great , i also love twitch as well
            Sys: conosci il gioco starcraft? ho aiutato a progettarlo!
     It     Usr: è fantastico, sono sicuro che al mio ex ragazzo gli piacerebbe
            Sys: di solito faccio streaming mentre gioco sul mio canale Twitch
            Sys: connaissez-vous le jeu starcraft? j’ai aidé à le concevoir!
    Fr      Usr: c’est génial, je suis sûr que mon ex petit ami aimerait ce jeu
            Sys: Je diffuse généralement le jeu sur ma chaîne Twitch
            Sys: apakah anda tahu game starcraft? saya yang mendesainnya!
    Id      Usr: itu luar biasa, saya yakin mantan pacar saya suka game itu.
            Sys: saya biasanya memainkan game itu di channel twitch saya.
            Sys: 你知道游戏《星际争霸》吗?我帮忙设计了它!
    Zh      Usr: 好厉害,我觉得我的前男友会喜欢那个游戏
            Sys: 我经常在我的直播频道上直播游戏
            Sys: 너ᄀ ᅦᄋᆷᄉ
                       ᅵ   ᅳᄐ ᅡ크ᄅ ᅢᄑ ᅳ트ᄅ ᆯᄋ
                                           ᅳ ᅡ니? ᄂ ᅡᄂ ᆫᄀ
                                                       ᅳ  ᅳᄀ ᆺᄋ
                                                             ᅥ  ᆯᄃ
                                                                ᅳ  ᅵ자ᄋ ᆫᄒ
                                                                        ᅵ ᅡᄂ ᆫᄀ
                                                                             ᅳ  ᆺᄋ
                                                                                ᅥᆯᄃ
                                                                                 ᅳ   ᅩᄋ ᆻᄋ
                                                                                        ᅪ ᅥ!
    Ko      Usr: ᄆᆺᄌ
                  ᅥ ᆫᄃ
                    ᅵ  ᅦ, ᄂ ᅢᄌ ᆫᄂ
                               ᅥ   ᆷᄌ
                                   ᅡ ᅡᄎ ᆫᄀ
                                        ᅵ  ᅮ가ᄀ ᅳᄀ ᅦᄋᆷᄋ
                                                     ᅵ  ᆯᅩ
                                                        ᅳ  ᄌᄋ
                                                           ᇂ  ᅡᄒ ᆯᄀ
                                                                 ᅡ  ᅥᄅ ᅡᄀ
                                                                        ᅩᄒ  ᅪ
                                                                            ᆨᄉᆫᄒ
                                                                              ᅵ ᅢ.
            Sys: 나ᄂᆫᄇ
                    ᅳ  ᅩᄐ ᆼᄂ
                          ᅩ   ᅢᄐ ᅳᄋ ᅱ치ᄎ ᅢᄂ ᆯᄅ
                                            ᅥ ᅩᄀ ᅳᄀ ᅦᄋ ᆷᄋ
                                                       ᅵ  ᆯᄉ
                                                          ᅳ  ᅳᄐ ᅳᄅ ᅵᄆᆼᄒ
                                                                     ᅵ ᅢ.
            Sys: ゲームのスタークラフトを知っていますか?私はそれを設計するのを助けました!
    Jp      Usr: それはすごいです、私は私の元彼がそのゲームを好きになると確信しています
            Sys: 私は通常、twitchチャンネルでゲームをストリーミングします

Table 1: Multi-turn annotated dialogue samples from test set in seven languages. For simplicity, we only show
three turns for each dialogue and the persona in English.

are annotated by human experts to facilitate both            to understand the mixed language dialogue
automatic and human evaluations in multiple lan-             context and generate coherent responses.
guages.
   Furthermore, we propose competitive baselines
in two training settings, namely, cross-lingual         2   Related Work
and multilingual, and compare them with trans-
                                                        Dialogue Systems are categorized as goal-
lation pipeline models. Our baselines leverage pre-
                                                        oriented (Williams and Young, 2007; Young et al.,
trained cross-lingual (Chi et al., 2019) and multi-
                                                        2013) and chit-chat (Serban et al., 2016; Vinyals
lingual (Devlin et al., 2018) models.
                                                        and Le, 2015). Interested readers may refer to Gao
   An extensive automatic and human evalua-
                                                        et al. (2018) for a general overview. In this paper,
tion (Li et al., 2019) of our models shows that a
                                                        we focus on the latter, for which, in recent years,
multilingual system is able to outperform strong
                                                        several tasks and datasets have been proposed
translation-based models and on par with or even
                                                        to ground the conversation on knowledge (Dinan
improve the monolingual model. The cross-lingual
                                                        et al., 2019b; Gopalakrishnan et al., 2019; Shuster
performance is still lower than other models, which
                                                        et al., 2018; Fan et al., 2019; Reddy et al., 2019;
indicates that cross-lingual conversation modeling
                                                        Choi et al., 2018; Moon et al., 2019) such as Wiki-
is very challenging. The main contribution of this
                                                        Articles, Reddit-Post, and CNN-Article. In this
paper are summarized as follows:
                                                        work, we focus on personalized dialogue agents
                                                        where the dialogues are grounded on persona infor-
  • We present the first multilingual non-goal-
                                                        mation.
    oriented dialogue benchmark for evaluating
    multilingual generative chatbots.                      Li et al. (2016a) was the first to introduce a
                                                        persona-grounded dialogue dataset for improving
  • We provide both cross-lingual and multilin-         response consistency. Later on, Zhang et al. (2018)
    gual baselines and discuss their limitations to     and Dinan et al. (2019a) introduced Persona-chat, a
    inspire future research.                            multi-turn conversational dataset, where two speak-
                                                        ers are paired, and a persona description (4–5 sen-
  • We show the potential of multilingual systems       tences) is randomly assigned to each of them. By
Valid.                                     Test
               Lang     #Dial.    #Utt.   Edit       BLEU      #Dial.    #Utt.       Edit   BLEU
                 Fr      248      3868 21.23         94.45      249      3900       24.29   94.19
                  It     140      2160 83.01         80.45      140      2192       81.93   80.08
                 Id      484      7562 157.58        60.46      484      7540      156.19   60.66
                 Jp      275      4278 71.41         53.66      275      4322       75.83   49.56
                 Ko      299      4684 74.04         61.25      300      4678       70.96   62.49
                 Zh      222      3440 30.33         59.89      222      3458       33.07   64.61

Table 2: The statistics of the collected dataset. We report the number of dialogues (#Dial.) and utterances (#Utt.)
of the validation and test set in six languages. Edit distance per dialogue (Edit) and BLEU score are computed to
show the difference between the human-annotated dataset and auto-translated dataset. (Training set is reported in
Appendix A)

conditioning the response generation on the per-           Cross-lingual Cross-lingual adaptation learns
sona descriptions, a chit-chat model is able to pro-       the inter-connections among languages and circum-
duce a more persona-consistent dialogue (Zhang             vents the requirement of extensive training data in
et al., 2018). Several works have improved on the          target languages (Wisniewski et al., 2014; Zhang
initial baselines with various methodologies (Ku-          et al., 2016; Liu et al., 2019b). Cross-lingual trans-
likov et al., 2018; Yavuz et al., 2019; Hancock            fer learning methods have been applied to multiple
et al., 2019; Madotto et al., 2019; Joshi et al., 2017;    NLP tasks, such as named entity recognition (Ni
Zemlyanskiy and Sha, 2018), especially using large         et al., 2017; Xie et al., 2018), natural language
pre-trained models (Wolf et al., 2019; Zhang et al.,       understanding (Liu et al., 2019c), dialogue state
2019).                                                     tracking (Chen et al., 2018), part-of-speech tag-
                                                           ging (Wisniewski et al., 2014; Zhang et al., 2016;
Multilingual Extensive approaches have been in-            Kim et al., 2017), and dependency parsing (Ahmad
troduced to construct multilingual systems, for ex-        et al., 2019; Schuster et al., 2019b). Meanwhile,
ample, multilingual semantic role labeling (Ak-            Lample and Conneau (2019) and Conneau et al.
bik et al., 2015; He et al., 2019), multilingual ma-       (2019) proposed pre-trained cross-lingual language
chine translation (Johnson et al., 2017), multilin-        models to align multiple language representations,
gual automatic speech recognition (Toshniwal et al.,       achieving state-of-the-art results in many cross-
2018; Yue et al., 2019; Nakayama et al., 2019;             lingual classification tasks. The aforementioned
Winata et al., 2019c), and named entity recogni-           tasks focused on classification and sequence label-
tion (Winata et al., 2019a,b). Multilingual deep           ing, while instead, Chi et al. (2019) proposed to pre-
contextualized model such as Multilingual BERT             train both the encoder and decoder of a sequence-to-
(M-BERT) (Devlin et al., 2018) have been com-              sequence model (XNLG) to conduct cross-lingual
monly used to represent multiple languages and             generation tasks, namely, question generation and
elevate the performance in many NLP applications,          abstractive summarization. The latter is the closest
such as classification tasks (Pires et al., 2019), tex-    to our task since it focuses on language generation;
tual entailment, named entity recognition (K et al.,       however cross-lingual dialogue generation has not
2020), and natural language understanding (Liu             yet been explored.
et al., 2019c). Multilingual datasets have also been
                                                           3    Data Collection
created for a number of NLP tasks, such as named
entity recognition or linking (Sang, 2002; Sang            The proposed XPersona dataset is an extension
and De Meulder, 2003; Pan et al., 2017; Aguilar            of the persona-chat dataset (Zhang et al., 2018;
et al., 2018), question answering (Liu et al., 2019a;      Dinan et al., 2019a). Specifically, we extend the
Lewis et al., 2019), semantic role labeling (Hajic         ConvAI2 (Dinan et al., 2019a) to six languages:
et al., 2009), part-of-speech tagging (Nivre et al.,       Chinese, French, Indonesian, Italian, Korean, and
2017), dialogue state tracking (Mrkšić et al., 2017),     Japanese. Since the test set of ConvAI2 is hid-
and natural language understanding (Schuster et al.,       den, we split the original validation set into a new
2019a). However, none of these datasets include            validation set and test sets. Then, we firstly au-
the multilingual chit-chat task.                           tomatically translate the training, validation, and
Response                                          Response

              M-Encoder                         M-Decoder                       M-Causal Decoder

           Word Embedding                     Word Embedding                      Word Embedding                X
         Positional Embedding               Positional Embedding                Positional Embedding            Xpos
       Persona User Sys. User               Language Embedding           Persona User Sys. User    Language     Xseg
                                      (a)                                                (b)

Figure 1: (a) Multilingual Encoder-Decoder model. (b) Multilingual Causal Decoder model. (Detailed illustration
is reported in Appendix B)

test set using APIs (PapaGo 2 for Korean, Google                   translated training set. We firstly sample 200 dia-
Translate 3 for other languages). For each lan-                    logues from the training set (∼2600 utterances) in
guage, we hired native speaker annotators with                     each language, and we assign human annotators to
a fluent level of English and asked them to revise                 list all frequent translation mistakes in the given
the machine-translated dialogues and persona sen-                  dialogues. For example, daily colloquial English
tences in the validation set and test set according                expressions such as “cool", “I see", and “lol" are
to original English dialogues. The main goal of                    usually literally translated. After that, we use a
human annotation is to ensure the revised conver-                  simple string matching to revise the inappropriate
sations are coherent and fluent in target language                 translations in the whole training-set and return a
despite the cultural discrepancy in different lan-                 revision log, which records all the revised utter-
guages. Therefore, annotators are not restricted                   ances. Then, we assign human annotators to check
to translate the English dialogues. They are also                  all the revised utterances and list translation mis-
allowed to customize dialogues and persona sen-                    takes again. We repeat this process at least twice for
tences. The annotated dialogues can deviate from                   each language. Finally, we summarize the statistics
original translation while retain persona and con-                 of the collected dataset in Table 2.
versation consistency. The full annotation instruc-
tions are reported in Appendix A.                                  4     Multilingual Personalized
   Compared to collecting new persona sentences                          Conversational Models
and dialogues in each language, human-annotating
the dialogues by leveraging translation APIs has                   Let us define a dialogue D                            =
multiple advantages. First, it increases the data                  {U1 , S1 , U2 , S2 , . . . , Un , Sn } as an alternat-
distribution similarity across languages (Conneau                  ing set of utterances from two speakers, where
et al., 2018), which can better examine the system’s               U and S represent the user and the system,
cross-lingual transferability. Second, revising the                respectively. Each speaker has its correspond-
machine-translated dialogues based on the original                 ing persona description that consists of a set
English dialogue improves the data construction                    of sentences P = {P1 , . . . , Pm }. Given the
efficiency. Third, it leverages the well-constructed               system persona sentences Ps and dialogue
English persona conversations as a reference to                    history Dt = {U1 , S1 , U2 , . . . , St−1 , Ut }, we are
ensure the dialogue quality without the need for                   interested in predicting the system utterances St .
training a new pool of workers to generate new
samples (Conneau et al., 2018).                                    4.1    Model Architecture
   On the other hand, human-translating the entire
                                                                   We explore both encoder-decoder and causal de-
training-set (∼130K utterances) in six languages
                                                                   coder architectures, and we leverage existing pre-
is expensive. Therefore, we propose an iterative
                                                                   trained contextualized multilingual language mod-
method to improve the quality of the automatically
                                                                   els as weights initialization. Hence, we firstly de-
   2
       https://papago.naver.com                                    fine the multilingual embedding layer and then the
   3
       https://translate.google.com                                two multilingual models used in our experiments.
Embedding We define three embedding matri-                        and the dialogue history Dt as the language model
ces: word embedding E W ∈ R|V |×d , positional                    prefix, and autoregressively decode the system re-
embedding E P ∈ RM ×d , and segmentation em-                      sponse St based on language embedding (i.e. lid ):
bedding E S ∈ R|S|×d , where |.| denotes set car-
dinality, d is the embedding size, V denotes the                           St = Decoder(E([Ps ; Dt ]), lid ).      (4)
vocabulary, M denotes the maximum sequence
length, and S denotes the set of segmentation to-                    Figure 1 shows the conceptual differences be-
kens. Segmentation embedding (Wolf et al., 2019)                  tween the encoder-decoder and casual decoder.
is used to indicate whether the current token is part             Note that in both multilingual models, the dialogue
of i) Persona sentences, ii) System (Sys.) utter-                 history encoding process is language-agnostic,
ances, iii) User utterances, iv) response in Lan-                 while decoding language is controlled by the lan-
guage lid . The language embedding lid is used                    guage embedding. Such design allows the model
to inform the model which language to generate.                   to understand mixed-language dialogue contexts
Hence, given a sequence of tokens X, the embed-                   and to responds in the desired language (details in
ding functions E are defined as:                                  Section 5.3.2).

 E(X) = E W (X) ⊕ E P (Xpos ) ⊕ E S (Xseg ), (1)                  4.2   Training Strategy
                                                                  We consider two training strategies to learn a multi-
where ⊕ denotes the positional sum, Xpos =                        lingual conversational model: multilingual training
{1, . . . , |X|} and Xseg is the sequence of segmen-              and cross-lingual training.
tation tokens, as in Wolf et al. (2019). Figure 1
shows a visual representation of the embedding                    Multilingual Training jointly learns to perform
process. A more detailed illustration is reported in              personalized conversations in multiple languages.
Appendix B.                                                       We follow a transfer learning approach (Wolf et al.,
                                                                  2019; See et al., 2019) by initializing our models
Encoder-Decoder To model the response gener-
                                                                  with the weights of the large multilingual pretrained
ation, we use a Transformer (Vaswani et al., 2017)
                                                                  model M-Bert (Pires et al., 2019). For the causal
based encoder-decoder (Vinyals and Le, 2015). As
                                                                  decoder, we add the causal mask into self-attention
illustrated in Figure 1, we concatenate 4 the system
                                                                  layer to convert M-Bert encoder to decoder. For
persona Ps with the dialogue history Dt . Then we
                                                                  encoder-decoder model, we randomly initialize
use the embedding layer E to finally pass it to the
                                                                  the cross encoder-decoder attention (Rothe et al.,
encoder. In short, we have:
                                                                  2019). Then, we train the both models on the com-
             H = Encoder(E([Ps ; Dt ])),                   (2)    bined training set in all 7 languages using cross-
                                                                  entropy loss.
where H ∈ RL×dmodel is the hidden representation
computed by the encoder, and L denotes the input                  Cross-lingual Training transfers knowledge
sequence length. Then, the decoder attends to H                   from the source language data to the target lan-
and generates the system response St token by to-                 guages. In this setting, the model is trained on
ken. In the decoder, segmentation embedding is                    English (source language) conversational samples,
the language ID embedding (e.g., we look up the                   and evaluated on the other 6 languages. Following
embedding for Italian to decode Italian). Thus:                   the methodology proposed by Chi et al. (2019), we
                                                                  align the embedded representations of different lan-
                St = Decodert (H, lid ),                   (3)    guages into the same embedding space by applying
                                                                  cross-lingual pre-training to the encoder-decoder
                                                                  model. The pre-training procedure consists of two
Causal Decoder As an alternative to encoder-
                                                                  stages:
decoders, the causal-decoders (Radford et al., 2018,
2019; He et al., 2018) have been used to model                      • pre-training the encoder and the decoder inde-
conversational responses (Wolf et al., 2019; Zhang                    pendently utilizing masked language model-
et al., 2019) by giving as a prefix the dialogue his-                 ing, as in Lample and Conneau (2019);
tory. In our model, we concatenate the persona Ps
   4
     We use the notation [a; b] for concatenating the vectors a     • jointly pre-training the encoder-decoder by
and b                                                                 using two objective functions: Cross-Lingual
Bert2Bert     M-Bert2Bert         CausalBert      M-CausalBert             XNLG
            ppl. BLEU       ppl. BLEU         ppl. BLEU         ppl. BLEU           ppl.   BLEU
    En     21.99   1.53    25.99  0.57       16.08   1.79      15.62  1.97         54.74*  2.25*
    Zh     21.35   3.36    13.24  1.25        8.69   5.51       9.27   5.7        3482.27   2.16
     It    50.36    0.6    24.16  0.31       18.41   1.32      15.12   1.3         917.63   0.41
    Jp     10.09   5.23    10.64  0.79       11.00   6.74       7.13  4.53         999.81    0.0
    Ko     12.81   0.24    34.31  0.00        9.66   1.06       9.56  1.08            -       -
    Id     21.37   0.11    22.83  0.22       14.77    2.1      14.61  1.92         844.98   0.15
    Fr     13.22   0.35    15.58  0.50       10.39   1.97      10.59  2.17         640.33   0.09

Table 3: Results of automatic evaluation score on test set in seven languages. We compute the BLEU score and
perplexity (ppl.) for monolingual, multilingual, and cross-lingual models.

          Auto-Encoding (XAE) and Denoising Auto-       2016), they help to roughly estimate the perfor-
          Encoding (DAE) (Chi et al., 2019).            mance of different models under the same test set.
                                                        More recently, Adiwardana et al. (2020) has shown
For instance, DAE adds perturbations to the input       the correlation between perplexity and human judg-
sentence of encoder and tries to reconstructs the       ment in open-domain chit-chat models.
original sentence using the decoder, whereas, XAE
uses parallel translation data to pre-train both the    Human Asking humans to evaluate the quality
encoder and decoder with machine translation ob-        of a dialogue model is challenging, especially when
jective. As in the multilingual models, the language    multiple models have to be compared. The likert
IDs are fed into the decoder to control the language    score (a.k.a. 1 to 5 scoring) has been widely used
of generated sentences. Both pre-training stages        to evaluate the interactive experience with conver-
require both parallel and non-parallel data in the      sational models (Venkatesh et al., 2018; See et al.,
target language.                                        2019; Zhang et al., 2018; Dinan et al., 2019a). In
   After the two stages of pre-training, the model      such evaluation, a human interacts with the sys-
is fine-tuned using just the source language sam-       tems for several turns, and then they assign a score
ples (i.e., English) with the same cross-entropy        from 1 to 5 based on three questions (Zhang et al.,
loss as for the multilingual training. However, as      2018) about fluency, engagingness, and consistency.
suggested in Chi et al. (2019), only the encoder        This evaluation is both expensive to conduct and
parameters are updated with back-propagation and        requires many samples to achieve statistically sig-
both the decoder and the word embedding layer           nificant results Li et al. (2019). To cope with these
remain frozen. This retains the decoders’ ability to    issues, Li et al. (2019) proposed ACUTE-EVAL,
generate multilingual output while still being able     an A/B test evaluation for dialogue systems. The
to learn new tasks using only the target language.      authors proposed two modes: human-model chats
                                                        and self-chat (Li et al., 2016b; Ghandeharioun et al.,
5     Experiments                                       2019). In this work, we opt for the latter since it is
                                                        cheaper to conduct and achieves similar results (Li
5.1       Evaluation Metrics                            et al., 2019) to the former. Another advantage of us-
Evaluating open-domain chit-chat models is chal-        ing this method is the ability to evaluate multi-turn
lenging, especially in multiple languages and at the    conversations instead of single-turn responses.
dialogue-level. Hence, we evaluate our models us-          Following ACUTE-EVAL, the annotator is pro-
ing both automatic and human evaluation. In both        vided with two full dialogues made by self-chat or
cases, human-annotated dialogues are used, which        human-dialogue. The annotator is asked to choose
show the importance of the provided dataset.            which of the two dialogues is better in terms of
                                                        engagingness, interestingness, and humanness. For
Automatic For each language, we evaluate re-            each comparison, we sample 60–100 conversations
sponses generated by the models using perplexity        from both models. In Appendix C, we report the
(ppl.) and BLEU (Papineni et al., 2002) with refer-     exact questions and instructions given to the anno-
ence to the human-annotated responses. Although         tators, and the user interface used in the evaluation.
these automatic measures are not perfect (Liu et al.,   We hired native speakers annotators for all six con-
Engageness               Interestingness                  Humanness
                Lang
 Multi Wins %          Human Mono         Poly    Human Mono Poly                Human Mono            Poly
                 En     23.33   68.57     36.36    23.33    64.29 32.73           30.00   62.86        42.73
                 Fr     32.00   55.17     42.86    16.00    53.45 48.21           28.00   50.00        44.64
                 Id     21.67   51.67     65.45    23.33    46.67 55.45           25.00   46.67        65.45
                 It     35.00   48.33     56.36    30.00    48.33 53.64           30.00   40.00        57.27
                 Jp     18.33   50.00     61.82    13.33    43.33 45.45           18.33   51.67        59.09
                 Ko     30.00   52.46     62.39    26.67    50.82 59.63           28.33   52.46        64.22
                 Zh     36.67   55.00     65.45    36.67    60.00 61.82           36.67   55.00        70.91

Table 4: Results of ACUTE-EVAL human evaluation. Tests are conducted pairwise between M-CausalBert (Multi.)
and other models (Human, Poly-encoder (Poly), Monolingual CausalBert (Mono)). Numbers indicate the winning
rate of Multi. Numbers in bold are statistically significant (p < 0.05).

sidered languages. The annotators were different             We adapt this model to the other languages by us-
from the dataset collection annotators to avoid any          ing the Google Translate API to translate target lan-
possible bias.                                               guages (e.g., Chinese) query to English as the input
                                                             to the model, then translate the English response
5.2             Implementation Details                       back to the target language. Thus, the response
Multilingual Models We use the "BERT-Base,                   generation flow is: target query → English query
Multilingual Cased" checkpoint, and we denote                → English response → target response. We denote
the multilingual encoder-decoder model as M-                 this model as Poly.
Bert2Bert (∼220M parameters) and causal de-
coder model as M-CausalBert (∼110M param-                    Cross-lingual Models. In the first pre-training
eters). We fine-tune both models in the combined             stage, we use the pre-trained weights from XLMR-
training set (English in Persona-chat (Zhang et al.,         base (Conneau et al., 2019). Then, we follow the
2018), six languages in Xpersona) for five epochs            second pre-training stage of XNLG (Chi et al.,
with AdamW 5 optimizer and a learning rate of                2019) for pre-training Italian, Japanese, Korean,
6.25e-5.                                                     Indonesia cross-lingual transferable models. For
                                                             Chinese and French, we directly apply the pre-
Monolingual Models To verify whether the mul-                trained XNLG (Chi et al., 2019) weights7 . Then,
tilingual agent will under-perform the monolingual           the pre-trained models are fine-tune on English Per-
agent in the monolingual conversational task, we             sonaChat training set and early stop based on the
build a monolingual encoder-decoder model and                perplexity on target language validation set.
causal decoder model for each language. For a fair
comparison, we initialize the monolingual models             5.3      Results and Discussion
with a pre-trained monolingual BERT 6 (Devlin
                                                             5.3.1      Quantitative Analysis
et al., 2018; Cui et al., 2019; Martin et al., 2019).
We denote the monolingual encoder-decoder model              Table 3 compares monolingual, multilingual, and
as Bert2Bert (∼220M parameters) and causal de-               cross-lingual models in terms of BLEU and per-
coder model as CausalBert (∼110M parameters).                plexity in the human-translated test set. On both
Then we fine-tune each model in each language                evaluation matrices, the causal decoder models out-
independently for the same number of epoch and               perform the encoder-decoder models. We observe
optimizer as the multilingual model.                         that the encoder-decoder model tends to overlook
                                                             dialogue context and generate digressive responses.
Translation-based Models Another strong base-                (Generated samples are available in Appendix D)
line we compare with is Poly-encoder (Humeau                 We hypothesize that this is because the one-to-
et al., 2019), a large-scale pre-trained retrieval           many problem (Zhao et al., 2017) in open-domain
model that has shown state-of-the-art performance            conversation weakens the relation between encoder
in the English Persona-chat dataset (Li et al., 2019).       and decoder; thus the well pre-trained decoder
         5
     AdamW: Adam algorithm with weight decay                 (Bert) easily converges to a locally-optimal, and
         6
     The monolingual BERT pre-trained models are available
                                                                7
in https://github.com/huggingface/transformers                      Available in https://github.com/CZWin32768/XNLG
learns to ignore the dialogue context from the en-                         System Persona
coder and generate the response in an unconditional            My favorite activity is rock climbing.
language model way. We leave the investigation                       I’m a mechanical engineer.
of this problem to future work. On the other hand,                I like to go hiking in the spring.
M-CausalBert achieves a comparable or slightly                            I live in colorado.
better performance compared to CausalBert, which                               Dialogue
suggests that M-CausalBert leverages the data from              studio inglese oggi e tu? (I am studying
                                                         Usr
other languages. As expected, we observe a sig-                 English today, you?)
nificant gap between the cross-lingual model and         Sys    just watching netflix with friends
other models, which indicates that cross-lingual                我喜欢攀岩,你呢? (I like rock
                                                         Usr
                                                                climbing, what about you?)
zero-shot conversation modeling is very challeng-
                                                                nice, I enjoy it while in the summer.
ing.                                                     Sys
                                                                what are you up to?
   Table 4 shows the human evaluation result of
                                                                saya ingin melakukan panjat tebing
comparing M-CausalBert (Multi) against the hu-           Usr
                                                                (I want to go to hike)
man, translation-based Poly-encoder (Poly), and
                                                         Sys    I’m going hiking the most time of year
monolingual CausalBert (Mono). The results il-
                                                                スポーツがお好きなようですね
lustrate that Multi outperforms Mono in English          Usr
                                                                (It seems you like sports)
and Chinese, and is on par with Mono in other lan-
                                                                I love the outdoors too, do you have
guages. On the other hand, Poly shows a strong           Sys
                                                                any hobbies?
performance in English as it was pre-trained with               ᅡᄂ
                                                                ᄂ  ᆫᄐ
                                                                   ᅳ   ᅵᄇ ᅵᄇ  ᅩᄂᆫᄀ
                                                                                ᅳ   ᆺᄋ
                                                                                    ᅥ ᆯᅩ
                                                                                      ᅳ   ᄌᄋ
                                                                                          ᇂ ᅡ해.(I like
a large-scale English conversation corpus. In con-       Usr
                                                                to watch tv)
trast, the performance of Poly drops in other lan-       Sys    i really like hiking and listening to music
guages, which indicates that the imperfect transla-
tion affects translation-based systems. We also con-   Table 5: Many-to-one: understand mixed-language di-
duct M-CausalBert (Multi) against XNLG (cross)         alogue context in multiple languages and generate re-
human evaluation, and Multi achieve nearly 100         sponse in one language
percent winning rate.
5.3.2 Qualitative Analysis and Discussion              model in 6 languages, and the model generate re-
We randomly sample 7 self-chat dialogues for each      sponses in English, 2) one-to-many, in which users
baseline model in the seven languages and report       converse with the model using English, and the
them in Appendix D., And we summarize the gen-         model generates responses in 6 languages using
eration of each model as follows:                      language embedding and corresponding persona
                                                       sentences. Table 5 and table 6 illustrate the gener-
Poly Poly-encoder, pretrained on 174 million           ation examples under these settings (more exam-
Reddit data, can accurately retrieve coherent and      ples reported in Appendix C.1). Most of the time,
diverse responses in English. However, in the other    M-CausalBert can understand the mixed-language
six languages, some of the retrieved responses are     context, and decode coherent response in different
digressive due to translation error.                   languages. Understanding the mixed-language di-
Monolingual & Multilingual We observe that             alogue context is a desirable skill for end-to-end
both the monolingual and multilingual models can       chit-chat systems, and a systematic study of this
generate fluent responses. Compared to Bert2Bert       research question is needed in future.
and M-Bert2Bert, CausalBert and M-CausalBert
can generate more on-topic responses but some-         Cross-lingual. The current state-of-the-art cross-
times repeat through turns. CausalBert and M-          lingual generation approach XNLG (Chi et al.,
CausalBert are on par with each other in mono-         2019) shows inferior performance on multi-turn
lingual conversational tasks, while M-CausalBert       dialogue tasks, and generates repetitive responses.
shows the advantage of handling a mixed-language       Although cross-lingual dialogue generation is chal-
context. For multilingual speakers, the conversa-      lenging, it reduces the human effort for data anno-
tion may involve multiple languages. Therefore,        tation in different languages. Therefore, the cross-
we experiment on M-CausalBert with two settings:       language transfer is an important direction to inves-
1) many-to-one, in which users converse with the       tigate.
System Persona                      References
                     I love to drink fancy tea.
                   I have a big library at home.             Daniel Adiwardana, Minh-Thang Luong, David R So,
                    I’m a museum tour guide.                   Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,
                          I’m partly deaf.                     Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,
                             Dialogue                          et al. 2020. Towards a human-like open-domain
                                                               chatbot. arXiv preprint arXiv:2001.09977.
    Usr    Hi, I am a computer science student, you?
           I’m a bookkeeper for the local                    Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, Mona
    En
           museum of art.                                      Diab, Julia Hirschberg, and Thamar Solorio. 2018.
           你好,我是一名博物馆老师。                                       Named entity recognition on code-switched data:
    Zh
           (I am a teacher in a museum)                        Overview of the calcs 2018 shared task. In Proceed-
           bonjour, je suis juste un séjour à la               ings of the Third Workshop on Computational Ap-
           maison maman de mon immense                         proaches to Linguistic Code-Switching, pages 138–
    Fr                                                         147.
           bibliothèque. (hello, I’m just a stay
           at home my huge library.)
                                                             Wasi Ahmad, Zhisong Zhang, Xuezhe Ma, Eduard
           Sono un bibliotecario, ma ho bisogno
                                                               Hovy, Kai-Wei Chang, and Nanyun Peng. 2019. On
    It     di rilassarmi. (I am a bookkeper, but               difficulties of cross-lingual transfer with order differ-
           I need to relax)                                    ences: A case study on dependency parsing. In Pro-
           こんにちは。私は大学にいます、                                     ceedings of the 2019 Conference of the North Amer-
    Jp     あなたの専攻は何ですか?                                        ican Chapter of the Association for Computational
           (Hello. I am in college, what is your major?)      Linguistics: Human Language Technologies, Vol-
           Saya tidak tahu, tetapi saya tuli.                  ume 1 (Long and Short Papers), pages 2440–2452.
    Id
           (I don’t know I am deaf)
           ᆫᄂ
           ᅡ
           ᄋ  ᆼ, ᄂ
              ᅧ    ᅡᄂ ᆫᄉ
                      ᅳ   ᆫᅢ
                          ᅥ ᆼ
                            ᄉᄂ ᆷᄋ
                               ᅵ  ᅵ야.                       Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yun-
    Ko     ᅥᄂ
           ᄂ  ᆫᄆ
              ᅳ   ᅮᄋ ᆺᄋ
                     ᅥ  ᆯᄀ
                        ᅳ  ᆼᄇ
                           ᅩ  ᅮ하ᄀ ᅩᄋ ᆻᄂ
                                      ᅵ  ᅵ?                    yao Li, Shivakumar Vaithyanathan, and Huaiyu Zhu.
           (Hello, I am a teacher. What are you studying?)
                                                               2015. Generating high quality proposition banks for
                                                               multilingual semantic role labeling. In Proceedings
Table 6: One-to-many: response one dialogue context            of the 53rd Annual Meeting of the Association for
                                                               Computational Linguistics and the 7th International
with 7 different languages
                                                               Joint Conference on Natural Language Processing
                                                               (Volume 1: Long Papers), pages 397–407.
                                                             Wenhu Chen, Jianshu Chen, Yu Su, Xin Wang, Dong
6        Conclusion                                           Yu, Xifeng Yan, and William Yang Wang. 2018. Xl-
                                                               nbt: A cross-lingual neural belief tracking frame-
                                                              work. In Proceedings of the 2018 Conference on
In this paper, we studied both cross-lingual and              Empirical Methods in Natural Language Processing,
multilingual approaches in end-to-end personalized             pages 414–424.
dialogue modeling. We presented the XPersona                 Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-
dataset, a multilingual extension of Persona-Chat,             Ling Mao, and Heyan Huang. 2019. Cross-lingual
for evaluating the multilingual personalized chat-             natural language generation via pre-training. arXiv
                                                               preprint arXiv:1909.10481.
bots. We further provided both cross-lingual and
multilingual baselines and compared them with the            Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-
monolingual approach and two-stage translation ap-             tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-
                                                               moyer. 2018. Quac: Question answering in context.
proach. Extensive automatic evaluation and human
                                                               In Proceedings of the 2018 Conference on Empiri-
evaluation were conducted to examine the models’               cal Methods in Natural Language Processing, pages
performance. The experimental results showed that              2174–2184.
multilingual trained models, with a single model
                                                             Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
across multiple languages, can outperform the two-             Vishrav Chaudhary, Guillaume Wenzek, Francisco
stage translation approach and is on par with mono-            Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
lingual models. On the other hand, the current state-          moyer, and Veselin Stoyanov. 2019. Unsupervised
of-the-art cross-lingual approach XNLG achieved                cross-lingual representation learning at scale. arXiv
                                                               preprint arXiv:1911.02116.
lower performance than other baselines. In future
work, we plan to research a more advanced cross-             Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-
lingual generation approach and construct a mixed-             ina Williams, Samuel Bowman, Holger Schwenk,
                                                               and Veselin Stoyanov. 2018. Xnli: Evaluating cross-
language conversational benchmark for evaluating               lingual sentence representations. In Proceedings of
multilingual systems.                                          the 2018 Conference on Empirical Methods in Natu-
                                                               ral Language Processing, pages 2475–2485.
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin,             Shexia He, Zuchao Li, and Hai Zhao. 2019. Syntax-
  Ziqing Yang, Shijin Wang, and Guoping Hu. 2019.           aware multilingual semantic role labeling. In Pro-
  Pre-training with whole word masking for chinese          ceedings of the 2019 Conference on Empirical Meth-
  bert. arXiv preprint arXiv:1906.08101.                    ods in Natural Language Processing and the 9th In-
                                                            ternational Joint Conference on Natural Language
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               Processing (EMNLP-IJCNLP), pages 5353–5362.
   Kristina Toutanova. 2018. Bert: Pre-training of deep
   bidirectional transformers for language understand-    Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo
   ing. arXiv preprint arXiv:1810.04805.                     Chen, and Tie-Yan Liu. 2018. Layer-wise coordi-
Emily Dinan, Varvara Logacheva, Valentin Malykh,             nation between encoder and decoder for neural ma-
  Alexander Miller, Kurt Shuster, Jack Urbanek,              chine translation. In Advances in Neural Informa-
  Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan             tion Processing Systems, pages 7944–7954.
  Lowe, et al. 2019a. The second conversational
  intelligence challenge (convai2). arXiv preprint        Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux,
  arXiv:1902.00098.                                         and Jason Weston. 2019. Poly-encoders: Trans-
                                                            former architectures and pre-training strategies for
Emily Dinan, Stephen Roller, Kurt Shuster, Angela           fast and accurate multi-sentence scoring. CoRR
  Fan, Michael Auli, and Jason Weston. 2019b. Wiz-          abs/1905.01969. External Links: Link Cited by, 2:2–
  ard of wikipedia: Knowledge-powered conversa-             2.
  tional agents. In International Conference on Learn-
  ing Representations.                                    Melvin Johnson, Mike Schuster, Quoc Le, Maxim
                                                           Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
Darrell Etherington. 2019. Amazon launches multilin-       Fernanda Viégas, Martin Wattenberg, Greg Corrado,
  gual mode for using alexa in multiple languages at       et al. 2017. Google’s multilingual neural machine
  once.                                                    translation system: Enabling zero-shot translation.
Angela Fan, Yacine Jernite, Ethan Perez, David Grang-      Transactions of the Association for Computational
  ier, Jason Weston, and Michael Auli. 2019. Eli5:         Linguistics, 5:339–351.
  Long form question answering. arXiv preprint
  arXiv:1907.09190.                                       Chaitanya K Joshi, Fei Mi, and Boi Faltings. 2017. Per-
                                                            sonalization in goal-oriented dialog. arXiv preprint
Jianfeng Gao, Michel Galley, and Lihong Li. 2018.           arXiv:1706.07503.
   Neural approaches to conversational ai. In The
   41st International ACM SIGIR Conference on Re-         Karthikeyan K, Zihan Wang, Stephen Mayhew, and
   search & Development in Information Retrieval,           Dan Roth. 2020. Cross-lingual ability of multilin-
   pages 1371–1374. ACM.                                    gual bert: An empirical study. In International Con-
                                                            ference on Learning Representations.
Asma Ghandeharioun, Judy Hanwen Shen, Natasha
  Jaques, Craig Ferguson, Noah Jones, Agata               Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya, and
  Lapedriza, and Rosalind Picard. 2019. Approximat-         Eric Fosler-Lussier. 2017. Cross-lingual transfer
  ing interactive human evaluation with self-play for       learning for pos tagging without cross-lingual re-
  open-domain dialog systems. In Advances in Neu-           sources. In Proceedings of the 2017 Conference on
  ral Information Processing Systems, pages 13658–          Empirical Methods in Natural Language Processing,
  13669.                                                    pages 2832–2838.
Karthik Gopalakrishnan, Behnam Hedayatnia, Qin-
                                                          Ilya Kulikov, Alexander H Miller, Kyunghyun Cho,
  lang Chen, Anna Gottardi, Sanjeev Kwatra, Anu
                                                             and Jason Weston. 2018. Importance of a search
  Venkatesh, Raefer Gabriel, Dilek Hakkani-Tür, and
                                                             strategy in neural dialogue modelling.   arXiv
  Amazon Alexa AI. 2019.           Topical-chat: To-
                                                             preprint arXiv:1811.00907.
  wards knowledge-grounded open-domain conversa-
  tions. Proc. Interspeech 2019, pages 1891–1895.
                                                          Guillaume Lample and Alexis Conneau. 2019. Cross-
Jan Hajic, Massimiliano Ciaramita, Richard Johans-          lingual language model pretraining. arXiv preprint
   son, Daisuke Kawahara, M Antònia Martí, Lluís            arXiv:1901.07291.
   Màrquez, Adam Meyers, Joakim Nivre, Sebastian
   Padó, Jan Štěpánek, et al. 2009. The conll-2009       Patrick Lewis, Barlas Oğuz, Ruty Rinott, Sebastian
   shared task: Syntactic and semantic dependencies         Riedel, and Holger Schwenk. 2019. Mlqa: Eval-
   in multiple languages. In Proceedings of the Thir-       uating cross-lingual extractive question answering.
   teenth Conference on Computational Natural Lan-          arXiv preprint arXiv:1910.07475.
   guage Learning (CoNLL 2009): Shared Task, pages
  1–18.                                                   Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp-
                                                             ithourakis, Jianfeng Gao, and Bill Dolan. 2016a. A
Braden Hancock, Antoine Bordes, Pierre-Emmanuel              persona-based neural conversation model. In Pro-
  Mazare, and Jason Weston. 2019. Learning from              ceedings of the 54th Annual Meeting of the Associa-
  dialogue after deployment: Feed yourself, chatbot!         tion for Computational Linguistics (Volume 1: Long
  arXiv preprint arXiv:1901.05415.                           Papers), volume 1, pages 994–1003.
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jian-   Sahoko Nakayama, Andros Tjandra, Sakriani Sakti,
   feng Gao, and Dan Jurafsky. 2016b. Deep rein-             and Satoshi Nakamura. 2019. Zero-shot code-
   forcement learning for dialogue generation. arXiv         switching asr and tts with multilingual machine
   preprint arXiv:1606.01541.                                speech chain. In 2019 IEEE Automatic Speech
                                                             Recognition and Understanding Workshop (ASRU),
Margaret Li, Jason Weston, and Stephen Roller. 2019.         pages 964–971. IEEE.
 Acute-eval: Improved dialogue evaluation with opti-
 mized questions and multi-turn comparisons. arXiv         Jian Ni, Georgiana Dinu, and Radu Florian. 2017.
 preprint arXiv:1909.03087.                                   Weakly supervised cross-lingual named entity recog-
                                                              nition via effective annotation and representation
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-            projection. In Proceedings of the 55th Annual Meet-
  worthy, Laurent Charlin, and Joelle Pineau. 2016.           ing of the Association for Computational Linguistics
  How not to evaluate your dialogue system: An em-            (Volume 1: Long Papers), pages 1470–1480.
  pirical study of unsupervised evaluation metrics for
  dialogue response generation. In Proceedings of the      Joakim Nivre, Željko Agić, Lars Ahrenberg, et al. 2017.
  2016 Conference on Empirical Methods in Natural            Universal dependencies 2.0. lindat/clarin digital li-
  Language Processing, pages 2122–2132. Associa-             brary at the institute of formal and applied linguis-
  tion for Computational Linguistics.                        tics, charles university, prague.

Jiahua Liu, Yankai Lin, Zhiyuan Liu, and Maosong           Xiaoman Pan, Boliang Zhang, Jonathan May, Joel
   Sun. 2019a. Xqa: A cross-lingual open-domain              Nothman, Kevin Knight, and Heng Ji. 2017. Cross-
   question answering dataset. In Proceedings of the         lingual name tagging and linking for 282 languages.
   57th Annual Meeting of the Association for Compu-         In Proceedings of the 55th Annual Meeting of the
   tational Linguistics, pages 2358–2368.                    Association for Computational Linguistics (Volume
                                                             1: Long Papers), volume 1, pages 1946–1958.
Zihan Liu, Jamin Shin, Yan Xu, Genta Indra Winata,
  Peng Xu, Andrea Madotto, and Pascale Fung. 2019b.        Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  Zero-shot cross-lingual dialogue systems with trans-       Jing Zhu. 2002. Bleu: a method for automatic eval-
  ferable latent variables. In Proceedings of the            uation of machine translation. In Proceedings of
  2019 Conference on Empirical Methods in Natu-              the 40th annual meeting on association for compu-
  ral Language Processing and the 9th International          tational linguistics, pages 311–318. Association for
  Joint Conference on Natural Language Processing            Computational Linguistics.
  (EMNLP-IJCNLP), pages 1297–1303.
                                                           Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
Zihan Liu, Genta Indra Winata, Zhaojiang Lin, Peng           How multilingual is multilingual bert? In Proceed-
  Xu, and Pascale Fung. 2019c. Attention-informed            ings of the 57th Annual Meeting of the Association
  mixed-language training for zero-shot cross-lingual        for Computational Linguistics, pages 4996–5001.
  task-oriented dialogue systems.
                                                           Alec Radford, Karthik Narasimhan, Time Salimans,
Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and           and Ilya Sutskever. 2018. Improving language un-
  Pascale Fung. 2019. Personalizing dialogue agents          derstanding with unsupervised learning. Technical
  via meta-learning. In Proceedings of the 57th An-          report, OpenAI.
  nual Meeting of the Association for Computational
  Linguistics, pages 5454–5459.                            Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
                                                             Dario Amodei, and Ilya Sutskever. 2019. Language
Louis Martin, Benjamin Muller, Pedro Javier Ortiz            models are unsupervised multitask learners. OpenAI
  Suárez, Yoann Dupont, Laurent Romary, Éric Ville-          Blog, 1(8):9.
  monte de la Clergerie, Djamé Seddah, and Benoît          Siva Reddy, Danqi Chen, and Christopher D Manning.
  Sagot. 2019. Camembert: a tasty french language             2019. Coqa: A conversational question answering
  model. arXiv preprint arXiv:1911.03894.                     challenge. Transactions of the Association for Com-
Seungwhan Moon, Pararth Shah, Anuj Kumar, and Ra-             putational Linguistics, 7:249–266.
  jen Subba. 2019. Opendialkg: Explainable conver-         Sascha Rothe, Shashi Narayan, and Aliaksei Sev-
  sational reasoning with attention-based walks over         eryn. 2019. Leveraging pre-trained checkpoints
  knowledge graphs. In Proceedings of the 57th An-           for sequence generation tasks.  arXiv preprint
  nual Meeting of the Association for Computational          arXiv:1907.12461.
  Linguistics, pages 845–854.
                                                           Erik F Sang. 2002. Introduction to the conll-2002
Nikola Mrkšić, Ivan Vulić, Diarmuid Ó Séaghdha, Ira         shared task: Language-independent named entity
  Leviant, Roi Reichart, Milica Gašić, Anna Korho-           recognition. arXiv preprint cs/0209010.
  nen, and Steve Young. 2017. Semantic specializa-
  tion of distributional word vector spaces using mono-    Erik F Sang and Fien De Meulder. 2003. Intro-
  lingual and cross-lingual constraints. Transactions         duction to the conll-2003 shared task: Language-
  of the Association for Computational Linguistics,           independent named entity recognition.      arXiv
  5:309–324.                                                  preprint cs/0306050.
Sebastian Schuster, Sonal Gupta, Rushin Shah, and        Genta Indra Winata, Zhaojiang Lin, Jamin Shin, Zihan
  Mike Lewis. 2019a. Cross-lingual transfer learning       Liu, and Pascale Fung. 2019b. Hierarchical meta-
  for multilingual task oriented dialog. In Proceed-       embeddings for code-switching named entity recog-
  ings of the 2019 Conference of the North American        nition. In Proceedings of the 2019 Conference on
  Chapter of the Association for Computational Lin-        Empirical Methods in Natural Language Processing
  guistics: Human Language Technologies, Volume 1          and the 9th International Joint Conference on Natu-
  (Long and Short Papers), pages 3795–3805.                ral Language Processing (EMNLP-IJCNLP), pages
                                                           3532–3538.
Tal Schuster, Ori Ram, Regina Barzilay, and Amir
  Globerson. 2019b. Cross-lingual alignment of con-      Genta Indra Winata, Andrea Madotto, Chien-Sheng
  textual word embeddings, with applications to zero-      Wu, and Pascale Fung. 2019c. Code-switched lan-
  shot dependency parsing. In Proceedings of the           guage models using neural based synthetic data from
  2019 Conference of the North American Chapter of         parallel sentences. In Proceedings of the 23rd Con-
  the Association for Computational Linguistics: Hu-       ference on Computational Natural Language Learn-
  man Language Technologies, Volume 1 (Long and            ing (CoNLL), pages 271–280.
  Short Papers), pages 1599–1613.
                                                         Guillaume Wisniewski, Nicolas Pécheux, Souhir
Abigail See, Stephen Roller, Douwe Kiela, and Jason        Gahbiche-Braham, and François Yvon. 2014. Cross-
  Weston. 2019. What makes a good conversation?            lingual part-of-speech tagging through ambiguous
  how controllable attributes affect human judgments.      learning. In Proceedings of the 2014 Conference on
  arXiv preprint arXiv:1902.08654.                         Empirical Methods in Natural Language Processing
Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, and        (EMNLP), pages 1779–1785.
   Joelle Pineau. 2016. Generative deep neural net-
   works for dialogue: A short review. arXiv preprint    Thomas Wolf, Victor Sanh, Julien Chaumond, and
   arXiv:1611.06216.                                       Clement Delangue. 2019.      Transfertransfo: A
                                                           transfer learning approach for neural network
Kurt Shuster, Samuel Humeau, Antoine Bordes, and           based conversational agents.     arXiv preprint
  Jason Weston. 2018. Engaging image chat: Model-          arXiv:1901.08149.
  ing personality in grounded dialogue. arXiv preprint
  arXiv:1811.00945.                                      Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A
                                                            Smith, and Jaime Carbonell. 2018. Neural cross-
Shubham Toshniwal, Tara N Sainath, Ron J Weiss,             lingual named entity recognition with minimal re-
  Bo Li, Pedro Moreno, Eugene Weinstein, and Kan-           sources. In Proceedings of the 2018 Conference on
  ishka Rao. 2018. Multilingual speech recognition          Empirical Methods in Natural Language Processing,
  with a single end-to-end model. In 2018 IEEE Inter-       pages 369–379.
  national Conference on Acoustics, Speech and Sig-
  nal Processing (ICASSP), pages 4904–4908. IEEE.        Semih Yavuz, Abhinav Rastogi, Guan-Lin Chao, and
                                                           Dilek Hakkani-Tur. 2019. Deepcopy: Grounded
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob           response generation with hierarchical pointer net-
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz            works. In Proceedings of the 20th Annual SIGdial
  Kaiser, and Illia Polosukhin. 2017. Attention is all     Meeting on Discourse and Dialogue, pages 122–
  you need. In Advances in neural information pro-         132.
  cessing systems, pages 5998–6008.
                                                         Steve Young, Milica Gašić, Blaise Thomson, and Ja-
Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fen-             son D Williams. 2013. Pomdp-based statistical spo-
  fei Guo, Raefer Gabriel, Ashish Nagar, Rohit              ken dialog systems: A review. Proceedings of the
  Prasad, Ming Cheng, Behnam Hedayatnia, Ange-              IEEE, 101(5):1160–1179.
  liki Metallinou, et al. 2018. On evaluating and
  comparing conversational agents. arXiv preprint        Xianghu Yue, Grandee Lee, Emre Yılmaz, Fang Deng,
  arXiv:1801.03625, 4:60–68.                               and Haizhou Li. 2019. End-to-end code-switching
Oriol Vinyals and Quoc V Le. 2015. A neural conver-        asr for low-resourced language pairs. arXiv preprint
  sational model. arXiv preprint arXiv:1506.05869.         arXiv:1909.12681.

Jason D Williams and Steve Young. 2007. Partially        Yury Zemlyanskiy and Fei Sha. 2018. Aiming to know
   observable markov decision processes for spoken         you better perhaps makes me a more engaging dia-
   dialog systems. Computer Speech & Language,             logue partner. CoNLL 2018, page 551.
   21(2):393–422.
                                                         Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur
Genta Indra Winata, Zhaojiang Lin, and Pascale Fung.       Szlam, Douwe Kiela, and Jason Weston. 2018. Per-
  2019a. Learning multilingual meta-embeddings for         sonalizing dialogue agents: I have a dog, do you
  code-switching named entity recognition. In Pro-         have pets too? In Proceedings of the 56th Annual
  ceedings of the 4th Workshop on Representation           Meeting of the Association for Computational Lin-
  Learning for NLP (RepL4NLP-2019), pages 181–             guistics (Volume 1: Long Papers), pages 2204–2213.
  186.                                                     Association for Computational Linguistics.
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen,
  Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing
  Liu, and Bill Dolan. 2019. Dialogpt: Large-scale
  generative pre-training for conversational response
  generation. arXiv preprint arXiv:1911.00536.
Yuan Zhang, David Gaddy, Regina Barzilay, and
  Tommi Jaakkola. 2016.         Ten pairs to tag–
  multilingual pos tagging via coarse mapping be-
  tween embeddings. In Proceedings of the 2016 Con-
  ference of the North American Chapter of the Asso-
  ciation for Computational Linguistics: Human Lan-
  guage Technologies, pages 1307–1317.
Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.
   2017. Learning discourse-level diversity for neural
   dialog models using conditional variational autoen-
   coders. In Proceedings of the 55th Annual Meet-
   ing of the Association for Computational Linguistics
  (Volume 1: Long Papers), pages 654–664.
A     Dataset Collection

A.1    Annotation Instructions
In this section, we show the instructions for French
annotation:

    • There are two existing columns of conversa-
      tions: the first column (en) is the original con-
      versations in English, the second column (fr)
      is the conversations translated by an automatic
      system (e.g., Google Translate).

    • You should copy the conversation from the
      second column (the translated conversations)
      into the third column (named fr_annotation).
      In that column, you should then revise the
      incorrect or inappropriate translations.

    • The goal of the revision is to make the conver-
      sations more coherent and fluent in the target
      language (French). Hence you can customize
      dialogues and persona sentences to make them
      fluent and coherent in the target language, in-
      cluding by deviating from the original transla-
      tion. However, you should retain persona and
      conversation consistency.
                                                           Figure 2: Human evaluation interface modified from
A.2    Training Set Statistics                             ACUTE-EVAL(Li et al., 2019)
We report our iterative revised training set statistics
in Table 7.
                                                           C     Human Evaluation
B     Model Detail                                         As illustrated in Figure 2, the annotator is provided
                                                           with two full dialogues made by a self-chat model
Figure 3 and 4 illustrates the details of the multilin-    or human-dialogues. Then the annotators are asked
gual causal decoder and the multilingual encoder-          the following questions:
decoder models.
                                                               • Who would you talk to for a long conversa-
                     Train                                       tion?
 Lang      # Dial.    # Utt.    Edit    BLEU
   Fr      16878     248244     0.06    99.98                  • If you had to say one of these speakers is
    It     16878     248244     1.09     99.8                    interesting and one is boring, who would you
   Id      16878     248244     0.18    99.94                    say is more interesting?
   Jp      16878     248244     0.38    99.17
   Ko      16878     248244     0.97    99.51                  • Which speaker sounds more human?
   Zh      16878     248244     0.52     98.1
                                                           D     Generated Samples
Table 7: The number of dialogues (#Dial.) and utter-
ances (#Utt.) of the training set in six languages. Edit
                                                           D.1    Mixed-language Samples
distance per dialogue and BLEU score are computed
to show the difference between the iterative revised       We report more the mixed-language samples gener-
dataset and auto-translated dataset.                       ated by M-CausalBert in Table 8 and 9.
I    am   fine    and    you    ?    EOS

                                                                   Causal Decoder

 X       I     love   cats    Hi       !        Hi     how    are      you        ?    SOS   I    am      fine   and   you    ?
         +      +      +      +        +        +       +      +        +         +     +    +     +       +      +     +     +
 Xpos    1      2      3      4        5        6       7      8          9       10   11    12   13       14    15    16     17
         +      +      +      +        +        +       +      +          +       +    +     +    +        +     +     +      +
 Xseg    Per   Per     Per   Sys      Sys       Usr    Usr    Usr      Usr     Usr     En    En   En       En    En    En     En

                                       Figure 3: Multilingual Causal Decoder model.

                                                                                        I    am   fine    and    you    ?    EOS

                                   M-Encoder                                                             M-Decoder

 X       I     love   cats   Hi       !        Hi     how    are     you      ?        SOS   I    am      fine   and   you   ?
         +      +      +     +        +        +       +      +       +       +         +    +     +       +      +     +    +
 Xpos    1      2      3      4        5       6      7      8        9       10        1    2     3       4     5     6     7
         +      +      +      +        +       +      +      +        +       +         +    +     +       +     +     +     +
 Xseg    Per   Per    Per    Sys      Sys      Usr    Usr    Usr      Usr     Usr      En    En   En      En     En    En    En

                                      Figure 4: Multilingual Encoder-Decoder model.

D.2     Model Comparison Samples
We randomly sample one self-chat dialogue exam-
ples for each model in each language and report
them in figure 5-37.
System Persona                                               System Persona
                        i have two dogs.                                 i love disneyland and mickey mouse.
                i like to work on vintage cars.                           i love to spend time with my family.
                my favorite music is country.                                   i’m a baby delivery nurse.
                 i own two vintage mustangs.                                  i walk three miles every day.
                            Dialogue                                                    Dialogue
Usr   do you have any hobbies?                              Usr    what do you do in your free time?
En    i like music, country is my favorite. you?            En     i love to play with my family. how about you?
      我喜欢听乡村音乐。你呢                                                  我主要是在家里
Zh                                                          Zh
      (I like country music, you?)                                 (I just stay at home)
      je possède deux voitures anciennes                           j’aime passer du temps avec ma
Fr    et j’aime faire de l’artisanat                               famille et mon mari.
                                                            Fr
      (I have two vintage cars and I like to make crafts)          (i love spending time with my family and
      Ciao. mi piace lavorare su auto d’epoca                      my husband.)
      e ascoltare musica country.                                  Ciao. Sono in pensione da quando ero un’infermiera.
It                                                          It
      (Hello. I love working on vintage cars and                   (Hello. I have been retired since I was a nurse.)
      listening to country music.)                                 こんにちは、私は家族と時間を過ごすのが
      こんにちは。私は大学にいます、                                       Jp     大好きです
Jp
      (Hi, i really like it)                                       (Hi, i love spending time with my family)
      Saya suka mendengarkan musik country, dan Anda?              Saya suka berjalan 3 mil setiap hari dan Anda?
Id                                                          Id
      (I like listening to country music, and you?)                (I love walking 3 miles every day and you?)
      ᆫᄂ
      ᅡ
      ᄋ   ᆼ, ᄂ
          ᅧ  ᅡᄂ  ᆫᄉ
                 ᅳ  ᆫᄉ
                    ᅥ  ᆼᄂ
                       ᅢ  ᆷᄋ
                          ᅵ ᅵᄋ ᅣ.                                  ᆫᄂ
                                                                   ᅡ
                                                                   ᄋ  ᆼ, ᄂ
                                                                      ᅧ    ᅡᄂ ᆫᄀ
                                                                              ᅳ  ᅡᄌᆨᄀ
                                                                                   ᅩ  ᅪ시ᄀ ᅡ
                                                                                           ᆫᄋᆯᄇ
                                                                                             ᅳ  ᅩᅢᄂᄂ ᆫᄀ
                                                                                                     ᅳ ᆺᄋ
                                                                                                       ᅥ  ᆯᅳ
                                                                                                          ᅳ ᄌᄀ
                                                                                                            ᆯ ᆫᄃ
                                                                                                              ᅵ ᅡ.
Ko                                                          Ko
      (I have two vintage mustangs.)                               (Hi, I enjoy spending time with my family.)

                                     Table 8: One-to-many by M-CausalBert

                   System Persona                                                 System Persona
               i am a cancer survivor.                                    my brother is older than me.
          my parents were both very athletic.                                i ride the bus to school.
  i got a new job just yesterday to be a life coach.                         i’m on the soccer team.
                  i am a clean eater.                                            i am 13 years old.
                      Dialogue                                                       Dialogue
       こんにちは                                                        ᆫᄂ
                                                                    ᅡ
                                                                    ᄋ   ᆼ
                                                                        ᅧ
Usr                                                          Usr
       (Hello)                                                      (Hello)
Sys hey there how are you?                                   Sys    hello, how are you?
       我很好,我是一名学生,你呢?                                               我很好,刚刚下班回家
Usr                                                          Usr
       (I am good, I am a student, what about you?)                 (I am good, just came home from work)
Sys i am a full time student and have no job yet.            Sys    oh, cool. what do you do for work?
       cosa Studi?                                                  Sono un insegnante di inglese
Usr                                                          Usr
       (What do you study?)                                         (I am English teacher.)
Sys oh i see. what are you studying?                         Sys    oh, that’s awesome! i hope to get a job soon.
       l’informatique                                               quel âge avez-vous?
Usr                                                          Usr
       (computer science)                                           (how old are you? )
Sys computer stuff. what is your favorite subject?                  oh ok. i’m only 13, but my brother is
                                                             Sys
Usr i like machine learning.                                        older than me
Sys i like to read and make fun of things.                   Usr    Apakah Anda pelajar? (are you a student?)
       Apa pekerjaan orang tuamu?                            Sys    do you play soccer? i am a football player
Usr
       (what do your parents do?)                            Usr    no, i like to read
Sys i’m from cali, my parents were both athletic.            Sys    i like to ride the bus and play soccer

                                      Table 9: Many-to-one by M-CausalBert
You can also read