A Unified Pre-training Framework for Conversational AI

Page created by Philip Caldwell
 
CONTINUE READING
A Unified Pre-training Framework for Conversational AI
 Siqi Bao* , Bingjin Chen* , Huang He* , Xin Tian* , Han Zhou* ,
 Fan Wang, Hua Wu, Haifeng Wang, Wenquan Wu, Yingzhan Lin
 Baidu Inc., China
 {baosiqi, chenbingjin, hehuang, tianxin06, zhouhan05,
 wang.fan, wu hua, wanghaifeng, wuwenquan01, linyingzhan01}@baidu.com
arXiv:2105.02482v2 [cs.CL] 27 May 2021

 Abstract How about making a snowman?
 It is snowing outside.
 In this work, we explore the application of PLATO-2 on vari- ...
 ous dialogue systems, including open-domain conversation, That is nice.
 knowledge grounded dialogue, and task-oriented conversa-
 tion. PLATO-2 is initially designed as an open-domain chat- It’s so cold. I really miss summer.
 bot, trained via two-stage curriculum learning. In the first
 stage, a coarse-grained response generation model is learned
 to fit the simplified one-to-one mapping relationship. This Figure 1: Toy example to show the one-to-one mapping
 model is applied to the task-oriented conversation, given that (gray line) and one-to-many mapping (blue dashed lines) in
 the semantic mappings tend to be deterministic in task com- open-domain conversations. Left: dialogue context; Right:
 pletion. In the second stage, another fine-grained generation candidate responses.
 model and an evaluation model are further learned for di-
 verse response generation and coherence estimation, respec-
 tively. With superior capability on capturing one-to-many designed as an open-domain chat-bot1 , trained via two-
 mapping, such models are suitable for the open-domain con-
 stage curriculum learning. In the first stage, a coarse-grained
 versation and knowledge grounded dialogue. For the com-
 prehensive evaluation of PLATO-2, we have participated in model is trained for general response generation under the
 multiple tasks of DSTC9, including interactive evaluation of simplified relationship of one-to-one mapping. In fact, one
 open-domain conversation (Track3-task2), static evaluation dialogue context might have multiple appropriate responses
 of knowledge grounded dialogue (Track3-task1), and end-to- in open-domain conversations, as shown in the toy exam-
 end task-oriented conversation (Track2-task1). PLATO-2 has ple of Figure 1. The one-to-one mapping network can only
 obtained the 1st place in all three tasks, verifying its effec- capture the common response patterns, resulting in general
 tiveness as a unified framework for various dialogue systems. and dull responses during inference. As such, the curricu-
 lum learning continues to the next stage for high-quality re-
 1 Introduction sponse generation, as illustrated in Figure 2. In the second
 stage, the discrete latent variable is encoded into the network
 The neural models in the conversational AI can be roughly
 for the one-to-many relationship modeling. Another fine-
 divided into three categories: open-domain chatbot, knowl-
 grained generation model and an evaluation model are fur-
 edge grounded dialogue agent, and task-oriented dialogue
 ther learned for diverse response generation and coherence
 system (Gao, Galley, and Li 2018). Due to the significant
 estimation, respectively. The combination of fine-grained
 differences among these tasks, it is usually necessary to cus-
 generation and evaluation helps PLATO-2 obtain new state-
 tomize the modeling and training for each task. Recently,
 of-the-art results in open-domain conversations.
 pre-trained language models have gained tremendous suc-
 cess in natural language processing (Devlin et al. 2019; Similar to the open-domain conversation, the one-
 Brown et al. 2020) and pioneering efforts have been made to-many mapping relationship also exists in knowledge
 to pre-train dialogue generation models (Bao et al. 2020a; grounded dialogue: given a dialogue context, multiple pieces
 Zhang et al. 2020). However, there still lacks a unified pre- of knowledge might be applicable for the response gener-
 training framework which may effectively handle all these ation. Therefore, the one-to-many mapping models of the
 three conversational tasks. second stage can also be adapted for knowledge grounded
 In this work, we will explore the application of PLATO- dialogue. By expanding the network input with the knowl-
 2 (Bao et al. 2020b) on the aforementioned tasks, includ- edge segment, the background knowledge is encoded and
 ing open-domain conversation, knowledge grounded dia- grounded for response generation. Distinct from the open-
 logue, and task-oriented conversation. PLATO-2 is initially domain conversation and knowledge grounded dialogue,
 there is a specific goal to be accomplished in task-oriented
 * Equal contribution, listed in alphabetical order.
 1
 Copyright © 2021, Association for the Advancement of Artificial Source code and pre-trained models are released at https://
 Intelligence (www.aaai.org). All rights reserved. github.com/PaddlePaddle/Knover/tree/develop/projects/PLATO-2.
ℎ+ ℎ, ℎ- ℎ. ℎ/ ( | , ) BOW NLL

 ⨁ Transformer Block ℎ$ ℎ⋅ ℎ⋅
 Transformer Block L
 Feed Forward Transformer Blocks

 … Layer Norm
 Transformer Block − 1
 Transformer Block 2
 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
 Transformer Block 1 ⨁ Latent Context Response Latent Context Response

 Multi-Head Attention ( | , ) ∼ ( | , )
 Token Embedding • Open-domain Conversation
 Segment Embedding Layer Norm • Knowledge Grounded Dialogue Transformer Block ℎ[*]

 Position Embedding Transformer Blocks
 Stage 2.1
 Pre-normalization Fine-grained Generation Transformer Block − 1
 + , - . /
 One-to-Many Mapping
 [M ] ⋅ ⋅ ⋅ ⋅ [M ] ⋅ ⋅ ⋅ ⋅
 (a) Network Overview Diverse Response Generation Latent Context Response Latent Context Response
 • Task-oriented Conversaton

 Stage 1
 ( | ) NLL ( & | , ) RCE MLM
 Coarse-grained Generation
 Transformer Block ℎ⋅ ℎ⋅ Transformer Block ℎ'() ℎ[*] ℎ[*]
 One-to-One Mapping
 Transformer Blocks General Response Generation Stage 2.2 Transformer Blocks
 Transformer Block − 1 Evaluation Transformer Block − 1
 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Response Coherence CLS ⋅ [M ] [M ] ⋅ CLS ⋅ [M ] [M ] ⋅
 Context Response Context Response
 Estimation Response
 Context Context Response

 Self-attention Visualization Training Objectives (b) Curriculum Learning Process Self-attention Visualization Training Objectives

Figure 2: PLATO-2 illustration. (a) Network overview with the details of transformer blocks. (b) Curriculum learning process
with self-attention visualization and training objectives.

conversation. Accordingly, the conversation flow would be- loss:
come less diverse and concentrated on task completion. T
 X
Therefore, the one-to-one mapping generation model of LBaseline
 N LL = −E log p(r|c) = −E log p(rt |c, r
where z is one K-way categorical variable z ∈ {1, · · · , K}. I have four children I like to ski
Given that the sampling operation is not differentiable, we I love watching Game of Thrones I hate Mexican food

 …
approximate it with Gumbel-Softmax (Jang, Gu, and Poole

 …
2017). Besides the classical NLL loss, the bag-of-words
(BOW) loss (Zhao, Zhao, and Eskenazi 2017) is also em- Namaste. How are you today?
ployed to facilitate the training process of latent variable:
 I am doing great. How are you?
 T
 Great, thanks. My children and I were just about
 X
 LGeneration
 BOW = −Ez∼p(z|c,r) log p(rt |c, z)
 to watch Game of Thrones.
 t=1
 T
 (2) ...
 X efrt
 = −Ez∼p(z|c,r) log P ,
 t=1 v∈V efv Figure 3: One example of knowledge grounded dialogue
 from Persona-Chat.
where V refers to the vocabulary, the function f tries to pre-
dict all the words in the target response using the output em-
bedding hz . As compared with the NLL loss, BOW loss ig-
 PLATO-2 learns gradually from coarse-grained general
nores the word order and forces the final latent embedding to
 response generation to fine-grained diverse response gen-
capture the global information of the response. To sum up,
 eration via this curriculum learning process. Besides, the
the training objective of the fine-grained generation model
 evaluation model further selects the most coherent response
is to minimize the integrated loss:
 from multiple candidate responses. This combination of
 LGeneration = LGeneration
 N LL + LGeneration
 BOW (3) fine-grained generation and evaluation helps PLATO-2 ob-
 tain high-quality responses in open-domain conversations.
 By assigning distinct values to the latent variable, the fine-
grained generation model is able to produce multiple diverse 3 Knowledge Grounded Dialogue
responses. For selecting the most appropriate one from these
candidate responses, the evaluation model is trained to esti- Another common conversational task is knowledge
mate the coherence between each response and the given di- grounded dialogue, where the response is generated based
alogue context. During training, the evaluation model needs on the dialogue context and background knowledge. The
to distinguish the golden response r from the randomly se- background knowledge can come in a variety of forms,
lected negative response r− . such as persona profiles (Zhang et al. 2018), Wikipedia
 (Dinan et al. 2018), news articles (Gopalakrishnan et al.
 LEvaluation
 RCE = − log p(lr = 1|c, r) − log p(lr− = 0|c, r− ) 2019), and so on. One example from Persona-Chat is given
 in Figure 3. It can be observed that the response generation
Besides the response coherence estimation (RCE) loss, the
 relies on not only the dialogue context but also the persona
conventional masked language model (MLM) loss (Devlin
 profiles. Similar to the open-domain conversation, there also
et al. 2019) is also included to retain the representation abil-
 exists the one-to-many mapping relationship in knowledge
ity of the network. To sum up, the training objective of the
 grounded dialogue (Kim, Ahn, and Kim 2019): given a
evaluation model is to minimize the integrated loss:
 dialogue context, multiple pieces of knowledge might be
 LEvaluation = LEvaluation
 RCE + LEvaluation
 M LM (4) applicable for the response generation.
 Within the PLATO-2 framework, the background knowl-
During inference, conditioned on each latent value z ∈ edge can be encoded into the fine-grained generation and
{1, · · · , K}, its corresponding candidate response is pro- evaluation network straightforwardly by adding a seg-
duced by the fine-grained generation model p(r|c, z). The ment of knowledge before the dialogue context. The learn-
most coherent response can be selected in the following ing of response generation becomes p(r|k, c, z), where k
way: refers to the background knowledge. And the evaluation
 r∗ = max p(lr = 1|c, r) (5) model p(lr = 1|k, c, r) will consider the coherence with
 z∈{1,··· ,K}
 the dialogue context and the consistency with the back-
In addition to the above coherence estimation, two other ground knowledge simultaneously. The fine-grained genera-
approaches are commonly adopted for response selection: tion model produces diverse knowledge grounded responses
length-averaged log-likelihood and maximum mutual infor- and the evaluation model further selects out the most appro-
mation. The length-averaged log-likelihood considers the priate one from these candidates.
forward probability p(r|c), which tends to select those com-
mon and general responses. The maximum mutual informa-
tion considers the backward probability p(c|r), which favors 4 End-to-end Task-oriented Conversation
those responses of high-overlap with the dialogue context. Conventional task-oriented dialogue systems usually adopt
In comparison, the evaluation model p(lr = 1|c, r) consid- the pipeline architecture, including natural language under-
ers the bi-directional information flow between the dialogue standing (NLU), dialogue state tracking (DST), dialogue
context and response, achieving better performance at se- policy, and natural language generation (NLG) modules. Re-
lecting coherent responses. cently, some works (Ham et al. 2020; Peng et al. 2020) have
Overall Human Error
 Model Name Rank Coherent Consistent Diversity Topic Depth Likeable Understanding Flexible Informative Inquisitive
 Rating Recovery
 PLATO-2 1 4.15
 Overall Human 2.8017 2.7518
 Error 0.9390 2.7441 2.7678 2.7878 2.8285 2.8000 2.7881 2.7949
 Model Name Rank Coherent Consistent Diversity Topic Depth Likeable Understanding Flexible Informative Inquisitive
 concise-and-comprehensive 2 Rating
 4.14 2.8157 Recovery
 2.7644 0.9573 2.7235 2.6530 2.7709 2.8225 2.7645 2.7065 2.7799
 PLATO-2
 DPKS-v0-w 13 4.15
 4.08 2.8017
 2.7686 2.7518
 2.6538 0.9390
 0.9222 2.7441
 2.6791 2.7678
 2.6909 2.7878
 2.7348 2.8285
 2.7733 2.8000
 2.7073 2.7881
 2.7162 2.7949
 2.6943
 concise-and-comprehensive
 DPKS-v1-w 24 4.14
 4.03 2.8157
 2.7327 2.7644
 2.6490 0.9573
 0.9153 2.7235
 2.6875 2.6530
 2.6362 2.7709
 2.7271 2.8225
 2.7348 2.7645
 2.6824 2.7065
 2.7280 2.7799
 2.7264
 DPKS-v0-w
 most-diverse-model 35 4.08
 3.93 2.7686
 2.7161 2.6538
 2.6000 0.9222
 0.9068 2.6791
 2.6550 2.6909
 2.5978 2.7348
 2.7006 2.7733
 2.7072 2.7073
 2.6361 2.7162
 2.6921 2.6943
 2.6760
 DPKS-v1-w 4 4.03 2.7327 2.6490 0.9153 2.6875 2.6362 2.7271 2.7348 2.6824 2.7280 2.7264

Table 1: Human evaluation
 most-diverse-model 5 results
 3.93 onHuman
 Overall Track3-task2
 2.7161 interactive
 2.6000 0.9068 open-domain
 2.6550 conversations,
 2.5978 2.7006 with the best
 2.7072 value written
 2.6361
 Semantically 2.6921 in2.6760
 bold.
 Model Name Rank Interesting Engaging Specific Relevant Correct Understandable Fluent
 Rating Appropriate

 PLATO-2 w/o explicit knowledge 1 4.2814
 Overall Human 2.8746 2.8542 2.5627 2.8814 2.6678 2.8780
 Semantically 0.9932 2.9119
 Model Name Rank Interesting Engaging Specific Relevant Correct Understandable Fluent
 PLATO-2 w/ explicit knowledge 1 Rating
 4.2804 2.8142 2.8480 2.5439 2.8682 2.6520 Appropriate
 2.8682 0.9966 2.8885
 PLATO-2 w/o explicit knowledge
 final-results-test-plato-best-1004 11 4.2814
 4.2799 2.8746
 2.8396 2.8542
 2.8703 2.5627
 2.5631 2.8814
 2.8976 2.6678
 2.6860 2.8780
 2.9147 0.9932
 0.9966 2.9119
 2.9147
 PLATO-2 w/ explicit knowledge
 nofact-data-finetune 14 4.2804
 4.2603 2.8142
 2.8185 2.8480
 2.8082 2.5439
 2.6062 2.8682
 2.8493 2.6520
 2.6849 2.8682
 2.8048 0.9966
 1.0000 2.8885
 2.9075
 final-results-test-plato-best-1004
 dialogpt-fact-dialogrpt-v0 15 4.2799
 4.2526 2.8396
 2.8601 2.8703
 2.8771 2.5631
 2.5768 2.8976
 2.8703 2.6860
 2.7167 2.9147
 2.9044 0.9966
 0.9966 2.9147
 2.9078
 nofact-data-finetune 4 4.2603 2.8185 2.8082 2.6062 2.8493 2.6849 2.8048 1.0000 2.9075

 dialogpt-fact-dialogrpt-v0 5 4.2526 2.8601 2.8771 2.5768 2.8703 2.7167 2.9044 0.9966 2.9078
 Rank Team ID Best Spec # Success Rate Complete Rate Book Rate Inform P/R/F1 Turn (succ/all)

 Table12: Human1evaluation
 (Ours) results on Track3-task1
 Submission3 93.0 static knowledge
 95.2 grounded dialogues,
 94.6 with84.1
 the/ 96.2
 best value written
 / 88.1 12.5 /in bold.
 12.7
 Rank Team ID Best Spec # Success Rate Complete Rate Book Rate Inform P/R/F1 Turn (succ/all)
 2 2 Submission5 91.4 96.9 96.2 80.2 / 97.3 / 86.0 15.3 / 15.7
 13 1 (Ours) Submission3 93.0 95.2 94.6 84.1 / 96.2 / 88.1 12.5 / 12.7
been introduced for3 end-to-end Submission1
 task-oriented dialogue 90.8
 gen- 94.4
 the queried results,96.7
 the second 81.0 / 95.4 / 85.9
 phase generation 13.4 will/ 13.6
 be car-
eration42 with pre-trained 24 Submission5
 Submission2
 language models. In this 91.4
 89.8 section, 96.9
 94.6
 ried 96.2
 96.3
 out to produce the final80.2
 72.4 // 97.3
 96.0 // 86.0
 response. 80.1 Secondly,15.3
 15.1 //to
 15.7
 15.8boost
we will53 discuss how35 to apply PLATO-2 Submission1
 Submission2 on end-to-end 90.8
 83.3 task- 94.4
 the extraction of
 88.5 96.7
 89.1entity names, 81.0
 81.1 //we
 95.4
 90.3 //employ
 85.9
 83.5 13.4
 fuzzy // 13.6
 13.5matching
 13.8
oriented 4
 N/A
 conversations. 4
 Baseline Submission2
 Baseline 89.8
 85.0
 between
 94.6
 92.4
 the dialogue
 96.3
 91.4
 context72.4and
 / 96.0 database,
 / 80.1
 79.3 / 94.9 / 84.5
 where
 15.1 / special
 15.8
 13.8 / 14.9
 Distinct
 5 from open-domain
 5 conversation and 83.3
 Submission2 knowledge tokens 
 88.5 89.1
 and will be added13.5
 81.1 / 90.3 / 83.5
 around
 / 13.8
 the
grounded dialogue, the task-oriented conversation is sup- candidate entity names. Through this enhanced presenta-
 N/A Baseline Baseline 85.0 92.4 91.4 79.3 / 94.9 / 84.5 13.8 / 14.9
posed Success Rate w/ DB tion,
 Success Rateour model
 w/o DB achieves
 Language better
 Understanding accuracy
 Response and generalization
 Rank to accomplish
 Team ID a particular
 Best Spec # goal.
 Average Therefore,
 Success Rate the seman- Turns
 Grounding Grounding Score Appropriateness
 in entity detection. Thirdly, to deal with the ambiguous re- Score
tic mapping between dialogue context and response would
 1
be less 1 (Ours) To this
 diverse. Submission5 74.8
 end, the one-to-one mapping 70.2
 genera-
 Success Rate w/ DB
 quests, active
 79.4 w/o
 Success Rate
 clarification
 DB Language 4.54
 is introduced
 Understanding 4.47 by raising one
 Response 18.5 clar-
 Rank Team ID Best Spec # Average Success Rate ifying question towards the user, such as “would you Turnslike a
tion1model in2 stage Submission1
 1 is employed for74.8 Grounding
 task-oriented conversa-68.8 Grounding
 80.8 Score
 4.51 Appropriateness
 4.45 Score 19.4
tion.31 Even1 with
 (Ours) this powerful
 Submission5 pre-trained
 74.8 generation model,
 70.2 guesthouse
 79.4 or a hotel”.
 4.54 With active clarification,
 4.47 the model
 18.5
 7 Submission4 72.3 62.0 4.53 4.41 17.1
it is 1still challenging to carry out end-to-end task-completion can82.6capture the user’s real needs under ambiguous scenar-
 4 26 Submission1
 Submission1 74.8 68.8 80.8 4.51 4.45 19.4
conversations. Firstly, the generation70.6 model needs to find 60.8
 out ios.80.4
 The above three4.41 techniques – effective 4.41 interaction20.1 with
 3
 5 73 Submission4
 Submission3 72.3
 67.8
an effective way to interact with the external database. 62.0
 60.0 It is 82.6
 an external
 75.6 database,4.53improved entity4.41
 4.56 representation,17.1
 4.42 and ac-
 21.0
 4
necessary
 N/A to 6retrieve
 Baseline Submission1
 relevant information
 Baseline 70.6
 69.6 60.8
 from the database
 56.8 tive80.4
 clarification,
 82.4 help PLATO-2
 4.41
 4.34 achieve 4.41
 4.18 a better success18.5 rate
 20.1
for response
 5 3 generation,
 Submission3 such as retrieving
 67.8 the candidates
 60.0 and75.6user experience4.56 in task-oriented 4.42 conversations. 21.0
meeting
 N/A the currentBaseline
 Baseline user’s criteria. 69.6 Secondly, it is 56.8 crucial 82.4 4.34 4.18 18.5
for task completion to extract the entity precisely from the 5 Experiments
conversation. However, the entity name is non-categorical, For the comprehensive evaluation of PLATO-2, we have en-
and the user might mention it in various forms. Thirdly, the rolled in multiple tasks of DSTC9 (Gunasekara et al. 2020):
user’s requests might be ambiguous, and the model has dif-
ficulties in capturing the user’s real needs. Taking the utter- • Track3-task2 interactive evaluation of open-domain con-
ance “I want to find a hotel to stay” as an example, it is hard versation;
to tell whether the user wants to find a place to stay or the • Track3-task1 static evaluation of knowledge grounded di-
user wants to find a hotel instead of a guesthouse to stay. alogue;
 To tackle the above problems, several techniques are em-
ployed in our work. Firstly, the interaction with the external • Track2-task1 end-to-end task-oriented conversation.
database is enabled through dialogue state estimation (Ham PLATO-2 is pre-trained with 684M (context, response) sam-
et al. 2020) and a flexible two-phase generation process is ples extracted from Reddit, and the vocabulary has 8k
adopted to produce the final response. In the first phase, the BPE subwords. All the models have 32 transformer blocks
model generates the dialogue state, system action, and sys- and 32 attention heads, with the hidden embedding dimen-
tem response simultaneously. The dialogue state will be used sion of 2048. For open-domain conversation and knowledge
as a constraint for database query, and the system action can grounded dialogue, responses are generated with top-k sam-
be refreshed according to the queried results. If there is any pling (Fan, Lewis, and Dauphin 2018), where k is set to 20.
update about the system action or no candidate found from For task-oriented conversation, responses are produced with
final-results-test-plato-best-1004 1 4.2799 2.8396 2.8703 2.5631 2.8976 2.6860 2.9147 0.9966 2.9147
 PLATO-2 w/o explicit knowledge 1 4.2814 2.8746 2.8542 2.5627 2.8814 2.6678 2.8780 0.9932 2.9119
 nofact-data-finetune 4 4.2603 2.8185 2.8082 2.6062 2.8493 2.6849 2.8048 1.0000 2.9075
 PLATO-2 w/ explicit knowledge 1 4.2804 2.8142 2.8480 2.5439 2.8682 2.6520 2.8682 0.9966 2.8885
 dialogpt-fact-dialogrpt-v0 5 4.2526 2.8601 2.8771 2.5768 2.8703 2.7167 2.9044 0.9966 2.9078
 final-results-test-plato-best-1004 1 4.2799 2.8396 2.8703 2.5631 2.8976 2.6860 2.9147 0.9966 2.9147

 nofact-data-finetune 4 4.2603 2.8185 2.8082 2.6062 2.8493 2.6849 2.8048 1.0000 2.9075

 dialogpt-fact-dialogrpt-v0
 Rank Team ID 5 4.2526
 Best Spec # 2.8601
 Success Rate 2.8771 Complete
 2.5768
 Rate 2.8703
 Book Rate 2.7167 2.9044
 Inform P/R/F1 0.9966 Turn (succ/all)
 2.9078

 1 1 (Ours) Submission3 93.0 95.2 94.6 84.1 / 96.2 / 88.1 12.5 / 12.7

 2
 Rank 2 ID
 Team Submission5
 Best Spec # 91.4 Rate
 Success 96.9 Rate
 Complete 96.2Rate
 Book 80.2 / 97.3
 Inform / 86.0
 P/R/F1 15.3(succ/all)
 Turn / 15.7

 3 3 Submission1 90.8 94.4 96.7 81.0 / 95.4 / 85.9 13.4 / 13.6
 1 1 (Ours) Submission3 93.0 95.2 94.6 84.1 / 96.2 / 88.1 12.5 / 12.7
 4 4 Submission2 89.8 94.6 96.3 72.4 / 96.0 / 80.1 15.1 / 15.8
 2 2 Submission5 91.4 96.9 96.2 80.2 / 97.3 / 86.0 15.3 / 15.7
 5 5 Submission2 83.3 88.5 89.1 81.1 / 90.3 / 83.5 13.5 / 13.8
 3 3 Submission1 90.8 94.4 96.7 81.0 / 95.4 / 85.9 13.4 / 13.6
 N/A Baseline Baseline 85.0 92.4 91.4 79.3 / 94.9 / 84.5 13.8 / 14.9
 4 4 Submission2 89.8 94.6 96.3 72.4 / 96.0 / 80.1 15.1 / 15.8

 5 5 Submission2 83.3 88.5 89.1 81.1 / 90.3 / 83.5 13.5 / 13.8
Table 3: Automatic evaluation results on Track2-task1 Success
 end-to-end task-oriented conversations, with the best value written in
 RankN/A Team ID Baseline
 Best Spec # Baseline
 Average Success Rate 85.0 Rate w/ DB Success
 92.4 Rate w/o DB Language
 91.4 Understanding79.3 / 94.9Response
 / 84.5 13.8 Turns
 / 14.9
bold. Grounding Grounding Score Appropriateness Score
 1 1 (Ours) Submission5 74.8 70.2 79.4 4.54 4.47 18.5
 Success Rate w/ DB Success Rate w/o DB Language Understanding Response
 1
 Rank 2 ID
 Team Submission1
 Best Spec # Average 74.8
 Success Rate 68.8 80.8 4.51 4.45 19.4
 Turns
 Grounding Grounding Score Appropriateness Score
 3 7 Submission4 72.3 62.0 82.6 4.53 4.41 17.1
 1 1 (Ours) Submission5 74.8 70.2 79.4 4.54 4.47 18.5
 4 6 Submission1 70.6 60.8 80.4 4.41 4.41 20.1
 1 2 Submission1 74.8 68.8 80.8 4.51 4.45 19.4
 5 3 Submission3 67.8 60.0 75.6 4.56 4.42 21.0
 3 7 Submission4 72.3 62.0 82.6 4.53 4.41 17.1
 N/A Baseline Baseline 69.6 56.8 82.4 4.34 4.18 18.5
 4 6 Submission1 70.6 60.8 80.4 4.41 4.41 20.1

 5 3 Submission3 67.8 60.0 75.6 4.56 4.42 21.0

 N/A Baseline Baseline 69.6 56.8 82.4 4.34 4.18 18.5

Table 4: Human evaluation results on Track2-task1 end-to-end task-oriented conversations, with the best value written in bold.

beam search, where the beam size is set to 5. Experimental response generation. As large-scale pre-trained models are
details on each task will be discussed below. capable of packing knowledge into the parameters (Roberts,
 Raffel, and Shazeer 2020), we test two experimental set-
5.1 Open-domain Conversation tings: PLATO-2 with and without explicit knowledge. In the
 first setting, the given relevant facts are appended before the
Interactive open-domain conversation is the most challeng-
 dialogue context, and the model learns the response genera-
ing direction in dialogue systems. The users are free to talk
 tion based on explicit knowledge p(r|k, c, z). In the second
about any topic and the system’s replies are expected to meet
 setting, the model tries to encode the knowledge into the net-
a high standard on many aspects, including coherence, con-
 work implicitly and learn the knowledge grounded response
sistency, informativeness, engagingness, etc. Since PLATO-
 generation directly p(r|c, z).
2 is initially designed as an open-domain chatbot, it can be
applied directly in open-domain conversations. In DSTC9 In this task, systems need to produce the response given
Track3-task2, real internet users are attracted through Face- the dialogue context and relevant facts. During the evalua-
book advertising and communicate with the backend dia- tion, 100 randomly selected samples are distributed to AMT
logue systems through DialPort (Zhao, Lee, and Eskenazi workers for assessments. For each conversational turn, three
2016). The collected logs are then distributed to AMT work- crowd-sourcing workers are asked to annotate it from multi-
ers for assessments. For each system, 200 interactive dia- ple aspects and provide an overall score. The human evalua-
logues are collected for human evaluation. And for each dia- tion results are summarized in Table 2. Three approaches are
logue, three crowd-sourcing workers are asked to annotate it tied for the first place, where the top two are our submitted
from multiple aspects and provide an overall score. The hu- PLATO-2 without and with explicit knowledge. Given the
man evaluation results are summarized in Table 1. PLATO-2 name of the third approach, PLATO-2 might dominate the
achieves the highest score of overall human rating and per- leaderboard in knowledge grounded dialogue.
forms well on many evaluation metrics.
 5.3 End-to-end Task-oriented Conversation
5.2 Knowledge Grounded Dialogue In DSTC9 Track2-task1, the end-to-end task-oriented con-
In DSTC9 Track3-task1, experiments are carried out on versation is carried out on MultiWOZ 2.1 (Eric et al. 2020),
Topical-Chat (Gopalakrishnan et al. 2019), which is a large- which is a multi-domain dialogue dataset on travel plan-
scale dataset on knowledge grounded dialogue. For back- ning. In the experiments, we employ the one-to-one mapping
ground knowledge, there are 300 entities in Topical-Chat, model of PLATO-2 and fine-tune it with the aforementioned
and each entity is associated with several short facts or ar- three techniques. In this task, the ConvLab-2 platform (Zhu
ticles. For each conversational turn, several relevant facts et al. 2020) provides the user simulator for automatic evalu-
are provided, and the system can leverage these facts for ation. The evaluation results are summarized in Table 3 and
That would be awful. That's interesting, maybe we are both robots and are talking to each other

 I'm not a robot. Sometime I wish I were.
 Well i disagree, robots are more efficient.
 Maybe one day we can all be robots, it would be a great advancement

 That would be awful.

 Well i disagree, robots are more efficient.

 [P1] do you have a twitter account ?
 [P2] i do n't , do you ?
 [P1] nope do n't have one either . i here they covered the florida shooting . did you hear about that ?
 Context [P2] i did ! i read a few of the student 's tweets , but i did n't realize teachers were involved in tweeting about it as well . its interesting to see how
 younger generations are making their voices heard though
 [P1] yes it is , technology is in know way slowing down and today 's youth know how to use it well .

 Knowledge Taylor Swift : Taylor Swift once accidentally released 8 seconds of white noise and topped the Canadian iTunes charts.

 i had a random factual section about taylor swift- apparently at one point she accidentally released a few seconds of white noise and it topped itunes
 Golden Response
 charts in canada . did you hear about that ?

 Baseline i think that is a great way to put it . i think it 's funny that she released 8 seconds of white noise on itunes and it topped the canadian itunes charts .
 PLATO-2 w/o explicit knowledge I agree. I think the younger generation has the potential to make a real difference in the world though. Do you ever listen to music on your phone?

 PLATO-2 w/ explicit knowledge i wonder if it is slowing down though , as technology has advanced and people are more connected to each other . have you ever heard the white noise
 taylor swift releases on social media ?

 Table 5: Case analysis on knowledge grounded dialogue.

 Yes, on the job site it does. In the office, I just have to use my brain. Task Goal:
 • You are looking for a place to stay. The hotel should be in the north. The hotel
 That's always good, the office is often times a stressful environment
 should include free parking.
 It can be! It can also be very boring. That is why I'm chatting with a • The hotel should include free wifi.
 bot right now • Once you find a hotel, make sure you get postcode, stars.
 That makes sense, is it a robot that is chatting with you?
 I need a hotel to stay in the north , it should have free parking
 I'm not sure. What are you?
 What price range would you like ?
 That's interesting, maybe we are both robots and are talking to each other

 I'm not a robot. Sometime I wish I were. I don't care the pricerange . it must have free wifi

 Maybe one day we can all be robots, it would be a great advancement Would you like a guesthouse or a hotel ?
 Any one is ok
 That would be awful.
 I would recommend the acorn guest house . It is in the moderate price
 Well i disagree, robots are more efficient. range and has 4 stars . It also has free parking and wifi . Would you like
 more information ?

 yes , please tell me the postcode
 Figure 4: Case analysis on open-domain conversation.
 [P1] do you have a twitter account ? The postcode is cb41da . Is there anything else I can help you with ?
 [P2] i do n't , do you ?
 Context
 [P1] nope do n't have one either . i here they covered the florida shooting . did you hear about that ? that's all . thank you
 [P2] i did ! i read a few of the student 's tweets , but i did n't realize teachers were involved in tweeting about it as well . its interesting to see how
 younger generations are making their voices heard though You 're welcome , have a great day .
 [P1] yes it is , technology is in know way slowing down and today 's youth know how to use it well .

 Knowledge Taylor Swift : Taylor Swift once accidentally released 8 seconds of white noise and topped the Canadian iTunes charts.
 Figure
 i had a random factual section about taylor swift- apparently at one point she accidentally released a few 5: noise
 seconds of white Case analysis
 and it topped itunes on task-oriented conversation.
our approach
 Golden Responseranks the 1st with the highest success rate.
 charts in canada . did you hear about that ?

 Baseline i think that is a great way to put it . i think it 's funny that she released 8 seconds of white noise on itunes and it topped the canadian itunes charts .
 PLATO-2 w/o explicit knowledgeI agree. I think the younger generation has the potential to make a real difference in the world though. Do you ever listen to music on your phone?
 Aside from the automatic evaluation, AMT workers are 5.4 Discussions
 PLATO-2 w/ explicit knowledge i wonder if it is slowing down though , as technology has advanced and people are more connected to each other . have you ever heard the white noise
asked to communicate with
 taylor theonsystems
 swift releases social media ? for task completion.

When the conversation is finished, AMT workers need to To further dissect the performance of PLATO-2, several
give evaluation scores on several aspects. The human evalu- cases from various conversations are provided for analysis.
ation results are summarized in Table 4. The average success As shown in Figure 4, one dialogue snippet is selected from
rate is calculated as the average value of the following two the interaction between a real user and our system. In the Di-
metrics: the success rate without database grounding and the alPort platform, users are informed in advance that they will
success rate with database grounding. The success rate with- communicate with AI bots. This dialogue snippet demon-
out database grounding is based on the AMT worker’s anno- strates that PLATO-2 is able to produce coherent and en-
tation during communication (success or fail). In fact, AMT gaging responses in open-domain conversation. For knowl-
workers do not know whether the provided values from the edge grounded dialogue, one example is selected to display
system are consistent with the database or not. In compar- in Table 5. As compared with the golden and baseline re-
ison, the success rate with database grounding is a more sponses, the responses generated by PLATO-2 are more co-
strict and practical metric. The dialogue is considered as a herent with the dialogue context. Instead of changing the
success if and only if: 1) AMT worker marks the dialogue topic suddenly or copying the given facts directly, PLATO-
as success; 2) the provided request slot values plus inform 2 absorbs the knowledge and conveys the information in a
slot values from the system can be found in the database. natural way. For task-oriented conversation, one dialogue
Our approach achieves the highest score on success rate with snippet with the corresponding goal is selected and shown
database grounding. The first two approaches are placed as in Figure 5. The user is asked to interact with the system to
co-champion based on the average success rate in the final accomplish a specific goal. The system needs to find out the
ranking. entity that satisfies the user’s requirements. As exhibited in
the case, the system actively communicates with the user to context and response. In the second stage, the discrete latent
narrow down the scope of candidates and successfully re- variable is encoded into the network for the one-to-many
turns the required information. mapping modeling. One fine-grained generation and one
 Despite the effectiveness on multiple conversational tasks, evaluation model are further learned for diverse response
PLATO-2 still suffers from several limitations of general generation and coherence estimation. The model in the first
dialogue models, including factual error, logic inconsis- stage is applicable to task-oriented conversation, while those
tency, toxic and biased language, and so on. Recently, some models in the second stage are suitable for open-domain
pioneering works have been proposed to alleviate these conversation and knowledge grounded dialogue. Compre-
problems. For example, the knowledge provenance from hensive evaluations in DSTC9 demonstrate that PLATO-2
Wikipedia is provided in the retrieval-augmented generation is an effective unified pre-training framework for conversa-
(Lewis et al. 2020). Some recipes are explored and discussed tional AI.
to increase the safety in open-domain chatbots (Xu et al.
2020). Future work will be carried out along these directions Acknowledgments
to boost the model’s capacity. We would like to thank the reviewers for their constructive
 suggestions; Jingzhou He, and Tingting Li for the help on
 6 Related Work resource coordination; Gaopeng Yong, Liankai Huang, and
Related works will be discussed on pre-trained dialogue Hua Lu for their generous help. This work was supported by
generation models and task-oriented dialogue systems. the Natural Key Research and Development Project of China
 Pre-trained language models have brought significant (No. 2018AAA0101900).
breakthroughs in natural language processing (Devlin et al.
2019; Radford et al. 2019; Brown et al. 2020). To boost the References
performance of dialogue generation, DialoGPT (Zhang et al. Adiwardana, D.; Luong, M.-T.; So, D. R.; Hall, J.; Fiedel,
2020) is trained on the basis of GPT-2 (Radford et al. 2019) N.; Thoppilan, R.; Yang, Z.; Kulshreshtha, A.; Nemade, G.;
using Reddit comments. To obtain a human-like chatbot, Lu, Y.; et al. 2020. Towards a Human-like Open-Domain
Meena (Adiwardana et al. 2020) utilizes more social media Chatbot. arXiv preprint arXiv:2001.09977 .
conversations and scales up the network to 2.6B parameters. Bao, S.; He, H.; Wang, F.; Wu, H.; and Wang, H. 2020a.
To strengthen the desirable conversational skills, Blender PLATO: Pre-trained Dialogue Generation Model with Dis-
(Roller et al. 2020) further fine-tunes the pre-trained model crete Latent Variable. In Proceedings of the 58th Annual
with human-annotated conversations. To tackle the one-to- Meeting of the Association for Computational Linguistics,
many mapping problem, PLATO-2 (Bao et al. 2020b) en- 85–96.
codes discrete latent variable into the network and achieves
new state-of-the-art results in open-domain conversations. Bao, S.; He, H.; Wang, F.; Wu, H.; Wang, H.; Wu, W.; Guo,
In this work, we demonstrate that the one-to-many mapping Z.; Liu, Z.; and Xu, X. 2020b. PLATO-2: Towards Building
models of PLATO-2 can be applied effectively on both open- an Open-Domain Chatbot via Curriculum Learning. arXiv
domain conversation and knowledge grounded dialogue. preprint arXiv:2006.16779 .
 For task-oriented dialogue systems, conventional ap- Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.;
proaches (Young et al. 2013; Henderson, Thomson, and Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell,
Williams 2014; Wen et al. 2015) usually adopt pipeline mod- A.; et al. 2020. Language Models are Few-Shot Learners.
ules, including natural language understanding (NLU), dia- arXiv preprint arXiv:2005.14165 .
logue state tracking (DST), dialogue policy, and natural lan-
 Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.
guage generation (NLG). Recently, some end-to-end neural
 BERT: Pre-training of Deep Bidirectional Transformers for
models (Wen et al. 2017; Ham et al. 2020; Peng et al. 2020)
 Language Understanding. In Proceedings of the 2019 Con-
have been introduced for task-oriented dialogue systems.
 ference of the North American Chapter of the Association
The end-to-end system (Ham et al. 2020) remains the core
 for Computational Linguistics: Human Language Technolo-
concepts of pipeline and generates the dialogue state, system
 gies, 4171–4186.
action, and system response simultaneously. In this work, we
demonstrate that the one-to-one mapping model of PLATO- Dinan, E.; Roller, S.; Shuster, K.; Fan, A.; Auli, M.; and We-
2 can be adopted as a powerful basis. With enhanced entity ston, J. 2018. Wizard of Wikipedia: Knowledge-Powered
representation and active clarification, PLATO-2 achieves Conversational Agents. In International Conference on
new state-of-the-art results in task-oriented conversation. Learning Representations.
 Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang,
 7 Conclusion Y.; Gao, J.; Zhou, M.; and Hon, H.-W. 2019. Unified
In this work, we explore the application of PLATO-2 on Language Model Pre-training for Natural Language Under-
various dialogue systems, including open-domain chit-chat, standing and Generation. In Advances in Neural Information
knowledge grounded dialogue, and task-oriented conversa- Processing Systems, 13063–13075.
tion. The training of PLATO-2 is carried out via two-stage Eric, M.; Goel, R.; Paul, S.; Sethi, A.; Agarwal, S.; Gao, S.;
curriculum learning. In the first stage, the network tries to Kumar, A.; Goyal, A.; Ku, P.; and Hakkani-Tur, D. 2020.
fit the simplified one-to-one mapping between the dialogue MultiWOZ 2.1: A Consolidated Multi-Domain Dialogue
Dataset with State Corrections and State Tracking Baselines. Model? In Proceedings of the 2020 Conference on Empiri-
In Proceedings of The 12th Language Resources and Evalu- cal Methods in Natural Language Processing, 5418–5426.
ation Conference, 422–428. Roller, S.; Dinan, E.; Goyal, N.; Ju, D.; Williamson, M.; Liu,
Fan, A.; Lewis, M.; and Dauphin, Y. 2018. Hierarchical Y.; Xu, J.; Ott, M.; Shuster, K.; Smith, E. M.; et al. 2020.
Neural Story Generation. In Proceedings of the 56th Annual Recipes for building an open-domain chatbot. arXiv preprint
Meeting of the Association for Computational Linguistics, arXiv:2004.13637 .
889–898. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
Gao, J.; Galley, M.; and Li, L. 2018. Neural Approaches to L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
Conversational AI. In The 41st International ACM SIGIR tention is All you Need. In Advances in Neural Information
Conference on Research & Development in Information Re- Processing Systems, 5998–6008.
trieval, 1371–1374. Wen, T.-H.; Gasic, M.; Mrkšić, N.; Su, P.-H.; Vandyke, D.;
Gopalakrishnan, K.; Hedayatnia, B.; Chen, Q.; Gottardi, A.; and Young, S. 2015. Semantically Conditioned LSTM-
Kwatra, S.; Venkatesh, A.; Gabriel, R.; and Hakkani-Tür, D. based Natural Language Generation for Spoken Dialogue
2019. Topical-Chat: Towards Knowledge-Grounded Open- Systems. In Proceedings of the 2015 Conference on Empir-
Domain Conversations. In Proceedings of the 20th Annual ical Methods in Natural Language Processing, 1711–1721.
Conference of the International Speech Communication As- Wen, T.-H.; Vandyke, D.; Mrkšić, N.; Gasic, M.; Bara-
sociation, 1891–1895. hona, L. M. R.; Su, P.-H.; Ultes, S.; and Young, S. 2017.
Gunasekara, C.; Kim, S.; D’Haro, L. F.; Rastogi, A.; Chen, A Network-based End-to-End Trainable Task-oriented Di-
Y.-N.; Eric, M.; Hedayatnia, B.; Gopalakrishnan, K.; Liu, alogue System. In Proceedings of the 15th Conference of
Y.; Huang, C.-W.; Hakkani-Tür, D.; Li, J.; Zhu, Q.; Luo, the European Chapter of the Association for Computational
L.; Liden, L.; Huang, K.; Shayandeh, S.; Liang, R.; Peng, Linguistics, 438–449.
B.; Zhang, Z.; Shukla, S.; Huang, M.; Gao, J.; Mehri, S.; Xu, J.; Ju, D.; Li, M.; Boureau, Y.-L.; Weston, J.; and Dinan,
Feng, Y.; Gordon, C.; Alavi, S. H.; Traum, D.; Eskenazi, M.; E. 2020. Recipes for Safety in Open-domain Chatbots. arXiv
Beirami, A.; Eunjoon; Cho; Crook, P. A.; De, A.; Geram- preprint arXiv:2010.07079 .
ifard, A.; Kottur, S.; Moon, S.; Poddar, S.; and Subba, R.
2020. Overview of the Ninth Dialog System Technology Young, S.; Gašić, M.; Thomson, B.; and Williams, J. D.
Challenge: DSTC9. arXiv preprint arXiv:2011.06486 . 2013. POMDP-Based Statistical Spoken Dialog Systems:
 A Review. In Proceedings of the IEEE, volume 101, 1160–
Ham, D.; Lee, J.-G.; Jang, Y.; and Kim, K.-E. 2020. End- 1179.
to-End Neural Pipeline for Goal-Oriented Dialogue Systems
using GPT-2. In Proceedings of the 58th Annual Meeting of Zhang, S.; Dinan, E.; Urbanek, J.; Szlam, A.; Kiela, D.; and
the Association for Computational Linguistics, 583–592. Weston, J. 2018. Personalizing Dialogue Agents: I have a
 dog, do you have pets too? In Proceedings of the 56th An-
Henderson, M.; Thomson, B.; and Williams, J. D. 2014. The nual Meeting of the Association for Computational Linguis-
Second Dialog State Tracking Challenge. In Proceedings of tics, 2204–2213.
the 15th Annual Meeting of the Special Interest Group on
Discourse and Dialogue, 263–272. Zhang, Y.; Sun, S.; Galley, M.; Chen, Y.-C.; Brockett, C.;
 Gao, X.; Gao, J.; Liu, J.; and Dolan, B. 2020. DialoGPT:
Jang, E.; Gu, S.; and Poole, B. 2017. Categorical Reparam- Large-Scale Generative Pre-training for Conversational Re-
eterization with Gumbel-Softmax. In International Confer- sponse Generation. In Proceedings of the 58th Annual Meet-
ence on Learning Representations. ing of the Association for Computational Linguistics: Sys-
Kim, B.; Ahn, J.; and Kim, G. 2019. Sequential Latent tem Demonstrations, 270–278.
Knowledge Selection for Knowledge-Grounded Dialogue.
 Zhao, T.; Lee, K.; and Eskenazi, M. 2016. DialPort: Con-
In International Conference on Learning Representations.
 necting the Spoken Dialog Research Community to Real
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; User Data. In 2016 IEEE Spoken Language Technology
Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, Workshop, 83–90.
T.; et al. 2020. Retrieval-augmented generation for
 Zhao, T.; Zhao, R.; and Eskenazi, M. 2017. Learning
knowledge-intensive nlp tasks. In Advances in Neural In-
 Discourse-level Diversity for Neural Dialog Models using
formation Processing Systems.
 Conditional Variational Autoencoders. In Proceedings of the
Peng, B.; Li, C.; Li, J.; Shayandeh, S.; Liden, L.; and Gao, 55th Annual Meeting of the Association for Computational
J. 2020. SOLOIST: Few-shot Task-Oriented Dialog with A Linguistics, 654–664.
Single Pre-trained Auto-regressive Model. arXiv preprint
 Zhu, Q.; Zhang, Z.; Fang, Y.; Li, X.; Takanobu, R.; Li, J.;
arXiv:2005.05298 .
 Peng, B.; Gao, J.; Zhu, X.; and Huang, M. 2020. ConvLab-2:
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and An Open-Source Toolkit for Building, Evaluating, and Diag-
Sutskever, I. 2019. Language Models are Unsupervised nosing Dialogue Systems. In Proceedings of the 58th Annual
Multitask Learners. Technical report, OpenAI . Meeting of the Association for Computational Linguistics:
Roberts, A.; Raffel, C.; and Shazeer, N. 2020. How Much System Demonstrations, 142–149.
Knowledge Can You Pack Into the Parameters of a Language
You can also read