Augmenting Transformers with KNN-Based Composite Memory for Dialog
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Augmenting Transformers with KNN-Based
Composite Memory for Dialog
Angela Fan Claire Gardent
Facebook AI Research CNRS/LORIA
Université de Lorraine claire.gardent@loria.fr
LORIA
angelafan@fb.com
Chloé Braud Antoine Bordes
CNRS/IRIT Facebook AI Research
chloe.braud@irit.fr abordes@fb.com
Abstract tectures on each task. In this work, we focus
on human–machine dialog and how to efficiently
Various machine learning tasks can benefit retrieve external knowledge that is relevant to
from access to external information of different the dialog. We consider two scenarios and for
modalities, such as text and images. Recent each scenario, retrieve two types of knowledge:
work has focused on learning architectures (i) knowledge about similar dialog contexts and
with large memories capable of storing this (ii) external knowledge used to ground the
knowledge. We propose augmenting genera- conversation into real world information.
tive Transformer neural networks with KNN- Knowledge about similar dialog contexts allows
based Information Fetching (KIF) modules. for a hybrid retrieval/generative approach to dialog
Each KIF module learns a read operation to where the system response is generated based not
access fixed external knowledge. We apply only on a representation of the current dialog
these modules to generative dialog modeling, context and of the relevant world knowledge,
a challenging task where information must be but also based on a response retrieved from a
flexibly retrieved and incorporated to maintain similar dialog context. The retrieved knowledge
the topic and flow of conversation. We demon- can be viewed as providing information about
strate the effectiveness of our approach by structure and dialog sentences, or utterances:
identifying relevant knowledge required for which response is likely given a similar context?
knowledgeable but engaging dialog from External knowledge is also retrieved to improve
Wikipedia, images, and human-written dialog the semantic content of the dialog model. In
utterances, and show that leveraging this one scenario, Wizard of Wikipedia (Dinan et al.
retrieved information improves model perfor- 2018), general topics are provided to crowdwor-
mance, measured by automatic and human kers, who are asked to have in-depth and specific
evaluation. conversations about these topics by referencing
specific Wikipedia sentences as knowledge. In this
1 Introduction
scenario, external knowledge is retrieved from a
Machine learning approaches to various tasks, pre-selected set of Wikipedia sentences associated
such as game-playing or dialog, are often depen- with the current dialog topic. Retrieval aims to
dent on external information. This information select the sentence that is most relevant at each
can take multimodal forms, including structured step of the dialog and thereby to ground system
knowledge bases, free text, and images, and responses in relevant world knowledge (e.g., by
also comes in overwhelmingly large quantities. referring to Star Wars when talking about science
A pressing challenge is to create models that fiction).
can identify which specific elements of multiple In the other scenario, Engaging ImageChat
information sources are relevant in a particular (Shuster et al., 2020), crowdworkers are provided
context, and incorporate them into standard archi- with images and asked to have a conversation
82
Transactions of the Association for Computational Linguistics, vol. 9, pp. 82–99, 2021. https://doi.org/10.1162/tacl a 00356
Action Editor: Masaaki Nagata. Submission batch: 6/2020; Revision batch: 9/2020; Published 3/2021.
c 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.inspired by or about the image. In this case, the ledge elements are selected, allowing users to
retrieved external knowledge is images and their better understand the information the generative
associated dialogs. By retrieving images that are model conditions upon when writing the subse-
similar to the image being talked about, we aim to quent utterance. On both datasets, we achieve
enrich system responses with knowledge about state-of-the-art results compared to generative
what is typically mentioned when describing models and find there is no statistically significant
similar images (e.g., when talking about an image difference in the interestingness or human pre-
with dogs, mentioning their breed). ference of our model output compared to state-
Our work on incorporating different types and of-the-art retrieval models.
modalities of knowledge is related to methods that
strive to add external memory, such as knowledge 2 Related Work
bases, to neural networks. Previous work has ex-
plored incorporating large external memories into We discuss related work on learning to incorporate
neural network layers (Weston et al., 2015; external knowledge into neural networks and
Sukhbaatar et al., 2015, 2019; Lample et al., 2019). efficiently access relevant information. We then
Many existing approaches focus on using attention describe work in generative dialog that incor-
over the memory slots, which is computationally porates knowledge.
intensive and becomes less effective as the the size
of the memory grows. In this work, we propose 2.1 Incorporating External Knowledge
representing multiple sources of external infor- Augmenting neural networks with memory, or
mation as fixed encodings and using K Nearest longer-term components that can be accessed
Neighbors (KNN) search to fetch relevant infor- with read and write operations, has been
mation. KNN search is computationally efficient explored in various proposed architectures. For
and scalable, and libraries like faiss (Johnson example, Memory Networks (Weston et al., 2015;
et al., 2019) allow KNN to be easily used on GPUs Sukhbaatar et al., 2015, 2019) introduce attention
and integrated into neural networks. Further, mechanisms over large external memories. Neural
the external memories are pre-encoded, so the cache models (Grave et al., 2017b) simplify these
information encoding is only computed once. As to access previous memories with a dot product.
the external memories are kept fixed, they do not Previous work has also studied how to read and
require any training to learn the memories along write into these memory architectures (Rae et al.,
with the model. We can thus scale easily to larger 2016; Graves et al., 2014; Joulin and Mikolov,
memories by learning only the KNN-based read 2015). In contrast, we focus on how to read large
operation to identify relevant information from memories.
the memory. Another line of research has focused on
Our core contribution proposes an efficient, computational scalability for larger external me-
KNN-based Information Fetching (KIF) module mories to allow efficient access of information.
that can access relevant external knowledge, com- For example, Chandar et al. (2016) propose a
bine knowledge from different sources, and inte- hierarchical memory network rather than a flat
grate this information into standard sequence to one and Rae et al. (2016) learn sparse operations
sequence architectures. We apply these flexible to read and write. Lample et al. (2019) focus
modules to two dialog datasets that challenge gen- on learning memories of up to one million
erative models to leverage external information to slots and how to efficiently access the slots
write coherent, on-topic responses. Both of our using product keys. Khandelwal et al. (2019)
chosen tasks require models to leverage external use nearest neighbor operations to augment
information, such as information from Wikipedia language models by performing retrieval at the
or images, to engage in the conversation. We token level—in contrast, we focus on multimodal
show that relevant information can be identified retrieval of multiple pieces of knowledge based
from hundreds of thousands of candidates in a on an entire dialog context. Beyond explicit
multimodal, multi-knowledge-source setting to memory representations, it may be possible to
improve the performance of generative dialog store information implicitly during training time
models. Further, the output of the KIF modules by memorizing common patterns present in text
is interpretable as specific human-readable know- (Petroni et al., 2019). We focus on learning
83to fetch relevant information from multiple model (Dinan et al., 2018; Weston et al., 2018;
explicit external multimodal knowledge sources Cai et al., 2019; Zhu et al., 2020). Some of this
and integrate them into one network. Further, work has specialized to use both types of models to
our work allows the retrieved information to be generate conversations in an ensemble (Song et al.,
interpreted as each memory slot is an explicit fact 2016) or to specifically improve consistency (Song
that can be read as text, rather than a learned vector et al., 2020). We extend these approaches by
such as in Lample et al. (2019). augmenting generative models with retrieval-like
Work has also focused on computationally operations based on KNN search, allowing dialog
efficient softmax operations (Mnih and Hinton, models to flexibly incorporate various sources of
2009; Grave et al., 2017a; Chen et al., 2016). external knowledge at the same time and scale to
Many approximate softmax techniques use KNN- large quantities of retrieval candidates.
like operations to form clusters, and the overall
softmax operation is constrained by the slow 3 KNN-based Information
calculation of the exponential. Our usage of KNN Fetching Modules
benefits from efficient and scalable libraries such
as faiss and nmslib. Broadly, the KIF module assumes an en-
coder model M can access inputs X =
{x1 , x2 , . . . , xn }. For example, X can be a
2.2 Generative Dialog
collection of sentences, and xi represents an
We develop a general architecture for incorpo- individual sentence. In a setting without additional
rating external information and apply it to the supporting information, the encoder will process
case of generative dialog models. Previous work an input xi and produce the encoder output
in dialog has leveraged knowledge as necessary M (xi ). If xi is a sequence such as a sentence,
information to accomplish the task. For example, then M (xi ) is a representation of the variable
airline and restaurant booking tasks often use size of the sequence length by the fixed size
API calls to access information about reservation encoder M ’s hidden size. However, in many tasks,
times and availability (Bordes et al., 2017). In additional information is present, represented as
contrast, our work focuses on how to incorporate E = {e1 , e2 , . . . , em }. We encode each element
unstructured knowledge, such as free text found of X and E into a vector representation using the
on the Web. Previous work has used architectures encoder. To identify the closest information in E
that attend over the available knowledge and that is relevant to xi , our general approach will
identify relevant pieces of information, which be to use KNN by comparing the representation
scales poorly with large quantities of information of xi with the representation of each element in
(Dinan et al., 2018; Qin et al., 2019; Lian the set E . KNN is a fully differentiable operation
et al., 2019). We replace the use of attention (Plötz and Roth, 2018), so can be incorporated in a
over external information with the output of straightforward way into neural models. The most
a KNN module. Other work has investigated relevant information in E will then be available in
incorporating information retrieval in language the model. We display a KIF-Augmented model 1
modeling and question answering (Chen et al., in Figure 1 and describe how the KIF module
2017; Fan et al., 2019; Seo et al., 2019; Guu et al., operates.
2020), while we focus on dialog applications and One challenge to overcome is that the
flexibly incorporating knowledge from multiple, representation of all elements of the knowledge
multimodal sources. source E are pre-computed and kept fixed,
On the modeling side, work has explored creating M (E )—we do not backpropagate to
both generative (Serban et al. 2016a, 2016b) affect the embeddings of the pre-encoded
and retrieval based models (Zhang et al., 2018), knowledge. In the early stages of training, the
which identify the best utterance from the model receives large amounts of loss, which would
training set to return as the dialog response. This affect the quality of the pre-encoded embeddings
often leverages self-attention or cross-attention if we backpropagated to them. Further, encoding
mechanisms (Humeau et al., 2019). Further work the fixed external knowledge once and re-using
has explored hybrid models, for example, using the it allows for greater scalability. However, this
output of a retrieval model as input for a generative lack of backpropagation can introduce a mismatch
84Figure 1: KIF modules fetch relevant information from multimodal external knowledge. External
knowledge sources E1 and E2 are pre-encoded by encoder M (green). In the model, input xi is encoded
by encoder M ′ (blue) to produce M ′ (xi ). KIF modules (orange) operate on M ′ (xi ) and identify the
nearest neighbors encoded in M (E1 ) and M (E2 ) using KNN. Identified relevant elements from E1 and
E2 are re-encoded by M ′ in a gating mechanism with a weighted sum (represented by σ (WS1i ) · WS1i ,
where WS stands for weighted sum), then concatenated to M ′ (xi ). Full description with notation can be
found in Section 3.
between the encoding of E and the encodings to fE (M ′ (xi )) in M (E ), based on KNN search
produced by a model that is training, as the training with inner product. Then, the relevant elements
model has constantly changing representations identified by KNN are re-encoded by M ′ . For
because the weights are being learned. We use example, if element ej is retrieved by KIF, it would
M to represent the original encoder model used produce M ′ (ej ). We use the optimized faiss
to encode E and M ′ to represent the constantly library for KNN search, which can conduct
training model that is encoding X . The model billion-scale KNN efficiently on GPUs.
must learn a function to align M ′ (xi ) to the The KNN output for an element xi is produced
pre-encoded elements of the external memory by using faiss to search for the k nearest
M (E ). representations to fE (M ′ (xi )) in M (E ). Note
To circumvent this misalignment, we learn that as the encoders M and M ′ produce output
a mapping operator fE (M ′ (xi )) that trains to representations of variable length (for example, in
map elements of the model’s representation of X , the case where xi is a variable length sequence,
or M ′ (X ), into the additional information repre- such as a sentence), we average across the length
sentation space M (E ). Concretely, fE (M ′ (xi )) dimension to produce a fixed-size representations
is a multilayer perceptron with ReLU nonlineari- r to conduct the KNN search.
ties. From the input elements of X , fE (M ′ (xi ))
rxi = Avg fE (M ′ (xi ))
learns representations of an output close to the (1)
corresponding projection of X into E . This can RE = Avg(M (e)) | e ∈ E (2)
be interpreted as learning a read operation on a KNNxi = KNearest k, rxi , RE
(3)
fixed external memory. If there was no change
to the encoding of the model compared to the Then, the KIF module output for an element xi
pre-computed knowledge, then the ideal map- is the set of all re-encoded representations of the
ping operator would be the identity function (as KNN-retrieved knowledge:
M ′ would equal M ). However, as the model
KIFxi = M ′ (e) | e ∈ KNNi
changes significantly during the training process, (4)
the nonlinear mapping capability of fE (M ′ (xi )) These elements are weighted by their normal-
is essential to be able to identify the correct ized nearest neighbor scores and then summed.
knowledge E from the input X . This is subsequently concatenated to M ′ (xi ) to
Thus, a model augmented with KIF will form the final encoder output:
incorporate external knowledge in the following
manner. First, we find the k nearest elements [M ′ (xi ), WeightedSum(KIFi )] (5)
85This can be easily extended to using multiple 4.1 KIF for Generative Dialog
modules simultaneously. For instance, two
In dialog, xi represents the text of the conversation
sources of external information, E1 and E2 , can
i. A conversation consists of multiple back-
be combined by identifying the top candidates
and-forth utterances (or turns). For example, a
of each information source. The weighted sum
conversation could consist of 4 turns: xi =
of the KIF output on each information source is
[xi,1 , xi,2 , xi,3 , xi,4 ] where xi,4 is the direct
concatenated with the encoded input M ′ (xi ). The
utterance the model should respond to, and the
KIF output dimensionality is the same size as the
earlier utterances are the conversation context.
hidden size of M ′ (xi ), so they can be directly
Standard generative dialog models use a
concatenated.
Transformer neural network as the encoder M
Finally, different sources of information may and want to produce an output that is an ap-
not be required for every prediction and some propriate response to the conversation. However,
information sources can be more important than in many cases, the conversation history alone
others. To allow the model to make more fine- does not include all of the information required to
grained decisions about what information to produce an appropriate response. For example, if
use from what source, and how much of it, a model needs to chat about a specific movie,
we add a gating mechanism using a sigmoid it can be helpful to provide the model with
function around each weighted sum of KNN more information about that movie so a more
representations. KIF1i and KIF2i denote the KIF interesting dialog response could be produced. To
module from Equation (4) applied to E1 and E2 , incorporate knowledge, models often concatenate
respectively. a knowledge source E such as Wikipedia to
xi and use attention modules to identify the
WS1i = WeightedSum(KIF1i ) (6) most relevant knowledge. However, this approach
WS2i = WeightedSum(KIF2i ) (7) is computationally intensive when handling
large quantities of information. Further, attention
which produces the final encoder output, a mechanisms have been found to operate poorly
concatenation of M ′ (xi ) with the output of over long sequences, as the mechanism becomes
multiple KIF modules: blurry due to the softmax and struggles to make
fine-grained decisions (Fan et al., 2018b). The
′ same is true for hierarchical approaches, which
M (xi ), σ (WS1i ) · WS1i , σ (WS2i ) · WS2i (8)
lack scalability.
We augment Transformer sequence to sequence
This concatenation represents the output of the (seq2seq) networks on the encoder side with KIF
encoder M ′ and can be used for various purposes, to improve generative dialog models. We experi-
such as providing the encoder output to a decoder ment on two dialog tasks, Wizard of Wikipedia
in a sequence to sequence model. (Dinan et al., 2018) and Engaging ImageChat
(Shuster et al., 2020). In both datasets, models
must leverage information external to the dialog
4 Applying KIF to Dialog Tasks
history alone—in Wizard of Wikipedia, the chat
We describe how to apply KIF to the task of requires access to knowledgeable facts and in
generative dialog, a setting where models must Engaging ImageChat, discussion about a specific
generate engaging and on-topic responses. We image. As models must process multiple inputs
investigate dialog for two reasons: First, dialog and ground responses in the knowledgeable facts
agents must be able to consult relevant information or images, these tasks challenge existing seq2seq
to maintain the topic of the conversation. Second, approaches.
retrieval-based agents have strong performance
4.2 Wizard of Wikipedia
compared to generative ones, due to their ability to
copy dialog utterances from the training set. Using The goal of the Wizard of Wikipedia dataset is to
KIF, we can incorporate the benefits of retrieval train knowledgeable agents that can chat in any
architectures into generative, knowledge-based domain. The dataset contains 1,365 various topics
models. discussed in 18,430 dialogs in the training set,
86totalling 166,787 training utterances. Each topic is information learned by accessing the Wikipedia
a general concept, such as dogs or ice cream, and is knowledge.
included as the first utterance of the conversation.
The conversation is meant to be in-depth and Additional KNN Features. To better identify
detailed, so individual utterances must reference relevant training utterances from the large quantity
specific knowledge as a basis for the utterance. The available, we break down xi into conversation
knowledge takes the form of Wikipedia sentences. sub-features for a more fine-grained match in the
For example, the chat utterance I love Toy Story! KNN search step. By conducting KNN on more
It was released in 1995 would reference the features, we can achieve higher quality retrieval.
Wikipedia sentence Toy Story is a 1995 American We leverage the nature of dialog to decide these
computer-animated buddy comedy [...]. For each features.
utterance, a set of sentences are identified by an We concatenate the encoding of the most
information retrieval system, and the crowdworker recent dialog utterance (e.g., xi,last ) with the
selected one knowledge sentence as the basis for encoding of the dialog context from the current
their utterance. conversation and the turn number t, such that
M ′ (xi,last ), M ′ (xi,−last ), t is the representation
Knowledge Sources. Our model for Wizard of used for KNN search. Concretely, if the model is
Wikipedia has access to two sources of external trying to produce the 5th turn of the conversation,
information, E1 and E2 : then xi,last is the most recent utterance from the
dialog partner, xi,−last would be the last 3 turns
• E1 is Wikipedia Knowledge provided by the
of exchange, and t would be 4. Note that the turn
dataset as evidence to support knowledgeable
number is represented as a standalone number.
chitchat (initially curated by the information
These are known to be salient conversation fea-
retrieval system used in Dinan et al. [2018]).
tures. The most recent dialog utterance is the di-
The scale of this KNN search is to filter
rect turn the model is responding to, and the
through an average of 34 sentences. The KIF
dialog context may provide additional clues. The
module uses dialog features to fetch relevant
turn number is important, as earlier turns are often
knowledge to condition upon to generate the
generic (e.g., how are you doing today) and later
subsequent utterance.
turns are more specific.
• E2 is Training Utterances. To incorporate
the benefits of retrieval-based dialog models 4.3 Engaging ImageChat
to the generative setting, we use KIF to
The goal of Engaging ImageChat is to create
identify relevant utterances from the training
agents capable of chitchatting about images
set and take their responses as input. If
selected from the YFFC100M dataset (Thomee
many conversations about dogs have already
et al., 2016). The dataset contains 186,782 dialogs
occurred, models should be able to take
in the training set, each about a unique image,
advantage of these human-written examples
totalling 355,862 utterances. Agents are assigned
to improve their generations. For example,
one of 215 personalities (e.g., sweet, caring,
likely conversation could occur about the
excited) to increase engagingness. Previous work
breed of the dog, daily routine with a pet, and
(Shuster et al., 2020, 2019) identified that both
similar topics. There are around 170K dialog
crowdworkers and models, when provided with
utterances as inputs to KNN search. This can
personalities, produced more diverse, interesting
be interpreted as incorporating the benefits of
responses, as evaluated by humans.
retrieval models by identifying an utterance
We use a multimodal neural network designed
with similar structure as the text the model
to handle both image input and text input.
would like to generate. We do not allow the
Following Shuster et al. (2020), the images are
module to fetch the correct response of the
encoded using a pre-trained ResNeXt network
current conversation context.
(Xie et al., 2017). To extract the final image
Access to these two sources of knowledge representation, we project the 2048-dimensional
can be seen as learning a template and a topic output of the image encoder to 512-dimensions
separately. Sample templates can be identified using a deep multilayer perceptron with ReLU
from the training utterances, and topic-specific activation units. The conversation history, which
87includes the one-word personality, is encoded with then the turn number t and personality p are
a Transformer encoder network. The image and represented separately. As the personality is a
conversation are integrated using the Multimodal- word, we use the same Transformer to encode
Sum-Combiner module proposed in Shuster et al. it. The concatenation of features used for KNN
(2020). search is: M ′ (xi,last ), M ′ (xi,−last ), t, p.
Knowledge Sources. Our model for Engaging 5 Experimental Setup
ImageChat has access to two sources of external
information, E1 and E2 : 5.1 Implementation Details
• E1 is Chat on Similar Images. Although there Parameter Settings. We use parl.ai (Miller
are over 180K different images in this dataset, et al., 2017) to implement our models. The data for
many of the images are similar. For example, both datasets used is available for download from
conversations associated with two pictures parl.ai as well. We use byte-pair encoding
of dogs could be relevant to each other. The (Sennrich et al., 2016) to represent the text to better
model is able to use KIF directly on the handle the rare word problem (Dinan et al., 2018;
current image features to fetch from around Fan et al., 2018a). Our generative Transformer
180K different images and return 6 turns of models have 8 encoder layers and 8 decoder layers,
related chat for each fetched image. Fetching with FFN size 2048, embedding dimension 512,
from E1 consists of identifying related image and 4 attention heads. We optimize using Adam
chats, or conversations on related topics. (Kingma and Ba) and the inverse square root
• E2 is Training Utterances. Similar to the learning schedule (Vaswani et al., 2017) with 10k
motivation for the previous dataset, we allow warmup updates. The initial learning rate is 0.0001
the model to identify training utterances that and we optimize for model perplexity. We use a
could be useful for responding in the current dropout of 0.5 and set gradient clipping to 0.1.
conversation. The scale of this fetching task We set k = 5 for all cases. For both datasets,
is large: 350K dialog utterances. This could we model a vocabulary size of 54,944 based on
be interpreted as identifying utterances with the BPE-based vocabulary from the Reddit pre-
similar structure to what the model would training. We tuned the learning rate and batchsize
like to generate, and is complementary to the hyperparameters together.
topic-based related image chats. Pre-training. We pre-train the Transformer
Additional KNN Features. To identify relevant seq2seq model used for both datasets on 250M
information from training utterances, we use the comments from Reddit. The Reddit dataset was
same dialog features as Wizard of Wikipedia in made available by pushshift.io. The comments
the KNN search step, with one modification: We are parsed to maintain conversational threads
add the personality provided by the dataset. We of users responding to each other, so the
represent the personality feature as the personality encoder network has been exposed to conversa-
word, such as caring, and embed it with the tional context at training time. Note that the
encoder M ′ . As utterances from speakers with Reddit dataset does not include aspects such as
the same personality are more likely to be personality, as those are unique to specific datasets
similar, this feature improves the quality of the such as Engaging ImageChat. The context size in
fetched information. For example, conversations pre-training is set to 512 tokens. The ResNeXt
with the sweet personality often include similar encoder used to model images for the Engaging
text such as aww, that’s wonderful. We use ImageChat dataset was pre-trained on 3.5 billion
two additional features for the KNN search: t, images (Mahajan et al., 2018).
the turn number, and p, the personality. This
feature is explicitly used in Shuster et al. (2020) 5.2 Evaluation
to improve the engagingness and flow of the Generation. We generate with beam search,
conversation. Similar to Wizard of Wikipedia, we setting the beam size to 4. We use 3-gram block-
represent the conversation turn t as a number. ing. This technique disallows repeated n-grams
The Transformer model is used to encode text from being generated multiple times and reduces
xi and produce a representation of the text, repetition.
88Automatic Metrics. Following Dinan et al. collected on the same topic for Wizard of Wiki-
(2018), we compute F1, a metric of unigram pedia and collected on the same image and per-
overlap, between the generated utterance and sonalities for Engaging ImageChat. Topic and
the human-written reference utterance from the images selected for evaluation are unique and
dataset. For generative models, utterances are taken randomly from the test set.
generated using beam search. For retrieval models,
the next utterance is predicted by ranking the entire 5.3 Baselines
set of training utterances, and the highest scoring We compare Transformers augmented with KIF to
utterance is chosen. other existing approaches on Wizard of Wikipedia
In Wizard of Wikipedia, there are two test sets: and Engaging ImageChat. The best approaches,
A set of seen topics, or topics that have been judged by human evaluation, are retrieval models,
seen at training time with new test-time dialogs. the Retrieval Transformer Memory Network from
The second set is unseen, or topics that have not Dinan et al. (2018) and the Retrieval Transformer
been encountered at all during training time. We from Shuster et al. (2020). These have been
evaluate on both subsets. shown to be strong baselines compared with
other retrieval techniques based on TF-IDF (Chen
Human Evaluation. We follow the setup and et al., 2017). Thus, we report the existing retrieval
use the analysis questions proposed in the models for both datasets, but focus on comparing
Acute-Eval dialog evaluation system (Li et al., to other generative baselines.
2019). For reproducibility, we adopt this existing We compare to three additional generative
evaluation setting that has been applied to several baselines. Note that in Wizard of Wikipedia,
dialog datasets. We use the question wording the construction of the dataset is that sentences
suggested by Acute-Eval and follow their of Wikipedia knowledge are provided with the
self-chat procedure and interface. As one of the utterances in a concatenated form. Models must
original datasets assessed in this system was identify the relevant information in this provided
Wizard of Wikipedia, their evaluation setting knowledge, or can access more Wikipedia know-
extends naturally to ours. We collect 100 human- ledge beyond the provided sentences. The follow-
bot conversational dialogs on a crowdsourcing ing baseline methods always have access to the
platform for both datasets. The dialogs are eight information provided in the datas et already, but
turns long. Then, we show pairs of the collected no additional Wikipedia knowledge beyond that.
conversations side by side, one conversation with
a human and model A and the other conversation • Transformer Memory Networks. To contrast
with a human and model B. We ask annotators the the ability of KIF to existing work, we
following questions: compare our models to published Trans-
former Memory Networks (Dinan et al.,
• Who would you prefer to talk to for a long 2018). These models encode each piece of
conversation? external information independently with a
• If you had to say one of the speakers is Transformer Encoder, and these are stored
interesting and one is boring, who would you as memory slots. To access information in
say is more interesting? the memory slots, a model performs dot-
• Which speaker sounds more human? product attention between the memory slots
• Which speaker has more coherent responses and the dialog context. In Dinan et al. (2018),
in the conversation? the knowledge selection from Wikipedia was
supervised with either (a) a two-stage model
• If you had to say that one speaker is more
where the first model was trained to pre-
knowledgeable and one is more ignorant,
dict the right knowledge and a second model
who is more knowledgeable? (Wizard of
conditions on the predicted knowledge to
Wikipedia only)
generate the next utterance, or (b) an end-
We measure the percentage of time one model to-end model with an auxiliary loss for
was chosen over the other, taking the majority knowledge prediction accuracy.
agreement between three evaluators. To reduce • Retrieve and Refine. We implement a hybrid
variance, dialogs paired in the evaluation were model (Weston et al., 2018) that incorporates
89top retrieval candidates as additional input more effectively as they are trained for dialog.
to Generative Transformer MemNets. In Re- Thus, we replace CoVE embeddings with
trieve and Refine, a fixed number of candi- domain-specific ones.
dates are retrieved and concatenated to the
conversational history in the encoder, making All of Transformer generative baselines are
the input much longer. For both datasets, the initialized with the same pre-training on Reddit
Retrieve and Refine mechanism that fetches that we use for our models for fair comparison on
a fixed number of training utterances is added modeling quality.
to the Generative Transformer MemNet with
Reddit Pre-Training baseline. 6 Results
Unlike the KIF-Augmented Transformer, the We describe the results of incorporating KIF
retrieval is conducted with a separate model modules into Transformer networks. We display
so there is no backpropagation to affect the an example conversation between a human and
retrieval. With KIF, models can alter the our model in Figure 4, and show the top scoring
retrieved candidates by learning the mapping Wikipedia knowledge and Training Utterance
operator. Further, a fixed amount of infor- fetched by KIF modules. We compare to various
mation is always retrieved, without the cap- baselines using automatic and human evaluation,
ability to easily rescale to focus on specific and discuss our experiments. We present various
candidates. KIF modules have weighting ablation settings to understand the key features
mechanisms to focus more on certain infor- that make our method function.
mation, and the modules are combined with
gating so models can learn which knowledge 6.1 KIF is Effective for Incorporating
sources are more important and adjust Knowledge
flexibly. Lastly, Retrieve and Refine is only Automatic Evaluation. Comparing KIF aug-
used to retrieve one source of information: mented Transformer networks to published base-
training set utterances. lines and Retrieve and Refine, we find improved
• Response Generation with MR. We imple- results.
ment the model proposed in Qin et al. (2019), For Wizard of Wikipedia, the improvement in
which encodes the conversation history and F1 score over the best baseline is around 8 points
document contextually with a biLSTM before (see Table 1). A major contributing factor is the
generating the next dialog utterance. The construction of the dataset—as each dialog turn
initial model was applied to a machine is grounded in a specific knowledge sentence
reading task where a knowledge document from Wikipedia, improving the ability to identify
was provided along with the conversation the relevant fact strongly improves performance.
history. For Wizard of Wikipedia, we replace Contrasting the results from the seen and unseen
the knowledge document with the Wikipedia test sets in Table 1, the improvement on unseen is
sentences provided in the dataset. The model worse—it is harder to fetch training utterances for
then uses the conversation to identify the unseen topics.
most relevant information in the document While Imagechat has no explicit dependency
using a cross-attention mechanism. For the on knowledge, we still see a 2 point improve-
Engaging ImageChat dataset, as there is no ment compared to the Generative Transformer
document provided with the dataset, we MemNet (with the additional Reddit pre-training),
replace the expected document with the indicating that KIF can be generally useful (see
conversation history, and use the most recent Table 2). Compared to an even stronger baseline
utterance in the conversation to attend to the that we tune in this work, Retrieve and Refine, we
conversation history. see 1 point improvement.
We make an additional improvement to this Human Evaluation. Results are shown in
baseline: in Qin et al. (2019), the embeddings Figure 2. On both datasets, we find there is large
used pre-trained CoVE vectors (McCann improvement over existing generative models
et al., 2017). We found our Reddit pre- (green bars) that is statistically significant for some
trained Transformer embeddings to work of the evaluation questions. Evaluators agree that
90Model Test F1 Test F1
(Seen) (Unseen)
Retrieval Baselines
Retrieval Transformer MemNet (Dinan et al., 2018) 15.4 12.4
Generative Baselines
2-Stage Generative MemNet (Dinan et al., 2018) 18.9 17.4
Generative Transformer MemNet (Dinan et al., 2018) 16.9 14.4
+ Reddit Pre-Training 17.6 16.3
Retrieve and Refine (Weston et al., 2018) 18.2 17.9
Response Generation with MR (Qin et al., 2019) 17.5 16.8
KIF-Augmented Transformer 25.9 22.3
Table 1: Results on the Wizard of Wikipedia dataset. We implement the Retrieve and Refine
and Response Generation with MR approaches, all with Reddit Pre-Training, and evaluate them on
Wizard of Wikipedia. The Seen test set consists of conversations on topics seen at training time, and
the Unseen test set consists of conversations about new topics that were not in the training set.
Model Test F1
Retrieval Baselines
Retrieval Transformer (Shuster et al., 2020) 9.81
Generative Baselines
Generative Transformer MemNet (Dinan et al., 2018) 7.1
+ Reddit Pre-Training 12.8
Retrieve and Refine(Weston et al., 2018) 13.6
Response Generation with MR (Qin et al., 2019) 13.2
KIF-Augmented Transformer 14.4
Table 2: Results on the Engaging ImageChat dataset. We implement the Generative Transformer
Memory Network, Retrieve and Refine, and Response Generation with MR approaches, all with
Reddit Pre-Training, and evaluate them on Engaging ImageChat.
KIF-augmented Transformers are generally more models. For example, on Engaging ImageChat,
coherent and human-sounding compared to the while our model has significantly improved over
Generative MemNet. the generative baseline (see green bars in Figure 2,
Comparison with existing retrieval models right), it does not beat retrieval based methods in
(shown in blue) is more nuanced. Along the sounding more human or being more interesting
lines of existing work (Zhang et al., 2018; Dinan (see blue bars in Figure 2, right). As the Retrieval
et al., 2018), we find that retrieval-based models baseline returns human-written text for other
score very well in human evaluations that ask humans to evaluate, we hypothesize that humans
how human or interesting a dialog sounds. This score each other’s writing quite well. Compared
is because retrieval models return human-written with generative models, which we focus on
utterances from the training set and do not suffer improving, retrieval models often produce longer
from decoding mistakes present in generative text with more interesting, nuanced vocabulary
1
In Shuster et al. (2020), retrieval Transformer models usage, and do not make generation mistakes
report Hits@N using a fixed candidate set of 99 distractor such as repetition. These factors often lead to
candidates and 1 true candidate. We compute F1 using their the stronger performance of retrieval models.
open-sourced model by scoring the entire training set of over
350K utterances with the model and taking the top scoring A surprising result is that KIF-augmented
candidate as the response. Transformers are more human sounding than
91Figure 2: Human Evaluation Results on Both Datasets. More than 50% indicates the KNN Model is
preferred. Stars indicate statistical significance at p < 0.05.
retrieval models on Wizard of Wikipedia. This
is because the dataset’s utterances are long and
factual due to the tendency of crowdworkers
to copy Wikipedia. Sometimes humans chatting
with the retrieval bot would respond uh. . . that’s
an interesting fact? Otherwise, our model
scores similarly to retrieval models, with most
evaluations not having statistically significant Figure 3: Human Evaluation on the Unseen
difference. Test Set of Wizard of Wikipedia. More than
We conduct a second evaluation on the Unseen 50% indicates the KNN Model is preferred. Stars
Test Set of the Wizard of Wikipedia dataset. indicate statistical significance at p < 0.05.
Results are shown in Figure 3. Trends are similar
compared to the results on the Seen Test set, mistakes. Overall, we find that the conclusions
though the preference for the KIF-augmented (and statistical significance) are stable across
Transformer is greater over the retrieval baseline. multiple evaluations.
We hypothesize that because the Unseen Test Set
is on entirely held out topics, the retrieval baseline 6.2 Analysis of Fetched Knowledge
can struggle to identify relevant utterances. In Example conversations from our KIF-augmented
contrast, the KIF-augmented Transformer, similar generative model are shown in Figure 4 on
to the generative baseline from Dinan et al. (2018), Wizard of Wikipedia. We find that relevant
can use the generative capability to produce knowledge is identified that affects the content
utterances. of the generated utterance. For example, the
Lastly, we conduct an additional study to model finds knowledge sentences about Disney
examine the variance of the comparative dialog movies as the human conversationalist starts
judgements. The evaluation study for Wizard of the conversation discussing Disney. The model
Wikipedia is repeated three times on different leverages the fetched knowledge to write the
days, and evaluators who have answered on content of the generated utterance. In a concrete
previous days are not allowed to evaluate again example, the fetched sentence disney announced
in any subsequent experiments. Overall, we intentions [...] after the success of the incredibles
find reasonable interannotator agreement rates, leads the model to generate the utterance i love the
around 73% averaged across all evaluations, incredibles, they are my favorite disney movie.
which is similar to the agreement rates reported In contrast, the model uses the form of the
in Li et al. (2019). We find there is greater fetched training utterance often as a template for
variance on questions asking which dialog is writing a response. For example, the model copies
more human and more interesting, most likely as the training utterance Ohhh . . . what do people
different evaluators can interpret these in different with color blindness do to cope with the effects?
ways. Further, we see that comparison with and starts the model generation with Ohhh ... and
the Retrieval model has less variance compared continues with the question i think toy story is a
to the Generative model, possibly because the classic? following the form of the selected training
Retrieval model’s human written text is devoid of utterance.
92Figure 4: Conversation between Human and KIF-Augmented Transformer on Wizard of
Wikipedia. The top-scoring Wikipedia knowledge and training utterances fetched by KIF are displayed
with model output.
Figure 5 displays the top-3 fetched training test the scalability of the module. In Figure 6(a),
set utterances and knowledge sentences on the we compare the Generative Transformer MemNet
Wizard of Wikipedia dataset when responding Baseline with KIF-Augmented Transformers in
to a human utterance. KIF modules can identify three settings. The first is the standard Wikipedia
multiple relevant items. In response to the human sentences provided by the dataset (average
question about blue skies the 1946 movie the model 34 sentences). Then, we extend to providing
identifies both the comedy film and the band. the model with the full Wikipedia article (on
Finally, the elements retrieved by KIF modules average, 57 sentences) and finally to multiple
provide a more interpretable understanding of Wikipedia articles (on average, totaling 205
what the model is conditioning upon to generate sentences), identified using the conversation’s
a dialog response. In Table 3, we display for the topic. This increasing size of available knowl-
same dialog history, changing the model’s fetched edge could be realistic for settings where it
training utterance and knowledge sentence for our is unclear what information is most relevant,
own examples. The model heavily incorporates if filtering steps to preprocess the data remove
our manual changes of the fetched information into potentially relevant information, or if information
the generated utterance. For example, changing synthesis from multiple knowledge sources is
the knowledge directly affects what the model necessary to produce a high-quality generation.
generates as the favorite character—from buzz As the Wikipedia knowledge becomes more
lightyear to mr potato head to slinky dog—while difficult to identify, performance decreases, but
changing the fetched training utterance changes still outperforms the baseline that uses the
the form of the generated sentence. dataset-provided set of 34 sentences.
Comparing the scaling capability of KIF to the
6.3 Scaling KIF to Challenging standard Generative Transformer MemNet Base-
Retrieval Settings line highlights the advantage of using KNN. The
KIF modules can be used in more realistic and attention-based mechanism used in Dinan et al.,
challenging settings for knowledge retrieval that 2018 struggles to identify salient information
93Figure 5: Examples of Top-3 Fetched Training Utterances and Fetched Knowledge when responding
to a human chat from the dataset using a trained Wizard of Wikipedia model. Examples are taken from
validation.
when given increasingly larger quantities of of the KIF module—requiring only a feature
knowledge, unlike the KNN information fetch. We vector to find nearest neighbors from—allows
hypothesize the attention mechanism is challenged fetching on multiple modalities such as text and
by softmax-ing over a larger quantity of inputs, as images. In Table 4, using the Image-based KIF
it can be difficult to make sharp distinctions. to fetch text from Related Images is important
to reach the strongest performance (compare
6.4 Ablations Training Utterances Only that uses text-based KIF
Importance of Multiple Knowledge Sources. and using both Training Utterances and Related
One benefit of the KIF module approach is Images).
that several modules can be combined, each
capturing information from a different source. In Using Dialog Features for KNN Performance.
both settings, Wizard of Wikipedia and Engaging The quality of the KNN search is critical to the
ImageChat, two modules were used to incorporate performance of KIF modules. As the external
multiple forms of knowledge—training utterances knowledge is kept fixed, KIF must be able to align
to capture the capability of a retrieval-based model the dialog context with the knowledge to identify
and knowledge from Wikipedia or related chats relevant pieces of information. In Table 5, we
based on image features. We perform here an show that matching on more features can improve
ablation study to evaluate the impact of using the quality of the retrieved information. Using only
only one source of information. As can be seen the encoding of the immediate previous utterance
in Table 4, performance decreases when only one can improve results on Wizard of Wikipedia by
source of information is used (see Table 4). 7 F1 points, but this is further improved by
For Engaging ImageChat, this study also also leveraging the encoding of context (+1.8
underlines the importance of being able to fetch F1) and using the dialog turn number (+1 F1).
in a multimodal fashion. The general form These features are available in the datasets, and
94Knowledge Training Utterance Generation
buzz lightyear’s name is in my favorite character in that cool! my favorite character in
honor of astronaut edwin ‘buzz’ book series is hermione granger that movie is buzz lightyear
aldrin
mr potato head is based on the my favorite character in that my favorite character in that
real-life mr. potato head toy book series is hermione granger movie is real-life mr potato
head
slinky dog is a toy dachschund my favorite character in that cool! my favorite character is the
with a metal slinky for a body book series is hermione granger slinky dog
slinky dog is a toy dachschund i really like the character cool! i really like slinky dog
with a metal slinky for a body hermione granger
slinky dog is a toy dachschund my favorite character of all i love that movie, my favorite
with a metal slinky for a body time has to be hermione granger character has to be slinky dog
the dachshund
slinky dog is a toy dachschund i agree with you! that’s my i think so too! my favorite is
with a metal slinky for a body favorite character as well slinky
Table 3: Effect of Fetched Information on Generated Utterances. The top section provides examples
for a fixed training utterance, changing the knowledge—the generated text maintains the construction
of the training utterance but changes the favorite character to match the knowledge. The bottom section
provides examples for fixed knowledge but changing the training utterance—the generated text modifies
its form to match the training utterance, but the favorite character information remains consistent.
Figure 6: Ablations on Wizard of Wikipedia. (a) KIF can scale to hundreds of relevant sentences (blue)
while the baseline model, the Generative Transformer MemNet (gray), scales poorly (b) Gating can
remove irrelevant information. In the 3 Sources case, one source of external information is unrelated.
(c) Performance as k varies.
we leverage them to improve the relatedness of N = 2 or N = 3 fixed hops. As the number
retrieved knowledge. of hops is fixed, the multi-hop operation remains
differentiable. We do not allow the model to
Multi-Hop Retrieval with KIF. Work in me- retrieve the same information in a second hop.
mory networks (Weston et al., 2015; Sukhbaatar We experimented in two settings. First, the
et al., 2015) utilized multi-hop mechanisms. Such same KIF module is used multiple times to fetch
capacity could be useful when multiple sources are different information, and then all of the fetched
necessary or information is incrementally fetched. knowledge is concatenated. Results are shown
To emulate multi-hop memory mechanisms, we in Table 6 (top). Second, we examine spreading
use KIF to retrieve relevant information for the fetches into different KIF modules at various
95Model Test F1 Model Valid F1
Wizard of Wikipedia KIF-Augmented Transformer 27.4
Training Utterances Only 18.1 One KIF Module fetches multiple times
Wiki Knowledge Only 23.9 2 Fetches 26.9
Training Utterances and Wiki Knowledge 25.9 3 Fetches 26.0
Engaging ImageChat
Multiple KIF Modules fetch once each
Training Utterances Only 13.9
2 Fetches 26.5
Related Images Only 13.8
3 Fetches 25.9
Training Utterances and Related Images 14.4
Table 6: Multi-hop with KIF to retrieve
Table 4: Using Multiple KIF Modules on Multiple information with multiple fetch steps.
Sources is important for improved performance.
(Zhang et al., 2018). This dataset looks quite
Model Valid F1
different—short utterances without factual
Wizard of Wikipedia knowledge—and should be easy for the model
Previous Utterance Only 24.6 to identify as distinct from Wizard of Wikipedia.
+ dialog Context 26.4 As shown in Figure 6(b), if KIF on PersonaChat is
+ Turn Embedding 27.4 included without gating, it has a harmful effect as
Engaging ImageChat the model includes irrelevant information. When
Previous Utterance Only 13.3 equipped with gating, the model learns to use
+ dialog Context 14.5 the gate to ignore some inputs, and can recover
+ Turn Embedding + Personality 15.1 almost the full performance of a model without
this irrelevant information source.
Table 5: Important Features for KNN Search
Size of K in KNN. Figure 6(c) shows the
using KIF. Salient conversation features
performance on Wizard of Wikipedia when
improve performance on both datasets.
varying the amount of knowledge. Being able to
access multiple relevant pieces of information is
encoder depths. This could be interpreted as the helpful, but too much information can be harmful.
model learning to access more information each This is likely because the weighted sum becomes
layer. As the model progresses deeper, more blurry if too many sentences are incorporated.
abstract and high level representations are built,
which could allow different knowledge to be
retrieved. Results are shown in Table 6 (bottom).
7 Conclusion
In both multi-hop settings, no improvement in We present a KNN-based Information Fetching
performance on the Wizard of Wikipedia dataset module that learns to identify relevant information
is observed. We hypothesize that this can be from external knowledge sources by learning a
partially attributed to the construction of the mapping-based read operation. KIF modules ben-
dataset—as humans explicitly based their written efit from the scalability and efficiency of KNN
dialog utterance on one knowledge sentence. search, enabling computation with large external
Further, it is possible that concatenation brings memories. We show in the context of two dialog
together too much information for the model to datasets that relevant knowledge can be identi-
incorporate, and thus adding additional fetches fied and incorporated to create more engaging,
makes the retrieval more noisy. high-quality dialog.
Effect of Gating. We analyze the effect of the Acknowledgments
gating mechanism by evaluating the capability of
the gate to identify and focus on salient infor- We thank the reviewers and action editor for
mation. On Wizard of Wikipedia, we concatenate their comments and insightful discussion. We
a third source of information: dialog turns from thank Emily Dinan and Kurt Shuster for provid-
a completely different corpus called PersonaChat ing assistance to reproduce their original works.
96References in Natural Language Processing and the 9th
International Joint Conference on Natural
Antoine Bordes, Y-Lan Boureau, and Jason
Language Processing (EMNLP-IJCNLP),
Weston. 2017. Learning end-to-end goal-
pages 4177–4187.
oriented dialog. In 5th International Conference
on Learning Representations, ICLR 2017, Angela Fan, David Grangier, and Michael Auli.
Toulon, France, April 24-26, 2017, Conference 2018a. Controllable abstractive summarization.
Track Proceedings. In Proceedings of the 2nd Workshop on
Neural Machine Translation and Generation,
Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu, pages 45–54.
Xiao-jiang Liu, and Shuming Shi. 2019.
Retrieval-guided dialogue response generation Angela Fan, Mike Lewis, and Yann Dauphin.
via a matching-to-generation framework. 2018b. Hierarchical neural story generation. In
In Proceedings of the 2019 Conference on Proceedings of the 56th Annual Meeting of
Empirical Methods in Natural Language the Association for Computational Linguistics
Processing and the 9th International Joint (Volume 1: Long Papers), pages 889–898.
Conference on Natural Language Processing Edouard Grave, Armand Joulin, Moustapha Cissé,
(EMNLP-IJCNLP), pages 1866–1875. DOI: David Grangier, and Hervé Jégou. 2017a.
https://doi.org/10.18653/v1/D19 Efficient softmax approximation for GPUs.
-1195 In Proceedings of the 34th International
Sarath Chandar, Sungjin Ahn, Hugo Larochelle, Conference on Machine Learning-Volume 70,
Pascal Vincent, Gerald Tesauro, and Yoshua pages 1302–1310.
Bengio. 2016. Hierarchical memory networks. Edouard Grave, Armand Joulin, and Nicolas
CoRR, abs/1605.07427. Usunier. 2017b. Improving neural language
models with a continuous cache. In 5th
Danqi Chen, Adam Fisch, Jason Weston, and
International Conference on Learning Repre-
Antoine Bordes. 2017. Reading Wikipedia to
sentations, ICLR 2017, Toulon, France, April
answer open-domain questions. In Proceedings
24-26, 2017, Conference Track Proceedings.
of the 55th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Alex Graves, Greg Wayne, and Ivo Danihelka.
Papers), pages 1870–1879. DOI: https:// 2014. Neural Turing machines. arXiv preprint
doi.org/10.18653/v1/P17-1171, PMCID: arXiv:1410.5401.
PMC5579958 Kelvin Guu, Kenton Lee, Zora Tung, Panupong
Wenlin Chen, David Grangier, and Michael Auli. Pasupat, and Ming-Wei Chang. 2020. Retrieval
2016. Strategies for training large vocabulary augmented language model pre-training. In
neural language models. In Proceedings of Proceedings of the International Conference
the 54th Annual Meeting of the Association on Machine Learning, pages 5695–5704.
for Computational Linguistics (Volume 1: Samuel Humeau, Kurt Shuster, Marie-Anne
Long Papers), pages 1975–1985. DOI: Lachaux, and Jason Weston. 2019. Poly-
https://doi.org/10.18653/v1/P16 encoders: Architectures and pre-training strate-
-1186 gies for fast and accurate multi-sentence scoring.
Emily Dinan, Stephen Roller, Kurt Shuster, In International Conference on Learning
Angela Fan, Michael Auli, and Jason Weston. Representations.
2018. Wizard of Wikipedia: Knowledge- Jeff Johnson, Matthijs Douze, and Hervé Jégou.
powered conversational agents. In International 2019. Billion-scale similarity search with
Conference on Learning Representations. GPUs. IEEE Transactions on Big Data. DOI:
https://doi.org/10.1109/TBDATA
Angela Fan, Claire Gardent, Chloé Braud, and
.2019.2921572
Antoine Bordes. 2019. Using local knowledge
graph construction to scale seq2seq models Armand Joulin and Tomas Mikolov. 2015.
to multi-document inputs. In Proceedings of Inferring algorithmic patterns with stack-
the 2019 Conference on Empirical Methods augmented recurrent nets. In Advances
97You can also read