Relational Memory Augmented Language Models

Page created by Philip Howell
 
CONTINUE READING
Relational Memory Augmented Language Models

                                                                       Qi Liu2∗, Dani Yogatama1 , and Phil Blunsom1,2
                                                                             1
                                                                               DeepMind, 2 University of Oxford
                                                                       {qi.liu,phil.blunsom}@cs.ox.ac.uk
                                                                               dyogatama@deepmind.com

                                                                 Abstract                           that GPT-2 (Radford et al., 2019) states that uni-
                                                                                                    corns have four horns, directly after speaking that
                                                 We present a memory-augmented approach             unicorns have one horn.
                                                 to condition an autoregressive language               In this work, we explore ways to combine an
                                                 model on a knowledge graph. We represent           autoregressive language model with a knowledge
arXiv:2201.09680v1 [cs.CL] 24 Jan 2022

                                                 the graph as a collection of relation triples      graph. We design a memory-augmented archi-
                                                 and retrieve relevant relations for a given
                                                                                                    tecture that stores relations from a knowledge
                                                 context to improve text generation. Exper-
                                                 iments on WikiText-103, WMT19, and en-             graph and investigate the effect of conditioning on
                                                 wik8 English datasets demonstrate that our         this relational memory in an autoregressive lan-
                                                 approach produces a better language model          guage model. In contrast to existing token-based
                                                 in terms of perplexity and bits per charac-        memory-augmented language models that store
                                                 ter. We also show that relational memory           context-target pairs (Khandelwal et al., 2020b; Yo-
                                                 improves coherence, is complementary to            gatama et al., 2021), our memory stores relation
                                                 token-based memory, and enables causal in-
                                                                                                    triples (head entity, relation, tail entity). Relation
                                                 terventions. Our model provides a simple
                                                 yet effective way to combine an autoregres-        triples form the basis of knowledge bases, empow-
                                                 sive language model and a knowledge graph          ering a wide range of applications such as ques-
                                                 for more coherent and logical generation.          tion answering (Yasunaga et al., 2021), machine
                                                                                                    reading (Yang and Mitchell, 2019), and reasoning
                                                                                                    (Minervini et al., 2020). From a cognitive science
                                         1   Introduction                                           perspective, we can consider the neural language
                                                                                                    model to be an instance of System 1 which per-
                                         A core function of language is to communicate              forms fast inference and the symbolic relational
                                         propositions (e.g., who did what to whom). As              memory as a world model to support slow and log-
                                         such, language models need to be able to gener-            ical reasoning of System 2 (Kahneman, 2011).1
                                         ate this information reliably and coherently. Exist-       We hypothesise that relational memory can im-
                                         ing language models (Devlin et al., 2019; Radford          prove performance and coherence of an autore-
                                         et al., 2019; Brown et al., 2020) do not have ex-          gressive language model.
                                         plicit representations for such information and rely          Given an observed context, we first run an en-
                                         on it being implicitly encoded in their parameters         tity tagger to identify entities in the context. We
                                         (Liu et al., 2019; Petroni et al., 2019; Wang et al.,      then use tf-idf (Ramos et al., 2003) to select salient
                                         2020). This encoding mechanism makes it diffi-             entities. We retrieve relations (from a knowledge
                                         cult to interpret what the language models know            base) for the selected entities and design a gating
                                         and often leads to generating illogical and contra-        function that allows the language model to adap-
                                         dictory contents. For example, Logan et al. (2019)         tively combine information from extracted rela-
                                         observe that existing language models rely heavily         tions and observed textual context to predict the
                                         on word correlation and fall short of logical rea-         next token. Existing knowledge bases such as
                                         soning. This causes the model to hallucinate—e.g.          Freebase and Wikidata can be used as a source of
                                         that Barack Obama’s wife is Hillary Clinton based          information to retrieve relations from. However,
                                         on the high co-occurrence of the two entities. In
                                                                                                       1
                                         another example, Lake and Murphy (2020) notice                  This view is also advocated in a parallel work by Nye
                                                                                                    et al. (2021) which presents a model for story generation and
                                             ∗
                                                 Work completed during an internship at DeepMind.   instruction following.
they are often incomplete and do not contain rela-          probability as a product of conditional probabil-
tions that are suitable for the particular dataset that     ities with the chain rule (Jelinek, 1980; Bengio
we want to work with. Instead of using these pre-           et al., 2003):
defined knowledge bases, we choose to perform
                                                                                       T
open information extraction (OpenIE) on each lan-                                      Y
guage modelling dataset to get relations. As a                    p(x1 , ..., xT ) =         p(xt |x0 , . . . , xt−1 ),   (1)
                                                                                       t=1
result, our model is able to move beyond simple
co-occurrence statistics and generate text that is          where x0 is a special start token.
more grounded on real-world relations observed                Our language model is based on transformer-
in a particular corpus.                                     XL (§2.1) which is augmented with a relational
   Our main contributions are as follows:                   memory (§2.2). We discuss them in detail below.
    • We evaluate the model on three English lan-
                                                            2.1     Transformer-XL
      guage modelling datasets. We show that our
      model outperforms a strong transformer-XL             We use transformer-XL (Dai et al., 2019)—which
      baseline (Dai et al., 2019) on both word-level        is based on transformer (Vaswani et al., 2017)—to
      (WikiText-103 and WMT19) and character-               parametrise the conditional probabilities in Eq. 1.
      level (enwik8) language modelling in terms            Transformer stacks multiple self-attention layers
      of perplexity and bits per character respec-          to obtain contextualised representations.
      tively (§3.3).                                           Language modelling datasets usually consist
                                                            of articles of different lengths. It is impracti-
    • We conduct comprehensive ablation and de-             cal to apply transformer to encode long articles,
      sign choice studies to understand contribu-           as its computational complexity is quadratic in
      tions of different components of our models           the sequence length. In practice, each article is
      (§4.1).                                               usually truncated into fixed-length text segments
    • We measure coherence with human evalua-               {xt−N +1 , ..., xt } of length N to train and evalu-
      tion and two automatic metrics (knowledge             ate the model. However, this approximation pre-
      perplexity and knowledge F1 ) and demon-              vents transformer from capturing long-term de-
      strate that relational memory improves coher-         pendency beyond text segments. Transformer-XL
      ence (§4.2).                                          reuses hidden states from previous text segments
                                                            to extend the context window.
    • We study the relationship between our                    More specifically, denote the hidden state
      method and a typical memory-augmented                 of xt at layer ` as h`t .                 Given a text
      language model which stores word tokens               segment {xt−N +1 , . . . , xt } and its extended
      in its memory (Yogatama et al., 2021). We             context {xt−N −M +1 , . . . , xt−N } of length M ,
      show that relational memory is complemen-             both the hidden states of the text segment
      tary to token-based memory and combining              {h`t−N +1 , . . . , h`t } and the hidden states of the ex-
      them improves performance further (§3.3).             tended context {h`t−N −M +1 , . . . , h`t−N } are used.
    • We perform qualitative analysis by examin-            When performing self-attention, each token in the
      ing gate values and retrieved relations. In line      text segment can attend to the preceding tokens
      with our main motivation, we find that the re-        in the text segment and all the tokens in the ex-
      lational memory is particularly useful for pre-       tended context, enabling longer-term dependency
      dicting entities. Further, we demonstrate that        compared to a vanilla transformer. Importantly,
      such explicit propositional representations al-       transformer-XL does not backpropagate through
      low causal interventions and increase inter-          the hidden states of the extended context during
      pretability of language models (§4.3).                training (by adding stop gradient operators to all
                                                            the hidden states in the extended context).

2    Model                                                  2.2     Relational Memory
An autoregressive language model defines the                In this section, we first introduce how we obtain
probability of a sequence of tokens p(x) =                  relation triples using OpenIE (§2.2.1). We then
p(x1 , . . . , xT ). It is common to factorise this joint   use tf-idf to score entities in the observed context
next token
            knowledge graph
                                                                                       agg                        Columbia
                     Obama                                                                re   gate
                                                                     LSTM encoder

                    r

                        ba
                ate

                         ch
                                               (Obama, alma mater, Columbia University)                                   gate
              am

                            elo
                             ro
           alm
                                               (Obama, bachelor of arts, 1983)

                                fa
                                  rts
        Columbia                                            ......
                                1983
        University                                                   relation retrieval
                                                                                                              Language Model
                                        (Obama, 13.8)   (president, 7.6)      (United States, 9.8)

                                                                     entity scoring
                         Obama was the first African-American president of the United States.               After graduating from

                                                    previous text segment                                   current text segment
Figure 1: We identify salient entities in the previous text segment and extract relations to build our rela-
tional memory. We encode each relation with an LSTM encoder, aggregate the resulting representations
into a vector, and use a gate mechanism that allows our language model to adaptively take advantage of
relational information for predicting the next token.

and retrieve relation triples related to these enti-                          Algorithm 1 Train/Eval w/ Relational Memory
ties (§2.2.2) to construct relational memory. Fi-                               1:    procedure TRAIN / EVAL SPLIT(S)
nally, we show an integrated architecture that al-                              2:       for each article A in S do
lows transformer-XL to incorporate the relational                               3:           Initialise M to empty
memory for predicting the next token (§2.2.3). We                               4:           for each text segment xc in A do
show our architecture in Figure 1. The pseudocode                               5:               if S is train set then
of training or evaluating with the relational mem-                              6:                    TRAIN(xc , M)
ory is demonstrated in Algorithm 1. In the pseu-                                7:               else
docode, we use TRAIN(xc , M) and EVAL(xc ,                                      8:                    EVAL(xc , M)
M to refer to training with the cross entropy                                   9:                    Run dynamic OpenIE on xc
loss and evaluating (e.g. calculating perplexity) on                          10:                end if
the text segment xc conditioned on the relational                             11:                Perform relation retrieval with xc
memory M, respectively.                                                       12:                Update M with retrieved triples
                                                                              13:            end for
2.2.1     Open Information Extraction
                                                                              14:        end for
A key challenge of utilising relational informa-                              15:     end procedure
tion for language modelling is obtaining high-
quality relation triples. There are several well-
established knowledge bases, such as Freebase                                 D. Given an entity e, we retrieve a set of rela-
(Bollacker et al., 2007) and YAGO (Rebele et al.,                             tion triples Re = {r1 , ..., rO }, where e is either
2016). However, existing knowledge bases suf-                                 the head entity or the tail entity in these relation
fer from missing relations and often do not con-                              triples. Conceptually, Re consists of all the rela-
tain relation triples related to observed contexts in                         tion triples from the one-hop subgraph centred at
a target corpus, even though research on knowl-                               the entity e in the knowledge graph constructed
edge base completion has resulted in significant                              from D. Therefore, Re can provide “global” in-
advances (Bordes et al., 2013; Trouillon et al.,                              formation about the entity.
2016; Zhang et al., 2019) .
   In this work, we use OpenIE (Angeli et al.,                                Dynamic OpenIE. Dynamic OpenIE takes ad-
2015; Etzioni et al., 2008) to obtain relation                                vantage of the autoregressive nature of language
triples. Since OpenIE directly extracts relation                              modelling, where text segments are sequentially
triples from each dataset D, it provides a struc-                             processed. In addition to extracting relations from
tured way to represent knowledge in D.2 Specif-                               the training set of D, we can also extract rela-
ically, we perform OpenIE on the training set of                              tions from previously seen text segments of our
   2
     We provide a comparison of using relations extracted                     evaluation set. We refer to this extraction mech-
from OpenIE and Freebase in §4.1.                                             anism as dynamic OpenIE. After a text segment
{xt−N +1 , ..., xt } has been evaluated, e.g. after     observed context, we perform entity recognition
calculating perplexity on this text segment, we         (Ratinov and Roth, 2009; Nadeau and Sekine,
perform OpenIE on it to obtain new relation triples     2007) on this context and score the tagged entities
to be added to our knowledge graph. Note that we        with tf-idf (Ramos et al., 2003). The top-K scored
only perform OpenIE on previously seen text seg-        entities (K is set to 5 in our experiments) are used
ments and do not use unseen text. We expect that        to retrieve relations {Re1 , ..., ReK }. These re-
the relation triples extracted from seen text seg-      trieved relations are used to construct the relational
ments are potentially useful for predicting the next    memory M. Note that the entities are selected
tokens. This extraction mechanism will not violate      from the observed context, so that unseen text is
the autoregressive nature of language modelling.        not utilized. We limit the capacity of M to P .
Metrics such as perplexity and bits per character       If the number of newly retrieved triples is larger
are calculated as usual. The idea of using seen         than P , we randomly drop relations and only se-
text segments during evaluation to improve lan-         lect P of them to be inserted into M. Otherwise,
guage modelling is related to dynamic evaluation        the relational memory operates with a first-in-first-
(Krause et al., 2018, 2019). In dynamic evalu-          out principle. When M is full, older relations re-
ation, the model is adapted based on recent his-        trieved will be overwritten by newly retrieved re-
tory during evaluation via gradient descent so that     lations. The relational memory is re-initialized to
it can assign higher probabilities to re-occurring      empty when an article ends.
patterns. In contrast to dynamic evaluation, we            As shown in Algorithm 1, since we update M
do not update model parameters and only extract         only after processing an entire text segment, all
new relations from seen text segments to enrich         the tokens in the same text segment will be con-
our corpus-specific knowledge graph.                    ditioned on the same relational memory. This ap-
                                                        proach is more efficient compared to updating M
Mismatch between training and evaluation.               each time a new entity is encountered and is more
As shown in Algorithm 1, since we do not use            amenable for batch training.
dynamic OpenIE during training due to its addi-
tional efficiency overhead (see speed comparison        2.2.3   Integration with Transformer-XL
in §4.1), this results in a mismatch between train-     We now show how we can integrate relational
ing and evaluation. We extract all the relation         memory with transformer-XL. We refer to our
triples from the training set of each dataset D be-     model as R ELATION LM.
fore training on D. As a result, during training
we may retrieve relation triples extracted from un-     Relation triple encoding. We first discuss how
seen text of the training set when performing re-       we encode relation triples in the relational mem-
lation retrieval (§2.2.2). We do not suffer from        ory M. We treat relation triples as text and se-
this issue during evaluation, as we extract relations   rialise each relation triple into a sequence, e.g.
from previously seen text of our evaluation set. We     (Barack Obama, president of, United States) is
believe this mismatch is minor given the superior       converted into a sequence “Barack Obama, pres-
performance of our model in the experiments.            ident of, United States”. This sequential repre-
                                                        sentation can well capture the order of head en-
2.2.2   Relation Retrieval                              tities and tail entities and is also adopted by KG-
Given a knowledge graph (represented as a collec-       BERT (Yao et al., 2019) and Kepler (Wang et al.,
tion of triples), an ideal relational memory consists   2021b). Since each example in a batch corre-
of a set of triples that are relevant to the observed   sponds to P retrieved relations, we obtain B · P
context. There are many choices to measure the          relation sequences for each batch, where B and P
relatedness between the observed context and re-        denote batch size and relational memory length,
lation triples in our knowledge graph— e.g. based       respectively. In the order of hundreds of relation
on keyword search or dense retrieval (Karpukhin         triples, this prevents us from using large models
et al., 2020; Guu et al., 2020; Yogatama et al.,        (e.g. a multi-layer transformer) to encode these se-
2021).                                                  quences due to memory constraints. In our prelim-
   In this work, we use keyword search due to           inary experiments, we compare LSTM (Hochre-
its simplicity and leave methods based on dense         iter and Schmidhuber, 1997), GRU (Cho et al.,
retrieval to future work. Specifically, given the       2014) and a one-layer transformer and find that
Dataset      # Train     # Valid   # Test   # Articles    # Vocab     # Entities   # Relations   # Relations/Entity
 WikiText      103M       0.2M      0.2M      28,595       267,735       980K          8.9M             9.03
 WMT19         151M       0.3M      0.3M      169,180      50,259        976K          7.8M             7.97
 enwik8        94M         5M        5M       12,350         256         361K          2.4M             6.66

Table 1: Statistics of datasets used in our experiments. For each subset, we show the number of
(sub)words for WikiText-103 and WMT19 or the number of characters for enwik8.

LSTM performs marginally better. Therefore, for            3       Experiments
each relation triple rp , we reuse the transformer-
XL word embedding matrix We to map each to-                Our experiments seek to evaluate the effect of aug-
ken in the sequence to its embedding vector. We            menting language models with a relational mem-
then run LSTM to encode the sequence and use               ory. We introduce datasets used for evaluation
the hidden representation of the last token as the         (§3.1), discuss implementation details (§3.2), and
relation representation rp .                               present our main results (§3.3). We show ablation
   There are other approaches to encode rela-              studies and further analysis of our model in (§4).
tion triples, e.g. embedding-based (Bordes et al.,
2013; Trouillon et al., 2016) and graph-based
(Schlichtkrull et al., 2018; Zhang and Chen, 2018)         3.1      Datasets and OpenIE
methods. We leave a comparison of these ap-                We use three English language modelling datasets:
proaches to future work.                                   WikiText-103 (Merity et al., 2017), WMT19
Integration. Given a text segment xc =                     (Barrault et al., 2019), and enwik8 (Hutter,
{xt−N +1 , ..., xt }, after L self-attention layers with   2012). Descriptive statistics of these datasets are
transformer-XL, we obtain contextualized repre-            shown in Table 1. WikiText-103 and WMT19
sentations {hL                 L
                t−N +1 , ..., ht }. At each timestep t,
                                                           are (sub)word-level datasets, while enwik8 is a
we use its hidden representation hL      t as the query
                                                           character-level dataset.
vector to attend over the P encoded contents of               WikiText-103 is a knowledge-driven dataset
M, i.e., {r1 , ..., rP }. We use a standard scaled         consisting of featured articles from English
dot-product attention (Vaswani et al., 2017) to ag-        Wikipedia. WMT19 contains English news from
gregate all triples into a single vector:                  the WMT19 workshop.3 The news are segmented
                  P                     √                  into months. We use the news from January to
                X       exp(hL  t · rp / d)
        mt =                                  r ,          October for training, and news in November and
                       P                 √ p
                 p=1
                      P           L
                          exp(ht · rj / d)                 December for development and test respectively.
                    j=1                                    Compared to Wikipedia articles, news contains
where d denotes the hidden size of our                     more dynamic and temporal information, expos-
transformer-XL. Finally, we combine mt                     ing new challenges for utilising relational informa-
and transformer-XL representation hL
                                   t via a gate:
                                                           tion. We reuse the vocabulary of GPT-2 (Radford
                                                           et al., 2019) with 50,259 tokens to tokenise this
        gt = σ(Wg [hL
                    t , mt ])                              dataset. enwik8 contains more than 100M bytes
        zt = gt     hL
                     t + (1 − gt )      mt                 of Wikipedia text. Character-level language mod-
        p(xt+1 | x≤t ) = softmax(We zt ),                  elling has a much smaller vocabulary size than
                                                           (sub)word-level language modelling.
where σ is the sigmoid function, [, ] denotes
concatenation of two vectors,      is element-wise            We perform OpenIE on each dataset. For en-
multiplication, and We is the embedding matrix             wik8, OpenIE is performed after detokenizing its
shared by both input and output embeddings (Inan           text into words. Statistics of extracted relations
et al., 2016). The only new parameters introduced          are also included in Table 1. Each entity from
by our method are an LSTM relation encoder and             WikiText-103, WMT19 and enwik8 has 9.03, 7.97
the gate matrix Wg . This gating mechanism al-             and 6.66 relation triples on average.
lows our model to adaptively take advantage of re-
                                                               3
lational information for predicting the next token.                http://www.statmt.org/wmt19/
3.2     Implementation Details                                            Model               # Params   Dev    Test
All models are implemented with JAX4 (Brad-                               Transformer-XL       122M      19.0   19.9

                                                              WikiText
bury et al., 2018) and Haiku5 (Hennigan et al.,                           R ELATION LM         124M      18.5   19.2
2020). We set the hidden size to 512 and the num-                         S PALM               122M      18.1   19.0
ber of layers to 16 for all models. In (sub)word-                         ,→ + R ELATION LM    124M      17.7   18.6
level language modelling, we use adaptive soft-                           Transformer-XL       114M      21.7   21.5

                                                           WMT19
max (Grave et al., 2017) for efficiency. We use                           R ELATION LM         116M      21.0   20.7
GELU (Hendrycks and Gimpel, 2016) as our acti-                            S PALM               114M      20.4   20.3
                                                                          ,→ + R ELATION LM    116M      19.8   19.6
vation function and Adam (Kingma and Ba, 2015)
as the optimizer. For training, we use batch size                         Transformer-XL        93M      1.05   1.03

                                                              enwik8
128 and train the models on 64 16GB TPUs. We                              R ELATION LM          95M      1.04   1.02
apply 4,000 warmup steps, before utilizing cosine                         S PALM                93M      1.04   1.02
                                                                          ,→ + R ELATION LM     95M      1.03   1.01
annealing to decay the learning rate. Dropout (Sri-
vastava et al., 2014) is applied during training with
                                                          Table 2: We use perplexity (↓) on WikiText-103
a rate of 0.25.
                                                          and WMT19 and bits per character (↓) on enwik8
   We set the lengths of text segment N , extended        for evaluation.
context M , and the relational memory P to (512,
512, 300), (384, 384, 800) and (768, 1536, 400)
for WikiText-103, WMT19 and enwik8, respec-
tively. These are determined by grid searches on          stored contexts and the observed context during
development sets.                                         training/evaluation. The next tokens of similar
                                                          contexts are retrieved and are integrated with the
3.3     Main Results                                      observed context via a gating mechanism for gen-
                                                          eration.
We compare with a strong transformer-XL base-
line trained under the same setting as our model.            We investigate whether R ELATION LM is com-
Our main results are shown in Table 2. We obtain          plementary to S PALM. Since S PALM also uses
three observations comparing transformer-XL and           a gating mechanism for integrating the retrieved
R ELATION LM. First, R ELATION LM consistently            tokens, we first apply R ELATION LM to combine
outperforms transformer-XL on all three datasets,         transformer-XL output hL  t with relational infor-
demonstrating the effectiveness of relational mem-        mation to obtain zt (as shown in §2.2.3), before
ory. Note that a decrease of 0.01 is consider-            using S PALM to integrate zt with retrieved tokens.
able on enwik8 with the bits per character met-           The results are shown in Table 2. S PALM out-
ric. Second, relational memory not only improves          performs transformer-XL and even performs com-
language modelling on knowledge-driven articles           parably or better compared to R ELATION LM on
(WikiText-103), but also generalises to the chal-         three datasets, demonstrating the effectiveness of
lenging news domain (WMT19) where informa-                retrieving related tokens. However, integrating
tion is more dynamic and temporal. Last, the              R ELATION LM and S PALM can further improve
results indicate that relational memory improves          the performance, indicating that these two models
both (sub)word-level and character-level language         are not mutually exclusive. Therefore, retrieving
modelling.                                                relation triples brings complementary benefits to
                                                          retrieving tokens.
Complementarity to S PALM. S PALM (Yo-
gatama et al., 2021) is a state-of-the-art memory-
augmented language model. Instead of retriev-             4              Analysis
ing relation triples, it retrieves a set of related to-
kens at each timestep. Specifically, it first stores      In this section, we study several design choices
(context, the next token) pairs from training data.       of relational memory, including its knowledge
It then uses a pre-trained transformer language           source, input component, capacity, dynamic Ope-
model to measure the similarities between the             nIE, entity scoring method used, and speed com-
   4
       https://github.com/google/jax                      parison. We then show quantitative and qualitative
   5
       https://github.com/deepmind/dm-haiku               analysis results to better understand our model.
4.1   Ablations and Design Choice Studies              triples performs the best, demonstrating the effec-
For this ablation studies, we use the development      tiveness of this triple representation of knowledge.
set of WikiText-103.
                                                                               Model                            Dev
Source of relation triples . We compare rela-
                                                                               Transformer-XL                  19.0
tion triples extracted from Freebase or using Ope-
                                                                               Triple - Relation - Tail        19.0
nIE. In the Freebase case, we use the Freebase
                                                                               Triple - Relation               18.7
API6 to obtain relation triples for each entity. For
                                                                               Triple                          18.5
WikiText-103, there are 10.74 relations per en-
tity on average, which is comparable to OpenIE         Table 4: Ablating relation and/or tail entity from a
relations (9.03 relations/entity). The results are     relation triple.
shown in Table 3. Although Freebase relations
have been observed to improve the performance
on smaller datasets (e.g. WikiText-2; Logan et al.,    Length of relational memory. We study how
2019) and particular domains (e.g. movies and ac-      many relation triples need to be stored in the re-
tors; Ahn et al., 2016), we find that R ELATION LM     lational memory. As shown in Figure 2, we can
with Freebase relations does not improve over          see that the perplexity improves with more rela-
transformer-XL on a much larger WikiText-103           tion triples. However, the curve becomes flat with
dataset. We observe that a large portion of Free-      more than 300 relation triples.
base relations is from infoboxes of Wikipedia
pages, which only cover information such as occu-                       19.0
pation, birth place, and religion. We believe these                     18.9
triples are too general to be useful for most con-
                                                           Perplexity

                                                                        18.8
texts. The result of R ELATION LM with OpenIE
                                                                        18.7
shows the advantages of extracting relations from
each dataset compared to using Freebase relations.                      18.6

                                                                        18.5
           Model                       Dev                                      0      100    200      300     400    500
                                                                              # RelationsRelational Memory Length

           Transformer-XL              19.0            Figure 2: Perplexity on WikiText-103 with differ-
           R ELATION LM + Freebase     19.0            ent number of relation triples.
           R ELATION LM + OpenIE       18.5
                                                       Length of transformer-XL memory. As in-
Table 3: R ELATION LM with OpenIE or Freebase
                                                       creasing the length of context window can cap-
triples.
                                                       ture longer dependency, we study whether increas-
                                                       ing the length of extended (transformer-XL) mem-
Ablating relation triples.       We ablate relation    ory removes the performance gap between R ELA -
and/or tail entity from a relation triple (head en-    TION LM and transformer-XL. As shown in Fig-
tity, relation, tail entity) to study the contribu-    ure 3, the performance of both R ELATION LM and
tion brought by each component. The results            transformer-XL improves with larger extended
are shown in Table 4. We find that ablating            memory. However, R ELATION LM still outper-
both relation and tail entity performs compara-        forms transformer-XL even with extended mem-
bly to transformer-XL. As head entities are ex-        ory length 3072. We conclude that relational
tracted from the observed context, we believe          memory brings complementary benefits to sim-
the extended memory of transformer-XL can off-         ply expanding extended memory, since it provides
set the effect brought by conditioning on head         global information about entities on each dataset.
entities. Ablating relation performs better than
                                                       Dynamic OpenIE. All our main results use dy-
transformer-XL. This shows the advantage of in-
                                                       namic OpenIE. We show results without dynamic
troducing tail entities. Using complete relation
                                                       OpenIE in Table 5. We include the results on three
  6
    https://developers.google.com/                     datasets for a comparison. We can see that R E -
freebase                                               LATION LM with dynamic OpenIE performs com-
21.0                              Transformer-XL
                                                                        Transformer-XL             Model                Train     Eval
                                                                        RLM
                                                                        RelationLM
 Perplexity                           20.5                                                         Transformer-XL        0.51     0.31
                                      20.0                                                         R ELATION LM          0.76     0.65
                                      19.5
                                                                                           Table 7: The unit is second/step. We use batch size
                                      19.0                                                 128 and 1 per step for training and evaluation, re-
                                      18.5                                                 spectively.
                                             0   500    1000 1500 2000 2500 3000
                                                         Memory Length                        Dataset     Subset    # Entity    # Non-Entity
  Figure 3: Increasing extended memory length.
   Powered by TCPDF (www.tcpdf.org)

                                                                                                           Dev       61.6K        155.9K
                                                                                              WikiText
                                                                                                           Test      65.8K        179.7K
parably to R ELATION LM without dynamic Ope-                                                               Dev       84.9K        262.2K
                                                                                              WMT
nIE on WikiText-103 and enwik8, while larger im-                                                           Test      81.0K        256.6K
provements are obtained on WMT19. This indi-                                                               Dev       1.7M          3.3M
cates that dynamic OpenIE is more helpful for the                                             enwik8
                                                                                                           Test      1.7M          3.3M
news domain, which is more dynamic and tempo-
ral compared to knowledge-driven articles.                                                 Table 8: Statistics of entity and non-entity tokens.

                                             Model            Wiki        WMT       ew8
                                                                                           4.2   Does Relational Memory Improve
   Transformer-XL                                                19.0     21.7      1.05         Coherence?
 w/o Dynamic OpenIE                                              18.6     21.4      1.04   For evaluating coherence, we use two automatic
 w/ Dynamic OpenIE                                               18.5     21.0      1.04   metrics—knowledge perplexity and knowledge
                                                                                           F1 —to investigate whether the models can faith-
Table 5: Perplexity with and without dynamic Ope-
                                                                                           fully use entities. We further perform a human
nIE.
                                                                                           evaluation to study whether language models can
                                                                                           generate coherent and knowledgeable sequences.
                                                                                           We believe the human evaluation is a reliable way
Entity scoring We study different entity scor-                                             of evaluating coherence. This claim is advocated
ing mechanisms for relation retrieval. We con-                                             in Barzilay and Lapata (2005). We note that ques-
sider random selection (where entities extracted                                           tion answering is also often used to evaluate co-
from the observed context are randomly selected),                                          herence (Guu et al., 2020; Lin et al., 2021). We
frequency-based scoring, and tf-idf scoring. As                                            leave this to future work.
shown in Table 6, tf-idf performs the best.
                                                                                           Knowledge perplexity. While vanilla perplex-
                                                                                           ity considers all words in an evaluation set, knowl-
                                                     Model         Dev
                                                                                           edge perplexity only considers entities for calcu-
                                                     Random        19.1                    lating perplexity. We use it to evaluate whether
                                                     Frequency     18.7                    the model can assign higher probabilities for the
                                                     tf-idf        18.5                    correct entities under different contexts. Table 8
                                                                                           shows the numbers of entity words and non-
Table 6: Perplexity with different entity scoring                                          entity words in our corpora. We show the re-
methods.                                                                                   sults in Table 9. We observe that the gap between
                                                                                           R ELATION LM and transformer-XL is larger on
                                                                                           knowledge perplexity. R ELATION LM only per-
Speed comparison. The wall clock time for                                                  forms comparably or slightly better compared to
both training and evaluation is shown in Table 7.                                          transformer-XL on non-entity perplexity. This
R ELATION LM is 1.5 and 2.1 times slower during                                            shows that relational memory is helpful for pre-
training and evaluation, respectively. Evaluation                                          dicting entity words. Note that knowledge per-
slows down some more due to dynamic OpenIE as                                              plexity tends to be much higher than perplexity on
shown in Algorithm 1.                                                                      non-entity words, indicating the difficulty of pre-
Metric     Model             Dev     Test   The results are shown in Table 10. We notice
                                                               that R ELATION LM performs better compared to
  Knowledge PPX               Transformer-XL    47.3    52.3
                   WikiText                                    transformer-XL. We conclude that models with re-
                              R ELATION LM      45.6    50.9
                                                               lational memory can generate more coherent and
                              Transformer-XL    77.2    77.0
                   WMT                                         logical text.
                              R ELATION LM      73.2    73.1
                              Transformer-XL    2.25    2.21   Human evaluation. We conduct a human eval-
                   enwik8
                              R ELATION LM      2.22    2.19   uation to study whether language models can gen-
                              Transformer-XL    13.3    13.8   erate coherent and knowledgeable sequences. We
  Non-entity PPX

                   WikiText                                    take 1,000 contexts from the test set of WikiText-
                              R ELATION LM      13.0    13.4
                                                               103. We show the contexts, ground-truth se-
                              Transformer-XL    14.4    14.4
                   WMT                                         quences, and continuations generated by R ELA -
                              R ELATION LM      14.2    14.3
                                                               TION LM and transformer-XL to five annotators.
                              Transformer-XL    1.98    1.95
                   enwik8                                      We use greedy decoding for both models. We
                              R ELATION LM      1.98    1.95
                                                               shuffle the order of the continuations generated by
Table 9: Knowledge perplexity (↓) and non-entity
                                                               R ELATION LM and transformer-XL so that the an-
perplexity (↓).                                                notators are unaware of the sources of sequences.
                                                               We then pose the following questions to the anno-
                                                               tators:
        Metric              Model              Dev     Test
                            Transformer-XL     9.9      9.4      1. Coherent. Given the context and its ground-
        WikiText                                                    truth continuation for reference, which gener-
                            R ELATION LM       11.4    11.2
                                                                    ated sequence is more logical and coherent?
                            Transformer-XL     11.4    11.0
        WMT
                            R ELATION LM       12.6    12.3      2. Knowledgeable. Given the context and its
                            Transformer-XL     16.0    18.9         ground-truth continuation, which generated
        enwik8                                                      sequence provides more insights and is more
                            R ELATION LM       16.6    19.4
                                                                    knowledgeable?
                     Table 10: Knowledge F1 (↑).
                                                                  We show the results in Table 11. We find that
                                                               R ELATION LM outperforms transformer-XL in the
dicting entity words. This collection of results in-           human evaluation. These results are consistent
dicates that relational memory helps the model use             with the two automatic metrics, knowledge per-
entities coherently and consistently under different           plexity and knowledge F1 . This corroborates our
contexts.                                                      claim that relational memory improves coherence
                                                               in language modelling.
Knowledge F1 . We use knowledge F1 to ex-
plore whether our model generates tokens that are
                                                                Model               Coherent     Knowledgeable
grounded to its contexts. Given a context as input,
we sequentially generate 32 words (or 128 char-                 Transformer-XL         388             416
acters) for word-(character-)level language mod-                R ELATION LM           612             584
elling by sampling from the distribution of the
next word (character). To reduce variance, we                  Table 11: We show the number of contexts in
generate 100 continuations for each context. We                which a continuation from a particular model is
then perform entity recognition for both the gen-              chosen by human evaluators for each evaluation
erated sequences and their corresponding ground-               criterion. Recall that the total number of contexts
truth sequences and calculate an F1 score based on             used for human evaluation is 1,000. Since we
these two sets of entities. For example, given the             have five annotators, we use majority voting to de-
context “...Ayola was nominated and shortlisted                cide the favored model for each continuation. We
for the ‘Female Performance in TV’ award”, we                  use the Kappa statistic to measure inter-annotator
compare the generated text and the ground truth                agreement. The statistic is 0.64, which shows sub-
“in the 2006 Screen Nation Awards, for her role                stantial agreement among the annotators.
as Kyla Tyson in Holby City...” to calculate F1 .
shipwreck
  occurred
         on
December
     when
        the
 Aberdeen
   trawler
     Elinor
    Viking
     A278
          ,
   skipper
      Alec
      Flett
          ,
foundered
         on
        the

                                            Figure 4: Heatmap of gate values.

4.3       Qualitative Analysis                                tration, which shows a text segment from the
                                                              article, Joe Biden 2008 presidential campaign7
Gate values. As we use a gating function to inte-
                                                              and some retrieved relations. We find that the
grate transformer-XL with relational information,
                                                              first two relations, (Joe Biden, senior Senator,
we study gate values in this section. The histogram
                                                              Delaware) and (Joe Biden presidential campaign,
of gate values is shown in Figure 5. We notice
                                                              began, January 7 2007), are extracted from previ-
that the histogram concentrates around 0.9. This
                                                              ous text segments, while (Joe Biden, was nomi-
is expected because non-entity words, which ac-
                                                              nated, vice president) and (Biden, withdrew nom-
count for a large portion of text (according to Ta-
                                                              ination, 1987) are extracted from the other arti-
ble 8), benefit less from the relational memory and
                                                              cles, Joe Biden8 and Joe Biden 1988 presidential
mainly rely on the observed context for prediction
                                                              campaign9 , respectively. We notice that the rela-
as shown in §4.2. We further calculate the aver-
                                                              tion (Joe Biden, was nominated, vice president)
age gate values for entity words and non-entity
                                                              is highly predictive of the sequence, “Biden was
words. The average gate value for entity words is
                                                              selected to be Democratic presidential nominee
0.87, while the average value is 0.92 for non-entity
                                                              Barack Obama’s vice presidential running mate”.
words. This confirms that entity words rely more
                                                              From the observed context, the model also iden-
on relational information for prediction compared
                                                              tifies a closely related entity, Barack Obama, and
to non-entity words. We also plot the heatmap of
                                                              retrieves the relation (Barack Obama, president of,
gate values and a cherry-picked example is shown
                                                              United States). Therefore, we conclude that the
in Figure 4. Note that we randomly select 100
                                                              relational memory can give a global picture of re-
dimensions from 512 dimensions for readability.
                                                              lated entities and provide relevant information for
We notice that the entities, Aberdeen and Alec
                                                              language modelling.
Flett, use more relational information than other
positions (as shown by the horizontal blue lines).            Causal intervention. We use causal interven-
These results demonstrate that R ELATION LM can               tion to study whether changing the contents in
adaptively incorporate relational information for             the relational memory will affect language model
prediction.                                                   prediction. Given the relation (Obama, born in,
                                                              Hawaii) along with other relations about Barack
                                                              Obama, we let the model complete the sequence,
                                                              “Obama was born in”. R ELATION LM outputs
                                                              “Obama was born in and raised in Hawaii.” with
                                                              greedy decoding. However, after modifying the
                                                              relation to (Obama, born in, Kenya), we obtain
                                                              “Obama was born in Kenya and was the first
                                                              African-American president.”. We further change
                                                              to (Obama, born in, Paris) and the model outputs
              0.0    0.2    0.4     0.6    0.8    1.0         “Obama was born in Paris, France.”. This indi-
                                                                 7
              Figure 5: Histogram of gate values gt .             https://en.wikipedia.org/wiki/Joe_
                                                              Biden_2008_presidential_campaign
                                                                8
                                                                  https://en.wikipedia.org/wiki/Joe_
                                                              Biden
Example. We show three cherry-picked exam-                      9
                                                                  https://en.wikipedia.org/wiki/Joe_
ples in Table 12. We take the first for illus-                Biden_1988_presidential_campaign
Seven months after conclusion of his campaign,     edge graphs is a promising direction to overcome
    Biden was selected to be Democratic presidential
    nominee Barack Obama's vice presidential running
                                                       the problem. Next we review previous knowledge-
    mate. The pair won in the general election, and    enhanced language models.
    were sworn in on January 20, 2009 ...

    (Joe Biden, senior Senator, Delaware)
                                                       Knowledge-enhanced language models. Our
    (Joe Biden presidential campaign, began, January   model is closely related to previous work on
    7 2007)                                            grounding autoregressive language models with
    (Joe Biden, was nominated, vice president)
    (Biden, withdrew nomination, 1987)                 knowledge graphs (Ahn et al., 2016; Logan et al.,
    (Barack Obama, president of, United States)        2019; Hayashi et al., 2020; Wang et al., 2021a).
                                                       However, these models rely on complex and ad-
    From 7 February 2006 to 9 December 2008, Ayola
    starred in BBC medical drama Holby City as nurse   hoc preprocessing or rules to link text with knowl-
    Kyla Tyson. She had previously appeared in Holby   edge bases, e.g. Freebase and Wikidata. As a re-
    City 's sister show Casualty …
                                                       sult, previous work is more aligned with condi-
    (Holby City, is, BBC medical drama)                tional language modelling, e.g. graph-to-text gen-
    (Rakie Ayola, played the role, Kyla Tyson)         eration p(x|G) in Wang et al. (2021a), which con-
    Independiente became Arjona's fourth number one    trasts with unconditional language modeling p(x)
    album on the Billboard Top Latin Albums where it   considered in this work. As the graph G is con-
    debuted for the week ending 22 October 2011. For
    thirteen non-consecutive weeks it topped the
                                                       structed with the unseen text x, predicting x given
    Latin Pop Albums chart ...                         G is easier due to this information leakage for
                                                       Wang et al. (2021a). Also in Hayashi et al. (2020),
    (Independiente, number one on, Top Latin Albums
    chart)                                             topic entities are required for language modelling,
    (Independiente, became number one on, 22 October   which may not be available in most datasets, e.g.
    2011)
                                                       the news domain. We do not compare with these
                                                       previous models due to the different settings. In
Table 12: Three examples of text segment and           contrast, we adopt OpenIE relations and use a tf-
retrieved relations (based on previous text seg-       idf search to retrieve relation triples for connect-
ments).                                                ing language models and knowledge graphs. In
                                                       the experiments, we demonstrate the effectiveness
cates that R ELATION LM can take advantage of          of our approach on three datasets, WikiText-103,
relation triples for making prediction. While we       WMT19 and enwik8.
can also use prompts as intervention for vanilla          There are language models incorporating en-
language models, it remains challenging about se-      tity information, such as entity coreference anno-
lecting the appropriate prompts in different appli-    tations (Ji et al., 2017; Clark et al., 2018), sur-
cations (Liu et al., 2021a).                           face forms of entities (Kiddon et al., 2016; Yang
                                                       et al., 2017; Cao et al., 2021), entity types (Parvez
5     Related Work                                     et al., 2018; Wang et al., 2018b) and entity de-
                                                       scriptions (Bahdanau et al., 2017). Different from
Knowledge-enhanced architectures. Injecting            these models, we augment language models with
symbolic knowledge to machine learning mod-            a relational memory consisting of relation triples.
els is widely-adopted to improve the perfor-           We demonstrate the effectiveness of using rela-
mance of natural language understanding (Anner-        tion triples by ablating tail entities and relations
vaz et al., 2018; Ostendorff et al., 2019), ques-      in §4.1.
tion answering (Zhang et al., 2018; Huang et al.,
2019; Hixon et al., 2015), dialogue systems (Zhou      Knowledge-enhanced           pretraining. Using
et al., 2018; Moon et al., 2019; Guo et al., 2018;     knowledge information for pretraining language
Liu et al., 2021b) and recommendation systems          models (Peters et al., 2019; Sun et al., 2019;
(Zhang et al., 2016; Wang et al., 2018a, 2019).        Liu et al., 2020; Guu et al., 2020; Wang et al.,
Different from these models, we focus on using         2021b; Agarwal et al., 2021; Verga et al., 2021)
symbolic knowledge for language modelling. Ex-         has recently grown in popularity and has achieved
isting language models are prone to generating il-     substantial improvements on knowledge-driven
logical and contradictory contents. We believe         tasks such as question answering and named entity
that connecting language modelling and knowl-          recognition. Instead of using knowledge informa-
tion for improving downstream knowledge-driven          References
tasks, we focus on using knowledge information
                                                        Oshin Agarwal, Heming Ge, Siamak Shakeri, and
for improving the generation capability of the
                                                          Rami Al-Rfou. 2021. Knowledge graph based
language model itself.
                                                          synthetic corpus generation for knowledge-
                                                          enhanced language model pre-training. In Pro-
Retrieval-augmented            models. Retrieval-         ceedings of the 2021 Conference of the North
augmented models are now widely adopted in                American Chapter of the Association for Com-
open-domain question answering (Chen et al.,              putational Linguistics: Human Language Tech-
2017; Lewis et al., 2020; de Masson d’Autume              nologies, pages 3554–3565.
et al., 2019; Izacard and Grave, 2021), dialogue
(Dinan et al., 2019; Fan et al., 2021; Thulke et al.,   Sungjin Ahn, Heeyoul Choi, Tanel Pärna-
2021) and machine translation (Bapna and Firat,           maa, and Yoshua Bengio. 2016. A neural
2019; Khandelwal et al., 2020a). We focus on              knowledge language model. arXiv preprint
retrieval augmentation for language modelling             arXiv:1608.00318.
(Merity et al., 2017; Grave et al., 2016; Khandel-
                                                        Gabor Angeli, Melvin Jose Johnson Premkumar,
wal et al., 2020b; Yogatama et al., 2021). These
                                                          and Christopher D. Manning. 2015. Leverag-
algorithms are specifically tailored for language
                                                          ing linguistic structure for open domain infor-
modelling, where related tokens are retrieved to
                                                          mation extraction. In Proceedings of the 53rd
help predict the next token. In this work, we
                                                          Annual Meeting of the Association for Compu-
move beyond token augmentation and show the
                                                          tational Linguistics and the 7th International
benefits of retrieving relation triples. We also
                                                          Joint Conference on Natural Language Pro-
demonstrate that our model is complementary to
                                                          cessing of the Asian Federation of Natural Lan-
a token augmentation model, S PALM (Yogatama
                                                          guage Processing, ACL 2015, July 26-31, 2015,
et al., 2021) in the experiments.
                                                          Beijing, China, Volume 1: Long Papers, pages
                                                          344–354. The Association for Computer Lin-
6   Conclusion                                            guistics.

                                                        K. M. Annervaz, Somnath Basu Roy Chowdhury,
We presented R ELATION LM, a language model
                                                          and Ambedkar Dukkipati. 2018. Learning be-
that is augmented with relational memory. We
                                                          yond datasets: Knowledge graph augmented
showed how to obtain relevant knowledge graphs
                                                          neural networks for natural language process-
for a given corpus and how to combine them
                                                          ing. In Proceedings of the 2018 Conference of
with a state-of-the-art language model such
                                                          the North American Chapter of the Association
as transformer-XL. We demonstrated that our
                                                          for Computational Linguistics: Human Lan-
model improves performance and coherence on
                                                          guage Technologies, NAACL-HLT 2018, New
WikiText-103, WMT19 and enwik8. We also per-
                                                          Orleans, Louisiana, USA, June 1-6, 2018, Vol-
formed a comprehensive analysis to better under-
                                                          ume 1 (Long Papers), pages 313–322. Associa-
stand how our model works. Our model provides a
                                                          tion for Computational Linguistics.
way to combine an autoregressive language model
with general knowledge graphs.                          Dzmitry Bahdanau, Tom Bosc, Stanislaw Jas-
                                                          trzebski, Edward Grefenstette, Pascal Vincent,
                                                          and Yoshua Bengio. 2017. Learning to com-
Acknowledgements
                                                          pute word embeddings on the fly. CoRR,
                                                          abs/1706.00286.
We would like to thank our action editor (Xavier
Carreras) and three anonymous reviewers for their       Ankur Bapna and Orhan Firat. 2019. Non-
insightful comments. We also thank Angeliki               parametric adaptation for neural machine trans-
Lazaridou, Cyprien de Masson d’Autume, Ling-              lation. In Proceedings of the 2019 Conference
peng Kong, Laura Rimell, Aida Nematzadeh, and             of the North American Chapter of the Associ-
the DeepMind language team for their helpful dis-         ation for Computational Linguistics: Human
cussions.                                                 Language Technologies, NAACL-HLT 2019,
                                                          Minneapolis, MN, USA, June 2-7, 2019, Volume
1 (Long and Short Papers), pages 1921–1931.            Dhariwal, Arvind Neelakantan, Pranav Shyam,
  Association for Computational Linguistics.             Girish Sastry, Amanda Askell, et al. 2020. Lan-
                                                         guage models are few-shot learners. arXiv
Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà,      preprint arXiv:2005.14165.
  Christian Federmann, Mark Fishel, Yvette Gra-
  ham, Barry Haddow, Matthias Huck, Philipp            Nicola De Cao, Gautier Izacard, Sebastian Riedel,
  Koehn, Shervin Malmasi, Christof Monz,                 and Fabio Petroni. 2021. Autoregressive en-
  Mathias Müller, Santanu Pal, Matt Post, and            tity retrieval. In 9th International Conference
  Marcos Zampieri. 2019. Findings of the 2019            on Learning Representations, ICLR 2021, Vir-
  conference on machine translation (WMT19).             tual Event, Austria, May 3-7, 2021. OpenRe-
  In Proceedings of the Fourth Conference on             view.net.
  Machine Translation (Volume 2: Shared Task
                                                       Danqi Chen, Adam Fisch, Jason Weston, and An-
  Papers, Day 1), pages 1–61, Florence, Italy. As-
                                                         toine Bordes. 2017. Reading wikipedia to an-
  sociation for Computational Linguistics.
                                                         swer open-domain questions. In Proceedings
Regina Barzilay and Mirella Lapata. 2005. Mod-           of the 55th Annual Meeting of the Association
  eling local coherence: An entity-based ap-             for Computational Linguistics, ACL 2017, Van-
  proach. In ACL 2005, 43rd Annual Meeting               couver, Canada, July 30 - August 4, Volume 1:
  of the Association for Computational Linguis-          Long Papers, pages 1870–1879. Association for
  tics, Proceedings of the Conference, 25-30 June        Computational Linguistics.
  2005, University of Michigan, USA, pages 141–
                                                       KyungHyun Cho, Bart van Merrienboer, Dzmitry
  148. The Association for Computer Linguistics.
                                                         Bahdanau, and Yoshua Bengio. 2014. On
Yoshua Bengio, Réjean Ducharme, Pascal Vin-              the properties of neural machine transla-
  cent, and Christian Janvin. 2003. A neural prob-       tion: Encoder-decoder approaches. CoRR,
  abilistic language model. J. Mach. Learn. Res.,        abs/1409.1259.
  3:1137–1155.
                                                       Elizabeth Clark, Yangfeng Ji, and Noah A. Smith.
Kurt D. Bollacker, Robert P. Cook, and Patrick           2018. Neural text generation in stories using
  Tufts. 2007. Freebase: A shared database of            entity representations as context. In Proceed-
  structured general human knowledge. In Pro-            ings of the 2018 Conference of the North Amer-
  ceedings of the Twenty-Second AAAI Confer-             ican Chapter of the Association for Compu-
  ence on Artificial Intelligence, July 22-26, 2007,     tational Linguistics: Human Language Tech-
  Vancouver, British Columbia, Canada, pages             nologies, NAACL-HLT 2018, New Orleans,
  1962–1963. AAAI Press.                                 Louisiana, USA, June 1-6, 2018, Volume 1
                                                         (Long Papers), pages 2250–2260. Association
Antoine Bordes, Nicolas Usunier, Alberto García-         for Computational Linguistics.
  Durán, Jason Weston, and Oksana Yakhnenko.
  2013. Translating embeddings for modeling            Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G.
  multi-relational data. In Advances in Neural           Carbonell, Quoc Viet Le, and Ruslan Salakhut-
  Information Processing Systems 26: 27th An-            dinov. 2019. Transformer-xl: Attentive lan-
  nual Conference on Neural Information Pro-             guage models beyond a fixed-length context. In
  cessing Systems 2013. Proceedings of a meeting         Proceedings of the 57th Conference of the As-
  held December 5-8, 2013, Lake Tahoe, Nevada,           sociation for Computational Linguistics, ACL
  United States, pages 2787–2795.                        2019, Florence, Italy, July 28- August 2, 2019,
                                                         Volume 1: Long Papers, pages 2978–2988. As-
James Bradbury, Roy Frostig, Peter Hawkins,              sociation for Computational Linguistics.
  Matthew James Johnson, Chris Leary, Dou-
  gal Maclaurin, and Skye Wanderman-Milne.             Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
  2018. JAX: composable transformations of               Kristina Toutanova. 2019. BERT: pre-training
  Python+NumPy programs.                                 of deep bidirectional transformers for language
                                                         understanding. In Proceedings of the 2019
Tom B Brown, Benjamin Mann, Nick Ry-                     Conference of the North American Chapter of
  der, Melanie Subbiah, Jared Kaplan, Prafulla           the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT             The Thirty-Second Innovative Applications of
  2019, Minneapolis, MN, USA, June 2-7, 2019,        Artificial Intelligence Conference, IAAI 2020,
  Volume 1 (Long and Short Papers), pages 4171–      The Tenth AAAI Symposium on Educational Ad-
  4186. Association for Computational Linguis-       vances in Artificial Intelligence, EAAI 2020,
  tics.                                              New York, NY, USA, February 7-12, 2020, pages
                                                     7911–7918. AAAI Press.
Emily Dinan, Stephen Roller, Kurt Shuster, An-
 gela Fan, Michael Auli, and Jason Weston.         Dan Hendrycks and Kevin Gimpel. 2016. Gaus-
 2019.    Wizard of wikipedia: Knowledge-            sian error linear units (gelus). arXiv preprint
 powered conversational agents. In 7th Interna-      arXiv:1606.08415.
 tional Conference on Learning Representations,
 ICLR 2019, New Orleans, LA, USA, May 6-9,         Tom Hennigan, Trevor Cai, Tamara Norman, and
 2019. OpenReview.net.                               Igor Babuschkin. 2020. Haiku: Sonnet for JAX.
Oren Etzioni, Michele Banko, Stephen Soderland,    Ben Hixon, Peter Clark, and Hannaneh Hajishirzi.
  and Daniel S Weld. 2008. Open information ex-      2015. Learning knowledge graphs for question
  traction from the web. Communications of the       answering through conversational dialog. In
  ACM, 51(12):68–74.                                 NAACL HLT 2015, The 2015 Conference of the
Angela Fan, Claire Gardent, Chloé Braud, and An-     North American Chapter of the Association for
  toine Bordes. 2021. Augmenting transformers        Computational Linguistics: Human Language
  with knn-based composite memory for dialog.        Technologies, Denver, Colorado, USA, May 31
  Trans. Assoc. Comput. Linguistics, 9:82–99.        - June 5, 2015, pages 851–861. The Association
                                                     for Computational Linguistics.
Édouard Grave, Armand Joulin, Moustapha Cissé,
  David Grangier, and Hervé Jégou. 2017. Ef-       Sepp Hochreiter and Jürgen Schmidhuber. 1997.
  ficient softmax approximation for GPUs. In         Long short-term memory. Neural computation,
  Proceedings of the 34th International Confer-      9(8):1735–1780.
  ence on Machine Learning, volume 70 of Pro-
  ceedings of Machine Learning Research, pages     Xiao Huang, Jingyuan Zhang, Dingcheng Li, and
  1302–1310. PMLR.                                   Ping Li. 2019. Knowledge graph embedding
                                                     based question answering. In Proceedings of
Edouard Grave, Armand Joulin, and Nicolas            the Twelfth ACM International Conference on
  Usunier. 2016. Improving neural language           Web Search and Data Mining, WSDM 2019,
  models with a continuous cache.   CoRR,            Melbourne, VIC, Australia, February 11-15,
  abs/1612.04426.                                    2019, pages 105–113. ACM.
Daya Guo, Duyu Tang, Nan Duan, Ming Zhou,
                                                   Marcus Hutter. 2012. The human knowledge
  and Jian Yin. 2018. Dialog-to-action: Conver-
                                                    compression contest. URL http://prize. hutter1.
  sational question answering over a large-scale
                                                    net, 6.
  knowledge base. In Advances in Neural Infor-
  mation Processing Systems 31: Annual Con-        Hakan Inan, Khashayar Khosravi, and Richard
  ference on Neural Information Processing Sys-      Socher. 2016. Tying word vectors and word
  tems 2018, NeurIPS 2018, December 3-8, 2018,       classifiers: A loss framework for language mod-
  Montréal, Canada, pages 2946–2955.                 eling. CoRR, abs/1611.01462.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong
  Pasupat, and Ming-Wei Chang. 2020. REALM:        Gautier Izacard and Edouard Grave. 2021. Lever-
  retrieval-augmented language model pre-            aging passage retrieval with generative models
  training. CoRR, abs/2002.08909.                    for open domain question answering. In Pro-
                                                     ceedings of the 16th Conference of the Euro-
Hiroaki Hayashi, Zecong Hu, Chenyan Xiong, and       pean Chapter of the Association for Computa-
  Graham Neubig. 2020. Latent relation lan-          tional Linguistics: Main Volume, EACL 2021,
  guage models. In The Thirty-Fourth AAAI Con-       Online, April 19 - 23, 2021, pages 874–880. As-
  ference on Artificial Intelligence, AAAI 2020,     sociation for Computational Linguistics.
Frederick Jelinek. 1980. Interpolated estimation     Ben Krause, Emmanuel Kahembwe, Iain Murray,
  of markov source parameters from sparse data.        and Steve Renals. 2018. Dynamic evaluation
  In Proc. Workshop on Pattern Recognition in          of neural sequence models. In Proceedings
  Practice, 1980.                                      of the 35th International Conference on Ma-
                                                       chine Learning, ICML 2018, Stockholmsmäs-
Yangfeng Ji, Chenhao Tan, Sebastian Martschat,         san, Stockholm, Sweden, July 10-15, 2018, vol-
  Yejin Choi, and Noah A. Smith. 2017. Dynamic         ume 80 of Proceedings of Machine Learning
  entity representations in neural language mod-       Research, pages 2771–2780. PMLR.
  els. In Proceedings of the 2017 Conference on
  Empirical Methods in Natural Language Pro-         Ben Krause, Emmanuel Kahembwe, Iain Mur-
  cessing, EMNLP 2017, Copenhagen, Denmark,            ray, and Steve Renals. 2019. Dynamic evalu-
  September 9-11, 2017, pages 1830–1839. Asso-         ation of transformer language models. CoRR,
  ciation for Computational Linguistics.               abs/1904.08378.

Daniel Kahneman. 2011. Thinking, Fast and            Brenden M. Lake and Gregory L. Murphy. 2020.
  Slow. Farrar, Straus and Giroux.                     Word meaning in minds and machines. CoRR,
                                                       abs/2008.01766.
Vladimir Karpukhin, Barlas Oguz, Sewon Min,
  Patrick S. H. Lewis, Ledell Wu, Sergey             Patrick S. H. Lewis, Ethan Perez, Aleksandra Pik-
  Edunov, Danqi Chen, and Wen-tau Yih. 2020.           tus, Fabio Petroni, Vladimir Karpukhin, Naman
  Dense passage retrieval for open-domain ques-        Goyal, Heinrich Küttler, Mike Lewis, Wen-
  tion answering. In Proceedings of the 2020           tau Yih, Tim Rocktäschel, Sebastian Riedel,
  Conference on Empirical Methods in Natural           and Douwe Kiela. 2020. Retrieval-augmented
  Language Processing, EMNLP 2020, Online,             generation for knowledge-intensive NLP tasks.
  November 16-20, 2020, pages 6769–6781. As-           In Advances in Neural Information Processing
  sociation for Computational Linguistics.             Systems 33: Annual Conference on Neural In-
                                                       formation Processing Systems 2020, NeurIPS
Urvashi Khandelwal, Angela Fan, Dan Jurafsky,          2020, December 6-12, 2020, virtual.
  Luke Zettlemoyer, and Mike Lewis. 2020a.
  Nearest neighbor machine translation. CoRR,        Stephanie Lin, Jacob Hilton, and Owain
  abs/2010.00710.                                      Evans. 2021. Truthfulqa: Measuring how
                                                       models mimic human falsehoods.   CoRR,
Urvashi Khandelwal, Omer Levy, Dan Jurafsky,           abs/2109.07958.
  Luke Zettlemoyer, and Mike Lewis. 2020b.
  Generalization through memorization: Nearest       Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
  neighbor language models. In 8th Interna-            Matthew E. Peters, and Noah A. Smith. 2019.
  tional Conference on Learning Representations,       Linguistic knowledge and transferability of
  ICLR 2020, Addis Ababa, Ethiopia, April 26-          contextual representations. In Proceedings of
  30, 2020. OpenReview.net.                            the 2019 Conference of the North American
                                                       Chapter of the Association for Computational
Chloé Kiddon, Luke Zettlemoyer, and Yejin Choi.        Linguistics: Human Language Technologies,
  2016. Globally coherent text generation with         Volume 1 (Long and Short Papers), pages 1073–
  neural checklist models. In Proceedings of           1094, Minneapolis, Minnesota. Association for
  the 2016 Conference on Empirical Methods in          Computational Linguistics.
  Natural Language Processing, EMNLP 2016,
  Austin, Texas, USA, November 1-4, 2016, pages      Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao
  329–339. The Association for Computational           Jiang, Hiroaki Hayashi, and Graham Neubig.
  Linguistics.                                         2021a. Pre-train, prompt, and predict: A sys-
                                                       tematic survey of prompting methods in natural
Diederik P. Kingma and Jimmy Ba. 2015. Adam:           language processing. CoRR, abs/2107.13586.
  A method for stochastic optimization. In 3rd In-
  ternational Conference on Learning Represen-       Qi Liu, Lei Yu, Laura Rimell, and Phil Blun-
  tations, ICLR 2015, San Diego, CA, USA, May          som. 2021b. Pretraining the noisy channel
  7-9, 2015, Conference Track Proceedings.             model for task-oriented dialogue. Transactions
You can also read