Meta-Learning with Variational Semantic Memory for Word Sense Disambiguation

Page created by Erik Fleming
 
CONTINUE READING
Meta-Learning with Variational Semantic Memory for Word Sense Disambiguation
Meta-Learning with Variational Semantic Memory for Word Sense
                                                                    Disambiguation
                                                  Yingjun Du               Nithin Holla         Xiantong Zhen
                                            University of Amsterdam        Amberscript      University of Amsterdam
                                                y.du@uva.nl         nithin.holla7@gmail.com   x.zhen@uva.nl

                                                           Cees G.M. Snoek                             Ekaterina Shutova
                                                        University of Amsterdam                      University of Amsterdam
                                                       C.G.M.Snoek@uva.nl                            e.shutova@uva.nl

                                                             Abstract                         sense frequencies, which further increase the need
                                                                                              for annotation to capture a diversity of senses and
                                            A critical challenge faced by supervised word
                                            sense disambiguation (WSD) is the lack of
                                                                                              to obtain sufficient training data for rare senses.
                                                                                                 This motivated recent research on few-shot
arXiv:2106.02960v1 [cs.CL] 5 Jun 2021

                                            large annotated datasets with sufficient cover-
                                            age of words in their diversity of senses. This   WSD, where the objective of the model is to learn
                                            inspired recent research on few-shot WSD us-      new, previously unseen word senses from only a
                                            ing meta-learning. While such work has suc-       small number of examples. Holla et al. (2020a) pre-
                                            cessfully applied meta-learning to learn new      sented a meta-learning approach to few-shot WSD,
                                            word senses from very few examples, its per-      as well as a benchmark for this task. Meta-learning
                                            formance still lags behind its fully-supervised
                                            counterpart. Aiming to further close this gap,
                                                                                              makes use of an episodic training regime, where a
                                            we propose a model of semantic memory for         model is trained on a collection of diverse few-shot
                                            WSD in a meta-learning setting. Semantic          tasks and is explicitly optimized to perform well
                                            memory encapsulates prior experiences seen        when learning from a small number of examples
                                            throughout the lifetime of the model, which       per task (Snell et al., 2017; Finn et al., 2017; Tri-
                                            aids better generalization in limited data set-   antafillou et al., 2020). Holla et al. (2020a) have
                                            tings. Our model is based on hierarchical vari-   shown that meta-learning can be successfully ap-
                                            ational inference and incorporates an adaptive
                                                                                              plied to learn new word senses from as little as one
                                            memory update rule via a hypernetwork. We
                                            show our model advances the state of the art      example per sense. Yet, the overall model perfor-
                                            in few-shot WSD, supports effective learning      mance in settings where data is highly limited (e.g.
                                            in extremely data scarce (e.g. one-shot) sce-     one- or two-shot learning) still lags behind that of
                                            narios and produces meaning prototypes that       fully supervised models.
                                            capture similar senses of distinct words.            In the meantime, machine learning research
                                                                                              demonstrated the advantages of a memory com-
                                        1   Introduction
                                                                                              ponent for meta-learning in limited data settings
                                        Disambiguating word meaning in context is at          (Santoro et al., 2016a; Munkhdalai and Yu, 2017a;
                                        the heart of any natural language understanding       Munkhdalai et al., 2018; Zhen et al., 2020). The
                                        task or application, whether it is performed ex-      memory stores general knowledge acquired in
                                        plicitly or implicitly. Traditionally, word sense     learning related tasks, which facilitates the acquisi-
                                        disambiguation (WSD) has been defined as the          tion of new concepts and recognition of previously
                                        task of explicitly labeling word usages in context    unseen classes with limited labeled data (Zhen
                                        with sense labels from a pre-defined sense inven-     et al., 2020). Inspired by these advances, we intro-
                                        tory. The majority of approaches to WSD rely          duce the first model of semantic memory for WSD
                                        on (semi-)supervised learning (Yuan et al., 2016;     in a meta-learning setting. In meta-learning, pro-
                                        Raganato et al., 2017a,b; Hadiwinoto et al., 2019;    totypes are embeddings around which other data
                                        Huang et al., 2019; Scarlini et al., 2020; Bevilac-   points of the same class are clustered (Snell et al.,
                                        qua and Navigli, 2020) and make use of training       2017). Our semantic memory stores prototypical
                                        corpora manually annotated for word senses. Typi-     representations of word senses seen during train-
                                        cally, these methods require a fairly large number    ing, generalizing over the contexts in which they
                                        of annotated training examples per word. This prob-   are used. This rich contextual information aids in
                                        lem is exacerbated by the dramatic imbalances in      learning new senses of previously unseen words
Meta-Learning with Variational Semantic Memory for Word Sense Disambiguation
that appear in similar contexts, from very few ex-      2014; Moro et al., 2014) rely on lexical resources
amples.                                                 such as WordNet (Miller et al., 1990) and do not
   The design of our prototypical representation of     require a corpus manually annotated with word
word sense takes inspiration from prototype theory      senses. Alternatively, supervised learning meth-
(Rosch, 1975), an established account of category       ods treat WSD as a word-level classification task
representation in psychology. It stipulates that se-    for ambiguous words and rely on sense-annotated
mantic categories are formed around prototypical        corpora for training. Early supervised learning ap-
members, new members are added based on resem-          proaches trained classifiers with hand-crafted fea-
blance to the prototypes and category membership        tures (Navigli, 2009; Zhong and Ng, 2010) and
is a matter of degree. In line with this account,       word embeddings (Rothe and Schütze, 2015; Ia-
our models learn prototypical representations of        cobacci et al., 2016) as input. Raganato et al.
word senses from their linguistic context. To do        (2017a) proposed a benchmark for WSD based on
this, we employ a neural architecture for learning      the SemCor corpus (Miller et al., 1994) and found
probabilistic class prototypes: variational prototype   that supervised methods outperform the knowledge-
networks, augmented with a variational semantic         based ones.
memory (VSM) component (Zhen et al., 2020).                Neural models for supervised WSD include
   Unlike deterministic prototypes in prototypical      LSTM-based (Hochreiter and Schmidhuber, 1997)
networks (Snell et al., 2017), we model class proto-    classifiers (Kågebäck and Salomonsson, 2016;
types as distributions and perform variational infer-   Melamud et al., 2016; Raganato et al., 2017b), near-
ence of these prototypes in a hierarchical Bayesian     est neighbour classifier with ELMo embeddings
framework. Unlike deterministic memory access           (Peters et al., 2018), as well as a classifier based
in memory-based meta-learning (Santoro et al.,          on pretrained BERT representations (Hadiwinoto
2016b; Munkhdalai and Yu, 2017a), we access             et al., 2019). Recently, hybrid approaches incorpo-
memory by Monte Carlo sampling from a varia-            rating information from lexical resources into neu-
tional distribution. Specifically, we first perform     ral architectures have gained traction. GlossBERT
variational inference to obtain a latent memory         (Huang et al., 2019) fine-tunes BERT with Word-
variable and then perform another step of varia-        Net sense definitions as additional input. EWISE
tional inference to obtain the prototype distribu-      (Kumar et al., 2019) learns continuous sense em-
tion. Furthermore, we enhance the memory update         beddings as targets, aided by dictionary definitions
of vanilla VSM with a novel adaptive update rule        and lexical knowledge bases. Scarlini et al. (2020)
involving a hypernetwork (Ha et al., 2016) that         present a semi-supervised approach for obtaining
controls the weight of the updates. We call our         sense embeddings with the aid of a lexical knowl-
approach β-VSM to denote the adaptive weight β          edge base, enabling WSD with a nearest neighbor
for memory updates.                                     algorithm. By further exploiting the graph structure
   We experimentally demonstrate the effectiveness      of WordNet and integrating it with BERT, EWISER
of this approach for few-shot WSD, advancing the        (Bevilacqua and Navigli, 2020) achieves the cur-
state of the art in this task. Furthermore, we ob-      rent state-of-the-art performance on the benchmark
serve the highest performance gains on word senses      by Raganato et al. (2017a) – an F1 score of 80.1%.
with the least training examples, emphasizing the          Unlike few-shot WSD, these works do not fine-
benefits of semantic memory for truly few-shot          tune the models on new words during testing. In-
learning scenarios. Our analysis of the meaning         stead, they train on a training set and evaluate on
prototypes acquired in the memory suggests that         a test set where words and senses might have been
they are able to capture related senses of distinct     seen during training.
words, demonstrating the generalization capabili-
ties of our memory component. We make our code          Meta-learning Meta-learning, or learning to
publicly available to facilitate further research.1     learn (Schmidhuber, 1987; Bengio et al., 1991;
                                                        Thrun and Pratt, 1998), is a learning paradigm
2       Related work                                    where a model is trained on a distribution of tasks
                                                        so as to enable rapid learning on new tasks. By
Word sense disambiguation Knowledge-based
                                                        solving a large number of different tasks, it aims
approaches to WSD (Lesk, 1986; Agirre et al.,
                                                        to leverage the acquired knowledge to learn new,
    1
        https://github.com/YDU-uva/VSM_WSD              unseen tasks. The training set, referred to as the
Meta-Learning with Variational Semantic Memory for Word Sense Disambiguation
meta-training set, consists of episodes, each cor-       testing. Meta-training consists of episodes formed
responding to a distinct task. Every episode is          from multiple words whereas meta-testing has one
further divided into a support set containing just       episode corresponding to each of the test words.
a handful of examples for learning the task, and         They show that prototype-based methods – proto-
a query set containing examples for task evalua-         typical networks (Snell et al., 2017) and first-order
tion. In the meta-training phase, for each episode,      ProtoMAML (Triantafillou et al., 2020) – obtain
the model adapts to the task using the support set,      promising results, in contrast with model-agnostic
and its performance on the task is evaluated on          meta-learning (MAML) (Finn et al., 2017).
the corresponding query set. The initial parame-
                                                         Memory-based models Memory mechanisms
ters of the model are then adjusted based on the
                                                         (Weston et al., 2014; Graves et al., 2014; Krotov
loss on the query set. By repeating the process on
                                                         and Hopfield, 2016) have recently drawn increas-
several episodes/tasks, the model produces repre-
                                                         ing attention. In memory-augmented neural net-
sentations that enable rapid adaptation to a new
                                                         work (Santoro et al., 2016b), given an input, the
task. The test set, referred to as the meta-test set,
                                                         memory read and write operations are performed
also consists of episodes with a support and query
                                                         by a controller, using soft attention for reads and
set. The meta-test set corresponds to new tasks that
                                                         least recently used access module for writes. Meta
were not seen during meta-training. During meta-
                                                         Network (Munkhdalai and Yu, 2017b) uses two
testing, the meta-trained model is first fine-tuned
                                                         memory modules: a key-value memory in com-
on a small number of examples in the support set
                                                         bination with slow and fast weights for one-shot
of each meta-test episode and then evaluated on
                                                         learning. An external memory was introduced to
the accompanying query set. The average perfor-
                                                         enhance recurrent neural network in Munkhdalai
mance on all such query sets measures the few-shot
                                                         et al. (2019), in which memory is conceptualized as
learning ability of the model.
                                                         an adaptable function and implemented as a deep
   Metric-based meta-learning methods (Koch              neural network. Semantic memory has recently
et al., 2015; Vinyals et al., 2016; Sung et al., 2018;   been introduced by Zhen et al. (2020) for few-shot
Snell et al., 2017) learn a kernel function and make     learning to enhance prototypical representations of
predictions on the query set based on the similarity     objects, where memory recall is cast as a variational
with the support set examples. Model-based meth-         inference problem.
ods (Santoro et al., 2016b; Munkhdalai and Yu,              In NLP, Tang et al. (2016) use content and
2017a) employ external memory and make predic-           location-based neural attention over external mem-
tions based on examples retrieved from the memory.       ory for aspect-level sentiment classification. Das
Optimization-based methods (Ravi and Larochelle,         et al. (2017) use key-value memory for question an-
2017; Finn et al., 2017; Nichol et al., 2018; Anto-      swering on knowledge bases. Mem2Seq (Madotto
niou et al., 2019) directly optimize for generaliz-      et al., 2018) is an architecture for task-oriented di-
ability over tasks in their training objective.          alog that combines attention-based memory with
   Meta-learning has been applied to a range of          pointer networks (Vinyals et al., 2015). Geng et al.
tasks in NLP, including machine translation (Gu          (2020) propose Dynamic Memory Induction Net-
et al., 2018), relation classification (Obamuyide        works for few-shot text classification, which uti-
and Vlachos, 2019), text classification (Yu et al.,      lizes dynamic routing (Sabour et al., 2017) over
2018; Geng et al., 2019), hypernymy detection (Yu        a static memory module. Episodic memory has
et al., 2020), and dialog generation (Qian and Yu,       been used in lifelong learning on language tasks, as
2019). It has also been used to learn across distinct    a means to perform experience replay (d’Autume
NLP tasks (Dou et al., 2019; Bansal et al., 2019) as     et al., 2019; Han et al., 2020; Holla et al., 2020b).
well as across different languages (Nooralahzadeh
et al., 2020; Li et al., 2020). Bansal et al. (2020)     3   Task and dataset
show that meta-learning during self-supervised pre-      We treat WSD as a word-level classification prob-
training of language models leads to improved few-       lem where ambiguous words are to be classified
shot generalization on downstream tasks.                 into their senses given the context. In traditional
  Holla et al. (2020a) propose a framework for           WSD, the goal is to generalize to new contexts of
few-shot word sense disambiguation, where the            word-sense pairs. Specifically, the test set consists
goal is to disambiguate new words during meta-           of word-sense pairs that were seen during train-
Meta-Learning with Variational Semantic Memory for Word Sense Disambiguation
ing. On the other hand, in few-shot WSD, the            the meta-training episodes, the meta-test episodes
goal is to generalize to new words and senses al-       reflect a natural distribution of senses in the cor-
together. The meta-testing phase involves further       pus, including class imbalance, providing a realistic
adapting the models (on the small support set) to       evaluation setting.
new words that were not seen during training and
evaluates them on new contexts (using the query         4     Methods
set). It deviates from the standard N -way, K-shot      4.1    Model architectures
classification setting in few-shot learning since the
words may have a different number of senses and         We experiment with the same model architectures
each sense may have different number of examples        as Holla et al. (2020a). The model fθ , with param-
(Holla et al., 2020a), making it a more realistic       eters θ, takes words xi as input and produces a per-
few-shot learning setup (Triantafillou et al., 2020).   word representation vector fθ (xi ) for i = 1, ..., L
                                                        where L is the length of the sentence. Sense pre-
Dataset We use the few-shot WSD benchmark               dictions are only made for ambiguous words using
provided by Holla et al. (2020a). It is based on        the corresponding word representation.
the SemCor corpus (Miller et al., 1994), annotated
                                                        GloVe+GRU Single-layer bi-directional GRU
with senses from the New Oxford American Dic-
                                                        (Cho et al., 2014) network followed by a single
tionary by Yuan et al. (2016). The dataset con-
                                                        linear layer, that takes GloVe embeddings (Pen-
sists of words grouped into meta-training, meta-
                                                        nington et al., 2014) as input. GloVe embed-
validation and meta-test sets. The meta-test set
                                                        dings capture all senses of a word. We thus evalu-
consists of new words that were not part of meta-
                                                        ate a model’s ability to disambiguate from sense-
training and meta-validation sets. There are four
                                                        agnostic input.
setups varying in the number of sentences in the
support set |S| = 4, 8, 16, 32. |S| = 4 corre-          ELMo+MLP A multi-layer perception (MLP)
sponds to an extreme few-shot learning scenario         network that receives contextualized ELMo embed-
for most words, whereas |S| = 32 comes closer           dings (Peters et al., 2018) as input. Their contex-
to the number of sentences per word encountered         tualised nature makes ELMo embeddings better
in standard WSD setups. For |S| = 4, 8, 16, 32,         suited to capture meaning variation than the static
the number of unique words in the meta-training         ones. Since ELMo is not fine-tuned, this model has
/ meta-validation / meta-test sets is 985/166/270,      the lowest number of learnable parameters.
985/163/259, 799/146/197 and 580/85/129 respec-
tively. We use the publicly available standard          BERT Pretrained BERTBASE (Devlin et al.,
dataset splits.2                                        2019) model followed by a linear layer, fully fine-
                                                        tuned on the task. BERT underlies state-of-the-art
Episodes The meta-training episodes were cre-           approaches to WSD.
ated by first sampling a set of words and a fixed
number of senses per word, followed by sampling         4.2    Prototypical Network
example sentences for these word-sense pairs. This      Our few-shot learning approach builds upon pro-
strategy allows for a combinatorially large number      totypical networks (Snell et al., 2017), which is
of episodes. Every meta-training episode has |S|        widely used for few-shot image classification and
sentences in both the support and query sets, and       has been shown to be successful in WSD (Holla
corresponds to the distinct task of disambiguating      et Pal., 2020a). It computes a prototype zk =
                                                         1
between the sampled word-sense pairs. The total         K      k fθ (xk ) of each word sense (where K is the
number of meta-training episodes is 10, 000. In the     number of examples for each word sense) through
meta-validation and meta-test sets, each episode        an embedding function fθ , which is realized as the
corresponds to the task of disambiguating a single,     aforementioned architectures. It computes a dis-
previously unseen word between all its senses. For      tribution over classes for a query sample x given
every meta-test episode, the model is fine-tuned on     a distance function d(·, ·) as the softmax over its
a few examples in the support set and its generaliz-    distances to the prototypes in the embedding space:
ability is evaluated on the query set. In contrast to
  2
    https://github.com/Nithin-Holla/                                          exp(−d(fθ (x), zk ))
MetaWSD
                                                            p(yi = k|x) = P                              (1)
                                                                              k0 exp(−d(fθ (x), zk0 ))
However, the resulting prototypes may not be
sufficiently representative of word senses as seman-
tic categories when using a single deterministic
vector, computed as the average of only a few ex-
amples. Such representations lack expressiveness
and may not encompass sufficient intra-class vari-
ance, that is needed to distinguish between different
fine-grained word senses. Moreover, large uncer-
tainty arises in the single prototype due to the small
                                                              Figure 1: Computational graph of variational semantic
number of samples.
                                                              memory for few-shot WSD. M is the semantic memory
                                                              module, S the support set, x and y are the query sample
4.3   Variational Prototype Network                           and label, and z is the word sense prototype.
Variational prototype network (Zhen et al., 2020)
(VPN) is a powerful model for learning latent rep-
resentations from small amounts of data, where the            memory update, which effectively collects new in-
prototype z of each class is treated as a distribution.       formation from the task and gradually consolidates
Given a task with a support set S and query set Q,            the semantic knowledge in the memory. We adopt
the objective of VPN takes the following form:                a similar memory mechanism and introduce an im-
                                                              proved update rule for memory consolidation.
                  |Q|         Lz
              1 Xh 1 X
  LVPN =                          − log p(yi |xi , z(lz ) )   Memory recall The memory recall of VSM aims
             |Q|        Lz                                    to choose the related content from the memory, and
                  i=1       lz =1
                                                              is accomplished by variational inference. It intro-
                                       i
          + λDKL [q(z|S)||p(z|xi )]
                                                              duces latent memory m as an intermediate stochas-
                                                        (2)   tic variable, and infers m from the addressed mem-
where q(z|S) is the variational posterior over z,             ory M . The approximate variational posterior
p(z|xi ) is the prior, and Lz is the number of Monte          q(m|M, S) over the latent memory m is obtained
Carlo samples for z. The prior and posterior are              empirically by
assumed to be Gaussian. The re-parameterization
trick (Kingma and Welling, 2013) is adopted to
                                                                                        |M |
enable back-propagation with gradient descent, i.e.,                                    X
                                                                       q(m|M, S) =             γa p(m|Ma ),      (3)
z(lz ) = f (S, (lz ) ), (lz ) ∼ N (0, I), f (·, ·) =
                                                                                        a=1
(lz ) ∗ µz + σz , where the mean µz and diagonal
covariance σz are generated from the posterior in-            where
ference network with S as input. The amortization
                                                                                                
                                                                                 exp g(Ma , S)
technique is employed for the implementation of                            γa = P                               (4)
                                                                                  i exp g(Mi , S)
VPN. The posterior network takes the mean word
representations in the support set S as input and             g(·) is the dot product, |M | is the number of mem-
returns the parameters of q(z|S). Similarly, the              ory slots, Ma is the memory content at slot a and
prior network produces the parameters of p(z|xi )             stores the prototype of samples in each class, and
by taking the query word representation xi ∈ Q as             we take the mean representation of samples in S.
input. The conditional predictive log-likelihood is              The variational posterior over the prototype then
implemented as a cross-entropy loss.                          becomes:
4.4 β-Variational Semantic Memory                                                 Lm
                                                                                1 X
In order to leverage the shared common knowledge                  q̃(z|M, S) ≈       q(z|m(lm ) , S),            (5)
                                                                               Lm
between different tasks to improve disambiguation                                     lm =1
in future tasks, we incorporate variational semantic
memory (VSM) as in Zhen et al. (2020). It consists            where m(lm ) is a Monte Carlo sample drawn from
of two main processes: memory recall, which re-               the distribution q(m|M, S), and lm is the number
trieves relevant information that fits with specific          of samples. By incorporating the latent memory
tasks based on the support set of the current task;           m from Eq. (3), we achieve the objective for varia-
tional semantic memory as follows:                          achieved by scaling as follows:

            |Q| h                                                                     Mc
            X                                                           Mc =                               (9)
  LVSM =            − Eq(z|S,m) log p(yi |xi , z)                                max(1, kMc k2 )
            i=1
                                                             When we update memory, we feed the new ob-
         + λz DKL q(z|S, m)||p(z|xi )
                                                            tained memory M̄c into the hypernetwork fβ (·)
                        |M |
                      X                             i     and output adaptive β for the update. We provide
         + λm DKL              γi p(m|Mi )||p(m|S)          a more detailed implementation of β-VSM in Ap-
                         i
                                                (6)         pendix A.1.
where p(m|S) is the introduced prior over m, λz
                                                            5   Experiments and results
and λm are the hyperparameters. The overall com-
putational graph of VSM is shown in Figure 1.               Experimental setup The size of the shared lin-
Similarly, the posterior and prior over m are also          ear layer and memory content of each word sense
assumed to be Gaussian and obtained by using                is 64, 256, and 192 for GloVe+GRU, ELMo+MLP
amortized inference networks; more details are pro-         and BERT respectively. The activation function
vided in Appendix A.1.                                      of the shared linear layer is tanh for GloVe+GRU
                                                            and ReLU for the rest. The inference networks
Memory update The memory update is to be
                                                            gφ (·) for calculating the prototype distribution and
able to effectively absorb new useful information to
                                                            gψ (·) for calculating the memory distribution are
enrich memory content. VSM employs an update
                                                            all three-layer MLPs, with the size of each hid-
rule as follows:
                                                            den layer being 64, 256, and 192 for GloVe+GRU,
                                                            ELMo+MLP and BERT. The activation function
            Mc ← βMc + (1 − β)M̄c ,                   (7)
                                                            of their hidden layers is ELU (Clevert et al., 2016),
                                                            and the output layer does not use any activation
where Mc is the memory content correspond-
                                                            function. Each batch during meta-training includes
ing to class c, M̄c is obtained using graph atten-
                                                            16 tasks. The hypernetwork fβ (·) is also a three-
tion (Veličković et al., 2017), and β ∈ (0, 1) is a
                                                            layer MLP, with the size of hidden state consis-
hyperparameter.
                                                            tent with that of the memory contents. The linear
Adaptive memory update Although VSM was                     layer activation function is ReLU for the hypernet-
shown to be promising for few-shot image classi-            work. For BERT and |S| = {4, 8}, λz = 0.001,
fication, it can be seen from the experiments by            λm = 0.0001 and learning rate is 5e−6; |S| = 16,
Zhen et al. (2020) that different values of β have          λz = 0.0001, λm = 0.0001 and learning rate is
considerable influence on the performance. β de-            1e−6; |S| = 32, λz = 0.001, λm = 0.0001 and
termines the extent to which memory is updated at           learning rate is 1e−5. Hyperparameters for other
each iteration. In the original VSM, β is treated           models are reported in Appendix A.2. All the hy-
as a hyperparameter obtained by cross-validation,           perparameters are chosen using the meta-validation
which is time-consuming and inflexible in dealing           set. The number of slots in memory is consistent
with different datasets. To address this problem,           with the number of senses in the meta-training set
we propose an adaptive memory update rule by                – 2915 for |S| = 4 and 8; 2452 for |S| = 16; 1937
learning β from data using a lightweight hypernet-          for |S| = 32. The evaluation metric is the word-
work (Ha et al., 2016). To be more specific, we             level macro F1 score, averaged over all episodes
obtain β by a function fβ (·) implemented as an             in the meta-test set. The parameters are optimized
MLP with a sigmoid activation function in the out-          using Adam (Kingma and Ba, 2014).
put layer. The hypernetwork takes M̄c as input and             We compare our methods against several base-
returns the value of β:                                     lines and state-of-the-art approaches. The near-
                                                            est neighbor classifier baseline (NearestNeighbor)
                     β = fβ (M̄c )                    (8)   predicts a query example’s sense as the sense of
                                                            the support example closest in the word embed-
Moreover, to prevent the possibility of endless             ding space (ELMo and BERT) in terms of co-
growth of memory value, we propose to scale down            sine distance. The episodic fine-tuning baseline
the memory value whenever kMc k2 > 1. This is               (EF-ProtoNet) is one where only meta-testing is
Embedding/                                                 Average macro F1 score
                          Method
       Encoder                                |S| = 4            |S| = 8       |S| = 16          |S| = 32
      -             MajoritySenseBaseline   0.247             0.259           0.264           0.261
                    NearestNeighbor         –                 –               –               –
                    EF-ProtoNet             0.522 ± 0.008     0.539 ± 0.009   0.538 ± 0.003   0.562 ± 0.005
      GloVe+GRU     ProtoNet                0.579 ± 0.004     0.601 ± 0.003   0.633 ± 0.008   0.654 ± 0.004
                    ProtoFOMAML             0.577 ± 0.011     0.616 ± 0.005   0.626 ± 0.005   0.631 ± 0.008
                    β-VSM (Ours)            0.597 ± 0.005     0.631 ± 0.004   0.652 ± 0.006   0.678 ± 0.007
                    NearestNeighbor         0.624             0.641           0.645           0.654
                    EF-ProtoNet             0.609 ± 0.008     0.635 ± 0.004   0.661 ± 0.004   0.683 ± 0.003
      ELMo+MLP      ProtoNet                0.656 ± 0.006     0.688 ± 0.004   0.709 ± 0.006   0.731 ± 0.006
                    ProtoFOMAML             0.670 ± 0.005     0.700 ± 0.004   0.724 ± 0.003   0.737 ± 0.007
                    β-VSM (Ours)            0.679 ± 0.006     0.709 ± 0.005   0.735 ± 0.004   0.758 ± 0.005
                    NearestNeighbor         0.681             0.704           0.716           0.741
                    EF-ProtoNet             0.594 ± 0.008     0.655 ± 0.004   0.682 ± 0.005   0.721 ± 0.009
      BERT          ProtoNet                0.696 ± 0.011     0.750 ± 0.008   0.755 ± 0.002   0.766 ± 0.003
                    ProtoFOMAML             0.719 ± 0.005     0.756 ± 0.007   0.744 ± 0.007   0.761 ± 0.005
                    β-VSM (Ours)            0.728 ± 0.012     0.773 ± 0.005   0.776 ± 0.003   0.788 ± 0.003

     Table 1: Model performance comparison on the meta-test words using different embedding functions.

performed, starting from a randomly initialized             Role of variational prototypes VPN consis-
model. Prototypical network (ProtoNet) and Proto-           tently outperforms ProtoNet with all embedding
FOMAML achieve the highest few-shot WSD per-                functions (by around 1% F1 score on average). The
formance to date on the benchmark of Holla et al.           results indicate that the probabilistic prototypes
(2020a).                                                    provide more informative representations of word
                                                            senses compared to deterministic vectors. The high-
Results In Table 1, we show the average macro               est gains were obtained in case of GloVe+GRU
F1 scores of the models, with their mean and stan-          (1.7% F1 score with |S| = 8), suggesting that
dard deviation obtained over five independent runs.         probabilistic prototypes are particularly useful for
Our proposed β-VSM achieves the new state-of-               models that rely on static word embeddings, as they
the-art performance on few-shot WSD with all the            capture uncertainty in contextual interpretation.
embedding functions, across all the setups with
                                                            Role of variational semantic memory We show
varying |S|. For GloVe+GRU, where the input is
                                                            the benefit of VSM by comparing it with VPN.
sense-agnostic embeddings, our model improves
                                                            VSM consistently surpasses VPN with all three
disambiguation compared to ProtoNet by 1.8% for
                                                            embedding functions. According to our analysis,
|S| = 4 and by 2.4% for |S| = 32. With contextual
                                                            VSM makes the prototypes of different word senses
embeddings as input, β-VSM with ELMo+MLP
                                                            more distinctive and distant from each other. The
also leads to improvements compared to the pre-
                                                            senses in memory provide more context informa-
vious best ProtoFOMAML for all |S|. Holla et al.
                                                            tion, enabling larger intra-class variations to be cap-
(2020a) obtained state-of-the-art performance with
                                                            tured, and thus lead to improvements upon VPN.
BERT, and β-VSM further advances this, resulting
in a gain of 0.9 – 2.2%. The consistent improve-            Role of adaptive β To demonstrate the effec-
ments with different embedding functions and sup-           tiveness of the hypernetwork for adaptive β, we
port set sizes suggest that our β-VSM is effective          compare β-VSM with VSM where β is tuned by
for few-shot WSD for varying number of shots and            cross-validation. It can be seen from Table 2 that
senses as well as across model architectures.               there is consistent improvement over VSM. Thus,
                                                            the learned adaptive β acquires the ability to deter-
6   Analysis and discussion                                 mine how much of the contents of memory needs
                                                            to be updated based on the current new memory. β-
To analyze the contributions of different compo-            VSM enables the memory content of different word
nents in our method, we perform an ablation study           senses to be more representative by better absorb-
by comparing ProtoNet, VPN, VSM and β-VSM                   ing information from data with adaptive update,
and present the macro F1 scores in Table 2.                 resulting in improved performance.
Embedding/                                  Average macro F1 score
                            Method
              Encoder                    |S| = 4         |S| = 8       |S| = 16          |S| = 32
                           ProtoNet   0.579 ± 0.004    0.601 ± 0.003   0.633 ± 0.008   0.654 ± 0.004
                           VPN        0.583 ± 0.005    0.618 ± 0.005   0.641 ± 0.007   0.668 ± 0.005
             GloVe+GRU
                           VSM        0.587 ± 0.004    0.625 ± 0.004   0.645 ± 0.006   0.670 ± 0.005
                           β-VSM      0.597 ± 0.005    0.631 ± 0.004   0.652 ± 0.006   0.678 ± 0.007
                           ProtoNet   0.656 ± 0.006    0.688 ± 0.004   0.709 ± 0.006   0.731 ± 0.006
                           VPN        0.661 ± 0.005    0.694 ± 0.006   0.718 ± 0.004   0.741 ± 0.004
             ELMo+MLP
                           VSM        0.670 ± 0.006    0.707 ± 0.006   0.726 ± 0.005   0.750 ± 0.004
                           β-VSM      0.679 ± 0.006    0.709 ± 0.005   0.735 ± 0.004   0.758 ± 0.005
                           ProtoNet   0.696 ± 0.011    0.750 ± 0.008   0.755 ± 0.002   0.766 ± 0.003
                           VPN        0.703 ± 0.011    0.761 ± 0.007   0.762 ± 0.004   0.779 ± 0.002
             BERT
                           VSM        0.717 ± 0.013    0.769 ± 0.006   0.770 ± 0.005   0.784 ± 0.002
                           β-VSM      0.728 ± 0.012    0.773 ± 0.005   0.776 ± 0.003   0.788 ± 0.003

 Table 2: Ablation study comparing the meta-test performance of the different variants of prototypical networks.

       (a) ProtoNet                   (b) VPN                     (c) VSM                     (d) β-VSM

Figure 2: Distribution of average macro F1 scores over number of senses for BERT-based models with |S| = 16.

Variation of performance with the number of               tions based on BERT for the word draw. Different
senses In order to further probe into the strengths       colored ellipses indicate the distribution of its dif-
of β-VSM, we analyze the macro F1 scores of the           ferent senses obtained from the support set. Differ-
different models averaged over all the words in the       ent colored points indicate the representations of
meta-test set with a particular number of senses.         the query examples. β-VSM makes the prototypes
In Figure 2, we show a bar plot of the scores ob-         of different word senses of the same word more
tained from BERT for |S| = 16. For words with             distinctive and distant from each other, with less
a low number of senses, the task corresponds to           overlap, compared to the other models. Notably,
a higher number of effective shots and vice versa.        the representations of query examples are closer
It can be seen that the different models perform          to their corresponding prototype distribution for β-
roughly the same for words with fewer senses, i.e.,       VSM, thereby resulting in improved performance.
2 – 4. VPN is comparable to ProtoNet in its distri-       We also visualize the prototype distributions of
bution of scores. But with semantic memory, VSM           similar vs. dissimilar senses of multiple words in
improves the performance on words with a higher           Figure 4 (see Appendix A.4 for example sentences).
number of senses. β-VSM further boosts the scores         The blue ellipse corresponds to the ‘set up’ sense
for such words on average. The same trend is ob-          of launch from the meta-test samples. Green and
served for |S| = 8 (see Appendix A.3). Therefore,         gray ellipses correspond to a similar sense of the
the improvements of β-VSM over ProtoNet come              words start and establish from the memory. We
from tasks with fewer shots, indicating that VSM is       can see that they are close to each other. Orange
particularly effective at disambiguation in low-shot      and purple ellipses correspond to other senses of
scenarios.                                                the words start and establish from the memory, and
                                                          they are well separated. For a given query word,
Visualization of prototypes To study the distinc-         our model is thus able to retrieve related senses
tion between the prototype distributions of word          from the memory and exploit them to make its
senses obtained by β-VSM, VSM and VPN, we                 word sense distribution more representative and
visualize them using t-SNE (Van der Maaten and            distinctive.
Hinton, 2008). Figure 3 shows prototype distribu-
(a) VPN                                 (b) VSM                                (c) β-VSM

                Figure 3: Prototype distributions of distinct senses of draw with different models.

                                                           References
                                                           Eneko Agirre, Oier López de Lacalle, and Aitor Soroa.
                                                             2014. Random walks for knowledge-based word
                                                             sense disambiguation. Computational Linguistics,
                                                             40(1):57–84.

                                                           Antreas Antoniou, Harrison Edwards, and Amos
                                                             Storkey. 2019. How to train your MAML. In Inter-
                                                             national Conference on Learning Representations.

Figure 4: Prototype distributions of similar sense of      Trapit Bansal, Rishikesh Jha, and Andrew McCallum.
launch (blue), start (green) and establish (grey). Dis-      2019. Learning to few-shot learn across diverse nat-
                                                             ural language classification tasks. arXiv preprint
tinct senses: start (orange) and establish (purple).
                                                             arXiv:1911.03863.

                                                           Trapit Bansal, Rishikesh Jha, Tsendsuren Munkhdalai,
7   Conclusion                                               and Andrew McCallum. 2020. Self-supervised
                                                             meta-learning for few-shot natural language classifi-
In this paper, we presented a model of variational           cation tasks. In Proceedings of the 2020 Conference
semantic memory for few-shot WSD. We use a                   on Empirical Methods in Natural Language Process-
variational prototype network to model the pro-              ing (EMNLP), pages 522–534, Online. Association
                                                             for Computational Linguistics.
totype of each word sense as a distribution. To
leverage the shared common knowledge between               Y. Bengio, S. Bengio, and J. Cloutier. 1991. Learn-
tasks, we incorporate semantic memory into the                ing a synaptic learning rule. In IJCNN-91-Seattle
probabilistic model of prototypes in a hierarchical           International Joint Conference on Neural Networks,
                                                              volume ii, pages 969 vol.2–.
Bayesian framework. VSM is able to acquire long-
term, general knowledge that enables learning new          Michele Bevilacqua and Roberto Navigli. 2020. Break-
senses from very few examples. Furthermore, we               ing through the 80% glass ceiling: Raising the state
propose adaptive β-VSM which learns an adaptive              of the art in word sense disambiguation by incor-
                                                             porating knowledge graph information. In Proceed-
memory update rule from data using a lightweight             ings of the 58th Annual Meeting of the Association
hypernetwork. The consistent new state-of-the-art            for Computational Linguistics, pages 2854–2864,
performance with three different embedding func-             Online. Association for Computational Linguistics.
tions shows the benefit of our model in boosting
                                                           Kyunghyun Cho, Bart van Merriënboer, Caglar Gul-
few-shot WSD.                                                cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
   Since meaning disambiguation is central to                Schwenk, and Yoshua Bengio. 2014. Learning
many natural language understanding tasks, models            phrase representations using RNN encoder–decoder
                                                             for statistical machine translation. In Proceedings of
based on semantic memory are a promising direc-
                                                             the 2014 Conference on Empirical Methods in Nat-
tion in NLP, more generally. Future work might in-           ural Language Processing (EMNLP), pages 1724–
vestigate the role of memory in modeling meaning             1734, Doha, Qatar. Association for Computational
variation across domains and languages, as well as           Linguistics.
in tasks that integrate knowledge at different levels      Djork-Arné Clevert, Thomas Unterthiner, and Sepp
of linguistic hierarchy.                                     Hochreiter. 2016. Fast and accurate deep network
                                                             learning by exponential linear units (elus). In 4th
International Conference on Learning Representa-       Alex Graves, Greg Wayne, and Ivo Danihelka.
  tions, ICLR 2016, San Juan, Puerto Rico, May 2-4,        2014. Neural turing machines. arXiv preprint
  2016, Conference Track Proceedings.                      arXiv:1410.5401.

Rajarshi Das, Manzil Zaheer, Siva Reddy, and Andrew      Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li,
  McCallum. 2017. Question answering on knowl-              and Kyunghyun Cho. 2018. Meta-learning for low-
  edge bases and text using universal schema and            resource neural machine translation. In Proceed-
  memory networks. In Proceedings of the 55th An-           ings of the 2018 Conference on Empirical Methods
  nual Meeting of the Association for Computational         in Natural Language Processing, pages 3622–3631,
  Linguistics (Volume 2: Short Papers), pages 358–          Brussels, Belgium. Association for Computational
  365, Vancouver, Canada. Association for Computa-          Linguistics.
  tional Linguistics.
                                                         David Ha, Andrew Dai, and Quoc V Le. 2016. Hyper-
Cyprien de Masson d’Autume, Sebastian Ruder, Ling-         networks. arXiv preprint arXiv:1609.09106.
  peng Kong, and Dani Yogatama. 2019. Episodic
  memory in lifelong language learning. In Advances      Christian Hadiwinoto, Hwee Tou Ng, and Wee Chung
  in Neural Information Processing Systems 32, pages       Gan. 2019. Improved word sense disambiguation us-
  13143–13152. Curran Associates, Inc.                     ing pre-trained contextualized word representations.
                                                           In Proceedings of the 2019 Conference on Empirical
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              Methods in Natural Language Processing and the
   Kristina Toutanova. 2019. BERT: Pre-training of         9th International Joint Conference on Natural Lan-
   deep bidirectional transformers for language under-     guage Processing (EMNLP-IJCNLP), pages 5297–
   standing. In Proceedings of the 2019 Conference         5306, Hong Kong, China. Association for Computa-
   of the North American Chapter of the Association        tional Linguistics.
   for Computational Linguistics: Human Language
  Technologies, Volume 1 (Long and Short Papers),        Xu Han, Yi Dai, Tianyu Gao, Yankai Lin, Zhiyuan Liu,
   pages 4171–4186, Minneapolis, Minnesota. Associ-        Peng Li, Maosong Sun, and Jie Zhou. 2020. Contin-
   ation for Computational Linguistics.                    ual relation learning via episodic memory activation
                                                           and reconsolidation. In Proceedings of the 58th An-
                                                           nual Meeting of the Association for Computational
Zi-Yi Dou, Keyi Yu, and Antonios Anastasopoulos.           Linguistics, pages 6429–6440, Online. Association
   2019. Investigating meta-learning algorithms for        for Computational Linguistics.
   low-resource natural language understanding tasks.
   In Proceedings of the 2019 Conference on Empirical    Sepp Hochreiter and Jürgen Schmidhuber. 1997.
  Methods in Natural Language Processing and the           Long short-term memory. Neural computation,
   9th International Joint Conference on Natural Lan-      9(8):1735–1780.
   guage Processing (EMNLP-IJCNLP), pages 1192–
  1197, Hong Kong, China. Association for Computa-       Nithin Holla, Pushkar Mishra, Helen Yannakoudakis,
   tional Linguistics.                                     and Ekaterina Shutova. 2020a. Learning to learn
                                                           to disambiguate: Meta-learning for few-shot word
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017.      sense disambiguation. In Proceedings of the 2020
  Model-agnostic meta-learning for fast adaptation of      Conference on Empirical Methods in Natural Lan-
  deep networks. In Proceedings of the 34th In-            guage Processing: Findings, pages 4517–4533, On-
  ternational Conference on Machine Learning, vol-         line. Association for Computational Linguistics.
  ume 70 of Proceedings of Machine Learning Re-
  search, pages 1126–1135, International Convention      Nithin Holla, Pushkar Mishra, Helen Yannakoudakis,
  Centre, Sydney, Australia. PMLR.                         and Ekaterina Shutova. 2020b. Meta-learning with
                                                           sparse experience replay for lifelong language learn-
Ruiying Geng, Binhua Li, Yongbin Li, Jian Sun, and         ing. arXiv preprint arXiv:2009.04891.
  Xiaodan Zhu. 2020. Dynamic memory induction
  networks for few-shot text classification. In Pro-     Luyao Huang, Chi Sun, Xipeng Qiu, and Xuanjing
  ceedings of the 58th Annual Meeting of the Asso-         Huang. 2019. GlossBERT: BERT for word sense
  ciation for Computational Linguistics, pages 1087–       disambiguation with gloss knowledge. In Proceed-
  1094, Online. Association for Computational Lin-         ings of the 2019 Conference on Empirical Methods
  guistics.                                                in Natural Language Processing and the 9th Inter-
                                                           national Joint Conference on Natural Language Pro-
Ruiying Geng, Binhua Li, Yongbin Li, Xiaodan Zhu,          cessing (EMNLP-IJCNLP), pages 3509–3514, Hong
  Ping Jian, and Jian Sun. 2019. Induction networks        Kong, China. Association for Computational Lin-
  for few-shot text classification. In Proceedings of      guistics.
  the 2019 Conference on Empirical Methods in Nat-
  ural Language Processing and the 9th International     Ignacio Iacobacci, Mohammad Taher Pilehvar, and
  Joint Conference on Natural Language Processing           Roberto Navigli. 2016. Embeddings for word sense
  (EMNLP-IJCNLP), pages 3904–3913, Hong Kong,               disambiguation: An evaluation study. In Proceed-
  China. Association for Computational Linguistics.         ings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa-      Oren Melamud, Jacob Goldberger, and Ido Dagan.
  pers), pages 897–907, Berlin, Germany. Association       2016. context2vec: Learning generic context em-
  for Computational Linguistics.                           bedding with bidirectional LSTM. In Proceedings
                                                           of The 20th SIGNLL Conference on Computational
Mikael Kågebäck and Hans Salomonsson. 2016. Word         Natural Language Learning, pages 51–61, Berlin,
  sense disambiguation using a bidirectional LSTM.         Germany. Association for Computational Linguis-
  In Proceedings of the 5th Workshop on Cognitive          tics.
  Aspects of the Lexicon (CogALex - V), pages 51–56,
  Osaka, Japan. The COLING 2016 Organizing Com-          George A. Miller, Richard Beckwith, Christiane Fell-
  mittee.                                                  baum, Derek Gross, and Katherine Miller. 1990.
                                                           Wordnet: An on-line lexical database. International
Diederik P Kingma and Jimmy Ba. 2014. Adam: A              Journal of Lexicography, 3:235–244.
  method for stochastic optimization. arXiv preprint
  arXiv:1412.6980.                                       George A. Miller, Martin Chodorow, Shari Landes,
                                                           Claudia Leacock, and Robert G. Thomas. 1994. Us-
Diederik P Kingma and Max Welling. 2013. Auto-             ing a semantic concordance for sense identification.
  encoding variational bayes.   arXiv preprint             In Human Language Technology: Proceedings of a
  arXiv:1312.6114.                                         Workshop held at Plainsboro, New Jersey, March 8-
                                                           11, 1994.
Gregory Koch, Richard Zemel, and Ruslan Salakhutdi-
  nov. 2015. Siamese neural networks for one-shot im-    Andrea Moro, Alessandro Raganato, and Roberto Nav-
  age recognition. In ICML deep learning workshop,         igli. 2014. Entity linking meets word sense disam-
  volume 2. Lille.                                         biguation: a unified approach. Transactions of the
                                                           Association for Computational Linguistics, 2:231–
Dmitry Krotov and John J Hopfield. 2016. Dense as-         244.
 sociative memory for pattern recognition. arXiv
 preprint arXiv:1606.01164.                              Tsendsuren Munkhdalai, Alessandro Sordoni, Tong
                                                           Wang, and Adam Trischler. 2019. Metalearned neu-
Sawan Kumar, Sharmistha Jat, Karan Saxena, and             ral memory. In Advanced in Neural Information Pro-
  Partha Talukdar. 2019. Zero-shot word sense dis-         cessing Systems.
  ambiguation using sense definition embeddings. In
  Proceedings of the 57th Annual Meeting of the          Tsendsuren Munkhdalai and Hong Yu. 2017a. Meta
  Association for Computational Linguistics, pages         networks. In Proceedings of the 34th International
  5670–5681, Florence, Italy. Association for Compu-       Conference on Machine Learning, volume 70 of
  tational Linguistics.                                    Proceedings of Machine Learning Research, pages
                                                           2554–2563, International Convention Centre, Syd-
Michael Lesk. 1986. Automatic sense disambiguation         ney, Australia. PMLR.
  using machine readable dictionaries: How to tell
  a pine cone from an ice cream cone. In Proceed-        Tsendsuren Munkhdalai and Hong Yu. 2017b. Meta
  ings of the 5th Annual International Conference on       networks. In Proceedings of the 34th International
  Systems Documentation, SIGDOC ’86, page 24–26,           Conference on Machine Learning, Proceedings of
  New York, NY, USA. Association for Computing             Machine Learning Research, pages 2554–2563, In-
  Machinery.                                               ternational Convention Centre, Sydney, Australia.
                                                           PMLR.
Zheng Li, Mukul Kumar, William Headden, Bing Yin,
  Ying Wei, Yu Zhang, and Qiang Yang. 2020. Learn        Tsendsuren Munkhdalai, Xingdi Yuan, Soroush Mehri,
  to cross-lingual transfer with meta graph learning       and Adam Trischler. 2018. Rapid adaptation with
  across heterogeneous languages. In Proceedings of        conditionally shifted neurons. In International Con-
  the 2020 Conference on Empirical Methods in Nat-         ference on Machine Learning, pages 3664–3673.
  ural Language Processing (EMNLP), pages 2290–            PMLR.
  2301, Online. Association for Computational Lin-
  guistics.                                              Roberto Navigli. 2009. Word sense disambiguation: A
                                                           survey. ACM Computing Surveys, 41(2):1–69.
Laurens Van der Maaten and Geoffrey Hinton. 2008.
  Visualizing data using t-sne. Journal of machine       Alex Nichol, Joshua Achiam, and John Schulman.
  learning research, 9(11).                                2018.    On first-order meta-learning algorithms.
                                                           arXiv preprint arXiv:1803.02999.
Andrea Madotto, Chien-Sheng Wu, and Pascale Fung.
  2018. Mem2Seq: Effectively incorporating knowl-        Farhad Nooralahzadeh, Giannis Bekoulis, Johannes
  edge bases into end-to-end task-oriented dialog sys-     Bjerva, and Isabelle Augenstein. 2020. Zero-shot
  tems. In Proceedings of the 56th Annual Meeting of       cross-lingual transfer with meta learning. In Pro-
  the Association for Computational Linguistics (Vol-      ceedings of the 2020 Conference on Empirical Meth-
  ume 1: Long Papers), pages 1468–1478, Melbourne,         ods in Natural Language Processing (EMNLP),
  Australia. Association for Computational Linguis-        pages 4547–4562, Online. Association for Compu-
  tics.                                                    tational Linguistics.
Abiola Obamuyide and Andreas Vlachos. 2019.               Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton.
  Model-agnostic meta-learning for relation classifica-     2017. Dynamic routing between capsules. In Ad-
  tion with limited supervision. In Proceedings of the      vances in Neural Information Processing Systems
  57th Annual Meeting of the Association for Com-           30, pages 3856–3866.
  putational Linguistics, pages 5873–5879, Florence,
  Italy. Association for Computational Linguistics.       Adam Santoro, Sergey Bartunov, Matthew Botvinick,
                                                            Daan Wierstra, and Timothy Lillicrap. 2016a. Meta-
Jeffrey Pennington, Richard Socher, and Christopher         learning with memory-augmented neural networks.
   Manning. 2014. Glove: Global vectors for word rep-       In International conference on machine learning,
   resentation. In Proceedings of the 2014 Conference       pages 1842–1850. PMLR.
   on Empirical Methods in Natural Language Process-
                                                          Adam Santoro, Sergey Bartunov, Matthew Botvinick,
   ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso-
                                                            Daan Wierstra, and Timothy Lillicrap. 2016b. Meta-
   ciation for Computational Linguistics.
                                                            learning with memory-augmented neural networks.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt             In Proceedings of The 33rd International Confer-
 Gardner, Christopher Clark, Kenton Lee, and Luke           ence on Machine Learning, volume 48 of Proceed-
 Zettlemoyer. 2018. Deep contextualized word rep-           ings of Machine Learning Research, pages 1842–
 resentations. In Proceedings of the 2018 Confer-           1850, New York, New York, USA. PMLR.
 ence of the North American Chapter of the Associ-        Bianca Scarlini, Tommaso Pasini, and Roberto Nav-
 ation for Computational Linguistics: Human Lan-            igli. 2020. With more contexts comes better per-
 guage Technologies, Volume 1 (Long Papers), pages          formance: Contextualized sense embeddings for
 2227–2237, New Orleans, Louisiana. Association             all-round word sense disambiguation. In Proceed-
 for Computational Linguistics.                             ings of the 2020 Conference on Empirical Methods
                                                            in Natural Language Processing (EMNLP), pages
Kun Qian and Zhou Yu. 2019. Domain adaptive dia-
                                                            3528–3539, Online. Association for Computational
  log generation via meta learning. In Proceedings of
                                                            Linguistics.
  the 57th Annual Meeting of the Association for Com-
  putational Linguistics, pages 2639–2649, Florence,      Jurgen Schmidhuber. 1987. Evolutionary principles in
  Italy. Association for Computational Linguistics.          self-referential learning. on learning now to learn:
                                                             The meta-meta-meta...-hook. Diploma thesis, Tech-
Alessandro Raganato, Jose Camacho-Collados, and              nische Universitat Munchen, Germany, 14 May.
  Roberto Navigli. 2017a. Word sense disambigua-
  tion: A unified evaluation framework and empiri-        Jake Snell, Kevin Swersky, and Richard Zemel. 2017.
  cal comparison. In Proceedings of the 15th Con-            Prototypical networks for few-shot learning. In Ad-
  ference of the European Chapter of the Association         vances in Neural Information Processing Systems
  for Computational Linguistics: Volume 1, Long Pa-         30, pages 4077–4087.
  pers, pages 99–110, Valencia, Spain. Association for
  Computational Linguistics.                              Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang,
                                                             Philip HS Torr, and Timothy M Hospedales. 2018.
Alessandro Raganato, Claudio Delli Bovi, and Roberto         Learning to compare: Relation network for few-shot
  Navigli. 2017b. Neural sequence learning mod-              learning. In Proceedings of the IEEE Conference
  els for word sense disambiguation. In Proceed-             on Computer Vision and Pattern Recognition, pages
  ings of the 2017 Conference on Empirical Methods          1199–1208.
  in Natural Language Processing, pages 1156–1167,
                                                          Duyu Tang, Bing Qin, and Ting Liu. 2016. Aspect
  Copenhagen, Denmark. Association for Computa-
                                                            level sentiment classification with deep memory net-
  tional Linguistics.
                                                            work. In Proceedings of the 2016 Conference on
Sachin Ravi and Hugo Larochelle. 2017. Optimiza-            Empirical Methods in Natural Language Processing,
  tion as a model for few-shot learning. In 5th Inter-      pages 214–224, Austin, Texas. Association for Com-
  national Conference on Learning Representations,          putational Linguistics.
  ICLR 2017, Toulon, France, April 24-26, 2017, Con-      Sebastian Thrun and Lorien Pratt, editors. 1998. Learn-
  ference Track Proceedings.                                ing to Learn. Kluwer Academic Publishers, USA.
Eleanor Rosch. 1975. Cognitive representations of se-     Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pas-
  mantic categories. Journal of Experimental Psychol-       cal Lamblin, Utku Evci, Kelvin Xu, Ross Goroshin,
  ogy: General, 104:192–233.                                Carles Gelada, Kevin Swersky, Pierre-Antoine Man-
                                                            zagol, and Hugo Larochelle. 2020. Meta-dataset: A
Sascha Rothe and Hinrich Schütze. 2015. AutoEx-            dataset of datasets for learning to learn from few ex-
  tend: Extending word embeddings to embeddings             amples. In International Conference on Learning
  for synsets and lexemes. In Proceedings of the            Representations.
  53rd Annual Meeting of the Association for Compu-
  tational Linguistics and the 7th International Joint    Petar Veličković, Guillem Cucurull, Arantxa Casanova,
  Conference on Natural Language Processing (Vol-           Adriana Romero, Pietro Lio, and Yoshua Bengio.
  ume 1: Long Papers), pages 1793–1803, Beijing,            2017. Graph attention networks. arXiv preprint
  China. Association for Computational Linguistics.         arXiv:1710.10903.
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Ko-   the c-th class. Then we get the memory Ma by
  ray Kavukcuoglu, and Daan Wierstra. 2016. Match-        using the support representation f¯cs of each class.
  ing networks for one shot learning. In Advances
                                                          The memory obtained Ma will be fed into a small
  in Neural Information Processing Systems 29, pages
  3630–3638.                                              three-layers MLP network gψ (·) to calculate the
                                                          mean µm and variance σ m of the memory dis-
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.       tribution m, which is then used to sample the
  2015. Pointer networks. In Advances in Neural
  Information Processing Systems, volume 28, pages        memory m by m ∼ N (µm , diag((σ m )2 )). The
  2692–2700. Curran Associates, Inc.                      new memory M̄c is obtained by using graph at-
                                                          tention. The nodes of the graph are a set of fea-
Jason Weston, Sumit Chopra, and Antoine Bor-
   des. 2014. Memory networks. arXiv preprint             ture representations of the current task samples:
   arXiv:1410.3916.                                       Fc = {fc0 , fc1 , fc2 , . . . , fcNc }, where fcNc ∈ Rd ,
                                                          Nc = |Sc ∪ Qc |, fc0 = Mc , fci>0 = fθ (xic ). Nc
Changlong Yu, Jialong Han, Haisong Zhang, and Wil-
  fred Ng. 2020. Hypernymy detection for low-
                                                          contains all samples including both the support
  resource languages via meta learning. In Proceed-       and query set from the c-th category in the current
  ings of the 58th Annual Meeting of the Association      task. When we update memory, we take the new
  for Computational Linguistics, pages 3651–3656,         obtained memory M̄c into the hypernetwork fβ (·)
  Online. Association for Computational Linguistics.      as input and output the adaptive β to update the
Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni      memory using Equation 8. We calculate the pro-
 Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang,            totype of the latent distribution, i.e., the mean µz
 and Bowen Zhou. 2018. Diverse few-shot text clas-        and variance σ z by another small three-layer MLP
 sification with multiple metrics. In Proceedings of
 the 2018 Conference of the North American Chap-          network gφ (·, ·), whose inputs are f¯cs and m. Then
 ter of the Association for Computational Linguistics:    the prototype z(lz ) is sampled from the distribution
 Human Language Technologies, Volume 1 (Long Pa-          z(lz ) ∼ N (µz , diag((σ z )2 )). By using the pro-
 pers), pages 1206–1215, New Orleans, Louisiana.          totypical word sense of support samples and the
 Association for Computational Linguistics.
                                                          feature embedding of query sample xi , we obtain
Dayu Yuan, Julian Richardson, Ryan Doherty, Colin         the predictive value ŷi .
  Evans, and Eric Altendorf. 2016. Semi-supervised           At meta-test time, we feed the support represen-
  word sense disambiguation with neural models. In
  Proceedings of COLING 2016, the 26th Interna-           tation f¯cs into the gψ (·) to generate the memory ma .
  tional Conference on Computational Linguistics:         Then, using the sampled memory ma and the sup-
  Technical Papers, pages 1374–1385, Osaka, Japan.        port representation f¯cs , we obtain the distribution
  The COLING 2016 Organizing Committee.                   of prototypical word sense z. Finally, we make
Xiantong Zhen, Yingjun Du, Huan Xiong, Qiang Qiu,         predictions for the query sample by using the query
  Cees Snoek, and Ling Shao. 2020. Learning to            representation extracted from embedding function
  learn variational semantic memory. In Proceedings       and the support prototype z.
  of NeurIPS.
Zhi Zhong and Hwee Tou Ng. 2010. It makes sense:          A.2    Hyperparameters and runtimes
  A wide-coverage word sense disambiguation system
  for free text. In Proceedings of the ACL 2010 Sys-      We present our hyperparameters in Table 3. For
  tem Demonstrations, pages 78–83, Uppsala, Swe-          Monte Carlo sampling, we set different LZ and LM
  den. Association for Computational Linguistics.
                                                          for the each embedding function and |S|, which
A     Appendix                                            are chosen using the validation set. Training time
                                                          differs for different |S| and different embedding
A.1    Implementation details                             functions. Here we give the training time per
In the meta-training phase, we implement β-VSM            epoch for |S| = 16. For GloVe+GRU, the ap-
by end-to-end learning with stochastic neural net-        proximate training time per epoch is 20 minutes;
works. The inference network and hypernetwork             for ELMo+MLP it is 80 minutes; and for BERT,
are parameterized by a feed-forward multi-layer           it is 60 minutes. The number of meta-learned pa-
perceptrons (MLP). At meta-train time, we first           rameters for GloVe+GRU is θ are 889, 920; for
extract the features of the support set via fθ (xS ),     ELMo+MLP it is 262, 404; and for BERT it is θ
where fθ is the feature extraction network and we         are 107, 867, 328. We implemented all models us-
use permutation-invariant instance-pooling oper-          ing the PyTorch framework and trained them on an
ations to get the mean feature f¯cs of samples in         NVIDIA Tesla V100.
You can also read