PROCEEDINGS OF THE 6TH WORKSHOP ON SEMANTIC DEEP LEARNING (SEMDEEP-6) - SEMDEEP-6 - 8 JANUARY, 2021 ONLINE

Page created by Peter Mcbride
 
CONTINUE READING
PROCEEDINGS OF THE 6TH WORKSHOP ON SEMANTIC DEEP LEARNING (SEMDEEP-6) - SEMDEEP-6 - 8 JANUARY, 2021 ONLINE
SemDeep-6

Proceedings of the 6th Workshop on Semantic Deep Learning
                        (SemDeep-6)

                     8 January, 2021
                         Online
PROCEEDINGS OF THE 6TH WORKSHOP ON SEMANTIC DEEP LEARNING (SEMDEEP-6) - SEMDEEP-6 - 8 JANUARY, 2021 ONLINE
c
�2021 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

      Association for Computational Linguistics (ACL)
      209 N. Eighth Street
      Stroudsburg, PA 18360
      USA
      Tel: +1-570-476-8006
      Fax: +1-570-476-0860
      acl@aclweb.org

ISBN 978-1-950737-19-2
PROCEEDINGS OF THE 6TH WORKSHOP ON SEMANTIC DEEP LEARNING (SEMDEEP-6) - SEMDEEP-6 - 8 JANUARY, 2021 ONLINE
SemDeep-6

Welcome to the 6th Workshop on Semantic Deep Learning (SemDeep-6), held in conjunction with IJCAI
2020. As a series of workshops and a special issue, SemDeep has been aiming to bring together Semantic
Web and Deep Learning research as well as industrial communities. It seeks to offer a platform for joining
computational linguistics and formal approaches to represent information and knowledge, and thereby
opens the discussion for new and innovative application scenarios for neural and symbolic approaches to
NLP, such as neural-symbolic reasoning.
SemDeep-6 features a shared task on evaluating meaning representations, the WiC-TSV (Target Sense
Verification For Words In Context) challenge, based on a new multi-domain evaluation benchmark. The
main difference between WiC-TSV and common WSD task statement is that in WiC-TSV there is no
standard sense inventory that systems need to model in full. Each instance in the dataset is associated
with a target word and single sense, and therefore systems are not required to model all senses of the
target word, but rather only a single sense. The task is to decide if the target word is used in the target
sense or not, a binary classification task. Therefore, the task statement of WiC-TSV resembles the usage
of automatic tagging in enterprise settings.
In total, this workshop accepted four papers, two for SemDeep and two for WiC-TSV. The main trend,
as it has been the norm for previous SemDeep editions, has been the combination of neural-symbolic
approaches to language and knowlege management tasks. For instance, Chiarcos et al. (2021) discuss
methods for combining embeddings and lexicons, whereas Moreno et al. (2021) present a method for
relation extraction and Vandenbussche et al. explore the application of transformer models to Word Sense
Disambiguation. Finally, Moreno et al. (2021) also take advantage of language models for their participa-
tion to WiC-TSV. In addition, this SemDeep edition had the pleasure to have Michael Spranger as keynote
speaker, who gave an invited talk on Logic Tensor Network, as well as an invited talk by Mohammadshahi
and Henderson (2020) on their Findings of EMNLP paper on Graph-to-Graph Transformers.
We would like to thank the Program Committee members for their support of this event in form of re-
viewing and feedback, without whom we would not be able to ensure the overall quality of the workshop.

     Dagmar Gromann, Luis Espinosa-Anke, Thierry Declerck, Anna Breit, Jose Camacho-Collados,
Mohammad Taher Pilehvar and Artem Revenko.
Co-organizers of SemDeep-6.                                                                  January 2021

                                                    iii
Organisers:
    Luis Espinosa-Anke, Cardiff University, UK
    Dagmar Gromann, Technical University Dresden, Germany
    Thierry Declerck, German Research Centre for Artificial Intelligence (DFKI GmbH), Germany
    Anna Breit, Semantic Web Company, Austria
    Jose Camacho-Collados, Cardiff University, UK
    Mohammad Taher Pilehvar, Iran University of Science and Technology, Iran
    Artem Revenko, Semantic Web Company, Austria

Program Committee:
    Michael Cochez, RWTH University Aachen, Germany
    Artur d’Avila Garcez, City, University of London, London UK
    Pascal Hitzler, Kansas State University, Kansas, USA
    Wei Hu, Nanjing University, China
    John McCrae, Insight Centre for Data Analytics, Galway, Ireland
    Rezaul Karim, Fraunhofer Institute for Applied Information Technology FIT, Sankt Augustin, Ger-
    many
    Jose Moreno, Universite Paul Sabatier, IRIT, Toulouse, France
    Tommaso Pasini, Sapienza University of Rome, Rome, Italy
    Alessandro Raganato, University of Helsinki, Helsinki, Finland
    Simon Razniewksi, Max-Planck-Institute, Germany
    Steven Schockaert, Cardiff University, United Kingdom
    Michael Spranger, Sony Computer Science Laboratories Inc., Tokyo, Japan
    Veronika Thost, IBM, Boston, USA
    Arkaitz Zubiaga, University of Warwick, United Kingdom

Invited Speaker:
    Michael Spranger, Sony Computer Science Laboratories

                                               iv
Invited Talk

Michael Spranger: Logic Tensor Network - A next generation framework for Neural-Symbolic
Computing
Artificial Intelligence has long been characterized by various approaches to intelligence. Some researchers
have focussed on symbolic reasoning, while others have had important successes using learning from
data. While state-of-the-art learning from data typically use sub-symbolic distributed representations,
reasoning is normally useful at a higher level of abstraction with the use of a first-order logic language
for knowledge representation. However, this dichotomy may actually be detrimental to progress in the
field. Consequently, there has been a growing interest in neural-symbolic integration. In the talk I will
present Logic Tensor Networks (LTN) a recently revised neural-symbolic formalism that supports learn-
ing and reasoning through the introduction of a many-valued, end-to-end differentiable first-order logic,
called Real Logic. The talk will introduce LTN using examples that combine learning and reasoning in
areas as diverse as: data clustering, multi-label classification, relational learning, logical reasoning, query
answering, semi-supervised learning, regression and learning of embeddings.

                                                      v
Table of Contents
CTLR@WiC-TSV: Target Sense Verification using Marked Inputs andPre-trained Models . . . .                 1
Jose Moreno, Elvys Linhares Pontes and Gaël Dias
Word Sense Disambiguation with Transformer Models . . . . . . . . . . . . . . . . . . . . . . .          7
Pierre-Yves Vandenbussche, Tony Scerri and Ron Daniel Jr.
Embeddings for the Lexicon: Modelling and Representation . . . . . . . . . . . . . . . . . . . .         13
Christian Chiarcos, Thierry Declerck and Maxim Ionov
Relation Classification via Relation Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . .   20
Jose Moreno, Antoine Doucet and Brigitte Grau

                                                     vi
CTLR@WiC-TSV: Target Sense Verification using Marked Inputs and
                       Pre-trained Models

        Jose G. Moreno                      Elvys Linhares Pontes            Gaël Dias
      University of Toulouse               University of La Rochelle      University of Caen
     IRIT, UMR 5505 CNRS                              L3i             GREYC, UMR 6072 CNRS
    F-31000, Toulouse, France            F-17000, La Rochelle, France   F-14000, Caen, France
       jose.moreno@irit.fr           elvys.linhares pontes@univ-lr.fr                 gael.dias@unicaen.fr

                      Abstract                               has been made to perform tasks, where positions
                                                             represent meaningful information. Regarding this
     This paper describes the CTRL participation
                                                             line of research, Baldini Soares et al. (2019) in-
     in the Target Sense Verification of the Words
     in Context challenge (WiC-TSV) at SemDeep-              clude markers into the learning inputs for the task
     6. Our strategy is based on a simplistic an-            of relation classification and Boualili et al. (2020)
     notation scheme of the target words to later            into an information retrieval model. In both cases,
     be classified by well-known pre-trained neu-             the tokens allow the model to carefully identify the
     ral models. In particular, the marker allows            targets and to make an informed prediction. Be-
     to include position information to help models          sides these works, we are not aware of any other
     to correctly identify the word to disambiguate.         text-based tasks that have been tackled with this
     Results on the challenge show that our strategy
     outperforms other participants (+11, 4 Accu-
                                                             kind of information included into the models. To
     racy points) and strong baselines (+1, 7 Accu-          cover this gap, we propose to use markers to deal
     racy points).                                           with target sense verification task.
                                                                The remainder of this paper presents a brief back-
1    Introduction                                            ground knowledge in Section 2. Details of our strat-
                                                             egy, including input modification and prediction
This paper describes the CTLR1 participation at
                                                             mixing is presented in Section 3. Then, unoffi-
the Word in Context challenge on the Target Sense
                                                             cial and official results are presented in Section 4.
Verification (WiC-TSV) task at SemDeep-6. In
                                                             Finally, conclusions are drawn in Section 5.
this challenge, given a target word w within its
context participants are asked to solve a binary task        2       Background
organised in three sub-tasks:
                                                             NLP research has recently been boosted by new
    • Sub-task 1 consists in predicting if the target        ways to use neural networks. Two main groups
      word matches with a given definition,                   of neural networks can be distinguished2 on NLP
                                                             based on the training model and feature modifica-
    • Sub-task 2 consists in predicting if the target
                                                             tion.
      word matches with a given set of hypernyms,
      and                                                        • First, classical neural networks usually use
                                                                   pre-trained embeddings as input and mod-
    • Sub-task 3 consists in predicting if the target
                                                                   els learn their own weights during training
      word matches with a given couple definition
                                                                   time. Those weights are calculated directly
      and set of hypernyms.
                                                                   on the target task and integration of new fea-
   Our system is based on a masked neural lan-                     tures or resources is intuitive. As an example,
guage model with position information for Word                     please refer to the Figure 1(a) which depicts
Sense Disambiguation (WSD). Neural language                        the model from Zeng et al. (2014) for rela-
models are recent and powerful resources useful                    tion classification. Note that this model uses
for multiple Natural Language Processing (NLP)                   2
                                                                  We are aware that our classification is arguable. Although
tasks (Devlin et al., 2018). However, little effort          this is not an established classification in the field, it seems
                                                             important for us to make a difference between them as this
   1
     University of Caen Normandie, University of Toulouse,   work tries to introduce well-established concepts from the first
and University of La Rochelle team.                          group into the second one.

                                                             1
(b)

                             (a)
                                                                                            (c)
Figure 1: Representation examples for the relation classification problem proposed by Zeng et al. (2014) (a) and
Baldini Soares et al. (2019) (b and c).

       the positional features (PF in the figure) that          order to improve prediction without re-training it.
       enrich the word embeddings (WF in the fig-               To do so, we base our strategy on the introduc-
       ure) to better represent the target words in the        tion of signals to the neural language models as
       sentence. In this first group, models tend to            depicted in Figure 1(c) and done by Baldini Soares
       use few parameters because embeddings are               et al. (2019). Note that in this case the input is
       not fine-tuned. This characteristic does not             modified by introducing extra tokens ([E1], [/E1],
       dramatically impact the model performances.             [E2], and [/E2] are added based on target words
                                                               (Baldini Soares et al., 2019)) that help the system
   • The second group of models deals with neu-
                                                               to point out the target words. In this work, we mark
     ral language models3 such as BERT (Devlin
                                                               the target word by modifying the sentence in order
     et al., 2018). The main difference, w.r.t. the
                                                               to improve performance of BERT for the task of
     first group, is that the weights of the models
                                                               target sense verification.
     are not calculated during the training step of
     the target task. Instead, they are pre-calculated         3       Target Sense Verification
     in an elegant but expensive fashion by using
     generic tasks that deal with strong initialised           3.1       Problem definition
     models. Then, these models are fine-tuned to               Given a first sentence with a known target word, a
     adapt their weights to the target task.4 Fig-             second sentence with a definition, and a set of hy-
     ure 1(b) depicts the model from Devlin et al.             pernyms, the target sense verification task consists
     (2018) for the sentence classification based on            in defining whether or not the target word in the
     BERT. Within the context of neural language               first sentence corresponds to the definition or/and
     models, adding extra features like PF demands             the set of hypernyms. Note that two sub-problems
     re-train of the full model, which is highly ex-           may be set if only the second sentence or the hyper-
     pensive and eventually prohibitive. Similarly,            nyms are used. These sub-problems are presented
     re-train is needed if one opt for adding ex-              as sub-tasks in the WiC-TSV challenge.
     ternal information as made for recent works
     such as KNOW+E+E (Peters et al., 2019) or                 3.2       CTLR method
     SenseBERT (Levine et al., 2019).
                                                               We implemented a target sense verification sys-
  We propose an alternative to mix the best of both            tem as a simplified version5 of the architecture
worlds by including extra tokens into the input in             proposed by Baldini Soares et al. (2019), namely
   3
     Some subcategories may exist.                             BERTEM . It is based on BERT (Devlin et al.,
   4
     We can image a combination of both, but models that use   2018), where an extra layer is added to make the
BERT as embeddings and do not fine-tune BERT weights may
                                                                   5
be classified in the first group.                                        We used the EntityMarkers[CLS] version.

                                                               2
classification of the sentence representation, i.e.                                used to solve the two-tasks problem. So, our over-
classification is performed using as input the [CLS]                               all prediction for the main problem is calculated
token. As reported by Baldini Soares et al. (2019),                               by combining both prediction scores. First, we nor-
an important component is the use of mark symbols                                 malise the scores by applying a sof tmax function
to identify the entities to classify. In our case, we                             to each model output, and then we select the pre-
mark the target word in its context to let the system                             diction with the maximum probability as shown in
know where to focus on.                                                           Equation 5.
                                                                                                
3.3      Pointing-out the target words                                                                    sp1
                                                                                                1, if m1 (x) + m1 (x)
                                                                                                                     sp2

Learning the similarities between a couple of sen-                                 pred(x) =             > msp1           sp2
                                                                                                              0 (x) + m0 (x). (4)
                                                                                                
                                                                                                
tences (sub-task 1) can easily be addressed with                                                  0, otherwise.
BERT-based models by concatenating the two in-
puts one after the other one as presented in Equa-                                    where                         exp(pspk
                                                                                                                         i )
tion 1, where S1 and S2 are two sentences given                                                    mspk
                                                                                                    i   =�                   spk
                                                                                                                                         (5)
                                                                                                                j={0,1} exp(pj )
as inputs, t1i (i = 1..n) are the tokens in S1 , and
t2j (j = 1..m) are the tokens in S2 . In this case,
                                                                                  and pspk  is the prediction value for the model k for
the model must learn to discriminate the correct                                        i

definition and also to which of the words in S1 the                                the class i (mspk
                                                                                                 i ).
definition relates to.
                                                                                  4        Experiments and Results
                             input(S1 , S2 ) =
                                                                                  4.1        Data Sets
                  [CLS]       t11     t12       ...         t1n             (1)
                 [SEP]       t21     t22        ...       t2m                     The data set was manually created by the task or-
                                                                                  ganisers and some basic statistics are presented
   To avoid the extra effort by the model to evi-                                 in Table 1. Detailed information can be found in
dence the target word, we propose to introduce this                               the task description paper (Breit et al., 2020). No
information into the learning input. Thus, we mark                                extra-annotated data was used for training.
the target word in St by using a special token be-
fore and after the target word6 . The input used                                                              train development test
when two sentences are compared is presented in                                                     Positive 1206       198          -
Equation 2. St is the first sentence with the target                                                 Negative 931        191          -
word ti , Sd is the definition sentence, and tkx are
their respective tokens.                                                                            Total    2137       389      1324

                                    inputsp1 (St , Sd ) =
                                                                                  Table 1: WiC-TSV data set examples per class. Posi-
      [CLS]     tt1   tt2    ...     $      tti       $       ...     ttn   (2)   tive examples are identified as ‘T’ and negative as ‘F’
                            [SEP]         td1     td2        ...      tdm         in the data set.

   In the case of hypernyms (sub-task 2), the input
on the left side is kept as in Equation 2, but the                                4.2        Implementation details
right side includes the tagging of each hypernym                                  We implemented BERTEM of Baldini Soares et al.
as presented in Equation 3.                                                       (2019) using the huggingface library (Wolf et al.,
                                    inputsp2 (St , Sh ) =                         2019), and trained two models with each training
                                                                                  set. We selected the model with best performance
      [CLS]     tt1   tt2    ...     $      tti       $       ...     ttn   (3)   on the development set. Parameters were fixed as
           [SEP]      sh1    $      sh2     $         ...         $   shl         follows: 20 was used as maximum epochs, Cross
                                                                                  Entropy as loss function, Adam as optimiser, bert-
3.4      Verifying the senses                                                     base-uncased7 as pre-trained model, and other pa-
We trained two separated models, one for each sub-                                rameters were assigned following the library rec-
problem using the architecture defined in Section                                  ommendations (Wolf et al., 2019). The final layer
3.2. The output predictions of both models are                                    is composed of two neurons (negative or positive).
   6                                                                                   7
       We used ‘$’ but any other special token may be used.                                https://github.com/google-research/bert

                                                                                  3
Figure 2: Confusion matrices for different position groups. Group 1 (resp. 2, 3, and 4) includes all sentences for
which the target word appears in the first (resp. second, third, and fourth) quarter of the sentence.

4.3     Results
As the test labels are not publicly available, our
following analysis is performed exclusively on the
development set. Results on the test set were calcu-
lated by the task organisers.
   We analyse confusion matrices depending on
the position of the target word in the sentence as
our strategy is based on marking the target word.
These matrices are presented in Figure 2. The con-
fusion matrix labelled as position group 1 shows
                                                           Figure 3: Histograms of predicted values in the dev set.
our results when the target word is in the first 25%
positions of the St sentence. Other matrices show
the results of the remaining parts of the sentence         a larger impact in recall than in precision. This
(second, third, and fourth 25%, for respectively           is mainly given to the fact that our two models
group 2, 3, and 4).                                        contradict for some examples.
   Confusion matrices show that the easiest cases
are when the target word is located in the first 25%.
Other parts are harder mainly because the system
considers positive examples as negatives (high false
negative rate). However, the system behaves cor-
rectly for negative examples independently of the
position of the target word. To better understand
this wrong classification of the positive examples,
we calculated the true label distribution depend-
ing on the normalised prediction score as in Figure
3. Note that positive examples are mainly located
                                                           Figure 4: Precision/Recall curve for the development
on the right side but a bulk of them are located           set for different threshold values.
around the middle of the figure. It means that mod-
els msp1 and msp2 where in conflict and average
                                                              We also considered the class distribution depend-
results were slightly better for the negative class. In
                                                           ing on a normalised distance between the target
the development set, it seems important to correctly
                                                           token and the beginning of the sentence. From Fig-
define a threshold strategy to better define which
                                                           ure 5, we observe that both classes are less frequent
examples are marked as positive.
                                                           at the beginning of the sentence with negative ex-
   In our experiments, we implicitly used 0.5 as
                                                           amples slightly less frequent than positive ones. It
threshold8 to define either the example belongs to
                                                           is interesting to remark that negative examples uni-
the ‘T’ or ‘F’ class. When comparing Figures 3
                                                           formly distribute after the first bin. On the contrary,
and 4, we can clearly see that small changes in the
                                                           the positive examples have a more unpredictable
threshold parameter would affect our results with
                                                           distribution indicating that a strategy based on only
   8
       Because of the condition msp1      sp2
                                 1 (x) + m1 (x) >
                                                           positions may fail. However, our strategy that com-
msp1
 0 (x)
               sp2
           + m0 (x).                                       bines markers to indicate the target word and a

                                                          4
Global                      WordNet/Wiktionary                Cocktails                       Medical entities                 Computer Science
 Run               User          Accuracy Precision Recall   F1     Accuracy Precision Recall F1 Accuracy Precision Recall F1 Accuracy Precision Recall F1 Accuracy Precision Recall F1

 run2              CTLR (ours)     78,3     78,9      78,0   78,5     72,1     75,8   70,7 73,2    87,5     82,4       90,3 86,2    85,9       86,7     85,8 86,3    83,3       78,4    88,5 83,1
 szte              begab           66,9     61,6      92,5   73,9     70,2     66,5   89,6 76,4    55,1     48,9       96,8 65,0    65,4       60,5     95,3 74,0    70,2       61,3    97,4 75,2
 szte2             begab           66,3     61,1      92,8   73,7     69,9     66,2   90,2 76,3    53,7     48,1       96,8 64,3    64,4       59,8     95,3 73,5    69,6       60,8    97,4 74,9

 BERT              -               76,6     74,1      82,8   78,2     73,5     76,1   74,2 75,1    79,2     67,8       98,2 80,2    79,8       75,8     89,6 82,1    82,1       73,0    97,9 83,6
 FastText          -               53,4     52,8      79,4   63,4     57,1     58,0   74,0 65,0    43,1     43,1       100,0 60,2   51,1       51,5     90,3 65,6    54,0       50,5    67,1 57,3
 Baseline (true)                   50,8     50,8      100,0 67,3      53,8     53,8   100,0 70,0   43,1     43,1       100,0 60,2   51,7       51,7     100,0 68,2   46,4       46,4    100,0 63,4

 Human                             85,3     80,2      96,2   87,4     82,1                         92,0                             89,1                             86,5

Table 2: Accuracy, Precision, Recall and F1 results of participants and baselines. Results where split by type.
General results are included in column ‘Global’. All results were calculated by the task organisers (Breit et al.,
2020) as participants have not access to test labels. Best performance for each global metric is marked in bold for
automatic systems.

strong neural language model (BERT) successfully                                                    to out-of-domain topics as it is clearly shown for
manage to classify the examples.                                                                    the Cocktails type in terms of F1, and also for the
                                                                                                    Medical entities type to a less extent. However,
                                                                                                    our system fails to provide better results than the
                                                                                                    standard BERT in terms of F1 for the Computer Sci-
                                                                                                    ence type. But, in terms of Accuracy, our strategy
                                                                                                    outperforms for a large margin the out-of-domain
                                                                                                    types (8.3, 6.1, and 1.2 improvements in absolute
                                                                                                    points for Cocktails, Medical entities, and Com-
                                                                                                    puter Science respectively). Surprisingly, it fails on
                                                                                                    both, F1 and Accuracy, for WordNet/Wiktionary.

Figure 5: Position distribution based on the target token
distances.
                                                                                                    5      Conclusion
   Finally, the main results calculated by the organ-
isers are presented in Table 2. The global column
presents the results for the global task, including                                                 This paper describes our participation in the WiC-
definitions and hypernyms. Our submission is iden-                                                   TSV task. We proposed a simple but effective strat-
tified as run2-CTLR. In the global results, our strat-                                               egy for target sense verification. Our system is
egy outperforms participants and baselines in terms                                                 based on BERT and introduces markers around the
of Accuracy, Precision, and F1. Best Recall perfor-                                                 target words to better drive the learned model. Our
mance is unsurprisingly obtained by the baseline                                                    results are strong over an unseen collection used
(true) that corresponds to a system that predicts all                                               to verify senses. Indeed, our method (Acc=78, 3)
examples as positives. Two strong baselines are                                                     outperforms other participants (second best par-
included, FastText and BERT. Both baselines were                                                    ticipant, Acc=66, 9) and strong baselines (BERT,
calculated by the organisers with more details in                                                   Acc=76, 6) when compared in terms of Accuracy,
(Breit et al., 2020). It is interesting to remark that                                              the official metric. This margin is even larger when
the baseline BERT is very similar to our model                                                      the results are compared for the out-of-domain ex-
but without the marked information. However, our                                                    amples of the test collection. Thus, the results
model focuses more on improving Precision than                                                      suggest that the extra information provided to the
Recall resulting with a clear improvement in terms                                                  BERT model through the markers clearly boost
of Accuracy but less important in terms of F1.                                                      performance.
   Organisers also provide results grouped by dif-                                                     As future work, we plan to complete the evalua-
ferent types of examples. They included four types                                                  tion of our system with the WiC dataset (Pilehvar
with three of them from domains that were not                                                       and Camacho-Collados, 2019) as well as the in-
included in the training set9 . From Table 2, we                                                    tegration of the model into a recent multi-lingual
can also conclude that our system is able to adapt                                                  entity linking system (Linhares Pontes et al., 2020)
    9
        More details in (Breit et al., 2020).                                                       by marking the anchor texts.

                                                                                                   5
Acknowledgements                                              formers: State-of-the-art Natural Language Process-
                                                              ing. arXiv:1910.03771 (2019).
This work has been partly supported by the Eu-
ropean Union’s Horizon 2020 research and inno-            Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,
                                                            and Jun Zhao. 2014. Relation classification via
vation programme under grant 825153 (EMBED-                 convolutional deep neural network. In Proceedings
DIA).                                                       of COLING 2014, the 25th International Confer-
                                                            ence on Computational Linguistics: Technical Pa-
                                                            pers. 2335–2344.
References
Livio Baldini Soares, Nicholas FitzGerald, Jeffrey
  Ling, and Tom Kwiatkowski. 2019. Matching the
  Blanks: Distributional Similarity for Relation Learn-
  ing. In ACL. 2895–2905.
Lila Boualili, Jose G. Moreno, and Mohand
   Boughanem. 2020.        MarkedBERT: Integrating
   Traditional IR Cues in Pre-Trained Language
   Models for Passage Retrieval. In Proceedings of
   the 43rd International ACM SIGIR Conference on
   Research and Development in Information Retrieval
   (SIGIR ’20). Association for Computing Machinery,
  1977–1980.
Anna Breit, Artem Revenko, Kiamehr Rezaee, Moham-
  mad Taher Pilehvar, and Jose Camacho-Collados.
  2020. WiC-TSV: An Evaluation Benchmark for
  Target Sense Verification of Words in Context.
  arXiv:2004.15016 (2020).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2018. BERT: Pre-training of
   Deep Bidirectional Transformers for Language Un-
   derstanding. arXiv:1810.04805 (2018).
Yoav Levine, Barak Lenz, Or Dagan, Dan Padnos,
  Or Sharir, Shai Shalev-Shwartz, Amnon Shashua,
  and Yoav Shoham. 2019. Sensebert: Driving some
  Sense into Bert. arXiv:1908.05646 (2019).
Elvys Linhares Pontes, Jose G. Moreno, and Antoine
  Doucet. 2020. Linking Named Entities across Lan-
  guages Using Multilingual Word Embeddings. In
  Proceedings of the ACM/IEEE Joint Conference on
  Digital Libraries in 2020 (JCDL ’20). Association
  for Computing Machinery, 329–332.
Matthew E. Peters, Mark Neumann, Robert L. Lo-
 gan, Roy Schwartz, Vidur Joshi, Sameer Singh, and
 Noah A. Smith. 2019. Knowledge Enhanced Con-
 textual Word Representations. In EMNLP. 43–54.
Mohammad Taher Pilehvar and Jose Camacho-
 Collados. 2019. WiC: the Word-in-Context Dataset
 for Evaluating Context-Sensitive Meaning Represen-
 tations. In Proceedings of the 2019 Conference of
 the North American Chapter of the Association for
 Computational Linguistics: Human Language Tech-
 nologies, Volume 1 (Long and Short Papers). 1267–
 1273.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
  Chaumond, Clement Delangue, Anthony Moi, Pier-
  ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
  icz, and Jamie Brew. 2019. HuggingFace’s Trans-

                                                          6
Word Sense Disambiguation with Transformer Models

    Pierre-Yves Vandenbussche        Tony Scerri          Ron Daniel Jr.
           Elsevier Labs            Elsevier Labs          Elsevier Labs
           Radarweg 29,            1 Appold Street,      230 Park Avenue,
       Amsterdam 1043 NX,       London EC2A 2UT, UK    New York, NY, 10169,
            Netherlands       a.scerri@elsevier.com            USA
p.vandenbussche@elsevier.com                        r.daniel@elsevier.com

                         Abstract                           given hypernym. In Subtask 3 the system can use
                                                            both the sentence and the hypernym in making the
       In this paper, we tackle the task of Word            decision.
       Sense Disambiguation (WSD). We present our
       system submitted to the Word-in-Context Tar-            The dataset provided with the WiC-TSV chal-
       get Sense Verification challenge, part of the         lenge has relatively few sense annotated examples
       SemDeep workshop at IJCAI 2020 (Breit et al.,        (< 4, 000) and with a single target sense per word.
       2020). That challenge asks participants to pre-      This makes pre-trained Transformer models well
       dict if a specific mention of a word in a text        suited for the task since the small amount of data
       matches a pre-defined sense. Our approach             would limit the learning ability of a typical super-
       uses pre-trained transformer models such as          vised model trained from scratch.
       BERT that are fine-tuned on the task using
       different architecture strategies. Our model            Thanks to the recent advances made in language
       achieves the best accuracy and precision on          models such as BERT (Devlin et al., 2018) or XL-
       Subtask 1 – make use of definitions for decid-        Net (Yang et al., 2019) trained on large corpora,
       ing whether the target word in context corre-        neural language models have established the state-
       sponds to the given sense or not. We believe         of-the-art in many NLP tasks. Their ability to cap-
       the strategies we explored in the context of this    ture context-sensitive semantic information from
       challenge can be useful to other Natural Lan-
                                                            text would seem to make them particularly well
       guage Processing tasks.
                                                            suited for this challenge. In this paper, we ex-
   1   Introduction                                         plore different fine-tuning architecture strategies
                                                            to answer the challenge. Beyond the results of
   Word Sense Disambiguation (WSD) is a fundamen-           our system, our main contribution comes from the
   tal and long-standing problem in Natural Language        intuition and implementation around this set of
   Processing (NLP) (Navigli, 2009). It aims at clearly     strategies that can be applied to other NLP tasks.
   identifying which specific sense of a word is being
   used in a text. As illustrated in Table 1, in the sen-   2   Data Analysis
   tence I spent my spring holidays in Morocco., the
   word spring is used in the sense of the season of        The Word-in-Context Target Sense Verification
   growth, and not in other senses involving coils of       dataset consists of more than 3800 rows. As shown
   metal, sources of water, the act of jumping, etc.        in Table 1, each row contains a target word, a con-
      The Word-in-Context Target Sense Verification          text sentence containing the target, and both hyper-
   challenge (WiC-TSV) (Breit et al., 2020) structures      nym(s) and a definition giving a sense of the term.
   WSD tasks in particular ways in order to make the        There are both positive and negative examples, the
   competition feasible. In Subtask 1, the system is        dataset provides a label to distinguish them.
   provided with a sentence, also known as the context,        Table 2 shows some statistics about the train-
   the target word, and a definition also known as           ing, dev, and test splits within the dataset. Note
   word sense. The system is to decide if the use           the substantial differences between the test set and
   of the target word matches the sense given by the        the training and dev sets. The longer length of
   definition. Note that Table 1 contains a Hypernym         the context sentences and definitions in the test set
   column. In Subtask 2 system is to decide if the          may have an impact on a model trained solely on
   use of the target in the context is a hyponym of the     the given training and dev sets. This is a known

                                                            7
Target Word    Pos.    Sentence                              Hypernyms       Definition              Label
    spring         3       I spent my spring holidays in Mo-     season,         the season of growth   T
                           rocco .                               time of year
    integrity      1       the integrity of the nervous system   honesty, hon-   moral soundness        F
                           is required for normal developments   estness

                                      Table 1: Examples of training data.

issue whose roots are explained in the dataset au-
thor’s paper (Breit et al., 2020). The training and
development sets come from WordNet and Wik-
tionary while the test set incorporates both gen-
eral purpose sources WordNet and Wiktionary, and
domain-specific examples from Cocktails, Medical
Subjects and Computer Science. The difference in
the distributions of the test set from the training
and dev sets, the short length of the definitions and
hypernyms, and the relatively small number of ex-
amples all combine to provide a good challenge for
the language models.

3    System Description and Related Work
                                                                         Figure 1: System overview
Word Sense Disambiguation is a long-standing task
in NLP because of its difficulty and subtlety. One
way the WiC-TSV challenge has simplified the                 sense of the word separated by a special token
problem is by reducing it to a binary yes/no deci-          ([SEP]). This configuration was originally used
sion over a single sense for a single pre-identified         to predict whether two sentences follow each other
target. This is in contrast to most prior work that         in a text. But the learning power of the transformer
provides a pre-defined sense inventory, typically            architecture lets us learn this new task by simply
WordNet, and requires the system to both identify           changing the meaning of the fields in the input
the terms and find the best matching sense from              data while keeping the structure the same. We add
the inventory. WordNet provides extremely fine-              a fully connected layer on top of the transformer
grained senses which have been shown to be dif-             model’s layers with classification function to pre-
ficult for humans to accurately select (Hovy et al.,         dict whether the target word in context matches
2006). Coupled with this is the task of even select-        the definition. This approach is particularly well
ing the term in the presence of multi-word expres-          suited for weak supervision and can generalise to
sions and negations.                                        word/sense pairs not previously seen in the training
   Since the introduction of the transformer self-          set. This overcomes the limitation of multi-class
attention-based neural architecture and its ability to      objective models, e.g. (Vial et al., 2019) that use
capture complex linguistic knowledge (Vaswani               a predefined sense inventory (as described above)
et al., 2017), their use in resolving WSD has               and can’t generalise to unseen word/sense pairs. An
received considerable attention (Loureiro et al.,           illustration of our system is provided in Figure 1.
2020). A common approach consists in fine-tuning
                                                            4    Experiments and Results
a single pre-trained transformer model to the WSD
downstream task. The pre-trained model is pro-              The system described in the previous section was
vided with the task-specific inputs and further              adapted in several ways as we tested alternatives.
trained for several epochs with the task’s objective        We first considered different transformer models,
and negative examples of the objective.                     such as BERT v. XLNet. We then concentrated
   Our system is inspired from the work of Huang            our efforts on one transformer model, BERT-base-
et al. (2019) where the WSD task can be seen as a           uncased, and performed other experiments to im-
binary classification problem. The system is given           prove performance.
the target word in context (input sentence) and one           All experiments were run five times with differ-

                                                           8
Train Set     Dev Set     Test Set
                  Number Examples                      2,137         389         1,305
                  Avg. Sentence Char Length            44 ± 27       44 ± 26     99 ± 87
                  Avg. Sentence Word Length            9±5           9±5         19 ± 16
                  Avg. Term Use                        2.5 ± 2.7     1.0 ± 0.2   1.9 ± 4.4
                  Avg. Number Hypernyms                2.2 ± 1.5     2.2 ± 1.4   1.9 ± 1.3
                  Percentage of True Label             56%           51%         NA
                  Avg. Definition Char Length           54 ± 27       56 ± 27     157 ± 151
                  Avg. Definition Word Length           9.3 ± 4.7     9.6 ± 4.7   25.3 ± 23.9

                           Table 2: Dataset statistics. Values with ± are mean and SD

 Model                     Accuracy      F1                  Model                 Accuracy       F1
 XLNet-base-cased          .522 ± .030   .666 ± .020         BERT-base-uncased     .723 ± .023    .751 ± .023
 DistilBERT-base-uncased   .612 ± .017   .665 ± .017         +mask                 .699 ± .011    .748 ± .009
 RoBERTa-base              .635 ± .074   .717 ± .030
                                                             +target emph          .725 ± .013    .752 ± .012
 BERT-base-uncased         .723 ± .023   .751 ± .023
                                                             +mean-pooling         .737 ± .017    .762 ± .015
Table 3: Comparison of transformer models perfor-            +freezing             .734 ± .008    .761 ± .010
mance (lr=5e−5 ; 3 epochs)                                   +data augment.        .749 ± .009    .752 ± .011
                                                             +hypernyms            .726 ± .012    .755 ± .011

ent random seeds. We report the mean and standard         Table 4: Influence of strategies on model performance.
deviation of the system’s performance on the met-         We note in bold those that had a positive impact on the
                                                          performance
rics of accuracy and F1. We believe this is more
informative than a single ’best’ number. All mod-
els in these experiments are trained on the training      4.2      Alternative BERT Strategies
set and evaluated on the dev set.
                                                          Having selected the BERT-base-uncased pretrained
   In addition to the experiments whose results are
                                                          transformer model, and staying with a single set of
reported here, we tried a variety of other things
                                                          hyperparameters (learning rate = 5e−5 and 3 train-
such as pooling methods (layers, ops), a Siamese
                                                          ing epochs), there are still many different strategies
network with shared encoders for two input sen-
                                                          that could be used to try to improve performance.
tences, and alternative loss calculations. None of
                                                          The individual strategies are discussed below. The
them gave better results in the time available.
                                                          results for all the strategies are presented in Table 4
4.1   Alternative Transformer Models                      4.2.1     Masking the target word
We compared the following pre-trained transformer         We wondered if the context of the target word was
models from the HuggingFace transformers li-              sufficient for the model to predict whether the defi-
brary: XLNet (Yang et al., 2019), BERT (De-               nition is correct. By masking the target word from
vlin et al., 2018), and derived models including          the input sentence, we test the ability of the model
RoBERTa (Liu et al., 2019) or DistilBERT (Sanh            to learn solely from the contextual words. We
et al., 2019).                                            hoped this might improve its generalisation. Mask-
   Following standard practice, those pretrained          ing led to a small decrease in performance. This
models were used as feature detectors for a fine-          small delta indicates that the non-target words in
tuning pass using the fully-connected head layer.         the context have strong influence on the model’s
The results for those models are given in Table 3.        prediction of the correct sense.
The BERT-base-uncased model performed the best
so it was the basis for further experiments described     4.2.2     Emphasising the word of interest
in the next section.                                      We wondered about the impact of taking the op-
   It is worth mentioning that no attempt was made        posite tack and calling out the target word. As
to perform a hyperparameter optimization for each         illustrated in Figure 1, some transformer models
model. Instead, a single set of hyperparameters           make use of token type ids (segment token indices)
was used for all the models being compared.               to indicate the first and second portion of the inputs.

                                                         9
We set the token(s) type of the target word in the             Model              Acc.   Prec.   Recall   F1
input sentence to match that of the definition. Ap-             Baseline (BERT)    .753   .717    .849     .777
plying this strategy leads to a slight improvement             Run1               .775   .804    .736     .769
in accuracy.                                                   Run2               .778   .819    .722     .768

4.2.3 CLS vs. pooling over token sequence                  Table 5: Model’s Results on the Subtask 1 of the WiC-
                                                           TSV challenge
The community has developed several common
ways to select the input for the head binary classi-
fication layer. We compare the performance using            to a slight performance improvement, presumably
the dedicated [CLS] token vector v. mean/max-              because the hypernym indirectly emphasizes the
pooling methods applied to the sequence hidden             intended sense of the target word.
states of selected layers of the transformer model.
Applying mean-pooling to the last layer gave the           5     Challenge Submission
best accuracy and F1 of the configurations tested.
                                                           The challenge allowed each participant to submit
4.2.4 Weight Training vs. Freeze-then-Thaw                 two results per task. However there was no clear
Another strategy centers on whether, and how, to           winner from the strategies above; most led to a
update the pre-trained model parameters during             minimal improvement with a substantial standard
fine-tuning, in addition to the training of the newly       deviation. We therefore selected our system for
initialized fully connected head layer. Updating           submitted results by a grid search over common hy-
the pre-trained model would allow it to specialize         perparameter values including the strategies men-
on our downstream task but might lead to “catas-           tioned previously. We use the train set for training
trophic forgetting” where we destroy the benefit of         and dev set to measure the performance of each
the pre-trained model. One strategy the commu-             model in the grid search. We chose accuracy as the
nity has evolved (Bevilacqua and Navigli, 2020)            main evaluation metric. For Subtask 1 we opted
first freezes the transformer model’s parameters            for the following parameters:
for several epochs while the head layer receives
the updates. Later the pre-trained parameters are               • Run1: BERT-base-uncased model trained for
unfrozen and updated too. This strategy provides                  3 epochs using the augmented dataset, with
some improvements in accuracy and F1.                             a learning rate of 7e−6 and emphasising the
                                                                  word of interest. Other parameters include:
4.2.5 Data augmentation                                           max sequence length of 256; train batch size
Due to the small size of the training dataset, we                 of 32.
experimented with data augmentation techniques
while using only the data provided for the chal-                • Run2: we kept the parameters from the previ-
lenge. For each word in context/target sense pair,                ous run, updating the learning rate to 1e−5 .
we generated:
                                                           The results on the private test set of the Subtask 1
  • one positive example by replacing the target           are presented in Table 5. The Run2 of our system
    word with a random hypernym, if any exist.             demonstrated a 3.3% accuracy and 14.2% precision
  • one negative example by associating the target         improvements compared to the baseline.
    word to a random definition.                               For Subtask 3 we arrived at the following param-
                                                           eters:
This strategy triples the size of the training dataset.
This strategy gave the greatest improvement (3.6%)              • Run1: BERT-base-uncased model trained for
of all those tested. Further work could test the                  3 epochs using the original dataset, with a
effect of more negative examples.                                 learning rate of 1e−5 . Other parameters in-
                                                                  clude: max sequence length of 256; train
4.2.6 Using Hypernyms (Subtask 3)                                 batch size of 32.
For the WiC-TSV challenge’s Subtask 3 , the sys-
tem can use the additional information of hyper-                • Run2: we kept the parameters from the pre-
nyms of the target word. We simply concatenate                    vious run, extending the number of training
the hypernyms to the definition. This strategy leads               epochs to 5.

                                                          10
Model                        Acc.   Prec.    Recall      F1      from the limited context provided by the sentence.
    Baseline (BERT)              .766   .741     .828        .782    For instance, the short sentence it’s my go could
    Run1                         .694   .643     .893        .747    very well correspond to the associated definition a
    Run2                         .719   .669     .885        .762    usually brief attempt of the target word go.
                                                                        As motivated by the construction of an aug-
Table 6: Model’s Results on the Subtask 3 of the WiC-                mented dataset, we believe that increasing the size
TSV challenge
                                                                     of the training dataset would probably lead to im-
                                                                     proved performance, even without system changes.
                   0.75
                                                                     To test this hypothesis we measured the perfor-
    Metric value

                    0.7                                              mance of our best model with increasing fractions
                                                  F1                 of the training data. The results in Figure 2 show
                   0.65
                                                Accuracy             improvement as the fraction of the training dataset
                    0.6                                              grows.
                                                                        As a counterbalance to the positive note above,
                              5
                           0.5
                              5
                              1

                                                         3
                          0.2

                          0.7

                                                                     we must note that this challenge set up WSD as
                           Percentage of training data               a binary classification problem. This is a consid-
                                                                     erable simplification from the more general sense
Figure 2: Influence of training data size on model per-               inventory approach. Further work will be needed
formance. We used the augmented dataset to reach a                   to obtain similar accuracy in that regime.
proportion of 3. Parameters from Subtask 1 Run2 were
used for this comparison.
                                                                     References
The results on the private test set of the SubTask 3                 Michele Bevilacqua and Roberto Navigli. 2020. Break-
are presented in Table 6. Compared to using the                        ing through the 80% glass ceiling: Raising the state
                                                                       of the art in word sense disambiguation by incor-
sentence and definition alone, our naive approach                       porating knowledge graph information. In Proceed-
to handling hypernyms hurt performance.                                ings of the 58th Annual Meeting of the Association
                                                                       for Computational Linguistics, pages 2854–2864.
6             Discussion and Future Work                             Anna Breit, Artem Revenko, Kiamehr Rezaee, Moham-
                                                                       mad Taher Pilehvar, and Jose Camacho-Collados.
We applied transformer models to tackle a Word                         2020. Wic-tsv: An evaluation benchmark for tar-
Sense Disambiguation challenge. As in much of                          get sense verification of words in context. arXiv
the current NLP research, pre-trained transformer                      preprint, arXiv:2004.15016.
models demonstrated a good ability to learn from                     Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
few examples with high accuracy. Using differ-                          Kristina Toutanova. 2018. Bert: Pre-training of deep
ent architecture modifications, and in particular the                    bidirectional transformers for language understand-
use of the token type id to flag the word of inter-                      ing. arXiv preprint arXiv:1810.04805.
est along with automatically augmented data, our                     E. Hovy, M. Marcus, Martha Palmer, L. Ramshaw, and
system demonstrated the best accuracy and preci-                        R. Weischedel. 2006. Ontonotes: The 90 In HLT-
sion in the competition and third-best F1. There is                     NAACL.
still a noticeable gap to human performance on this                  Luyao Huang, Chi Sun, Xipeng Qiu, and Xuanjing
dataset (85.3 acc.), but the level of effort required                  Huang. 2019. Glossbert: Bert for word sense dis-
to create these kinds of systems is easily within                      ambiguation with gloss knowledge. arXiv preprint
reach of small groups or individuals. Despite the                      arXiv:1908.07245.
test set having a very different distribution than                   Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
the training/development sets, our system demon-                       dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
strated similar performance on both the develop-                       Luke Zettlemoyer, and Veselin Stoyanov. 2019.
                                                                       Roberta: A robustly optimized bert pretraining ap-
ment and test sets.                                                    proach. arXiv preprint arXiv:1907.11692.
   An analysis of the errors produced by our best
performing model on the dev set (Subtask 1, Run2)                    Daniel Loureiro, Kiamehr Rezaee, Mohammad Taher
                                                                       Pilehvar, and Jose Camacho-Collados. 2020. Lan-
is presented in Table 7. It shows a mix of obvi-                       guage models and word sense disambiguation:
ous errors and more ambiguous ones where it has                        An overview and analysis.        arXiv preprint
been difficult for the model to draw conclusions                        arXiv:2008.11608.

                                                                    11
Target Word      Pos.   Sentence                                 Definition                                  Label   Pred
  criticism        4      the senator received severe criticism    a serious examination and judgment of      F       T
                          from his opponent                        something
  go               3      it ’s my go                              a usually brief attempt                    F       T
  reappearance     1      the reappearance of Halley ’s comet      the act of someone appearing again         F       T
  continent        5      pioneers had to cross the continent      one of the large landmasses of the earth   T       F
                          on foot
  rail             4      he was concerned with rail safety        short for railway                          T       F

           Table 7: Examples of errors in the development set for the model used in Subtask 1 Run2.

Roberto Navigli. 2009. Word sense disambiguation: A
  survey. ACM computing surveys (CSUR), 41(2):1–
  69.
Victor Sanh, Lysandre Debut, Julien Chaumond, and
  Thomas Wolf. 2019. Distilbert, a distilled version
  of bert: smaller, faster, cheaper and lighter. arXiv
  preprint arXiv:1910.01108.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In Advances in neural information pro-
  cessing systems, pages 5998–6008.
Loı̈c Vial, Benjamin Lecouteux, and Didier Schwab.
  2019. Sense vocabulary compression through the se-
  mantic knowledge of wordnet for neural word sense
  disambiguation. arXiv preprint arXiv:1905.05677.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
  bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
  Xlnet: Generalized autoregressive pretraining for
  language understanding. In Advances in neural in-
  formation processing systems, pages 5753–5763.

                                                              12
Embeddings for the Lexicon: Modelling and Representation

    Christian Chiarcos1                  Thierry Declerck2                   Maxim Ionov1
    1
      Applied Computational Linguistics Lab, Goethe University, Frankfurt am Main, Germany
        2
          DFKI GmbH, Multilinguality and Language Technology, Saarbrücken, Germany
                {chiarcos,ionov}@informatik.uni-frankfurt.de
                                    declerck@dfki.de

                          Abstract                          identified as a topic for future developments. Here,
                                                            we describe possible use-cases for the latter and
        In this paper, we propose an RDF model for
        representing embeddings for elements in lex-        present our current model for this.
        ical knowledge graphs (e.g. words and con-
        cepts), harmonizing two major paradigms in
                                                            2       Sharing Embeddings on the Web
        computational semantics on the level of rep-        Although word embeddings are often calculated on
        resentation, i.e., distributional semantics (us-
                                                            the fly, the community recognizes the importance
        ing numerical vector representations), and lex-
        ical semantics (employing lexical knowledge         of pre-trained embeddings as these are readily avail-
        graphs). Our model is a part of a larger ef-        able (it saves time), and cover large quantities of
        fort to extend a community standard for rep-        text (their replication would be energy- and time-
        resenting lexical resources, OntoLex-Lemon,         intense). Finally, a benefit of re-using embeddings
        with the means to connect it to corpus data. By     is that they can be grounded in a well-defined, and,
        allowing a conjoint modelling of embeddings         possibly, shared feature space, whereas locally built
        and lexical knowledge graphs as its part, we        embeddings (whose creation involves an element
        hope to contribute to the further consolidation
                                                            of randomness) would reside in an isolate feature
        of the field, its methodologies and the accessi-
        bility of the respective resources.                 space. This is particularly important in the context
                                                            of multilingual applications, where collections of
1       Background                                          embeddings are in a single feature space (e.g., in
                                                            MUSE [6]).
With the rise of word and concept embeddings,
                                                               Sharing embeddings, especially if calculated
lexical units at all levels of granularity have been
                                                            over a large amount of data, is not only an econom-
subject to various approaches to embed them into
                                                            ical and ecological requirement, but unavoidable in
numerical feature spaces, giving rise to a myriad
                                                            many cases. For these, we suggest to apply estab-
of datasets with pre-trained embeddings generated
                                                            lished web standards to provide metadata about the
with different algorithms that can be freely used
                                                            lexical component of such data. We are not aware
in various NLP applications. With this paper, we
                                                            of any provider of pre-trained word embeddings
present the current state of an effort to connect
                                                            which come with machine-readable metadata. One
these embeddings with lexical knowledge graphs.
                                                            such example is language information: while ISO-
   This effort is a part of an extension of a widely
                                                            639 codes are sometimes being used for this pur-
used community standard for representing, link-
                                                            pose, they are not given in a machine-readable way,
ing and publishing lexical resources on the web,
                                                            but rather documented in human-readable form or
OntoLex-Lemon1 . Our work aims to complement
                                                            given implicitly as part of file names.2
the emerging OntoLex module for representing
Frequency, Attestation and Corpus Information               2.1      Concept Embeddings
(FrAC) which is currently being developed by the
                                                            It is to be noted, however, that our focus is not
W3C Community Group “Ontology-Lexica”, as
                                                            so much on word embeddings, since lexical infor-
presented in [4]. There we addressed only fre-
                                                            mation in this context is apparently trivial – plain
quency and attestations, whereas core aspects of
                                                                2
corpus-based information such as embeddings were                See the ISO-639-1 code ‘en’ in FastText/MUSE files such
                                                            as https://dl.fbaipublicfiles.com/arrival/
    1
        https://www.w3.org/2016/05/ontolex/                 vectors/wiki.multi.en.vec.

                                                           13
tokens without any lexical information do not seem       2.2    Organizing Contextualized Embeddings
to require a structured approach to lexical seman-
                                                         A related challenge is the organization of contex-
tics. This changes drastically for embeddings of
                                                         tualized embeddings that are not adequately iden-
more abstract lexical entities, e.g., word senses or
                                                         tified by a string (say, a word form), but only by
concepts [17], that need to be synchronized be-
                                                         that string in a particular context. By providing
tween the embedding store and the lexical knowl-
                                                         a reference vocabulary to organize contextualized
edge graph by which they are defined. WordNet
                                                         embeddings together with the respective context,
[14] synset identifiers are a notorious example for
                                                         this challenge will be overcome, as well.
the instability of concepts between different ver-
sions: Synset 00019837-a means ‘incapable of                As a concrete use case of such information, con-
being put up with’ in WordNet 1.71, but ‘estab-          sider a classical approach to Word Sense Disam-
lished by authority’ in version 2.1. In WordNet          biguation: The Lesk algorithm [12] uses a bag-of-
3.0, the first synset has the ID 02435671-a, the          words model to assess the similarity of a word in a
second 00179035-s.3                                      given context (to be classified) with example sen-
                                                         tences or definitions provided for different senses
   The precompiled synset embeddings provided            of that word in a lexical resource (training data,
with AutoExtend [17] illustrate the consequences:        provided, for example, by resources such as Word-
The synset IDs seem to refer to WordNet 2.1              Net, thesauri, or conventional dictionaries). The
(wn-2.1-00001742-n), but use an ad hoc no-               approach was very influential, although it always
tation and are not in a machine-readable format.         suffered from data sparsity issues, as it relied on
More importantly, however, there is no synset            string matches between a very limited collection of
00001742-n in Princeton WordNet 2.1. This                reference words and context words. With distribu-
does exist in Princeton WordNet 1.71, and in fact,       tional semantics and word embeddings, such spar-
it seems that the authors actually used this edition     sity issues are easily overcome as literal matches
instead of the version 2.1 they refer to in their pa-    are no longer required.
per. Machine-readable metadata would not prevent
                                                            Indeed, more recent methods such as BERT al-
such mistakes, but it would facilitate their verifia-
                                                         low us to induce contextualized word embeddings
bility. Given the current conventions in the field,
                                                         [20]. So, given only a single example for a partic-
such mistakes will very likely go unnoticed, thus
                                                         ular word sense, this would be a basis for a more
leading to highly unexpected results in applications
                                                         informed word sense disambiguation. With more
developed on this basis. Our suggestion here is
                                                         than one example per word sense, this is becom-
to use resolvable URIs as concept identifiers, and
                                                         ing a little bit more difficult, as these now need
if they provide machine-readable lexical data, lex-
                                                         to be either aggregated into a single sense vector,
ical information about concept embeddings can
                                                         or linked to the particular word sense they pertain
be more easily verified and (this is another applica-
                                                         to (which will require a set of vectors as a data
tion) integrated with predictions from distributional
                                                         structure rather than a vector). For data of the sec-
semantics. Indeed, the WordNet community has
                                                         ond type, we suggest to follow existing models for
adopted OntoLex-Lemon as an RDF-based repre-
                                                         representing attestations (glosses, examples) for
sentation schema and developed the Collaborative
                                                         lexical entities, and to represent contextualized em-
Interlingual Index (ILI or CILI) [3] to establish
                                                         beddings (together with their context) as properties
sense mappings across a large number of WordNets.
                                                         of attestations.
Reference to ILI URIs would allow to retrieve the
lexical information behind a particular concept em-
bedding, as the WordNet can be queried for the           3     Addressing the Modelling Challenge
lexemes this concept (synset) is associated with. A
                                                         In order to meet these requirements, we propose a
versioning mismatch can then be easily spotted by
                                                         formalism that allows the conjoint publication of
comparing the cosine distance between the word
                                                         lexical knowledge graphs and embedding stores.
embeddings of these lexemes and the embedding
                                                         We assume that future approaches to access and
of the concept presumably derived from them.
                                                         publish embeddings on the web will be largely in
                                                         line with high-level methods that abstract from de-
  3
    See   https://github.com/globalwordnet/              tails of the actual format and provide a conceptual
ili                                                      view on the data set as a whole, as provided, for ex-

                                                        14
(1) LexicalEntry representation of a lexeme in a
                                                           lexical knowledge graph, groups together form(s)
                                                           and sense(s), resp. concept(s) of a particular expres-
                                                           sion; (2) (lexical) Form, written representation(s)
                                                           of a particular lexical entry, with (optional) gram-
                                                           matical information; (3) LexicalSense word sense
                                                           of one particular lexical entry; (4) LexicalConcept
                                                           elements of meaning with different lexicalizations,
                                                           e.g., WordNet synsets.
                                                              As the dominant vocabulary for lexical knowl-
                                                           edge graphs on the web of data, the OntoLex-
          Figure 1: OntoLex-Lemon core model               Lemon model has found wide adaptation beyond
                                                           its original focus on ontology lexicalization. In
ample, by the torchtext package of PyTorch.4               the WordNet Collaborative Interlingual Index [3]
   There is no need to limit ourselves to the cur-         mentioned before, OntoLex vocabulary is used to
rently dominating tabular structure of commonly            provide a single interlingual identifier for every con-
used formats to exchange embeddings; access to             cept in every WordNet language as well as machine-
(and the complexity of) the actual data will be hid-       readable information about it (including links with
den from the user.                                         various languages).
   A promising framework for the conjoint publi-              To broaden the spectrum of possible usages of
cation of embeddings and lexical knowledge graph           the core model, various extensions have been de-
is, for example, RDF-HDT [9], a compact binary             veloped by the community effort. This includes
serialization of RDF graphs and the literal values         the emerging module for frequency, attestation and
these comprise. We would assume it to be inte-             corpus-based information, FrAC.
grated in programming workflows in a way similar               The vocabulary elements introduced with FrAC
to program-specific binary formats such as Numpy            cover frequency and attestations, with other as-
arrays in pickle [8].                                      pects of corpus-based information described as
                                                           prospective developments in our previous work [4].
3.1     Modelling Lexical Resources                        Notable aspects of corpus information to be cov-
Formalisms to represent lexical resources are mani-        ered in such a module for the purpose of lexico-
fold and have been a topic of discussion within the        graphic applications include contextual similarity,
language resource community for decades, with              co-occurrence information and embeddings.
standards such as LMF [10], or TEI-Dict [19] de-              Because of the enormous technical relevance
signed for the electronic edition and/or search in         of the latter in language technology and AI, and
individual dictionaries. Lexical data, however, does       because of the requirements identified above for
not exist in isolation, and synergies can be un-           the publication of embedding information over the
leashed if information from different dictionaries is      web, this paper focuses on embeddings,
combined, e.g., for bootstrapping new bilingual dic-
tionaries for languages X and Z by using another           3.2   OntoLex-FrAC
language Y and existing dictionaries for X �→ Y            FrAC aims to (1) extend OntoLex with corpus in-
and Y �→ Z as a pivot.                                     formation to address challenges in lexicography,
   Information integration beyond the level of indi-       (2) model lexical and distributional-semantic re-
vidual dictionaries has thus become an important           sources (dictionaries, embeddings) as RDF graphs,
concern in the language resource community. One            (3) provide an abstract model of relevant concepts
way to achieve this is to represent this data as a         in distributional semantics that facilitates applica-
knowledge graph. The primary community stan-               tions that integrate both lexical and distributional
dard for publishing lexical knowledge graphs on            information. Figure 2 illustrates the FrAC concepts
the web is OntoLex-Lemon [5].                              and properties for frequency and attestations (gray)
   OntoLex-Lemon defines four main classes of               along with new additions (blue) for embeddings as
lexical entities, i.e., concepts in a lexical resource:    a major form of corpus-based information.
   4
       https://pytorch.org/text/                              Prior to the introduction of embeddings, the

                                                          15
You can also read