Specialising Word Vectors for Lexical Entailment - Association for ...

Page created by Roberta Guzman
 
CONTINUE READING
Specialising Word Vectors for Lexical Entailment

                                 Ivan Vulić1 and Nikola Mrkšić2
                         1
                          Language Technology Lab, University of Cambridge
                                              2
                                                PolyAI
                         iv250@cam.ac.uk          nikola@poly-ai.com

                      Abstract                               is to go beyond stand-alone unsupervised learning
                                                             and fine-tune distributional vector spaces by using
    We present LEAR (Lexical Entailment Attract-
    Repel), a novel post-processing method that
                                                             external knowledge from human- or automatically-
    transforms any input word vector space to                constructed knowledge bases. This is often done as
    emphasise the asymmetric relation of lexical             a post-processing step, where distributional vectors
    entailment (LE), also known as the IS - A or             are gradually refined to satisfy linguistic constraints
    hyponymy-hypernymy relation. By injecting                extracted from lexical resources such as WordNet
    external linguistic constraints (e.g., WordNet           (Faruqui et al., 2015; Mrkšić et al., 2016), the Para-
    links) into the initial vector space, the LE spe-        phrase Database (PPDB) (Wieting et al., 2015), or
    cialisation procedure brings true hyponymy-
                                                             BabelNet (Mrkšić et al., 2017; Vulić et al., 2017a).
    hypernymy pairs closer together in the trans-
    formed Euclidean space. The proposed asym-               One advantage of post-processing methods is that
    metric distance measure adjusts the norms of             they treat the input vector space as a black box,
    word vectors to reflect the actual WordNet-              making them applicable to any input space.
    style hierarchy of concepts. Simultaneously, a              A key property of these methods is their abil-
    joint objective enforces semantic similarity us-         ity to transform the vector space by specialising it
    ing the symmetric cosine distance, yielding a
                                                             for a particular relationship between words.1 Prior
    vector space specialised for both lexical re-
    lations at once. LEAR specialisation achieves            work has predominantly focused on distinguishing
    state-of-the-art performance in the tasks of hy-         between semantic similarity and conceptual relat-
    pernymy directionality, hypernymy detection,             edness (Faruqui et al., 2015; Mrkšić et al., 2017;
    and graded lexical entailment, demonstrating             Vulić et al., 2017b). In this paper, we introduce a
    the effectiveness and robustness of the pro-             novel post-processing model which specialises vec-
    posed asymmetric specialisation model.                   tor spaces for the lexical entailment (LE) relation.
1   Introduction                                                Word-level lexical entailment is an asymmet-
                                                             ric semantic relation (Collins and Quillian, 1972;
Word representation learning has become a re-                Beckwith et al., 1991). It is a key principle de-
search area of central importance in NLP, with               termining the organisation of semantic networks
its usefulness demonstrated across application ar-           into hierarchical structures such as semantic on-
eas such as parsing (Chen and Manning, 2014),                tologies (Fellbaum, 1998). Automatic reasoning
machine translation (Zou et al., 2013), and many             about LE supports tasks such as taxonomy creation
others (Turian et al., 2010; Collobert et al., 2011).        (Snow et al., 2006; Navigli et al., 2011), natural lan-
Standard techniques for inducing word embeddings             guage inference (Dagan et al., 2013; Bowman et al.,
rely on the distributional hypothesis (Harris, 1954),        2015), text generation (Biran and McKeown, 2013),
using co-occurrence information from large textual           and metaphor detection (Mohler et al., 2013).
corpora to learn meaningful word representations
                                                                Our novel LE specialisation model, termed LEAR
(Mikolov et al., 2013; Levy and Goldberg, 2014;
                                                             (Lexical Entailment Attract-Repel), is inspired by
Pennington et al., 2014; Bojanowski et al., 2017).
                                                             ATTRACT-R EPEL, a state-of-the-art general spe-
   A major drawback of the distributional hypoth-
esis is that it coalesces different relationships be-           1
                                                                 Distinguishing between synonymy and antonymy has a
tween words, such as synonymy and topical related-           positive impact on real-world language understanding tasks
ness, into a single vector space. A popular solution         such as Dialogue State Tracking (Mrkšić et al., 2017).

                                                        1134
                                Proceedings of NAACL-HLT 2018, pages 1134–1145
              New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics
z                                        The employed asymmetric distance allows one
                                                           to make graded assertions about hierarchical re-
                                                           lationships between concepts in the specialised
                                                           space. This property is evaluated using HyperLex,
       machine
                                                           a recent graded LE dataset (Vulić et al., 2017).
                computer                 organism          The LEAR-specialised vectors push state-of-the-art
                                      animal
                                dog                        Spearman’s correlation from 0.540 to 0.686 on the
       laptop
                           terrier                         full dataset (2,616 word pairs), and from 0.512 to
                                                           0.705 on its noun subset (2,163 word pairs).
                                                              The code for the LEAR model is available from:
                                                           github.com/nmrksic/lear.
                                                    y

          x                                                2     Methodology

Figure 1: An illustration of LEAR specialisation.          2.1       The ATTRACT-R EPEL Framework
LEAR controls the arrangement of vectors in the            Let V be the vocabulary, A the set of ATTRACT
transformed vector space by: 1) emphasising sym-           word pairs (e.g., intelligent and brilliant), and R
metric similarity of LE pairs through cosine dis-          the set of R EPEL word pairs (e.g., vacant and oc-
                                          −−−−→
tance (by enforcing small angles between terrier           cupied). The ATTRACT-R EPEL procedure operates
     −→ −→          −−−−→
and dog or dog and animal); and 2) by imposing             over mini-batches of such pairs BA and BR . For
an LE ordering using vector norms, adjusting them          ease of notation, let each word pair (xl , xr ) in
so that higher-level concepts have larger norms            these two sets correspond to a vector pair (xl , xr ),
        −−−−→      −→      −−−−→
(e.g., |animal| > |dog| > |terrier|).                      so that a mini-batch of k1 word pairs is given by
                                                           BA = [(x1l , x1r ), . . . , (xkl 1 , xkr 1 )] (similarly for BR ,
                                                           which consists of k2 example pairs).
cialisation framework (Mrkšić et al., 2017).2 The            Next, the sets of pseudo-negative examples
key idea of LEAR, illustrated by Figure 1, is to           TA = [(t1l , t1r ), . . . , (tkl 1 , tkr 1 )] and TR =
pull desirable (ATTRACT) examples described by             [(t1l , t1r ), . . . , (tkl 2 , tkr 2 )] are defined as pairs of neg-
the constraints closer together, while at the same         ative examples for each ATTRACT and R EPEL ex-
time pushing undesirable (REPEL) word pairs away           ample pair in mini-batches BA and BR . These neg-
from each other. Concurrently, LEAR (re-)arranges          ative examples are chosen from the word vectors
vector norms so that norm values in the Euclidean          present in BA or BR so that, for each ATTRACT
space reflect the hierarchical organisation of con-        pair (xl , xr ), the negative example pair (tl , tr ) is
cepts according to the given LE constraints: put           chosen so that tl is the vector closest (in terms of
simply, higher-level concepts are assigned larger          cosine distance) to xl and tr is closest to xr . Sim-
norms. Therefore, LEAR simultaneously captures             ilarly, for each R EPEL pair (xl , xr ), the negative
the hierarchy of concepts (through vector norms)           example pair (tl , tr ) is chosen from the remain-
and their similarity (through their cosine distance).      ing in-batch vectors so that tl is the vector furthest
The two pivotal pieces of information are combined         away from xl and tr is furthest from xr .
into an asymmetric distance measure which quanti-             The negative examples are used to: a) force AT-
fies the LE strength in the specialised space.             TRACT pairs to be closer to each other than to their
   After specialising four well-known input vector         respective negative examples; and b) to force R E -
spaces with LEAR, we test them in three standard           PEL pairs to be further away from each other than
word-level LE tasks (Kiela et al., 2015b): 1) hyper-       from their negative examples. The first term of the
nymy directionality; 2) hypernymy detection; and           cost function pulls ATTRACT pairs together:
3) combined hypernymy detection/directionality.
Our specialised vectors yield notable improve-                 Att(BA , TA ) =
ments over the strongest baselines for each task,              k1
with each input space, demonstrating the effective-            X                                                
                                                                       τ δatt + cos(xil , til ) − cos(xil , xir )
ness and robustness of LEAR specialisation.                    i=1
                                                                                                                
   2
       https://github.com/nmrksic/attract-repel                      +τ δatt + cos(xir , tir ) − cos(xil , xir )   (1)

                                                        1135
where cos denotes cosine similarity, τ (x) =                   ordering between high- and low-level concepts:
max(0, x) is the hinge loss function and δatt is the
attract margin which determines how much closer                           D1 (x, y) =  |x| − |y|                  (3)
these vectors should be to each other than to their                                    |x| − |y|
                                                                          D2 (x, y) =                             (4)
respective negative examples. The second part of                                       |x| + |y|
the cost function pushes R EPEL word pairs away                                        |x| − |y|
from each other:                                                          D3 (x, y) =                             (5)
                                                                                      max(|x|, |y|)

  Rep(BR , TR ) =                                               The lexical entailment term (for the j-th asym-
 k2
 X                                                             metric distance, j ∈ 1, 2, 3) is defined as:
                                                  
         τ δrep + cos(xil , xir ) − cos(xil , til )
                                                                                         k3
                                                                                         X
 i=1
                                                                        LEj (BL ) =          Dj (xi , yi )      (6)
       +τ δrep + cos(xil , xir ) − cos(xir , tir )   (2)                                 i=1

In addition to these two terms, an additional regu-               The first distance serves as the baseline: it uses
larisation term is used to preserve the abundance              the word vectors’ norms to order the concepts, that
of high-quality semantic content present in the                is to decide which of the words is likely to be the
distributional vector space, as long as this infor-            higher-level concept. In this case, the magnitude of
mation does not contradict the injected linguistic             the difference between the two norms determines
constraints. If V (B) is the set of all word vectors           the ‘intensity’ of the LE relation. This is potentially
present in the given mini-batch, then:                         problematic, as this distance does not impose a
                                                               limit on the vectors’ norms. The second and third
                            X
 Reg(BA , BR ) =                        λreg kxbi − xi k2      metric take a more sophisticated approach, using
                      xi ∈V (BA ∪BR )
                                                               the ratios of the differences between the two norms
                                                               and either: a) the sum of the two norms; or b) the
where λreg is the L2 regularization constant and xbi           larger of the two norms. In doing that, these metrics
denotes the original (distributional) word vector for          ensure that the cost function only considers the
word xi . The full ATTRACT-R EPEL cost function                norms’ ratios. This means that the cost function no
is given by the sum of all three terms.                        longer has the incentive to increase word vectors’
                                                               norms past a certain point, as the magnitudes of
2.2    LEAR: Encoding Lexical Entailment                       norm ratios grow in size much faster than the linear
                                                               relation defined by the first distance function.
In this section, the ATTRACT-R EPEL framework is                  To model the semantic and the LE relations
extended to model lexical entailment jointly with              jointly, the LEAR cost function jointly optimises
(symmetric) semantic similarity. To do this, the               the four terms of the expanded cost function:
method uses an additional source of external lexi-
cal knowledge: let L be the set of directed lexical            C(BA , TA , BR , TR , BL , TL ) = Att(BS , TS ) + . . .
entailment constraints such as (corgi, dog), (dog,
                                                                   + Rep(BA , TA ) + Reg(BA , BR , BL ) + . . .
animal), or (corgi, animal), with lower-level con-
cepts on the left and higher-level ones on the right               + Att(BL , TL ) + LEj (BL )
(the source of these constraints will be discussed
                                                               LE Pairs as ATTRACT Constraints The com-
in Section 3). The optimisation proceeds in the
                                                               bined cost function makes use of the batch of lexi-
same way as before, considering a mini-batch of
                                                               cal constraints BL twice: once in the defined asym-
LE pairs BL consisting of k3 word pairs standing
                                                               metric cost function LEj , and once in the symmet-
in the (directed) lexical entailment relation.
                                                               ric ATTRACT term Att(BL , TL ). This means that
   Unlike symmetric similarity, lexical entailment
                                                               words standing in the lexical entailment relation
is an asymmetric relation which encodes a hier-
                                                               are forced to be similar both in terms of cosine
archical ordering between concepts. Inferring the
                                                               distance (via the symmetric ATTRACT term) and in
direction of the entailment relation between word
                                                               terms of the asymmetric LE distance from Eq. (6).
vectors requires the use of an asymmetric distance
function. We define three different ones, all of               Decoding Lexical Entailment The defined cost
which use the word vector’s norms to impose an                 function serves to encode semantic similarity and

                                                            1136
LE relations in the same vector space. Whereas                     antonyms are clear REPEL constraints as they anti-
the similarity can be inferred from the standard                   correlate with the LE relation.4 Synonymy and
cosine distance, the LEAR optimisation embeds lex-                 antonymy constraints are taken from prior work
ical entailment as a combination of the symmetric                  (Zhang et al., 2014; Ono et al., 2015): they are ex-
ATTRACT term and the newly defined asymmetric                      tracted from WordNet (Fellbaum, 1998) and Roget
LEj cost function. Consequently, the metric used                   (Kipfer, 2009). In total, we work with 1,023,082
to determine whether two words stand in the LE                     synonymy pairs (11.7 synonyms per word on aver-
relation must combine the two cost terms as well.                  age) and 380,873 antonymy pairs (6.5 per word).5
We define the LE decoding metric as:                                  As in prior work (Nguyen et al., 2017; Nickel
                                                                   and Kiela, 2017), LE constraints are extracted from
        ILE (x, y) = dcos(x, y) + Dj (x, y)                (7)     the WordNet hierarchy, relying on the transitivity
                                                                   of the LE relation. This means that we include both
where dcos(x, y) denotes the cosine distance. This                 direct and indirect LE pairs in our set of constraints
decoding function combines the symmetric and the                   (e.g., (pangasius, fish), (fish, animal), and (panga-
asymmetric cost term, in line with the combination                 sius, animal)). We retained only noun-noun and
of the two used to perform LEAR specialisation. In                 verb-verb pairs, while the rest were discarded: the
the evaluation, we show that combining the two                     final number of LE constraints is 1,545,630.6
cost terms has a synergistic effect, with both terms
contributing to stronger performance across all LE                 Training Setup We adopt the original ATTRACT-
tasks used for evaluation.                                         R EPEL model setup without any fine-tuning. Hyper-
                                                                   parameter values are set to: δatt = 0.6, δrep = 0.0,
3    Experimental Setup                                            λreg = 10−9 (Mrkšić et al., 2017). The models
Starting Distributional Vectors To test the ro-                    are trained for 5 epochs with the AdaGrad algo-
bustness of LEAR specialisation, we experiment                     rithm (Duchi et al., 2011), with batch sizes set to
with a variety of well-known, publicly available                   k1 = k2 = k3 = 128 for faster convergence.
English word vectors: 1) Skip-Gram with Negative
Sampling (SGNS) (Mikolov et al., 2013) trained                     4     Results and Discussion
on the Polyglot Wikipedia (Al-Rfou et al., 2013)
                                                                   We test and analyse LEAR-specialised vector spaces
by Levy and Goldberg (2014); 2) G LOVE Common
                                                                   in two standard word-level LE tasks used in prior
Crawl (Pennington et al., 2014); 3) CONTEXT 2 VEC
                                                                   work: hypernymy directionality and detection (Sec-
(Melamud et al., 2016), which replaces CBOW con-
                                                                   tion 4.1) and graded LE (Section 4.2).
texts with contexts based on bidirectional LSTMs
(Hochreiter and Schmidhuber, 1997); and 4) FAST-
                                                                   4.1    LE Directionality and Detection
T EXT (Bojanowski et al., 2017), a SGNS variant
which builds word vectors as the sum of their con-                 The first evaluation uses three classification-style
stituent character n-gram vectors.3                                tasks with increased levels of difficulty. The tasks
                                                                   are evaluated on three datasets used extensively in
Linguistic Constraints We use three groups of                      the LE literature (Roller et al., 2014; Santus et al.,
linguistic constraints in the LEAR specialisation                  2014; Weeds et al., 2014; Shwartz et al., 2017;
model, covering three different relation types which               Nguyen et al., 2017), compiled into an integrated
are all beneficial to the specialisation process: di-              evaluation set by Kiela et al. (2015b).7
rected 1) lexical entailment (LE) pairs; 2) syn-
onymy pairs; and 3) antonymy pairs. Synonyms                           4
                                                                         In short, the question “Is X a type of X?” (synonymy)
are included as symmetric ATTRACT pairs (i.e.,                     is trivially true, while the question “Is ¬X a type of X?”
                                                                   (antonymy) is trivially false.
the BA pairs) since they can be seen as defining                       5
                                                                         https://github.com/tticoin/AntonymDetection
a trivial symmetric IS - A relation (Rei and Briscoe,                  6
                                                                         We also experimented with additional 30,491 LE con-
2014; Vulić et al., 2017). For a similar reason,                  straints from the Paraphrase Database (PPDB) 2.0 (Pavlick
                                                                   et al., 2015). Adding them to the WordNet-based LE pairs
    3
      All vectors are 300-dimensional except for the 600-          makes no significant impact on the final performance. We also
dimensional CONTEXT 2 VEC vectors; for further details re-         used synonymy and antonymy pairs from other sources, such
garding the architectures and training setup of the used vector    as word pairs from PPDB used previously by Wieting et al.
collections, we refer the reader to the original papers. We also   (2015), and BabelNet (Navigli and Ponzetto, 2012) used by
experimented with dependency-based SGNS vectors (Levy              Mrkšić et al. (2017), reaching the same conclusions.
                                                                       7
and Goldberg, 2014), observing similar patterns in the results.          http://www.cl.cam.ac.uk/∼dk427/generality.html

                                                              1137
1.00           D1       D2       D3                    1.00           D1          D2    D3                    1.00           D1       D2       D3

             0.95                                                   0.95                                                   0.95
  Accuracy

                                                         Accuracy

                                                                                                                Accuracy
             0.90                                                   0.90                                                   0.90

             0.85                                                   0.85                                                   0.85

             0.80                                                   0.80                                                   0.80
                    SGNS    Glove context2vec fastText                     SGNS    Glove context2vec fastText                     SGNS    Glove context2vec fastText
                           Word vector collection                                 Word vector collection                                 Word vector collection

              (a) BLESS: Directionality                               (b) WBLESS: Detection                                (c) BIBLESS: Detect+direct
Figure 2: Summary of the results on three different word-level LE subtasks: (a) directionality; (b) detection;
(c) detection and directionality. Vertical bars denote the results obtained by different input word vector
spaces which are post-processed/specialised by our LEAR specialisation model using three variants of the
asymmetric distance (D1 , D2 , D3 ), see Section 2. Thick horizontal red lines refer to the best reported
scores on each subtask for these datasets; the baseline scores are taken from Nguyen et al. (2017).

   The first task, LE directionality, is conducted on                                         The final task, LE detection and directionality,
1,337 LE pairs originating from the BLESS evalu-                                           concerns a three-way classification on BIBLESS, a
ation set (Baroni and Lenci, 2011). Given a true                                           relabeled version of WBLESS. The task is now to
LE pair, the task is to predict the correct hypernym.                                      distinguish both LE pairs (→ 1) and reversed LE
With LEAR-specialised vectors this is achieved by                                          pairs (→ −1) from other relations (→ 0), and then
simply comparing the vector norms of each con-                                             additionally select the correct hypernym in each
cept in a pair: the one with the larger norm is the                                        detected LE pair. We apply the same test protocol
hypernym (see Figure 1).                                                                   as in the LE detection task.
   The second task, LE detection, involves a binary
                                                                                           Results and Analysis The original paper of
classification on the WBLESS dataset (Weeds et al.,
                                                                                           Kiela et al. (2015b) reports the following best
2014) which comprises 1,668 word pairs standing
                                                                                           scores on each task: 0.88 (BLESS), 0.75 (WBLESS),
in a variety of relations (LE, meronymy-holonymy,
                                                                                           0.57 (BIBLESS). These scores were recently sur-
co-hyponymy, reversed LE, no relation). The model
                                                                                           passed by Nguyen et al. (2017), who, instead
has to detect a true LE pair, that is, to distinguish
                                                                                           of post-processing, combine WordNet-based con-
between the pairs where the statement X is a (type
                                                                                           straints with an SGNS-style objective into a joint
of) Y is true from all other pairs. With LEAR vec-
                                                                                           model. They report the best scores to date: 0.92
tors, this classification is based on the asymmetric
                                                                                           (BLESS), 0.87 (WBLESS), and 0.81 (BIBLESS).
distance score: if the score is above a certain thresh-
                                                                                              The performance of the four LEAR-specialised
old, we classify the pair as “true LE”, otherwise as
                                                                                           word vector collections is shown in Figure 2 (to-
“other”. While Kiela et al. (2015b) manually de-
                                                                                           gether with the strongest baseline scores for each
fine the threshold value, we follow the approach of
                                                                                           of the three tasks). The comparative analysis con-
Nguyen et al. (2017) and cross-validate: in each of
                                                                                           firms the increased complexity of subsequent tasks.
the 1,000 iterations, 2% of the pairs are sampled
                                                                                           LEAR specialisation of each of the starting vec-
for threshold tuning, and the remaining 98% are
                                                                                           tor spaces consistently outperformed all baseline
used for testing. The reported numbers are there-
                                                                                           scores across all three tasks. The extent of the im-
fore average accuracy scores.8
                                                                                           provements is correlated with task difficulty: it is
    8
      We have conducted more LE directionality and detection                               lowest for the easiest directionality task (0.92 →
experiments on other datasets such as EVALution (Santus                                    0.96), and highest for the most difficult detection
et al., 2015), the N1  N2 dataset of Baroni et al. (2012),                                plus directionality task (0.81 → 0.88).
and the dataset of Lenci and Benotto (2012) with similar
performances and findings. We do not report all these results                                 The results show that the two LEAR variants
for brevity and clarity of presentation.                                                   which do not rely on absolute norm values and

                                                                                        1138
Norm               Norm                      Norm       4.2    Graded Lexical Entailment
 terrier       0.87   laptop      0.60   cabriolet          0.74
 dog           2.64   computer    2.96   car                3.59      Asymmetric distances in the LEAR-specialised
 mammal
 vertebrate
               8.57
              10.96
                      machine
                      device
                                  6.15
                                 12.09
                                         vehicle
                                         transport
                                                            7.78
                                                            8.01
                                                                      space quantify the degree of lexical entailment
 animal       11.91   artifact   17.71   instrumentality   14.56      between any two concepts. This means that they
 organism     20.08   object     23.55   –                     –
                                                                      can be used to make fine-grained assertions re-
Table 1: L2 norms for selected concepts from the                      garding the hierarchical relationships between con-
WordNet hierarchy. Input: FAST T EXT; LEAR: D2.                       cepts. We test this property on HyperLex (Vulić
                                                                      et al., 2017), a gold standard dataset for evalu-
                                                                      ating how well word representation models cap-
perform a normalisation step in the asymmetric                        ture graded LE, grounded in the notions of concept
distance (D2 and D3) have an edge over the D1                         (proto)typicality (Rosch, 1973; Medin et al., 1984)
variant which operates with unbounded norms. The                      and category vagueness (Kamp and Partee, 1995;
difference in performance between D2/D3 and D1                        Hampton, 2007) from cognitive science. HyperLex
is even more pronounced in the graded LE task (see                    contains 2,616 word pairs (2,163 noun pairs and
Section 4.2). This shows that the use of unbounded                    453 verb pairs) scored by human raters in the [0, 6]
vector norms diminishes the importance of the sym-                    interval following the question “To what degree is
metric cosine distance in the combined asymmetric                     X a (type of) Y?”9
distance. Conversely, the synergistic combination                        As shown by the high inter-annotator agreement
used in D2/D3 does not suffer from this issue.                        on HyperLex (0.85), humans are able to consis-
   The high scores achieved with each of the four                     tently reason about graded LE.10 However, current
word vector collections show that LEAR is not de-                     state-of-the-art representation architectures are far
pendent on any particular word representation ar-                     from this ceiling. For instance, Vulić et al. (2017)
chitecture. Moreover, the extent of the performance                   evaluate a plethora of architectures and report a
improvements in each task suggests that LEAR is                       high-score of only 0.320 (see the summary table
able to reconstruct the concept hierarchy coded in                    in Figure 3). Two recent representation models
the input linguistic constraints.                                     (Nickel and Kiela, 2017; Nguyen et al., 2017) fo-
   Moreover, we have conducted a small experi-                        cused on the LE relation in particular (and employ-
ment to verify that the LEAR method can generalise                    ing the same set of WordNet-based constraints as
                                                                      LEAR ) report the highest score of 0.540 (on the
beyond what is directly coded in pairwise exter-
nal constraints. A simple WordNet lookup baseline                     entire dataset) and 0.512 (on the noun subset).
yields accuracy scores of 0.82 and 0.80 on the di-                    Results and Analysis We scored all HyperLex
rectionality and detection tasks, respectively. This                  pairs using the combined asymmetric distance de-
baseline is outperformed by LEAR: its scores are                      scribed by Equation (7), and then computed Spear-
0.96 and 0.92 on the two tasks when relying on the                    man’s rank correlation with the ground-truth rank-
same set of WordNet constraints.                                      ing. Our results, together with the strongest base-
                                                                      line scores, are summarised in Figure 3.
Importance of Vector Norms To verify that the                            The summary table in Figure 3(c) shows the Hy-
knowledge concerning the position in the semantic                     perLex performance of several prominent LE mod-
hierarchy actually arises from vector norms, we                       els. We provide only a quick outline of these mod-
also manually inspect the norms after LEAR spe-                       els here; further details can be found in the original
cialisation. A few examples are provided in Table 1.                  papers. FREQ - RATIO exploits the fact that more
They indicate a desirable pattern in the norm values                  general concepts tend to occur more frequently in
which imposes a hierarchical ordering on the con-                     textual corpora. SGNS ( COS ) uses non-specialised
cepts. Note that the original distributional SGNS                         9
                                                                            From another perspective, one might say that graded LE
model (Mikolov et al., 2013) does not normalise                       provides finer-grained human judgements on a continuous
vectors to unit length after training. However, these                 scale rather than simplifying the judgements into binary dis-
                                                                      crete decisions. For instance, the HyperLex score for the pair
norms are not at all correlated with the desired hi-                  (girl, person) is 5.91/6, the score for (guest, person) is 4.33,
erarchical ordering, and are therefore useless for                    while the score for the reversed pair (person, guest) is 1.73.
                                                                         10
LE -related applications: the non-specialised distri-                       For further details concerning HyperLex, we refer the
                                                                      reader to the resource paper (Vulić et al., 2017). The
butional SGNS model scores 0.44, 0.48, and 0.34                       dataset is available at: http://people.ds.cam.ac.
on the three tasks, respectively.                                     uk/iv250/hyperlex.html

                                                                   1139
Spearman’s ρ correlation with HyperLex (Nouns)
                                                                 D1       D2       D3                                                                          D1       D2       D3
                                                                                                                                                                                                                      All
   Spearman’s ρ correlation with HyperLex (All)
                                                  0.70                                                                                         0.70
                                                                                                                                                                                              FREQ - RATIO           0.279
                                                                                                                                                                                              SGNS ( COS )           0.205
                                                  0.65                                                                                         0.65
                                                                                                                                                                                              SLQS - SIM             0.228
                                                                                                                                                                                              VISUAL                 0.209
                                                                                                                                                                                              WN - BEST              0.234
                                                  0.60                                                                                         0.60                                           WORD 2 GAUSS           0.206
                                                                                                                                                                                              SIM - SPEC             0.320
                                                                                                                                                                                              ORDER - EMB            0.191
                                                  0.55                                                                                         0.55
                                                                                                                                                                                              P OINCARÉ (nouns)      0.512
                                                                                                                                                                                              H YPERV EC             0.540
                                                  0.50                                                                                         0.50                                           Best LEAR              0.686
                                                         SGNS    Glove context2vec fastText                                                           SGNS     Glove context2vec fastText
                                                                Word vector collection                                                                        Word vector collection

                                                           (a) HyperLex: All                                                                          (b) HyperLex: Nouns                          (c) Summary

Figure 3: Results on the graded LE task defined by HyperLex. Following Nickel and Kiela (2017), we use
Spearman’s rank correlation scores on: a) the entire dataset (2,616 noun and verb pairs); and b) its noun
subset (2,163 pairs). The summary table shows the performance of other well-known architectures on the
full HyperLex dataset, compared to the best results achieved using LEAR specialisation.

SGNS vectors and quantifies the LE strength using                                                                                                               lation strength by using WordNet as the source of
the symmetric cosine distance between vectors. A                                                                                                                constraints for specialisation models such as H Y-
comparison of these models to the best-performing                                                                                                               PERV EC or LEAR .
LEAR vectors shows the extent of the improvements                                                                                                                  WORD 2 GAUSS (Vilnis and McCallum, 2015)
achieved using the specialisation approach.                                                                                                                     represents words as multivariate K-dimensional
   LEAR-specialised vectors also outperform SLQS -                                                                                                              Gaussians rather than points in the embedding
SIM (Santus et al., 2014) and VISUAL (Kiela et al.,                                                                                                             space: it is therefore naturally asymmetric and was
2015b), two LE detection models similar in spirit                                                                                                               used in LE tasks before, but its performance on Hy-
to LEAR. These models combine symmetric se-                                                                                                                     perLex indicates that it cannot effectively capture
mantic similarity (through cosine distance) with                                                                                                                the subtleties required to model graded LE. How-
an asymmetric measure of lexical generality ob-                                                                                                                 ever, note that the comparison is not strictly fair
tained either from text (SLQS - SIM) or visual data                                                                                                             as WORD 2 GAUSS does not leverage any external
(VISUAL). The results on HyperLex indicate that                                                                                                                 knowledge. An interesting line for future research
the two generality-based measures are too coarse-                                                                                                               is to embed external knowledge within this repre-
grained for graded LE judgements. These models                                                                                                                  sentation framework.
were originally constructed to tackle LE direction-                                                                                                                Most importantly, LEAR outperforms three re-
ality and detection tasks (see Section 4.1), but their                                                                                                          cent (and conceptually different) architectures:
performance is surpassed by LEAR on those tasks                                                                                                                 ORDER - EMB (Vendrov et al., 2016), P OINCARÉ
as well. The VISUAL model outperforms SLQS - SIM.                                                                                                               (Nickel and Kiela, 2017), and H YPERV EC (Nguyen
However, its numbers on BLESS (0.88), WBLESS                                                                                                                    et al., 2017). Like LEAR, all of these models
(0.75), and BIBLESS (0.57) are far from the top-                                                                                                                complement distributional knowledge with exter-
performing LEAR vectors (0.96, 0.92, 0.88).11                                                                                                                   nal linguistic constraints extracted from WordNet.
   WN - BEST denotes the best result with asymmet-                                                                                                              Each model uses a different strategy to exploit the
ric similarity measures which use the WordNet                                                                                                                   hierarchical relationships encoded in these con-
structure as their starting point (Wu and Palmer,                                                                                                               straints (their approaches are discussed in Sec-
1994; Pedersen et al., 2004). This model can be                                                                                                                 tion 5).12 However, LEAR, as the first LE-oriented
observed as a model that directly looks up the full                                                                                                             post-processor, is able to utilise the constraints
WordNet structure to reason about graded lexical                                                                                                                more effectively than its competitors. Another ad-
entailment. The reported results from Figure 3(c)                                                                                                               vantage of LEAR is its applicability to any input
suggest it is more effective to quantify the LE re-
                                                                                                                                                                   12
                                                                                                                                                                      As discussed previously by Vulić et al. (2017), the off-
  11
     We note that SLQS and VISUAL do not leverage any exter-                                                                                                    the-shelf ORDER - EMB vectors were trained for the binary
nal knowledge from WordNet, but the VISUAL model lever-                                                                                                         ungraded LE detection task: this limits their expressiveness in
ages external visual information about concepts.                                                                                                                the graded LE task.

                                                                                                                                                             1140
WBLESS    BIBLESS   HL - A   HL - N      5   Related Work
 LEAR variant
 SYM - ONLY     0.687     0.679     0.469    0.429       Vector Space Specialisation A standard ap-
 ASYM - ONLY    0.867     0.824     0.529    0.565       proach to incorporating external information into
 FULL           0.912     0.875     0.686    0.705
                                                         vector spaces is to pull the representations of simi-
Table 2: Analysing the importance of the synergy in      lar words closer together. Some models integrate
the FULL LEAR model on the final performance             such constraints into the training procedure: they
on WBLESS, BLESS, HyperLex-All (HL - A) and              modify the prior or the regularisation (Yu and
HyperLex-Nouns (HL - N). Input: FAST T EXT. D2.          Dredze, 2014; Xu et al., 2014; Bian et al., 2014;
                                                         Kiela et al., 2015a), or use a variant of the SGNS-
                                                         style objective (Liu et al., 2015; Osborne et al.,
vector space.                                            2016; Nguyen et al., 2017). Another class of mod-
   Figures 3(a) and 3(b) indicate that the two LEAR      els, popularly termed retrofitting, fine-tune distribu-
variants which rely on norm ratios (D2 and D3),          tional vector spaces by injecting lexical knowledge
rather than on absolute (unbounded) norm differ-         from semantic databases such as WordNet or the
ences (D1), achieve stronger performance on Hy-          Paraphrase Database (Faruqui et al., 2015; Jauhar
perLex. The highest correlation scores are again         et al., 2015; Wieting et al., 2015; Nguyen et al.,
achieved by D2 with all input vector spaces.             2016; Mrkšić et al., 2016; Mrkšić et al., 2017).
                                                            LEAR falls into the latter category. However,
4.3   Further Discussion                                 while previous post-processing methods have fo-
Why Symmetric + Asymmetric? In another ex-               cused almost exclusively on specialising vector
periment, we analyse the contributions of both LE-       spaces to emphasise semantic similarity (i.e., to
related terms in the LEAR combined objective func-       distinguish between similarity and relatedness by
tion (see Section 2.2). We compare three variants        explicitly pulling synonyms closer and pushing
of LEAR: 1) a symmetric variant which does not ar-       antonyms further apart), this paper proposed a prin-
range vector norms using the LEj (BL ) term (SYM -       cipled methodology for specialising vector spaces
ONLY); 2) a variant which arranges norms, but does       for asymmetric hierarchical relations (of which lex-
not use LE constraints as additional symmetric AT-       ical entailment is an instance). Its starting point is
TRACT constraints ( ASYM - ONLY ); and 3) the full       the state-of-the-art similarity specialisation frame-
LEAR model, which uses both cost terms ( FULL ).         work of Mrkšić et al. (2017), which we extend to
The results with one input space (similar results are    support the inclusion of hierarchical asymmetric
achieved with others) are shown in Table 2. This ta-     relationships between words.
ble shows that, while the stand-alone ASYM - ONLY
term seems more beneficial than the SYM - ONLY           Word Vectors and Lexical Entailment Since
one, using the two terms jointly yields the strongest    the hierarchical LE relation is one of the funda-
performance across all LE tasks.                         mental building blocks of semantic taxonomies
                                                         and hierarchical concept categorisations (Beckwith
LE and Semantic Similarity We also test                  et al., 1991; Fellbaum, 1998), a significant amount
whether the asymmetric LE term harms the (norm-          of research in semantics has been invested into its
independent) cosine distances used to represent          automatic detection and classification. Early work
semantic similarity. The LEAR model is compared          relied on asymmetric directional measures (Weeds
to the original ATTRACT- REPEL model making use          et al., 2004; Clarke, 2009; Kotlerman et al., 2010;
of the same set of linguistic constraints. Two true      Lenci and Benotto, 2012, i.a.) which were based
semantic similarity datasets are used for evaluation:    on the distributional inclusion hypothesis (Geffet
SimLex-999 (Hill et al., 2015) and SimVerb-3500          and Dagan, 2005) or the distributional informa-
(Gerz et al., 2016). There is no significant differ-     tiveness or generality hypothesis (Herbelot and
ence in performance between the two models, both         Ganesalingam, 2013; Santus et al., 2014). However,
of which yield similar results on SimLex (Spear-         these approaches have recently been superseded by
man’s rank correlation of ≈ 0.71) and SimVerb (≈         methods based on word embeddings. These meth-
0.70). This proves that cosine distances remain pre-     ods build dense real-valued vectors for capturing
served during the optimization of the asymmetric         the LE relation, either directly in the LE-focused
objective performed by the joint LEAR model.             space (Vilnis and McCallum, 2015; Vendrov et al.,

                                                      1141
2016; Henderson and Popa, 2016; Nickel and Kiela,       metric and asymmetric constraints into existing
2017; Nguyen et al., 2017) or by using the vectors      vector spaces, performing joint specialisation for
as features for supervised LE detection models          two properties: lexical entailment and semantic sim-
(Tuan et al., 2016; Shwartz et al., 2016; Nguyen        ilarity. Since the former is not symmetric, LEAR
et al., 2017; Glavaš and Ponzetto, 2017).               uses an asymmetric cost function which encodes
   Several LE models embed useful hierarchical re-      the hierarchy between concepts by manipulating
lations from external resources such as WordNet         the norms of word vectors, assigning higher norms
into LE-focused vector spaces, with solutions com-      to higher-level concepts. Specialising the vector
ing in different flavours. The model of Yu et al.       space for both relations has a synergistic effect:
(2015) is a dynamic distance-margin model opti-         LEAR-specialised vectors attain state-of-the-art per-
mised for the LE detection task using hierarchical      formance in judging semantic similarity and set
WordNet constraints. This model was extended by         new high scores across four different lexical entail-
Tuan et al. (2016) to make use of contextual senten-    ment tasks. The code for the LEAR model is avail-
tial information. A major drawback of both models       able from: github.com/nmrksic/lear.
is their inability to make directionality judgements.      In future work, we plan to apply a simi-
Further, their performance has recently been sur-       lar methodology to other asymmetric relations
passed by the H YPERV EC model of Nguyen et al.         (e.g., meronymy), as well as to investigate fine-
(2017). This model combines WordNet constraints         grained models which can account for differing
with the SGNS distributional objective into a joint     path lengths from the WordNet hierarchy. We will
model. As such, the model is tied to the SGNS           also extend the model to reason over words un-
objective and any change of the distributional mod-     seen in input lexical resources, similar to the recent
elling paradigm implies a change of the entire H Y-     post-specialisation model oriented towards special-
PERV EC model. This makes their model less ver-         isation of unseen words for similarity (Vulić et al.,
satile than the proposed LEAR framework. More-          2018). We also plan to test the usefulness of LE-
over, the results achieved using LEAR specialisation    specialised vectors in downstream natural language
achieve substantially better performance across all     understanding tasks. Porting the model to other
LE tasks used for evaluation.                           languages and enabling cross-lingual applications
   Another model similar in spirit to LEAR is the       such as cross-lingual lexical entailment (Upadhyay
ORDER - EMB model of Vendrov et al. (2016), which       et al., 2018) is another future research direction.
encodes hierarchical structure by imposing a par-
tial order in the embedding space: higher-level         Acknowledgments
concepts get assigned higher per-coordinate val-
                                                        We thank the three anonymous reviewers for their
ues in a d-dimensional vector space. The model
                                                        insightful comments and suggestions. We are also
minimises the violation of the per-coordinate or-
                                                        grateful to the TakeLab research group at the Uni-
derings during training by relying on hierarchical
                                                        versity of Zagreb for offering support to computa-
WordNet constraints between word pairs. Finally,
                                                        tionally intensive experiments in our hour of need.
the P OINCARÉ model of Nickel and Kiela (2017)
                                                        This work is supported by the ERC Consolidator
makes use of hyperbolic spaces to learn general-
                                                        Grant LEXICAL: Lexical Acquisition Across Lan-
purpose LE embeddings based on n-dimensional
                                                        guages (no 648909).
Poincaré balls which encode both hierarchy and
semantic similarity, again using the WordNet con-
straints. A similar model in hyperbolic spaces was      References
proposed by Chamberlain et al. (2017). In this pa-
                                                        Rami Al-Rfou, Bryan Perozzi, and Steven Skiena.
per, we demonstrate that LE-specialised word em-
                                                          2013. Polyglot: Distributed word representations for
beddings with stronger performance can be induced         multilingual NLP. In Proceedings of CoNLL, pages
using a simpler model operating in more intuitively       183–192.
interpretable Euclidean vector spaces.
                                                        Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do,
                                                         and Chung-chieh Shan. 2012. Entailment above the
6   Conclusion and Future Work                           word level in distributional semantics. In Proceed-
                                                         ings of EACL, pages 23–32.
This paper proposed LEAR, a vector space speciali-
sation procedure which simultaneously injects sym-      Marco Baroni and Alessandro Lenci. 2011. How we

                                                   1142
BLESSed distributional semantic evaluation. In Pro-     Maayan Geffet and Ido Dagan. 2005. The distribu-
  ceedings of the GEMS 2011 Workshop, pages 1–10.          tional inclusion hypotheses and lexical entailment.
                                                           In Proceedings of ACL, pages 107–114.
Richard Beckwith, Christiane Fellbaum, Derek Gross,
  and George A. Miller. 1991. WordNet: A lexical          Daniela Gerz, Ivan Vulić, Felix Hill, Roi Reichart, and
  database organized on psycholinguistic principles.        Anna Korhonen. 2016. SimVerb-3500: A large-
  Lexical acquisition: Exploiting on-line resources to      scale evaluation set of verb similarity. In Proceed-
  build a lexicon, pages 211–231.                           ings of EMNLP, pages 2173–2182.
Jiang Bian, Bin Gao, and Tie-Yan Liu. 2014.               Goran Glavaš and Simone Paolo Ponzetto. 2017.
   Knowledge-powered deep learning for word embed-          Dual tensor model for detecting asymmetric lexico-
   ding. In Proceedings of ECML-PKDD, pages 132–            semantic relations. In Proceedings of EMNLP,
   148.                                                     pages 1758–1768.
Or Biran and Kathleen McKeown. 2013. Classifying
  taxonomic relations between pairs of Wikipedia arti-    James A. Hampton. 2007. Typicality, graded member-
  cles. In Proceedings of IJCNLP, pages 788–794.            ship, and vagueness. Cognitive Science, 31(3):355–
                                                            384.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
  Tomas Mikolov. 2017. Enriching word vectors with        Zellig S. Harris. 1954. Distributional structure. Word,
   subword information. Transactions of the ACL,            10(23):146–162.
   5:135–146.
                                                          James Henderson and Diana Popa. 2016. A vector
Samuel R. Bowman, Gabor Angeli, Christopher Potts,          space for distributional semantics for entailment. In
  and Christopher D. Manning. 2015. A large anno-           Proceedings of ACL, pages 2052–2062.
  tated corpus for learning natural language inference.
  In Proceedings of EMNLP, pages 632–642.                 Aurélie Herbelot and Mohan Ganesalingam. 2013.
                                                            Measuring semantic content in distributional vectors.
Benjamin Paul Chamberlain, James Clough, and                In Proceedings of ACL, pages 440–445.
  Marc Peter Deisenroth. 2017. Neural embeddings of
  graphs in hyperbolic space. CoRR, abs/1705.10359.       Felix Hill, Roi Reichart, and Anna Korhonen. 2015.
                                                            SimLex-999: Evaluating semantic models with (gen-
Danqi Chen and Christopher D. Manning. 2014. A
                                                            uine) similarity estimation. Computational Linguis-
  fast and accurate dependency parser using neural net-
                                                            tics, 41(4):665–695.
  works. In Proceedings of EMNLP, pages 740–750.
Daoud Clarke. 2009. Context-theoretic semantics for       Sepp Hochreiter and Jürgen Schmidhuber. 1997.
  natural language: An overview. In Proceedings of          Long Short-Term Memory. Neural Computation,
  the Workshop on Geometrical Models of Natural             9(8):1735–1780.
  Language Semantics (GEMS), pages 112–119.
                                                          Sujay Kumar Jauhar, Chris Dyer, and Eduard H. Hovy.
Allan M. Collins and Ross M. Quillian. 1972. Exper-         2015. Ontologically grounded multi-sense represen-
  iments on semantic memory and language compre-            tation learning for semantic vector space models. In
  hension. Cognition in Learning and Memory.                Proceedings of NAACL, pages 683–693.

Ronan Collobert, Jason Weston, Léon Bottou, Michael       Hans Kamp and Barbara Partee. 1995. Prototype the-
  Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa.            ory and compositionality. Cognition, 57(2):129–
  2011. Natural language processing (almost) from           191.
  scratch. Journal of Machine Learning Research,
  12:2493–2537.                                           Douwe Kiela, Felix Hill, and Stephen Clark. 2015a.
                                                            Specializing word embeddings for similarity or re-
Ido Dagan, Dan Roth, Mark Sammons, and Fabio Mas-           latedness. In Proceedings of EMNLP, pages 2044–
   simo Zanzotto. 2013. Recognizing textual entail-         2048.
   ment: Models and applications. Synthesis Lectures
   on Human Language Technologies, 6(4):1–220.            Douwe Kiela, Laura Rimell, Ivan Vulić, and Stephen
John C. Duchi, Elad Hazan, and Yoram Singer. 2011.          Clark. 2015b. Exploiting image generality for lex-
  Adaptive subgradient methods for online learning          ical entailment detection. In Proceedings of ACL,
  and stochastic optimization. Journal of Machine           pages 119–124.
  Learning Research, 12:2121–2159.
                                                          Barbara Ann Kipfer. 2009. Roget’s 21st Century The-
Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar,            saurus (3rd Edition). Philip Lief Group.
 Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015.
 Retrofitting word vectors to semantic lexicons. In       Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan
 Proceedings of NAACL-HLT, pages 1606–1615.                  Zhitomirsky-Geffet. 2010.        Directional distribu-
                                                             tional similarity for lexical inference. Natural Lan-
Christiane Fellbaum. 1998. WordNet.                          guage Engineering, 16(4):359–389.

                                                     1143
Alessandro Lenci and Giulia Benotto. 2012. Identify-      Kim Anh Nguyen, Maximilian Köper, Sabine
  ing hypernyms in distributional semantic spaces. In       Schulte im Walde, and Ngoc Thang Vu. 2017.
  Proceedings of *SEM, pages 75–79.                         Hierarchical embeddings for hypernymy detection
                                                            and directionality. In Proceedings of EMNLP,
Omer Levy and Yoav Goldberg. 2014. Dependency-              pages 233–243.
 based word embeddings. In Proceedings of ACL,
 pages 302–308.                                           Kim Anh Nguyen, Sabine Schulte im Walde, and
                                                            Ngoc Thang Vu. 2016. Integrating distributional
Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and             lexical contrast into word embeddings for antonym-
  Yu Hu. 2015. Learning semantic word embeddings            synonym distinction. In Proceedings of ACL, pages
  based on ordinal knowledge constraints. In Proceed-       454–459.
  ings of ACL, pages 1501–1511.
                                                          Maximilian Nickel and Douwe Kiela. 2017. Poincaré
Douglas L. Medin, Mark W. Altom, and Timothy D.            embeddings for learning hierarchical representa-
  Murphy. 1984. Given versus induced category repre-       tions. In Proceedings of NIPS.
  sentations: Use of prototype and exemplar informa-
  tion in classification. Journal of Experimental Psy-    Masataka Ono, Makoto Miwa, and Yutaka Sasaki.
  chology, 10(3):333–352.                                  2015. Word embedding-based antonym detection
                                                           using thesauri and distributional information. In
Oren Melamud, Jacob Goldberger, and Ido Dagan.             Proceedings of NAACL-HLT, pages 984–989.
  2016. Context2vec: Learning generic context em-
  bedding with bidirectional LSTM. In Proceedings         Dominique Osborne, Shashi Narayan, and Shay Cohen.
  of CoNLL, pages 51–61.                                    2016. Encoding prior knowledge with eigenword
                                                            embeddings. Transactions of the ACL, 4:417–430.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.
  Corrado, and Jeffrey Dean. 2013. Distributed rep-       Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,
  resentations of words and phrases and their compo-         Benjamin Van Durme, and Chris Callison-Burch.
  sitionality. In Proceedings of NIPS, pages 3111–           2015. PPDB 2.0: Better paraphrase ranking, fine-
  3119.                                                      grained entailment relations, word embeddings, and
                                                             style classification. In Proceedings of ACL, pages
Michael Mohler, David Bracewell, Marc Tomlinson,             425–430.
  and David Hinote. 2013. Semantic signatures for
  example-based linguistic metaphor detection. In         Ted Pedersen, Siddharth Patwardhan, and Jason Miche-
  Proceedings of the First Workshop on Metaphor in          lizzi. 2004. WordNet::Similarity - Measuring the re-
  NLP, pages 27–35.                                         latedness of oncepts. In Proceedings of AAAI, pages
                                                            1024–1025.
Nikola Mrkšić, Diarmuid Ó Séaghdha, Tsung-Hsien          Jeffrey Pennington, Richard Socher, and Christopher
  Wen, Blaise Thomson, and Steve Young. 2017. Neu-           Manning. 2014. Glove: Global vectors for word rep-
  ral belief tracker: Data-driven dialogue state track-      resentation. In Proceedings of EMNLP, pages 1532–
  ing. In Proceedings of ACL, pages 1777–1788.               1543.
Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thom-         Marek Rei and Ted Briscoe. 2014. Looking for hy-
  son, Milica Gašić, Lina Maria Rojas-Barahona, Pei-      ponyms in vector space. In Proceedings of CoNLL,
  Hao Su, David Vandyke, Tsung-Hsien Wen, and              pages 68–77.
  Steve Young. 2016. Counter-fitting word vectors
  to linguistic constraints. In Proceedings of NAACL-     Stephen Roller, Katrin Erk, and Gemma Boleda. 2014.
  HLT.                                                       Inclusive yet selective: Supervised distributional hy-
                                                             pernymy detection. In Proceedings of COLING,
Nikola Mrkšić, Ivan Vulić, Diarmuid Ó Séaghdha, Ira        pages 1025–1036.
  Leviant, Roi Reichart, Milica Gašić, Anna Korho-
  nen, and Steve Young. 2017. Semantic specialisa-        Eleanor H. Rosch. 1973. Natural categories. Cognitive
  tion of distributional word vector spaces using mono-     Psychology, 4(3):328–350.
  lingual and cross-lingual constraints. Transactions
  of the ACL, 5:309–324.                                  Enrico Santus, Alessandro Lenci, Qin Lu, and Sabine
                                                            Schulte im Walde. 2014. Chasing hypernyms in vec-
Roberto Navigli and Simone Paolo Ponzetto. 2012. Ba-        tor spaces with entropy. In Proceedings of EACL,
  belNet: The automatic construction, evaluation and        pages 38–42.
  application of a wide-coverage multilingual seman-
  tic network. Artificial Intelligence, 193:217–250.      Enrico Santus, Frances Yung, Alessandro Lenci, and
                                                            Chu-Ren Huang. 2015. EVALution 1.0: an evolving
Roberto Navigli, Paola Velardi, and Stefano Faralli.        semantic dataset for training and evaluation of distri-
  2011. A graph-based algorithm for inducing lexical        butional semantic models. In Proceedings of the 4th
  taxonomies from scratch. In Proceedings of IJCAI,         Workshop on Linked Data in Linguistics: Resources
  pages 1872–1877.                                          and Applications, pages 64–69.

                                                     1144
Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016.         Julie Weeds, David Weir, and Diana McCarthy. 2004.
  Improving hypernymy detection with an integrated            Characterising measures of lexical distributional
  path-based and distributional method. In Proceed-           similarity. In Proceedings of COLING, pages 1015–
  ings of ACL, pages 2389–2398.                              1021.

Vered Shwartz, Enrico Santus, and Dominik                  John Wieting, Mohit Bansal, Kevin Gimpel, and Karen
  Schlechtweg. 2017.       Hypernyms under siege:            Livescu. 2015. From paraphrase database to compo-
  Linguistically-motivated artillery for hypernymy           sitional paraphrase model and back. Transactions of
  detection. In Proceedings of EACL, pages 65–75.            the ACL, 3:345–358.
                                                           Zhibiao Wu and Martha Palmer. 1994. Verb semantics
Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006.
                                                             and lexical selection. In Proceedings of ACL, pages
  Semantic taxonomy induction from heterogenous ev-
                                                             133–138.
  idence. In Proceedings of ACL, pages 801–808.
                                                           Chang Xu, Yalong Bai, Jiang Bian, Bin Gao, Gang
Luu Anh Tuan, Yi Tay, Siu Cheung Hui, and See Kiong          Wang, Xiaoguang Liu, and Tie-Yan Liu. 2014. RC-
  Ng. 2016. Learning term embeddings for taxonomic           NET: A general framework for incorporating knowl-
  relation identification using dynamic weighting neu-       edge into word representations. In Proceedings of
  ral network. In Proceedings of EMNLP, pages 403–           CIKM, pages 1219–1228.
  413.
                                                           Mo Yu and Mark Dredze. 2014. Improving lexical em-
Joseph P. Turian, Lev-Arie Ratinov, and Yoshua Ben-         beddings with semantic knowledge. In Proceedings
   gio. 2010. Word representations: A simple and gen-       of ACL, pages 545–550.
   eral method for semi-supervised learning. In Pro-
   ceedings of ACL, pages 384–394.                         Zheng Yu, Haixun Wang, Xuemin Lin, and Min Wang.
                                                             2015. Learning term embeddings for hypernymy
Shyam Upadhyay, Yogarshi Vyas, Marine Carpuat, and           identification. In Proceedings of IJCAI, pages 1390–
  Dan Roth. 2018. Robust cross-lingual hypernymy             1397.
  detection using dependency context. In Proceedings
                                                           Jingwei Zhang, Jeremy Salwen, Michael Glass, and Al-
  of NAACL-HLT.
                                                              fio Gliozzo. 2014. Word semantic representations
                                                              using bayesian probabilistic tensor factorization. In
Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Ur-
                                                              Proceedings of EMNLP, pages 1522–1531.
   tasun. 2016. Order-embeddings of images and lan-
   guage. In Proceedings of ICLR (Conference Track).       Will Y. Zou, Richard Socher, Daniel Cer, and Christo-
                                                             pher D. Manning. 2013. Bilingual word embeddings
Luke Vilnis and Andrew McCallum. 2015. Word repre-           for phrase-based machine translation. In Proceed-
  sentations via Gaussian embedding. In Proceedings          ings of EMNLP, pages 1393–1398.
  of ICLR (Conference Track).

Ivan Vulić, Daniela Gerz, Douwe Kiela, Felix Hill, and
   Anna Korhonen. 2017. Hyperlex: A large-scale eval-
   uation of graded lexical entailment. Computational
   Linguistics.

Ivan Vulić, Goran Glavaš, Nikola Mrkšić, and Anna
   Korhonen. 2018. Post-specialisation: Retrofitting
   vectors of words unseen in lexical resources. In Pro-
   ceedings of NAACL-HLT.

Ivan Vulić, Nikola Mrkšić, and Anna Korhonen. 2017a.
   Cross-lingual induction and transfer of verb classes
   based on word vector space specialisation. In Pro-
   ceedings of EMNLP, pages 2536–2548.

Ivan Vulić, Nikola Mrkšić, Roi Reichart, Diarmuid
   Ó Séaghdha, Steve Young, and Anna Korhonen.
   2017b. Morph-fitting: Fine-tuning word vector
   spaces with simple language-specific rules. In Pro-
   ceedings of ACL, pages 56–68.

Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir,
   and Bill Keller. 2014. Learning to distinguish hy-
   pernyms and co-hyponyms. In Proceedings of COL-
   ING, pages 2249–2259.

                                                      1145
You can also read