The Role of Modifier and Head Properties in Predicting the Compositionality of English and German Noun-Noun Compounds: A Vector-Space Perspective

Page created by Keith Reeves

Style & Fashion

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

The Role of Modifier and Head Properties in Predicting the
Compositionality of English and German Noun-Noun Compounds:
A Vector-Space Perspective

Sabine Schulte im Walde and Anna Hätty and Stefan Bott
Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart
Pfaffenwaldring 5B, 70569 Stuttgart, Germany
{schulte,haettyaa,bottsn}@ims.uni-stuttgart.de

Abstract modifier and head transparency (T) vs. opaque-
ness (O) within a compound. Examples for these
In this paper, we explore the role of con- categories were car-wash (TT), strawberry (OT),
stituent properties in English and Ger- jailbird (TO), and hogwash (OO). Libben et al.
man noun-noun compounds (corpus fre- confirmed Zwitserlood’s analyses that both se-
quencies of the compounds and their con- mantically transparent and semantically opaque
stituents; productivity and ambiguity of compounds show morphological constituency; in
the constituents; and semantic relations addition, the semantic transparency of the head
between the constituents), when predict- constituent was found to play a significant role.
ing the degrees of compositionality of the From a computational point of view, address-
compounds within a vector space model. ing the compositionality of noun compounds (and
The results demonstrate that the empirical multi-word expressions in more general) is a cru-
and semantic properties of the compounds cial ingredient for lexicography and NLP appli-
and the head nouns play a significant role. cations, to know whether the expression should
be treated as a whole, or through its constituents,
1 Introduction
and what the expression means. For example,
The past 20+ years have witnessed an enormous studies such as Cholakov and Kordoni (2014),
amount of discussions on whether and how the Weller et al. (2014), Cap et al. (2015), and Salehi
modifiers and the heads of noun-noun compounds et al. (2015b) have integrated the prediction of
such as butterfly, snowball and teaspoon influence multi-word compositionality into statistical ma-
the compositionality of the compounds, i.e., the chine translation.
degree of transparency vs. opaqueness of the com- Computational approaches to automatically
pounds. The discussions took place mostly in psy- predict the compositionality of noun compounds
cholinguistic research, typically relying on read- have mostly been realised as vector space mod-
ing time and priming experiments. For example, els, and can be subdivided into two subfields:
Sandra (1990) demonstrated in three priming ex- (i) approaches that aim to predict the meaning
periments that both modifier and head constituents of a compound by composite functions, relying
were accessed in semantically transparent En- on the vectors of the constituents (e.g., Mitchell
glish noun-noun compounds (such as teaspoon), and Lapata (2010), Coecke et al. (2011), Baroni
but there were no effects for semantically opaque et al. (2014), and Hermann (2014)); and (ii) ap-
compounds (such as buttercup), when primed ei- proaches that aim to predict the degree of compo-
ther on their modifier or head constituent. In con- sitionality of a compound, typically by comparing
trast, Zwitserlood (1994) provided evidence that the compound vectors with the constituent vec-
the lexical processing system is sensitive to mor- tors (e.g., Reddy et al. (2011), Salehi and Cook
phological complexity independent of semantic (2013), Schulte im Walde et al. (2013), Salehi et
transparency. Libben and his colleagues (Libben al. (2014; 2015a)). In line with subfield (ii),
et al. (1997), Libben et al. (2003)) were the first this paper aims to distinguish the contributions
who systematically categorised noun-noun com- of modifier and head properties when predicting
pounds with nominal modifiers and heads into four the compositionality of English and German noun-
groups representing all possible combinations of noun compounds in a vector space model.

148
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics (*SEM 2016), pages 148–158,
Berlin, Germany, August 11-12, 2016.

Up to date, computational research on noun one opaque constituent) produce semantic prim-
compounds has largely ignored the influence of ing of their constituents. For the heads of seman-
constituent properties on the prediction of compo- tically transparent compounds, a larger amount of
sitionality. Individual pieces of research noticed facilitation was found than for the modifiers. Dif-
differences in the contributions of modifier and ferences in the results by Sandra (1990) and Zwit-
head constituents towards the composite functions serlood (1994) were supposedly due to different
predicting compositionality (Reddy et al., 2011; definitions of partial opacity, and different prime–
Schulte im Walde et al., 2013), but so far the target SOAs.
roles of modifiers and heads have not been distin- Libben and his colleagues (Libben et al. (1997),
guished. We use a new gold standard of German Libben (1998), and Libben et al. (2003)) were the
noun-noun compounds annotated with corpus fre- first who systematically categorised noun-noun
quencies of the compounds and their constituents; compounds with nominal modifiers and heads
productivity and ambiguity of the constituents; and into four groups representing all possible com-
semantic relations between the constituents; and binations of a constituent’s transparency (T) vs.
we extend three existing gold standards of German opaqueness (O) within a compound: TT, OT, TO,
and English noun-noun compounds (Ó Séaghdha, OO. Libben’s examples for these categories were
2007; von der Heide and Borgwaldt, 2009; Reddy car-wash (TT), strawberry (OT), jailbird (TO),
et al., 2011) to include approximately the same and hogwash (OO). They confirmed Zwitserlood’s
compound and constituent properties. Relying on analyses that both semantically transparent and se-
a standard vector space model of compositional- mantically opaque compounds show morphologi-
ity, we then predict the degrees of compositional- cal constituency, and also that the semantic trans-
ity of the English and German noun-noun com- parency of the head constituent was found to play
pounds, and explore the influences of the com- a significant role. Studies such as Jarema et al.
pound and constituent properties. Our empirical (1999) and Kehayia et al. (1999) to a large ex-
computational analyses reveal that the empirical tent confirmed the insights by Libben and his col-
and semantic properties of the compounds and the leagues for French, Bulgarian, Greek and Polish.
head nouns play a significant role in determining
Regarding related computational work, promi-
the compositionality of noun compounds.
nent approaches to model the meaning of a com-
pound or a phrase by a composite function include
2 Related Work
Mitchell and Lapata (2010), Coecke et al. (2011),
Regarding relevant psycholinguistic research on Baroni et al. (2014), and Hermann (2014)). In this
the representation and processing of noun com- area, researchers combine the vectors of the com-
pounds, Sandra (1990) hypothesised that an asso- pound/phrase constituents by mathematical func-
ciative prime should facilitate access and recog- tions such that the resulting vector optimally rep-
nition of a noun compound, if a compound con- resents the meaning of the compound/phrase. This
stituent is accessed during processing. His three research is only marginally related to ours, since
priming experiments revealed that in transparent we are interested in the degree of compositional-
noun-noun compounds, both constituents are ac- ity of a compound, rather than its actual meaning.
cessed, but he did not find priming effects for the Most closely related computational work in-
constituents in opaque noun-noun compounds. cludes distributional approaches that predict the
Zwitserlood (1994) performed an immediate degree of compositionality of a compound regard-
partial repetition experiment and a priming exper- ing a specific constituent, by comparing the com-
iment to explore and to distinguish morpholog- pound vector to the respective constituent vector.
ical and semantic structures in noun-noun com- Most importantly, Reddy et al. (2011) used a stan-
pounds. On the one hand, she confirmed San- dard distributional model to predict the compo-
dra’s results that there is no semantic facilitation of sitionality of compound-constituent pairs for 90
any constituent in opaque compounds. In contrast, English compounds. They extended their predic-
she found evidence for morphological complex- tions by applying composite functions (see above).
ity, independent of semantic transparency, and that In a similar vein, Schulte im Walde et al. (2013)
both transparent and also partially opaque com- predicted the compositionality for 244 German
pounds (i.e., compounds with one transparent and compounds. Salehi et al. (2014) defined a cross-

149

lingual distributional model that used translations          German Noun-Noun Compound Datasets As
into multiple languages and distributional simi-             basis for this work, we created a novel gold stan-
larities in the respective languages, to predict the         dard of German noun-noun compounds: Gh OST-
compositionality for the two datasets from Reddy             NN (Schulte im Walde et al., 2016). The new
et al. (2011) and Schulte im Walde et al. (2013).            gold standard was built such that it includes a rep-
                                                             resentative choice of compounds and constituents
3    Noun-Noun Compounds                                     from various frequency ranges, various productiv-
Our focus of interest is on noun-noun compounds,             ity ranges, with various numbers of senses, and
such as butterfly, snowball and teaspoon as well             with various semantic relations. In the follow-
as car park, zebra crossing and couch potato in              ing, we describe the creation process in some de-
English, and Ahornblatt ‘maple leaf’, Feuerwerk              tail, because the properties of the gold standard are
‘fireworks’, and Löwenzahn ‘dandelion’ in Ger-              highly relevant for the distributional models.
man, where both the grammatical head (in English                Relying on the 11.7 billion words in the web
and German, this is typically the rightmost con-             corpus DECOW14AX 2 (Schäfer and Bildhauer,
stituent) and the modifier are nouns. We are inter-          2012; Schäfer, 2015), we extracted all words that
ested in the degrees of compositionality of noun-            were identified as common nouns by the Tree Tag-
noun compounds, i.e., the semantic relatedness be-           ger (Schmid, 1994) and analysed as noun com-
tween the meaning of a compound (e.g., snowball)             pounds with exactly two nominal constituents by
and the meanings of its constituents (e.g., snow             the morphological analyser SMOR (Faaß et al.,
and ball). More specifically, this paper aims to             2010). This set of 154,960 two-part noun-noun
explore factors that have been found to influence            compound candidates was enriched with empiri-
compound processing and representation, such as              cal properties relevant for the gold standard:
    • frequency-based factors, i.e., the frequencies            • corpus frequencies of the compounds and the
      of the compounds and their constituents (van                constituents (i.e., modifiers and heads), rely-
      Jaarsveld and Rattink, 1988; Janssen et al.,                ing on DECOW14AX;
      2008);
                                                                • productivity of the constituents i.e., how
    • the productivity (morphological family size),               many compound types contained a specific
      i.e., the number of compounds that share a                  modifier/head constituent;
      constituent (de Jong et al., 2002); and
                                                                • number of senses of the compounds and the
    • semantic variables as the relationship be-                  constituents, relying on GermaNet (Hamp
      tween compound modifier and head: a teapot                  and Feldweg, 1997; Kunze, 2000).
      is a pot FOR tea; a snowball is a ball MADE
      OF snow (Gagné and Spalding, 2009; Ji et              From the set of compound candidates we extracted
      al., 2011).                                            a random subset that was balanced3 for
In addition, we were interested in the effect of am-            • the productivity of the modifiers: we cal-
biguity (of both the modifiers and the heads) re-                 culated tertiles to identify modifiers with
garding the compositionality of the compounds.                    low/mid/high productivity;
   Our explorations required gold standards of
compounds that were annotated with all these                    • the ambiguity of the heads: we distinguished
compound and constituent properties. Since most                   between heads with 1, 2 and >2 senses.
previous work on computational predictions of                For each of the resulting nine categories (three
compositionality has been performed for English              productivity ranges × three ambiguity ranges),
and for German, we decided to re-use existing                we randomly selected 20 noun-noun compounds
datasets for both languages, which however re-                  2
                                                                  http://corporafromtheweb.org/decow14/
quired extensions to provide all properties we                  3
                                                                  We wanted to extract a random subset that at the same
wanted to take into account. We also created a               time was balanced across frequency, productivity and am-
novel gold standard. In the following, we describe           biguity ranges of the compounds and their constituents, but
                                                             defining and combining several ranges for each of the three
the datasets.1                                               criteria and for compounds as well as constituents would have
  1
    The datasets are available from http://www.ims.          led to an explosion of factors to be taken into account, so we
uni-stuttgart.de/data/ghost-nn/.                             focused on two main criteria instead.

                                                       150

from our candidate set, disregarding compounds                          constituent compositionality ratings on a scale
with a corpus frequency < 2,000, and disregard-                         from 1 (definitely semantically opaque) to 6 (def-
ing compounds containing modifiers or heads with                        initely semantically transparent). Another five na-
a corpus-frequency < 100. We refer to this dataset                      tive speakers provided additional annotation for
of 180 compounds balanced for modifier produc-                          our small core subset of 180 compounds on the
tivity and head ambiguity as Gh OST-NN/S.                               same scale. As final compositionality ratings, we
   We also created a subset of 5 noun-noun com-                         use the mean compound–constituent ratings across
pounds for each of the 9 criteria combinations, by                      the 13 annotators.
randomly selecting 5 out of the 20 selected com-                           As alternative gold standard for German noun-
pounds in each mode. This small, balanced sub-                          noun compounds, we used a dataset based on a
set was then systematically extended by adding                          selection of noun compounds by von der Heide
all compounds from the original set of compound                         and Borgwaldt (2009), that was previously used
candidates with either the same modifier or the                         in computational models predicting composition-
same head as any of the selected compounds. Tak-                        ality (Schulte im Walde et al., 2013; Salehi et al.,
ing Haarpracht as an example (the modifier is                           2014). The dataset contains a subset of their com-
Haar ’hair’, the head is Pracht ’glory’), we added                      pounds including 244 two-part noun-noun com-
Haarwäsche, Haarkleid, Haarpflege, etc. as well                        pounds, annotated by compositionality ratings on
as Blütenpracht, Farbenpracht, etc.4 We refer to                       a scale between 1 and 7. We enriched the existing
this dataset of 868 compounds that destroyed the                        dataset with frequencies, and productivity and am-
coherent balance of criteria underlying our ran-                        biguity scores, also based on DECOW14AX and
dom extraction, but instead ensured a variety of                        GermaNet, to provide the same empirical infor-
compounds with either the same modifiers or the                         mation as for the Gh OST-NN datasets. We refer
same heads, as Gh OST-NN/XL.                                            to this alternative German dataset as VD HB.
   The two sets of compounds (Gh OST-NN/S and
Gh OST-NN/XL) were annotated with the seman-                            English Noun-Noun Compound Datasets
tic relations between the modifiers and the heads,                      Reddy et al. (2011) created a gold standard for
and compositionality ratings. Regarding seman-                          English noun-noun compounds. Assuming that
tic relations, we applied the relation set sug-                         compounds whose constituents appeared either
gested by Ó Séaghdha (2007), because (i) he                           as their hypernyms or in their definitions tend
had evaluated his annotation relations and anno-                        to be compositional, they induced a candidate
tation scheme, and (ii) his dataset had a similar                       compound set with various degrees of compound–
size as ours, so we could aim for comparing re-                         constituent relatedness from WordNet (Miller et
sults across languages. Ó Séaghdha (2007) him-                        al., 1990; Fellbaum, 1998) and Wiktionary. A
self had relied on a set of nine semantic rela-                         random choice of 90 compounds that appeared
tions suggested by Levi (1978), and designed and                        with a corpus frequency > 50 in the ukWaC
evaluated a set of relations that took over four                        corpus (Baroni et al., 2009) constituted their
of Levi’s relations (BE, HAVE, IN, ABOUT)                               gold-standard dataset and was annotated by
and added two relations referring to event partici-                     compositionality ratings. Bell and Schäfer (2013)
pants (ACTOR, INST(rument)) that replaced                               annotated the compounds with semantic relations
the relations MAKE, CAUSE, FOR, FROM,                                   using all of Levi’s original nine relation types:
USE. An additional relation LEX refers to lexi-                         CAUSE, HAVE, MAKE, USE, BE, IN,
calised compounds where no relation can be as-                          FOR, FROM, ABOUT. We refer to this dataset
signed. Three native speakers of German anno-                           as R EDDY.
tated the compounds with these seven semantic                              Ó Séaghdha developed computational models
relations.5 Regarding compositionality ratings,                         to predict the semantic relations between modi-
eight native speakers of German annotated all                           fiers and heads in English noun compounds (Ó
868 gold-standard compounds with compound–                              Séaghdha, 2008; Ó Séaghdha and Copestake,
                                                                        2013; Ó Séaghdha and Korhonen, 2014). As
   4
     The translations of the example compounds are hair                 gold-standard basis for his models, he created a
washing, hair dress, hair care, floral glory, and colour glory.         dataset of compounds, and annotated the com-
   5
     In fact, the annotation was performed for a superset of
1,208 compounds, but we only took into account 868 com-                 pounds with semantic relations: He tagged and
pounds with perfect agreement, i.e. IAA=1.                              parsed the written part of the British National Cor-

                                                                  151

Annotation
        Language   Dataset          #Compounds
                                                        Frequency/Productivity Ambiguity                 Relations
                   Gh OST-NN/S                180             DECOW            GermaNet                  Levi (7)
        DE         Gh OST-NN/XL               868             DECOW            GermaNet                  Levi (7)
                   VD HB                      244             DECOW            GermaNet                      –
                   R EDDY                      90             ENCOW             WordNet                  Levi (9)
        EN
                   OS                         396             ENCOW             WordNet                  Levi (6)

                                Table 1: Noun-noun compound datasets.

pus using RASP (Briscoe and Carroll, 2002), and                 In this paper, we use VSMs in order to model
applied a simple heuristics to induce compound               compounds as well as constituents by distribu-
candidates: He used all sequences of two or more             tional vectors, and we determine the semantic re-
common nouns that were preceded or followed by               latedness between the compounds and their mod-
sentence boundaries or by words not representing             ifier and head constituents by measuring the dis-
common nouns. Of these compound candidates,                  tance between the vectors. We assume that the
a random selection of 2,000 instances was used               closer a compound vector and a constituent vec-
for relation annotation (Ó Séaghdha, 2007) and             tor are to each other, the more compositional (i.e.,
classification experiments. The final gold standard          the more transparent) the compound is, regard-
is a subset of these compounds, containing 1,443             ing that constituent. Correspondingly, the more
noun-noun compounds. We refer to this dataset as             distant a compound vector and a constituent vec-
OS.                                                          tor are to each other, the less compositional (i.e.,
   Both English compound datasets were enriched              the more opaque) the compound is, regarding that
with frequencies and productivities, based on the            constituent.
ENCOW14AX 6 containing 9.6 billion words. We                    Our main questions regarding the VSMs are
also added the number of senses of the con-                  concerned with the influence of constituent prop-
stituents to both datasets, using WordNet. And we            erties on the prediction of compositionality. I.e.,
collected compositionality ratings for a random              how do the corpus frequencies of the compounds
choice of 396 compounds from the OS dataset                  and their constituents, the productivity and the am-
relying on eight experts, in the same way as the             biguity of the constituents, and the semantic rela-
Gh OST-NN ratings were collected.                            tions between the constituents influence the qual-
                                                             ity of the predictions?
Resulting Noun-Noun Compound Datasets
Table 1 summarises the gold-standard datasets.               4.1   Vector Space Models (VSMs)
They are of different sizes, but their empirical and
semantic annotations have been aligned to a large            We created a standard vector space model for
extent, using similar corpora, relying on WordNets           all our compounds and constituents in the vari-
and similar semantic relation inventories based on           ous datasets, using co-occurrence frequencies of
Levi (1978).                                                 nouns within a sentence-internal window of 20
                                                             words to the left and 20 words to the right of
4       VSMs Predicting Compositionality                     the targets.7 The frequencies were induced from
                                                             the German and English COW corpora, and trans-
Vector space models (VSMs) and distributional in-
                                                             formed to local mutual information (LMI) values
formation have been a steadily increasing, integral
                                                             (Evert, 2005).
part of lexical semantic research over the past 20
                                                                Relying on the LMI vector space models, the
years (Turney and Pantel, 2010): They explore
                                                             cosine determined the distributional similarity
the notion of “similarity” between a set of tar-
                                                             between the compounds and their constituents,
get objects, typically relying on the distributional
                                                             which was in turn used to predict the degree
hypothesis (Harris, 1954; Firth, 1957) to deter-
mine co-occurrence features that best describe the              7
                                                                  In previous work, we systematically compared window-
words, phrases, sentences, etc. of interest.                 based and syntax-based co-occurrence variants for predicting
                                                             compositionality (Schulte im Walde et al., 2013). The current
    6
        http://corporafromtheweb.org/encow14/                work adopted the best choice of co-occurrence dimensions.

                                                       152

of compositionality between the compounds and                       4.3    Influence of Compound Properties on
their constituents, assuming that the stronger the                         VSM Prediction Results
distributional similarity (i.e., the cosine values),
                                                                    Figures 1 to 5 present the core results of this paper:
the larger the degree of compositionality. The vec-
                                                                    They explore the influence of compound and con-
tor space predictions were evaluated against the
                                                                    stituent properties on predicting compositionality.
mean human ratings on the degree of composition-
                                                                    Since we wanted to optimise insight into the influ-
ality, using the Spearman Rank-Order Correlation
                                                                    ence of the properties, we selected the 60 maxi-
Coefficient ρ (Siegel and Castellan, 1988).
                                                                    mum instances and the 60 minimum instances for
                                                                    each property.9 For example, to explore the in-
4.2   Overall VSM Prediction Results
                                                                    fluence of head frequency on the prediction qual-
Table 2 presents the overall prediction results                     ity, we selected the 60 most frequent and the 60
across languages and datasets. The mod column                       most infrequent compound heads from each gold-
shows the ρ correlations for predicting only the                    standard resource, and calculated Spearman’s ρ
degree of compositionality of compound–modifier                     for each set of 60 compounds with these heads.
pairs; the head column shows the ρ correlations                        Figure 1 shows that the distributional model
for predicting only the degree of compositional-                    predicts high-frequency compounds (red bars) bet-
ity of compound–head pairs; and the both col-                       ter than low-frequency compounds (blue bars),
umn shows the ρ correlations for predicting the                     across datasets. The differences are significant for
degree of compositionality of compound–modifier                     Gh O S T-NN/XL.
and compound–head pairs at the same time.

             Dataset            mod    head    both
            Gh OST-NN/S         0.48   0.57    0.46
      DE    Gh OST-NN/XL        0.49   0.59    0.47
            VD HB               0.65   0.60    0.61
            R EDDY              0.48   0.60    0.56
      EN
            OS                  0.46   0.39    0.35

      Table 2: Overall prediction results (ρ).

   The models for VD HB and R EDDY represent
replications of similar models in Schulte im Walde
et al. (2013) and Reddy et al. (2011), respectively,
but using the much larger COW corpora.                                    Figure 1: Effect of compound frequency.
   Overall, the both prediction results on VD HB
are significantly8 better than all others but R EDDY;                  Figure 2 shows that the distributional model
and the prediction results on OS compounds are                      predicts compounds with low-frequency heads
significantly worse than all others. We can also                    better than compounds with high-frequency heads
compare within-dataset results: Regarding the two                   (right panel), while there is no tendency regarding
Gh OST-NN datasets and the R EDDY dataset, the                      the modifier frequencies (left panel). The differ-
VSM predictions for the compound–head pairs are                     ences regarding the head frequencies are signifi-
better than for the compound–modifier pairs. Re-                    cant (p = 0.1) for both Gh O S T-NN datasets.
garding the VD HB and the OS datasets, the VSM                         Figure 3 shows that the distributional model
predictions for the compound–modifier pairs are                     also predicts compounds with low-productivity
better than for the compound–head pairs. These                      heads better than compounds with high-
differences do not depend on the language (ac-                      productivity heads (right panel), while there
cording to our datasets), and are probably due to                   is no tendency regarding the productivities of
properties of the specific gold standards that we                   modifiers (left panel). The prediction differences
did not control. They are, however, also not the                    regarding the head productivities are significant
main point of this paper.                                           for Gh O S T-NN/S (p < 0.05).
   8                                                                    9
     All significance tests in this paper were performed by               For R EDDY, we could only use 45 maximum/minimum
Fisher r-to-z transformation.                                       instances, since the dataset only contains 90 compounds.

                                                              153

Figure 2: Effect of modifier/head frequency.

Figure 3: Effect of modifier/head productivity.

Figure 4: Effect of modifier/head ambiguity.

                     154

Figure 4 shows that the distributional model             5 Discussion
also predicts compounds with low-ambiguity
                                                            While modifier frequency, productivity and am-
heads better than compounds with high-ambiguity
                                                            biguity did not show a consistent effect on the
heads (right panel) –with one exception (Gh OST-
                                                            predictions, head frequency, productivity and
NN/XL)– while there is no tendency regarding the
                                                            ambiguity influenced the predictions such that
ambiguities of modifiers (left panel). The predic-
                                                            the prediction quality for compounds with low-
tion differences regarding the head ambiguities are
                                                            frequency, low-productivity and low-ambiguity
significant for Gh O S T-NN/XL (p < 0.01).
                                                            heads was better than for compounds with high-
   Figure 5 compares the predictions of the dis-            frequency, high-productivity and high-ambiguity
tributional model regarding the semantic rela-              heads. The differences were significant only for
tions between modifiers and heads, focusing on              our new Gh OST-NN datasets. In addition, the
Gh O S T-NN/XL. The numbers in brackets refer to            compound frequency also had an effect on the pre-
the number of compounds with the respective re-             dictions, with high-frequency compounds receiv-
lation. The plot reveals differences between pre-           ing better prediction results than low-frequency
dictions of compounds with different relations.             compounds. Finally, the quality of predictions
                                                            also differed for compound relation types, with
                                                            BE compounds predicted best, and ACTOR com-
                                                            pounds predicted worst. These differences were
                                                            ascertained mostly in the Gh OST-NN and the OS
                                                            datasets. Our results raise two main questions:

                                                            (1) What does it mean if a distributional model
                                                                predicts a certain subset of compounds (with
                                                                specific properties) “better” or “worse” than
                                                                other subsets?

                                                            (2) What are the implications for (a) psycholin-
                                                                guistic and (b) computational models regard-
      Figure 5: Effect of semantic relation.                    ing the compositionality of noun compounds?

                                                               Regarding question (1), there are two options
                                                            why a distributional model predicts a certain sub-
   Table 3 summarises those differences across
                                                            set of compounds better or worse than other sub-
gold standards that are significant (where filled
                                                            sets. On the one hand, one of the underlying gold-
cells refer to rows significantly outperforming
                                                            standard datasets could contain compounds whose
columns).     Overall, the compositionality of
                                                            compositionality scores are easier to predict than
BE compounds is predicted significantly better
                                                            the compositionality scores of compounds in a
than the compositionality of HAVE compounds
                                                            different dataset. On the other hand, even if
(in R EDDY), INST and ABOUT compounds (in
                                                            there were differences in individual dataset pairs,
Gh OST-NN) and ACTOR compounds (in Gh OST-
                                                            this would not explain why we consistently find
NN and OS). The compositionality of ACTOR
                                                            modelling differences for head constituent proper-
compounds is predicted significantly worse than
                                                            ties (and compound properties) but not for modi-
the compositionality of BE, HAVE, IN and
                                                            fier constituent properties. We therefore conclude
INST compounds in both Gh OST-NN and OS.
                                                            that the effects of compound and head properties
                                                            are due to the compounds’ morphological con-
          HAVE      INST     ABOUT    ACTOR                 stituency, with specific emphasis on the influences
  BE      R EDDY    Gh OST   Gh OST   Gh OST,   OS
                                                            of the heads.
  HAVE                       OS       Gh OST,   OS
  IN                                  Gh OST,   OS             Looking at the individual effects of the com-
  INST                                Gh OST,   OS          pound and head properties that influence the dis-
                                                            tributional predictions, we hypothesise that high-
    Table 3: Significant differences: relations.
                                                            frequent compounds are easier to predict because
                                                            they have a better corpus coverage (and less

                                                      155

sparse data) than low-frequent compounds, and compounds across various frequency, productivity
that they contain many clearly transparent com- and ambiguity ranges.
pounds (such as Zitronensaft ‘lemon juice’), and
at the same time many clearly opaque compounds 6 Conclusion
(such as Eifersucht ‘jealousy’, where the literal
translations of the constituents are ‘eagerness’ and We explored the role of constituent properties
‘addiction’). Concerning the decrease in predic- in English and German noun-noun compounds,
tion quality for more frequent, more productive when predicting compositionality within a vec-
and more ambiguous heads, we hypothesise that tor space model. The results demonstrated that
all of these properties are indicators of ambiguity, the empirical and semantic properties of the com-
and the more ambiguous a word is, the more diffi- pounds and the head nouns play a significant role.
cult it is to provide a unique distributional predic- Therefore, psycholinguistic experiments as well as
tion, as distributional co-occurrence in most cases computational models are advised to carefully bal-
(including our current work) subsumes the con- ance their selections of compound targets accord-
texts of all word senses within one vector. For ex- ing to compound and constituent properties.
ample, more than half of the compounds with the
most frequent and also with the most productive
Acknowledgments
heads have the head Spiel, which has six senses
in GermaNet and covers six relations (BE, IN, The research presented in this paper was funded
INST, ABOUT, ACTOR, LEX). by the DFG Heisenberg Fellowship SCHU 2580/1
Regarding question (2), the results of our distri- (Sabine Schulte im Walde), the DFG Research
butional predictions confirm psycholinguistic re- Grant SCHU 2580/2 “Distributional Approaches
search that identified morphological constituency to Semantic Relatedness” (Stefan Bott), and the
in noun-noun compounds: Our models clearly dis- DFG Collaborative Research Center SFB 732
tinguish between properties of the whole com- (Anna Hätty).
pounds, properties of the modifier constituents,
and properties of the head constituents. Further-
References
more, our models reveal the need to carefully bal-
ance the frequencies and semantic relations of tar- Marco Baroni, Silvia Bernardini, Adriano Ferraresi,
and Eros Zanchetta. 2009. The WaCky Wide Web:
get compounds, and to carefully balance the fre-
A Collection of Very Large Linguistically Processed
quencies, productivities and ambiguities of their Web-Crawled Corpora. Language Resources and
head constituents, in order to optimise experiment Evaluation, 43(3):209–226.
interpretations, while a careful choice of empirical
modifier properties seems to play a minor role. Marco Baroni, Raffaella Bernardi, and Roberto Zam-
parelli. 2014. Frege in Space: A Program for Com-
For computational models, our work provides positional Distributional Semantics. Linguistic Is-
similar implications. We demonstrated the need to sues in Language Technologies, 9(6):5–110.
carefully balance gold-standard datasets for multi- Melanie J. Bell and Martin Schäfer. 2013. Semantic
word expressions according to the empirical and Transparency: Challenges for Distributional Seman-
semantic properties of the multi-word expressions tics. In Proceedings of the IWCS Workshop on For-
themselves, and also according to those of the con- mal Distributional Semantics, pages 1–10, Potsdam,
Germany.
stituents. In the case of noun-noun compounds,
the properties of the nominal modifiers were of Ted Briscoe and John Carroll. 2002. Robust Accurate
minor importance, but regarding other multi-word Statistical Annotation of General Text. In Proceed-
expressions, this might differ. If datasets are not ings of the 3rd Conference on Language Resources
and Evaluation, pages 1499–1504, Las Palmas de
balanced for compound and constituent properties, Gran Canaria, Spain.
the qualities of model predictions are difficult to
interpret, because it is not clear whether biases in Fabienne Cap, Manju Nirmal, Marion Weller, and
empirical properties skewed the results. Our ad- Sabine Schulte im Walde. 2015. How to Account
for Idiomatic German Support Verb Constructions in
vice is strengthened by the fact that most signifi- Statistical Machine Translation. In Proceedings of
cant differences in prediction results were demon- the 11th Workshop on Multiword Expressions, pages
strated for our new gold standard, which includes 19–28, Denver, Colorado, USA.

156

Kostadin Cholakov and Valia Kordoni. 2014. Better Hongbo Ji, Christina L. Gagné, and Thomas L. Spald-
Statistical Machine Translation through Linguistic ing. 2011. Benefits and Costs of Lexical Decompo-
Treatment of Phrasal Verbs. In Proceedings of the sition and Semantic Integration during the Process-
Conference on Empirical Methods in Natural Lan- ing of Transparent and Opaque English Compounds.
guage Processing, pages 196–201, Doha, Qatar. Journal of Memory and Language, 65:406–430.

Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Eva Kehayia, Gonia Jarema, Kyrana Tsapkini, Danuta
Clark. 2011. Mathematical Foundations for a Com- Perlak, Angela Ralli, and Danuta Kadzielawa. 1999.
positional Distributional Model of Meaning. Lin- The Role of Morphological Structure in the Process-
guistic Analysis, 36(1-4):345–384. ing of Compounds: The Interface between Linguis-
tics and Psycholinguistics. Brain and Language,
Nicole H. de Jong, Laurie B. Feldman, Robert 68:370–377.
Schreuder, Michael Pastizzo, and Harald R. Baayen.
2002. The Processing and Representation of Dutch Claudia Kunze. 2000. Extension and Use of Ger-
and English Compounds: Peripheral Morphological maNet, a Lexical-Semantic Database. In Proceed-
and Central Orthographic Effects. Brain and Lan- ings of the 2nd International Conference on Lan-
guage, 81:555–567. guage Resources and Evaluation, pages 999–1002,
Athens, Greece.
Stefan Evert. 2005. The Statistics of Word Co-
Occurrences: Word Pairs and Collocations. Ph.D. Judith N. Levi. 1978. The Syntax and Semantics of
thesis, Institut für Maschinelle Sprachverarbeitung, Complex Nominals. Academic Press, London.
Universität Stuttgart.
Gary Libben, Martha Gibson, Yeo Bom Yoon, and Do-
Gertrud Faaß, Ulrich Heid, and Helmut Schmid. 2010. miniek Sandra. 1997. Semantic Transparency and
Design and Application of a Gold Standard for Mor- Compound Fracture. Technical Report 9, CLAS-
phological Analysis: SMOR in Validation. In Pro- NET Working Papers.
ceedings of the 7th International Conference on
Language Resources and Evaluation, pages 803– Gary Libben, Martha Gibson, Yeo Bom Yoon, and Do-
810, Valletta, Malta. miniek Sandra. 2003. Compound Fracture: The
Role of Semantic Transparency and Morphological
Christiane Fellbaum, editor. 1998. WordNet – An Elec- Headedness. Brain and Language, 84:50–64.
tronic Lexical Database. Language, Speech, and
Communication. MIT Press, Cambridge, MA. Gary Libben. 1998. Semantic Transparency in the
Processing of Compounds: Consequences for Rep-
John R. Firth. 1957. Papers in Linguistics 1934-51. resentation, Processing, and Impairment. Brain and
Longmans, London, UK. Language, 61:30–44.

Christina L. Gagné and Thomas L. Spalding. 2009. George A. Miller, Richard Beckwith, Christiane Fell-
Constituent Integration during the Processing of baum, Derek Gross, and Katherine J. Miller. 1990.
Compound Words: Does it involve the Use of Re- Introduction to Wordnet: An On-line Lexical
lational Structures? Journal of Memory and Lan- Database. International Journal of Lexicography,
guage, 60:20–35. 3(4):235–244.

Birgit Hamp and Helmut Feldweg. 1997. GermaNet Jeff Mitchell and Mirella Lapata. 2010. Composition
– A Lexical-Semantic Net for German. In Proceed- in Distributional Models of Semantics. Cognitive
ings of the ACL Workshop on Automatic Information Science, 34:1388–1429.
Extraction and Building Lexical Semantic Resources
for NLP Applications, pages 9–15, Madrid, Spain. Diarmuid Ó Séaghdha and Ann Copestake. 2013.
Interpreting Compound Nouns with Kernel Meth-
Zellig Harris. 1954. Distributional structure. Word, ods. Journal of Natural Language Engineering,
10(23):146–162. 19(3):331–356.

Karl Moritz Hermann. 2014. Distributed Represen- Diarmuid Ó Séaghdha and Anna Korhonen. 2014.
tations for Compositional Semantics. Ph.D. thesis, Probabilistic Distributional Semantics with La-
University of Oxford. tent Variable Models. Computational Linguistics,
40(3):587–631.
Niels Janssen, Yanchao Bi, and Alfonso Caramazza.
2008. A Tale of Two Frequencies: Determining the Diarmuid Ó Séaghdha. 2007. Designing and Evalu-
Speed of Lexical Access for Mandarin Chinese and ating a Semantic Annotation Scheme for Compound
English Compounds. Language and Cognitive Pro- Nouns. In Proceedings of Corpus Linguistics, Birm-
cesses, 23:1191–1223. ingham, UK.

Gonia Jarema, Celine Busson, Rossitza Nikolova, Diarmuid Ó Séaghdha. 2008. Learning Compound
Kyrana Tsapkini, and Gary Libben. 1999. Process- Noun Semantics. Ph.D. thesis, University of Cam-
ing Compounds: A Cross-Linguistic Study. Brain bridge, Computer Laboratory. Technical Report
and Language, 68:362–369. UCAM-CL-TR-735.

157

Siva Reddy, Diana McCarthy, and Suresh Manandhar. Sabine Schulte im Walde, Anna Hätty, Stefan Bott, and
2011. An Empirical Study on Compositionality in Nana Khvtisavrishvili. 2016. Gh oSt-NN: A Rep-
Compound Nouns. In Proceedings of the 5th In- resentative Gold Standard of German Noun-Noun
ternational Joint Conference on Natural Language Compounds. In Proceedings of the 10th Interna-
Processing, pages 210–218, Chiang Mai, Thailand. tional Conference on Language Resources and Eval-
uation, pages 2285–2292, Portoroz, Slovenia.
Bahar Salehi and Paul Cook. 2013. Predicting the
Compositionality of Multiword Expressions Using Sidney Siegel and N. John Castellan. 1988. Non-
Translations in Multiple Languages. In Proceedings parametric Statistics for the Behavioral Sciences.
of the 2nd Joint Conference on Lexical and Compu- McGraw-Hill, Boston, MA.
tational Semantics, pages 266–275, Atlanta, GA.
Peter D. Turney and Patrick Pantel. 2010. From Fre-
Bahar Salehi, Paul Cook, and Timothy Baldwin. 2014. quency to Meaning: Vector Space Models of Se-
Using Distributional Similarity of Multi-way Trans- mantics. Journal of Artificial Intelligence Research,
lations to Predict Multiword Expression Composi- 37:141–188.
tionality. In Proceedings of the 14th Conference of
Henk J. van Jaarsveld and Gilbert E. Rattink. 1988.
the European Chapter of the Association for Com-
Frequency Effects in the Processing of Lexicalized
putational Linguistics, pages 472–481, Gothenburg,
and Novel Nominal Compounds. Journal of Psy-
Sweden.
cholinguistic Research, 17:447–473.
Bahar Salehi, Paul Cook, and Timothy Baldwin. Claudia von der Heide and Susanne Borgwaldt. 2009.
2015a. A Word Embedding Approach to Predicting Assoziationen zu Unter-, Basis- und Oberbegrif-
the Compositionality of Multiword Expressions. In fen. Eine explorative Studie. In Proceedings of
Proceedings of the Conference of the North Amer- the 9th Norddeutsches Linguistisches Kolloquium,
ican Chapter of the Association for Computational pages 51–74.
Linguistics/Human Language Technologies, pages
977–983, Denver, Colorado, USA. Marion Weller, Fabienne Cap, Stefan Müller, Sabine
Schulte im Walde, and Alexander Fraser. 2014. Dis-
Bahar Salehi, Nitika Mathur, Paul Cook, and Timothy tinguishing Degrees of Compositionality in Com-
Baldwin. 2015b. The Impact of Multiword Ex- pound Splitting for Statistical Machine Translation.
pression Compositionality on Machine Translation In Proceedings of the 1st Workshop on Computa-
Evaluation. In Proceedings of the 11th Workshop on tional Approaches to Compound Analysis, pages 81–
Multiword Expressions, pages 54–59, Denver, Col- 90, Dublin, Ireland.
orado, USA.
Pienie Zwitserlood. 1994. The Role of Semantic
Dominiek Sandra. 1990. On the Representation Transparency in the Processing and Representation
and Processing of Compound Words: Automatic of Dutch Compounds. Language and Cognitive
Access to Constituent Morphemes does not occur. Processes, 9:341–368.
The Quarterly Journal of Experimental Psychology,
42A:529–567.

Roland Schäfer and Felix Bildhauer. 2012. Building
Large Corpora from the Web Using a New Efficient
Tool Chain. In Proceedings of the 8th International
Conference on Language Resources and Evaluation,
pages 486–493, Istanbul, Turkey.

Roland Schäfer. 2015. Processing and Querying Large
Web Corpora with the COW14 Architecture. In
Proceedings of the 3rd Workshop on Challenges in
the Management of Large Corpora, pages 28–34,
Mannheim, Germany.

Helmut Schmid. 1994. Probabilistic Part-of-Speech
Tagging using Decision Trees. In Proceedings of
the 1st International Conference on New Methods in
Language Processing.

Sabine Schulte im Walde, Stefan Müller, and Stephen
Roller. 2013. Exploring Vector Space Models to
Predict the Compositionality of German Noun-Noun
Compounds. In Proceedings of the 2nd Joint Con-
ference on Lexical and Computational Semantics,
pages 255–265, Atlanta, GA.

158

You can also read