Predicting emergent linguistic compositions through time: Syntactic frame extension via multimodal chaining

Page created by Adam Rowe
 
CONTINUE READING
Predicting emergent linguistic compositions through time: Syntactic frame extension via multimodal chaining
Predicting emergent linguistic compositions through time:
                                                        Syntactic frame extension via multimodal chaining
                                                                                Lei Yu1 , Yang Xu1, 2, 3
                                                    1
                                                      Department of Computer Science, University of Toronto, Toronto, Canada
                                                       2
                                                         Cognitive Science Program, University of Toronto, Toronto, Canada
                                                           3
                                                             Vector Institute for Artificial Intelligence, Toronto, Canada
                                                                 {jadeleiyu,yangxu}@cs.toronto.edu

                                                               Abstract                          verb-argument compositions, and how this compo-
                                                                                                 sitionality may be understood in principled terms.
                                             Natural language relies on a finite lexicon to
                                             express an unbounded set of emerging ideas.            Compositionality is at the heart of linguistic cre-
                                             One result of this tension is the formation         ativity yet a notoriously challenging topic in com-
arXiv:2109.04652v1 [cs.CL] 10 Sep 2021

                                             of new compositions, such that existing lin-        putational linguistics and natural language process-
                                             guistic units can be combined with emerging         ing (e.g., Vecchi et al. 2017; Cordeiro et al. 2016;
                                             items into novel expressions. We develop a          Blacoe and Lapata 2012; Mitchell and Lapata 2010;
                                             framework that exploits the cognitive mecha-
                                                                                                 Baroni and Zamparelli 2010). For instance, modern
                                             nisms of chaining and multimodal knowledge
                                             to predict emergent compositional expressions
                                                                                                 views on the state-of-the-art neural models of lan-
                                             through time. We present the syntactic frame        guage have suggested that they show some degree
                                             extension model (SFEM) that draws on the            of linguistic generalization but are impoverished
                                             theory of chaining and knowledge from “per-         in systematic compositionality (see Baroni (2020)
                                             cept”, “concept”, and “language” to infer how       for review). Existing work has also explored the
                                             verbs extend their frames to form new com-          efficacy of neural models in modeling diachronic
                                             positions with existing and novel nouns. We         semantics (e.g., Hamilton et al., 2016; Rosenfeld
                                             evaluate SFEM rigorously on the 1) modalities
                                                                                                 and Erk, 2018; Hu et al., 2019; Giulianelli et al.,
                                             of knowledge and 2) categorization models of
                                             chaining, in a syntactically parsed English cor-    2020). However, to our knowledge, no attempt has
                                             pus over the past 150 years. We show that mul-      been made to examine principles in the formation
                                             timodal SFEM predicts newly emerged verb            of novel verb-noun compositions through time.
                                             syntax and arguments substantially better than         We formulate the problem as an inferential pro-
                                             competing models using purely linguistic or         cess which we call syntactic frame extension. We
                                             unimodal knowledge. We find support for an
                                                                                                 define syntactic frame as a joint distribution over a
                                             exemplar view of chaining as opposed to a pro-
                                             totype view and reveal how the joint approach       verb predicate, its noun arguments, and their syntac-
                                             of multimodal chaining may be fundamental to        tic relations, and we focus on tackling two related
                                             the creation of literal and figurative language     predictive problems: 1) given a novel or existing
                                             uses including metaphor and metonymy.               noun, infer what verbs and syntactic relations that
                                                                                                 have not predicated the noun might emerge to de-
                                         1   Introduction                                        scribe it over time (e.g., to drive a car vs. to fly a
                                         Language users often construct novel compositions       car), and 2) given a verb predicate and a syntactic
                                         through time, such that existing linguistic units can   relation, infer what nouns can be plausibly intro-
                                         be combined with emerging items to form novel           duced as its novel arguments in the future (e.g.,
                                         expressions. Consider the expression swipe your         drive a car vs. drive a computer).
                                         phone, which presumably came about after the               Figure 1 offers a preview of our framework by
                                         emergence of touchscreen-enabled smartphones.           visualizing the process of assigning novel verb
                                         Here the use of the verb swipe was extended to          frames to describe two query nouns over time. In
                                         express one’s experience with the emerging item         the first case, the model incorporated with percep-
                                         “smartphone”. These incremental extensions are          tual and conceptual knowledge successfully pre-
                                         fundamental to adapting a finite lexicon toward         dicts the verb drive to be a better predicate than fly
                                         emerging communicative needs. We explore the            for describing the novel item car that just emerged
                                         nature of cognitive mechanisms and knowledge in         at the time of prediction where linguistic usages
                                         the temporal formation of previously unattested         are not yet observed (i.e., emergent verb compo-
drive a ___         horse                               hold a ___
                     fly a ___                                               wear a ___
                     bird                                                                               recorder
Dimension 2

                                                       Dimension 2
                                                                            coat dress                    computer
                                   car
              flag
                                                                                                               device
                                    wheel cart                       suit
              kite                                                                                       container
                                      time=1920                                             time=1980
                            Dimension 1                                           Dimension 1
Figure 1: Preview of the proposed approach of syntactic frame extension. Given a query noun (circled dot) at
time t, the framework draws on a combination of multimodal knowledge and cognitive mechanisms of chaining to
predict novel linguistic expressions for those items. The left panel shows via PCA projection how a newly emerged
noun (i.e., car in 1920s) is assigned appropriate verb frames by comparing the learned multimodal representation
of the query nouns with support nouns (non-circled dots) that have been predicated by the frames. The right panel
shows a similar verb construction for a noun that already existed at the time of prediction (i.e., computer in 1980s).

sition with a novel noun concept). In the second                     mechanisms of chaining through deep models of
case, the model predicts that hold is a better predi-                categorization and multimodal semantic represen-
cate than wear for describing the noun computer,                     tations predicts the temporal emergence of novel
which already existed at the time of prediction (i.e.,               noun-verb compositions.
emergent verb composition with an existing noun).
   Our approach connects two strands of research
                                                                     2      Related work
that were rarely in contact: cognitive linguistic                    Our work synthesizes the interdisciplinary areas of
theories of chaining and computational representa-                   cognitive linguistics, diachronic semantics, mean-
tions of multimodal semantics. Work in the cogni-                    ing representation, and deep learning.
tive linguistics tradition has suggested that frame
extension is not arbitrary and involves the compari-                 2.1     Cognitive mechanisms of chaining
son between a new item to existing items that are                    The problem of syntactic frame extension concerns
relevant to the frame (Fillmore, 1986). Similar pro-                 the cognitive theory of chaining (Lakoff, 1987;
posals lead to the theory of chaining postulating                    Malt et al., 1999). It has been proposed that the his-
that linguistic categories grow by linking novel ref-                torical growth of linguistic categories depends on
erents to existing ones of a word due to proximity                   a process of chaining, whereby novel items link to
in semantic space (Lakoff, 1987; Malt et al., 1999;                  existing referents of a word that are close in seman-
Xu et al., 2016; Ramiro et al., 2018; Habibi et al.,                 tic space, resulting in chain-like structures. Recent
2020; Grewal and Xu, 2020). However, such a                          studies have formulated chaining as models of cat-
theory has neither been formalized nor evaluated                     egorization from classic work in cognitive science.
to predicting verb frame extensions through time.                    Specifically, it has been shown that chaining may
Separately, computational work in multimodal se-                     be formalized as an exemplar-based mechanism of
mantics has suggested how word meanings war-                         categorization emphasizing semantic neighborhood
rant a richer representation beyond purely linguis-                  profile (Nosofsky, 1986), which contrasts with a
tic knowledge (e.g., Bruni et al. 2012; Gella et al.                 prototype-based mechanism that emphasizes cate-
2016, 2017). However, multimodal semantic rep-                       gory centrality (Reed, 1972; Lakoff, 1987; Rosch,
resentations have neither been examined in the di-                   1975). This computational approach to chaining
achronics of compositionality nor in light of the                    has been applied to explain word meaning growth
cognitive theories of chaining. We show that a                       in numeral classifiers (Habibi et al., 2020) and ad-
unified framework that incorporates the cognitive                    jectives (Grewal and Xu, 2020). Unlike these pre-
vious studies, we consider the open issue whether       encompass its co-occurrence with the visual words
cognitive mechanisms of chaining might be gener-        of images it is associated with. Gella et al. (2017)
alized to verb frame extension which draws on rich      also showed how visual and multimodal informa-
sources of knowledge. It remains critically undeter-    tion help to disambiguate verb meanings. Our
mined how “shallow models” such as the exemplar         framework extends these studies by incorporating
model can function or integrate with deep neural        the dimension of time into exploring how multi-
models (Mahowald et al., 2020; McClelland, 2020),       modal knowledge predicts novel language use.
and how it might fair with the alternative mecha-
nism of prototype-based chaining in the context of      2.4    Memory-augmented deep learning
verb frame extension. We address both of these the-     Our framework also builds upon recent work on
oretical issues in a framework that explores these      memory-augmented deep learning (Vinyals et al.,
alternative mechanisms of chaining in light of prob-    2016; Snell et al., 2017). In particular, it has been
abilistic deep categorization models.                   shown that category representations enriched by
                                                        deep neural networks can effectively generalize to
2.2   Diachronic semantics in NLP
                                                        few-shot predictions with sparse input, hence yield-
The recent surge of interest in NLP on diachronic       ing human-like abilities in classifying visual and
semantics has developed Bayesian models of se-          textual data (Pahde et al., 2020; Singh et al., 2020;
mantic change (e.g., Frermann and Lapata, 2016),        Holla et al., 2020). In our work, we consider the
diachronic word embeddings (e.g., Hamilton et al.,      scenario of constructing novel compositions as they
2016), and deep contextualized language models          emerge over time, where sparse linguistic informa-
(e.g., Rosenfeld and Erk, 2018; Hu et al., 2019;        tion is available. We therefore extend the existing
Giulianelli et al., 2020). A common assumption          line of research to investigate how representations
in these studies is that linguistic usages (from his-   learned from naturalistic stimuli (e.g., images) and
torical corpora) are sufficient to capture diachronic   structured knowledge (e.g., knowledge graphs) can
word meanings. However, previous work has sug-          reliably model the emergence of flexible language
gested that text-derived distributed representations    use that expresses new knowledge and experience.
tend to miss important aspects of word meaning,
including perceptual features (Andrews et al., 2009;    3     Computational framework
Baroni and Lenci, 2008; Baroni et al., 2010) and
relational information (Necşulescu et al., 2015). It   We present the syntactic frame extension model
has also been shown that both relational and per-       (SFEM), which is composed of two components.
ceptual knowledge are essential to construct cre-       First, SFEM specifies a frame as a joint probabilis-
ative or figurative language use such as metaphor       tic distribution over a verb, its noun arguments, and
(Gibbs Jr. et al., 2004; Gentner and Bowdle, 2008)      their syntactic relations and supports temporal pre-
and metonymy (Radden and Kövecses, 1999). Our           diction of verb syntax and arguments via deep prob-
work examines the function of multimodal seman-         abilistic models of categorization. Second, SFEM
tic representations in capturing diachronic verb-       draws on multimodal knowledge by incorporating
noun compositions, and the extent to which such         perceptual, conceptual, and linguistic cues into flex-
representations can be integrated with the cognitive    ible inference for extended verb frames over time.
mechanisms of chaining.                                 Figure 2 illustrates our framework.

2.3   Multimodal representation of meaning              3.1    Chaining as probabilistic categorization
Computational research has shown the effective-         We denote a predicate verb as v (e.g., drive) and
ness of grounding language learning and distri-         a syntactic relation as r (e.g., direct object of a
butional semantic models in multimodal knowl-           verb), and consider a finite set of verb-syntactic
edge beyond linguistic knowledge (Lazaridou et al.,     frame elements f = (v, r) ∈ F. We define the
2015; Hermann et al., 2017). For instance, Kiros        set of nouns that appeared as arguments for a verb
et al. (2014) proposed a pipeline that combines         (under historically attested syntactic relations) up
image-text embedding models with LSTM neural            to time t as support nouns, denoted by ns ∈ S(f )(t)
language models. Bruni et al. (2014) identifies dis-    (e.g., horse appeared as a support noun—the direct
crete “visual words” in images, so that the distribu-   object—for the verb drive prior to 1880s). Given
tional representation of a word can be extended to      a query noun n∗ (e.g., car upon its emergence in
Predicted
                                                                                                             frame
                                                                                                 drive a(n) __
                    Perceptual information
                        from ImageNet
                                                                                                       horse
                                                                                                                  fly a(n) __
                                                                                               car
                                                        VGG-19                                                   airplane

Query noun
                                                                              Integration                  DEM/DPM
 __ a car?                                                                      Network                   Categorization
                       locates at       garage
                                                         SVD
                      car
                            has a       wheel
                     Conceptual information
                        from ConceptNet

Figure 2: Illustration of the syntactic frame extension model for the emerging query car. The model integrates
                                                                                                            (t)
information from visual perception and conceptual knowledge about cars to form a multimodal embedding (hn ),
which supports temporal prediction of appropriate verb-syntax usages via deep categorization models of chaining.

1880s) that has never been an argument of v under                joint probability p(n∗ , f )(t) in Equation 1 for every
relation r, we define syntactic frame extension as               frame f and each of its query noun n∗ ∈ Q(f )(t) :
probabilistic inference in two related problems:                                 X        X
                                                                        J =−                       log p(n∗ , f )(t) (5)
  1. Verb syntax prediction. Here we predict                                   f ∈F (t) n∗ ∈Q(f )(t)
     which verb-syntactic frames f are appropriate
     to describe the query noun n∗ , operationalized             For each noun n, we consider a time-dependent
                                                                                            (t)
     as p(f |n∗ ) yet to be specified.                           hidden representation hn ∈ RM derived from
                                                                 different sources of knowledge (specified in Sec-
  2. Noun argument prediction. Here we predict                   tion 3.2). For the prior probability p(f )(t) , we con-
     which nouns n∗ are plausible novel arguments                sider a frequency-based approach that computes
     for a given verb-syntax frame f , operational-              the proportion for the number of unique noun argu-
     ized as p(n∗ |f ) yet to be specified.                      ments that a (verb) frame has been paired with and
   We solve these inference problems by model-                   attested in a historical corpus:
ing the joint probability p(n∗ , f ) for a query noun
                                                                                                |S(f )(t) |
and a candidate verb-syntactic frame incrementally                           p(f )(t) = P                 0 (t)
                                                                                                                                (6)
through time as follows:                                                                     f 0 ∈F |S(f ) |

        p(n∗ , f )(t) = p(n∗ |f )(t) p(f )(t)            (1)     We formalize p(n∗ |f ) (Problem 2), namely
                                                                 p(n∗ |N (f ))(t) , by two classes of deep categoriza-
                     = p(n∗ |S(f )(t) )p(f )(t)          (2)     tion models motivated by the literature on chaining,
                                                                 categorization, and memory-augmented learning.
Here we construct verb meaning based on its ex-
                                                                    Deep prototype model (SFEM-DPM). SFEM-
isting support nouns S(f )(t) at current time t. We
                                                                 DPM draws inspirations from prototypical network
infer the most probable verb-syntax usages for de-
                                                                 for few-shot learning (Snell et al., 2017) and is
scribing the query noun (Problem 1) as follows:
                                                                 grounded in the prototype theory of categorization
                    p(n∗ , f )(t)                                in cognitive psychology (Rosch, 1975). The model
  p(f |n∗ ) = P              ∗    (t)
                                                         (3)     computes a set of hidden representations for every
                   f ∈F p(n , f )                                support noun ns ∈ S(f )(t) , and takes the expected
                       p(n∗ |S(f )(t) )p(f )(t)                                           (t)
                                                                 vector as a prototype cf to represent f at time t:
             =P               ∗       0 (t)     0 (t)
                                                         (4)
                   f 0 ∈F p(n |S(f ) )p(f )
                                                                             (t)      1                X
In the learning phase, we train our model incremen-                        cf =                                 h(t)
                                                                                                                 ns             (7)
                                                                                   |S(f )(t) |
tally at each time period t by minimizing the log                                                ns ∈S(f )(t)
The likelihood of extending n∗ to S(f )(t) is then                      relations may change over time, we prepare a di-
defined as a softmax distribution over l2 distances                     achronic slice of the ConceptNet graph at each
d(·, ·) to the embedded prototype:                                      time t by removing all words with frequency up
                                          (t)   (t)
                                                                        to t in a reference historical text corpus (see Sec-
      ∗       (t)
                  exp (−d(hn∗ , cf ))                                   tion 4 for details) under a threshold kc which we
  p(n |S(f ) ) = P            (t) (t)
                                                                  (8)
                                                                        set to be 10. We then compute embeddings for
                  f 0 exp(−d(hn∗ , cf ))
                                                                        the remaining concepts following methods recom-
Deep exemplar model (SFEM-DEM). In contrast                             mended in the original study by Speer et al. (2017).
to the prototype model, SFEM-DEM resembles                              In particular, we perform singular value decom-
the memory-augmented matching network in deep                           position (SVD) on the positive pointwise mutual
                                                                                                (t)
learning (Vinyals et al., 2016), and formalizes the                     information matrix MG of the ConceptNet G(t)
exemplar theory of categorization (Nosofsky, 1986)                      truncated at time t, and combine the top 300 di-
and chaining-based category growth (Habibi et al.,                      mensions (with largest singular values) of the term
2020). Unlike DPM, this model depends on the l2                         and context matrix symmetrically into a concept
distances between n∗ and every support noun:                            embedding matrix. Each row of the resulting row
                            P                         (t)   (t)         matrix of SVD will therefore serves as the concep-
                                      exp (−d(hn∗ ,hns ))
                       ns ∈S(f )(t)
                                                                        tual embedding xc (n)(t) ∈ R300 for its correspond-
p(n∗ |S(f )(t) ) =   P        P                               (t)       ing noun.
                                          exp (−d(hn∗ ,h          ))
                                                              n0s
                     f 0 n0 ∈S(f 0 )(t)
                          s
                                                                           Linguistic knowledge. For linguistic knowl-
                                                                  (9)   edge, we take the HistWords diachronic word em-
                                                                                     (t)
                                                                        beddings xl ∈ R300 pre-trained on the Google
3.2   Multimodal knowledge integration                                  N-Grams English corpora to represent linguistic
                                                                        meaning of each noun at decade t (Hamilton et al.,
In addition to the probabilistic formulation, SFEM
                                                                        2016).
draws on structured knowledge including percep-
                                                                           Knowledge integration. To construct a unified
tual, conceptual, and linguistic cues to construct
                                            (t)                         representation that incorporates knowledge from
multimodal semantic representations hn intro-                           different modalities, we take the mean of the uni-
duced in Section 3.1.                                                   modal representations described into a joint vector
   Perceptual knowledge. We capture perceptual                          xn ∈ R300 , and then apply an integration function
knowledge from image representations in the large,                      g : R300 → RM parameterized by a feedforward
taxonomically organized ImageNet database (Deng                         neural network to get the multimodal word rep-
et al., 2009). For each noun n, we randomly sample                                      (t)
                                                                        resentation hn .1 Our framework allows flexible
a collection of 64 images from the union of all Im-
                                                                        combinations of the three modalities introduced,
ageNet synsets that contains n, and encode the im-
                                                                        e.g., a full model would utilizes all three types of
ages through the VGG-19 convolutional neural net-
                                                                        knowledge, while a linguistic-only baseline will
work (Simonyan and Zisserman, 2015) by extract-                                                                 (t)
ing the output vector from the last fully connected                     directly take HistWords embeddings xl as inputs
layer after all convolutions (see similar procedures                    of the integration network.
also in Pinto Jr. and Xu, 2021). We then average the                    4       Historical noun-verb compositions
encoded images to a mean vector xp (n) ∈ R1000
as the perceptual representation of n.                                  To evaluate our framework, we collected a large
   Conceptual knowledge. To capture conceptual                          dataset of historical noun-verb compositions de-
knowledge beyond perceptual information (e.g.,                          rived from the Google Syntactic N-grams (GSN)
attributes and functions), we extract information                       English corpus (Lin et al., 2012) from 1850 to
from the ConceptNet knowledge graph (Speer et al.,                      2000. Specifically, we collected verb-noun-relation
2017), which connects concepts in a network struc-                      triples (n, v, r)(t) that co-occur in the ENGALL
ture via different types of relations as edges. This                    subcorpus of GSN over the 150 years. We focused
graph reflects commonsense knowledge of a con-                          on working with common usages and pruned rare
cept (noun) such as its functional role (e.g., a car                    cases under the following criteria: 1) a noun n
IS_USED_FOR transportation), taxonomic infor-                               1
                                                                            For ImageNet embeddings, we apply a linear transfor-
mation (e.g., a car IS_A vehicle), or attributes (e.g.,                                           (t)
                                                                        mation to project each xp into R300 so that all unimodal
a car HAS_A wheel). Since the concepts and their                        representations are 300-d vectors before taking the means.
Verb syntactic frame
Decade                                                      Support noun                     Query noun
            Predicate verb   Syntactic relation
1900        drive            direct object                  horse, wheel, cart               car, van
1950        work             prepositional object via as    mechanic, carpenter, scientist   astronaut, programmer
1980        store            prepositional object via in    fridge, container, box           supercomputer

Table 1: Sample entries from Google Syntactic Ngram including verb syntactic frames, support and query nouns.

should have at least θp = 64 image representa-               the remaining examples for model testing such that
tions in ImageNet, θc = 10 edges in the contem-              there is no overlap in the query nouns between
porary ConceptNet network, and θn = 15, 000                  training and testing. We trained models on the neg-
counts (with POS tag as nouns) in GSN over the               ative log-likelihood loss defined in Equation 5 at
150-year period; 2) a verb v should have at least            each decade. To examine how multimodal knowl-
θv = 15, 000 counts in GSN. To facilitate feasi-             edge contributes to temporal prediction of novel
ble model learning, we consider the top-20 most              language use, we trained 5 DEM and 5 DPM mod-
common syntactic relations in GSN, including di-             els using information from different modalities.
rect object, direct subject, and relations concerning
prepositional objects.                                       5.2   Evaluation against historical data
   We binned the raw co-occurrence counts by                 We test our models on both verb syntax and noun
decade ∆ = 10. At each decade, we define emerg-              argument predictive tasks with the goals of assess-
ing query nouns n∗ for a given verb frame f if               ing 1) the contributions of multimodal knowledge,
their number of co-occurrences with f up to time             and 2) the two alternative mechanisms of chaining.
t falls below a threshold θq , while the number of           We also consider baseline models that do not im-
co-occurrences with f up to time t + ∆ is above θq           plement chaining-based mechanisms: a frequency
(i.e., an emergent use that conventionalizes). We            baseline that predicts by count in GSN up to time t,
define support nouns as those that co-occurred with          and a random guesser. We evaluate model perfor-
f for more than θs times before t. We found that             mance via standard receiver operating characteris-
θq = 10 and θs = 100 are reasonable choices. This            tics (ROC) curves that reveal cumulative precision
preprocessing pipeline yielded a total of 10,349             of models in their top m predictions. We compute
verb-syntactic frames over 15 decades, where each            the standard area-under-curve (AUC) statistics for
frame class has at least 1 novel query noun and              the ROC curves to get the mean precision over all
4 existing support nouns. Table 1 shows sample               values of m from 1 to the candidate set size. Fig-
entries of data which we make publicly available.2           ure 3 summarizes the results over the 150 years.
                                                             We observe that 1) all multimodal models perform
5       Evaluation and results                               better than their uni-/bi-modal counterparts, and
We first describe the details of SFEM implemen-              2) the exemplar-based model performs dominantly
tation and diachronic evaluation. We then provide            better than the prototype-based counterpart, and
an in-depth analysis on the multimodal knowledge             both outperform the baseline models without chain-
and chaining mechanisms in verb frame extension.             ing. In particular, a tri-modal deep exemplar model
                                                             that incorporates knowledge from all three modali-
5.1      Details of model implementation                     ties achieves the best overall performance. These
We implemented the integration network g(·) of               results provide strong support that verb frame ex-
SFEM as a three-layer feedforward neural network             tension depends on multimodal knowledge and an
with an output dimension M = 100, and keep pa-               exemplar-based chaining.
rameters and embeddings in other modules fixed                  To further assess how the models perform in pre-
during learning.3 At each decade, we randomly                dicting emerging verb extension toward both novel
sample 70% of the query nouns with their associ-             and existing nouns, we report separate mean AUC
ated verb-syntactic pairs as training data, and take         scores for these predictive cases where query nouns
    2                                                        are either completely novel (i.e., zero token fre-
    Data and code are deposited here: https://github.
com/jadeleiyu/frame_extension                                quencies) or established. (i.e., above-zero frequen-
  3
    See Appendix A for additional implementation details.    cies) at the time of prediction. Table 2 summarizes
percept+concept+ling                                                      percept+concept      concept+ling           concept       percept       ling   frequency
                                                                                 DPM models                                             DEM models

             Verb syntax pred. (AUC) Noun argument pred. (AUC)
                                                                 0.9                                                  0.9

                                                                 0.8                                                  0.8

                                                                 0.7                                                  0.7

                                                                 0.6                                                  0.6

                                                                 0.5                                                  0.5
                                                                       1860 1880 1900 1920 1940 1960 1980                   1860 1880 1900 1920 1940 1960 1980
                                                                                      year                                                 year
                                                                 0.9                                                  0.9

                                                                 0.8                                                  0.8

                                                                 0.7                                                  0.7

                                                                 0.6                                                  0.6

                                                                 0.5                                                  0.5
                                                                       1860 1880 1900 1920 1940 1960 1980                   1860 1880 1900 1920 1940 1960 1980
                                                                                      year                                                 year

Figure 3: Area-under-curves of SFEM and baseline models from 1850s to 1990s. Top row: AUCs of predicting
extended syntax frames for query nouns. Bottom row: AUCs of predicting extended nouns for query verb frames.

these results and shows that model performances                                                                   joint probability p(f, n) after ablating a knowledge
are similar under four predictive cases. For predic-                                                              modality from the full tri-modal SFEM (see Ta-
tion with novel query nouns, it is not surprising that                                                            ble 4). A drop in p(f, n) indicates reliance on
linguistic-only models fail due to the unavailabil-                                                               multimodality in prediction. We found that lin-
ity of linguistic mentions. However, for prediction                                                               guistic knowledge helps the model identify some
with established query nouns, the superiority of                                                                  general properties that are absent in the other cues
multimodal SFEMs is still prominent suggesting                                                                    (e.g., a monitor is buy-able). Importantly, for the
that our framework captures general principles in                                                                 two extra-linguistic knowledge modalities, we ob-
verb frame extension (and not just for predicting                                                                 serve that visual-perceptual knowledge helps pre-
verb extension toward novel nouns).                                                                               dict many imaged-based metaphors, including “the
   Table 3 compares sample verb syntax predictions                                                                airplane rolls" (i.e., based on common shapes of
made by the full and linguistic-only DEM models                                                                   airplanes) and “the tree stands" (based on verti-
that cover a diverse range of concepts including                                                                  cality of trees). On the other hand, conceptual
inventions (e.g., airplane), discoveries (e.g., mi-                                                               knowledge predicts cases of logical metonymy
croorganism), and occupations (e.g. astronaut). We                                                                (e.g., “work for the newspaper”) and conceptual
observe that the full model typically constructs rea-                                                             metaphor (e.g., “kill the process”). These examples
sonable predicate verbs that reflect salient features                                                             suggest that multimodality serves to ground and
of the query noun (e.g., cars are vehicles that are                                                               embody SFEM with commonsense knowledge that
drive-able). In contrast, the linguistic-only model                                                               constructs novel verb compositions for not only lit-
often predicts verbs that are either overly generic                                                               eral language use, but also non-literal or figurative
(e.g., purchase a telephone) or nonsensical.                                                                      language use that is extensively discussed in the
                                                                                                                  psycholinguistics literature (Lakoff, 1982; Radden
5.3     Model analysis and interpretation                                                                         and Kövecses, 1999; Gibbs Jr. et al., 2004).
We provide further analyses and interpret why both                                                                   We also evaluate the contributions of the three
multimodality and chaining mechanisms are funda-                                                                  modalities in model prediction by comparing the
mental to predicting emergent verb compositions.                                                                  AUC scores from the three uni-modal DEMs. Fig-
                                                                                                                  ure 4 shows the percentage breakdown of examples
5.3.1     The function of multimodal knowledge                                                                    on which one of the modalities yields the highest
To understand the function of multimodal knowl-                                                                   score (i.e., contributes most to a reliable prediction).
edge, we compute, for each modality, the top-                                                                     We observe that conceptual cues explain data the
4 verb compositions that were most degraded in                                                                    best in almost 2/3 of the cases, followed by percep-
AUC – verb syntax prediction                AUC – noun argument prediction
                 Model
                                          novel items    existing items    combined        novel items     existing items     combined
DPM (linguistics)                         0.642          0.690             0.681           0.641           0.653              0.650
DPM (perceptual)                          0.632          0.666             0.657           0.650           0.624              0.629
DPM (conceptual)                          0.772          0.722             0.733           0.727           0.705              0.711
DPM (perceptual+conceptual)               0.809          0.754             0.767           0.725           0.719              0.721
DPM (perceptual+linguistics)              0.645          0.669             0.661           0.655           0.669              0.665
DPM (conceptual+linguistics)              0.753          0.774             0.766           0.776           0.768              0.770
DPM (perceptual+conceptual+linguistics)   0.848          0.810             0.815           0.799           0.786              0.788
DEM (linguistics)                         0.652          0.690             0.686           0.641           0.625              0.632
DEM (perceptual)                          0.737          0.674             0.684           0.659           0.650              0.655
DEM (conceptual)                          0.854          0.784             0.788           0.736           0.724              0.729
DEM (perceptual+conceptual)               0.858          0.792             0.797           0.750           0.744              0.748
DEM (perceptual+linguistics)              0.712          0.759             0.753           0.698           0.710              0.708
DEM (conceptual+linguistics)              0.902          0.866             0.870           0.837           0.822              0.824
DEM (perceptual+conceptual+linguistics)   0.919          0.872             0.878           0.856           0.820              0.827
Baseline (frequency)                      0.573          0.573             0.573           0.536           0.536              0.536
Baseline (random)                         0.500          0.500             0.500           0.500           0.500              0.500

     Table 2: Mean model AUC scores of verb syntax and noun argument predictions from 1850s to 1990s.

 Query noun        Decade    Predicted frames (linguistic-only DEM)          Predicted frames (tri-modal DEM)
                             roll-nsubj, load-dobj,                          purchase_dobj, pick-dobj,
 telephone         1860
                             play-pobj_prep.on                               remain-pobj_prep.on
                             decorate-pobj_prep.with,                        feed-pobj_prep.on,
 microorganism     1900
                             play-pobj_prep.on, spread-dobj                  mix-pobj_prep.with, breed-dobj
                             load-dobj, mount-pobj_prep.on,                  fly-dobj, approach-nsubj,
 airplane          1930
                             mount-dobj, blow-dobj, roll-nsubj               drive-dobj, stop-nsubj
                             spin-dobj,work-pobj_prep.in,                    work-pobj_prep.as, talk-pobj_prep.to,
 astronaut         1950
                             emerge-pobj_prep.from                           lead-pobj_prep.by
                             purchase-dobj, fix-dobj,                        store-pobj_prep.in, move-pobj_prep.into,
 computer          1970
                             generate-dobj, write-pobj_prep.to               display-pobj_prep.on, implement-pobj_prep.in

Table 3: Example predictions of novel verb-noun compositions from the full tri-imodal and linguistic-only models.

                                                                 roll an airplane                              drain a battery
                                                                 talk to an entrepreneur                       wear a microphone
Ablated modality     Most affected compositions                                            21.9% 13.7%
                                                                                                                       Perceptual (N=3612)
                     buy a monitor, find a disk (*),                                                                   Conceptual (N=10605)
                                                                                                                       Linguistic (N=2265)
                     a resident dies,
Language
                     specialized in nutrition (*),                                                 64.3%
                                                                                                               drive a car
                     point to the window                                                                       work as an astronaut

                     an airplane rolls,
                     talk to an entrepreneur (*),                Figure 4: Percentage breakdown of the three modalities
Percept
                     the tree stands, the doctor says,           in model prediction, with annotated examples.
                     topped with nutella (*)
                     perform in the film (*),                    tual and linguistic cues. These results suggest that
                     work as a programmer (*),
Concept                                                          while conceptual knowledge plays a dominant role
                     work for the newspaper,
                                                                 in model prediction, all three modalities contain
                     expand the market, kill the process
                                                                 complementary information in predicting novel lan-
Table 4: Top-4 ground-truth compositions with most               guage use through time.
prominent drops in joint probability p(f, n) after ab-
                                                                 5.3.2 General mechanisms of chaining
lation of one modality of knowledge from SFEM.
Phrases marked with ‘*’ include novel query nouns.               We next analyze general mechanisms of chaining
                                                                 by focusing on understanding the superiority of
                                                                 exemplar-based chaining in SFEM. Figure 5 illus-
supercomputer
                               computer                                                 computer
                                                                                          supercomputer
    Dimension 2

                                                          Dimension 2
                                     glass
                                    shoe
                  container
                                                                                                   shoe
                   warehouse                                            warehousecontainerglass

                         Dimension 1                                             Dimension 1
Figure 5: Illustrations of two mechanisms of chaining (left: exemplar; right: prototype) in verb frame prediction
for query supercomputer. Nouns are PCA-projected in 2D, with categories color-coded and in dashed boundaries.

trates the exemplar-based and prototype-based pro-        of exemplar-based chaining. Our work creates a
cesses of chaining with the example verb frame            novel approach to diachronic compositionality and
prediction for the noun “supercomputer". For sim-         strengthens the link between multimodal semantics
plicity, we only show two competing frames “to            and cognitive linguistic theories of categorization.
store in a ___” and “to wear a ___”. In this case,
the query noun is semantically distant to most of         Acknowledgments
the prototypical support nouns in both categories,
and is slightly closer to the centroid of the “wear”      We would like to thank Graeme Hirst, Michael
class than to that of the “store” class. The prototype    Hahn, and Suzanne Stevenson for their feedback
model would then predict the incorrect composi-           on the manuscript, and members of the Cognitive
tion “to wear a supercomputer”. In contrast, the          Lexicon Laboratory at the University of Toronto
exemplar model is more sensitive to the seman-            for helpful suggestions. We also thank the anony-
tic neighborhood profile of the query noun and            mous reviewers for their constructive comments.
the aprototypical support noun “computer” of the          This work was supported by a NSERC Discovery
“store in ___” class, and it therefore correctly pre-     Grant RGPIN-2018-05872, a SSHRC Insight Grant
dicts that “supercomputer” is more likely to be           #435190272, and an Ontario ERA Award to YX.
predicated by “to store in”. Our discovery that the
exemplar-based chaining accounts for verb com-
position through time mirrors existing findings on
                                                          References
similar mechanisms of chaining in the extensions          Mark Andrews, Gabriella Vigliocco, and David Vinson.
of numeral classifiers (Habibi et al., 2020) and ad-       2009. Integrating experiential and distributional
jectives (Grewal and Xu, 2020), and together they          data to learn semantic representations. Psychologi-
                                                           cal review, 116(3):463.
suggest a general cognitive mechanism may under-
lie historical linguistic innovation.                     Marco Baroni. 2020. Linguistic generalization and
                                                           compositionality in modern artificial neural net-
                                                           works. Philosophical Transactions of the Royal So-
6          Conclusion                                      ciety B, 375(1791):20190307.

We have presented a probabilistic framework for           Marco Baroni and Alessandro Lenci. 2008. Concepts
characterizing the process of syntactic frame exten-       and properties in word spaces. Italian Journal of
sion in which verbs extend their referential range         Linguistics, 20(1):55–88.
toward novel and existing nouns over time. Our
                                                          Marco Baroni, Brian Murphy, Eduard Barbu, and Mas-
results suggest that language users rely on extra-         simo Poesio. 2010. Strudel: A distributional seman-
linguistic knowledge from percept and concept to           tic model based on property and types. Cognitive
construct new linguistic compositions via a process        Science, 34(2):222–254.
Marco Baroni and Roberto Zamparelli. 2010. Nouns          Raymond W Gibbs Jr., Paula Lenz Costa Lima, and
 are vectors, adjectives are matrices: Represent-           Edson Francozo. 2004. Metaphor is grounded
 ing adjective-noun constructions in semantic space.        in embodied experience. Journal of Pragmatics,
 In Proceedings of the 2010 conference on empiri-           36(7):1189–1210.
 cal methods in natural language processing, pages
 1183–1193.                                               Mario Giulianelli, Marco Del Tredici, and Raquel Fer-
                                                           nández. 2020. Analysing lexical semantic change
William Blacoe and Mirella Lapata. 2012. A com-            with contextualised word representations. In Pro-
  parison of vector-based representations for seman-       ceedings of the 58th Annual Meeting of the Asso-
  tic composition. In Proceedings of the 2012 joint        ciation for Computational Linguistics, pages 3960–
  conference on empirical methods in natural lan-          3973.
  guage processing and computational natural lan-
  guage learning, pages 546–556.                          Karan Grewal and Yang Xu. 2020. Chaining and histor-
                                                            ical adjective extension. In Proceedings of the 42nd
Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-            Annual Conference of the Cognitive Science Society.
   Khanh Tran. 2012. Distributional semantics in tech-
   nicolor. In Proceedings of the 50th Annual Meet-       Amir Ahmad Habibi, Charles Kemp, and Yang Xu.
   ing of the Association for Computational Linguistics    2020. Chaining and the growth of linguistic cate-
   (Volume 1: Long Papers), pages 136–145.                 gories. Cognition, 202:104323.

Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014.       William L Hamilton, Jure Leskovec, and Dan Jurafsky.
   Multimodal distributional semantics. Journal of Ar-      2016. Diachronic word embeddings reveal statisti-
   tificial Intelligence Research, 49:1–47.                 cal laws of semantic change. In Proceedings of the
                                                            54th Annual Meeting of the Association for Compu-
Silvio Cordeiro, Carlos Ramisch, Marco Idiart, and          tational Linguistics (Volume 1: Long Papers), pages
   Aline Villavicencio. 2016. Predicting the composi-      1489–1501.
   tionality of nominal compounds: Giving word em-
   beddings a hard time. In Proceedings of the 54th An-   Karl Moritz Hermann, Felix Hill, Simon Green, Fumin
   nual Meeting of the Association for Computational        Wang, Ryan Faulkner, Hubert Soyer, David Szepes-
   Linguistics (Volume 1: Long Papers), pages 1986–         vari, Wojciech Marian Czarnecki, Max Jaderberg,
   1997.                                                    Denis Teplyashin, et al. 2017. Grounded language
                                                            learning in a simulated 3d world. arXiv preprint
                                                            arXiv:1706.06551.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
   and Li Fei-Fei. 2009. Imagenet: A large-scale hier-
                                                          Nithin Holla, Pushkar Mishra, Helen Yannakoudakis,
   archical image database. In 2009 IEEE Conference
                                                            and Ekaterina Shutova. 2020.        Learning to
   on Computer Vision and Pattern Recognition, pages
                                                            learn to disambiguate: Meta-learning for few-
   248–255.
                                                            shot word sense disambiguation. arXiv preprint
                                                            arXiv:2004.14355.
Charles J Fillmore. 1986. Frames and the semantics of
  understanding. Quaderni di Semantica, 6:222–254.        Renfen Hu, Shen Li, and Shichen Liang. 2019. Di-
                                                            achronic sense modeling with deep contextualized
Lea Frermann and Mirella Lapata. 2016. A Bayesian           word embeddings: An ecological view. In Proceed-
  model of diachronic meaning change. Transactions          ings of the 57th Annual Meeting of the Association
  of the Association for Computational Linguistics,         for Computational Linguistics, pages 3899–3908.
  4:31–45.
                                                          Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel.
Spandana Gella, Frank Keller, and Mirella Lapata.           2014. Multimodal neural language models. In Pro-
  2017. Disambiguating visual verbs. IEEE Transac-          ceedings of the 31st International Conference on
  tions on Pattern Analysis and Machine Intelligence,       Machine Learning, pages 595–603.
  41(2):311–322.
                                                          George Lakoff. 1982. Experiential factors in linguis-
Spandana Gella, Mirella Lapata, and Frank Keller.           tics. Language, mind, and brain, pages 142–157.
  2016. Unsupervised visual sense disambiguation for
  verbs using multimodal embeddings. In Proceed-          George Lakoff. 1987. Women, Fire, and Dangerous
  ings of the 2016 Conference of the North Ameri-           Things: What Categories Reveal about the Mind.
  can Chapter of the Association for Computational          University of Chicago Press.
  Linguistics: Human Language Technologies, pages
  182–192, San Diego, California. Association for         Angeliki Lazaridou, Marco Baroni, et al. 2015. Com-
  Computational Linguistics.                                bining language and vision with a multimodal skip-
                                                            gram model. In Proceedings of the 2015 Conference
Dedre Gentner and Brian Bowdle. 2008. Metaphor              of the North American Chapter of the Association
  as structure-mapping. The Cambridge handbook of           for Computational Linguistics: Human Language
  metaphor and thought, 109:128.                            Technologies, pages 153–163.
Yuri Lin, Jean-Baptiste Michel, Erez Aiden Lieberman,       Alex Rosenfeld and Katrin Erk. 2018. Deep neural
  Jon Orwant, Will Brockman, and Slav Petrov. 2012.           models of semantic shift. In Proceedings of the 2018
  Syntactic annotations for the google books ngram            Conference of the North American Chapter of the
  corpus. In Proceedings of the ACL 2012 System               Association for Computational Linguistics: Human
  Demonstrations, pages 169–174.                              Language Technologies, Volume 1 (Long Papers),
                                                              pages 474–484.
Kyle Mahowald, George Kachergis, and Michael C
  Frank. 2020. What counts as an exemplar model,            Karen Simonyan and Andrew Zisserman. 2015. Very
  anyway? a commentary on ambridge (2020). First              deep convolutional networks for large-scale image
  Language, 40(5-6):608–611.                                  recognition. In 3rd International Conference on
                                                              Learning Representations, ICLR 2015, San Diego,
Barbara C Malt, Steven A Sloman, Silvia Gennari,              CA, USA, May 7-9, 2015, Conference Track Proceed-
  Meiyi Shi, and Yuan Wang. 1999. Knowing ver-                ings.
  sus naming: Similarity and the linguistic categoriza-
                                                            Pulkit Singh, Joshua C Peterson, Ruairidh M Battleday,
  tion of artifacts. Journal of Memory and Language,
                                                              and Thomas L Griffiths. 2020. End-to-end deep pro-
  40(2):230–262.
                                                              totype and exemplar models for predicting human
                                                              behavior. In Proceedings of the 42nd Annual Confer-
James L McClelland. 2020. Exemplar models are use-            ence of the Cognitive Science Society (CogSci 2020).
  ful and deep neural networks overcome their limi-
  tations: A commentary on ambridge (2020). First           Jake Snell, Kevin Swersky, and Richard Zemel. 2017.
  Language, 40(5-6):612–615.                                   Prototypical networks for few-shot learning. In Ad-
                                                               vances in Neural Information Processing Systems,
Jeff Mitchell and Mirella Lapata. 2010. Composition            pages 4077–4087.
   in distributional models of semantics. Cognitive Sci-
   ence, 34(8):1388–1429.                                   Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.
                                                              Conceptnet 5.5: An open multilingual graph of gen-
Silvia Necşulescu, Sara Mendes, David Jurgens, Núria         eral knowledge. In Proceedings of the AAAI Confer-
   Bel, and Roberto Navigli. 2015. Reading between            ence on Artificial Intelligence, volume 31.
   the lines: Overcoming data sparsity for accurate clas-
   sification of lexical relationships. In Proceedings of   Eva M Vecchi, Marco Marelli, Roberto Zamparelli, and
   the Fourth Joint Conference on Lexical and Compu-          Marco Baroni. 2017. Spicy adjectives and nominal
   tational Semantics, pages 182–192.                         donkeys: Capturing semantic deviance using com-
                                                              positionality in distributional spaces. Cognitive Sci-
Robert M Nosofsky. 1986.       Attention, similarity,         ence, 41(1):102–136.
  and the identification–categorization relationship.
                                                            Oriol Vinyals, Charles Blundell, Timothy Lillicrap,
  Journal of Experimental Psychology: General,
                                                              Daan Wierstra, et al. 2016. Matching networks for
  115(1):39.
                                                              one shot learning. Advances in Neural Information
                                                              Processing Systems, 29:3630–3638.
Frederik Pahde, Mihai Puscas, Tassilo Klein, and Moin
  Nabi. 2020. Multimodal prototypical networks for          Yang Xu, Terry Regier, and Barbara C Malt. 2016. His-
  few-shot learning. In Proceedings of the IEEE/CVF           torical semantic chaining and efficient communica-
  Winter Conference on Applications of Computer Vi-           tion: The case of container names. Cognitive Sci-
  sion, pages 2644–2653.                                      ence, 40(8):2081–2094.

Renato Ferreira Pinto Jr. and Yang Xu. 2021. A com-
  putational theory of child overextension. Cognition,
  206:104472.

Günter Radden and Zoltán Kövecses. 1999. Towards
  a theory of metonymy. Metonymy in language and
  thought, 4:17–60.

Christian Ramiro, Mahesh Srinivasan, Barbara C Malt,
  and Yang Xu. 2018. Algorithms in the historical
  emergence of word senses. Proceedings of the Na-
  tional Academy of Sciences, 115(10):2323–2328.

Stephen K Reed. 1972. Pattern recognition and catego-
   rization. Cognitive Psychology, 3(3):382–407.

Eleanor Rosch. 1975. Cognitive representations of se-
  mantic categories. Journal of Experimental Psychol-
  ogy: General, 104(3):192.
A    Additional details of SFEM
     implementation
We implemented the integration network g(·) as
a three-layer feedforward neural network using
PyTorch, where each layer has a dimension of
300, 200 and 100 respectively. For models that
incorporates less than three modalities, we replace
the missing embeddings with a zero vector when
computing the mean vectors before knowledge in-
tegration.
   During training, except for network weights in
g(·), we keep parameters in every modules (i.e., the
VGG-19 encoder and every unimodal embedding)
constant, and optimize SFEM by minimzing the
negative log-likelihood loss function specified in
Equation 5 via stochastic gradient descent (SGD).
   Each training batch consists of B = 64 syntac-
tic frames with their associated query and support
nouns. We train each model for 200 epochs and
save the configuration that achieves the highest val-
idation accuracy for our evaluation described in
Section 5.
You can also read