Experience Grounds Language

Page created by Christina Carrillo
 
CONTINUE READING
Experience Grounds Language

                                                          Yonatan Bisk*                Ari Holtzman*               Jesse Thomason*
                                                  Jacob Andreas    Yoshua Bengio    Joyce Chai    Mirella Lapata
                                         Angeliki Lazaridou Jonathan May Aleksandr Nisnevich Nicolas Pinto Joseph Turian

                                                               Abstract                                Meaning is not a unique property of language, but a
                                                                                                       general characteristic of human activity ... We cannot
                                             Successful linguistic communication relies on             say that each morpheme or word has a single or central
                                             a shared experience of the world, and it is this          meaning, or even that it has a continuous or coherent
                                             shared experience that makes utterances mean-             range of meanings ... there are two separate uses and
                                             ingful. Despite the incredible effectiveness              meanings of language – the concrete ... and the abstract.
                                             of language processing models trained on text
arXiv:2004.10151v1 [cs.CL] 21 Apr 2020

                                                                                                              Zellig S. Harris (Distributional Structure 1954)
                                             alone, today’s best systems still make mistakes
                                             that arise from a failure to relate language to      it is a universal property of intelligence. Conse-
                                             the physical world it describes and to the so-
                                                                                                  quently, we must consider what knowledge or con-
                                             cial interactions it facilitates.
                                                                                                  cepts are missing from models trained solely on
                                             Natural Language Processing is a diverse field,
                                                                                                  text corpora, even when those corpora are large
                                             and progress throughout its development has
                                             come from new representational theories, mod-
                                                                                                  scale or meticulously annotated.
                                             eling techniques, data collection paradigms,            Large, generative language models emit utter-
                                             and tasks. We posit that the present success         ances which violate visual, physical, and social
                                             of representation learning approaches trained        commonsense. Perhaps this is to be expected, since
                                             on large text corpora can be deeply enriched         they are tested in terms of Independent and Identi-
                                             from the parallel tradition of research on the       cally Distributed held-out sets, rather than queried
                                             contextual and social nature of language.
                                                                                                  on points meant to probe the granularity of the dis-
                                             In this article, we consider work on the contex-     tinctions they make (Kaushik et al., 2020; Gardner
                                             tual foundations of language: grounding, em-         et al., 2020). We draw on previous work in NLP,
                                             bodiment, and social interaction. We describe
                                                                                                  Cognitive Science, and Linguistics to provide a
                                             a brief history and possible progression of how
                                             contextual information can factor into our rep-      roadmap towards addressing these gaps. We posit
                                             resentations, with an eye towards how this inte-     that the universes of knowledge and experience
                                             gration can move the field forward and where         available to NLP models can be defined by succes-
                                             it is currently being pioneered. We believe this     sively larger world scopes: from a single corpus to
                                             framing will serve as a roadmap for truly con-       a fully embodied and social context.
                                             textual language understanding.                         We propose the notion of a World Scope (WS) as
                                            Improvements in hardware and data collection          a lens through which to view progress in NLP. From
                                         have galvanized progress in NLP. Performance             the beginning of corpus linguistics (Zipf, 1932; Har-
                                         peaks have been reached in tasks such as lan-            ris, 1954), to the formation of the Penn Treebank
                                         guage modeling (Radford et al., 2019; Zellers et al.,    (Marcus et al., 1993), NLP researchers have con-
                                         2019c; Keskar et al., 2019) and span-selection ques-     sistently recognized the limitations of corpora in
                                         tion answering (Devlin et al., 2019a; Yang et al.,       terms of coverage of language and experience. To
                                         2019; Lan et al., 2019) through massive data and         organize this intuition, we propose five WSs, and
                                         massive models. Now is an excellent time to reflect      note that most current work in NLP operates in the
                                         on the direction of our field, and on the relationship   second (internet-scale data).
                                         of language to the broader AI, Cognitive Science,           We define five levels of World Scope:
                                         and Linguistics communities.                                   WS1. Corpus (our past)
                                            We are interested in how the data and world a               WS2. Internet (our present)
                                         language learner is exposed to defines and poten-              WS3. Perception
                                         tially constrains the scope of the learner’s seman-            WS4. Embodiment
                                         tics. Meaning is not a unique property of language;            WS5. Social
100

                                                        % of 2019 Citations
1   WS1: Corpora and Representations                                       80
                                                                                         Harris 1954
                                                                                         Firth 1957
                                                                           60            Chomsky 1957
The story of computer-aided, data-driven language
                                                                           40
research begins with the corpus. While by no                               20
means the only of its kind, the Penn Treebank (Mar-                         0
cus et al., 1993) is the canonical example of a ster-                                  1960    1970   1980     1990   2000   2010    2020
                                                                                                             Year
ilized subset of naturally generated language, pro-           Academic interest in Firth and Harris increases dramatically
cessed and annotated for the purpose of studying               around 2010, perhaps due to the popularization of Firth (1957)
representations, a perfect example of WS1. While          “You shall know a word by the company it keeps."
initially much energy was directed at finding struc-
ture (e.g. syntax trees), this has been largely over-
shadowed by the runaway success of unstructured,              ument generated as a bag-of-words conditioned
fuzzy representations—from dense word vectors                 on topics. However, the most successful vector
(Mikolov et al., 2013) to contextualized pretrained           representations came from Latent Semantic Index-
representations (Peters et al., 2018b; Devlin et al.,         ing/Analysis (Deerwester et al., 1988, 1990; Du-
2019b).                                                       mais, 2004), which represents words by their co-
   Yet fuzzy, conceptual word representations have            occurrence with other words. Via singular value de-
a long history that predates the recent success of            composition, these matrices are reduced to low di-
deep learning methods. Philosophy (Lakoff, 1973)              mensional projections, essentially a bag-of-words.
and linguistics (Coleman and Kay, 1981) recog-                  A related question also existed in connectionism
nized that meaning is flexible yet structured. Early          (Pollack, 1987; James and Miikkulainen, 1995):
experiments on neural networks trained with se-               Are concepts distributed through edges or local to
quences of words (Elman, 1990) suggested that                 units in an artificial neural network?
vector representations could capture both syntax
and semantics. Subsequent experiments with larger                                 “... there has been a long and unresolved
models, document contexts, and corpora have                                       debate between those who favor localist
demonstrated that representations learned from text                               representations in which each process-
capture a great deal of information about word                                    ing element corresponds to a meaningful
meanings in and out of context (Bengio et al., 2003;                              concept and those who favor distributed
Collobert and Weston, 2008; Turian et al., 2010;                                  representations.”           Hinton (1990)
                                                                                  Special Issue on Connectionist Symbol Processing
Mikolov et al., 2013; McCann et al., 2017)
   It has long been acknowledged that context lends
                                                              Unlike work that defined words as distributions
meaning (Firth, 1957; Turney and Pantel, 2010).
                                                              over clusters, which were perceived as having in-
Local contexts proved powerful when combined
                                                              trinsic meaning to the user of a system, in connec-
with agglomerative clustering guided by mutual
                                                              tionism there was the question of where meaning
information (Brown et al., 1992). A word’s posi-
                                                              resides. The tension of modeling symbols and dis-
tion in this hierarchy captures semantic and syntac-
                                                              tributed representations was nicely articulated by
tic distinctions. When the Baum-Welch algorithm
                                                              Smolensky (1990), and alternative representations
(Welch, 2003) was applied to unsupervised Hidden
                                                              (Kohonen, 1984; Hinton et al., 1986; Barlow, 1989)
Markov Models, it assigned a class distribution to
                                                              and approaches to structure and composition (Erk
every word, and that distribution was considered a
                                                              and Padó, 2008; Socher et al., 2012) span decades
partial representation of a word’s “meaning.” If the
                                                              of research.
set of classes was small, syntax-like classes were
                                                                 All of these works rely on corpora. The Brown
induced; If the set was large, classes became more
                                                              Corpus (Francis, 1964) and Penn Treebank (Mar-
semantic. Over the years this “search for meaning”
                                                              cus et al., 1993) were colossal undertakings that de-
has played out again and again, often with similar
                                                              fined linguistic context for decades. Only relatively
themes.
                                                              recently (Baroni et al., 2009) has the cost of anno-
   Later approaches discarded structure in favor of
                                                              tations decreased enough to enable the introduction
larger context windows for better semantics. The
                                                              of new tasks and have web-crawls become viable
most popular generative approach came in the form
                                                              for researchers to collect and process (WS2).1
of Latent Dirichlet Allocation (Blei et al., 2003).
                                                                              1
The sentence structure was discarded and a doc-                                   An important parallel discussion centers on the hardware
2     WS2: The Written World                                      with Good (1953), the weights of Kneser and
                                                                  Ney (1995); Chen and Goodman (1996), and the
Corpora in NLP have broadened to include large
                                                                  power-law distributions of Teh (2006). Modern ap-
web-crawls. The use of unstructured, unlabeled,
                                                                  proaches to learning dense representations allow us
multi-domain, and multilingual data broadens our
                                                                  to better estimate these distributions from massive
world scope, in the limit, to everything humanity
                                                                  corpora. However, modeling lexical co-occurrence,
has ever written. We are no longer constrained to
                                                                  no matter the scale, is still “modeling the written
a single author or source, and the temptation for
                                                                  world." Models constructed this way blindly search
NLP is to believe everything that needs knowing
                                                                  for symbolic co-occurences void of meaning.
can be learned from the written world.
   This move towards using large scale (whether
                                                                     In regards to the apparent paradox of “impre-
mono- or multilingual) has led to substantial ad-
                                                                  sive results” vs. “diminishing returns”, language
vances in performance on existing and novel com-
                                                                  modeling—the modern workhorse of neural NLP
munity benchmarks (Wang et al., 2019a). These
                                                                  systems—provides a poignant example. Recent
advances have come due to the transfer learning
                                                                  pretraining literature has produced results that few
enabled by representations in deep models. Tra-
                                                                  could have predicted, crowding leader boards re-
ditionally, transfer learning relied on our under-
                                                                  sults that make claims to super-human accuracy
standing of model classes, such as English gram-
                                                                  (Rajpurkar et al., 2018). However, there are di-
mar. Domain adaptation could proceed by simply
                                                                  minishing returns. For example, in the LAM-
providing a model with sufficient data to capture
                                                                  BADA dataset (Paperno et al., 2016) (designed
lexical variation, while assuming most higher-level
                                                                  to capture human intuition), GPT2 (Radford et al.,
structure would remain the same. Embeddings—
                                                                  2019) (1.5B), Megatron-LM (Shoeybi et al., 2019)
lexical representations built from massive corpora—
                                                                  (8.3B), and TuringNLG (Rosset, 2020) (17B) per-
encompass multiple domains and lexical senses,
                                                                  form within a few points of each other and very
violating this structural assumption.
                                                                  far from perfect (
3   WS3: The World of Sights and Sounds

Language learning needs perception. An agent ob-
serving the real world can build axioms from which
to extrapolate. Learned, physical heuristics, such
as that a falling cat will land quietly, are general-
ized and abstracted into language metaphors like as
nimble as a cat (Lakoff, 1980). World knowledge
forms the basis for how people make entailment
and reasoning decisions. There exists strong evi-       Eugene Charniak (A Framed PAINTING: The Representation
dence that children require grounded sensory input,     of a Common Sense Knowledge Fragment 1977)
not just speech, to learn language (Sachs et al.,
1981; O’Grady, 2005; Vigliocco et al., 2014).
                                                        tested on the 1,000 class2 ImageNet challenge are
   Perception includes auditory, tactile, and visual    limited to extracting a bag of visual words. In re-
input. Even restricted to purely linguistic signals,    ality, these architectures are mature and capable
auditory input is necessary for understanding sar-      of generalizing, both from the perspective of en-
casm, stress, and meaning implied through prosody.      gineering infrastructure3 and that the backbones
Further, tactile senses give meaning, both physi-       have been tested against other technologies like
cal (Sinapov et al., 2014; Thomason et al., 2016)       self-attention (Ramachandran et al., 2019). The sta-
and abstract, to concepts like heavy and soft. Vi-      bility of these architectures allows for new research
sual perception is a powerful signal for modeling a     into more challenging world modeling. Mottaghi
vastness of experiences in the world that cannot be     et al. (2016) predict the effects of forces on ob-
documented by text alone (Harnad, 1990).                jects in images. Bakhtin et al. (2019) extends this
   For example, frames and scripts (Schank and          physical reasoning to complex puzzles of cause
Abelson, 1977; Charniak, 1977; Dejong, 1981;            and effect. Sun et al. (2019b,a) model scripts and
Mooney and Dejong, 1985) require understand-            actions. Alternative, unsupervised training regimes
ing often unstated sets of pre- and post-conditions     (Bachman et al., 2019) also open up research to-
about the world. To borrow from Charniak (1977),        wards automatic concept formation.
how should we learn the meaning, method, and im-           While a minority of language researchers make
plications of painting? A web crawl of knowledge        forays into computer vision, researchers in CV are
from an exponential number of possible how-to,          often willing to incorporate language signals. Ad-
text-only guides and manuals (Bisk et al., 2020) is     vances in computer vision architectures and model-
misdirected without some fundamental referents to       ing have enabled building semantic representations
ground symbols to. We posit that models must be         rich enough to interact with natural language. In
able to watch and recognize objects, people, and        the last decade of work descendant from image
activities to understand the language describing        captioning (Farhadi et al., 2010; Mitchell et al.,
them (Li et al., 2019b; Krishna et al., 2017; Yatskar   2012), a myriad of tasks on visual question answer-
et al., 2016; Perlis, 2016).                            ing (Antol et al., 2015; Das et al., 2018), natural
   While the NLP community has played an im-            language and visual reasoning (Suhr et al., 2019b),
portant role in the history of scripts and grounding    visual commonsense (Zellers et al., 2019a), and
(Mooney, 2008), recently remarkable progress in         multilingual captioning and translation via video
grounding has taken place in the Computer Vision        (Wang et al., 2019b) have emerged. These com-
community. In the last decade, CV researchers           bined text and vision benchmarks are rich enough
have refined and codified the “backbone” of com-        to train large-scale, multimodal transformers (Li
puter vision architectures. These advances have         et al., 2019a; Lu et al., 2019; Zhou et al., 2019)
lead to parameter efficient models (Tan and Le,         without language pretraining, for example via con-
2019) and real time object detection on embedded        ceptual captions (Sharma et al., 2018), or further
devices and phones (Rastegari et al., 2016; Redmon      broadened to include audio (Tsai et al., 2019).
and Farhadi, 2018).                                        2
                                                            Or the 1,600 classes of Anderson et al. (2017).
   We find anecdotally that many researchers in the        3
                                                            Torchvision and Detectron both include pretrained ver-
NLP community still assume that vision models           sions of over a dozen models.
Semantic level representations emerge from Im-             If A and B have some environments in common and
ageNet classification pretraining partially due to            some not ... we say that they have different meanings,
                                                              the amount of meaning difference corresponding
class hypernyms. For example, the person class                roughly to the amount of difference in their
sub-divides into many professions and hobbies, like           environments ...
firefighter, gymnast, and doctor. To differentiate                   Zellig S. Harris (Distributional Structure 1954)
such sibling classes, learned vectors can also en-
code lower-level characteristics like clothing, hair,
and typical surrounding scenes. These representa-
                                                         action taking open several new dimensions to un-
tions allow for pixel level masks and skeletal mod-
                                                         derstanding and actively learning about the world.
eling, and can be extended to zero-shot settings tar-
                                                         Queries can be resolved via dialog-based explo-
geting all 20,000 ImageNet categories (Chao et al.,
                                                         ration with a human interlocutor (Liu and Chai,
2016; Changpinyo et al., 2017). Modern architec-
                                                         2015), even as new object properties like texture,
tures are flexible enough to learn to differentiate
                                                         weight, and sound become available (Thomason
instances within a general class, such as face. For
                                                         et al., 2017). We see the need for embodied lan-
example, facial recognition benchmarks require
                                                         guage with complex meaning when thinking deeply
distinguishing over 10,000 unique faces (Liu et al.,
                                                         about even the most innocuous of questions:
2015). While vision is by no means “solved,” vi-
sion benchmarks have led to off-the-shelf tools for           Is an orange more like a baseball or a
building representations rich enough for tens of              banana?
thousands of objects, scenes, and individuals.
   That the combination of language and vision su-          WS1 is likely not to have an answer beyond that
pervision leads to models that produce clear and         the objects are common nouns that can both be held.
coherent sentences, or that their attention can trans-   WS2 may also capture that oranges and baseballs
late parts of an image to words, indicates they truly    both roll, but is unlikely to capture the deformation
are learning about language. Moving forward, we          strength, surface texture, or relative sizes of these
believe benchmarks incorporating auditory, tactile,      objects (Elazar et al., 2019). WS3 can begin to un-
and visual sensory information together with lan-        derstand the relative deformability of these objects,
guage will be crucial. Such benchmarks will facili-      but is likely to confuse how much force is necessary
tate modeling language meaning with respect to an        given that baseballs are used much more roughly
experienced world.                                       than oranges in widely distributed media. WS4
                                                         can appreciate the nuanced nature of the question—
4   WS4: Embodiment and Action                           the orange and baseball share a similar texture and
                                                         weight allowing for similar manipulation, while the
Many of the phenomena in Winograd (1972)’s Un-           orange and banana both contain peels, deform un-
derstanding Natural Language require representa-         der stress, and are edible. We as humans have rich
tion of concepts derivable from perception, such as      representations of these items. The words evoke a
color, counting, size, shape, spatial reasoning, and     myriad of properties, and that richness allows us to
stability. Recent work (Gardner et al., 2020) argues     reason.
that these same dimensions can serve as meaning-            Control is where people first learn abstraction
ful perturbations of inputs to evaluate models’ rea-     and simple examples of post-conditions through
soning consistency. However, their presence is in        trial and error. The most basic scripts humans learn
service of actual interactions with the world. Ac-       start with moving our own bodies and achieving
tion taking is a natural next rung on the ladder to      simple goals as children, such as stacking blocks.
broader context.                                         In this space, we have unlimited supervision from
   An embodied agent, whether in a virtual world,        the environment and can learn to generalize across
such as a 2D Maze (MacMahon et al., 2006), a             plans and actions. In general, simple worlds do not
Baby AI in a grid world (Chevalier-Boisvert et al.,      entail simple concepts: even in block worlds (Bisk
2019), Vision-and-Language Navigation (Ander-            et al., 2018) where concepts like “mirroring”—one
son et al., 2018; Thomason et al., 2019b), or the        of a massive set of physical phenomena that we
real world (Tellex et al., 2011; Matuszek, 2018;         generalize and apply to abstract concepts—appear.
Thomason et al., 2020; Tellex et al., 2020) must            In addition to learning basic physical proper-
translate from language to control. Control and          ties of the world from interaction, WS4 also al-
lows the agent to construct rich pre-linguistic rep-          In order to talk about concepts, we must understand the
resentations from which to generalize. Hespos and             importance of mental models... we set up a model of
                                                              the world which serves as a framework in which to
Spelke (2004) show pre-linguistic category forma-             organize our thoughts. We abstract the presence of
tion within children that are then later codified by          particular objects, having properties, and entering into
social constructs. Mounting evidence seems to indi-           events and relationships.
cate that children have trouble transferring knowl-                                           Terry Winograd - 1971
edge from the 2D world of books (Barr, 2013) and
iPads (Lin et al., 2017) to the physical 3D world.
So while we might choose to believe that we can en-      nication in service of real-world cooperation is the
code parameters (Chomsky, 1981) more effectively         prototypical use of language, and the ability to fa-
and efficiently than evolution provided us, develop-     cilitate such cooperation remains the final test of a
mental experiments indicate doing so without 3D          learned agent.
interaction may prove insurmountable.                       The acknowledgement that interpersonal com-
   Part of the problem is that much of the knowl-        munication is the necessary property of a linguistic
edge humans hold about the world is intuitive, pos-      intelligence is older than the terms “Computational
sibly incommunicable by language, but is required        Linguistics” or “Artificial Intelligence.” Launched
to understand that language. This intuitive knowl-       into computing when Alan Turing suggested the
edge could be acquired by embodied agents in-            “Imitation Game” (Turing, 1950), the first illustra-
teracting with their environment, even before lan-       tive example of the test brings to bear an issue that
guage words are grounded to meanings.                    has haunted generative NLP since it was possible.
   Robotics and embodiment are not available in          Smoke and mirrors cleverness, in which situations
the same off-the-shelf manner as computer vision         are framed to the advantage of the model, can cre-
models. However, there is rapid progress in simu-        ate the appearance of genuine content where there
lators and commercial robotics, and as language re-      is none. This phenomenon has been noted count-
searchers we should match these advances at every        less times, from Pierce (1969) criticizing Speech
step. As action spaces grow, we can study complex        Recognition as “deceit and glamour” (Pierce, 1969)
language instructions in simulated homes (Shrid-         to Marcus and Davis (2019) which complains of
har et al., 2019) or map language to physical robot      humanity’s “gullibility gap”.
control (Blukis et al., 2019; Chai et al., 2018). The       Interpersonal dialogue is canonically framed as
last few years have seen massive advances in both        a grand test for AI (Norvig and Russel, 2002), but
high fidelity simulators for robotics (Todorov et al.,   there are few to no examples of artificial agents
2012; Coumans and Bai, 2016–2019; NVIDIA,                one could imagine socializing with in anything
2019; Xiang et al., 2020) and the cost and avail-        more than transactional (Bordes et al., 2016) or
ability of commodity hardware (Fitzgerald, 2013;         extremely limited game scenarios (Lewis et al.,
Campeau-Lecours et al., 2019; Murali et al., 2019).      2017)—at least not ones that aren’t purposefully
   As computers transition from desktops to perva-       hollow and fixed, where people can project what-
sive mobile and edge devices, we must make and           ever they wish (e.g. ELIZA (Weizenbaum, 1966)).
meet the expectation that NLP can be deployed in         More important to our discussion is why social-
any of these contexts. Current representations have      ization is required as the next rung on the context
very limited utility in even the most basic robotic      ladder in order to fully ground meaning.
settings (Scalise et al., 2019), making collaborative
robotics (Rosenthal et al., 2010) largely a domain       Language that Does Something Work in the
of custom engineering rather than science.               philosophy of language has long suggested that
                                                         function is the source of meaning, as famously il-
5   WS5: The Social World                                lustrated through Wittgenstein’s “language games”
                                                         (Wittgenstein, 1953, 1958) and J.L. Austin’s “per-
Natural language arose to enable interpersonal com-      formative speech acts” (Austin, 1975). That func-
munication (Dunbar, 1993). Since then, it has been       tion is the source of meaning was echoed in linguis-
used in everything from laundry instructions to per-     tics usage-based theory of language acquisition,
sonal diaries, which has given us data and situated      which suggests that constructions that are useful
scenarios that are the raw material for the previous     are the building blocks for everything else (Lan-
levels of contextualization. Interpersonal commu-        gacker, 1987, 1991). In recent years, this theory
has begun to shed light on what use-cases language        Q: Please write me a sonnet on the subject of the Forth
presents in both acquisition and its initial origins      Bridge.
                                                          A: Count me out on this one. I never could write poetry.
in our species (Tomasello, 2009; Barsalou, 2008).         Q: Add 34957 to 70764.
    WS1, WS2, WS3, and WS4 lend an extra depth            A: (Pause about 30 seconds and then give as answer)
to the interpretation of language through context,        105621.
                                                          Q: Do you play chess?
because they expand the factorizations of informa-        A: Yes.
tion available to define meaning. Understanding           Q: I have K at my K1, and no other pieces. You have only
that what one says can change what people do al-          K at K6 and R at R1. It is your move. What do you play?
                                                          A: (After a pause of 15 seconds) R-R8 mate.
lows language to take on its most active role. This
is the ultimate goal for natural language generation:     A.M. Turing (Computing Machinery and Intelligence 1950)
language that does something to the world.
   Despite the current, passive way generation is
created and evaluated (Liu et al., 2016), the ability   In the language community, this paradigm has
to change the world through other people is the         been described under the “Speaker-Listener” model
rich learning signal that natural language genera-      (Stephens et al., 2010), and a rich theory to describe
tion offers. In order to learn the effects language     this computationally is being actively developed
has on the world, an agent must participate in lin-     under the Rational Speech Act Model (Frank and
guistic activity, such as negotiation (Lewis et al.,    Goodman, 2012; Bergen et al., 2016).
2017; He et al., 2018), collaboration (Chai et al.,
                                                           Recently, a series of challenges that attempt to
2017), visual disambiguation (Liu and Chai, 2015;
                                                        address this fundamental aspect of communication
Anderson et al., 2018; Lazaridou et al., 2017), pro-
                                                        (Nematzadeh et al., 2018; Sap et al., 2019) have
viding emotional support (Rashkin et al., 2019), all
                                                        been introduced. Using traditional-style datasets
of which require inferring mental states and social
                                                        can be problematic due to the risk of embedding
outcomes—a key area of interest in itself (Zadeh
                                                        spurious patterns (Le et al., 2019). Despite in-
et al., 2019).
                                                        creased scores on datasets due to larger models
   As an example, what “hate” means in terms
                                                        and more complex pretraining curricula, it seems
of discriminative information is always at ques-
                                                        unlikely that models understand their listener any
tion: it can be defined as “strong dislike,” but what
                                                        more than they understand the physical world in
it tells one about the processes operating in the
                                                        which they exist. This disconnect is driven home by
environment requires social context to determine
                                                        studies of bias (Gururangan et al., 2018; Glockner
(Bloom, 2002). It is the toddler’s social experi-
                                                        et al., 2018) and techniques like Adversarial Filter-
mentation with “I hate you” that gives the word
                                                        ing (Zellers et al., 2018, 2019b), which elucidate
weight and definite intent (Ornaghi et al., 2011). In
                                                        the biases models exploit in lieu of understanding.
other words, the discriminative signal for the most
foundational part of a word’s meaning can only             Our training data, complex and large as it is,
be observed by its effect on the world, and active      does not offer the discriminatory signals that make
experimentation seems to be key to learning that        the hypothesizing of consistent identity or mental
effect. This is in stark contrast to the disembodied    states an efficient path towards lowering perplexity
chat bots that are the focus of the current dialogue    or raising accuracy. Firstly, there is a lack of induc-
community (Adiwardana et al., 2020; Zhou et al.,        tive bias (Martin et al., 2018): it is not clear that
2020; Chen et al., 2018; Serban et al., 2017), which    any learned model, not just a neural network, from
often do not learn from individual experiences and      reading alone would be able to posit that people
are not given enough of a persistent environment        exist, that they arrange narratives along an abstract
to learn about the effects of actions.                  single-dimension called time, and that they describe
                                                        things in terms of causality. Models learn what they
Theory of Mind By attempting to get what we             need to discriminate, and so inductive bias is not
want, we confront people with their own desires         just about efficiency, it is about making sure what
and identities. Premack and Woodruff (1978) be-         is learned will generalize (Mitchell, 1980). Sec-
gan to formalize how much the ability to con-           ondly, current cross entropy training losses actively
sider the feelings and knowledge of others is           discourage learning the tail of the distribution prop-
human-specific and how it actually functions, com-      erly, as events that are statistically infrequent are
monly referred to now as the “Theory of Mind.”          drowned out (Pennington et al., 2014; Holtzman
et al., 2020). Researchers have just begun to scratch     6   Self-Evaluation
the surface by considering more dynamic evalua-
                                                               You can’t learn language from the radio.
tions (Zellers et al., 2020; Dinan et al., 2019), but
true persistence of identity and adaption to change          Nearly every NLP course will at some point
are still a long way off.                                 make this claim. While the futility of learning
                                                          language from an ungrounded linguistic signal is
Language in a Social Context Whenever lan-                intuitive, those NLP courses go on to attempt pre-
guage is used between two people, it exists in a          cisely that learning. In fact, our entire field follows
concrete social context: status, role, intention, and     this pattern: trying to learn language from the in-
countless other variables intersect at a specific point   ternet, standing in as the modern radio to deliver
(Wardhaugh, 2011). Currently, these complexities          limitless language. The need for language to attach
are overlooked by selecting labels crowd workers          to “extralinguistic events" (Ervin-Tripp, 1973) and
agree on. That is, our notion of ground truth in          the requirement for social context (Baldwin et al.,
dataset construction is based on crowd consensus          1996) should guide our research.
bereft of social context (Zellers et al., 2020).             Concretely, we use our notion of World Scopes
                                                          to make the following concrete claims:
   In a real-world interaction, even one as anony-
mous as interacting with a cashier at a store, no-        You can’t learn language ...
tions of social cognition (Fiske and Taylor, 1991;
                                                          ... from the radio (internet).      WS2 ⊂ WS3
Carpenter et al., 1998) are present in the style of
utterances and in the social script of the exchange.           A learner cannot be said to be in WS3
The true evaluation of generative models will re-              if it can perform its task without sensory
quire the construction of situations where artificial          perception such as visual, auditory, or
agents are considered to have enough identity to be            tactile information.
granted social standing for these interactions.
                                                          ... from a television.              WS3 ⊂ WS4
   Social interaction is a precious signal and some
NLP researchers have began scratching the sur-                 A learner cannot be said to be in WS4
face of it, especially for persuasion related tasks            if the space of actions and consequences
where interpersonal relationships are key (Wang                of its environment can be enumerated.
et al., 2019c; Tan et al., 2016). These initial studies   ... by yourself.                    WS4 ⊂ WS5
have been strained by the training-validation-test
set scenario, as collecting static of this data is dif-        A learner cannot be said to be in WS5 if
ficult for most use-cases and often impossible as              its cooperators can be replaced with clev-
natural situations cannot be properly facilitated.             erly pre-programmed agents to achieve
Learning by participation is a necessary step to the           the same goals.
ultimately social venture of communication. By
                                                             By these definitions, most of NLP research still
exhibiting different attributes and sending varying
                                                          resides in WS2, and we believe this is at odds with
signals, the sociolinguistic construction of identity
                                                          the goals and origins of our science. This does
(Ochs, 1993) could be examined more deeply than
                                                          not invalidate the utility or need for any of the
ever before, yielding an understanding of social
                                                          research within NLP, but it is to say that much of
intelligence that simply isn’t possible with a fixed
                                                          that existing research targets a different goal than
corpus. This social layer is the foundation upon
                                                          language learning.
which language situates (Tomasello, 2009). Under-
standing this social layer is not only necessary, but     Where Should We Start? Concretely, many in
will make clear complexities around implicature           our community are already pursuing the move from
and commonsense that so obviously arise in cor-           WS2 to WS3 by rethinking our existing tasks and
pora, but with such sparsity that they cannot prop-       investigating whether their semantics can be ex-
erly be learned (Gordon et al., 2018). Once models        panded and grounded. Importantly, this is not a
are commonly expected to be interacted with in or-        new trend (Chen and Mooney, 2008; Feng and La-
der to be tested, probing their decision boundaries       pata, 2010; Bruni et al., 2014; Lazaridou et al.,
for simplifications of reality as in Kaushik et al.       2016), and task and model design can fail to re-
will become trivial and natural.                          quire sensory perception (Thomason et al., 2019a).
However, research and progress in multimodal,                     These problems include the need to bring meaning
grounded language understanding has accelerated                   and reasoning into systems that perform natural
                                                                  language processing, the need to infer and
dramatically in the last few years.                               represent causality, the need to develop
                                                                  computationally-tractable representations of
   Elliott et al. (2016) takes the classic problem of             uncertainty and the need to develop systems that
machine translation and reframes it in the context                formulate and pursue long-term goals.
of visual observations (Elliott and Kádár, 2017).                       Michael Jordan (Artificial intelligence – the
Wang et al. (2019b) extends this paradigm by ex-                             revolution hasn’t happened yet, 2019)
ploring machine translation with videos serving
as an interlingua, and Regneri et al. (2013) in-
                                                         7   Conclusions
troduced a foundational dataset which aligns text
descriptions and semantic annotations of actions         Our proposed World Scopes are steep steps, and
with videos. Even core tasks like syntax can be          it is possible that WS5 is AI-complete. WS5 im-
informed with visual information. Shi et al. (2019)      plies a persistent agent experiencing time and a
investigate the role of visual information in syn-       personalized set of experiences. With few excep-
tax acquisition. This may prove necessary to learn       tions (Carlson et al., 2010), machine learning mod-
headedness (e.g. choosing the main verb vs the           els have been confined to IID datasets that do not
more common auxiliary as the root of a sentence)         have the structure in time from which humans draw
(Bisk and Hockenmaier, 2015). Relatedly, Oror-           correlations about long-range causal dependencies.
bia et al. (2019) investigate the benefits of visual         What happens if a machine is allowed to par-
context for language modeling.                           ticipate consistently? This is difficult to imagine
                                                         testing under current evaluation paradigms for gen-
   Collaborative games have long served as a
                                                         eralization. Yet, this is the structure of general-
testbed for studying language (Werner and Dyer,
                                                         ization in human development: drawing analogies
1991). Recently, Suhr et al. (2019a) introduced a
                                                         to episodic memories, gathering a system through
testbed for evaluating language understanding in
                                                         non-independent experiments.
the service of a shared goal, and Andreas and Klein
                                                             While it is fascinating to broadly consider such
(2016) use a simpler visual paradigm for studying
                                                         far reaching futures, our goal in this call to action
pragmatics. A communicative goal can also be
                                                         is more pedestrian. As with many who have ana-
used to study the emergence of language. Lazari-
                                                         lyzed the history of Natural Language Processing,
dou et al. (2018) evolves agents with language that
                                                         its trends (Church, 2007), its maturation toward a
aligns to perceptual features of the environment.
                                                         science (Steedman, 2008), and its major challenges
These paradigms can help us examine how induc-
                                                         (Hirschberg and Manning, 2015; McClelland et al.,
tive biases and environmental pressures can build
                                                         2019), we hope to provide momentum for a direc-
towards socialization (WS5).
                                                         tion many are already heading. We call for and
   Most of this research provides resources such as      embrace the incremental, but purposeful, contextu-
data, code, simulators and methodology for evaluat-      alization of language in human experience. With all
ing the multimodal content of linguistic representa-     that we have learned about what words can tell us
tions (Silberer and Lapata, 2014; Bruni et al., 2012).   and what they keep implicit, now is the time to ask:
Moving forward, we would like to encourage a             What tasks, representations, and inductive-biases
broader re-examination of how NLP researchers            will fill the gaps?
frame the relationship between meaning and con-              Computer vision and speech recognition are ma-
text. We believe that the time is ripe to begin          ture enough for novel investigation of broader lin-
a deeper dive into grounded learning, embodied           guistic contexts (Section 3). The robotics indus-
learning, and ultimately social participation. With      try is rapidly developing commodity hardware and
recent deep learning representations, researchers        sophisticated software that both facilitate new re-
can more easily integrate additional signals into the    search and expect to incorporate language technolo-
learned meanings of language than just word to-          gies (Section 4). Our call to action is to encour-
kens. We particularly advocate for the homegrown         age the community to lean in to trends prioritizing
creation of such merged representations, rather than     grounding and agency (Section 5), and explicitly
waiting for the representations and data to come to      aim to broaden the corresponding World Scopes
NLP from other fields.                                   available to our models.
Acknowledgements                                         Anton Bakhtin, Laurens van der Maaten, Justin John-
                                                           son, Laura Gustafson, and Ross Girshick. 2019.
Thanks to Raymond Mooney for discussions and               Phyre: A new benchmark for physical reasoning.
suggestions, Paul Smolensky for discussion and
                                                         Dare A. Baldwin, Ellen M. Markman, Brigitte Bill, Re-
disagreements, and to Catriona Silvey for help with        nee N. Desjardins, Jane M. Irwin, and Glynnis Tid-
references.                                                ball. 1996. Infants’ reliance on a social criterion for
                                                           establishing word-object relations. Child Develop-
A    Invitation to Reply                                   ment, 67(6):3135–3153.
                                                         H.B. Barlow. 1989. Unsupervised learning. Neural
What we have introduced here is not comprehen-             Computation, 1(3):295–311.
sive and you may not agree with our arguments.
Perhaps they do not go far enough, they miss a           Marco Baroni, Silvia Bernardini, Adriano Ferraresi,
                                                          and Eros Zanchetta. 2009. The wacky wide web: a
relevant literature, or you feel they do not represent
                                                          collection of very large linguistically processed web-
the goals of NLP/CL. To this end, we welcome              crawled corpora. Language resources and evalua-
both suggestions for an updated version of this           tion, 43(3):209–226.
manuscript, as well as response papers on alternate
                                                         Rachel Barr. 2013. Memory constraints on infant learn-
foci and directions for the field.                         ing from picture books, television, and touchscreens.
                                                           Child Development Perspectives, 7(4):205–210.

References                                               Lawrence W Barsalou. 2008. Grounded cognition.
                                                           Annu. Rev. Psychol., 59:617–645.
Daniel Adiwardana, Minh-Thang Luong, David R So,
  Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,     Yoshua Bengio, RÃl’jean Ducharme, Pascal Vincent,
  Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,           and Christian Jauvin. 2003. A neural probabilistic
  et al. 2020. Towards a human-like open-domain            language model. Journal of Machine Learning Re-
  chatbot. arXiv preprint arXiv:2001.09977.                search, 3:1137–1155.
                                                         Leon Bergen, Roger Levy, and Noah Goodman. 2016.
Peter Anderson, Xiaodong He, Chris Buehler, Damien         Pragmatic reasoning through semantic inference.
  Teney, Mark Johnson, Stephen Gould, and Lei              Semantics and Pragmatics, 9.
  Zhang. 2017. Bottom-up and top-down attention for
  image captioning and visual question answering. Vi-    Yonatan Bisk and Julia Hockenmaier. 2015. Probing
  sual Question Answering Challenge at CVPR 2017.          the linguistic strengths and limitations of unsuper-
                                                           vised grammar induction. In Proceedings of the
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce,           53rd Annual Meeting of the Association for Compu-
  Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen         tational Linguistics and the 7th International Joint
  Gould, and Anton van den Hengel. 2018. Vision-           Conference on Natural Language Processing (Vol-
  and-Language Navigation: Interpreting visually-          ume 1: Long Papers).
  grounded navigation instructions in real environ-
  ments. In Proceedings of the IEEE Conference on        Yonatan Bisk, Kevin Shih, Yejin Choi, and Daniel
  Computer Vision and Pattern Recognition (CVPR).          Marcu. 2018. Learning interpretable spatial opera-
                                                           tions in a rich 3d blocks world. In Proceedings of the
Jacob Andreas and Dan Klein. 2016. Reasoning about         Thirty-Second Conference on Artificial Intelligence
   pragmatics with neural listeners and speakers. In       (AAAI-18).
   Proceedings of the 2016 Conference on Empirical
  Methods in Natural Language Processing, pages          Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jian-
  1173–1182, Austin, Texas. Association for Compu-         feng Gao, and Yejin Choi. 2020. PIQA: Reasoning
   tational Linguistics.                                   about physical commonsense in natural language. In
                                                           Thirty-Fourth AAAI Conference on Artificial Intelli-
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-        gence.
   garet Mitchell, Dhruv Batra, C Lawrence Zitnick,
                                                         David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
   and Devi Parikh. 2015. Vqa: Visual question an-
                                                           2003. Latent dirichlet allocation. Journal of Ma-
   swering. In Proceedings of the IEEE international
                                                           chine Learning Research, 3:993–1022.
   conference on computer vision, pages 2425–2433.
                                                         Paul Bloom. 2002. How children learn the meanings
John Langshaw Austin. 1975. How to do things with          of words. MIT press.
  words, volume 88. Oxford university press.
                                                         Valts Blukis, Yannick Terme, Eyvind Niklasson,
Philip Bachman, R Devon Hjelm, and William Buch-           Ross A. Knepper, and Yoav Artzi. 2019. Learning to
  walter. 2019. Learning representations by maximiz-       map natural language instructions to physical quad-
  ing mutual information across views. arXiv preprint      copter control using simulated flight. In 3rd Confer-
  arXiv:1906.00910.                                        ence on Robot Learning (CoRL).
Antoine Bordes, Y-Lan Boureau, and Jason Weston.          Eugene Charniak. 1977. A framed painting: The rep-
  2016. Learning end-to-end goal-oriented dialog. In        resentation of a common sense knowledge fragment.
  International Conference on Learning Representa-          Cognitive Science, 1(4):355–394.
  tions.
                                                          Chun-Yen Chen, Dian Yu, Weiming Wen, Yi Mang
Peter F Brown, Peter V deSouza, Robert L Mercer, Vin-       Yang, Jiaping Zhang, Mingyang Zhou, Kevin Jesse,
  cent J Della Pietra, and Jenifer C Lai. 1992. Class-      Austin Chau, Antara Bhowmick, Shreenath Iyer,
  based n-gram models of natural language. Compu-           et al. 2018. Gunrock: Building a human-like social
  tational Linguistics, 18.                                 bot by leveraging large scale real user data. Alexa
                                                            Prize Proceedings.
Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-
   Khanh Tran. 2012. Distributional semantics in tech-    David L. Chen and Raymond J. Mooney. 2008. Learn-
   nicolor. In Proceedings of the 50th Annual Meet-         ing to sportscast: A test of grounded language ac-
   ing of the Association for Computational Linguistics     quisition. In Proceedings of the 25th International
   (Volume 1: Long Papers), pages 136–145, Jeju Is-         Conference on Machine Learning (ICML), Helsinki,
   land, Korea. Association for Computational Linguis-      Finland.
   tics.                                                  SF Chen and Joshua Goodman. 1996. An empirical
                                                            study of smoothing techniques for language model-
Elia Bruni, Nam Khanh Tran, and Marco Baroni. 2014.
                                                            ing. In Association for Computational Linguistics,
   Multimodal distributional semantics. Journal of Ar-
                                                            pages 310–318. Association for Computational Lin-
   tificial Intelligence Research, 49:1–47.
                                                            guistics.
Alexandre Campeau-Lecours, Hugo Lamontagne, Si-           Maxime Chevalier-Boisvert, Dzmitry Bahdanau,
  mon Latour, Philippe Fauteux, Véronique Maheu,           Salem Lahlou, Lucas Willems, Chitwan Saharia,
  François Boucher, Charles Deguire, and Louis-            Thien Huu Nguyen, and Yoshua Bengio. 2019.
  Joseph Caron L’Ecuyer. 2019. Kinova modular              Babyai: First steps towards grounded language
  robot arms for service robotics applications. In         learning with a human in the loop. In ICLR’2019.
  Rapid Automation: Concepts, Methodologies, Tools,
  and Applications, pages 693–719. IGI Global.            Noam Chomsky. 1981. Lectures on Government and
                                                            Binding. Mouton de Gruyter.
Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr
  Settles, Estevam R Hruschka, and Tom M Mitchell.        Kenneth Church. 2007. A pendulum swung too far.
  2010. Toward an architecture for never-ending lan-        Linguistic Issues in Language Technology âĂŞ LiLT,
  guage learning. In Twenty-Fourth AAAI Conference          2.
  on Artificial Intelligence.
                                                          L. Coleman and P. Kay. 1981. The english word “lie".
Malinda Carpenter, Katherine Nagell, Michael                 Linguistics, 57.
 Tomasello, George Butterworth, and Chris Moore.
 1998. Social cognition, joint attention, and commu-      Ronan Collobert and Jason Weston. 2008. A unified
 nicative competence from 9 to 15 months of age.            architecture for natural language processing: deep
 Monographs of the society for research in child            neural networks with multitask learning. In ICML.
 development, pages i–174.                                Erwin Coumans and Yunfei Bai. 2016–2019. Pybullet,
                                                            a python module for physics simulation for games,
Joyce Y. Chai, Rui Fang, Changsong Liu, and Lanbo           robotics and machine learning. http://pybullet.
  She. 2017. Collaborative language grounding to-           org.
  ward situated human-robot dialogue. AI Magazine,
  37(4):32–45.                                            Abhishek Das, Samyak Datta, Georgia Gkioxari, Ste-
                                                            fan Lee, Devi Parikh, and Dhruv Batra. 2018. Em-
Joyce Y. Chai, Qiaozi Gao, Lanbo She, Shaohua Yang,         bodied question answering. In Proceedings of the
  Sari Saba-Sadiya, and Guangyue Xu. 2018. Lan-             IEEE Conference on Computer Vision and Pattern
  guage to action: Towards interactive task learning        Recognition Workshops, pages 2054–2063.
  with physical agents. In Proceedings of the Twenty-
  Seventh International Joint Conference on Artificial    Scott Deerwester, Susan T. Dumais, George W. Furnas,
  Intelligence (IJCAI-18).                                  Thomas K. Landauer, and Richard Harshman. 1988.
                                                            Improving information retrieval with latent semantic
Soravit Changpinyo, Wei-Lun Chao, and Fei Sha. 2017.        indexing. In Proceedings of the 51st Annual Meet-
  Predicting visual exemplars of unseen classes for         ing of the American Society for Information Science
  zero-shot learning. In ICCV.                              25, page 36âĂŞ40.
Wei-Lun Chao, Soravit Changpinyo, Boqing Gong,            Scott Deerwester, Susan T. Dumais, George W. Fur-
  and Fei Sha. 2016. An empirical study and analysis        nas, Thomas K. Landauer, and Richard Harshman.
  of generalized zero-shot learning for object recog-       1990. Indexing by latent semantic analysis. Jour-
  nition in the wild. In ECCV, pages 52–68, Cham.           nal of the American Society for Information Science,
  Springer International Publishing.                        41(6):391–407.
Gerald Dejong. 1981. Generalizations based on expla-     Ali Farhadi, M Hejrati, M Sadeghi, Peter Young, Cyrus
  nations. In IJCAI.                                       Rashtchian, Julia Hockenmaier, and David Forsyth.
                                                           2010. Every picture tells a story: Generating sen-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              tences from images. In European Conference on
   Kristina Toutanova. 2019a. Bert: Pre-training of        Computer Vision. Springer.
   deep bidirectional transformers for language under-
   standing. In North American Chapter of the Associ-    Yansong Feng and Mirella Lapata. 2010. Topic models
   ation for Computational Linguistics (NAACL).            for image annotation and text illustration. In Human
                                                           Language Technologies: The 2010 Annual Confer-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              ence of the North American Chapter of the Associa-
   Kristina Toutanova. 2019b. BERT: Pre-training of        tion for Computational Linguistics, pages 831–839,
   Deep Bidirectional Transformers for Language Un-        Los Angeles, California. Association for Computa-
   derstanding. In Proceedings of the 2019 Conference      tional Linguistics.
   of the North American Chapter of the Association
   for Computational Linguistics: Human Language         J. R. Firth. 1957. A synopsis of linguistic theory, 1930-
  Technologies, Volume 1 (Long and Short Papers).           1955. Studies in Linguistic Analysis.
Emily Dinan, Samuel Humeau, Bharath Chintagunta,         Susan T Fiske and Shelley E Taylor. 1991. Social cog-
  and Jason Weston. 2019. Build it break it fix it for     nition. Mcgraw-Hill Book Company.
  dialogue safety: Robustness from adversarial human
  attack. In Proceedings of the 2019 Conference on       Cliff Fitzgerald. 2013. Developing baxter. In 2013
  Empirical Methods in Natural Language Processing         IEEE Conference on Technologies for Practical
  and the 9th International Joint Conference on Natu-      Robot Applications (TePRA).
  ral Language Processing (EMNLP-IJCNLP), pages
  4529–4538.                                             W. Nelson Francis. 1964. A standard sample of
                                                           present-day english for use with digital computers.
Susan T. Dumais. 2004. Latent semantic analysis. An-       Report to the U.S Office of Education on Coopera-
  nual Review of Information Science and Technology,       tive Research Project No. E-007.
  38(1):188–230.
                                                         Michael C Frank and Noah D Goodman. 2012. Pre-
Robin IM Dunbar. 1993. Coevolution of neocortical          dicting pragmatic reasoning in language games. Sci-
  size, group size and language in humans. Behav-          ence, 336(6084):998–998.
  ioral and brain sciences, 16(4):681–694.
                                                         Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng
Yanai Elazar, Abhijit Mahabal, Deepak Ramachandran,        Wang, and Ting Liu. 2014. Learning semantic hier-
  Tania Bedrax-Weiss, and Dan Roth. 2019. How              archies via word embeddings. In Proceedings of the
  large are lions? inducing distributions over quanti-     52nd Annual Meeting of the Association for Compu-
  tative attributes. In Proceedings of the 57th Annual     tational Linguistics (Volume 1: Long Papers), pages
  Meeting of the Association for Computational Lin-        1199–1209.
  guistics, pages 3973–3983.
                                                         Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan
Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-    Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi,
  cia Specia. 2016. Multi30k: Multilingual english-       Dheeru Dua, Yanai Elazar, Ananth Gottumukkala,
  german image descriptions. In Workshop on Vision        Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco,
  and Langauge at ACL ’16.                                Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel-
                                                          son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer
Desmond Elliott and Ákos Kádár. 2017. Imagination         Singh, Noah A. Smith, Sanjay Subramanian, Reut
  improves multimodal translation. In Proceedings of      Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou.
  the Eighth International Joint Conference on Natu-      2020. Evaluating NLP Models via Contrast Sets.
  ral Language Processing (Volume 1: Long Papers),        arXiv:2004.02709.
  pages 130–141.
                                                         Max Glockner, Vered Shwartz, and Yoav Goldberg.
J Elman. 1990. Finding structure in time. Cognitive       2018. Breaking nli systems with sentences that re-
   Science, 14(2):179–211.                                quire simple lexical inferences. In Proceedings of
                                                          the 56th Annual Meeting of the Association for Com-
Katrin Erk and Sebastian Padó. 2008. A structured         putational Linguistics (Volume 2: Short Papers),
  vector space model for word meaning in context.         pages 650–655.
  In Proceedings of the 2008 Conference on Empiri-
  cal Methods in Natural Language Processing, pages      I J Good. 1953.       The population frequencies of
  897–906, Honolulu, Hawaii. Association for Com-          species and the estimation of population parameters.
  putational Linguistics.                                  Biometrika, 40:237–264.
Susan Ervin-Tripp. 1973. Some strategies for the first   Daniel Gordon, Aniruddha Kembhavi, Mohammad
  two years. In Timothy E. Moore, editor, Cognitive        Rastegari, Joseph Redmon, Dieter Fox, and Ali
  Development and Acquisition of Language, pages           Farhadi. 2018. Iqa: Visual question answering in in-
  261 – 286. Academic Press, San Diego.                    teractive environments. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recog-           Teuvo Kohonen. 1984. Self-Organization and Associa-
  nition, pages 4089–4098.                                     tive Memory. Springer.
Suchin Gururangan, Swabha Swayamdipta, Omer                  Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-
  Levy, Roy Schwartz, Samuel Bowman, and Noah A                son, Kenji Hata, Joshua Kravitz, Stephanie Chen,
  Smith. 2018. Annotation artifacts in natural lan-            Yannis Kalantidis, Li-Jia Li, David A Shamma,
  guage inference data. In Proceedings of the 2018             et al. 2017. Visual genome: Connecting language
  Conference of the North American Chapter of the              and vision using crowdsourced dense image anno-
  Association for Computational Linguistics: Human             tations. International Journal of Computer Vision,
  Language Technologies, Volume 2 (Short Papers),              123(1):32–73.
  pages 107–112.
                                                             Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
Stevan Harnad. 1990. The symbol grounding problem.             ton. 2012. Imagenet classification with deep con-
   Physica D, 42:335–346.                                      volutional neural networks. In F. Pereira, C. J. C.
Zellig S Harris. 1954. Distributional structure. Word,         Burges, L. Bottou, and K. Q. Weinberger, editors,
  10:146–162.                                                  Advances in Neural Information Processing Systems
                                                               25, pages 1097–1105. Curran Associates, Inc.
He He, Derek Chen, Anusha Balakrishnan, and Percy
  Liang. 2018. Decoupling strategy and generation in         George Lakoff. 1973. Hedges: A study in meaning
  negotiation dialogues. In Proceedings of the 2018            criteria and the logic of fuzzy concepts. Journal of
  Conference on Empirical Methods in Natural Lan-              Philosophical Logic, 2:458–508.
  guage Processing, pages 2333–2343.
                                                             George Lakoff. 1980. Metaphors We Live By. Univer-
Susan J. Hespos and Elizabeth S. Spelke. 2004. Con-            sity of Chicago Press.
  ceptual precursors to language. Nature, 430.
                                                             Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
G. E. Hinton, J. L. McClelland, and D. E. Rumelhart.           Kevin Gimpel, Piyush Sharma, and Radu Soricut.
  1986. Distributed representations. Parallel Dis-             2019. Albert: A lite bert for self-supervised learn-
  tributed Processing: Explorations in the Microstruc-         ing of language representations. arXiv preprint
  ture of Cognition, Volume 1: Foundations.                    arXiv:1909.11942.
Geoffrey E. Hinton. 1990. Preface to the special issue       Ronald W Langacker. 1987. Foundations of cogni-
  on connectionist symbol processing. Artificial Intel-        tive grammar: Theoretical prerequisites, volume 1.
  ligence, 46(1):1 – 4.                                        Stanford university press.
Julia Hirschberg and Christopher D Manning. 2015.
   Advances in natural language processing. Science,         Ronald W Langacker. 1991. Foundations of Cognitive
   349(6245):261–266.                                          Grammar: descriptive application. Volume 2, vol-
                                                               ume 2. Stanford university press.
Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin
  Choi. 2020. The curious case of neural text degener-       Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls,
  ation. In ICLR 2020.                                         and Stephen Clark. 2018. Emergence of linguis-
                                                               tic communication from referential games with sym-
Daniel L. James and Risto Miikkulainen. 1995. Sard-            bolic and pixel input. In Internationl Conference on
  net: A self-organizing feature map for sequences. In         Learning Representations.
  Advances in Neural Information Processing Systems
  7 (NIPS’94), pages 577–584, Denver, CO. Cam-               Angeliki Lazaridou, Alexander Peysakhovich, and
  bridge, MA: MIT Press.                                       Marco Baroni. 2017. Multi-agent cooperation and
                                                               the emergence of (natural) language. In ICLR 2017.
Michael I Jordan. 2019. Artificial intelligence – the rev-
  olution hasn’t happened yet. Harvard Data Science          Angeliki Lazaridou, Nghia The Pham, and Marco Ba-
  Review.                                                      roni. 2016. The red one!: On learning to refer to
                                                               things based on discriminative properties. In Pro-
Divyansh Kaushik, Eduard Hovy, and Zachary Lipton.
                                                               ceedings of the 54th Annual Meeting of the Associa-
  2020. Learning the difference that makes a differ-
                                                               tion for Computational Linguistics (Volume 2: Short
  ence with counterfactually-augmented data. In Inter-
                                                               Papers), pages 213–218, Berlin, Germany. Associa-
  national Conference on Learning Representations.
                                                               tion for Computational Linguistics.
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,
  Caiming Xiong, and Richard Socher. 2019. Ctrl: A           Matthew Le, Y-Lan Boureau, and Maximilian Nickel.
  conditional transformer language model for control-         2019. Revisiting the evaluation of theory of mind
  lable generation. arXiv preprint arXiv:1909.05858.          through question answering. In Proceedings of the
                                                              2019 Conference on Empirical Methods in Natu-
Reinhard Kneser and Hermann Ney. 1995. Improved               ral Language Processing and the 9th International
  backing-ff of m-gram language modeling. In Pro-             Joint Conference on Natural Language Processing
  ceedings of the IEEE International Conference on            (EMNLP-IJCNLP), pages 5871–5876, Hong Kong,
  Acoustics, Speech and Signal Processing.                    China. Association for Computational Linguistics.
You can also read