GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding - NYU

Page created by Earl Carlson
 
CONTINUE READING
GLUE: A Multi-Task Benchmark and Analysis Platform
                       for Natural Language Understanding
                     Alex Wang1, Amanpreet Singh1, Julian Michael2, Felix Hill3,
                                Omer Levy2, and Samuel R. Bowman1
                                       1 New
                                         York University, New York, NY
    2 Paul   G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
                                          3 DeepMind, London, UK

                              {alexwang,amanpreet,bowman}@nyu.edu
                            {julianjm,omerlevy}@cs.washington.edu
                                       felixhill@google.com
                       Abstract                           standing      Evaluation     benchmark        (GLUE,
                                                          gluebenchmark.com), an online tool for
     For natural language understanding (NLU)
                                                          evaluating the performance of a single NLU
     technology to be maximally useful, both prac-
     tically and as a scientific object of study, it
                                                          model across multiple tasks, including question
     must be general: it must be able to process lan-     answering, sentiment analysis, and textual en-
     guage in a way that is not exclusively tailored      tailment, built largely on established existing
     to any one specific task or dataset. In pursuit of   datasets. GLUE does not place any constraints on
     this objective, we introduce the General Lan-        model architecture beyond the ability to process
     guage Understanding Evaluation benchmark             single-sentence and paired-sentence inputs and to
     (GLUE), a tool for evaluating and analyzing          make corresponding predictions. For some GLUE
     the performance of models across a diverse
                                                          tasks, directly pertinent training data is plentiful,
     range of existing NLU tasks. GLUE is model-
     agnostic, but it incentivizes sharing knowledge      but for others, training data is limited or fails to
     across tasks because certain tasks have very         match the genre of the test set. GLUE therefore
     limited training data. We further provide a          favors models that can learn to represent linguistic
     hand-crafted diagnostic test suite that enables      and semantic knowledge in a way that facilitates
     detailed linguistic analysis of NLU models.          sample-efficient learning and effective knowledge
     We evaluate baselines based on current meth-         transfer across tasks.
     ods for multi-task and transfer learning and
                                                             Though GLUE is designed to prefer models
     find that they do not immediately give substan-
     tial improvements over the aggregate perfor-         with general and robust language understanding,
     mance of training a separate model per task,         we cannot entirely rule out the existence of sim-
     indicating room for improvement in develop-          ple superficial strategies for solving any of the in-
     ing general and robust NLU systems.                  cluded tasks. We therefore also provide a set of
                                                          newly constructed evaluation data for the analy-
                                                          sis of model performance. Unlike many test sets
1    Introduction
                                                          employed in machine learning research that re-
Human ability to understand language is general,          flect the frequency distribution of naturally occur-
flexible, and robust. We can effectively interpret        ring data or annotations, this dataset is designed
and respond to utterances of diverse form and             to highlight points of difficulty that are relevant to
function in many different contexts. In contrast,         model development and training, such as the in-
most natural language understanding (NLU) mod-            corporation of world knowledge, or the handling
els above the word level are designed for one par-        of lexical entailments and negation. Visitors to
ticular task and struggle with out-of-domain data.        the online platform have access to a breakdown
If we aspire to develop models whose understand-          of how well each model handles these phenomena
ing extends beyond the detection of superficial           alongside its scores on the primary GLUE test sets.
correspondences between inputs and outputs, then             To better understand the challenged posed by
it is critical to understand how a single model can       the GLUE benchmark, we conduct experiments
learn to execute a range of different linguistic tasks    with simple baselines and state-of-the-art models
on language from different domains.                       for sentence representation. We find that naı̈ve
    To motivate research in this direction,               multi-task learning with standard models over the
we present the General Language Under-                    available task training data yields overall perfor-
Corpus    |Train|   |Dev|   |Test|    Task                  Metric                  Domain
                                                Single-Sentence Tasks
       CoLA         10k      1k     1.1k     acceptability         Matthews                linguistics literature
       SST-2        67k     872     1.8k     sentiment             acc.                    movie reviews
                                            Similarity and Paraphrase Tasks
       MRPC          4k     N/A     1.7k     paraphrase            acc./F1                 news
       STS-B         7k     1.5k    1.4k     sentence similarity   Pearson/Spearman        misc.
       QQP         400k     N/A    391k      paraphrase            acc./F1                 social QA Questions
                                                    Inference Tasks
       MNLI        393k     20k      20k     NLI                   acc. (match/mismatch)   misc.
       QNLI        108k     11k      11k     QA/NLI                acc.                    Wikipedia
       RTE          2.7k    N/A       3k     NLI                   acc.                    misc.
       WNLI          706    N/A      146     coreference/NLI       acc.                    fiction books

Table 1: Task descriptions and statistics. All tasks are single sentence or sentence pair classification,
except STS-Benchmark, which is a regression task. MNLI has three classes while all other classification
tasks are binary.

mance no better than can be achieved by training                      (Kiros et al., 2015), InferSent (Conneau
on a separate model for each task, indicating the                     et al., 2017), DisSent (Nie et al., 2017), and
need for improved general NLU systems. How-                           GenSen (Subramanian et al., 2018).
ever, for certain tasks with less training data, we
find that multi-task learning approaches do im-                2   Related Work
prove over a single-task model. This indicates that
there is potentially interesting space for meaning-            Our work builds on various strands of NLP re-
ful knowledge sharing across NLU tasks. Anal-                  search that aspired to develop better general un-
ysis with our diagnostic dataset reveals that cur-             derstanding in models.
rent models deal well with strong lexical signals
                                                               Multi-task Learning in NLP Multi-task learn-
and struggle with logic, and that there are interest-
                                                               ing has a rich history in NLP as an approach
ing patterns in the generalization behavior of our
                                                               for learning more general language understanding
models that do not correlate perfectly with perfor-
                                                               systems. Collobert et al. (2011), one of the earli-
mance on the main benchmark.
                                                               est works exploring deep learning for NLP, used
   In summary, we offer the following contribu-
                                                               a multi-task model to jointly learn POS tagging,
tions:
                                                               chunking, named entity recognition, and semantic
  • A suite of nine sentence- or sentence-pair                 role labeling. More recently, there has been work
    NLU tasks, built on established annotated                  using labels from core NLP tasks to supervise
    datasets where possible, and selected to cover             training of lower levels of deep neural networks
    a diverse range of dataset sizes, text genres,             (Søgaard and Goldberg, 2016; Hashimoto et al.,
    and degrees of difficulty.                                 2016) and automatically learning cross-task shar-
                                                               ing mechanisms for multi-task learning (Ruder
  • An online evaluation platform and leader-                  et al., 2017).
    board, based primarily on privately-held test
    data. The platform is model-agnostic; any                  Evaluating Sentence Representations Beyond
    model or method capable of producing re-                   multi-task learning, much of the work so far to-
    sults on all nine tasks can be evaluated.                  wards developing general NLU systems has fo-
                                                               cused on the development of sentence-to-vector
  • A suite of diagnostic evaluation data aimed to
                                                               encoder functions (Le and Mikolov, 2014; Kiros
    give model developers feedback on the types
                                                               et al., 2015, i.a.), including approaches leverag-
    of linguistic phenomena their evaluated sys-
                                                               ing unlabeled data (Hill et al., 2016; Peters et al.,
    tems handle well.
                                                               2018), labeled data (Conneau and Kiela, 2018;
  • Results with several major existing sentence               McCann et al., 2017), and combinations of these
    representation systems such as Skip-Thought                (Collobert et al., 2011; Subramanian et al., 2018).
In this line of work, a standard evaluation prac-         while GLUE emphasizes the need to perform well
tice has emerged, and has recently been codi-                on multiple different tasks using shared model
fied as SentEval (Conneau et al., 2017; Conneau              components.
and Kiela, 2018). Like GLUE, SentEval also re-                  Weston et al. (2015) similarly proposed a hier-
lies on a variety of existing classification tasks           archy of tasks towards building question answer-
that involve either one or two sentences as in-              ing and reasoning models, although involving syn-
puts, but only evaluates sentence-to-vector en-              thetic language, whereas almost all of our data
coders. Specifically, SentEval takes a pre-trained           is human-generated. The recently proposed di-
sentence encoder as input and feeds its output en-           alogue systems framework ParlAI (Miller et al.,
codings into lightweight task-specific models (typ-          2017) also combines many language understand-
ically linear classifiers) that are trained and tested       ing tasks into a single framework, although this
on task-specific data.                                       aggregation is very flexible, and the framework in-
   SentEval is well-suited for evaluating general-           cludes no standardized evaluation suite for system
purpose sentence representations in isolation.               performance.
However, cross-sentence contextualization and
alignment, such as that yielded by methods like              3       Tasks
soft attention, is instrumental in achieving state-          We aim for GLUE to spur development of general-
of-the-art performance on tasks such as machine              izable NLU systems. As such, we expect that do-
translation (Bahdanau et al., 2014; Vaswani et al.,          ing well on the benchmark should require a model
2017), question answering (Seo et al., 2016; Xiong           to share substantial knowledge (e.g. in the form of
et al., 2016), and natural language inference1 .             trained parameters) across all tasks, while keep-
GLUE is designed to facilitate the development               ing the task-specific components as minimal as
of these methods: it is model-agnostic, allowing             possible. Though it is possible to train a single
for any kind of representation or contextualization,         model for each task and evaluate the resulting set
including models that use no systematic vector               of models on this benchmark, we expect that for
or symbolic representations for sentences whatso-            some data-scarce tasks in the benchmark, knowl-
ever.                                                        edge sharing between tasks will be necessary for
   GLUE also diverges from SentEval in the se-               competitive performance. In such a case, a more
lection of evaluation tasks that are included in the         unified approach should prevail.
suite. Many of the SentEval tasks are closely re-               The GLUE benchmark consists of nine English
lated to sentiment analysis, with the inclusion of           sentence understanding tasks selected to cover a
MR (Pang and Lee, 2005), SST (Socher et al.,                 broad spectrum of task type, domain, amount of
2013), CR (Hu and Liu, 2004), and SUBJ (Pang                 data, and difficulty. We describe them here and
and Lee, 2004). Other tasks are so close to be-              provide a summary in Table 1. Unless otherwise
ing solved that evaluation on them is less infor-            mentioned, tasks are evaluated on accuracy and
mative, such as MPQA (Wiebe et al., 2005) and                have a balanced class split.
TREC (Voorhees et al., 1999). In GLUE, we have                  The benchmark follows the same basic evalua-
attempted to construct a benchmark that is diverse,          tion model of SemEval and Kaggle. To evaluate a
spans multiple domains, and is systematically dif-           system on the benchmark, one must configure that
ficult.                                                      system to perform all of the tasks, run the system
                                                             on the provided test data, and upload the results
Evaluation Platforms and Competitions in NLP
                                                             to the website for scoring. The site will then show
Our use of an online evaluation platform with pri-
                                                             the user (and the public, if desired) an overall score
vate test labels is inspired by a long tradition of
                                                             for the main suite of tasks, and per-task scores on
shared tasks at the SemEval (Agirre et al., 2007)
                                                             both the main tasks and the diagnostic dataset.
and CoNLL (Ellison, 1997) conferences, as well
as similar leaderboards on Kaggle and CodaLab.               3.1      Single-Sentence Tasks
These frameworks tend to focus on a single task,
                                                             CoLA The Corpus of Linguistic Acceptability2
   1
     In the case of SNLI (Bowman et al., 2015), the best-    consists of examples of expert English sentence
performing sentence encoding model on the leaderboard as     acceptability judgments drawn from 22 books and
of April 2018 achieves 86.3% accuracy, while the best per-
                                                                 2
forming attention-based model achieves 89.3%.                        Available at: nyu-mll.github.io/CoLA
journal articles on linguistic theory. Each exam-       score. We use the standard test set, for which we
ple is a single string of English words annotated       obtained labels privately from the authors.
with whether it is a grammatically possible sen-
                                                        STS-B The Semantic Textual Similarity Bench-
tence of English. Superficially, this data is sim-
                                                        mark (Cer et al., 2017) is based on the datasets
ilar to our analysis data in that it is constructed
                                                        for a series of annual challenges for the task of
to demonstrate potentially subtle and difficult con-
                                                        determining the similarity on a continuous scale
trasts. However, judgments of this particular kind
                                                        from 1 to 5 of a pair of sentences drawn from
are the primary form of evidence in linguistic the-
                                                        various sources. We use the STS-Benchmark re-
ory (Schütze, 1996), and were a machine learning
                                                        lease, which draws from news headlines, video
system to be able to predict them reliably, it would
                                                        and image captions, and natural language infer-
offer potentially substantial evidence on questions
                                                        ence data, scored by human annotators. We follow
of language learnability and innate bias. As in
                                                        common practice and evaluate using Pearson and
MNLI, the corpus contains development and test
                                                        Spearman correlation coefficients between pre-
examples drawn from in-domain data (the same
                                                        dicted and ground-truth scores.
books and articles used in the training set) and out-
of-domain data, though we report numbers only           3.3   Inference Tasks
on the unified development and test sets with-          MNLI The Multi-Genre Natural Language In-
out differentiating these. We follow the original       ference Corpus (Williams et al., 2018) is a crowd-
work and report the Matthews correlation coef-          sourced collection of sentence pairs with textual
ficient (Matthews, 1975), which evaluates classi-       entailment annotations. Given a premise sentence
fiers on unbalanced binary classification tasks with    and a hypothesis sentence, the task is to predict
a range from -1 to 1, with 0 being the performance      whether the premise entails the hypothesis, con-
at random chance. We use the standard test set,         tradicts the hypothesis, or neither (neutral). The
for which we obtained labels privately from the         premise sentences are gathered from a diverse set
authors.                                                of sources, including transcribed speech, popular
SST-2 The Stanford Sentiment Treebank                   fiction, and government reports. The test set is
(Socher et al., 2013) consists of sentences             broken into two sections: matched, which is drawn
extracted from movie reviews and human annota-          from the same sources as the training set, and mis-
tions of their sentiment. Given a sentence, the task    matched, which uses different sources and thus re-
is to determine the sentiment of the sentence. We       quires domain transfer. We use the standard test
use the two-way (positive/negative) class split.        set, for which we obtained labels privately from
                                                        the authors, and evaluate on both sections.
3.2   Similarity and Paraphrase Tasks                      Though not part of the benchmark, we use and
                                                        recommend the Stanford Natural Language Infer-
MRPC The Microsoft Research Paraphrase
                                                        ence corpus (Bowman et al. 2015; SNLI) as auxil-
Corpus (Dolan and Brockett, 2005) is a corpus of
                                                        iary training data. It is distributed in the same for-
sentence pairs automatically extracted from online
                                                        mat for the same task, and has been used produc-
news sources, with human annotations of whether
                                                        tively in cotraining for MNLI (Chen et al., 2017;
the sentences in the pair are semantically equiv-
                                                        Gong et al., 2018).
alent. Because the classes are imbalanced (68%
positive, 32% negative), we follow common prac-         QNLI The Stanford Question Answering
tice and report both accuracy and F1 score.             Dataset (Rajpurkar et al. 2016; SQuAD) is
                                                        a question-answering dataset consisting of
QQP The Quora Question Pairs3 dataset is a              question-paragraph pairs, where the one of
collection of question pairs from the community         the sentences in the paragraph (drawn from
question-answering website Quora. Given two             Wikipedia) contains the answer to the corre-
questions, the task is to determine whether they are    sponding question (written by an annotator). We
semantically equivalent. As in MRPC, the class          automatically convert the original SQuAD dataset
distribution in QQP is unbalanced (37% positive,        into a sentence pair classification task by forming
63% negative), so we report both accuracy and F1        a pair between a question and each sentence in the
  3
      data.quora.com/First-Quora-Dataset-               corresponding context. The task is to determine
Release-Question-Pairs                                  whether the context sentence contains the answer
to the question. We filter out pairs where there             a sentence pair classification task, we construct
is low lexical overlap4 between the question and             two sentence pairs per example by replacing the
the context sentence. Specifically, we select all            ambiguous pronoun with each possible referent.
pairs in which the most similar sentence to the              The task (a slight relaxation of the original
question was not the answer sentence, as well as             Winograd Schema Challenge) is to predict if the
an equal amount of cases in which the correct                sentence with the pronoun substituted is entailed
sentence was the most similar to the question, but           by the original sentence. While the included
another distracting sentence was a close second.             training set is balanced between two classes
This approach to converting pre-existing datasets            (entailment and not entailment), the test set is
into NLI format is closely related to recent work            imbalanced between them (35% entailment, 65%
by White et al. (2017) as well as to the original            not entailment). We call the resulting sentence
motivation for textual entailment presented by               pair version of the dataset WNLI (Winograd NLI).
Dagan et al. (2006). Both argue that many NLP
tasks can be productively reduced to textual                 3.4    Scoring
entailment. We call this processed dataset QNLI              In addition to each task’s metric or metrics, Our
(Question-answering NLI).                                    benchmark reports a macro-average of the metrics
                                                             over all tasks (see Table 5) to determine a system’s
RTE The Recognizing Textual Entailment
                                                             position on the leaderboard. For tasks with mul-
(RTE) datasets come from a series of annual
                                                             tiple metrics (e.g., accuracy and F1), we use un-
challenges for the task of textual entailment, also
                                                             weighted average of the metrics as the score for
known as NLI. We combine the data from RTE1
                                                             the task.
(Dagan et al., 2006), RTE2 (Bar Haim et al.,
2006), RTE3 (Giampiccolo et al., 2007), and                  3.5    Data and Bias
RTE5 (Bentivogli et al., 2009)5 . Each example in
                                                             The tasks listed above are meant to represent a di-
these datasets consists of a premise sentence and a
                                                             verse sample of those studied in contemporary re-
hypothesis sentence, gathered from various online
                                                             search on applied sentence-level language under-
news sources. The task is to predict if the premise
                                                             standing, but we do not endorse the use of the task
entails the hypothesis. We convert all the data to
                                                             training sets for any specific non-research applica-
a two-class split (entailment or not entailment,
                                                             tion. They do not cover every dialect of English
where we collapse neutral and contradiction into
                                                             one may wish to handle, nor languages outside
not entailment for challenges with three classes)
                                                             of English, and as all of them contain text or an-
for consistency.
                                                             notations that were collected in uncontrolled set-
WNLI The Winograd Schema Challenge                           tings, they contain evidence of stereotypes and bi-
(Levesque et al., 2011) is a reading comprehen-              ases that one may not wish their system to learn
sion challenge where each example consists of                (Rudinger et al., 2017).
a sentence containing a pronoun and a list of
its possible referents in the sentence. The task             4     Diagnostic Dataset
is to determine the correct referent. The data               Drawing inspiration from the FraCaS test
is designed to foil simple statistical methods; it           suite (Cooper et al., 1996) and the recent Build-
is constructed so that each example hinges on                It-Break-It competition (Ettinger et al., 2017),
contextual information provided by a single word             we include a small, manually-curated test set to
or phrase in the sentence, which can be switched             allow for fine-grained analysis of system perfor-
out to change the answer. We use a small evalua-             mance on a broad range of linguistic phenomena.
tion set consisting of new examples derived from             While the main benchmarks mostly reflect an
fiction books6 that was shared privately by the              application-driven distribution of examples (e.g.
authors of the corpus. To convert the problem into           the question answering dataset will contain ques-
   4
      To measure lexical overlap we use a CBoW representa-   tions that people are likely to ask), our diagnostic
tion with pre-trained GloVe embeddings.                      dataset is collected to highlight a pre-defined set
    5
      RTE4 is not publicly available, while RTE6 and RTE7    of modeling-relevant phenomena.
do not fit the standard NLI task.
    6
      See similar examples at cs.nyu.edu/faculty/               Specifically, we construct a set of NLI examples
davise/papers/WinogradSchemas/WS.html                        with fine-grained annotations of the linguistic phe-
LS    PAS    L   K    Sentence 1                                 Sentence 2                                 Fwd   Bwd
       X               Cape sparrows eat seeds, along with        Seeds, along with soft plant parts and     E     E
                       soft plant parts and insects.              insects, are eaten by cape sparrows.
       X               Cape sparrows eat seeds, along with        Cape sparrows are eaten by seeds,          N     N
                       soft plant parts and insects.              along with soft plant parts and insects.
 X     X               Tulsi Gabbard disagrees with Bernie        Tulsi Gabbard and Bernie Sanders dis-      E     E
                       Sanders on what is the best way to deal    agree on what is the best way to deal
                       with Bashar al-Assad.                      with Bashar al-Assad.
 X                X    Musk decided to offer up his personal      Musk decided to offer up his personal      E     N
                       Tesla roadster.                            car.
                  X    The announcement of Tillerson’s de-        People across the globe were not ex-       E     N
                       parture sent shock waves across the        pecting Tillerson’s departure.
                       globe.
                  X    The announcement of Tillerson’s de-        People across the globe were prepared      C     C
                       parture sent shock waves across the        for Tillerson’s departure.
                       globe.
              X        I have never seen a hummingbird not        I have never seen a hummingbird.           N     E
                       flying.
       X               Understanding a long document re-          Understanding a long document re-          N     N
                       quires tracking how entities are intro-    quires evolving over time.
                       duced and evolve over time.
              X        Understanding a long document re-          Understanding a long document re-          E     N
                       quires tracking how entities are intro-    quires understanding how entities are
                       duced and evolve over time.                introduced.
 X                     That perspective makes it look gigantic.   That perspective makes it look minus-      C     C
                                                                  cule.

Table 2: Examples from the analysis set. Sentence pairs are labeled according to four coarse categories:
Lexical Semantics (L), Predicate-Argument Structure (PAS), Logic (L), and Knowledge and Common
Sense (K). Within each category, each example is also tagged with fine-grained labels (see tables 4). See
gluebenchmark.com for details on the set of labels, their meaning, and how we do the categorization.

nomena they capture. The NLI task is well suited              some animals, it is not sufficient to know that dog
to this kind of analysis, as it is constructed to make        lexically entails animal; one must also know that
it straightforward to evaluate the full set of skills         dog/animal appears in an upward monotone con-
involved in (ungrounded) sentence understanding,              text in the sentence. This example would be clas-
from the resolution of syntactic ambiguity to prag-           sified under both Lexical Semantics > Lexical En-
matic reasoning with world knowledge. We ensure               tailment and Logic > Upward Monotone.
that the examples in the diagnostic dataset have a
                                                              Domains We construct sentences based on ex-
reasonable distribution over word types and topics
                                                              isting text from four domains: News (drawn from
by building on naturally-occurring sentences from
                                                              articles linked on Google News7 ), Reddit (from
several domains. Table 2 shows examples from the
                                                              threads linked on the Front Page8 ), Wikipedia
dataset.
                                                              (from Featured Articles9 ), and academic papers
                                                              drawn from the proceedings of recent ACL confer-
Linguistic Phenomena We tag every example
                                                              ences. We include 100 sentence pairs constructed
with fine- and coarse-grained categories of the lin-
                                                              from each source, as well as 150 artificially-
guistic phenomena they involve (categories shown
                                                              constructed sentence pairs.
in Table 3). While each example was collected
with a single phenomenon in mind, it is often the             Annotation Process We begin with an initial
case that it falls under other categories as well. We         set of fine-grained semantic phenomena, using the
therefore code the examples under a non-exclusive                 7
                                                                  news.google.com
tagging scheme, in which a single example can                     8
                                                                  reddit.com
participate in many categories at once. For exam-               9
                                                                  en.wikipedia.org/wiki/Wikipedia:
ple, to know that I like some dogs entails I like             Featured_articles
Coarse-Grained Categories      Fine-Grained Categories
                                Lexical Entailment, Morphological Negation, Factivity, Symmetry/Collectivity,
 Lexical Semantics
                                Redundancy, Named Entities, Quantifiers
                                Core Arguments, Prepositional Phrases, Ellipsis/Implicits, Anaphora/Coreference
 Predicate-Argument Structure   Active/Passive, Nominalization, Genitives/Partitives, Datives, Relative Clauses,
                                Coordination Scope, Intersectivity, Restrictivity
                                Negation, Double Negation, Intervals/Numbers, Conjunction, Disjunction, Conditionals,
 Logic
                                Universal, Existential, Temporal, Upward Monotone, Downward Monotone, Non-Monotone
 Knowledge                      Common Sense, World Knowledge

Table 3: The types of linguistic phenomena annotated in the diagnostic dataset, organized under four
major categories.

categories in the FraCaS test suite (Cooper et al.,          using only the hypothesis as input. Testing these
1996) as a starting point, while also generalizing           on the diagnostic data, accuracies are 32.7% and
to include lexical semantics, common sense, and              36.4%—very close to chance—showing that the
world knowledge. We gather examples by search-               data does not suffer from artifacts of this specific
ing through text in each domain and locating ex-             kind. We also evaluate state-of-the-art NLI mod-
ample sentences that can be easily modified to in-           els on the diagnostic dataset and find their overall
volve one of the chosen phenomena (or that in-               performance to be rather weak, further suggest-
volves one already). We then modify the sentence             ing that no easily-gameable artifacts present in ex-
further to produce the other sentence in an NLI              isting training data are abundant in the diagnostic
pair. In many cases, we make these modifications             dataset (see Section 6).
small, in order to encourage high lexical and struc-
tural overlap among the sentence pairs—which                 Evaluation Since the class distribution in the di-
may make the examples more difficult for models              agnostic set is not uniform (and is even less so
that rely on lexical overlap as an indicator for en-         within each category), we propose using R3 , a
tailment. We then label the NLI relations between            three-class generalization of the Matthews corre-
the sentences in both directions (considering each           lation coefficient, as the evaluation metric. This
sentence alternatively as the premise), producing            coefficient was introduced by Gorodkin (2004) as
two labeled examples for each pair. Where pos-               RK , a generalization of the Pearson correlation
sible, we produce several pairs with different la-           that works for K dimensions by averaging the
bels for a single sentence, to have minimal sets             square error from the mean value in each dimen-
of sentence pairs that are lexically and structurally        sion, i.e., calculating the full covariance between
very similar but correspond to different entailment          the input and output. In the discrete case, it gen-
relationships. After finalizing the categories, we           eralizes Matthews correlation, where a value of
gathered a minimum number of examples in each                1 means perfect correlation and 0 means random
fine-grained category from each domain to ensure             chance.
a baseline level of diversity.
   In total, we gather 550 sentence pairs, for 1100          Intended Use Because these analysis examples
entailment examples. The labels are 42% entail-              are hand-picked to address certain phenomena, we
ment, 35% neutral, and 23% contradiction.                    expect that they will not be representative of the
                                                             distribution of language as a whole, even in the
Auditing In light of recent work showing that                targeted domains. However, NLI is a task with no
crowdsourced data often contains artifacts which             natural input distribution. We deliberately select
can be exploited to perform well without solving             sentences that we hope will be able to provide in-
the intended task (Schwartz et al., 2017; Gururan-           sight into what models are doing, what phenomena
gan et al., 2018), we perform an audit of our man-           they catch on to, and where are they limited. This
ually curated data as a sanity check. We repro-              means that the raw performance numbers on the
duce the methodology of Gururangan et al. (2018),            analysis set should be taken with a grain of salt.
training fastText classifiers (Joulin et al., 2016) to       The set is provided not as a benchmark, but as an
predict entailment labels on SNLI and MultiNLI               analysis tool to paint in broad strokes the kinds
Tags           Premise                                            Hypothesis                                      Fwd     Bwd
    UQuant         Our deepest sympathies are with all those          Our deepest sympathies are with a victim           E       N
                   affected by this accident.                         who was affected by this accident.
    MNeg           We built our society on unclean energy.            We built our society on clean energy.              C       C
    MNeg, 2Neg     The market is about to get harder, but not         The market is about to get harder, but possi-      E       E
                   impossible to navigate.                            ble to navigate.
    2Neg           I have never seen a hummingbird not flying.        I have always seen hummingbirds flying.            E       E
    2Neg, Coref    It’s not the case that there is no rabbi at this   A rabbi is at this wedding, standing right         E       E
                   wedding; he is right there standing behind         there standing behind that tree.
                   that tree.

Table 4: Examples from the diagnostic evaluation. Tags are Universal Quantification (UQuant), Morpho-
logical Negation (MNeg), Double Negation (2Neg), and Anaphora/Coreference (Coref). Other tags on
these examples are omitted for brevity.

of phenomena a model may or may not capture,                          corporating a matrix attention mechanism between
and to provide a set of examples that can serve for                   the two sentences. By explicitly modeling the in-
error analysis, qualitative model comparison, and                     teraction between sentences, our model is strictly
development of adversarial examples that expose                       outside of the sentence-to-vector paradigm. We
a model’s weaknesses.                                                 follow standard practice to contextualize each to-
                                                                      ken with attention. Given two sequences of hid-
5     Baselines                                                       den states u1 , u2 , . . . , uM and v1 , v2 , . . . , vN , the
                                                                      attention mechanism is implemented by first com-
As baselines, we provide performance numbers
                                                                      puting a matrix H where Hij = ui · vj . For each
for a relatively simple multi-task learning model
                                                                      ui , we get attention weights αi by taking a softmax
trained from scratch on the benchmark tasks, as
                                                                      over the ith row of H, and       P get the correspond-
well as several more sophisticated variants that uti-
                                                                      ing context vector ṽi =           j αij vj by taking the
lize recent developments in transfer learning. We
                                                                      attention-weighted sum of the vj . We pass a sec-
also evaluate a sample of competitive existing sen-
                                                                      ond BiLSTM with max pooling over the sequence
tence representation models, where we only train
                                                                      [u1 ; v1 ], . . . [uM ; vM ] to produce u0 . We process
task-specific classifiers on top of the representa-
                                                                      the vj vectors in a symmetric manner to obtain v 0 .
tions they produce.
                                                                      Finally, we feed [u0 ; v 0 ; |u0 −v 0 |; u0 ∗v 0 ] into a clas-
5.1        Multi-task Architecture                                    sifier for each task.

Our simplest baseline is based on sentence-to-                        Incorporating Transfer Learning We also
vector encoders, and sets aside GLUE’s ability                        augment our base non-attentive model with two
to evaluate models with more complex structures.                      recently proposed methods for transfer learning in
Taking inspiration from Conneau et al. (2017),                        NLP: ELMo (Peters et al., 2018) and CoVe (Mc-
the model uses a BiLSTM with temporal max-                            Cann et al., 2017). Both use pretrained models that
pooling and 300-dimensional GloVe word embed-                         produce contextual word embeddings via some
dings (Pennington et al., 2014) trained on 840B                       transformation of the underlying model’s hidden
Common Crawl. For single-sentence tasks, we                           states.
process the sentence and pass the resulting vector                       ELMo uses a pair of two-layer neural language
to a classifier. For sentence-pair tasks, we process                  models (one forward, one backward) trained on
sentences independently to produce vectors u, v,                      the One Billion Word Benchmark (Chelba et al.,
and pass [u; v; |u − v|; u ∗ v] to a classifier. We ex-               2013). A word’s contextual embedding is pro-
periment with logistic regression and a multi-layer                   duced by taking a linear combination of the cor-
perceptron with a single hidden layer for classi-                     responding hidden states on each layer. We fol-
fiers, leaving the choice as a hyperparameter to                      low the authors’ recommendations10 and use the
tune.                                                                 ELMo embeddings in place of any other embed-
   For sentence-pair tasks, we take advantage of                        10
                                                                           github.com/allenai/allennlp/blob/
GLUE’s indifference to model architecture by in-                      master/tutorials/how to/elmo.md
Single Sent           Similarity and Paraphrase                        Natural Language Inference
  Model          Avg
                        CoLA    SST-2   MRPC               QQP               STS-B        MNLI         QNLI        RTE       WNLI
                                                     Single-Task Training

  BiLSTM +ELMo   60.3   22.2     89.3   70.3/79.6        84.8/64.2          72.3/70.9    77.1/76.6A     82.3A      48.3A      63
                                                     Multi-Task Training
  BiLSTM         55.3    0.0     84.9   72.6/80.9        85.5/63.4          71.8/70.1    69.3/69.1      75.6       57.6       43
   +Attn         56.3    0.0     83.8   74.1/82.1A       85.1/63.6A         71.1/69.8A   72.4/72.2A     78.5A      61.1A      44A
   +ELMo         55.8    0.9     88.4   72.7/82.6        79.0/58.9          73.3/72.0    71.3/71.8      75.8       56.0       46
   +CoVe         56.7    1.8     85.0   73.5/81.4        85.2/63.5          73.4/72.1    70.5/70.5      75.6       57.6       52
                                         Pre-trained Sentence Representation Models
  CBoW           52.2    0.0     80.1   72.6/81.1        79.4/51.2          61.2/59.0    55.7/56.3      71.5       54.8       63
  Skip-Thought   55.4    0.0     82.1   73.4/82.4        81.7/56.1          71.7/69.8    62.8/62.7      72.6       55.1       64
  InferSent      58.1    2.8     84.8   75.7/82.8        86.1/62.7          75.8/75.7    66.0/65.6      73.7       59.3       65
  DisSent        56.4    9.9     83.9   76.7/83.7        85.3/62.7          66.2/65.1    57.7/58.0      68.0       59.5       65
  GenSen         59.2    1.4     83.6   78.2/84.6        83.2/59.8          79.0/79.4    71.2/71.0      78.7       59.9       65

Table 5: Performances on the benchmark tasks for different models. Bold denotes best results per task
overall; underline denotes best results per task within a section; A denotes models using attention. For
MNLI, we report accuracy on the matched / mismatched test splits. For MRPC and Quora, we report
accuracy / F1. For STS-B, we report Pearson / Spearman correlation, scaled to be in [-100, 100]. For
CoLA, we report Matthews correlation, scaled to be in [-100, 100]. For all other tasks we report accuracy
(%). We compute a macro-average score in the style of SentEval by taking the average across all tasks,
first averaging the metrics within each tasks for tasks with more than one reported metric.

dings.                                                           We tune hyperparameters with random search over
   CoVe uses a sequence-to-sequence model with                   30 runs on macro-average development set perfor-
a two-layer BiLSTM encoder trained for English-                  mance. Our best model is a two layer BiLSTM
to-German translation. The CoVe vector C(wi )                    that is 1500-dimensional per direction. We evalu-
of a word is the corresponding hidden state of                   ate our all our BiLSTM-based models with these
the top-layer LSTM. As per the original work, we                 settings.
concatenate the CoVe vectors to the GloVe word
embeddings.                                                      5.3        Single-task Training
                                                                 We use the same training procedure to train an in-
5.2   Multi-task Training
                                                                 stance of the model with ELMo on each task sep-
These four models (BiLSTM, BiLSTM +Attn,                         arately. For tuning hyperparameters per task, we
BiLSTM +ELMo, BiLSTM +CoVe) are jointly                          use random search on that task’s metrics evalu-
trained on all tasks, with the primary BiLSTM en-                ated on the development set. We tune the same
coder shared between all task-specific classifiers.              hyperparameters as in the multi-task setting, ex-
To perform multi-task training, we randomly pick                 cept we also tune whether or not to use attention
an ordering on the tasks and train on 10% of a                   (for pair tasks only), and whether to use SGD or
task’s training data for each task in that order. We             Adam (Kingma and Ba, 2014).
repeat this process 10 times between validation
checks, so that we roughly train on all training ex-             5.4        Sentence Representation Models
amples for each task once between checks. We                     Finally, we evaluate a number of established
use the previously defined macro-average as the                  sentence-to-vector encoder models using our
validation metric, where for tasks without prede-                suite. Specifically, we investigate:
termined development sets, we reserve 10% of the
training data for validation.                                         1. CBoW: the average of the GloVe embeddings
   We train our models with stochastic gradient de-                      of the tokens in the sentence.
scent using batch size 128, and multiply the learn-
ing rate by .2 whenever validation performance                        2. Skip-Thought (Kiros et al., 2015):           a
does not improve. We stop training when the                              sequence-to-sequence(s) model trained to
learning rate drops below 10−5 or validation per-                        generate the previous and next sentences
formance does not improve after 5 evaluations.                           given the middle sentence. After training, the
model’s encoder is taken as a sentence en-          Model             LS    PAS     L   K    All
          coder. We use the original pretrained model11       BiLSTM             13     27   15   15    20
          trained on sequences of sentences from the          BiLSTM +AttnA      26     32   24   18    27
          Toronto Book Corpus (Zhu et al. 2015, TBC).         BiLSTM +ELMo       14     22   14   17    17
                                                              BiLSTM +CoVe       17     30   17   14    21

    3. InferSent (Conneau et al., 2017): a BiL-               CBoW               09     13   08   10    10
                                                              Skip-Thought       02     25   09   08    12
       STM with max-pooling trained on MNLI and               InferSent          17     18   15   13    19
       SNLI.                                                  DisSent            09     14   11   15    13
                                                              GenSen             27     27   14   10    20
    4. DisSent (Nie et al., 2017): a BiLSTM with
       max-pooling trained to predict the discourse       Table 6: Results on the diagnostic set. A denotes
       marker (e.g. “because”, “so”, etc.) relat-         models using attention. All numbers in the ta-
       ing two sentences on data derived from TBC         ble are R3 coefficients between gold and predicted
       (Zhu et al., 2015). We use the variant trained     labels within each category (percentage). The
       to predict eight discourse marker types.           categories are Lexical Semantics (L), Predicate-
                                                          Argument Structure (PAS), Logic (L), and Knowl-
    5. GenSen (Subramanian et al., 2018): a               edge and Common Sense (K).
       sequence-to-sequence model trained on a va-
       riety of supervised and unsupervised objec-
       tives. We use a variant of the model trained       ize to these tasks. The notable exception is Dis-
       on both MNLI and SNLI, the Skip-Thought            Sent, which does better than other multi-task mod-
       objective on TBC, and a constituency pars-         els on CoLA. A possible explanation is that Dis-
       ing objective on the One Billion Word Bench-       Sent is trained using a discourse-based objective,
       mark.                                              which might be more sensitive to grammatical-
                                                          ity. However, DisSent underperforms other multi-
We use pretrained versions of these models, fix           task models on more data-rich tasks such as MNLI
their parameters, learn task-specific classifiers on      and QNLI. This result demonstrates the utility of
top of the sentence representations that they pro-        GLUE: by assembling a wide variety of tasks, it
duce. We use the SentEval framework to train the          highlights the relative strengths and weaknesses of
classifiers.                                              various models.
                                                             Among our multi-task BiLSTM models, using
6        Benchmark Results                                attention yields a noticeable improvement over the
We present performance on the main benchmark              vanilla BiLSTM for all tasks involving sentence
in Table 5. For multi-task models, we average per-        pairs. When using ELMo or CoVe, we see im-
formance over five runs; for single-task models,          provements for nearly all tasks. There is also a
we use only one run.                                      performance gap between all variants of our multi-
   We find that the single-task baselines have the        task BiLSTM model and the best models that use
best performance among all models on SST-2,               pre-trained sentence representations (GenSen and
MNLI, and QNLI, while the lagging behind multi-           Infersent), demonstrating the utility of transfer via
task trained models on MRPC, STS-B, and RTE.              pre-training on an auxiliary task.
For MRPC and RTE in particular, the single-task              Among the pretrained sentence representation
baselines are close to majority class baselines, in-      models, we observe relatively consistent per-task
dicating the inherent difficulty of these tasks and       and aggregate performance gains moving from
the potential of transfer learning approaches. On         CBoW to Skip-Thought to DisSent, to Infersent
QQP, the best multi-task trained models slightly          and GenSen. The latter two show competitive per-
outperform the single-task baseline.                      formance on various tasks, with GenSen slightly
   For multi-task trained baselines, we find that al-     edging out InferSent in aggregate.
most no model does significantly better on CoLA
or WNLI than performance from predicting ma-              7   Analysis
jority class (0.0 and 63, respectively), which high-
                                                          By running all of the models on the diagnostic set,
lights the difficulty of current models to general-
                                                          we get a breakdown of their performance across a
    11
         github.com/ryankiros/skip-thoughts               set of modeling-relevant phenomena. Overall re-
Gold \Prediction    All       E        C        N     Model             UQuant     MNeg    2Neg    Coref
               All                    65       16       19    BiLSTM               67        13       5      24
               E            42        34        3        4    BiLSTM +AttnA        85        64      11      20
               C            23        11        8        4    BiLSTM +ELMo         77        60      -8      18
               N            35        19        5       11    BiLSTM +CoVe         71        34      28      39
  (a) Confusion matrix for BiLSTM +Attn (percentages).        CBoW                 16         0      13      21
                                                              SkipThought          61         6      -2      30
                                                              InferSent            64        51     -22      26
           Model                  E        C        N         DisSent              70        34     -20      21
           BiLSTM                71       16       13         GenSen               78        64       5      26
           BiLSTM +AttnA         65       16       19
           BiLSTM +ELMo          81        9       10        Table 7:  Model performance in terms of R3
           BiLSTM +Cove          75       13       13
                                                             (scaled by 100) on selected fine-grained cate-
           CBoW                  84        7        9        gories for analysis. The categories are Uni-
           SkipThought           80        8       12
           InferSent             68       21       11        versal Quantification (UQuant), Morphological
           DisSent               73       18        8        Negation (MNeg), Double Negation (2Neg), and
           GenSen                74       15       11        Anaphora/Coreference (Coref).
           Gold                  42       23       35

  (b) Output class distributions (percentages). Bolded
  numbers are closest to the gold distribution.              as a strong sign of entailment, and that surgical ad-
                                                             dition of new information to the hypothesis (as in
Figure 1: Partial output of GLUE’s error analysis,           the case of neutral instances in the diagnostic set)
aggregated across our models.                                might go unnoticed. Indeed, the attention-based
                                                             model seems more sensitive to the neutral class,
                                                             and is perhaps better at detecting small sets of un-
sults are presented in Table 6.
                                                             aligned tokens because it explicitly tries to model
Overall Performance Performance is very low                  these alignments.
across the board: the highest total score (27) still
                                                             Linguistic Phenomena While performance
denotes poor absolute performance. Scores on the
                                                             metrics on the coarse-grained categories give us
Predicate-Argument Structure category tend to be
                                                             broad strokes that we can use to compare models,
higher across all models, while Knowledge cate-
                                                             we can gain a better understanding of the models’
gory scores are lower. However, these trends do
                                                             capabilities by drilling down into the fine-grained
not necessarily reflect that our models understand
                                                             subcategories. The GLUE platform reports scores
sentence structure better than world knowledge
                                                             for every fine-grained category; we present here a
or common sense; these numbers are not directly
                                                             few highlights in Table 7. To help interpret these
comparable. Rather, numbers should be compared
                                                             results, we list some examples from each fine-
between models within each category.
                                                             grained category, along with model predictions, in
   One notable trend is the high performance of
                                                             Table 4.
the BiLSTM +Attn model: though it does not out-
perform most of the pretrained sentence represen-               The Universal Quantification category appears
tation methods (InferSent, DisSent, GenSen) on               easy for most of the models; looking at exam-
GLUE’s main benchmark tasks, it performs best                ples, it seems that when universal quantification
or competitively on all categories of the diagnos-           as a phenomenon is isolated, catching on to lexical
tic set.                                                     cues such as all often suffices to solve our exam-
                                                             ples. Morphological negation examples are super-
Domain Shift & Class Priors GLUE’s online                    ficially similar, but the systems find it more diffi-
platform also provides a submitted model’s pre-              cult. On the other hand, double negation appears
dicted class distributions and confusion matrices.           to be adversarially difficult for models to recog-
We provide an example in Figure 1. One point                 nize, with the exception of BiLSTM +CoVe; this
is immediately clear: all models severely under-             is perhaps due to the translation signal, which can
predict neutral and over-predict entailment. This            match phrases like “not bad” and “okay” to the
is perhaps indicative of the models’ inability to            same expression in a foreign language. A similar
generalize and adapt to new domains. We hypoth-              advantage, though less acute, appears when using
esize that they learned to treat high lexical overlap        CoVe on coreference examples.
Overall, there is some evidence that going be-     Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo
yond sentence-to-vector representations might aid       Giampiccolo. 2009. The fifth PASCAL recognizing
                                                        textual entailment challenge. In TAC.
performance on out-of-domain data (as with BiL-
STM +Attn) and that representations like ELMo         Samuel R. Bowman, Gabor Angeli, Christopher Potts,
and CoVe encode important linguistic information        and Christopher D. Manning. 2015. A large anno-
                                                        tated corpus for learning natural language inference.
that is specific to their supervision signal. Our
                                                        In Proceedings of the 2015 Conference on Empiri-
platform and diagnostic dataset should support fu-      cal Methods in Natural Language Processing, pages
ture inquiries into these issues, so we can bet-        632–642. Association for Computational Linguis-
ter understand our models’ generalization behav-        tics.
ior and what kind of information they encode.         Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-
                                                        Gazpio, and Lucia Specia. 2017. Semeval-2017
8   Conclusion                                          task 1: Semantic textual similarity-multilingual and
                                                        cross-lingual focused evaluation. In 11th Interna-
We introduce GLUE, a platform and collection            tional Workshop on Semantic Evaluations.
of resources for training, evaluating, and analyz-
                                                      Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,
ing general natural language understanding sys-         Thorsten Brants, Phillipp Koehn, and Tony Robin-
tems. When evaluating existing models on the            son. 2013. One billion word benchmark for measur-
main GLUE benchmark, we find that none are              ing progress in statistical language modeling. arXiv
able to substantially outperform a relatively sim-      preprint arXiv:1312.3005.
ple baseline of training a separate model for each    Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui
constituent task. When evaluating these models          Jiang, and Diana Inkpen. 2017. Recurrent neural
on our diagnostic dataset, we find that they spec-      network-based sentence encoder with gated atten-
                                                        tion for natural language inference. In 2nd Work-
tacularly fail on a wide range of linguistic phe-       shop on Evaluating Vector Space Representations
nomena. The question of how to design general-          for NLP.
purpose NLU models thus remains unanswered.
                                                      Ronan Collobert, Jason Weston, Léon Bottou, Michael
We believe that GLUE, and the generality it pro-        Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
motes, can provide fertile soil for addressing this     2011. Natural language processing (almost) from
open challenge.                                         scratch. Journal of Machine Learning Research,
                                                        12(Aug):2493–2537.
Acknowledgments                                       Alexis Conneau and Douwe Kiela. 2018. SentEval: An
                                                        evaluation toolkit for universal sentence representa-
We thank Ellie Pavlick, Tal Linzen, Kyunghyun
                                                        tions. In LREC 2018.
Cho, and Nikita Nangia for their comments on
this work at its early stages, and we thank Ernie     Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈c
                                                        Barrault, and Antoine Bordes. 2017. Supervised
Davis, Alex Warstadt, and Quora’s Nikhil Dan-           learning of universal sentence representations from
dekar and Kornel Csernai for providing access to        natural language inference data. In Proceedings of
private evaluation data. This project has benefited     the 2017 Conference on Empirical Methods in Nat-
from financial support to SB by Google, Tencent         ural Language Processing, EMNLP 2017, Copen-
Holdings, and Samsung Research.                         hagen, Denmark, September 9-11, 2017, pages 681–
                                                        691.
                                                      Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox,
References                                              Josef Van Genabith, Jan Jaspars, Hans Kamp, David
                                                        Milward, Manfred Pinkal, Massimo Poesio, Steve
Eneko Agirre, Lluı́s Màrquez, and Richard Wicen-       Pulman, Ted Briscoe, Holger Maier, and Karsten
  towski, editors. 2007. Proceedings of the Fourth      Konrad. 1996. Using the framework. Technical re-
  International Workshop on Semantic Evaluations        port, The FraCaS Consortium.
  (SemEval-2007). Association for Computational
  Linguistics, Prague, Czech Republic.                Ido Dagan, Oren Glickman, and Bernardo Magnini.
                                                         2006. The PASCAL recognising textual entailment
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-         challenge. In Machine learning challenges. evalu-
  gio. 2014. Neural machine translation by jointly       ating predictive uncertainty, visual object classifica-
  learning to align and translate. In ICLR 2015.         tion, and recognising tectual entailment, pages 177–
                                                         190. Springer.
Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro,
  Danilo Giampiccolo, Bernardo Magnini, and Idan      William B Dolan and Chris Brockett. 2005. Automati-
  Szpektor. 2006. The second PASCAL recognising         cally constructing a corpus of sentential paraphrases.
  textual entailment challenge.                         In Proc. of IWP.
T. Mark Ellison, editor. 1997. Computational Natural      Quoc Le and Tomas Mikolov. 2014. Distributed repre-
   Language Learning: Proceedings of the 1997 Meet-         sentations of sentences and documents. In Proceed-
   ing of the ACL Special Interest Group in Natural         ings of the 31st International Conference on Ma-
   Language Learning. Association for Computational         chine Learning, volume 32 of Proceedings of Ma-
   Linguistics.                                             chine Learning Research, pages 1188–1196, Bejing,
                                                            China. PMLR.
Allyson Ettinger, Sudha Rao, Hal Daumé III, and
  Emily M Bender. 2017. Towards linguistically gen-       Hector J Levesque, Ernest Davis, and Leora Morgen-
  eralizable NLP systems: A workshop and shared             stern. 2011. The Winograd schema challenge. In
  task. In First Workshop on Building Linguistically        Aaai spring symposium: Logical formalizations of
  Generalizable NLP Systems.                                commonsense reasoning, volume 46, page 47.

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,          Brian W Matthews. 1975. Comparison of the pre-
  and Bill Dolan. 2007. The third PASCAL recog-             dicted and observed secondary structure of t4 phage
  nizing textual entailment challenge. In Proceedings       lysozyme. Biochimica et Biophysica Acta (BBA)-
  of the ACL-PASCAL workshop on textual entailment          Protein Structure, 405(2):442–451.
  and paraphrasing, pages 1–9. Association for Com-
  putational Linguistics.                                 Bryan McCann, James Bradbury, Caiming Xiong, and
                                                            Richard Socher. 2017. Learned in translation: Con-
Yichen Gong, Heng Luo, and Jian Zhang. 2018. Nat-           textualized word vectors. In Advances in Neural In-
  ural language inference over interaction space. In        formation Processing Systems, pages 6297–6308.
  Proceedings of ICLR 2018.
                                                          Alexander H. Miller, Will Feng, Adam Fisch, Jiasen
J. Gorodkin. 2004. Comparing two k-category assign-         Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, and
   ments by a k-category correlation coefficient. Com-      Jason Weston. 2017. ParlAI: A dialog research soft-
   put. Biol. Chem., 28(5-6):367–374.                       ware platform. CoRR, abs/1705.06476.

Suchin Gururangan, Swabha Swayamdipta, Omer               Allen Nie, Erin D Bennett, and Noah D Goodman.
  Levy, Roy Schwartz, Samuel R. Bowman, and                 2017. Dissent: Sentence representation learning
  Noah A. Smith. 2018. Annotation artifacts in nat-         from explicit discourse relations. arXiv preprint
  ural language inference data. In Proceedings of           arXiv:1710.04334.
  the North American Chapter of the Association for
  Computational Linguistics: Human Language Tech-         Bo Pang and Lillian Lee. 2004. A sentimental educa-
  nologies.                                                 tion: Sentiment analysis using subjectivity summa-
                                                            rization based on minimum cuts. In Proceedings of
                                                            the 42nd annual meeting on Association for Compu-
Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsu-
                                                            tational Linguistics, page 271. Association for Com-
  ruoka, and Richard Socher. 2016. A joint many-task
                                                            putational Linguistics.
  model: Growing a neural network for multiple nlp
  tasks. In Proceedings of EMNLP 2017.
                                                          Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit-
                                                            ing class relationships for sentiment categorization
Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016.         with respect to rating scales. In Proceedings of the
  Learning distributed representations of sentences         43rd annual meeting on association for computa-
  from unlabelled data. Proceedings of NAACL 2016.          tional linguistics, pages 115–124. Association for
                                                            Computational Linguistics.
Minqing Hu and Bing Liu. 2004. Mining and summa-
  rizing customer reviews. In Proceedings of the tenth    Jeffrey Pennington, Richard Socher, and Christopher
  ACM SIGKDD international conference on Knowl-              Manning. 2014. GloVe: Global vectors for word
  edge discovery and data mining, pages 168–177.             representation. In Proceedings of the 2014 confer-
  ACM.                                                       ence on empirical methods in natural language pro-
                                                             cessing (EMNLP), pages 1532–1543.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
  Tomas Mikolov. 2016. Bag of tricks for efficient text   Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
  classification. arXiv preprint arXiv:1607.01759.         Gardner, Christopher Clark, Kenton Lee, and Luke
                                                           Zettlemoyer. 2018. Deep contextualized word rep-
Diederik P Kingma and Jimmy Ba. 2014. Adam: A              resentations. In Proceedings of ICLR.
  method for stochastic optimization. In ICLR 2015.
                                                          Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,              Percy Liang. 2016. SQuAD: 100,000+ questions for
  Richard Zemel, Raquel Urtasun, Antonio Torralba,          machine comprehension of text. In Proceedings of
  and Sanja Fidler. 2015. Skip-Thought vectors. In          the 2016 Conference on Empirical Methods in Nat-
  Advances in neural information processing systems,        ural Language Processing, pages 2383–2392. Asso-
  pages 3294–3302.                                          ciation for Computational Linguistics.
Sebastian Ruder, Joachim Bingel, Isabelle Augenstein,        Eighth International Joint Conference on Natural
  and Anders Søgaard. 2017. Sluice networks: Learn-          Language Processing (Volume 1: Long Papers), vol-
  ing what to share between loosely related tasks.           ume 1, pages 996–1005.
  arXiv preprint arXiv:1705.08142.
                                                           Janyce Wiebe, Theresa Wilson, and Claire Cardie.
Rachel Rudinger, Chandler May, and Benjamin                   2005. Annotating expressions of opinions and emo-
  Van Durme. 2017. Social bias in elicited natural lan-       tions in language. Language resources and evalua-
  guage inferences. In Proceedings of the First ACL           tion, 39(2-3):165–210.
  Workshop on Ethics in Natural Language Process-
  ing, pages 74–79.                                        Adina Williams, Nikita Nangia, and Samuel R. Bow-
                                                             man. 2018. A broad-coverage challenge corpus for
Carson T Schütze. 1996. The empirical base of lin-          sentence understanding through inference. In Pro-
  guistics: Grammaticality judgments and linguistic          ceedings of NAACL 2018.
  methodology. University of Chicago Press.
                                                           Caiming Xiong, Victor Zhong, and Richard Socher.
Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles,       2016. Dynamic coattention networks for question
  Yejin Choi, and Noah A. Smith. 2017. The ef-               answering. In ICLR 2017.
  fect of different writing tasks on linguistic style: A
  case study of the ROC story cloze task. In Proc. of      Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-
  CoNLL.                                                     dinov, Raquel Urtasun, Antonio Torralba, and Sanja
                                                             Fidler. 2015. Aligning books and movies: Towards
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and            story-like visual explanations by watching movies
  Hannaneh Hajishirzi. 2016. Bidirectional attention         and reading books. In Proceedings of the IEEE
  flow for machine comprehension. In ICLR 2017.              international conference on computer vision, pages
                                                             19–27.
Richard Socher, Alex Perelygin, Jean Wu, Jason
  Chuang, Christopher D Manning, Andrew Ng, and
  Christopher Potts. 2013. Recursive deep models
  for semantic compositionality over a sentiment tree-
  bank. In Proceedings of the 2013 conference on
  empirical methods in natural language processing,
  pages 1631–1642.
Anders Søgaard and Yoav Goldberg. 2016. Deep
  multi-task learning with low level tasks supervised
  at lower layers. In Proceedings of the 54th Annual
  Meeting of the Association for Computational Lin-
  guistics (Volume 2: Short Papers), volume 2, pages
  231–235.
Sandeep Subramanian, Adam Trischler, Yoshua Ben-
  gio, and Christopher J. Pal. 2018. Learning gen-
  eral purpose distributed sentence representations via
  large scale multi-task learning. In Proceedings of
  ICLR.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In Advances in Neural Information Pro-
  cessing Systems, pages 6000–6010.
Ellen M Voorhees et al. 1999. The trec-8 question an-
   swering track report. In Trec, volume 99, pages 77–
   82.
Jason Weston, Antoine Bordes, Sumit Chopra, Alexan-
   der M Rush, Bart van Merriënboer, Armand Joulin,
   and Tomas Mikolov. 2015. Towards AI-complete
   question answering: A set of prerequisite toy tasks.
   In ICLR 2016.
Aaron Steven White, Pushpendre Rastogi, Kevin Duh,
  and Benjamin Van Durme. 2017. Inference is ev-
  erything: Recasting semantic resources into a uni-
  fied evaluation framework. In Proceedings of the
You can also read