Fixed That for You: Generating Contrastive Claims with Semantic Edits - Association for ...

Page created by Fred Potter
 
CONTINUE READING
Fixed That for You: Generating Contrastive Claims with Semantic Edits

                  Christopher Hidey                                 Kathleen McKeown
            Department of Computer Science                     Department of Computer Science
                 Columbia University                                Columbia University
                 New York, NY 10027                                 New York, NY 10027
            chidey@cs.columbia.edu                              kathy@cs.columbia.edu

                     Abstract                                 This is an example of a policy claim - a view on
                                                             what should be done (Schiappa and Nordin, 2013).
    Understanding contrastive opinions is a key
    component of argument generation. Central                   While negation of this claim is a plausible re-
    to an argument is the claim, a statement that            sponse (e.g. asserting there should be no change
    is in dispute. Generating a counter-argument             by stating Do not get employers out of the busi-
    then requires generating a response in contrast          ness, do not pass universal healthcare), negation
    to the main claim of the original argument. To           limits the diversity of responses that can lead to a
    generate contrastive claims, we create a cor-            productive dialogue. Instead, consider a response
    pus of Reddit comment pairs self-labeled by              that provides an alternative suggestion:
    posters using the acronym FTFY (fixed that
    for you). We then train neural models on                      Get employers out of the business,
    these pairs to edit the original claim and pro-
                                                                  deregulate and allow cross-state com- (2)
    duce a new claim with a different view. We
    demonstrate significant improvement over a                    petition.
    sequence-to-sequence baseline in BLEU score                In Example 1, the speaker believes in an in-
    and a human evaluation for fluency, coherence,           creased role for government while in Example 2,
    and contrast.                                            the speaker believes in a decreased one. As these
1   Introduction                                             views are on different sides of the political spec-
                                                             trum, it is unlikely that a single speaker would
In the Toulmin model (1958), often used in com-              utter both claims. In related work, de Marneffe
putational argumentation research, the center of             et al. (2008) define two sentences as contradictory
the argument is the claim, a statement that is in            when they are extremely unlikely to be true simul-
dispute (Govier, 2010). In recent years, there               taneously. We thus define a contrastive claim as
has been increased interest in argument genera-              one that is likely to be contradictory if made by
tion (Bilu and Slonim, 2016; Hua and Wang, 2018;             the speaker of the original claim. Our goal, then,
Le et al., 2018). Given an argument, a system                is to develop a method for generating contrastive
that generates counter-arguments would need to               claims when explicit negation is not the best op-
1) identify the claims to refute, 2) generate a new          tion. Generating claims in this way also has the
claim with a different view, and 3) find supporting          benefit of providing new content that can be used
evidence for the new claim. We focus on this sec-            for retrieving or generating supporting evidence.
ond task, which requires an understanding of con-               In order to make progress towards generating
trast. A system that can generate claims with dif-           contrastive responses, we need large, high-quality
ferent views is a step closer to understanding and           datasets that illustrate this phenomenon. We con-
generating arguments (Apothloz et al., 1993). We             struct a dataset of 1,083,520 contrastive comment
build on previous work in automated claim genera-            pairs drawn from Reddit using a predictive model
tion (Bilu et al., 2015) which examined generating           to filter out non-contrastive claims. Each pair con-
opposing claims via explicit negation. However,              tains very similar, partially aligned text but the
researchers also noted that not every claim has an           responder has significantly modified the original
exact opposite. Consider a claim from Reddit:                post. We use this dataset to model differences in
     Get employers out of the business, pass                 views and generate a new claim given an origi-
                                             (1)             nal comment. The similarity within these pairs
     universal single-payer healthcare.
                                                       1756
                                Proceedings of NAACL-HLT 2019, pages 1756–1767
           Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics
allows us to use them as distantly labeled con-                      between pairs of contrastive claims except for the
trastive word alignments. The word alignments                        contrastive word or phrase. Much of the previous
provide semantic information about which words                       work in contrast and contradiction has examined
and phrases can be substituted in context in a co-                   the relationship between words or sentences. In
herent, meaningful way.                                              order to understand when words and phrases are
   Our contributions1 are as follows:                                contrastive in argumentation, we need to exam-
                                                                     ine them in context. For example, consider the
    1. Methods and data for contrastive claim iden-                  claim Hillary Clinton should be president. A rea-
       tification to mine comment pairs from Red-                    sonable contrastive claim might be Bernie Sanders
       dit, resulting in a large, continuously grow-                 should be president. (rather than the explicit nega-
       ing dataset of 1,083,520 distant-labeled ex-                  tion Hillary Clinton should not be president.) In
       amples.                                                       this context, Hillary Clinton and Bernie Sanders
                                                                     are contrastive entities as they were both running
    2. A crowd-labeled set of 2,625 comments each
                                                                     for president. However, for the claim Hillary Clin-
       paired with 5 new contrastive responses gen-
                                                                     ton was the most accomplished Secretary of State
       erated by additional annotators.
                                                                     in recent memory. they would be unrelated. Con-
    3. Models for generating contrastive claims us-                  sider also that we could generate the claim Hillary
       ing neural sequence models and constrained                    Clinton should be senator. This contrastive claim
       decoding.                                                     is not coherent given the context. Generating a
                                                                     contrastive claim then requires 1) identifying the
   In the following sections, we describe the task                   correct substitution span and 2) generating a re-
methodology and data collection and process-                         sponse with semantically relevant replacements.
ing. Next, we present neural models for con-                            While some contrastive claims are not coherent,
trastive claim generation and evaluate our work                      there are often multiple plausible responses, sim-
and present an error analysis. We then discuss                       ilar to tasks such as dialogue generation. For ex-
related work in contrast/contradiction, argumenta-                   ample, Donald Trump should be president is just
tion, and generation before concluding.                              as appropriate as Bernie Sanders should be pres-
                                                                     ident. We thus treat this as a dialogue generation
2       Task Definition and Motivation                               task where the goal is to generate a plausible re-
Previous work in claim generation (Bilu et al.,                      sponse given an input context.
2015) focused on explicit negation to provide op-
posing claims. While negation plays an important
                                                                     3       Data
role in argumentation (Apothloz et al., 1993), re-                   In order to model contrastive claims, we need
searchers found that explicit negation may result                    datasets that reflect this phenomenon.
in incoherent responses (Bilu et al., 2015). Fur-
thermore, recent empirical studies have shown that                   3.1      Collection
arguments that provide new content (Wachsmuth                        We obtain training data by scraping the social
et al., 2018) tend to be more effective. While new                   media site Reddit for comments containing the
concepts can be introduced in other ways by find-                    acronym FTFY.2 FTFY is a common acronym
ing semantically relevant content, we may find it                    meaning “fixed that for you.”3 FTFY responses
desirable to explicitly model contrast in order to                   (hereafter FTFY) are used to respond to another
control the output of the model as part of a rhetor-                 comment by editing part of the “parent comment”
ical strategy, e.g. concessions (Musi, 2018). We                     (hereafter parent). Most commonly, FTFY is used
thus develop a model that generates a contrastive                    for three categories of responses: 1) expressing
claim given an input claim.                                          a contrastive claim (e.g. the parent is Bernie
   Contrastive claims may differ in more than just                   Sanders for president and the FTFY is Hillary
viewpoint; they may also contain stylistic differ-                   Clinton should be president) which may be sarcas-
ences and paraphrases, among other aspects. We                       tic (e.g. Ted Cruz for president becomes Zodiac
thus propose to model contrastive claims by con-
                                                                         2
trolling for context and maintaining the same text                      https://en.wiktionary.org/wiki/FTFY
                                                                         3
                                                                        https://reddit.zendesk.com/hc/en-us/articles/205173295-
    1
        Data and code available at github.com/chridey/fixedthat      What-do-all-these-acronyms-mean

                                                                  1757
killer for president) 2) making a joke (e.g. This       exact sentence alignments in 93.
Python library really piques my interest vs. This          Given these pairs of comments, we derive lin-
really *py*ques my interest), and 3) correcting a       guistic and structural features for training a binary
typo (e.g. This peaks my interest vs. piques). In       classifier. For each pair of comments, we compute
Section 3.2, we describe how we identify category       features for the words in the entire comment span
1 (contrastive claims) for modeling.                    and features from the aligned phrases span only
   To obtain historical Reddit data, we mined           (as identified by edit distance). From the aligned
comments from the site pushshift.io for Decem-          phrases we compute the character edit distance
ber 2008 through October 2017. This results in          and character Jaccard similarity (both normalized
2,200,258 pairs from Reddit, where a pair consists      by the number of characters) to attempt to cap-
of a parent and an FTFY. We find that many of the       ture jokes and typos (the similarity should be high
top occurring subreddits are ones where we would        if the FTFY is inventing an oronym or correcting
expect strong opinions (/r/politics, /r/worldnews,      a spelling error). From the entire comment, we
and /r/gaming).                                         use the percentage of characters copied as a low
                                                        percentage may indicate a poor alignment and the
3.2   Classification                                    percentage of non-ASCII characters as many of
                                                        the jokes use emojis or upside-down text. In ad-
To filter the data to only the type of response that    dition, we use features from GloVe (Pennington
we are interested in, we annotated comment pairs        et al., 2014) word embeddings4 for both the entire
for contrastive claims and other types. We use          comment and aligned phrases. We include the per-
our definition of contrastive claims based on con-      centage of words in the embedding vocabulary for
tradiction, where both the parent and FTFY are          both spans for both the parent and FTFY. The rea-
a claim and they are unlikely to be beliefs held        son for this feature is to identify infrequent words
by the same speaker. A joke is a response that          which may be typos or jokes. We compute the
does not meaningfully contrast with the parent          cosine similarity of the average word embeddings
and commonly takes the form of a pun, rhyme,            between the parent and FTFY for both spans. Fi-
or oronym. A correction is a response to a typo,        nally, we use average word embeddings for both
which may be a spelling or grammatical error.           spans for both parent and FTFY.
Any other pair is labeled as “other,” including            As we want to model the generation of new
pairs where the parent is not a claim.                  content, not explicit negation, we removed any
   In order to identify contrastive claims, we se-      pairs where the difference was only “stop words.”
lected a random subset of the Reddit data from          The set of stop words includes all the default stop
prior to September 2017 and annotated 1993 com-         words in Spacy5 combined with expletives and
ments. Annotators were native speakers of English       special tokens (we replaced all URLs and user-
and the Inter-Annotator Agreement using Kripen-         names). We trained a logistic regression classifier
dorff’s alpha was 0.72. Contrast occurs in slightly     and evaluated using 4-fold cross-validation. We
more than half of the sampled cases (51.4%), with       compare to a character overlap baseline where any
jokes (23.0%) and corrections (21.2%) compris-          examples with Jaccard similarity > 0.9 and edit
ing about one quarter each. We then train a bi-         distance < 0.15 were classified as non-contrastive.
nary classifier to predict contrastive claims, thus     The goal of this baseline is to illustrate how much
enabling better quality data for the generation task.   of the non-contrastive data involves simple or non-
   To identify the sentence in the parent that the      existent substitutions. Results are shown in Table
FTFY responds to and derive features for classi-        1. Our model obtains an F-score of 80.25 for an 8
fication, we use an edit distance metric to obtain      point absolute improvement over the baseline.
sentence and word alignments between the parent
comment and response. As the words in the par-          3.3    Selection
ent and response are mostly in the same order and       After using the trained model to classify the re-
most FTFYs contain significant overlap with the         maining data, we have 1,083,797 Reddit pairs. We
parent response, it is possible to find alignments by   set aside 10,307 pairs from October 1-20, 2017 for
moving a sliding window over the parent. A sam-            4
                                                             We found the 50-dimensional Wikipedia+Gigaword em-
ple of 100 comments verifies that this approach         beddings to be sufficient
                                                           5
yields exact word alignments in 75 comments and              spacy.io

                                                    1758
Model      Precision      Recall    F-score           4     Methods
       Majority   51.4           100       67.5
       Baseline   67.75          77.19     72.16             Our goal of generating contrastive claims can be
       LR         74.22          87.60     80.25             broken down into two primary tasks: 1) identify-
                                                             ing the words in the original comment that should
  Table 1: Results of Identifying Contastive Claims          be removed or replaced and 2) generating the ap-
                                                             propriate substitutions and any necessary context.
                                                             Initially, we thus experimented with a modular ap-
development and October 21-30 for test (6,773),              proach by tagging each word in the parent and
with the remainder used for training. As we are              then using the model predictions to determine if
primarily working with sentences, the mean par-              we should copy, delete, or replace a segment with
ent length was 16.3 and FTFY length was 14.3.                a new word or phrase. We tried the bi-directional
   The resulting test FTFYs are naturally occur-             LSTM-CNN-CRF model of Ma and Hovy (2016)
ring and so do not suffer from annotation arti-              and used our edit distance word alignments to
facts. At the same time, they are noisy and may              obtain labels for copying, deleting, or replacing.
not reflect the desired phenomenon. Thus, we                 However, we found this model performed slightly
also conducted an experiment on Amazon Me-                   above random predictions, and with error propa-
chanical Turk6 (AMT) to obtain additional gold               gation, the model is unlikely to produce fluent and
references, which are further required by metrics            accurate output. Instead, we use an end-to-end ap-
such as BLEU (Papineni et al., 2002). We se-                 proach using techniques from machine translation.
lected 2,625 pairs from the 10 most frequent cat-
egories7 (see Table 2). These categories form a              4.1    Model
three-level hierarchy for each subreddit and we              We use neural sequence-to-sequence encoder-
use the second-level, e.g. for /r/pokemongo the              decoder models (Sutskever et al., 2014) with at-
categories are “Pokémon”, “Video Games”, and                tention for our experiments. The tokens from the
“Gaming” so we use “Video Games.” Before par-                parent are passed as input to a bi-directional GRU
ticipating, each annotator was required to pass a            (Cho et al., 2014) to obtain a sequence of encoder
qualification test - five questions to gauge their           hidden states hi . Our decoder is also a GRU,
knowledge of that topic. For the movies category,            which at time t generates a hidden state st from the
one question we asked was whether for the sen-               previous hidden state st−1 along with the input.
tence Steven Spielberg is the greatest director of           When training, the input xt is computed from the
all time, we could instead use Stanley Kubrick or            previous word in the gold training data if we are
Paul McCartney. If they passed this test, the an-            in “teacher forcing” mode (Williams and Zipser,
notators were then given the parent comment and              1989) and otherwise is the prediction made by the
“keywords” (the subreddit and three category lev-            model at the previous time step. When testing, we
els) to provide additional context. We obtained              also use the model predictions. The input word
five new FTFYs for each parent and validated them            wt may be augmented by additional features, as
manually to remove obvious spam or trivial nega-             discussed in Section 4.2. In the baseline scenario
tion (e.g. “not” or “can’t”).                                xt = e(wt ) where e is an embedding. The hidden
                                                             state st is then combined with a context vector h∗t ,
   Category           Count     Category       Count         which is a weighted combination of the encoder
   Video Games        1062      Basketball     116           hidden states using an attention mechanism:
   Politics           529       Soccer         99                                      X
   Football           304       Movies         88                              h∗t =       αti hi
   Television         194       Hockey         60                                      i

   World News         130       Baseball       55
                                                             To calculate αit , we use the attention of Luong et
        Table 2: Comments for Mechanical Turk                al. (2015) as this encourages the model to select
                                                             features in the encoder hidden state which corre-
   6                                                         late with the decoder hidden state, which we want
      We paid annotators the U.S. federal minimum wage and
the study was approved by an IRB.                            because our input and output are similar. Our ex-
    7
      Obtained from the snoopsnoo.com API                    periments on the development data verified this,

                                                         1759
as Bahdanau attention (Bahdanau et al., 2015) per-     count is computed by:
formed worse. Attention is then calculated as:
                                                               c0 = |O \ (S ∪ I)| or desired count
                       exp(hTi st )
                                                                    (
              αti =                                                   ct − 1, wt ∈ / S ∪ I and ct > 0
                      P         T                            ct+1 =
                        s0 exp(hs0 st )                               ct ,     otherwise
Finally, we make a prediction of a vocabulary          where O is the set of gold output words in training.
word w by using features from the context and de-         For the parent comment Hillary Clinton for
coder hidden state with a projection matrix W and      president 2020 and FTFY Bernie Sanders for pres-
output vocabulary matrix V :                           ident, the decoder input is presented, with the time
                                                       t in the first row of Table 3 and the inputs wt and
P (w) = softmax(V tanh(W [st ; h∗t ] + bw ) + bv )     ct in the second and third rows, respectively. At
                                                       the start of decoding, the model expects to gener-
We explored using a copy mechanism (See et al.,        ate two new content words, which in this exam-
2017) for word prediction but found it difficult to    ple it generates immediately and decrements the
prevent the model from copying the entire input.       counter. When the counter reaches 0, it only gen-
                                                       erates stop or input words.
4.2   Decoder Representation
                                                          t      0   1         2          3     4
Decoder Input: We evaluate two representations
                                                          wt     -   Bernie    Sanders    for   president
of the target input: as a sequence of words and
                                                          ct     2   1         0          0     0
as a sequence of edits. The sequence of words
approach is the standard encoder-decoder setup.                      Table 3: Example of Counter
For the example parent Hillary Clinton for presi-
dent 2020 and FTFY Bernie Sanders for president            Unlike the controlled-length scenario, at test
we would use the FTFY without modification.            time we do not know the number of new content
Schmaltz et al. (2017) found success modeling er-      words to generate. However, the count for most
ror correction using sequence-to-sequence models       FTFYs is between 1 and 5, inclusive, so we can ex-
by representing the target input as a sequence of      haustively search this range during decoding. We
edits. We apply a similar approach to our prob-        experimented with predicting the count but found
lem, generating a target sequence by following the     it to be inaccurate so we leave this for future work.
best path in the matrix created by the edit dis-           Subreddit Information: As the model often
tance algorithm. The new target sequence is the        needs to disambiguate polysemous words, addi-
original parent interleaved with “DELETE-N to-         tional context can be useful. Consider the par-
kens” that specify how many previous words to          ent comment this is a strange bug. In a program-
delete, followed by the newly generated content.       ming subreddit, a sarcastic FTFY might be this is
For the same example, Hillary Clinton for pres-        a strange feature. However, in a Pokémon subred-
ident 2020, the modified target sequence would         dit, an FTFY might be this is a strange dinosaur
be Hillary Clinton DELETE-2 Bernie Sanders for         in an argument over whether Armaldo is a bug or
president 2020 DELETE-1.                               a dinosaur. We thus include additional features to
   Counter: Kikuchi et al. (2016) found that           be passed to the encoder at each time step, in the
by using an embedding for a length variable they       form of an embedding g for each the three cat-
were able to control output length via a learned       egory levels obtained in Section 3.3. These em-
mechanism. In our work, we compute a counter           beddings are concatenated to the input word wt at
variable which is initially set to the number of new   each timestep, i.e. xt = e(wt , gt1 , gt2 , gt3 ).
content words the model should generate. During
decoding, the counter is decremented if a word         4.3     Objective Function
is generated that is not in the source input (I) or    We use a negative log Plikelihood ∗objective func-
in the set of stop words (S) defined in Section        tion LN LL = − log t∈1:T P (wt ), where wt∗ is
3.2. The model uses an embedding e(ct ) for each       the gold token at time t, normalized by each batch.
count, which is parameterized by a count embed-        We also include an additional loss term that uses
ding matrix. The input to the decoder state in this    the encoder hidden states to make a binary pre-
scenario is xt = e(wt , ct ). At each time step, the   diction over the input for whether a token will be

                                                   1760
copied or inserted/deleted. For the example from        5   Results
Section 4.2, the target for Hillary Clinton for pres-
ident 2020 would be 0 0 1 1 0. This encourages          For training, development, and testing we use the
the model to select features that indicate whether      data described in Section 3.3. The test reference
the encoder input will be copied to the output. We      data consists of the Reddit FTFYs and the FTFYs
use a 2-layer multi-layer perceptron and a binary       generated from AMT. We evaluate our models us-
cross-entropy loss LBCE . The joint loss is then:       ing automated metrics and human judgments.
                                                           Automated metrics should reflect our joint goals
                                                        of 1) copying necessary context and 2) making ap-
              L = LN LL + λLBCE                         propriate substitutions. To address point 1, we use
                                                        BLEU-4 as a measure of similarity between the
4.4   Decoding                                          gold FTFY and the model output. As the FTFY
                                                        may contain significant overlap with the parent,
We use beam search for generation, as this method       BLEU indicates how well the model copies the
has proven effective for many neural language           appropriate context. As BLEU reflects mostly
generation tasks. For the settings of the model that    span selection rather than the insertion of new con-
require a counter, we expand the beam by count m        tent, we need alternative metrics to address point
so that for a beam size k we calculate k ∗ m states.    2. However, addressing point 2 is more difficult
   Filtering: We optionally include a constrained       due to the variety of possible substitutions, in-
decoding mode where we filter the output based          cluding named entities. For example, if the par-
on the counter; when ct > 1 the end-of-sentence         ent comment is jaguars for the win! and the gold
(EOS) score is set to −∞ and when ct = 0 the            FTFY is chiefs for the win! but the model pro-
score of any word w ∈ V \ (S ∪ I) is set to −∞.         duces cowboys for the win! (or any of 29 other
The counter ct is decremented at every time step as     NFL teams), most metrics would judge this re-
in Section 4.2. In other words, when the counter        sponse incorrectly even though it would be an ac-
is zero, we only allow the model to copy or gener-      ceptable response. Thus we present results using
ate stop words. When the counter is positive, we        both automated metrics and human evaluation. As
prevent the model from ending the sentence before       an approximation to address point 2, we attempt
it generates new content words and decrements the       to measure when the model is making changes
counter. The constrained decoding is possible with      rather than just copying the input. To this end, we
any combination of settings, with or without the        present two additional metrics - novelty, a mea-
counter embedding.                                      sure of whether novel content (non-stop word) to-
                                                        kens are generated relative to the parent comment,
                                                        and partial match, a measure of whether the novel
4.5   Hyper-parameters and Optimization
                                                        tokens in the gold FTFY match any of the novel
We used Pytorch (Paszke et al., 2017) for all ex-       tokens in the generated FTFY. To provide a refer-
periments. We used 300-dimensional vectors for          ence point, we find that the partial match between
the word embedding and GRU layers. The count            two different gold FTFYs (Reddit and AMT) was
embedding dimension was set to 5 with m = 5             11.4% and BLEU was 47.28, which shows the dif-
and k = 10 for decoding. The category embed-            ficulty of automatic evaluation. The scores are
ding dimensions were set to 5, 10, and 25 for each      lower than expected because the Reddit FTFYs are
of the non-subreddit categories. We also set λ = 1      noisy due to the process in Section 3.2. This also
for multi-task learning. We used the Adam op-           justifies obtaining the AMT FTFYs.
timizer (Kingma and Ba, 2015) with settings of             Results are presented in Table 4. The baseline
β1 = 0.9, β2 = 0.999,  = 10−8 and a learning           is a sequence-to-sequence model with attention.
rate of 10−3 decaying by γ = 0.1 every epoch.           For other components, the counter embedding is
We used dropout (Srivastava et al., 2014) on the        referred to as “COUNT,” the category/subreddit
embeddings with a probability of 0.2 and teacher        embeddings as “SUB,” the sequence of edits as
forcing with 0.5. We used a batch size of 100 with      “EDIT,” and the multi-task copy loss as “COPY.”
10 epochs, selecting the best model on the devel-       The models in the top half of the table use con-
opment set based on perplexity. We set the mini-        strained decoding and those in the bottom half
mum frequency of a word in the vocabulary to 4.         are unconstrained, to show the learning capabil-

                                                    1761
Reddit                     AMT
                    Model                       Novelty     BLEU-4       % Match       BLEU-4       % Match
    Constrained     Baseline                    79.88       18.81        4.67          40.14        10.06
                    COUNT                       89.69       22.61        4.72          47.55        12.55
                    COUNT + SUB + COPY          90.45       23.13        4.83          50.05        14.92
                    EDIT                        64.64       16.12        3.37          35.48        7.33
                    EDIT + COUNT + SUB + COPY   82.96       19.37        4.23          42.69        11.62
    Unconstrained

                    Baseline                    3.34        7.31         0.73          25.83        0.68
                    COUNT                       16.19       8.51         1.95          27.68        2.36
                    COUNT + SUB + COPY          16.26       9.62         1.93          31.23        3.81
                    EDIT                        7.97        35.41        1.57          74.24        1.56
                    EDIT + COUNT + SUB + COPY   39.99       32.59        3.25          67.56        6.25

                                         Table 4: Automatic Evaluation

ities of the models. For each model we com-               contradicts the original comment. We specified
pute statistical significance with bootstrap resam-       that if the response is different but does not pro-
pling (Koehn, 2004) for the constrained or uncon-         vide a contrasting view it should receive a low rat-
strained baseline as appropriate and we find the          ing. Previous work (Bilu et al., 2015) used flu-
COUNT and EDIT models to be significantly bet-            ency, clarity/usability (which we combine into co-
ter for constrained and unconstrained decoding,           herence), and opposition (where we use contrast).
respectively (p < 0.005).
   Under constrained decoding, we see that the              Model             Fluency    Coherence      Contrast
“COUNT + SUB + COPY” model performs the                     Gold              4.34       4.26           3.01
best in all metrics, although most of the perfor-           Baseline          3.49       3.19           1.94
mance can be attributed to the count embedding.             Constrained       3.46       3.32           2.53
When we allow the model to determine its own                Best              3.52       3.46           2.87
output, we find that “EDIT + COUNT” performs
the best. In particular, this model does well at un-                     Table 5: Human Evaluation
derstanding which part of the context to select, and
even does better than other unconstrained models             We used a Likert scale where 5 is strongly agree
at selecting appropriate substitutions. However,          and 1 is strongly disagree. We used the same data
when we combine this model with constrained de-           and qualification test from Section 3.3 for each
coding, the improvement is smaller than for the           category and used three annotators per example.
other settings. We suspect that because the EDIT          We asked annotators to judge 4 different pairs: 3
model often needs to generate a DELETE-N token            model outputs and the gold Reddit8 FTFYs for
before a new response, these longer-term depen-           comparison. We include the baseline, the base-
dencies are hard to capture with constrained de-          line with constrained decoding, and the best con-
coding but easier if included in training.                strained model (“COUNT + SUB + COPY”) ac-
   We also conducted a human evaluation of the            cording to BLEU and partial match. We verified
model output on the same subset of 2,625 exam-            that the annotators understood how to rate con-
ples described in Section 3.3. We performed an            trast by examining the distribution of responses:
additional experiment on AMT where we asked               the annotators selected option 3 (neither) 15% of
annotators to rate responses on fluency, coherence,       the time and preferred to select either extreme, 5
and contrast. Fluency is a measure of the qual-           (21%) or 1 (27%). Results are presented in Table
ity of the grammar and syntax and the likelihood          5, showing a clear preference for the best model.
that a native English speaker would utter that state-     Note the degradation in fluency for the constrained
ment. Coherence is a measure of whether the               baseline, as the model is prevented from generat-
response makes sense, is semantically meaning-            ing the EOS token and may repeat tokens up to the
ful, and would be usable as a response to a claim.            8
                                                                We did not evaluate the AMT FTFYs as these were gen-
Contrast is a measure of how much the response            erated by the same pool of annotators.

                                                    1762
Parent: ah yes the wonders of the free market        clue what i ’m talking about . We also found ex-
    Model: ah yes the wonders of government in-          amples where the model chose poorly due to un-
    tervention                                           filtered jokes or errors in the training data (12 in
    Parent: i know that this is an unofficial mod        total). In 15 cases, due to the constrained decod-
    , but xp is the best os for this machine             ing the model repeated a word until the maximum
    Model: linux is the best os for this machine         length or appended an incoherent phrase. For the
    Parent: that ’s why it ’s important to get all       most common error, the model made a substitu-
    your propaganda from infowars and brietbart          tion that was not contrasting as in Table 6 (19
                                                         examples). Finally, we found 38 of the samples
    Model: propaganda from fox news outlets
                                                         were valid responses, but did not match the gold,
                 Table 6: Model Output                   indicating the difficulty of automatic evaluation.
                                                         For example, in response to the claim Nintendo is
                                                         the only company that puts customers over profits,
maximum length.                                          the model replaces Nintendo with Rockstar (both
6     Qualitative Analysis                               video game companies) while the gold FTFYs had
                                                         other video game companies.
We provide three examples of the model output in
Table 6 with the first and third from the News and       7   Related Work
Politics category, demonstrating how the model
handles different types of input. In the first ex-       Understanding contrast and contradiction is key
ample, the contrast is between allowing markets          to argumentation as it requires an understand-
to regulate themselves versus an increased role of       ing of differing points-of-view. Recent work ex-
government. In the second example, the contra-           amined the negation of claims via explicit nega-
diction is due to the choice of operating system.        tion (Bilu et al., 2015). Other work investigated
In the third (invalid) example, the model responds       the detection of different points-of-view in opin-
to a sarcastic claim with another right-wing news        ionated text (Al Khatib et al., 2012; Paul et al.,
organization; this response is not a contradiction       2010). Wachsmuth et al. (2017; 2018) retrieved
since it is plausible the original speaker would also    arguments for and against a particular stance us-
utter this statement.                                    ing online debate forums. In non-argumentative
                                                         text, researchers predicted contradictions for types
6.1     Error Analysis                                   such as negation, antonyms, phrasal, or structural
We conduct an error analysis by selecting 100 re-        (de Marneffe et al., 2008) or those that can be
sponses where the model did not partially match          expressed with functional relations (Ritter et al.,
any of the 6 gold responses and we found 6 main          2008). Other researchers have incorporated en-
types of errors. One error is the model identify-        tailment models (Kloetzer et al., 2013) or crowd-
ing an incorrect substitution span while the hu-         sourcing methods (Takabatake et al., 2015). Con-
man responses all selected a different span to re-       tradiction has also become a part of the natural lan-
place. We noticed that this occurred 5 times and         guage inference (NLI) paradigm, with datasets la-
may require world knowledge to understand which          beling contradiction, entailment, or neutral (Bow-
tokens to select. For example, in response to the        man et al., 2015a; Williams et al., 2018). The
claim Hillary Clinton could have been president if       increase in resources with contrast and contradic-
not for robots, the model generates Donald Trump         tion has resulted in new representations with con-
in place of Hillary Clinton, whereas the gold re-        trastive meaning (Chen et al., 2015; Nguyen et al.,
sponses generate humans / votes / Trump’s tweets         2016; Vulić, 2018; Conneau et al., 2017). Most
in place of robots. Another type of error is when        of this work has focused on identifying contrast or
the responses are not coherent with the parent and       contradiction while we aim to generate contrast.
the language model instead determines the token          Furthermore, while contradiction and contrast are
selection based on the most recent context (11           present in these corpora, we obtain distant-labeled
cases). For example, given the claim bb-8 gets a         alignments for contrast at the word and phrase
girlfriend and poe still does n’t have a girlf :’) the   level. Our dataset also includes contrastive con-
Reddit FTFY has boyf instead of girlf whereas the        cepts and entities while other corpora primarily
model generates ... and poe still does n’t have a        contain antonyms and explicit negation.

                                                     1763
Contrast also appears in the study of stance,      generating contrastive claims and obtained signif-
where the opinion towards a target may vary. The      icant improvement in automated metrics and hu-
SemEval 2016 Stance Detection for Twitter task        man evaluations for Reddit and AMT test data.
(Mohammad et al., 2016) involved predicting if a         Our goal is to incorporate this model into an ar-
tweet favors a target entity. The Interpretable Se-   gumentative dialogue system. In addition to gen-
mantic Similarity task (Agirre et al., 2016) called   erating claims with a contrasting view, we can
to identify semantic relation types (including op-    also retrieve supporting evidence for the newly-
position) between headlines or captions. Target-      generated claims. Additionally, we plan to exper-
specific stance prediction in debates is addressed    iment with using our model to improve claim de-
by Hasan and Ng (2014) and Walker et al. (2012).      tection (Daxenberger et al., 2017) and stance pre-
Fact checking can be viewed as stance toward an       diction (Bar-Haim et al., 2017). Our model could
event, resulting in research on social media (Lend-   be used to generate artificial data to enhance clas-
vai and Reichel, 2016; Mihaylova et al., 2018),       sification performance on these tasks.
politician statements (Vlachos and Riedel, 2014),        To improve our model, we plan to experiment
news articles (Pomerleau and Rao, 2017), and          with retrieval-based approaches to handle low-
Wikipedia (Thorne et al., 2018).                      frequency terms and named entities, as sequence-
   In computational argumentation mining, iden-       to-sequence models are likely to have trouble in
tifying claims and other argumentative compo-         this environment. One possibility is to incorpo-
nents is a well-studied task (Stab and Gurevych,      rate external knowledge with entity linking over
2014). Daxenberger et al. (2017) and Schulz et al.    Wikipedia articles to find semantically-relevant
(2018) developed approaches to detect claims          substitutions.
across across diverse claim detection datasets. Re-      Another way to improve the model is by intro-
cently, a shared task was developed for argument      ducing controllable generation. One aspect of con-
reasoning comprehension (Habernal et al., 2018).      trollability is intention; our model produces con-
The best system (Choi and Lee, 2018) used models      trastive claims without understanding the view of
pre-trained on NLI data (Bowman et al., 2015b),       the original claim. Category embeddings partially
which contains contradictions. While this work        address this issue (some labels are “Liberal” or
is concerned with identification of argumentative     “Conservative”), but labels are not available for
components, we propose to generate new claims.        all views. Going forward, we hope to classify the
   In the field of argument generation, Wang and      viewpoint of the original claim and then generate a
Ling (2016) train neural abstractive summarizers      claim with a desired orientation. Furthermore, we
for opinions and arguments. Additional work in-       hope to improve on the generation task by identi-
volved generating opinions given a product rating     fying the types of claims we encounter. For exam-
(Wang and Zhang, 2017). Bilu and Slonim (2016)        ple, we may want to change the target of the claims
combine topics and predicates via a template-         in some claims but in others change the polarity.
based classifier. This work involves the genera-         We also plan to improve the dataset by improv-
tion of claims but in relation to a topic. Other      ing our models for contrastive pair prediction to
researchers generated political counter-arguments     reduce noise. Finally, we hope that this dataset
supported by external evidence (Hua and Wang,         proves useful for related tasks such as textual
2018) and generating argumentative dialogue by        entailment (providing examples of contradiction)
maximizing mutual information (Le et al., 2018).      and argument comprehension (providing counter-
This research considers end-to-end argument gen-      examples of arguments) or even unrelated tasks
eration, which may not be coherent, whereas we        like humor or error correction.
focus specifically on contrastive claims.
                                                      Acknowledgments
8   Conclusion
                                                      We would like to thank the AMT annotators for
We presented a new source of over 1 million con-      their work and the anonymous reviewers for their
trastive claim pairs that can be mined from social    valuable feedback. The authors also thank Alyssa
media sites such as Reddit. We provided an anal-      Hwang for providing annotations and Smaranda
ysis and models to filter noisy training data from    Muresan, Chris Kedzie, Jessica Ouyang, and Els-
49% down to 25%. We created neural models for         beth Turcan for their helpful comments.

                                                  1764
References                                                   In Proceedings of the 53rd Annual Meeting of the
                                                             Association for Computational Linguistics and the
Eneko Agirre, Aitor Gonzalez-Agirre, Inigo Lopez-            7th International Joint Conference on Natural Lan-
  Gazpio, Montse Maritxalar, German Rigau, and Lar-          guage Processing (Volume 1: Long Papers), pages
  raitz Uria. 2016. Semeval-2016 task 2: Interpretable       106–115. Association for Computational Linguis-
  semantic textual similarity. In Proceedings of the         tics.
  10th International Workshop on Semantic Evalua-
  tion (SemEval-2016), pages 512–524. Association         Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
  for Computational Linguistics.                            cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
                                                            Schwenk, and Yoshua Bengio. 2014. Learning
Khalid Al Khatib, Hinrich Schütze, and Cathleen Kant-      phrase representations using rnn encoder–decoder
  ner. 2012. Automatic detection of point of view dif-      for statistical machine translation. In Proceedings of
  ferences in wikipedia. In Proceedings of COLING           the 2014 Conference on Empirical Methods in Nat-
  2012, pages 33–50. The COLING 2012 Organizing             ural Language Processing (EMNLP), pages 1724–
  Committee.                                                1734, Doha, Qatar. Association for Computational
                                                            Linguistics.
Denis Apothloz, Pierre-Yves Brandt, and Gustavo
  Quiroz. 1993. The function of negation in argumen-      HongSeok Choi and Hyunju Lee. 2018. Gist at
  tation. Journal of Pragmatics, 19:23–38.                  semeval-2018 task 12: A network transferring infer-
                                                            ence knowledge to argument reasoning comprehen-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-            sion task. In Proceedings of The 12th International
  gio. 2015. Neural machine translation by jointly          Workshop on Semantic Evaluation, pages 773–777.
  learning to align and translate. In 3rd International     Association for Computational Linguistics.
  Conference on Learning Representations.
                                                          Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈c
Roy Bar-Haim, Indrajit Bhattacharya, Francesco Din-         Barrault, and Antoine Bordes. 2017. Supervised
  uzzo, Amrita Saha, and Noam Slonim. 2017. Stance          learning of universal sentence representations from
  classification of context-dependent claims. In Pro-       natural language inference data. In Proceedings of
  ceedings of the 15th Conference of the European           the 2017 Conference on Empirical Methods in Nat-
  Chapter of the Association for Computational Lin-         ural Language Processing, pages 670–680. Associ-
  guistics: Volume 1, Long Papers, pages 251–261,           ation for Computational Linguistics.
  Valencia, Spain. Association for Computational Lin-
  guistics.                                               Johannes Daxenberger, Steffen Eger, Ivan Habernal,
                                                            Christian Stab, and Iryna Gurevych. 2017. What is
Yonatan Bilu, Daniel Hershcovich, and Noam Slonim.          the essence of a claim? cross-domain claim iden-
  2015. Automatic claim negation: Why, how and              tification. In Proceedings of the 2017 Conference
  when. In Proceedings of the 2nd Workshop on Ar-           on Empirical Methods in Natural Language Pro-
  gumentation Mining, pages 84–93. Association for          cessing, pages 2055–2066. Association for Compu-
  Computational Linguistics.                                tational Linguistics.
Yonatan Bilu and Noam Slonim. 2016. Claim synthe-         Trudy Govier. 2010. A Practical Study of Argument.
  sis via predicate recycling. In Proceedings of the        Cengage Learning, Wadsworth.
  54th Annual Meeting of the Association for Compu-
  tational Linguistics (Volume 2: Short Papers), pages    Ivan Habernal, Henning Wachsmuth, Iryna Gurevych,
  525–530, Berlin, Germany. Association for Compu-           and Benno Stein. 2018. The argument reasoning
  tational Linguistics.                                      comprehension task: Identification and reconstruc-
                                                             tion of implicit warrants. In Proceedings of the
Samuel R. Bowman, Gabor Angeli, Christopher Potts,           2018 Conference of the North American Chapter of
  and Christopher D. Manning. 2015a. A large an-             the Association for Computational Linguistics: Hu-
  notated corpus for learning natural language infer-        man Language Technologies, Volume 1 (Long Pa-
  ence. In Proceedings of the 2015 Conference on             pers), pages 1930–1940. Association for Computa-
  Empirical Methods in Natural Language Process-             tional Linguistics.
  ing, pages 632–642. Association for Computational
  Linguistics.                                            Kazi Saidul Hasan and Vincent Ng. 2014. Why are
                                                            you taking this stance? identifying and classifying
Samuel R. Bowman, Gabor Angeli, Christopher Potts,          reasons in ideological debates. In Proceedings of
  and Christopher D. Manning. 2015b. A large an-            the 2014 Conference on Empirical Methods in Natu-
  notated corpus for learning natural language infer-       ral Language Processing (EMNLP), pages 751–762.
  ence. In Proceedings of the 2015 Conference on            Association for Computational Linguistics.
  Empirical Methods in Natural Language Processing
  (EMNLP). Association for Computational Linguis-         Xinyu Hua and Lu Wang. 2018. Neural argument
  tics.                                                     generation augmented with externally retrieved evi-
                                                            dence. In Proceedings of the 56th Annual Meeting of
Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen,            the Association for Computational Linguistics (Vol-
  Si Wei, Hui Jiang, and Xiaodan Zhu. 2015. Re-             ume 1: Long Papers), pages 1–10, Melbourne, Aus-
  visiting word embedding for contrasting meaning.          tralia. Association for Computational Linguistics.

                                                      1765
Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hi-             Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-
  roya Takamura, and Manabu Okumura. 2016. Con-               hani, Xiaodan Zhu, and Colin Cherry. 2016.
  trolling output length in neural encoder-decoders.          Semeval-2016 task 6: Detecting stance in tweets. In
  In Proceedings of the 2016 Conference on Empiri-            Proceedings of the 10th International Workshop on
  cal Methods in Natural Language Processing, pages           Semantic Evaluation (SemEval-2016), pages 31–41.
  1328–1338, Austin, Texas. Association for Compu-            Association for Computational Linguistics.
  tational Linguistics.
                                                            Elena Musi. 2018. How did you change my view?
Diederik Kingma and Jimmy Ba. 2015. Adam: A                   a corpus-based study of concessions argumentative
  method for stochastic optimization. In 3rd Interna-         role. Discourse Studies, 20(2):270–288.
  tional Conference on Learning Representations.
                                                            Kim Anh Nguyen, Sabine Schulte im Walde, and
Julien Kloetzer, Stijn De Saeger, Kentaro Torisawa,           Ngoc Thang Vu. 2016. Integrating distributional
   Chikara Hashimoto, Jong-Hoon Oh, Motoki Sano,              lexical contrast into word embeddings for antonym-
   and Kiyonori Ohtake. 2013. Two-stage method for            synonym distinction. In Proceedings of the 54th An-
   large-scale acquisition of contradiction pattern pairs     nual Meeting of the Association for Computational
   using entailment. In Proceedings of the 2013 Con-          Linguistics (Volume 2: Short Papers), pages 454–
   ference on Empirical Methods in Natural Language           459. Association for Computational Linguistics.
   Processing, pages 693–703. Association for Com-          Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
   putational Linguistics.                                    Jing Zhu. 2002. Bleu: a method for automatic eval-
                                                              uation of machine translation. In Proceedings of
Philipp Koehn. 2004. Statistical significance tests for       40th Annual Meeting of the Association for Com-
  machine translation evaluation. In Proceedings of           putational Linguistics, pages 311–318, Philadelphia,
  EMNLP 2004, pages 388–395, Barcelona, Spain.                Pennsylvania, USA. Association for Computational
  Association for Computational Linguistics.                  Linguistics.
Dieu-Thu Le, Cam Tu Nguyen, and Kim Anh Nguyen.             Adam Paszke, Sam Gross, Soumith Chintala, Gre-
  2018. Dave the debater: a retrieval-based and gen-          gory Chanan, Edward Yang, Zachary DeVito, Zem-
  erative argumentative dialogue agent. In Proceed-           ing Lin, Alban Desmaison, Luca Antiga, and Adam
  ings of the 5th Workshop on Argument Mining, pages          Lerer. 2017. Automatic differentiation in pytorch.
  121–130. Association for Computational Linguis-             In NIPS-W.
  tics.
                                                            Michael Paul, ChengXiang Zhai, and Roxana Girju.
Piroska Lendvai and Uwe Reichel. 2016. Contradic-             2010. Summarizing contrastive viewpoints in opin-
   tion detection for rumorous claims. In Proceedings         ionated text. In Proceedings of the 2010 Confer-
   of the Workshop on Extra-Propositional Aspects of          ence on Empirical Methods in Natural Language
   Meaning in Computational Linguistics (ExProM),             Processing, pages 66–76. Association for Compu-
   pages 31–40. The COLING 2016 Organizing Com-               tational Linguistics.
   mittee.
                                                            Jeffrey Pennington, Richard Socher, and Christopher
Thang Luong, Hieu Pham, and Christopher D. Man-                Manning. 2014. Glove: Global vectors for word
  ning. 2015. Effective approaches to attention-based          representation. In Proceedings of the 2014 Con-
  neural machine translation. In Proceedings of the            ference on Empirical Methods in Natural Language
  2015 Conference on Empirical Methods in Natu-                Processing (EMNLP), pages 1532–1543, Doha,
  ral Language Processing, pages 1412–1421, Lis-               Qatar. Association for Computational Linguistics.
  bon, Portugal. Association for Computational Lin-
  guistics.                                                 Dean Pomerleau and Delip Rao. 2017. Fake news chal-
                                                              lenge.
Xuezhe Ma and Eduard Hovy. 2016. End-to-end se-             Alan Ritter, Stephen Soderland, Doug Downey, and
  quence labeling via bi-directional lstm-cnns-crf. In        Oren Etzioni. 2008. It’s a contradiction – no, it’s
  Proceedings of the 54th Annual Meeting of the As-           not: A case study using functional relations. In Pro-
  sociation for Computational Linguistics (Volume 1:          ceedings of the 2008 Conference on Empirical Meth-
  Long Papers), pages 1064–1074, Berlin, Germany.             ods in Natural Language Processing, pages 11–20.
  Association for Computational Linguistics.                  Association for Computational Linguistics.
Marie-Catherine de Marneffe, Anna N. Rafferty, and          E. Schiappa and J.P. Nordin. 2013. Argumentation:
 Christopher D. Manning. 2008. Finding contradic-              Keeping Faith with Reason. Pearson Education.
 tions in text. In Proceedings of ACL-08: HLT, pages
 1039–1047, Columbus, Ohio. Association for Com-            Allen Schmaltz, Yoon Kim, Alexander Rush, and Stu-
 putational Linguistics.                                      art Shieber. 2017. Adapting sequence models for
                                                              sentence correction. In Proceedings of the 2017
Tsvetomila Mihaylova, Preslav Nakov, Lluı́s Màrquez,         Conference on Empirical Methods in Natural Lan-
  Alberto Barrón-Cedeño, Mitra Mohtarami, Georgi            guage Processing, pages 2807–2813, Copenhagen,
  Karadzhov, and James R. Glass. 2018. Fact check-            Denmark. Association for Computational Linguis-
  ing in community forums. CoRR, abs/1803.03178.              tics.

                                                        1766
Claudia Schulz, Steffen Eger, Johannes Daxenberger,          Proceedings of The Third Workshop on Representa-
  Tobias Kahse, and Iryna Gurevych. 2018. Multi-             tion Learning for NLP, pages 137–143. Association
  task learning for argumentation mining in low-             for Computational Linguistics.
  resource settings. CoRR, abs/1804.04083.
                                                          Henning Wachsmuth, Martin Potthast, Khalid
Abigail See, Peter J. Liu, and Christopher D. Manning.      Al Khatib, Yamen Ajjour, Jana Puschmann, Jiani
  2017. Get to the point: Summarization with pointer-       Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff,
  generator networks. In Proceedings of the 55th An-        and Benno Stein. 2017. Building an argument
  nual Meeting of the Association for Computational         search engine for the web. In Proceedings of
  Linguistics (Volume 1: Long Papers), pages 1073–          the 4th Workshop on Argument Mining, pages
  1083, Vancouver, Canada. Association for Compu-           49–59, Copenhagen, Denmark. Association for
  tational Linguistics.                                     Computational Linguistics.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,      Henning Wachsmuth, Shahbaz Syed, and Benno Stein.
  Ilya Sutskever, and Ruslan Salakhutdinov. 2014.           2018. Retrieval of the best counterargument with-
  Dropout: A simple way to prevent neural networks          out prior topic knowledge. In Proceedings of the
  from overfitting. J. Mach. Learn. Res., 15(1):1929–       56th Annual Meeting of the Association for Compu-
  1958.                                                     tational Linguistics (Volume 1: Long Papers), pages
                                                            241–251. Association for Computational Linguis-
Christian Stab and Iryna Gurevych. 2014. Identify-          tics.
  ing argumentative discourse structures in persuasive
                                                          Marilyn Walker, Pranav Anand, Rob Abbott, and Ricky
  essays. In Proceedings of the 2014 Conference on
                                                           Grant. 2012. Stance classification using dialogic
  Empirical Methods in Natural Language Processing
                                                           properties of persuasion. In Proceedings of the 2012
  (EMNLP), pages 46–56, Doha, Qatar. Association
                                                           Conference of the North American Chapter of the
  for Computational Linguistics.
                                                           Association for Computational Linguistics: Human
                                                           Language Technologies, pages 592–596. Associa-
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
                                                           tion for Computational Linguistics.
   Sequence to sequence learning with neural net-
   works. In Z. Ghahramani, M. Welling, C. Cortes,        Lu Wang and Wang Ling. 2016. Neural network-based
   N. D. Lawrence, and K. Q. Weinberger, editors, Ad-       abstract generation for opinions and arguments. In
   vances in Neural Information Processing Systems          Proceedings of the 2016 Conference of the North
   27, pages 3104–3112. Curran Associates, Inc.             American Chapter of the Association for Computa-
                                                            tional Linguistics: Human Language Technologies,
Yu Takabatake, Hajime Morita, Daisuke Kawahara,             pages 47–57, San Diego, California. Association for
  Sadao Kurohashi, Ryuichiro Higashinaka, and               Computational Linguistics.
  Yoshihiro Matsuo. 2015. Classification and ac-
  quisition of contradictory event pairs using crowd-     Zhongqing Wang and Yue Zhang. 2017. Opinion rec-
  sourcing. In Proceedings of the The 3rd Work-             ommendation using a neural model. In Proceed-
  shop on EVENTS: Definition, Detection, Corefer-           ings of the 2017 Conference on Empirical Methods
  ence, and Representation, pages 99–107. Associa-          in Natural Language Processing, pages 1627–1638.
  tion for Computational Linguistics.                       Association for Computational Linguistics.

James Thorne,       Andreas Vlachos,        Christos      Adina Williams, Nikita Nangia, and Samuel Bowman.
  Christodoulopoulos, and Arpit Mittal. 2018.               2018. A broad-coverage challenge corpus for sen-
  Fever: a large-scale dataset for fact extraction          tence understanding through inference. In Proceed-
  and verification.  In Proceedings of the 2018             ings of the 2018 Conference of the North American
  Conference of the North American Chapter of the           Chapter of the Association for Computational Lin-
  Association for Computational Linguistics: Human          guistics: Human Language Technologies, Volume 1
  Language Technologies, Volume 1 (Long Papers),            (Long Papers), pages 1112–1122. Association for
  pages 809–819. Association for Computational              Computational Linguistics.
  Linguistics.
                                                          Ronald J. Williams and David Zipser. 1989. A learn-
Stephen Toulmin. 1958. The Uses of Argument. Cam-           ing algorithm for continually running fully recurrent
   bridge At the University Press.                          neural networks. Neural Comput., 1(2):270–280.

Andreas Vlachos and Sebastian Riedel. 2014. Fact
  checking: Task definition and dataset construction.
  In Proceedings of the ACL 2014 Workshop on Lan-
  guage Technologies and Computational Social Sci-
  ence, pages 18–22. Association for Computational
  Linguistics.

Ivan Vulić. 2018. Injecting lexical contrast into word
   vectors by guiding vector space specialisation. In

                                                      1767
You can also read