Robust Argument Unit Recognition and Classification

Page created by Elaine Boyd
 
CONTINUE READING
Robust Argument Unit Recognition and Classification

                                         Dietrich Trautmann† , Johannes Daxenberger‡ , Christian Stab‡ , Hinrich Schütze† , Iryna Gurevych‡
                                                    †
                                                      Center for Information and Language Processing (CIS), LMU Munich, Germany
                                                    ‡
                                                      Ubiquitous Knowledge Processing Lab (UKP-TUDA), TU Darmstadt, Germany
                                                             dietrich@trautmann.me; inquiries@cislmu.org
                                                                       http://www.ukp.tu-darmstadt.de

                                                                   Abstract                                           Topic: Death Penalty
                                                                                                             It does not deter crime and
                                                 Argument mining is generally performed on                                        CON
arXiv:1904.09688v1 [cs.CL] 22 Apr 2019

                                                 the sentence-level – it is assumed that an en-              it is extremely expensive to administer .
                                                                                                                                                    CON

                                                 tire sentence (not parts of it) corresponds to
                                                 an argument. In this paper, we introduce the                          Topic: Gun Control
                                                 new task of Argument unit Recognition and                   Yes , guns can be used for protection
                                                                                                                                                   CON
                                                 Classification (ARC). In ARC, an argument is
                                                                                                             but laws are meant to protect us , too .
                                                 generally a part of a sentence – a more real-                                               PRO

                                                 istic assumption since several different argu-
                                                 ments can occur in one sentence and longer         Figure 1: Examples of sentences with two arguments
                                                 sentences often contain a mix of argumenta-        as well as with annotated spans and stances.
                                                 tive and non-argumentative parts. Recogniz-
                                                 ing and classifying the spans that correspond
                                                 to arguments makes ARC harder than previ-
                                                 ously defined argument mining tasks. We re-        (Stab et al., 2018b; Shnarch et al., 2018). While
                                                 lease ARC-8, a new benchmark for evaluating        discourse-level argument mining aims to parse
                                                 the ARC task. We show that token-level an-         argumentative structures in a fine-grained man-
                                                 notations for argument units can be gathered       ner within single documents (thus, mostly in sin-
                                                 using scalable methods. ARC-8 contains 25%         gle domains or applications), topic-dependent ar-
                                                 more arguments than a dataset annotated on         gument retrieval focusses on argumentative con-
                                                 the sentence-level would. We cast ARC as a         structs such as claims or evidences with regard to
                                                 sequence labeling task, develop a number of
                                                                                                    a given topic that can be found in very different
                                                 methods for ARC sequence tagging and es-
                                                 tablish the state of the art for ARC-8. A fo-      types of discourse. Argument retrieval typically
                                                 cus of our work is robustness: both robust-        frames the argumentative unit (argument, claim,
                                                 ness against errors in sentence identification     evidence etc.) on the level of sentence, i.e., it
                                                 (which are frequent for noisy text) and ro-        seeks to detect sentences that are relevant support-
                                                 bustness against divergence in training and test   ing (PRO) or opposing (CON) arguments as in the
                                                 data.                                              examples given in Fig. 1.
                                             1   Introduction                                          In this work, we challenge the assumption that
                                                                                                    arguments should be detected on the sentence-
                                             Argument mining (Peldszus and Stede, 2013) has         level. This is partly justified by the difficulty of
                                             gained substantial attention from researchers in       “unitizing”, i.e., of segmenting a sentence into
                                             the NLP community, mostly due to its complex-          meaningful units for argumentation tasks (Stab
                                             ity as a task requiring sophisticated reasoning,       et al., 2018b; Miller et al., 2019). We show that re-
                                             but also due to the availability of high-quality       framing the argument retrieval task as Argument
                                             resources. Those resources include discourse-          unit Recognition and Classification (ARC), i.e.,
                                             level closed-domain datasets for political, educa-     as recognition and classification of spans within
                                             tional or legal applications (Walker et al., 2012;     a sentence on the token-level is feasible, not just
                                             Stab and Gurevych, 2014; Wyner et al., 2010), as       in terms of the reliability of recognizing argumen-
                                             well as open-domain datasets for topic-dependent       tative spans, but also in terms of the scalability of
                                             argument retrieval from heterogeneous sources          generating training data.
Framing argument retrieval as in ARC, i.e., on         the latter: we model arguments as self-contained
the token-level, has several advantages:                 pieces of information which can be verified as rel-
                                                         evant arguments for a given topic with no or mini-
    • It prevents merging otherwise separate argu-       mal surrounding context.
      ments into a single argument (e.g., for the
      topic death penalty in Fig. 1).                       As one of the main contributions of this work,
                                                         we show how to create training data for token-
    • It can handle two-sided argumentation ade-         level argument mining with the help of crowd-
      quately (e.g., for the topic gun control in Fig.   sourcing. Stab et al. (2018b) and Shnarch et al.
      1).                                                (2018) annotated topic-dependent arguments on
                                                         the sentence-level using crowdsourcing. Fleiss κ
    • It can be framed as a sequence labeling task,
                                                         agreement scores reported were 0.45 in Shnarch
      which is a common scenario for many NLP
                                                         et al. (2018) for crowd workers and 0.72 in Stab
      applications with many available architec-
                                                         et al. (2018b) for experts. Miller et al. (2019)
      tures for experimentation (Eger et al., 2017).
                                                         present a multi-step approach to crowdsource
   To address the feasibility of ARC, we will ad-        more complex argument structures in customer re-
dress the following questions. First, we discuss         views. Like us, they annotate arguments on the
how to select suitable data for annotating argu-         token-level – however, they annotate argument
ments on the token-level. Second, we analyze             components from the discourse-level perspective.
whether the annotation of arguments on the token-        Their inter-annotator agreement (αu roughly be-
level can be reliably conducted with trained ex-         tween 0.4 and 0.5) is low, demonstrating the dif-
perts, as well as with untrained workers in a            ficulty of this task. In this work, to capture ar-
crowdsourcing setup. Third, we test a few basic          gument spans more precisely, we test the validity
as well as state-of-the-art sequence labeling meth-      of arguments using a slot filling approach. Reis-
ods on ARC.                                              ert et al. (2018) also use argument templates, i.e.,
   A focus of our work is robustness. (i) The as-        slots to determine arguments.
sumption that arguments correspond to complete              Close to the spirit of this work, Ajjour et al.
sentences makes argument mining brittle – when           (2017) compare various argumentative unit seg-
the assumption is not true, then sentence-level ar-      mentation approaches on the token-level across
gument mining makes mistakes. In addition, sen-          three corpora. They use a feature-based approach
tence identification is error-prone for noisy text       and various architectures for segmentation and
(e.g., text crawled from the web), resulting in          find that BiLSTMs work best on average. How-
noisy non-sentence units being equated with argu-        ever, as opposed to this work, they study argu-
ments. (ii) The properties of argument topics vary       mentation on the discourse level, i.e., they do not
considerably from topic to topic. An ARC method          consider topic-dependency and only account for
trained on one topic will not necessarily perform        arguments and non-arguments (no argumentative
well on another. We set ARC-8 up to make it              types or relations like PRO and CON). Eger et al.
easy to test the robustness of argument mining           (2017) model discourse-level argument segmen-
by including a cross-domain split and demonstrate        tation, identification (claim, premises and major
that cross-domain generalization is challenging for      claims) and relation extraction as sequence tag-
ARC-8.                                                   ging, dependency parsing and entity-relation ex-
                                                         traction. For a dataset of student essays (Stab and
2    Related Work
                                                         Gurevych, 2014), they find that sequence tagging
Our work follows the established line of work on         and an entity-relation extraction approach (Miwa
argument mining in the NLP community, which              and Bansal, 2016) work best. In particular, for the
can loosely be divided into approaches detect-           unit segmentation task (vanilla BIO), they find that
ing and classifying arguments on the discourse           state-of-the-art sequence tagging approaches can
level (Palau and Moens, 2009; Stab and Gurevych,         perform as well or even better than human experts.
2014; Eger et al., 2017) and ones focusing on            Stab and Gurevych (2017) propose a CRF-based
topic-dependent argument retrieval (Levy et al.,         approach with manually defined features for the
2014; Wachsmuth et al., 2017; Hua and Wang,              unit segmentation task on student essays (Stab and
2017; Stab et al., 2018b). Our work is in line with      Gurevych, 2014) and also achieve performance
close to human experts.                                the topic. Each document was checked for its cor-
                                                       responding WARC file at the Common Crawl In-
3     Corpus Creation                                  dex.4 We then downloaded and parsed the orig-
Collecting annotations on the token-level is chal-     inal HTML document for the next steps of our
lenging. First, the unit of annotation needs to be     pipeline; this ensures reproducibility. Following
clearly defined. This is straightforward for tasks     this, we used justext5 to remove HTML boiler-
with short spans (sequences of words) such as          plate. The resulting document was segmented into
named entities, but much harder for longer spans       separate sentences as well as within a sentence
– as in the case of argument units. Second, labels     into single tokens using spacy.6 We only con-
from multiple annotators need to be merged into a      sider sentences with number of tokens in the range
single gold standard.1 This is also more difficult     [3, 45].
for longer sequences because simple majority vot-      3.3   Sentence Sampling
ing over individual words will likely create invalid
                                                       The sentences were pre-classified with a sentence-
(e.g., disrupted or grammatically incorrect) spans.
                                                       level argument mining model following (Stab
   To address these challenges, we carefully de-
                                                       et al., 2018b) and available via the ArgumenText
signed selection of sources, sampling and anno-
                                                       Classify API.7 The API returns for each sentence
tation of input for ARC-8, our novel argument
                                                       (i) an argument confidence score arg score in
unit dataset. We first describe how we pro-
                                                       [0.0, 1.0) (we discard sentences with arg score <
cessed and retrieved data from a large webcrawl.
                                                       0.5), (ii) the stance on the sentence-level (PRO
Next, we outline the sentence sampling process
                                                       or CON) and (iii) the stance confidence score
that accounts for a balanced selection of both
                                                       stance score. This information was used together
(non-)argument types and source documents. Fi-
                                                       with the doc score to rank sentences for a selec-
nally, we describe how we crowdsource annota-
                                                       tion in the following crowd annotation process.
tions of argument units within sentences in a scal-
                                                       First, all three scores (for documents, arguments
able way.
                                                       and stance confidence) were normalized in the
3.1    Data Source                                     range of available sentences and secondly summed
                                                       up to create a rank for each sentence (see Eq. 1)
We used the February 2016 Common Crawl
                                                       with di , ai and si being the ranks fo document, ar-
archive,2 which was indexed with Elasticsearch3
                                                       gument and stance confidence scores, respectively.
following the description in (Stab et al., 2018a).
For the sake of comparability, we adopt Stab et al.
                                                                     ranki = di + ai + si              (1)
(2018b)’s eight topics (cf. Table 1). The topics are
general enough to have good coverage in Common            The ranked sentences were divided by topic and
Crawl. They are also of a controversial nature and     the pre-classified stance on the sentence-level and
hence a potentially good choice for argument min-      ordered by rank (where a lower rank indicates a
ing with an expected broad set of supporting and       better candidate). We then went down the ranked
opposing arguments.                                    list selected each sentence with a probability of
                                                       p = 0.5 until the target size of n = 500 per
3.2    Retrieval Pipeline                              stance and topic was reached; otherwise we did
For document retrieval, we queried the indexed         additional passes through the list. Table 1 gives
data for Stab et al. (2018b)’s topics and collected    data set creation statistics.
the first 500 results per topic ordered by their
document score (doc score) from Elasticsearch;         3.4   Crowd Annotations
a higher doc score indicates higher relevance for      The goal of this work was to come up with a
    1
                                                       scalable approach to annotate argument units on
      One could also learn from “soft” labels, i.e., a distribu-
                                                       the token-level. Given that arguments need to be
tion created from the votes of multiple annotators. However,
                                                       annotated with regard to a specific topic, large
this does not solve the problem that some annotators deliver
low quality work and their votes should be outscored by a
                                                                   4
(hopefully) higher-quality majority of annotators.                   http://index.commoncrawl.org/
    2                                                            CC-MAIN-2016-07
      http://commoncrawl.org/2016/02/
                                                                   5
february-2016-crawl-archive-now-available/                           http://corpus.tools/wiki/Justext
    3                                                              6
      https://www.elastic.co/products/                               https://spacy.io/
                                                                   7
elasticsearch                                                        https://api.argumentsearch.com/en/doc
#    topic                    #docs   #text   #sentences   #candidates   #final   #arg-sent.        #arg-segm.   #non-arg
    T1   abortion                  491     454       39,083         3,282    1,000          424     472 (+11.32%)        576
    T2   cloning                   495     252       30,504         2,594    1,000          353     400 (+13.31%)        647
    T3   marijuana legalization    490     472       45,644         6,351    1,000          630     759 (+20.48%)        370
    T4   minimum wage              494     479       43,128         8,290    1,000          630     760 (+20.63%)        370
    T5   nuclear energy            491     470       43,576         5,056    1,000          623     726 (+16.53%)        377
    T6   death penalty             491     484       32,253         6,079    1,000          598     711 (+18.90%)        402
    T7   gun control               497     479       38,443         4,576    1,000          529     624 (+17.96%)        471
    T8   school uniforms           495     475       40,937         3,526    1,000          713     891 (+24.96%)        287
         total                    3,944   3,565     314,568        39,754    8,000        4,500    5,343 (+18.73%)     3,500

Table 1: Number of documents and sentences in the selection process and the final corpus size; arg-sent. is the
number of argumentative sentences; arg-segm. is the information about argumentative segments; the percentage
value is comparing the number of argumentative sentences with the number of argumentative segments

amounts of (cross-topic) training data need to be                 satisfying agreement (αunom = 0.51, average over
created. As has been shown by previous work on                    topics), one reason being inconsistency in select-
topic-dependent argument mining (Shnarch et al.,                  ing argument spans (median length of arguments
2018; Stab et al., 2018b), crowdsourcing can be                   ranged from nine to 16 words among the three
used to obtain reliable annotations for argument                  experts). In a second round, we therefore de-
mining datasets. However, as outlined above,                      cided to restrict the spans that could be selected
token-level annotation significantly increases the                by applying a slot filling approach that enforces
difficulty of the annotation task, so it was unclear              valid argument spans that match a template. We
whether agreement among untrained crowd work-                     use the template: “< T OP IC > should be sup-
ers would be sufficiently high.                                   ported/opposed, because < argument span >”.
   We use the αu agreement measure Krippendorff                   The guidelines specify that the resulting sentence
et al. (2016) in this work. It is designed for anno-              had to be a grammatically sound statement. Al-
tation tasks that involve unitizing textual continua              though this choice unsurprisingly increased the
– i.e., segmenting continuous text into meaningful                length of spans and reduced the total number of
subunits – and measuring chance-corrected agree-                  arguments selected, it increased consistency of
ment in those tasks. It is also a good fit for ar-                spans substantially (min/max. median length was
gument spans within a sentence: typically these                   now between 15 and 17). Furthermore, the agree-
spans are long and the context is a single sentence               ment between the three experts rose to αunom =
that may contain any type of argument and any                     0.61 (average over topics). Compared to other
number of arguments. Krippendorff et al. (2016)                   studies on token-level argument mining (Eckle-
define a family of α-reliability coefficients that                Kohler et al., 2015; Li et al., 2017; Stab and
improve upon several weaknesses of previous α                     Gurevych, 2014), this score is in an acceptable
measures. From these, we chose the αunom coef-                    range and we deem it sufficient to proceed with
ficient, which also takes into account agreement                  crowdsourcing.
on “blanks” (non-arguments in our case). The ra-                     In our crowdsourcing setup, workers could se-
tionale behind this was that ignoring agreement on                lect one or multiple spans, where each span’s per-
sentences without any argument spans would over-                  missible length is between one token and the en-
proportionally penalize disagreement in sentences                 tire sentence. Workers had to either choose at
that contain arguments while ignoring agreement                   least one argument span and its stance (support-
in sentences without arguments.                                   ing/opposing), or select that the sentence did not
   To determine agreement, we initially carried out               contain a valid argument and instead solve a sim-
an in-house expert study with three graduate em-                  ple math problem. We introduced further qual-
ployees (who were trained on the task beforehand)                 ity control measures in the form of a qualification
and randomly sampled 160 sentences (10 per topic                  test and periodic attention checks.8 On an initial
and stance) from the overall data. In the first
round, we did not impose any restrictions on the                      8
                                                                      Workers had to be located in the US, CA, AU, NZ or
span of words to be selected, other than that the                 GB, with an acceptance rate of 95% or higher. Payment was
                                                                  $0.42 per HIT, corresponding to US federal minimum wage
selected span should be the shortest self-contained               ($7.25/hour). The annotators in the expert study were salaried
span that forms an argument. This resulted in un-                 research staff.
batch of 160 sentences, we collected votes from          3.6    Dataset Statistics
nine workers. To determine the optimal number of
                                                         The resulting data set, ARC-8,9 consists of 8000
workers for the final study, we did majority voting
                                                         annotated sentences with 3500 (43.75%) being
on the token-level (ties broken as non-arguments)
                                                         non-argumentative. The 4500 argumentative sen-
for both the expert study and workers from the
                                                         tences are divided into 1951 (43.36%) single pro
initial crowd study. We artificially reduced the
                                                         argument sentences, 1799 (39.98%) single con-
number of workers (1-9) and calculated percent-
                                                         tra argument sentences and the remaining 750
age overlap averaged across all worker combina-
                                                         (16.67%) sentences are many possible combina-
tions (for worker numbers lower than 9). Whereas
                                                         tions of supporting (PRO) and opposing (CON)
the overlap was highest with 80.2% at nine votes,
                                                         arguments with up to five single argument seg-
it only dropped to 79.5% for five votes (and de-
                                                         ments in a sentence. Thus, the token-level an-
creased more significantly for fewer votes). We
                                                         notation leads to a higher (+18.73%) total count
deemed five votes to be an acceptable compromise
                                                         of arguments of 5343, compared to 4500 with a
between quality and cost. The agreement with ex-
                                                         sentence-level approach. If we propagate the la-
perts on the five-worker-setup is αunom = 0.71,
                                                         bel of a sentence to all its tokens, then 100% of
which is substantial (Landis and Koch, 1977).
                                                         tokens of argumentative sentences are argumenta-
   The final gold standard labels on the 8000            tive. This ratio drops to 69.94% in our token-level
sampled sentences was determined using a vari-           setup, reducing the amount of non-argumentative
ant of Bayesian Classifier Combination (Kim and          tokens otherwise incorrectly selected as argumen-
Ghahramani, 2012), referred to as IBCC in Simp-          tative in a sentence.
son and Gurevych (2018)’s modular framework
for Bayesian aggregation of sequence labels. This        4     Methods
method has been shown to yield results superior to
majority voting or MACE (Hovy et al., 2013).             We model ARC as a sequence labeling task. The
                                                         input is a topic t and a sentence S = w1 ...wn .
                                                         The goal is to select 0 ≤ k ≤ n spans of words
3.5   Dataset Splits                                     each of which corresponding to an argument unit
                                                         A = wj ...wm , 1 ≤ j, j ≤ m, m ≤ n. Following
We create two different dataset splits. (i) An in-
                                                         Stab et al. (2018b), we distinguish between PRO
domain split. This lets us evaluate how models
                                                         and CON (t should be supported/opposed, because
perform on known vocabulary and data distribu-
                                                         A) arguments. To measure the difficulty of the task
tions. (ii) A cross-domain split. This lets us evalu-
                                                         of ARC, we estimate the performance of simple
ate how well a model generalizes for unseen topics
                                                         baselines as well as current models in NLP, that
and distributions different from the training set. In
                                                         achieve state-of-the-art results on other sequence
the cross-domain setup, we defined topics T1-T5
                                                         labeling data sets (Devlin et al., 2018).
to be in the train set, topic T6 in the development
set and topics T7 and T8 in the test set. For the
                                                         4.1    1-class Baselines
in-domain setup, we excluded topics T7 and T8
(cross-domain test set), and used the first 70% of       The 1-class baseline labels the data set completely
the topics T1-T6 for train, the next 10% for dev         (i.e., for each S the entire sequence w1 ...wn ) with
und the remaining 20% for test. The samples from         one of the three labels PRO, CON and NON.
the in-domain test set were also excluded in the
cross-domain train and development sets. As a re-        4.2    Sentence-Level Baselines
sult, there are 4000 samples in train, 800 in dev
                                                         As the sentence-level baseline, we used labels pro-
and 2000 in test for the cross-domain split; and
                                                         duced by the previously mentioned ArgumenText
4200 samples in train, 600 in dev and 1200 in test
                                                         Classify API from Stab et al. (2018a). Since it is
for the in-domain split. We work with two differ-
                                                         a sentence-level classifier, we also projected the
ent splits so as to guarantee that train/dev sets (in-
                                                         sentence-level prediction on all of the tokens in a
domain or cross-domain) do not overlap with test
                                                         sequence to enable token-level evaluation.
sets (in-domain or cross-domain). The assignment
of sentences to the two splits is released as part of      9
                                                             We will make the dataset available at www.ukp.
ARC-8.                                                   tu-darmstadt.de/data
4.3    BERT                                            type of label, the sentence is labeled with it. Other-
Furthermore, we used the BERT10 base (cased)           wise, if there is the NON label with only one other
model (Devlin et al., 2018) as a recent state-of-      label (PRO or CON), then the NON label is omit-
the-art model which achieved impressive results        ted and the sentence is labeled with the remaining
on many tasks including sequence labeling. For         label. In other cases, a majority vote determines
this model we considered two scenarios. First, we      the final sentence label, or, in the case of ties, the
kept the parameters as they are and used the model     NON label is assigned.
as a feature extractor (considered frozen, tagged      5.3    Sequence Labeling with Different Tagsets
as ). Second, we fine-tuned (tagged as ) the
parameters for the ARC task and the correspond-        In the sequence labeling experiments with the new
ing different tags.                                    ARC-8 data set, we investigate the performance
                                                       of BERT (cf. Section 4.3). The base scenario is
5     Experiments                                      with three labels PRO, CON and NON (TAGS=3),
                                                       but we also use two extended label sets. In one of
In total, we run three different experiments on        them, we extended the PRO and CON labels with
the ARC-8 dataset with the previously intro-           BI tags (TAGS=5), with B being the beginning of
duced models, which we will describe in this sec-      a segment and I a within-segment token, resulting
tion. Additionally we experimented with different      in the tags: B-PRO, I-PRO, B-CON, I-CON and
tagsets for the ARC task. All experiments were         NON. The other extension is with BIES tags, were
conducted on a single GPU with 11 GB memory.           we add E for end of a segment and S for single unit
                                                       segments (TAGS=9), resulting in the following tag
5.1    1-class Baselines                               set: B-PRO, I-PRO, E-PRO, S-PRO, B-CON, I-
For the simple baselines, we applied 1-class se-       CON, E-CON, S-CON and NON.
quence tagging on the corresponding development
and test sets for the in-domain and cross-domain       5.4    Adding Topic Information
setups. This allowed us to estimate the expected       The methods described so far do not use topic in-
lower bounds for more complex models.                  formation. We also tests methods for ARC that
                                                       make use of topic information. In the first sce-
5.2    Token- vs. Sentence-Level                       nario, we just add the topic information to the la-
To further investigate the performance of a token-     bels, resulting in 25 TAGS (2 span information (B
level model vs. a sentence-level model, we run         and I) × 2 stance information (PRO and CON) ×
four different training procedures and evaluate the    6 topics (in-domain, topics T1-T6), and the NON
results on both token- and sentence-level. We first    label; for example B-PRO-CLONING). In the sce-
train models on the token-level (sequence label-       nario “TAGS=25++”, in addition to the TAGS=25
ing) and also evaluate on the token-level. Second,     setup, we add the topic at the beginning of a se-
we train a model on the sentence-level (as a text      quence. Additionally, in the TAGS=25++ sce-
classification task) and project the predictions to    nario, we add all sentences of the other topics as
all tokens of the sentence, which we then com-         negative examples to the training set, with all la-
pare to the token-level labels of the gold standard.   bels set to NON. For example, a sentence with
Third, we train models on the token-level and ag-      PRO tokens for the topic CLONING, was added
gregate a sentence-level score from the predicted      as is (argumentative, for CLONING) and as non-
scores, which we evaluate against an aggregated        argumentative for the other five topics. Since all
sentence-level gold-standard. Finally, in the last     the topics need to be known beforehand, this is
of this type of experiments, we train a model on       done only on the in-domain datasets. This last
sentence-level and compared it against the aggre-      experiment is to investigate whether the model is
gated sentence-level gold-standard. In the latter      able to learn the topic-dependency of argument
two cases, we aggregate on sentence-level as fol-      units.
lows: for each sentence, all occurrences of possi-
ble types of label are counted. If there is only one   6     Evaluation
  10
     https://github.com/huggingface/                   In this section we evaluate the results and analyze
pytorch-pretrained-BERT                                the errors from the models in the different ARC
Domain                                 In-Domain                                                         Cross-Domain
                      Train       Token             Sentence           Token          Sentence          Token           Sentence        Token          Sentence
                     EVAL                  Token                               Sentence                         Token                           Sentence
                       SET     Dev      Test      Dev     Test    Dev        Test    Dev     Test    Dev     Test    Dev      Test   Dev     Test     Dev    Test

   Model              TAGS
   Baseline (PRO)        3     10.46    10.91     10.46   10.91   14.10   14.10     14.10    14.10    6.46   12.36    6.46   12.36    9.87   16.43    9.87   16.43
   Baseline (CON)        3     10.24    10.53     10.24   10.53   13.48   14.07     13.48    14.07   15.78   11.55   15.78   11.55   20.13   15.03   20.13   15.03
   Baseline (NON)        3     25.83    25.43     25.83   25.43   21.57   21.13     21.57    21.13   24.54   24.01   24.54   24.01   18.83   18.43   18.83   18.43
   ArgumenText           3     25.10    23.46     25.10   23.46   32.72   29.86     32.72    29.86   19.87   24.81   19.87   24.81   26.56   31.26   26.56   31.26
   BERT                  3     55.60    52.93     49.95   49.24   62.93   59.99     49.97    50.10   38.91   40.86   37.47   34.60   43.98   49.50   38.56   34.37
   BERT                  5     55.38    52.23       -       -     61.93   60.20       -        -     38.49   40.73      -    43.45   48.71     -        -      -
   BERT                  9     54.50    51.37       -       -     61.16   60.09       -        -     37.86   39.96      -    42.82   48.54     -        -      -
   BERT                  3     68.95    63.35     64.83   63.78   72.51   65.49     64.92    64.26   53.66   52.28   46.47   52.19   55.54   51.21   46.56   51.68
   BERT                  5     68.34    64.67       -       -     70.21   65.80       -        -     53.32   52.52      -      -     53.07   51.98      -      -
   BERT                  9     67.58    64.98       -       -     67.19   64.27       -        -     53.50   54.96      -      -     52.45   51.90      -      -
   BERT                 25     71.18    63.23       -       -     72.91   64.66       -        -        -      -        -      -        -      -        -      -
   BERT               25++     66.58    64.19       -       -     65.72   64.21       -        -        -      -        -      -        -      -        -      -

Table 2: F1 scores for all methods; training was done with the corresponding TAGS in the table, while evalutation
was always on three labels (PRO, CON, NON) with aggregation if necessary; the missing values (-) were not
possible or applicable experiment setups and hence omitted.

       Model                           TAGS          Train time (sec./it.)                Token- vs. Sentence-Level The four experi-
       BERT    (base, cased)    3 (T)     3 (S)      37           18                      ments on token- and sentence-level for both in-
       BERT    (base, cased)    3 (T)     3 (S)      39           29                      and cross-domain setups (Table 2) work signifi-
                                                                                          cantly better with a fine-tuned BERT model for
Table 3: BERT average runtimes with training on
                                                                                          the ARC task, which is a similar discovery as in
token-level (T) and training on sentence-level (S), with
32 sentences per batch on a single GPU with 11 GB                                         Peters et al. (2019) for many other NLP tasks.
memory.                                                                                   Furthermore, training on token-level leads always
                                                                                          to better results, which was one of our motiva-
                                                                                          tions and objectives for this task and the dataset.
experiments. All reported results are macro F1                                            For an evaluation on token-level, a model trained
scores, except otherwise stated. For the computa-                                         on token-level with TAGS=9 works best, while
tion of the scores we used a function from scikit-                                        TAGS=5 work best for an evaluation on sentence-
learn11 where we concatenated all the true values                                         level. However, the average runtime per iteration
and the predictions over all sentences per set.                                           (Table 3) is for sentence-level models on average
                                                                                          between 25% and 50% faster compared to token-
6.1      Results                                                                          level models.
We present the results in the following manner:                                           Sequence Labeling Across Domains The re-
Table 2 shows experiments across domains and                                              sults for the evaluation on three labels are in
for different tagsets in the training step, always                                        Table 2 and the best F1 scores for in-domain
evaluating on three labels (PRO, CON and NON).                                            on token- (64.26) and sentence-level (65.80) are
In Table 3 we compare the runtimes of the token-                                          higher than the corresponding scores for cross-
and sentence-level training. Finally, we show the                                         domain (54.96 and 51.98, respectively). This val-
results of the evaluation on the same tags as we                                          idates our assumption that the ARC problem de-
used for the training in Table 4 and Table 5.                                             pends on the topic at hand and that cross-topic
                                                                                          (cross-domain) transfer is more difficult to learn.
   For the results in Table 2 we see that the base-                                       Sequence Labeling with Different Tagsets The
line for the NON label (most frequent label) and                                          results in Table 4 are from evaluations of models
the model “ArgumenText” (ArgumenText Classify                                             that were trained on the corresponding TAGS 3, 5
API) are clearly worse than all BERT-based mod-                                           and 9, and work again better for in-domain and a
els. This shows that we are definitely improving                                          fine-tuned model (63.35, 54.23 and 36.01, respec-
upon the pipeline that we used to select the data.                                        tively). Results for larger tagsets are clearly lower
  11
                                                                                          which is to be expected from the increased com-
   https://scikit-learn.org/stable/
modules/generated/sklearn.metrics.                                                        plexity of the task and the low number of training
precision_recall_fscore_support.html                                                      examples for some of the tags.
Model                        TAGS         Dev (In-Domain)         Test (In-Domain)        Dev (Cross-Domain)      Test (Cross-Domain)

            BERT    (base, cased)    3    5     9   55.60   34.25   18.93   52.93   32.45    17.92   38.91   23.20   12.82   40.86   24.88   13.76
            BERT    (base, cased)    3    5     9   68.95   58.32   39.66   63.35   54.23    36.01   53.66   41.81   28.35   52.28   42.65   30.34

                                    Table 4: Sequence labeling with BERT for 3, 5 and 9 labels.

  Model                    TAGS          Dev                Test                    cross-domain shorter than the true segments. Re-
                                         (In-Domain)        (In-Domain)
                                                                                    garding the count of segments, there are 297 more
  BERT    (base, cased)     25           51.38              41.73
                                                                                    segments in the predicted labels for in-domain
  BERT    (base, cased)    25++          45.50              42.83
                                                                                    and 372 more segments in the predicted labels for
Table 5: BERT Experiments with added topic informa-                                 cross-domain, than there are in the gold-standard.
tion; for 25, the topic information is only in the labels;
for 25++, the topic information is in the labels, neg-
ative examples are added and the topic information is                               Stance The complete misclassification of the
provided at the beginning of a sequence.                                            stance occured for the best token-level model
                                                                                    (TAGS=9) in 7.67% of the test sentences in-
                                                                                    domain and in 16.50% of the test sentences cross-
Adding Topic Information Adding the topic in-                                       domain. A frequent error is that apparently stance-
formation in the labels or before a sequence gen-                                   specific words are assigned a label that is not con-
erally does not help when evaluating on three tags                                  sistent with the overall segment stance.
(results for 25 and 25++ TAGS in Table 2). So
we suggest to use more complex models that can
improve the results when the topic information is                                   Topic We looked for errors where the topic-
provided. The results in Table 5 show that addi-                                    independent tag was correct (e.g., B-CON, begin-
tional information about the topic and from the                                     ning of a con argument), but the topic was incor-
negative examples (42.83) are helping to train the                                  rect. This type of error occurred only four times
model. So the model is able to learn the topic rel-                                 on the testset for TAGS=25++ on some of the to-
evance of a sentence for the six topics in the in-                                  kens, but never for a full sequence. The model
domain sets.                                                                        misclassified for example the actual topic nuclear
                                                                                    energy as the topic abortion, or the actual topic
6.2   Error Analysis                                                                death penalty was confused for the topic minimun
We classified errors in three ways: (i) the span is                                 wage. Reasons for this could be some topic spe-
not correctly recognized, (ii) the stance is not cor-                               cific vocabulary that the model learned, but none
rectly classified, or (iii) the topic is not correctly                              of them are actually words one would assign to the
classified.                                                                         misclassified topics.

Span The errors by the models for the span can
be divided into two more cases: (a) the beginning                                   7       Conclusion
and/or end of a segment is incorrectly recognized,
and/or (b) the segment is broken into several seg-                                  We introduced a new task, argument unit recog-
ments or merged into fewer segments, such that to-                                  nition and classification (ARC), and release the
kens inside or outside an actual argument unit are                                  benchmark ARC-8 for this task. We demonstrated
misclassified as non-argumentative. Therefore, we                                   that ARC-8 has good quality in terms of annotator
used the predictions by the best token-level model                                  agreement: the required annotations can be crowd-
with TAGS=9 in both in-domain and cross-domain                                      sourced using specific data selection and filtering
settings, and analyzed the average length of seg-                                   methods as well as a slot filling approach. We cast
ments as well as the total count of segments for the                                ARC as a sequence labeling task and established a
true and predicted labels. For the average length                                   state of the art for ARC-8, using baseline as well
of segments (in tokens), we got 17.66 for true and                                  as advanced methods for sequence labeling. In the
13.73 for predicted labels in-domain and 16.35 for                                  future, we plan to find better models for this task,
true and 13.14 for predicted labels cross-domain,                                   especially models with the ability to better incor-
showing that predicted segments are on average                                      porate the topic information in the learning pro-
four tokens in-domain and on average three token                                    cess.
Acknowledgments                                           K. Krippendorff, Y. Mathet, S. Bouvry, and
                                                            A. Widlöcher. 2016. On the reliability of uni-
We gratefully acknowledge support by Deutsche               tizing textual continua: Further developments.
Forschungsgemeinschaft    (DFG)     (SPP-1999               Quality & Quantity, 50(6):2347–2364.
Robust Argumentation Machines (RATIO),                    J. Richard Landis and Gary G. Koch. 1977. The mea-
SCHU2246/13), as well as by the German Federal               surement of observer agreement for categorical data.
Ministry of Education and Research (BMBF)                    Biometrics, 33(1):159–174.
under the promotional reference 03VP02540                 Ran Levy, Yonatan Bilu, Daniel Hershcovich, Ehud
(ArgumenText).                                              Aharoni, and Noam Slonim. 2014. Context depen-
                                                            dent claim detection. In Proceedings of COLING
                                                            2014, the 25th International Conference on Compu-
                                                            tational Linguistics: Technical Papers, pages 1489–
References                                                  1500, Dublin, Ireland. Dublin City University and
Yamen Ajjour, Wei-Fan Chen, Johannes Kiesel, Hen-           Association for Computational Linguistics.
  ning Wachsmuth, and Benno Stein. 2017. Unit seg-        Mengxue Li, Shiqiang Geng, Yang Gao, Shuhua Peng,
  mentation of argumentative texts. In Proceedings of      Haijing Liu, and Hao Wang. 2017. Crowdsourcing
  the 4th Workshop on Argument Mining, pages 118–          argumentation structures in Chinese hotel reviews.
  128, Copenhagen, Denmark. Association for Com-           In Proceedings of the 2017 IEEE International Con-
  putational Linguistics.                                  ference on Systems, Man, and Cybernetics, pages
                                                           87–92.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2018. Bert: Pre-training of deep   Tristan Miller, Maria Sukhareva, and Iryna Gurevych.
   bidirectional transformers for language understand-       2019. A streamlined method for sourcing discourse-
   ing. arXiv preprint arXiv:1810.04805.                     level argumentation annotations from the crowd. In
                                                             Proceedings of the 2019 Conference of the North
Judith Eckle-Kohler, Roland Kluge, and Iryna                 American Chapter of the Association for Computa-
  Gurevych. 2015. On the role of discourse markers           tional Linguistics.
  for discriminating claims and premises in argumen-
  tative discourse. In Proceedings of the 2015 Con-       Makoto Miwa and Mohit Bansal. 2016. End-to-end re-
  ference on Empirical Methods in Natural Language         lation extraction using lstms on sequences and tree
  Processing, pages 2236–2242, Lisbon, Portugal. As-       structures. In Proceedings of the 54th Annual Meet-
  sociation for Computational Linguistics.                 ing of the Association for Computational Linguistics
                                                           (Volume 1: Long Papers), pages 1105–1116, Berlin,
                                                           Germany. Association for Computational Linguis-
Steffen Eger, Johannes Daxenberger, and Iryna
                                                           tics.
   Gurevych. 2017. Neural end-to-end learning for
   computational argumentation mining. In Proceed-        Raquel Mochales Palau and Marie-Francine Moens.
   ings of the 55th Annual Meeting of the Association       2009. Argumentation mining: The detection, classi-
   for Computational Linguistics (ACL 2017), volume         fication and structure of arguments in text. In Pro-
   Volume 1: Long Papers, pages 11–22. Association          ceedings of the 12th International Conference on Ar-
   for Computational Linguistics.                           tificial Intelligence and Law, ICAIL ’09, pages 98–
                                                            107, New York, NY, USA. ACM.
Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani,
  and Eduard Hovy. 2013. Learning whom to trust           Andreas Peldszus and Manfred Stede. 2013. From ar-
  with mace. In Proceedings of the 2013 Conference          gument diagrams to argumentation mining in texts:
  of the North American Chapter of the Association          A survey. Int. J. Cogn. Inform. Nat. Intell., 7(1):1–
  for Computational Linguistics: Human Language             31.
  Technologies, pages 1120–1130.
                                                          Matthew Peters, Sebastian Ruder, and Noah A Smith.
Xinyu Hua and Lu Wang. 2017. Understanding and             2019. To tune or not to tune? adapting pretrained
  detecting supporting arguments of diverse types. In      representations to diverse tasks. arXiv preprint
  Proceedings of the 55th Annual Meeting of the As-        arXiv:1903.05987.
  sociation for Computational Linguistics (Volume 2:      Paul Reisert, Naoya Inoue, Tatsuki Kuribayashi, and
  Short Papers), pages 203–208, Vancouver, Canada.          Kentaro Inui. 2018. Feasible annotation scheme
  Association for Computational Linguistics.                for capturing policy argument reasoning using argu-
                                                            ment templates. In Proceedings of the 5th Workshop
Hyun-Chul Kim and Zoubin Ghahramani. 2012.                  on Argument Mining, pages 79–89. Association for
  Bayesian classifier combination. In Proceedings of        Computational Linguistics.
  the Fifteenth International Conference on Artificial
  Intelligence and Statistics, volume 22 of Proceed-      Eyal Shnarch, Carlos Alzate, Lena Dankin, Mar-
  ings of Machine Learning Research, pages 619–627,         tin Gleize, Yufang Hou, Leshem Choshen, Ranit
  La Palma, Canary Islands. PMLR.                           Aharonov, and Noam Slonim. 2018. Will it blend?
blending weak and strong labeled data in a neu-
  ral network for argumentation mining. In Proceed-
  ings of the 56th Annual Meeting of the Associa-
  tion for Computational Linguistics (Volume 2: Short
  Papers), pages 599–605. Association for Computa-
  tional Linguistics.
Edwin Simpson and Iryna Gurevych. 2018. Bayesian
  ensembles of crowds and deep learners for sequence
  tagging. CoRR, abs/1811.00780.
Christian Stab, Johannes Daxenberger, Chris Stahlhut,
  Tristan Miller, Benjamin Schiller, Christopher
  Tauchmann, Steffen Eger, and Iryna Gurevych.
  2018a. Argumentext: Searching for arguments in
  heterogeneous sources. In Proceedings of the 2018
  Conference of the North American Chapter of the
  Association for Computational Linguistics: Demon-
  strations, pages 21–25.
Christian Stab and Iryna Gurevych. 2014. Annotat-
  ing argument components and relations in persua-
  sive essays. In Proceedings of the 25th International
  Conference on Computational Linguistics (COLING
  2014), pages 1501–1510. Dublin City University
  and Association for Computational Linguistics.

Christian Stab and Iryna Gurevych. 2017. Parsing ar-
  gumentation structures in persuasive essays. Com-
  putational Linguistics, 43(3):619–659.

Christian Stab, Tristan Miller, Benjamin Schiller,
  Pranav Rai, and Iryna Gurevych. 2018b. Cross-
  topic argument mining from heterogeneous sources.
  In Proceedings of the 2018 Conference on Empiri-
  cal Methods in Natural Language Processing, pages
  3664–3674. Association for Computational Linguis-
  tics.
Henning Wachsmuth, Martin Potthast, Khalid
  Al Khatib, Yamen Ajjour, Jana Puschmann, Jiani
  Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff,
  and Benno Stein. 2017. Building an argument
  search engine for the web. In Proceedings of
  the 4th Workshop on Argument Mining, pages
  49–59, Copenhagen, Denmark. Association for
  Computational Linguistics.
Marilyn Walker, Jean Fox Tree, Pranav Anand, Rob
 Abbott, and Joseph King. 2012. A corpus for re-
 search on deliberation and debate. In Proceed-
 ings of the Eighth International Conference on
 Language Resources and Evaluation (LREC-2012),
 pages 812–817, Istanbul, Turkey. European Lan-
 guage Resources Association (ELRA).

Adam Wyner, Raquel Mochales-Palau, Marie-Francine
  Moens, and David Milward. 2010. Approaches to
  text mining arguments from legal cases. In Enrico
  Francesconi, Simonetta Montemagni, Wim Peters,
  and Daniela Tiscornia, editors, Semantic Processing
  of Legal Texts: Where the Language of Law Meets
  the Law of Language, pages 60–79. Springer Berlin
  Heidelberg, Berlin, Heidelberg.
You can also read