Critical Thinking for Language Models - arXiv.org

Page created by Erica Gonzales
 
CONTINUE READING
Critical Thinking for Language Models

                                                             Gregor Betz† and Christian Voigt† and Kyle Richardson‡
                                                              †
                                                                Karlsruhe Institute of Technology, Karlsruhe, Germany
                                                                        {gregor.betz, christian.voigt}@kit.edu
                                                                      ‡
                                                                        Allen Institute for AI, Seattle, WA, USA
                                                                                  {kyler}@allenai.org

                                                               Abstract                           ing skills (Paglieri, 2017), the resulting omnipres-
                                                                                                  ence of fallacies and biases in texts and the fre-
                                              This paper takes a first step towards a crit-       quently low argumentative quality of online de-
                                              ical thinking curriculum for neural auto-
                                                                                                  bates (Hansson, 2004; Guiaşu and Tindale, 2018;
                                              regressive language models. We introduce
arXiv:2009.07185v2 [cs.CL] 17 Dec 2020

                                              a synthetic corpus of deductively valid ar-         Cheng et al., 2017). Neural language models are
                                              guments, and generate artificial argumenta-         known to pick up and reproduce normative bi-
                                              tive texts to train and evaluate GPT-2. Sig-        ases (e.g., regarding gender or race) present in the
                                              nificant transfer learning effects can be ob-       dataset they are trained on (Gilburt, 2019), as well
                                              served: Training a model on three simple            as other annotation artifacts (Gururangan et al.,
                                              core schemes allows it to accurately com-           2018); no wonder this happens with argumenta-
                                              plete conclusions of different, and more            tive biases and reasoning flaws, too (Kassner and
                                              complex types of arguments, too. The
                                              language models generalize the core argu-
                                                                                                  Schütze, 2020; Talmor et al., 2020). This diag-
                                              ment schemes in a correct way. More-                nosis suggests that there is an obvious remedy for
                                              over, we obtain consistent and promising            LMs’ poor reasoning capability: make sure that
                                              results for NLU benchmarks. In particu-             the training corpus contains a sufficient amount of
                                              lar, pre-training on the argument schemes           exemplary episodes of sound reasoning.
                                              raises zero-shot accuracy on the GLUE di-
                                                                                                     In this paper, we take a first step towards the
                                              agnostics by up to 15 percentage points.
                                              The findings suggest that intermediary pre-
                                                                                                  creation of a “critical thinking curriculum” for
                                              training on texts that exemplify basic rea-         neural language models. Critical thinking can be
                                              soning abilities (such as typically covered in      loosely defined as “reasonable reflective thinking
                                              critical thinking textbooks) might help lan-        that is focused on deciding what to believe or
                                              guage models to acquire a broad range of            do.” (Norris and Ennis, 1989) Generally speak-
                                              reasoning skills. The synthetic argumen-            ing, our study exploits an analogy between teach-
                                              tative texts presented in this paper are a          ing critical thinking to students and training lan-
                                              promising starting point for building such
                                                                                                  guage models so as to improve their reasoning
                                              a “critical thinking curriculum for language
                                              models.”                                            skill. More specifically, we build on three key as-
                                                                                                  sumptions that are typically made in critical think-
                                                                                                  ing courses and textbooks: First, there exist fun-
                                         1   Introduction                                         damental reasoning skills that are required for, or
                                         Pre-trained autoregressive language models (LM)          highly conducive to, a large variety of more spe-
                                         such as GPT-2 and GPT-3 achieve, remarkably,             cific and advanced critical thinking skills (e.g.,
                                         competitive results in a variety of language model-      Fisher, 2001, p. 7). Second, drawing deductive
                                         ing benchmarks without task-specific fine-tuning         inferences is one such basic ability (e.g., Fisher,
                                         (Radford et al., 2019; Brown et al., 2020). Yet,         2001, pp. 7–8). Third, reasoning skills are not
                                         it is also widely acknowledged that these mod-           (just) acquired by learning a theory of correct rea-
                                         els struggle with reasoning tasks, such as natu-         soning, but by studying lots of examples and doing
                                         ral language inference (NLI) or textual entailment       “lots of good-quality exercises” (Lau and Chan,
                                         (Askell, 2020). Actually, that doesn’t come as a         2020), typically moving from simple to more dif-
                                         surprise, given the tendency of humans to com-           ficult problems (e.g., Bowell and Kemp, 2014).
                                         mit errors in reasoning (Kahneman, 2011; Sun-               These insights from teaching critical thinking
                                         stein and Hastie, 2015), their limited critical think-   translate, with respect to our study, as follows.
First of all, we design and build ‘lots of good-        ple deductive argumentation. Obviously, drawing
quality exercises’: a synthetic corpus of deduc-        correct inferences is just one of the elementary
tively valid arguments which instantiate a variety      skills typically covered in critical thinking courses
of (syllogistic) argument schemes, and which are        (Fisher, 2001). Critical thinking involves more
rendered as text paragraphs (Section 3). Next, we       than deduction. And it would hence, by analogy,
use our synthetic argument text corpus to train and     be unreasonable to expect that intermediary pre-
to evaluate GPT-2 (Section 4). The training, which      training on the synthetic argument corpus suffices
maximizes a causal language modeling objective,         to turn language models into accomplished rea-
can be conceived of as a generic, intermediary          soners. However, we have shown that argumen-
pre-training in the spirit of STILTS (Phang et al.,     tative texts (with valid syllogistic arguments) are
2018).                                                  certainly a good starting point when building a
   Evaluating the models’ ability to correctly          more comprehensive dataset for initial or interme-
complete conclusions of arguments, we observe           diary pre-training that might help language models
strong transfer learning effects/generalization         to acquire a broad range of reasoning skills. Or, to
(Section 5): Just training the models on a few          put it differently, the synthetic argumentative texts
central core schemes (generalized modus ponens,         might belong to the core of a “critical thinking cur-
contraposition and chain rule) allows them to ac-       riculum for language models.” In the final section,
curately complete conclusions of different types        we advance some ideas for complementing the ar-
of arguments, too (e.g., complex argumentative          tificial argument corpus so as to further improve
forms that involve dilemma and de Morgan). The          the performance of LMs with regard to different
language models appear to connect and generalize        reasoning benchmarks.
the core argument schemes in a correct way. In ad-
                                                        2   Related Work
dition, the models are equally able to apply learned
argument patterns beyond the training corpus’ do-       To our knowledge, this paper is, together with
main. Tests with a simple manually authored ar-         Gontier et al. (2020), among the first to show
gument produce evidence that generic language           that autoregressive language models like GPT-2
modeling skill facilitates the successful general-      can learn to reason by training on a text corpus
ization of learned argument patterns.                   of correct natural language arguments. By con-
   Moreover, we test the trained models on differ-      trast, previous work in this field, described below,
ent reasoning benchmarks. Because we are par-           has typically modeled natural language reasoning
ticularly interested in transfer learning effects, we   problems as classification tasks and trained neural
do so in a zero-shot set-up (i.e., evaluating our       systems to accomplish them. For example, Schick
argumentation models on entirely unrelated NLU          and Schütze (2020a,b), using pattern verbaliza-
tasks, which follows recent work by Mitra et al.        tions, construct structured training data that is suit-
(2019); Shwartz et al. (2020); Ma et al. (2020)).       able for training a masked language model with
We obtain consistent and promising results for the      classification head, and thusly achieve remarkable
GLUE diagnostics (Wang et al., 2018) and SNLI           NLU performance. This paper explores the oppo-
(Bowman et al., 2015) benchmarks (Section 5),           site route: We start with highly structured (syn-
finding that training on core schemes clearly im-       thetic) data, render it as unstructured, plain text
proves NLU skill. However, training on the argu-        and train a uni-directional language model on the
ment corpus doesn’t affect the performance with         synthetic text corpus.
regard to the semantically more demanding Ar-              Over and above the methodological novelty of
gument Reasoning Comprehension task (Haber-             our approach, we discuss, in the following, related
nal et al., 2018) or the critical thinking assessment   reasoning benchmarks and explain what sets our
compiled in LogiQA (Liu et al., 2020).                  synthetic argument corpus apart from this work.
   All these transfer learning effects observed         Rule reasoning in natural language Various
strengthen the analogy between teaching critical        datasets have been developed for (deductive) rule
thinking and training language models: A variety        reasoning in natural language. In these tasks, one
of reasoning skills are improved by generic, in-        or multiple rules, i.e. (generalized) conditionals,
termediary pre-training on high-quality texts that      must be applied to a fact base in order to deduc-
exemplify a basic reasoning skill, namely sim-          tively infer a conclusion. Facts and conclusions
are represented by atomic statements. Rule ap-          novel facts. This confirms that language mod-
plication closely resembles the conclusion com-         els can, in principle, learn basic conceptual rules,
pletion task for generalized modus ponens and           which, e.g., express that a relation is symmetric or
generalized modus tollens schemes described be-         that two terms are equivalent.
low. However, we go beyond previous work in in-
vestigating the ability of language models to in-       Benchmarks for enthymematic reasoning An
fer conclusions that have a more complex logico-        ‘enthymeme’ is an argument whose premises are
semantic structure (e.g., existential or universal      not explicitly stated, e.g.: “Jerry is a mouse.
statements).                                            Therefore, Jerry is afraid of cats.” The three tasks
                                                        described below involve such reasoning with im-
   The question answering bAbI dataset (Weston
                                                        plicit assumptions, whereas our synthetic argu-
et al., 2016) contains a task which involves apply-
                                                        ment corpus doesn’t: all premises are transparent
ing very specific rules of the form “Xs are afraid of
                                                        and explicitly given.
Ys” to an instance (for example: “Mice are afraid
                                                           Commensense Transformers (COMET) are au-
of cats. Jerry is a mouse. What is Jerry afraid
                                                        toregressive language models for generating com-
of? A:cats”). Equally simple, one-step rule ap-
                                                        monsense knowledge graphs (Bosselut et al.,
plications are tested in Richardson et al. (2020),
                                                        2019). Being trained on seed data, the models are
and also contained in the QuaRTz dataset (Tafjord
                                                        able to meaningfully relate subject phrases to ob-
et al., 2019).
                                                        ject phrases in terms of multiple binary relations
   ROPES (Lin et al., 2019) is a reading compre-
                                                        (by doing the type of completion tasks we intro-
hension task that involves applying background
                                                        duce in Section 4), and can thereby both repro-
knowledge to a given situation (both being pre-
                                                        duce and extend a given knowledge graph. In par-
sented as paragraph long text). Correct answers
                                                        ticular, this includes generating statements about
can be inferred by one-step rule application; part
                                                        causal relationships, which can be construed as
of the challenge is to identify the relevant rule and
                                                        enthymematic reasoning with commonsense back-
fact in the text.
                                                        ground assumptions. For example, given the in-
   RuleTaker, arguably the most general system          put "PersonX is re-elected. As a result, PersonX
for natural rule reasoning in natural language so       wants" the model generates as completions: "to
far, is a transformer model that has been fine-tuned    get a raise", "to go to office", "to go home", "to
to predict whether a conclusion can be inferred         make a speech", "to celebrate" – all of which
from a set of rules and facts, not all of which are     are plausible fill-ins. The implicit commonsense
necessarily required to draw the conclusion (Clark      premises that underlie this (entyhmematic) infer-
et al., 2020). Moreover, inferring the conclusion       ence are principles such as "If someone has been
from the premise set might involve multiple in-         re-elected, then they want to celebrate."
ference steps. The authors show that the trans-            The Argument Reasoning Comprehension
former model can be trained to perform this task        (ARC) dataset (Habernal et al., 2018) comprises
nearly flawlessly and, moreover, to ‘explain’ its       simple informal arguments.          Each argument
inferences by identifying relevant premises. They       contains two premises: whereas the first premise
also observe substantial transfer learning effects.     is explicitly stated, there are two alternative
   PRover extends RuleTaker by a component for          formulations of the second premise. The task
proof generation (Saha et al., 2020). Technically,      consists in identifying which of these two alter-
the QA head of the RoBERTa language model               native formulations is actually assumed in the
(Liu et al., 2019) is complemented by two ad-           argument. For example: “Miss America gives
ditional neural classifiers (for nodes and edges)       honors and education scholarships. And since
that are used to to construct proof chains. Saha        [scholarships would give women a chance to
et al. (2020) show that PRover can construct valid      study | scholarships would take women from the
proofs and outperforms RuleTaker in terms answer        home], Miss America is good for women.” ARC
accuracy in a zero-shot setting.                        therefore assesses the ability to make implicit
   Training on synthetic knowledge-graph data           premises explicit. An adversarial ARC dataset
(such as "Paris CapitalOf France" and "France           that eliminates clues in the original benchmark is
HasCapital Paris") from scratch, Kassner et al.         also available in Niven and Kao (2019).
(2020) find that BERT is able to correctly infer           CLUTRR is a task generator for relational rea-
soning on kinship graphs (Sinha et al., 2019).            in Figure 1). These base schemes have been cho-
CLUTTR takes a set of (conceptual) rules about            sen because of their logical simplicity as well as
family relations as given and constructs set-             their relevance in critical thinking and argument
theoretic possible worlds (represented as graphs)         analysis (Feldman, 2014; Bowell and Kemp, 2014;
which instantiate these rules. In such a possible         Brun and Betz, 2016). Each of these eight base
(kinship) world, a target fact and a set of base facts    schemes is manually varied in specific ways to cre-
are identified such that the base facts together with     ate further valid variants.
the rules deductively entail the target fact. The            Negation variants of base schemes (second row
task consists in inferring the target fact from the       in Figure 1) are created by substituting a sub-
base facts alone – the conceptual rules remain im-        formula with its negation and/or by applying du-
plicit. For example: “Kristin and her son Justin          plex negatio affirmat.
went to visit her mother Carol on a nice Sunday              Complex predicates variants (third row in Fig-
afternoon. They went out for a movie together             ure 1) build on base schemes or their respective
and had a good time. Q: How is Carol related to           negation variants and are obtained by substituting
Justin? A: Carol is the grandmother of Justin.”           atomic predicates with compound disjunctive or
So, CLUTRR assesses entyhmematic deductive                conjunctive ones.
reasoning with implicit conceptual rules. Gon-               De Morgan variants of base schemes (fourth
tier et al. (2020) have trained a generative Trans-       row in Figure 1) are finally derived by applying
former language model on a synthetic text corpus          de Morgan’s law to the respective variants created
(with each argumentative text containing a story,         before.
a proof chain and a conclusion from CLUTTR)                  With 2-3 different versions for each of these
and show that the language model does not only            variations of a base scheme (parameter "n" in Fig-
learn to draw the correct conclusion (given an ar-        ure 1), we obtain, all in all, 71 distinct hand-
gument with implicit commonsense premises), but           crafted argument schemes. Obviously, some of
also seems to acquire the ability to generate valid       these schemes can be derived from others. For
proof chains.                                             example, generalized modus ponens and general-
Critical thinking tasks LogiQA (Liu et al.,               ized contraposition (base schemes) entail a nega-
2020) is a collection of publicly available criti-        tion variant of generalized modus tollens. Like-
cal thinking questions, used by the National Civil        wise, generalized contraposition and hypothetical
Servants Examination of China to assess candi-            syllogism 1 entail a (negation variant of) hypo-
dates’ critical thinking and problem solving skills.      thetical syllogism 2.
LogiQA covers tasks of various types: different              In view of their simplicity and prominence in
kinds of natural language inference problems as           natural language argumentation, three of the eight
well as the identification of implicit premises or        base schemes are marked as core schemes: gener-
(practical) instrumental reasoning. Its scope is          alized modus ponens, generalized contraposition,
much broader than our highly specific and care-           hypothetical syllogism 1.
fully designed argument corpus. The LogiQA                   Natural language instances of the argument
tasks are shown to be hard for current AI systems,        schemes can be created by means of a first-order-
of which a fine-tuned transformer model performs          logic domain (with names and predicates) and nat-
best with an accuracy score of 35% – 50 percent-          ural language templates for the formal schemes. In
age points below human performance.                       order to obtain a large variety of realistic natural
                                                          language arguments, we have devised
3       An Artificial Argument Corpus
                                                             • a multi-stage templating process with
This section describes the construction of a syn-            • alternative templates at each stage and
thetic corpus of natural language arguments used             • multiple domains.
for training and evaluating GPT-2.1
   The corpus is built around eight simple, deduc-        As shown in Figure 2, this process can be split into
tively valid syllogistic argument schemes (top row        five consecutive steps.
    1
     The corpus as well as the source code used to gen-
                                                             In step 1, the argument scheme, which serves
erate it will be released at https://github.com/          as formal template for the natural language argu-
debatelab/aacorpus.                                       ment, is chosen.
generalized  generalized         hypothetical       hypothetical     hypothetical       generalized     disjunctive         generalized
                                      modus ponens contraposition        syllogism 1        syllogism 2      syllogism 3      modus tollens     syllogism           dilemma

                                                                                                                                                                  ∀x Fx→Gx∨Hx
base_scheme

                                      ∀x Fx→Gx                          ∀x Fx→Gx           ∀x Fx→Gx         ∀x Fx→Gx          ∀x Fx→Gx        ∀x Fx→Gx∨Hx         ∀x Gx→Jx
                                      Fa               ∀x Fx→¬Gx        ∀x Gx→Hx           ∀x ¬Hx→¬Gx       ∃x Hx∧¬Gx         ¬Ga             ∀x Fx→¬Gx           ∀x Hx→Jx
                                           ——⇩——         ——⇩——             ——⇩——             ——⇩——            ——⇩——             ——⇩——           ——⇩——               ——⇩——
                                      Ga               ∀x Gx→¬Fx        ∀x Fx→Hx           ∀x Fx→Hx         ∃x Hx∧¬Fx         ¬Fa             ∀x Fx→Hx            ∀x Fx→Jx
complex_predicates negation_variant

                                                                                                                                                                  ∀x Fx→Gx∨Hx
                                      ∀x Fx→¬Gx                         ∀x Fx→¬Gx          ∀x Fx→¬Gx        ∀x ¬Fx→Gx         ∀x Fx→¬Gx       ∀x Fx→Gx∨Hx         ∀x Jx→¬Gx
                                      Fa               ∀x Fx→Gx         ∀x ¬Gx→Hx          ∀x ¬Hx→Gx        ∃x Hx∧¬Gx         Ga              ∀x Gx→¬Fx           ∀x Jx→¬Hx
                                           ——⇩——         ——⇩——             ——⇩——             ——⇩——            ——⇩——             ——⇩——           ——⇩——               ——⇩——
                                      ¬Ga              ∀x ¬Gx→¬Fx       ∀x Fx→Hx           ∀x Fx→Hx         ∃x Hx∧Fx          ¬Fa             ∀x Fx→Hx            ∀x Fx→¬Jx
                                                 n=2              n=3                n=3              n=3               n=3             n=2                 n=3                 n=3

                                      ∀x Fx∧Hx→Gx                       ∀x Fx→Gx                            ∀x Fx→Gx                          ∀x Fx→Gx∨Hx∨Ix      ∀x Fx→Gx∨Hx∨Ix
                                      Fa                                ∀x Fx→Ix           ∀x Fx→¬(Gx∨Ix)   ∀x Fx→Ix          ∀x Fx→Gx∧Hx     ∀x Fx→¬Gx           ∀x Gx→Jx
                                      Ha               ∀x (Fx∧Hx)→¬Gx   ∀x Gx∧Ix→Hx        ∀x Hx→¬(Gx∨Ix)   ∃x Hx∧¬(Gx∧Ix)    ¬Ga             ∀x Fx→¬Ix           ∀x Hx→Jx
                                           ——⇩——         ——⇩——             ——⇩——             ——⇩——            ——⇩——             ——⇩——           ——⇩——               ——⇩——
                                      Ga               ∀x Gx→¬(Fx∧Hx)   ∀x Fx→Hx           ∀x Fx→Hx         ∃x Hx∧¬Fx         ¬Fa             ∀x Fx→Hx            ∀x Fx→Jx∨Ix
                                                 n=3              n=2                n=3              n=3               n=3             n=2                 n=3                 n=3

                                      ∀x ¬(Fx∨Hx)→Gx                                                        ∀x Fx→Gx                                              ∀x Fx→¬(Gx∧Hx)
de_morgan

                                      ¬Fa                               ∀x (¬Fx∧¬Ix)→Gx    ∀x Fx→¬(Gx∨Ix)   ∀x Fx→Ix          ∀x Fx→Gx∧Hx     ∀x Fx∧Ix→Gx∨Hx      ∀x ¬Gx→Jx
                                      ¬Ha              ∀x (Fx∧Hx)→¬Gx   ∀x Gx → Hx         ∀x Hx→¬Gx∧¬Ix    ∃x Hx∧(¬Gx∨¬Ix)   ¬Ga∨¬Ha         ∀x Gx→¬Fx∨¬Ix       ∀x ¬Hx→Jx
                                           ——⇩——         ——⇩——             ——⇩——             ——⇩——            ——⇩——             ——⇩——           ——⇩——               ——⇩——
                                      Ga               ∀x Gx→¬Fx∨¬Hx    ∀x ¬(Fx ∨ Ix)→Hx   ∀x Fx→Hx         ∃x Hx∧¬Fx         ¬Fa             ∀x Fx∧Ix→Hx         ∀x Fx→Jx
                                                 n=2              n=2                n=2              n=3               n=3             n=3                 n=2                 n=2

                                                    Figure 1: Syllogistic argument schemes used to create an artificial argument corpus.

     In step 2, each sentence in the formal scheme                                                             • Male Relatives: grandson of Ryan, nephew
  (premises and conclusion) is individually replaced                                                             of Jim, cousin of Lee, . . .
  by a natural language pattern in accordance with a                                                           • Football Fans: supporter of Real Madrid CF,
  randomly chosen template. For example, the for-                                                                ex-fan of Sevilla FC, member of SSC Napoli,
  mula “∀xF x → Gx” might be replaced by any of                                                                  ...
  the following natural language sentence schemes:                                                             • Personal Care: regular consumer of Dove
                                                                                                                 shampoo, infrequent user of L’Oreal sham-
                                      •   “Every F is a G.”                                                      poo, loyal buyer of Redken shampoo, . . .
                                      •   “Whoever is a F is also a G.”                                        • Chemical Ingredients: ingredient of Maypole
                                      •   “Being a G is necessary for being a F.”                                Soap, ingredient of OASIS CREAM, ingredi-
                                      •   “If someone is a F, then they are a G.”*                               ent of BB concealer, . . .
                                                                                                               • Dinosaurs*: contemporary of Megalosaurus,
  Some of these patterns are not used for training,
                                                                                                                 predator of Iguanodon, ancestor of Al-
  but are reserved for generating an out-of-domain
                                                                                                                 losaurus, . . .
  test dataset (e.g., the template marked with an as-
                                                                                                               • Philosophers*: teacher of Aeschines of
  terisk in the above list).
                                                                                                                 Neapolis, pupil of Cratylus, reader of Dem-
     In step 3, the entity- and property-placeholders
                                                                                                                 ocritus, . . .
  in the resulting argument scheme are replaced
  argument-wise with names and predicates from a
                                                                                                            Domains marked with an asterisk are used for test-
  domain. We hence obtain an instance of the for-
                                                                                                            ing only, and not for training (see below and Sec-
  mal argument scheme as premise-conclusion list.
                                                                                                            tion 4.2).
  Each domain provides hundreds of entity-names,
  which can be paired with different binary predi-                                                             In step 4, the premises of the natural language
  cates to create thousands of different unary predi-                                                       argument are randomly re-ordered.
  cates. The following example predicates illustrate                                                           In step 5, the premise-conclusion list is packed
  the domains used in this study:                                                                           into a text paragraph by adding an argument in-
                                                                                                            tro, framing the premises, and adding an inference
                                      • Female Relatives: sister of Anna, grand-                            indicator. Again, multiple templates are available
                                        daughter of Elsa, cousin of Sarah, . . .                            for doing so, which yields a large variety of textual
artificial argument corpus config file
                                          topic-neutral                                                              argument-,
                 topic-neutral                                  domain-specific
                                       NL templates for                                                           premise-, and
             formal argument                                    NL names and
                                      formal sentence                                                             inference-
                schemes                                      binary predicates
                                        schemes                                                                  indicators

                                                              Step 3: construct
                                      Step 2: choose &                                     Step 4:             Step 5: construct
              Step 1: choose                                    & substitute
                                        substitute NL                                    permutate                 & apply
             formal argument                                  domain-specific
                                          schemes                                         premises                argument
                 scheme                                        predicates and
                                       sentence-wise                                     randomly                 template
                                                                   names

                                            1. No sister of Lisa is a friend of            Here comes a perfectly valid
            ∀x Fx→¬Gx                       Chloe.                                         argument: To begin with, Susan is
            Ga                              2. Susan is a friend of Chloe.                 a friend of Chloe. Moreover, no
             ——⇩——                                        ——⇩——                            sister of Lisa is a friend of Chloe.
            ¬Fa                             3. It is false that Susan is a sister          In consequence, it is false that
                                            of Lisa.                                       Susan is a sister of Lisa.

                                                                     1. Susan is a friend of Chloe.
                           No F is a G.                              2. No sister of Lisa is a friend of
                           a is a G.                                 Chloe.
                                   ——⇩——                                            ——⇩——
                           It is false that a is a F.                3. It is false that Susan is a sister
                                                                     of Lisa.

      Figure 2: Pipeline for creating natural language instances of argument schemes with multiple templating.

renderings of an argument.                                               4.1      Training
   Following this pipeline, we generate natu-                            From the training items in the Artificial Argu-
ral language instances of each formal argument                           ment Corpus (TRAIN) we sample three types of
scheme, thus creating:                                                   differently-sized training sets as follows (see also
    1. a training set of argumentative texts, based on                   the color pattern in Figure 1):
       the default domains and templates (TRAIN);
    2. an evaluation set of argumentative texts,                              • TRAIN01: all training items which are in-
       based on the default domains and templates,                              stances of a core scheme, i.e. generalized
       which are used for development (DEV);                                    modus ponens, generalized contraposition,
    3. a test set of argumentative texts, based on the                          hypothetical syllogism 1 (N=4.5K, 9K, 18K,
       default domains and templates and used for                               36K)
       final tests (TEST_OUT-OF-SAMPLE);                                      • TRAIN02: all training items which are in-
    4. a test set of argumentative texts, based on the                          stances of a base scheme (N=4.5K, 9K, 18K,
       domains and templates reserved for testing                               36K)
       (TEST_OUT-OF-DOMAIN).                                                  • TRAIN03: all training items in the corpus
                                                                                (N=4.5K, 9K, 18K, 36K)
  This represents the artificial argument text cor-
pus we use to train and evaluate GPT-2.                                     In an attempt to avoid over-fitting, we blend
                                                                         the training arguments with snippets from Reuters
4     Experiments with GPT-2                                             news stories (Lewis et al., 2004) and the standard-
We train and evaluate three compact versions of                          ized Project Gutenberg Corpus (Gerlach and Font-
GPT-2 with 117M, 345M and 762M parameters                                Clos, 2018), trying a mixing ratio of 1:1 and thus
respectively using the implementation from Wolf                          doubling training size to N=9K, 18K, 36K, 72K.
et al. (2019). We note that all of these models fall                     (We find that fine-tuning on the accordingly en-
short of the full-scale model with 1542M parame-                         hanced argument corpus still increases the model’s
ters.2                                                                   perplexity on the Wiki103 dataset by a factor of
  2
    The fine-tuned models will be released through https:                1.5 (see Appendix B), which suggests to mix a
//huggingface.co/models.                                                 higher proportion of common texts into the train-
ing data in future work.) The three different ver-        Task             Conclusion with           Comple-
sions of GPT-2 are fine-tuned (causal language                             cloze-style prompt        tion
modeling objective, using default training scripts
by Wolf et al. (2019)) on each of the 12 enhanced         split             Every F is a G           G
training sets (hyper-parameters are detailed in Ap-                         Some F is not a G        G
pendix A). This gives us 36 fine-tuned model ver-                           a is a F or not a G      G
sions plus the three BASE models to evaluate. Un-
less explicitly stated otherwise, we report results       extended          Every F is a G           aG
of 762M parameter model trained on 72K items.                               Some F is not a G        not a G
                                                                            a is a F or not a G      not a G
4.2   Testing
Conclusion Completion on Artificial Argument              inverted          Every F is a G           not a G
Corpus To test whether language models can                                  Some F is not a G        not a G
reason correctly, we assess their ability to accu-                          a is a F or not a G      not a G
rately complete conclusions of arguments in the
artificial argument corpus. Here, we make use of                  Table 1: Three conclusion completion tasks
the fact that, by construction, the conclusion of ev-
ery argument in the corpus ends with a predicate          inverted task, the stronger the model’s reasoning
(a property-term such as “sister of Chloe” or “sup-       performance.
porter of Tottenham Hotspurs”), which is poten-              Based on the artificial argument corpus (see
tially preceded by a negator. First of all, as shown      Section 3), we generate and distinguish three dif-
in Table 1, we test whether the model is able to          ferent test datasets, each of which comprises the
correctly fill in the final predicate (task split). The   three tasks described above, as follows:
second, more difficult task consists in completing
the final predicate plus, if present, the preceding          • out of sample:         contains items from
negator (task extended). With a third, adverserial             TEST_OUT-OF-SAMPLE,            which   share
task we check how frequently the model wrongly                 domain and natural language templates with
adjoins the complement of the correct completion               the training data;
of the extended task (task inverted). Consider, for          • paraphrased: a sample of 100 items, ran-
example, the following argument:                               domly drawn from TEST_OUT-OF-SAMPLE,
                                                               which have been manually reformulated so as
      It is not always easy to see who is re-                  to alter the premises’ grammatical structure
      lated to whom – and in which ways.                       imposed by the natural language templates;
      The following argument pertains to                     • out of domain:         contains items from
      this question: First premise: Every                      TEST_OUT-OF-DOMAIN, which belong
      workmate of Brad is a classmate of                       to different domains instantiate grammatical
      James. Second premise: Every class-                      patterns other than the training data.
      mate of James is not a classmate of
      Theodore. So, necessarily, everyone                    Technically, conclusion completions, in all
      who is a workmate of Brad is [not a]E               tasks and tests, are generated by the language
      [classmate of Theodore.]S ”                         model with top-p nucleus sampling (p = 0.9).
In the split task, we prompt the model with the           Classification for NLU Benchmarks To inves-
argument, dropping []S , and check whether it gen-        tigate transfer learning effects, we evaluate the
erates “classmate of Theodore”. In the extended           trained models on standard NLU benchmarks,
task, we prompt the model with the argument,              such as GLUE AX and SNLI. These benchmark
dropping []E []S , and check whether it generates         tasks are classification problems. In the following,
“not a classmate of Theodore”. Finally, in the            we describe how we use the generative language
inverted task, we prompt the model as before              models to perform such classification.
and check whether it generates “a classmate of               Using simple templates, we translate each
Theodore”.                                                benchmark entry into alternative prompts (e.g.,
   Clearly, the higher the accuracy in the split and      context and question) and/or alternative comple-
extended tasks, and the lower the accuracy in the         tions (e.g., answers). Consider for example a
GLUE-style problem given by two sentences “The              use relevance perplexity as a score function to pre-
girl is eating a pizza.” and “The girl is eating food”      dict the category of X:
and the question whether one entails, contradicts,
or is independent of the other. We can construct                                                       
three prompts, corresponding to the three possible              category(X) = L argmin(relPP(cj , pi )) .
                                                                                    (pi ,cj )
answers (entail / contradict / independent):
                                                            5    Results
     Prompt1: The girl is eating a pizza.
     Therefore,                                             Conclusion Completion on Artificial Argument
     Prompt2: The girl is eating a pizza. This              Corpus Does the (fine-tuned) GPT-2 model cor-
     rules out that                                         rectly complete conclusions of natural language
     Prompt3: The girl is eating a pizza. This              arguments? Figure 3 displays the evaluation re-
     neither entails nor rules out that                     sults in an aggregated way. Each subplot visual-
     Completion: the girl is eating food.                   izes the accuracy of the models in the three com-
                                                            pletion tasks for a different test dataset (see Sec-
In this case, the correct match is obviously                tion 4.2), comparing the BASE model (points at
Prompt1–Completion. The ability of a language               the very left) with the fine-tuned models trained
model to discern that “The girl is eating pizza” en-        on TRAIN01, TRAIN02, and TRAIN03 (in this or-
tails (and does not contradict) “The girl is eating         der from left to right). The task-specific accuracy
food” will be reflected in a comparatively low con-         values are distinguished by line color.
ditional perplexity of Completion given Prompt1                We may observe, first of all, that training on the
and a correspondingly high conditional perplexity           argument corpus effectively improves conclusion-
of Completion given Prompt2 or Prompt3.                     completion-skill. In all three test datasets, the ac-
   Let us describe this procedure in more gen-              curacy in the split and extended tasks increases as
eral terms and consider a textual classification            the models are trained on more and more argu-
problem with categories k = 1 . . . N . To clas-            ment schemes, far exceeding the base model’s per-
sify a given input X, one constructs n alternative          formance. Once the model has seen all schemes
prompts p1 , . . . pn and m alternative completions         (TRAIN03), accuracy levels reach 100% for in-
c1 , . . . , cm (N = m·n), such that each pair (pi , cj )   domain and 70%-90% for out-of-domain tests.
corresponds to a class k of the classification prob-        However, the TRAIN01 and TRAIN02 models do
lem, i.e.,                                                  also generate more incorrect completions than the
                                                            BASE model (inverted task). But the frequency
              L : (pi , cj ) 7→ {1 . . . N }.               of such incorrect completions increases much less
                                                            than the frequency of correct ones (the gap be-
   In the above pizza example, we have N = n =              tween blue and gray curve widens), and it actu-
3 and m = 1. Moreover, let PPL (c|p) refer                  ally falls back to almost zero with the TRAIN03
to the conditional perplexity of the completion c           model. Out-of-domain performance of the models
given prompt p according to the language model              (right-hand plot) is qualitatively similar and only
L. Rather than directly using this conditional per-         slightly less strong than in-domain performance
plexity as a prediction score (as for instance in           (left-hand and middle plot). The models trained
Shwartz et al., 2020), which doesn’t account for            on arguments from a given domain are able to ef-
varying ‘prima facie’ or ‘prior’ perplexities of al-        fectively exercise the reasoning skill thus acquired
ternative completions, we consider the degree to            in other domains, and have hence gained topic-
which prompting the model L with p changes the              neutral, universal reasoning ability.
the perplexity of c, i.e.                                      The strong performance of TRAIN01 models,
                                                            averaged over all schemes, suggests that signifi-
                                 PPL (c|p)
            relPPL (c, p) :=               .                cant transfer learning occurs and that training on
                                  PPL (c)
                                                            a few argument schemes positively affects perfor-
   In analogy to Bayesian confirmation theory, this         mance on other schemes, too. To further investi-
might be termed a (perplexity-based) relevance              gate this issue, Table 2 contrasts (a) the models’
measure, as opposed to a measure of absolute con-           accuracy on schemes they have not been trained
firmation (cf. Carnap, 1950, pp. 346-48). We now            on – averaged over TRAIN01 and TRAIN02 mod-
test = out of sample                        test = paraphrased                      test = out of domain
           1.0
           0.8
           0.6                                                                                                                                   task
accuracy

                                                                                                                                                  split
           0.4                                                                                                                                    extended
                                                                                                                                                  inverted
           0.2
           0.0
                 base    train01 train02        train03     base    train01 train02       train03   base     train01 train02       train03
                               model                                      model                                    model

Figure 3: Accuracy of four model versions in three conclusion completion tasks and on different test datasets (out
of sample, paraphrased, out of domain).

                            BASE                (a) schemes not in training data (TR01–02)                 (b) trained on schemes (TR01–03)
Task                                           o-o-sample      paraphr.           o-o-domain           o-o-sample           paraphr.         o-o-domain
split                       21.4                  85.4             82.0                  69.4               99.9               99.2             89.0
extended                    10.7                  60.3             59.3                  45.8               99.9               99.2             76.2
inverted                     1.5                  16.9             18.0                  22.1                0.0                0.0              3.2

Table 2: Accuracy of models in three conclusion completion tasks and on different test datasets (out of sample,
paraphrased, out of domain). Columns report, separately, the performance (a) on schemes the model has not been
trained on, and (b) on schemes that are covered by the model’s training data.

els – with (b) their accuracy on schemes that are                                        performance on unknown schemes. Figure 4 re-
instantiated in their respective training corpus –                                       veals, first of all, that even the BASE models (only
averaged over TRAIN01, TRAIN02, and TRAIN023                                             pre-training, no fine-tuning) display a significant
models. The upshot is that trained models per-                                           ability to correctly complete conclusions of some
form way more strongly than the base model not                                           kinds of arguments. For example, GPT-2-762M
only on argument schemes they’ve been trained,                                           achieves 50% accuracy (split task) in completing
but also on those schemes they haven’t seen yet.                                         contrapositions, 30% accuracy in completing gen-
We take this to be a promising result as it strength-                                    eralized modus ponens, and still 20% accuracy in
ens the analogy between teaching critical think-                                         completing disjunctive syllogism and dilemma ar-
ing and training language models: generic inter-                                         guments. These findings further corroborate the
mediary pre-training on high-quality texts that ex-                                      hypothesis that NLMs learn (basic) linguistic and
emplify a specific, basic reasoning skill – namely,                                      reasoning skills “on the fly” by training on a large
simple deductive argumentation – improves other,                                         generic corpus (Radford et al., 2019).
more complex reasoning skills.
                                                                                            In addition, the matrix plot (Figure 4) demon-
   Figure 4 gives further insights by differentiating                                    strates that some types of arguments are much
evaluation results according to argument type. Its                                       easier to master, given training on the core and
subplots are arranged in a grid that mirrors the or-                                     possibly base schemes, than others. For in-
ganisation of argument schemes in Figure 1. Each                                         stance, complex_predicates variants of general-
subplot visualizes the ability of the models to cor-                                     ized modus ponens or de_morgan variants of gen-
rectly complete arguments of the corresponding                                           eralized modus tollens seem to be easily mas-
scheme (given the out-of-sample test dataset). Ac-                                       tered by the TRAIN01 model. In contrast, even
cordingly, the left-hand plot in Figure 3 in effect                                      the TRAIN02 model, which has been fine-tuned on
averages all curves in Figure 4. Reported accu-                                          all eight base schemes, struggles with the nega-
racy values that fall within gray background areas                                       tion_variants of generalized modus ponens (gen-
are attained by models which have seen the cor-                                          erating substantially more incorrect than correct
responding scheme during training. Vice versa,                                           completions). All in all, the picture that emerges
thick lines on white background visualize model                                          is plausible: Generalization towards novel types
Generalized         Generalized               Hypothetical             Hypothetical            Hypothetical            Generalized          Disjunctive       Generalized
                              modus ponens        Contraposition             Syllogism 1              Syllogism 2             Syllogism 3            modus tollens         Syllogism          Dilemma
                     1.0
                     0.8
base_scheme

                     0.6
                     0.4
                     0.2
                     0.0
                     1.0
negation_variant

                     0.8
                     0.6
                     0.4
                     0.2
                     0.0
                     1.0
complex_predicates

                     0.8
                     0.6
                     0.4
                     0.2
                     0.0
                     1.0
                     0.8
de_morgan

                     0.6
                     0.4
                     0.2
                     0.0
                           BASE TR01 TR02 TR03 BASE TR01 TR02 TR03 BASE TR01 TR02 TR03 BASE TR01 TR02 TR03 BASE TR01 TR02 TR03 BASE TR01 TR02 TR03 BASE TR01 TR02 TR03 BASE TR01 TR02 TR03
                                                                   task: split       task: extended       task: inverted     not trained on scheme     trained on scheme

 Figure 4: Accuracy of conclusion completions (three tasks) for instances of different argument schemes (see
 Figure 1) and four model versions.

 of argument appears to be comparatively diffi-
 cult whenever the new scheme involves negations
 (compare 2nd and 4th row in Figure 4 with 3rd
 row). This is consistent with the finding that some
 NLMs seemingly fail to understand simple nega-                                                                                                                                      762M          117M
 tion (Kassner and Schütze, 2020; Talmor et al.,                                                                           Completion                                         TR01       BASE         TR01
 2020).                                                                                                                    . . . is not a philosopher.        ? 100                          2             2
    The results reported so far suggest that reason-                                                                       . . . is immortal.                 =   0                         12             0
 ing skills acquired on (a subset of) the artificial                                                                       . . . is not a critic.             ◦   0                          0             9
 argument corpus generalize rather well – both to                                                                          . . . is mortal.                   †   0                          8             0
 other domains and other types of arguments. We                                                                            . . . is not mortal.               =   0                          6             0
 have further cross-checked these statistical find-                                                                        . . . is not Hermes.               †   0                          2             0
                                                                                                                           . . . does not exist.              ◦   0                          2             0
 ings by letting the models complete a conclusion
                                                                                                                           . . . is not God.                  ◦   0                          2             0
 of a simple manually authored argument:                                                                                   . . . is not a friend of Eckhardt. ◦   0                          0             1
                             [Hermes] Every philosopher is mortal.                                                         . . . is not an expert of BSI Ar- ◦    0                          0             1
                                                                                                                           senal FC.
                             Hermes is not mortal. Therefore, Her-
                                                                                                                           . . . is not a friend of Atalanta. ◦   0                          0              1
                             mes . . .                                                                                     . . . is not an infrequent user of ◦   0                          0              1
 This text differs syntactically and semantically                                                                          Neutrogena shampoo.
                                                                                                                           others                                 0                         66             85
 from any argument possibly contained in the arti-
 ficial argument corpus (where predicates have al-
 ways the form “is/being a Y of X,” and no domain                                                                          Table 3: Absolute frequency of predicted completions
                                                                                                                           for the hand-written [Hermes] query by three different
 covers philosophers or mortality). Obviously, it
                                                                                                                           models. Completions are – relative to the premises –
 follows that Hermes “is not a philosopher.” The                                                                           entailed (?), redundant (=), contradictory (†) or inde-
 argument instantiates generalized modus tollens,                                                                          pendent (◦).
 which is not a core scheme in TRAIN01. Can
 TRAIN01-models nonetheless fill out the unfin-
 ished argument in a sensible way?
    Table 3 counts and compares the most frequent
completions generated by two TRAIN01 models            sized model that profit from fine-tuning on the
(762M and 117M) and by the large untrained             AAC; the SNLI performance of the 762M param-
BASE model (762M). Exclusively the 762M-               eter model gets rather deteriorated. This might be
model trained on the core schemes reliably pre-        due to a coincidentally strong performance of the
dicts the correct conclusion. The large BASE           corresponding BASE model (see Figure 7), or sug-
model rather repeats a premise or even generate a      gest that the large model, unlike the smaller ones,
contradiction, whereas the small TRAIN01 model         has already learned during pre-training whatever is
(117M) changes the topic. This is consistent with      of relevance for SNLI in the AAC. (Further exper-
and illustrates our previous findings. Remarkably,     iments, preferably involving more model versions,
although both the small and the large TRAIN01          are required to clarify this.)
models have been fine-tuned on precisely the same
arguments, only the large model seems to correctly     Argument Reasoning Comprehension Task
recognize the logical structure of the [Hermes] ar-    The Argument Reasoning Comprehension (ARC)
gument. Generic language modeling skill, it is         task (Habernal et al., 2018) assesses the ability to
suggested, facilitates the successful generalization   identify a missing premise in an informally recon-
of learned argument patterns beyond the templates      structed and not necessarily deductively valid ar-
used to create the synthetic training data.            gument. It is a multiple-choice task where two al-
                                                       ternative sentences are provided, one of which is
   To further understand transfer learning ef-
                                                       the missing premise.
fects, we next examine whether intermediary pre-
training on the artificial argument corpus improves       We design and apply specific templates to con-
zero-shot performance in other NLP reasoning           struct prompts and completions, and calculate rel-
tasks (i.e., without task-specific fine-tuning).       ative perplexity as described in Section 4.2.
                                                          As shown in Figure 5, we find no evidence of
GLUE AX The GLUE datasets (Wang et al.,                transfer learning effects with respect to ARC.
2018) represent standard benchmarks for natural
                                                       LogiQA LogiQA (Liu et al., 2020) is a col-
language understanding (NLU). We evaluate our
                                                       lection of nearly 9,000 multiple-choice questions
models’ NLU skill in terms of accuracy on the cu-
                                                       (four alternative answers each) used in critical
rated GLUE diagnostics dataset (Figure 5).
                                                       thinking assessments. These questions span the
   Training on the artificial argument corpus sub-
                                                       whole range of critical thinking tasks.
stantially boosts accuracy on the GLUE diagnos-
                                                          We design and apply specific templates to con-
tics. Accuracy increases by at least 5 and up to 17
                                                       struct prompts and completions (one prompt and
percentage points, depending on model size. Re-
                                                       four completions per question), and use perplexity
markably, training on the core scheme alone suf-
                                                       scores to predict classifications as described above
fices to bring about these improvements.
                                                       (Section 4.2).
   This is a major finding and our clearest evidence      As can be seen from Figure 5, training on the
so far that training on the AAC involves substantial   artificial argument corpus has no effect whatsoever
transfer learning effects.                             on the ability of the models to handle the critical
SNLI The SNLI dataset (Bowman et al., 2015)            thinking tasks collected in LogiQA.
is another standard benchmark for NLI. Like the
                                                       6   Conclusion
GLUE dataset, it consists in pairs of sentences
which entail, contradict, or don’t bear on each        This paper has taken a first step towards the cre-
other. The assessment of our models with re-           ation of a critical thinking curriculum for neural
spect to SNLI data proceeds in close analogy to        language models. It presents a corpus of deduc-
the GLUE benchmark.                                    tively valid, artificial arguments, and uses this ar-
   The results, reported in Figure 5, are consistent   tificial argument corpus to train and evaluate GPT-
with, albeit less definite than our previous find-     2. The observation of strong transfer learning ef-
ings for the GLUE benchmark: First and foremost,       fects/generalization is its main finding: Training a
fine-tuning on all schemes (TRAIN03) improves          model on a few central core schemes allows it to
the performance by up to 8 percentage points.          accurately complete conclusions of different types
Training on fewer schemes is slightly less effec-      of arguments, too. The language models seem
tive. However, it is only the small and medium         to connect and to generalize the core argument
GLUE AX                                                               SNLI                                                                    ARC Task                                                                 LogiQA
                                  20                                                                       20                                                                      20                                                                       20
                                                                                                                                    model_size                                                               model_size                                                              model_size
                                                                                                                                        117M                                                                     117M                                                                    117M
                                  15                                                                       15                           345M                                       15                            345M                                       15                           345M
                                                                                                                                        762M                                                                     762M                                                                    762M
gain in accuracy (rel. to base)

                                                                         gain in accuracy (rel. to base)

                                                                                                                                                 gain in accuracy (rel. to base)

                                                                                                                                                                                                                          gain in accuracy (rel. to base)
                                  10                                                                       10                                                                      10                                                                       10

                                  5                                                                        5                                                                       5                                                                        5

                                  0                                                                        0                                                                       0                                                                        0

                                   5                                                                        5                                                                       5                                                                        5
                                       model_size
                                  10       117M                                                            10                                                                      10                                                                       10
                                           345M
                                           762M
                                  15                                                                       15                                                                      15                                                                       15
                                          train01    train02   train03                                          train01   train02   train03                                             train01    train02   train03                                             train01   train02   train03
                                                      model                                                                model                                                                    model                                                                   model

    Figure 5: Gains in accuracy due to fine-tuning on the AAC (accuracy TRAIN model – accuracy BASE model) for
    differently sized models and different NLP benchmark tasks: the GLUE diagnostics data, the SNLI dataset, the
    argument reasoning comprehension (ARC) benchmark, and the LogiQA dataset.

    schemes in a correct way. Moreover, the models                                                                                                                                        through adjusting the argument corpus con-
    are equally able to apply learned argument pat-                                                                                                                                       figuration file.)
    terns beyond the domain they have been trained                                                                                                                                      • To succeed in NLI tasks, it doesn’t suffice
    on, and there is evidence that generic language                                                                                                                                       to understand ‘what follows.’ In addition,
    modeling skill facilitates the successful general-                                                                                                                                    a system needs to be able to explicitly dis-
    ization of learned argument patterns. These find-                                                                                                                                     cern contradictions and non sequiturs (rela-
    ings are consistent with previous work on rule rea-                                                                                                                                   tions of logical independence). This suggests
    soning (Clark et al., 2020). They suggest that there                                                                                                                                  that the artificial argument corpus might be
    exist (learning-wise) fundamental reasoning skills                                                                                                                                    fruitfully supplemented with corpora of cor-
    in the sense that generic intermediary pre-training                                                                                                                                   rectly identified aporetic clusters (Rescher,
    on texts which exemplify these skills leads to spill-                                                                                                                                 1987) as well as corpora containing correctly
    over effects and can improve performance on a                                                                                                                                         diagnosed fallacies.
    broad variety of reasoning tasks. The synthetic ar-                                                                                                                                 • In addition, the idea of curriculum learning
    gumentative texts might be a good starting point                                                                                                                                      for ML (Bengio et al., 2009) might be given
    for building such a “critical thinking curriculum                                                                                                                                     a try. Accordingly, a critical thinking cur-
    for language models.”                                                                                                                                                                 riculum with basic exemplars of good rea-
       Moreover, the trained models have been tested                                                                                                                                      soning would not only be used to fine-tune a
    on different reasoning benchmarks. We obtain                                                                                                                                          pre-trained model, but would be employed as
    clear and promising results for the GLUE and                                                                                                                                          starting point for training a language model
    SNLI benchmarks. But training on the argument                                                                                                                                         from scratch.
    corpus doesn’t affect the performance with re-
    gard to the semantically more demanding Argu-
                                                                                                                                                                                      Natural language templating is a fundamental
    ment Reasoning Comprehension task or the criti-
                                                                                                                                                                                   technique used throughout this paper: both in con-
    cal thinking assessment compiled in LogiQA.
                                                                                                                                                                                   structing the artificial argument corpus as well
       Our work suggests different directions for ad-
                                                                                                                                                                                   as in transforming the NLP benchmark datasets
    vancing the approach adopted in this paper and
                                                                                                                                                                                   into text that can be processed by language mod-
    further improving the general reasoning skill of
                                                                                                                                                                                   els. The concrete templates applied have been de-
    neural language models:
                                                                                                                                                                                   signed in a trial-and-error process. It is far from
                                       • The syllogistic argument text corpus might                                                                                                clear that these represent optimal choices for ef-
                                         be complemented with corpora of argu-                                                                                                     fectively eliciting a language model’s skills. Still,
                                         ments that instantiate different kinds of cor-                                                                                            following (Jiang et al., 2020), it seems of great im-
                                         rect schemes, e.g., propositional inference                                                                                               portance to gain a more systematic understanding
                                         schemes, modal schemes, argument schemes                                                                                                  of different templating strategies and their effects
                                         for practical reasoning, complex argument                                                                                                 on metrics based on accuracy and perplexity.
                                         schemes with intermediary conclusions or as-                                                                                                 In conclusion, designing a critical thinking cur-
                                         sumptions for the sake of the argument, etc.                                                                                              riculum for neural language models seems to be
                                         (Technically, we provide the infrastructure                                                                                               a promising and worthwhile research program to
                                         for doing so, as all this might be achieved                                                                                               pursue.
A   Appendix: Training Parameters                       Girish Sastry, Amanda Askell, Sandhini Agar-
                                                        wal, Ariel Herbert-Voss, Gretchen Krueger,
We train the models on 8 GPUs for 2 epochs with
                                                        Tom Henighan, Rewon Child, Aditya Ramesh,
batch size = 2, learning rate = 5 × 10−5 , gradient
                                                        Daniel M. Ziegler, Jeffrey Wu, Clemens Win-
accumulation steps = 2, and default parameters of
                                                        ter, Christopher Hesse, Mark Chen, Eric Sigler,
the HuggingFace implementation otherwise (Wolf
                                                        Mateusz Litwin, Scott Gray, Benjamin Chess,
et al., 2019).
                                                        Jack Clark, Christopher Berner, Sam McCan-
B   Appendix: Performance Metrics for                   dlish, Alec Radford, Ilya Sutskever, and Dario
    Differently Sized Training Sets                     Amodei. 2020. Language models are few-shot
                                                        learners.
Figure 6 displays accuracy values on conclusion
completion tasks for models trained on differently    Georg Brun and Gregor Betz. 2016. Analysing
sized datasets.                                         practical argumentation. In Sven Ove Hansson
   Figure 7 reports perplexity and NLU accuracy         and Gertrude Hirsch-Hadorn, editors, The Ar-
metrics for models trained on differently sized         gumentative Turn in Policy Analysis. Reason-
datasets.                                               ing about Uncertainty, pages 39–77. Springer,
                                                        Cham.

References                                            Rudolf Carnap. 1950. Logical Foundations of
                                                        Probability.  University of Chicago Press,
Amanda Askell. 2020. Gpt-3: Towards renais-
                                                        Chicago.
 sance models. In Daily Nous Blog: Philoso-
 phers On GPT-3.                                      J. Cheng, M. Bernstein, C. Danescu-Niculescu-
                                                         Mizil, and J. Leskovec. 2017. Anyone can be-
Yoshua Bengio, Jérôme Louradour, Ronan Col-
                                                         come a troll: Causes of trolling behavior in on-
  lobert, and Jason Weston. 2009. Curriculum
                                                         line discussions. CSCW: Proceedings of the
  learning. In Proceedings of the 26th Annual In-
                                                         Conference on Computer-Supported Coopera-
  ternational Conference on Machine Learning,
                                                         tive Work. Conference on Computer-Supported
  ICML ’09, pages 41–48, New York, NY, USA.
                                                         Cooperative Work, 2017, page 1217–1230.
  ACM.

Antoine Bosselut, Hannah Rashkin, Maarten Sap,        Peter Clark, Oyvind Tafjord, and Kyle Richard-
  Chaitanya Malaviya, Asli Çelikyilmaz, and             son. 2020. Transformers as soft reasoners over
  Yejin Choi. 2019. Comet: Commonsense trans-           language. arXiv preprint arXiv:2002.05867v2.
  formers for automatic knowledge graph con-
  struction. In Proceedings of the 57th Annual        Richard Feldman. 2014. Reason and Argument.
  Meeting of the Association for Computational          Pearson, Harlow.
  Linguistics (ACL).
                                                      Alec Fisher. 2001. Critical Thinking: An Intro-
Tracey Bowell and Gary Kemp. 2014. Critical             duction. Cambridge University Press, Cam-
  Thinking: A Concise Guide, 4th edition edition.       bridge.
  Routledge, London.
                                                      Martin Gerlach and Francesc Font-Clos. 2018. A
Samuel R. Bowman, Gabor Angeli, Christopher            standardized project gutenberg corpus for sta-
  Potts, and Christopher D. Manning. 2015. A           tistical analysis of natural language and quanti-
  large annotated corpus for learning natural lan-     tative linguistics. CoRR, abs/1812.08092.
  guage inference. In Proceedings of the 2015
  Conference on Empirical Methods in Natural          Ben Gilburt. 2019. Examining gender bias in ope-
  Language Processing (EMNLP). Association              nai’s gpt-2 language model. hackernoon.com.
  for Computational Linguistics.
                                                      Nicolas Gontier, Koustuv Sinha, Siva Reddy, and
Tom B. Brown, Benjamin Mann, Nick Ry-                   Christopher Pal. 2020. Measuring systematic
  der, Melanie Subbiah, Jared Kaplan, Prafulla          generalization in neural proof generation with
  Dhariwal, Arvind Neelakantan, Pranav Shyam,           transformers.
Figure 6: Accuracy on three conclusion completion tasks as a function of training corpus size.

                                      Perplexity Wiki103                                             GLUE AX                                                                         SNLI
                   model_size                                                      0.50                                                               0.50
              55       762M
                       345M
                       117M                                                        0.45                                                               0.45
              50   train
                       train03
              45       train02                                                     0.40                                                               0.40
                       train01
 perplexity

                                                                        accuracy

                                                                                                                                           accuracy
              40                                                                   0.35                                                               0.35
                                                                                                                             model_size                      model_size
              35                                                                                                                 762M                            762M
                                                                                   0.30                                          345M                 0.30       345M
              30                                                                                                                 117M                            117M
                                                                                                                             train                           train
                                                                                   0.25                                          train03              0.25       train03
              25                                                                                                                 train02                         train02
                                                                                                                                 train01                         train01
                                                                                   0.20                                                               0.20
                   0             9K            18K          36K   72K                     0   9K          18K          36K          72K                      0             9K          18K          36K   72K
                                        size training set                                          size training set                                                            size training set

                                             Figure 7: Perplexity and NLI metrics as a function of training corpus size.

Radu Cornel Guiaşu and Christopher W Tindale.                                                                    Graham Neubig. 2020. How can we know
  2018. Logical fallacies and invasion biology.                                                                   what language models know? Transactions of
  Biology & philosophy, 33(5-6):34.                                                                               the Association for Computational Linguistics,
                                                                                                                  8:423–438.
Suchin Gururangan, Swabha Swayamdipta, Omer
  Levy, Roy Schwartz, Samuel Bowman, and                                                                   Daniel Kahneman. 2011. Thinking, fast and slow,
  Noah A Smith. 2018. Annotation artifacts in                                                                1st edition. Farrar, Straus and Giroux, New
  natural language inference data. In Proceed-                                                               York.
  ings of the 2018 Conference of the North Amer-
  ican Chapter of the Association for Computa-                                                             Nora Kassner, Benno Krojer, and Hinrich Schütze.
  tional Linguistics: Human Language Technolo-                                                               2020. Are pretrained language models sym-
  gies, Volume 2 (Short Papers), pages 107–112.                                                              bolic reasoners over knowledge?

Ivan Habernal, Henning Wachsmuth, Iryna                                                                    Nora Kassner and Hinrich Schütze. 2020.
  Gurevych, and Benno Stein. 2018. The argu-                                                                 Negated and misprimed probes for pretrained
  ment reasoning comprehension task: Identifi-                                                               language models: Birds can talk, but cannot fly.
  cation and reconstruction of implicit warrants.
  In Proceedings of the 2018 Conference of the                                                             Joe Lau and Jonathan Chan. 2020. Critical think-
  North American Chapter of the Association for                                                              ing web. https://philosophy.hku.hk/think.
  Computational Linguistics: Human Language                                                                D. D. Lewis, Y. Yang, T. Rose, and F. Li. 2004.
  Technologies, NAACL-HLT 2018, New Orleans,                                                                 Rcv1: A new benchmark collection for text
  Louisiana, USA, June 1-6, 2018, Volume 1                                                                   categorization research. Journal of Machine
  (Long Papers), pages 1930–1940. Association                                                                Learning Research, 5:361–397.
  for Computational Linguistics.
                                                                                                           Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt
Sven Ove Hansson. 2004. Fallacies of risk. Jour-
                                                                                                             Gardner. 2019. Reasoning over paragraph ef-
  nal of Risk Research, 7(3):353–360.
                                                                                                             fects in situations. Proc. MRQA Workshop
Zhengbao Jiang, Frank F. Xu, Jun Araki, and                                                                  (EMNLP’19).
You can also read