Transformer-based End-to-End Question Generation

Page created by Derek Richards
 
CONTINUE READING
Transformer-based End-to-End Question Generation

                                             Luis Enrico Lopez∗, Diane Kathryn Cruz∗, Jan Christian Blaise Cruz∗and Charibeth Cheng
                                                                     Center for Language Technologies (CeLT)
                                                                        College of Computer Studies (CCS)
                                                                               De La Salle University
                                         {luis lopez, diane cruz, jan christian cruz, charibeth.cheng}@dlsu.edu.ph

                                                                    Abstract                        (Seq2Seq) (Sutskever et al., 2014) models. These
                                                                                                    approaches use two networks—usually LSTMs
                                                 Question generation (QG) is a natural lan-
                                                                                                    (Sutskever et al., 2014), but also Transformers very
                                                 guage generation task where a model is trained
arXiv:2005.01107v2 [cs.CL] 27 Feb 2021

                                                 to ask questions corresponding to some in-
                                                                                                    recently (Vaswani et al., 2017; Dong et al., 2019)—
                                                 put text. Most recent approaches frame QG          to encode the source context paragraph and to de-
                                                 as a sequence-to-sequence problem and rely         code the embedded information into an output ques-
                                                 on additional features and mechanisms to in-       tion respectively.
                                                 crease performance; however, these often in-          Further works that improve on the standard
                                                 crease model complexity, and can rely on aux-      Seq2Seq-based QG models use either extra mecha-
                                                 iliary data unavailable in practical use. A        nisms, extra features, or both. These include the us-
                                                 single Transformer-based unidirectional lan-
                                                                                                    age of extra linguistic features (Zhou et al., 2017) or
                                                 guage model leveraging transfer learning can
                                                 be used to produce high quality questions          the introduction of answer-awareness (Zhao et al.,
                                                 while disposing of additional task-specific        2018a; Du et al., 2017; Dong et al., 2019), which
                                                 complexity. Our QG model, finetuned from           uses the answer to the desired question, or the posi-
                                                 GPT-2 Small, outperforms several paragraph-        tion of the answer within the context paragraph as
                                                 level QG baselines on the SQuAD dataset            additional features. A combination of these tech-
                                                 by 0.95 METEOR points. Human evaluators            niques provide the base for state-of-the-art QG in
                                                 rated questions as easy to answer, relevant
                                                                                                    recent years.
                                                 to their context paragraph, and corresponding
                                                 well to natural human speech. Also introduced         While these models yield high performance, they
                                                 is a new set of baseline scores on the RACE        rely on the additional cost of an encoder-decoder
                                                 dataset, which has not previously been used for    setup, as well as extra features and mechanisms
                                                 QG tasks. Further experimentation with vary-       that make them harder to train and reproduce.
                                                 ing model capacities and datasets with non-           We propose that a robust question generation sys-
                                                 identification type questions is recommended       tem can be created using only a single pretrained
                                                 in order to further verify the robustness of
                                                                                                    Transformer-based language model without the use
                                                 pretrained Transformer-based LMs as question
                                                 generators.
                                                                                                    of additional mechanisms, answer metadata, and
                                                                                                    extensive features. This method, while simpler, pro-
                                             1   Introduction                                       duces results able to outperform established base-
                                                                                                    lines within the same task.
                                             Question Generation (QG) remains a relevant               We benchmark our approach on the SQuAD (Ra-
                                             task in NLP. The ability to ask meaningful ques-       jpurkar et al., 2016) and RACE (Lai et al., 2017)
                                             tions provides evidence towards comprehension          Question Answering (QA) datasets, repurposed for
                                             within an Artificial Intelligence (AI) model (Nappi,   QG use, in line with the methods done in previous
                                             2017), which can have implications on other            studies.
                                             understanding-based tasks.                                Generated questions were evaluated with stan-
                                                Many studies have produced robust models with       dard language generation metrics as well as human
                                             good performance for QG in recent years. The           preference annotation based on certain question
                                             most widely-used techniques are Deep Learning-         characteristics. In addition, a variety of analyses
                                             based approaches involving Sequence-to-Sequence        were performed on the model and its output in or-
                                                 ∗
                                                 *Equal contribution.                               der to isolate performance indicators and identify
the model’s weaknesses and failure modes.               continuous language modeling-ready text. Exper-
  In addition, we also introduced a set of baseline     iments were performed on two identified factors
scores on the RACE dataset which has not been           in formatting this data: the delimiter used, and the
previously used in QG research.                         representation method for multiple questions per
                                                        context paragraph.
2     Methodology
                                                          Super Bowl 50 was an American football game to de-
2.1    Data Preparation                                   termine the champion of the National Football League
                                                          (NFL) for the 2015 season. The American Football
The QG model was finetuned on two datasets:               Conference (AFC) champion Denver Broncos defeated
SQuAD (Rajpurkar et al., 2016) and RACE (Lai              the National Football Conference (NFC) champion Car-
                                                          olina Panthers 24–10 to earn their third Super Bowl
et al., 2017).                                            title. The game was played on February 7, 2016, at
   Version 1.1 of the Stanford Question Answering         Levi’s Stadium in the San Francisco Bay Area at Santa
Dataset (SQuAD) (Rajpurkar et al., 2016) contains         Clara, California. As this was the 50th Super Bowl,
                                                          the league emphasized the ”golden anniversary” with
context paragraphs, each with sets of questions and       various gold-themed initiatives, as well as temporarily
corresponding answer spans related to the contents        suspending the tradition of naming each Super Bowl
                                                          game with Roman numerals (under which the game
of these paragraphs; in total, SQuAD contains more        would have been known as ”Super Bowl L”), so that
than 100,000 crowdsourced questions. While origi-         the logo could prominently feature the Arabic numerals
nally intended for the task of question answering,        50. [SEP] Which NFL team represented the AFC at
                                                          Super Bowl 50? [SEP] Where did Super Bowl 50 take
previous works on question generation (Du et al.,         place? [SEP] What color was used to emphasize the
2017; Zhao et al., 2018a) have repurposed SQuAD           50th anniversary of the Super Bowl?
as a training and test dataset, designating the ques-
                                                        Figure 1: A sample training example for question gen-
tions as the target output rather than the answer
                                                        eration training. The context, delimiter, and questions
spans.                                                  are highlighted in red, green, and blue respectively.
   The Large-scale ReAding Comprehension                Uses the ARTIFICIAL delimiter and the AQPL format.
Dataset From Examinations (RACE) (Lai et al.,           Text adapted from SQuAD dataset (Rajpurkar et al.,
2017) takes its context paragraph and question data     2016).
from English reading comprehension tests writ-
ten for Chinese middle and high school students.
                                                          Super Bowl 50 was an American football game to de-
The primary difference setting RACE apart from            termine the champion of the National Football League
SQuAD is its use of multiple choice questions             (NFL) for the 2015 season. The American Football
instead of SQuAD’s identification-type questions.         Conference (AFC) champion Denver Broncos defeated
                                                          the National Football Conference (NFC) champion Car-
Each question in RACE is paired with a set of four        olina Panthers 24–10 to earn their third Super Bowl title.
choices, similar to a multiple choice test.               The game was played on February 7, 2016, at Levi’s
   As GPT-2 was pretrained to perform language            Stadium in the San Francisco Bay Area at Santa Clara,
                                                          California. As this was the 50th Super Bowl, the league
modeling, it must be finetuned similarly. Thus,           emphasized the ”golden anniversary” with various gold-
the two QG datasets were formatted such that they         themed initiatives, as well as temporarily suspending
                                                          the tradition of naming each Super Bowl game with
appear similar to input data for language model-          Roman numerals (under which the game would have
ing. Each dataset was transformed into a continu-         been known as ”Super Bowl L”), so that the logo could
ous body of text. Each training example consists          prominently feature the Arabic numerals 50. [SEP]
                                                          Which NFL team represented the AFC at Super Bowl
of a context paragraph and its associated ques-           50?
tion(s) transformed into a single continuous se-
quence with a special delimiter in between. For the     Figure 2: A sample training example for question gen-
RACE-trained model, the corresponding choices           eration training. The context, delimiter, and question
for each question are also appended after the ques-     are highlighted in red, green, and blue respectively.
                                                        Uses the ARTIFICIAL delimiter and the OQPL format.
tion, where A. is used as the delimiter between
                                                        Text adapted from SQuAD dataset (Rajpurkar et al.,
question and choice, as well as between choices.        2016).
Training examples are separated by the newline
character \n. Figure 2 shows an example of a sin-
gle training example in the context-question form.      2.1.1 Delimiters
   There can be multiple ways to perform this trans-    During data preparation, a delimiter is placed be-
formation from the datasets’ original structured        tween each input context paragraph and output
representation (JSON for SQuAD and RACE) to a           question. During training, this delimiter allows
the model to properly distinguish between the con-     window problem raised with AQPL, as the length
text and question, while during prediction, it can     of a single training example is reduced to the length
be used as a marker at the end of some input text to   of an input paragraph plus the length of a single
invoke question generation behavior in the model.      associated question. However, this format does
Three different delimiting schemes were tested: 1)     result in a longer training time due to the duplicated
ARTIFICIAL, or a delimiter in the form of the          contexts increasing the size of the final formatted
token [SEP], 2) NATURAL-QUESTION, or a                 dataset.
delimiter in the form of the word Question, and
3) NATURAL-NUMBER, or a delimiting scheme              2.2   Model Setup and Finetuning
in the form of a numbered list, where each item is     The 124 million parameter GPT-2, the smallest of
a question.                                            the four available GPT-2 model sizes, was used
   The ARTIFICIAL delimiter was not present in         as the base pretrained model. From this, a total
the original model’s vocabulary, and its weights are   of 12 question generation models were finetuned,
learned from scratch during the finetuning phase,      each using one of the data format combinations
while the NATURAL delimiting schemes rely on to-       enumerated in Section 2.1.
ken weights already learning during the pretraining       Each model was trained for 3 epochs using
phase, thus making it possible for the model’s pre-    causal language modeling loss. The Adam opti-
trained knowledge to affect performance through        mizer (Kingma and Ba, 2015) was used with an
these delimiters. Similar keywords have been           initial learning rate of 5 × 10−4 and a linearly de-
shown to be effective in invoking certain pretrained   creasing learning rate schedule with warmup for
model behaviors (e.g. TL;DR: for summariza-            10% of total training steps.
tion), even in a zero-shot setting (Radford et al.,       Due to memory constraints with the hardware
2019).                                                 available, a batch size of 32 was simulated by com-
                                                       bining an actual batch size of 2 with 16 gradient ac-
2.1.2 Questions Per Line                               cumulation steps per minibatch. As such, the larger
There can be several possible questions associ-        GPT-2 sizes were no longer used in this study.
ated with a single paragraph. Two ways to flatten
this many-to-one relationship in the formatted data    2.3   Model Generation
were identified:                                       The model temperature was set to 0.6. Higher tem-
All Questions Per Line (AQPL) A single train-          perature values result in more randomness in gen-
ing example consists of a context paragraph with       erations, while lower values approach greedy be-
all of its associated questions placed immediately     havior.
after it, separated from one another with the se-         The top-p nucleus sampling method (Holtzman
lected delimiter. While this avoids duplication of     et al., 2020) was used with a value of p = 0.9. Top-
context and thus results in faster training time, it   p allows for more diverse generations than a purely
may potentially result in the model no longer being    greedy scheme, and minimizes the occurrence of
able to attend to earlier tokens as its context win-   certain tokens or token spans repeating indefinitely
dow moves further away from the beginning of the       in the generated text.
input paragraph.                                          Each generation loop is terminated either when
   This is critical in the question generation task,   the model generates the newline character \n, or
as information pertaining to a reference question      when the model reaches a generation length of 32
may be found anywhere in the input paragraph. If       tokens. This maximum length was manually set
that information is found at the beginning, outside    in order to terminate generation sessions that are
of the model’s current context window, the model       stuck in token span loops and do not reach the \n
may have difficulty generating the corresponding       end-of-text token on their own.
question.                                              2.4   Metrics and Evaluation
One Question Per Line (OQPL) Each context              Model generations were automatically evaluated
paragraph is duplicated for each of its associated     against reference data using several text generation
questions, such that for a single training example,    quality metrics commonly used in QG: BLEU-1,
there is only one context and one question. For        BLEU-2, BLEU-3, BLEU-4 (Papineni et al., 2002),
many cases, this may alleviate the moving context      ROUGE-L (Lin, 2004), and METEOR (Denkowski
and Lavie, 2014). Existing implementations of the        from context paragraphs in order to generate ques-
aforementioned metrics (Sharma et al., 2017) were        tions. 94.67% of generated questions contained
used to quantify the models’ performance.                tokens or spans of tokens from their correspond-
   Model generations were also evaluated via hu-         ing context paragraphs. To quantify the extent of
man annotation. Volunteers were asked to rate ques-      context-copying in the average question, longest
tions based on three criteria: 1) naturalness and 2)     common subsequence (LCS) was calculated be-
difficulty, following Du et al. (2017)’s methodol-       tween the generated question and given context
ogy, as well as 3) relevance to the context. For         paragraph. On average, generated questions take
the RACE-trained model, generated choices were           6 tokens (mean LCS length of 6.25) from the con-
also evaluated according only to naturalness and         text paragraph. This context-copying behavior is
relevance.                                               usually found in identification type questions (who,
                                                         what, when or where), which comprise 91.67% of
3     Results and Discussion                             the total generated samples.
                                                            The model appears to have learned the context-
3.1     SQuAD Automated Evaluation (AE)
                                                         copying technique on its own without explicit in-
The best performing model is the One Question            struction during training. This may be due to the
Per Line (OQPL) model with number delimiters,            frequency of identification-type questions in the
achieving the highest score for BLEU-2, BLEU-3,          training data; 88.26% of SQuAD’s training set is
BLEU-4 and METEOR. For BLEU-1 and ROUGE-                 composed of identification type questions.
L, the One Question Per Line (OQPL) model with
artificial delimiters performed the best.                3.1.2   Failure Modes
   It is interesting to note, however, that the best     After testing, 19 samples from the generated ques-
OQPL models are on average only 0.6917 points            tion set were observed to be non-questions based
better than their corresponding All Questions Per        on two identified failure modes: 1) the last three
Line (AQPL) counterparts. Additionally, the maxi-        words repeat indefinitely, and 2) the generated ques-
mum difference between OQPL number delimiter             tion is cut prematurely. See Table 2 for example of
and others model for all automatic metrics is only       the failure modes.
1.06. This shows that different tags and delimiters         For failure case 1, the generated question keeps
have only a marginal effect on the AE performance.       on repeating words, presuming that the atten-
   For further analysis, post-finetuning features        tion mechanism fail to pinpoint important context
were also extracted from the generated questions         words, which led to the model being confused in
such as question length, paragraph context length,       generating the next token.
and longest sub-sequence (between the paragraph             The visualization of the attention mechanism’s
context and generated question) on the best per-         behavior while generating for this failure mode
forming model.                                           was used to analyze the attention weights seen in
   From the generated questions of OQPL Num-             attention visualization in Figure 3.
ber delimiter model, the following behaviors were           When observing the attention scores over the
observed:                                                context paragraph for failure case 1, we find that
                                                         the attention mechanism is “confused.” Attention
    • Some generated questions seem to be simply         should ideally point to specific positions in the
      extracting phrases from the paragraph context,     inputs in order to provide context information bet-
      and returning them in question form.               ter. However, in this case, it can be observed that
                                                         the attention scores are evenly distributed over a
    • From the 2067 sample generated questions,          number of random positions in the given context
      19 of which do not end with a “?” token.           paragraph when generating a token after the word
      Note that such samples are not considered          “profession.” This behavior can be seen in most of
      “full questions” based on the identified failure   the attention heads and layers given this example.
      modes.                                             This failure of the mechanism to identify important
                                                         context points in the paragraph may have led to its
3.1.1    Evaluating Context-Copy                         inability to properly generate a question.
The model exhibits a form of context-copying be-            For failure case 2 , it can be surmised that the
havior, where it extracts verbatim spans of text         generation is cut simply because it reached the
Format        Delimiter       BLEU-1        BLEU-2         BLEU-3          BLEU-4        METEOR           ROUGE-L
                 Artificial      54.83         30.13          15.72           7.31          20.53            43.88
   AQPL          Number          54.98         30.31          15.79           7.57          20.69            43.83
                 Question        55.03         30.46          16.20           7.74          20.71            44.039
                 Artificial      55.60         31.03          16.56           7.89          21.03            44.41
   OQPL          Number          55.51         31.17          16.79           8.27          21.2             44.38
                 Question        55.28         30.81          16.55           8.21          21.11            44.27

                          Table 1: SQuAD automatic evaluation results for all models.

     Case    Question                                                 Context
     1       What is a profession of the profession of the pro-       Teaching may be carried out informally, within the
             fession of the profession of the profession of the       family, which is called homeschooling, or in the
             profession of the profession of the profession of the    wider community. Formal teaching may be carried
             profession of the profession                             out by paid professionals. Such professionals enjoy
                                                                      a status in some societies on a par with physicians,
                                                                      lawyers, engineers, and accountants (Chartered or
                                                                      CPA).
     2       Which newspaper in the United States defined             In 1900, the Los Angeles Times defined southern
             Southern California as including the seven counties      California as including ”the seven counties of Los
             of Los Angeles, San Bernardino, Orange, Riverside,       Angeles, San Bernardino, Orange, Riverside, San
             San Diego, Ventura and Sant                              Diego, Ventura and Santa Barbara.” In 1999, the
                                                                      Times added a newer county—Imperial—to that
                                                                      list.

            Table 2: Examples of failed generations from the best performing model’s failure modes.

maximum generation length while copying text                    duces the best results in recent literature.
from the context, as a consequence of the model’s
context-copy mode (which it learned as its most                 3.2     RACE Automated Evaluation (AE)
common generation mechanism).                                   For analyzing performance in RACE, we utilize the
                                                                best performing model (OQPL number delimiter).
3.1.3 Comparison with previous studies
                                                                After producing a set of generated questions from
As seen in Table 3, the model outperforms prior                 the test set, we observe the following:
RNN-based Seq2Seq works Du and Cardie (2018);
Zhao et al. (2018b) in terms of METEOR and                           • Only 44% of the generated questions exhibit
ROUGE-L score. It is worth noting that, in ad-                         context copying.
dition to a more complex model setup, (Zhao et al.,
2018b) uses other techniques such as a maxout                        • From the 1407 sample generated questions, 77
pointer mechanism and gated self attention mecha-                      are repeated questions, albeit generated from
nisms. Other previous work also use answer aware-                      different context paragraphs.
ness, using the positions of the answers in the para-
graph, or the answers themselves, as additional fea-                 • Three of the generated questions do not end
tures for the model. The proposed model uses none                      with a “?”,“.” or “ ” token. Note that these
of these extra features, yet still achieves robust ME-                 questions are not considered “full questions.”
TEOR and ROUGE-L scores that outperform these
studies.                                                        3.2.1      Question Types
   Our model performs worse in terms of BLEU-                   The RACE datset has two observable question
4 and ROUGE-L, and slightly worse in terms of                   types: fill in the blanks-type question, and
METEOR when compared with the recent UniLM                      identification-type questions. In the training
work of (Dong et al., 2019). It is important to note            dataset, the frequency of these two types are well-
that (Dong et al., 2019) is also the only other work            balanced, with 52.44% being fill in the blank-type
that uses a Transformer for their question gener-               questions, and 47.56% being identification-type
ation model. Their incorporation of an answer-                  questions.
awareness mechanism, in addition to the multiple                  After studying the generated outputs from the
modes of finetuning on a Seq2Seq transformer pro-               best-performing RACE model, we observe that the
Figure 3: Sample attention visualization for generated outputs of failure mode 1. This example shows the words
and the attention values to those words when focusing on the word “profession,” which is highlighted in red.

       Model                                       Answer        BLEU-4    METEOR         ROUGE-L
       Du and Cardie (2018)                                     15.16     19.12          -
       Zhao et al. (2018b) (s2s+a)                               4.8       12.52          30.11
       Zhao et al. (2018b) (s2s-a-at-mcp-gsa)                   16.38     20.25          44.48
       Dong et al. (2019)                                       22.12     25.06          51.07
       GPT-2 QG (ours)                                           8.26      21.2           44.38

Table 3: Performance of previous works in QG with Paragraph Level Input compared to ours. “Answer” refers to
the model’s usage of answer awareness. (Du and Cardie, 2018) did not report a ROGUE-L score.

model generated significantly more identification-       comprehension questions which can apply to any
type questions than fill in the blanks-type questions    paragraph, and which are primarily distinguished
(65.26% against 34.75%).                                 by their choices; examples include “What is the
   We hypothesize that the model generates more          main idea of the passage?” , “What is the best title
identification-type questions despite the two types      for the passage?”, and “Which of the following is
being evenly distributed in the training dataset be-     true/false ?” The model generated the latter true-
cause of the model’s tendency to learn context-          or-false question for 7.9% of the test set passages,
copying. In addition, copying from existing text         suggesting that it learned to ask these generic ques-
is significantly easier than producing a fill in the     tions that are independent of the passage’s content.
blanks-type question along with plausible multiple
choice answers.                                          3.2.4    Failure Modes
                                                         Only a single instance was observed where the
3.2.2 Evaluating Context-Copy
                                                         model, instead of generating a question, opted to
Context-copying frequency was quantified in two          continue the context passage. As the model was
methods:                                                 pretrained on the language modeling task, it is pos-
   • Calculating the LCS between the generated           sible for this to occur even after a learned delimiter
     question and the given context paragraph.           is used to invoke QG behavior, though only very
                                                         rarely.
   • Calculating the LCS between the generated
     choices and the given context paragraph.            3.3     SQuAD Human Evaluation (HE)

   We observe that only 44% and 24% of questions         Five rounds of evaluation were performed across
exhibit context-copying, based on methods 1 and          the entire generated set, resulting in five annota-
2, respectively. We hypothesize that the RACE-           tions for each question.
finetuned models perform context-copying less due
                                                         3.3.1    Distribution of SQuAD HE Scores
to the fact that answers to questions in RACE can-
not simply be lifted from the context paragraph.        Average scores for each attribute (3.61, 2.43, and
                                                        4.20 for relevance, difficulty, and naturalness re-
3.2.3 Repeating Questions                               spectively) suggest that generated questions were
Duplicated questions account for 3.5% of RACE’s         mostly relevant to their contexts, easy, and natural
training set. Many of these, while valid, are generic   sounding. Low standard deviation values for all
Model      Data Format          BLEU-1       BLEU-2        BLEU-3    BLEU-4       ROUGE-L
         RACE       OQPL Question        49.35        36.14         27.53     19.80        38.79
         SQuAD      OQPL Number          55.51        31.17         16.79     8.27         44.43

                         Table 4: SQuAD and RACE results for Automatic Evaluation.

            Relevance Difficulty Naturalness                          Relevance Difficulty Naturalness
  Mean      3.61      2.43       4.2                        Original 0.1603     0.1338     0.0698
  Standard 0.66       0.9        0.55                       Regrouped 0.2325    0.2405     0.1843
  deviation
                                                           Table 6: Average pairwise Cohen’s kappa values for
Table 5: Average and standard deviation of human           each question scoring attribute.
evaluation scores for the SQuAD-OQPL-QUESTION
model.
                                                           3.3.3    High HE and Low AE
                                                           We define a low AE score as falling below the 25th
three also show that there were few outlier ques-          percentile (BLEU-4
3.4.1   Distribution of RACE HE Scores                   performing the QG task without sacrificing perfor-
Average scores for each attribute (3.93, 2.09, and       mance.
4.11 for relevance, difficulty, and naturalness re-         Our model proved to be robust to datasets with
spectively) suggest that generated questions and         increasing difficulty, yielding high human prefer-
choices were mostly relevant to their contexts, easy,    ence scores on both the SQuAD dataset and the
and natural sounding.                                    RACE dataset. It also adapted well to data con-
                                                         taining mixed question types beyond the typical
              Relevance Difficulty Naturalness           question word (e.g. Who, What, When) construc-
    Mean      3.93      2.09       4.11                  tion, such as the fill-in-the-blank-type questions
    Standard 0.39       0.61       0.38                  found in RACE.
    deviation                                               Second, we provide initial baseline results for
                                                         the RACE dataset in the context of QG. Previously,
Table 7: Average and standard deviation of human eval-   RACE has only been proposed for QG use, but has
uation scores for the RACE-OQPL-NUMBER model.            never been used in actual studies. We look towards
                                                         the reporting of our baseline scores to be the start of
                                                         future QG research involving this complex dataset.
3.4.2   High HE and Low AE
                                                            Third, we identify limitations with existing hu-
We again define a low AE score as falling below          man evaluation schemes. Human annotation results
the 25th percentile (BLEU-4
on Neural Information Processing Systems 2019,         Shikhar Sharma, Layla El Asri, Hannes Schulz, and
  NeurIPS 2019, 8-14 December 2019, Vancouver, BC,         Jeremie Zumer. 2017. Relevance of unsupervised
  Canada, pages 13042–13054.                               metrics in task-oriented dialogue for evaluating nat-
                                                           ural language generation. CoRR, abs/1706.09799.
Xinya Du and Claire Cardie. 2018.             Harvest-
  ing paragraph-level question-answer pairs from         Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
  Wikipedia. In Proceedings of the 56th Annual Meet-        Sequence to sequence learning with neural networks.
  ing of the Association for Computational Linguistics      In Advances in Neural Information Processing Sys-
  (Volume 1: Long Papers), pages 1907–1917, Mel-            tems 27: Annual Conference on Neural Informa-
  bourne, Australia. Association for Computational          tion Processing Systems 2014, December 8-13 2014,
  Linguistics.                                              Montreal, Quebec, Canada, pages 3104–3112.
Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn-    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  ing to ask: Neural question generation for reading       Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
  comprehension. In Proceedings of the 55th Annual         Kaiser, and Illia Polosukhin. 2017. Attention is all
  Meeting of the Association for Computational Lin-        you need. In Advances in Neural Information Pro-
  guistics (Volume 1: Long Papers), pages 1342–1352,       cessing Systems 30: Annual Conference on Neural
  Vancouver, Canada. Association for Computational         Information Processing Systems 2017, 4-9 Decem-
  Linguistics.                                             ber 2017, Long Beach, CA, USA, pages 5998–6008.
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and       Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa
  Yejin Choi. 2020. The curious case of neural text        Ke. 2018a. Paragraph-level neural question gener-
  degeneration. In 8th International Conference on         ation with maxout pointer and gated self-attention
  Learning Representations, ICLR 2020, Addis Ababa,        networks. In Proceedings of the 2018 Conference
  Ethiopia, April 26-30, 2020. OpenReview.net.             on Empirical Methods in Natural Language Process-
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A             ing, pages 3901–3910, Brussels, Belgium. Associa-
  method for stochastic optimization. In 3rd Inter-        tion for Computational Linguistics.
  national Conference on Learning Representations,
                                                         Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa
  ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
                                                           Ke. 2018b. Paragraph-level neural question gener-
  Conference Track Proceedings.
                                                           ation with maxout pointer and gated self-attention
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,           networks. In Proceedings of the 2018 Conference
  and Eduard Hovy. 2017. RACE: Large-scale ReAd-           on Empirical Methods in Natural Language Process-
  ing comprehension dataset from examinations. In          ing, pages 3901–3910, Brussels, Belgium. Associa-
  Proceedings of the 2017 Conference on Empirical          tion for Computational Linguistics.
  Methods in Natural Language Processing, pages
  785–794, Copenhagen, Denmark. Association for          Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan,
  Computational Linguistics.                               Hangbo Bao, and Ming Zhou. 2017. Neural ques-
                                                           tion generation from text: A preliminary study. In
Chin-Yew Lin. 2004. ROUGE: A package for auto-             Natural Language Processing and Chinese Comput-
  matic evaluation of summaries. In Text Summariza-        ing - 6th CCF International Conference, NLPCC
  tion Branches Out, pages 74–81, Barcelona, Spain.        2017, Dalian, China, November 8-12, 2017, Pro-
  Association for Computational Linguistics.               ceedings, volume 10619 of Lecture Notes in Com-
                                                           puter Science, pages 662–671. Springer.
Judith S Nappi. 2017. The importance of questioning
  in developing critical thinking skills. Delta Kappa
  Gamma Bulletin, 84(1):30.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  Jing Zhu. 2002. Bleu: a method for automatic eval-
  uation of machine translation. In Proceedings of
  the 40th Annual Meeting of the Association for Com-
  putational Linguistics, pages 311–318, Philadelphia,
  Pennsylvania, USA. Association for Computational
  Linguistics.
Alec Radford, Jeff Wu, Rewon Child, David Luan,
  Dario Amodei, and Ilya Sutskever. 2019. Language
  models are unsupervised multitask learners.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
  Percy Liang. 2016. SQuAD: 100,000+ questions for
  machine comprehension of text. In Proceedings of
  the 2016 Conference on Empirical Methods in Natu-
  ral Language Processing, pages 2383–2392, Austin,
  Texas. Association for Computational Linguistics.
You can also read