Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting

Page created by Kelly Snyder
 
CONTINUE READING
Fast Abstractive Summarization with
                                                                  Reinforce-Selected Sentence Rewriting

                                                                        Yen-Chun Chen and Mohit Bansal
                                                                               UNC Chapel Hill
                                                                      {yenchun, mbansal}@cs.unc.edu

                                                             Abstract                           sequence models (Chopra et al., 2016; Nallap-
                                                                                                ati et al., 2016; See et al., 2017; Paulus et al.,
                                             Inspired by how humans summarize long              2018). Abstractive models can be more concise
arXiv:1805.11080v1 [cs.CL] 28 May 2018

                                             documents, we propose an accurate and              by performing generation from scratch, but they
                                             fast summarization model that first selects        suffer from slow and inaccurate encoding of very
                                             salient sentences and then rewrites them           long documents, with the attention model being
                                             abstractively (i.e., compresses and para-          required to look at all encoded words (in long
                                             phrases) to generate a concise overall sum-        paragraphs) for decoding each generated summary
                                             mary. We use a novel sentence-level pol-           word (slow, one by one sequentially). Abstrac-
                                             icy gradient method to bridge the non-             tive models also suffer from redundancy (repeti-
                                             differentiable computation between these           tions), especially when generating multi-sentence
                                             two neural networks in a hierarchical way,         summary.
                                             while maintaining language fluency. Em-               To address both these issues and combine
                                             pirically, we achieve the new state-of-the-        the advantages of both paradigms, we pro-
                                             art on all metrics (including human eval-          pose a hybrid extractive-abstractive architecture,
                                             uation) on the CNN/Daily Mail dataset, as          with policy-based reinforcement learning (RL) to
                                             well as significantly higher abstractiveness       bridge together the two networks. Similar to how
                                             scores. Moreover, by first operating at            humans summarize long documents, our model
                                             the sentence-level and then the word-level,        first uses an extractor agent to select salient sen-
                                             we enable parallel decoding of our neural          tences or highlights, and then employs an abstrac-
                                             generative model that results in substan-          tor network to rewrite (i.e., compress and para-
                                             tially faster (10-20x) inference speed as          phrase) each of these extracted sentences. To over-
                                             well as 4x faster training convergence than        come the non-differentiable behavior of our ex-
                                             previous long-paragraph encoder-decoder            tractor and train on available document-summary
                                             models. We also demonstrate the general-           pairs without saliency label, we next use actor-
                                             ization of our model on the test-only DUC-         critic policy gradient with sentence-level metric
                                             2002 dataset, where we achieve higher              rewards to connect these two neural networks and
                                             scores than a state-of-the-art model.              to learn sentence saliency. We also avoid com-
                                                                                                mon language fluency issues (Paulus et al., 2018)
                                         1   Introduction
                                                                                                by preventing the policy gradients from affect-
                                         The task of document summarization has two             ing the abstractive summarizer’s word-level train-
                                         main paradigms: extractive and abstractive. The        ing, which is supported by our human evaluation
                                         former method directly chooses and outputs the         study. Our sentence-level reinforcement learn-
                                         salient sentences (or phrases) in the original doc-    ing takes into account the word-sentence hierar-
                                         ument (Jing and McKeown, 2000; Knight and              chy, which better models the language structure
                                         Marcu, 2000; Martins and Smith, 2009; Berg-            and makes parallelization possible. Our extractor
                                         Kirkpatrick et al., 2011). The latter abstractive      combines reinforcement learning and pointer net-
                                         approach involves rewriting the summary (Banko         works, which is inspired by Bello et al. (2017)’s
                                         et al., 2000; Zajic et al., 2004), and has seen sub-   attempt to solve the Traveling Salesman Problem.
                                         stantial recent gains due to neural sequence-to-       Our abstractor is a simple encoder-aligner-decoder
model (with copying) and is trained on pseudo          sentence-pairs between the document and ground
document-summary sentence pairs obtained via           truth summary. Next, our model achieves the
simple automatic matching criteria.                    new state-of-the-art on all metrics of multiple ver-
   Thus, our method incorporates the abstractive       sions of a popular summarization dataset (as well
paradigm’s advantages of concisely rewriting sen-      as a test-only dataset) both extractively and ab-
tences and generating novel words from the full        stractively, without loss in language fluency (also
vocabulary, yet it adopts intermediate extractive      demonstrated via human evaluation and abstrac-
behavior to improve the overall model’s quality,       tiveness scores). Finally, our parallel decoding re-
speed, and stability. Instead of encoding and at-      sults in a significant 10-20x speed-up over the pre-
tending to every word in the long input document       vious best neural abstractive summarization sys-
sequentially, our model adopts a human-inspired        tem with even better accuracy.1
coarse-to-fine approach that first extracts all the
salient sentences and then decodes (rewrites) them     2       Model
(in parallel). This also avoids almost all redun-      In this work, we consider the task of summa-
dancy issues because the model has already cho-        rizing a given long text document into several
sen non-redundant salient sentences to abstrac-        (ordered) highlights, which are then combined
tively summarize (but adding an optional final         to form a multi-sentence summary. Formally,
reranker component does give additional gains by       given a training set of document-summary pairs
removing the fewer across-sentence repetitions).       {xi , yi }N
                                                                 i=1 , our goal is to approximate the func-
   Empirically, our approach is the new state-of-      tion h : X → Y, X = {xi }N                            N
                                                                                            i=1 , Y = {yi }i=1
the-art on all ROUGE metrics (Lin, 2004) as well       such that h(xi ) = yi , 1 ≤ i ≤ N . Further-
as on METEOR (Denkowski and Lavie, 2014)               more, we assume there exists an abstracting func-
of the CNN/Daily Mail dataset, achieving sta-          tion g defined as: ∀s ∈ Si , ∃d ∈ Di such that
tistically significant improvements over previous      g(d) = s, 1 ≤ i ≤ N , where Si is the set of sum-
models that use complex long-encoder, copy, and        mary sentences in xi and Di the set of document
coverage mechanisms (See et al., 2017). The            sentences in yi . i.e., in any given pair of docu-
test-only DUC-2002 improvement also shows our          ment and summary, every summary sentence can
model’s better generalization than this strong ab-     be produced from some document sentence. For
stractive system. In addition, we surpass the pop-     simplicity, we omit subscript i in the remainder
ular lead-3 baseline on all ROUGE scores with an       of the paper. Under this assumption, we can fur-
abstractive model. Moreover, our sentence-level        ther define another latent function f : X → Dn
abstractive rewriting module also produces sub-        that satisfies f (x) = {dt }nj=1 and y = h(x) =
stantially more (3x) novel N -grams that are not       [g(d1 ), g(d2 ), . . . , g(dn )], where [, ] denotes sen-
seen in the input document, as compared to the         tence concatenation. This latent function f can be
strong flat-structured model of See et al. (2017).     seen as an extractor that chooses the salient (or-
This empirically justifies that our RL-guided ex-      dered) sentences in a given document for the ab-
tractor has learned sentence saliency, rather than     stracting function g to rewrite. Our overall model
benefiting from simply copying longer sentences.       consists of these two submodules, the extractor
We also show that our model maintains the same         agent and the abstractor network, to approximate
level of fluency as a conventional RNN-based           the above-mentioned f and g, respectively.
model because the reward does not leak to our ab-
stractor’s word-level training. Finally, our model’s   2.1      Extractor Agent
training is 4x and inference is more than 20x faster   The extractor agent is designed to model f , which
than the previous state-of-the-art. The optional       can be thought of as extracting salient sentences
final reranker gives further improvements while        from the document. We exploit a hierarchical neu-
maintaining a 7x speedup.                              ral model to learn the sentence representations of
   Overall, our contribution is three fold: First      the document and a ‘selection network’ to extract
we propose a novel sentence-level RL technique         sentences based on their representations.
for the well-known task of abstractive summariza-          1
                                                           We are releasing our code, best pretrained models,
tion, effectively utilizing the word-then-sentence     as well as output summaries, to promote future research:
hierarchical structure without annotated matching      https://github.com/ChenRocks/fast_abs_rl
Extraction Probabilities (Policy)
                                                                                                                                                                         r1
                   Convolutional Sentence Encoder
                                                                                                                                                           r1            r2

                                                                                    bi-LSTM

                                                                                                         bi-LSTM

                                                                                                                         bi-LSTM
                                                                    bi-LSTM

                                                                                                                                          LSTM
        Embedded Word Vectors

                                                                                                                                                           LSTM

                                                                                                                                                                         LSTM
                                                                                    r1                                                                     r2            r3
                                                                           r1       r2                                                                     r3            r4
                                                                r1         r2       r3                                                                     r4            h0

                                                     CONV
                                                       r1       r2         r3       r4                                                                     h0            h1
                                             r1        r2       r3         r4       h0                                                                     h1            h4
                                             r2        r3       r4         h0       h1                                                                     h4
                                             r3        r4       h0         h1       h4
                                                                                  Context-aware Sent. Reps.
                                             Encoded Sentence Representations
                                             r4        h0       h1         h4 (from previous extraction step)
                                             h0
Figure 1: Our extractor agent: the convolutional       h1
                                                     encoder    h4
                                                              computes     representation rj for each sentence.
The RNN encoder (blue) computes context-awareh1        h4
                                                      representation   hj and then the RNN decoder (green)
                                             h4
selects sentence jt at time step t. With jt selected, hjt will be fed into the decoder at time t + 1.

2.1.1 Hierarchical Sentence Representation                                                                                                     RL Agent

We use a temporal convolutional model (Kim,                                                                        Observation
                                                                                                                                           Extractor
                                                                                                                                                                  Action (extract sent.)
2014) to compute rj , the representation of each in-
                                                                                                              d1    d2    d3       d4    d5      d1jt dg(d
                                                                                                                                                       2 jdt)
                                                                                                                                                            3 sdt4      d5      djt   g(djt )   st
dividual sentence in the documents (details in sup-
                                                                                                                                         Policy Gradient
plementary). To further incorporate global context                                                  d1        d2    d3    d4       d5    djt  g(d jt ) s t
                                                                                                                                              Update

of the document and capture the long-range se-                                                d1    d2        d3    d4    d5       djt   g(djt )     st
                                                                                                                                                 Reward                 Abstractor
mantic dependency between sentences, a bidirec- d1                                            d2 d3 d4 d5 djt g(djt ) st
                                                                                               Document Sentences
tional LSTM-RNN (Hochreiter and Schmidhuber,    d1 d2                                         d3 d4 d5 dd1jt dg(d
                                                                                                                2 jdt 3) d
                                                                                                                         st4 d5                      djt    g(djt )    st
                                                                                                                               Summary Sentence Generated Sentence
1997; Schuster et al., 1997) is applied on the con-                                                                              (ground truth)
volutional output. This enables learning a strong                                             Figure 2: Reinforced training of the extractor (for
representation, denoted as hj for the j-th sentence                                           one extraction step) and its interaction with the ab-
in the document, that takes into account the con-                                             stractor. For simplicity, the critic network is not
text of all previous and future sentences in the                                              shown. Note that all d’s and st are raw sentences,
same document.                                                                                not vector representations.
2.1.2 Sentence Selection
Next, to select the extracted sentences based on the                                          In Eqn. 3, zt is the output of the added LSTM-
above sentence representations, we add another                                                RNN (shown in green in Fig. 1) which is referred
LSTM-RNN to train a Pointer Network (Vinyals                                                  to as the decoder. All the W ’s and v’s are trainable
et al., 2015), to extract sentences recurrently. We                                           parameters. At each time step t, the decoder per-
calculate the extraction probability by:                                                      forms a 2-hop attention mechanism: It first attends
                                                                                             to hj ’s to get a context vector et and then attends
           >
         vp tanh(Wp1 hj + Wp2 et ) if jt 6= jk
                                                                                             to hj ’s again for the extraction probabilities.2 This
 utj =                                     ∀k < t                                             model is essentially classifying all sentences of the
         
         
          −∞                            otherwise                                             document at each extraction step. An illustration
                                                 (1)                                          of the whole extractor is shown in Fig. 1.
        P (jt |j1 , . . . , jt−1 ) = softmax(ut ) (2)                                         2.2        Abstractor Network
where et ’s are the output of the glimpse opera-
tion (Vinyals et al., 2016):                                                                  The abstractor network approximates g, which
                                                                                              compresses and paraphrases an extracted docu-
              atj = vg> tanh(Wg1 hj + Wg2 zt )                                (3)             ment sentence to a concise summary sentence. We
                                t                t
             α = softmax(a )                                                  (4)                 2
                                                                                                    Note that we force-zero the extraction prob. of already
                  X
             et =    αjt Wg1 hj                                               (5)             extracted sentences so as to prevent the model from using re-
                                                                                              peating document sentences and suffering from redundancy.
                                      j                                                       This is non-differentiable and hence only done in RL training.
use the standard encoder-aligner-decoder (Bah-                 Abstractor Training: For the abstractor training,
danau et al., 2015; Luong et al., 2015). We add the            we create training pairs by taking each summary
copy mechanism3 to help directly copy some out-                sentence and pairing it with its extracted docu-
of-vocabulary (OOV) words (See et al., 2017). For              ment sentence (based on Eqn. 6). The network
more details, please refer to the supplementary.               is trained as an usual sequence-to-sequence model
                                                               to minimize     the cross-entropy loss L(θabs ) =
3       Learning                                                   1 PM
                                                               −M       m=1  logP θabs (wm |w1:m−1 ) of the decoder
                                                               language model at each generation step, where
Given that our extractor performs a non-
                                                               θabs is the set of trainable parameters of the ab-
differentiable hard extraction, we apply stan-
                                                               stractor and wm the mth generated word.
dard policy gradient methods to bridge the back-
propagation and form an end-to-end trainable                   3.2   Reinforce-Guided Extraction
(stochastic) computation graph. However, sim-
                                                               Here we explain how policy gradient techniques
ply starting from a randomly initialized network
                                                               are applied to optimize the whole model. To
to train the whole model in an end-to-end fash-
                                                               make the extractor an RL agent, we can formu-
ion is infeasible. When randomly initialized, the
                                                               late a Markov Decision Process (MDP)5 : at each
extractor would often select sentences that are
                                                               extraction step t, the agent observes the current
not relevant, so it would be difficult for the ab-
                                                               state ct = (D, djt−1 ), samples an action jt ∼
stractor to learn to abstractively rewrite. On the
                                                               πθa ,ω (ct , j) = P (j) from Eqn. 2 to extract a doc-
other hand, without a well-trained abstractor the
                                                               ument sentence and receive a reward6
extractor would get noisy reward, which leads
to a bad estimate of the policy gradient and a                         r(t + 1) = ROUGE-LF1 (g(djt ), st )               (7)
sub-optimal policy. We hence propose optimiz-                  after the abstractor summarizes the extracted sen-
ing each sub-module separately using maximum-                  tence djt . We denote the trainable parameters of
likelihood (ML) objectives: train the extractor to             the extractor agent by θ = {θa , ω} for the decoder
select salient sentences (fit f ) and the abstractor to        and hierarchical encoder respectively. We can then
generate shortened summary (fit g). Finally, RL is             train the extractor with policy-based RL. We illus-
applied to train the full model end-to-end (fit h).            trate this process in Fig. 2.
                                                                  The vanilla policy gradient algorithm, REIN-
3.1      Maximum-Likelihood Training for
                                                               FORCE (Williams, 1992), is known for high vari-
         Submodules
                                                               ance. To mitigate this problem, we add a critic
Extractor Training: In Sec. 2.1.2, we have                     network with trainable parameters θc to predict
formulated our sentence selection as classifica-               the state-value function V πθa ,ω (c). The predicted
tion. However, most of the summarization datasets              value of critic bθc ,ω (c) is called the ‘baseline’,
are end-to-end document-summary pairs with-                    which is then used to estimate the advantage func-
out extraction (saliency) labels for each sentence.            tion: Aπθ (c, j) = Qπθa ,ω (c, j) − V πθa ,ω (c) be-
Hence, we propose a simple similarity method to                cause the total return Rt is an estimate of action-
provide a ‘proxy’ target label for the extractor.              value function Q(ct , jt ). Instead of maximizing
Similar to the extractive model of Nallapati et al.            Q(ct , jt ) as done in REINFORCE, we maximize
(2017), for each ground-truth summary sentence,                Aπθ (c, j) with the following policy gradient:
we find the most similar document sentence djt
                                                                                          ∇θa ,ω J(θa , ω) =
by:4                                                                                                                     (8)
                                                                           E[∇θa ,ω logπθ (c, j)Aπθ (c, j)]
         jt = argmaxi (ROUGE-Lrecall (di , st ))        (6)
                                                               And the critic is trained to minimize the square
Given these proxy training labels, the extractor is            loss: Lc (θc , ω) = (bθc ,ω (ct ) − Rt )2 . This is
then trained to minimize the cross-entropy loss.
                                                                   5
                                                                     Strictly speaking, this is a Partially Observable Markov
    3
      We use the terminology of copy mechanism (originally     Decision Process (POMDP). We approximate it as an MDP
named pointer-generator) in order to avoid confusion with      by assuming that the RNN hidden state contains all past info.
the pointer network (Vinyals et al., 2015).                        6
                                                                     In Eqn. 6, we use ROUGE-recall because we want the
    4
      Nallapati et al. (2017) selected sentences greedily to   extracted sentence to contain as much information as possible
maximize the global summary-level ROUGE, whereas we            for rewriting. Nevertheless, for Eqn. 7, ROUGE-F1 is more
match exactly 1 document sentence for each GT summary          suitable because the abstractor g is supposed to rewrite the
sentence based on the individual sentence-level score.         extracted sentence d to be as concise as the ground truth s.
known as the Advantage Actor-Critic (A2C), a                         namic decisions of number-of-sentences based on
synchronous variant of A3C (Mnih et al., 2016).                      the input document, eliminates the need for tuning
For more A2C details, please refer to the supp.                      a fixed number of steps, and enables a data-driven
   Intuitively, our RL training works as follow: If                  adaptation for any specific dataset/application.
the extractor chooses a good sentence, after the ab-
stractor rewrites it the ROUGE match would be                        3.3    Repetition-Avoiding Reranking
high and thus the action is encouraged. If a bad                     Existing abstractive summarization systems on
sentence is chosen, though the abstractor still pro-                 long documents suffer from generating repeating
duces a compressed version of it, the summary                        and redundant words and phrases. To mitigate
would not match the ground truth and the low                         this issue, See et al. (2017) propose the coverage
ROUGE score discourages this action. Our RL                          mechanism and Paulus et al. (2018) incorporate
with a sentence-level agent is a novel attempt in                    tri-gram avoidance during beam-search at test-
neural summarization. We use RL as a saliency                        time. Our model without these already performs
guide without altering the abstractor’s language                     well because the summary sentences are gener-
model, while previous work applied RL on the                         ated from mutually exclusive document sentences,
word-level, which could be prone to gaming the                       which naturally avoids redundancy. However, we
metric at the cost of language fluency.7                             do get a small further boost to the summary quality
                                                                     by removing a few ‘across-sentence’ repetitions,
Learning how many sentences to extract: In a
                                                                     via a simple reranking strategy: At sentence-level,
typical RL setting like game playing, an episode
                                                                     we apply the same beam-search tri-gram avoid-
is usually terminated by the environment. On the
                                                                     ance (Paulus et al., 2018). We keep all k sentence
other hand, in text summarization, the agent does
                                                                     candidates generated by beam search, where k is
not know in advance how many summary sentence
                                                                     the size of the beam. Next, we then rerank all
to produce for a given article (since the desired
                                                                     k n combinations of the n generated summary sen-
length varies for different downstream applica-
                                                                     tence beams. The summaries are reranked by the
tions). We make an important yet simple, intuitive
                                                                     number of repeated N -grams, the smaller the bet-
adaptation to solve this: by adding a ‘stop’ ac-
                                                                     ter. We also apply the diverse decoding algorithm
tion to the policy action space. In the RL training
                                                                     described in Li et al. (2016) (which has almost no
phase, we add another set of trainable parameters
                                                                     computation overhead) so as to get the above ap-
vEOE (EOE stands for ‘End-Of-Extraction’) with
                                                                     proach to produce useful diverse reranking lists.
the same dimension as the sentence representation.
                                                                     We show how much the redundancy affects the
The pointer-network decoder treats vEOE as one
                                                                     summarization task in Sec. 6.2.
of the extraction candidates and hence naturally
results in a stop action in the stochastic policy.                   4     Related Work
We set the reward for the agent performing EOE
to ROUGE-1F1 ([{g(djt )}t ], [{st }t ]); whereas for                 Early summarization works mostly focused on ex-
any extraneous, unwanted extraction step, the                        tractive and compression based methods (Jing and
agent receives zero reward. The model is there-                      McKeown, 2000; Knight and Marcu, 2000; Clarke
fore encouraged to extract when there are still re-                  and Lapata, 2010; Berg-Kirkpatrick et al., 2011;
maining ground-truth summary sentences (to ac-                       Filippova et al., 2015). Recent large-sized corpora
cumulate intermediate reward), and learn to stop                     attracted neural methods for abstractive summa-
by optimizing a global ROUGE and avoiding extra                      rization (Rush et al., 2015; Chopra et al., 2016).
extraction.8 Overall, this modification allows dy-                   Some of the recent success in neural abstractive
    7
                                                                     models include hierarchical attention (Nallapati
      During this RL training of the extractor, we keep the ab-
stractor parameters fixed. Because the input sentences for the       et al., 2016), coverage (Suzuki and Nagata, 2016;
abstractor are extracted by an intermediate stochastic policy        Chen et al., 2016; See et al., 2017), RL based met-
of the extractor, it is impossible to find the correct target sum-   ric optimization (Paulus et al., 2018), graph-based
mary for the abstractor to fit g with ML objective. Though it
is possible to optimize the abstractor with RL, in out prelim-       attention (Tan et al., 2017), and the copy mecha-
inary experiments we found that this does not improve the            nism (Miao and Blunsom, 2016; Gu et al., 2016;
overall ROUGE, most likely because this RL optimizes at a            See et al., 2017).
sentence-level and can add across-sentence redundancy. We
achieve SotA results without this abstractor-level RL.               important information been generated); while ROUGE-L is
    8
      We use ROUGE-1 for terminal reward because it is a             used as intermediate rewards since it is known for better mea-
better measure of bag-of-words information (i.e., has all the        surement of language fluency within a local sentence.
Our model shares some high-level intuition with       to enhance the copying abstractive summarizer.
extract-then-compress methods. Earlier attempts             Finally, there are some loosely-related recent
in this paradigm used Hidden Markov Models and           works: Zhou et al. (2017) proposed selective gate
rule-based systems (Jing and McKeown, 2000),             to improve the attention in abstractive summa-
statistical models based on parse trees (Knight          rization. Tan et al. (2018) used an extract-then-
and Marcu, 2000), and integer linear programming         synthesis approach on QA, where an extraction
based methods (Martins and Smith, 2009; Gillick          model predicts the important spans in the passage
and Favre, 2009; Clarke and Lapata, 2010; Berg-          and then another synthesis model generates the fi-
Kirkpatrick et al., 2011). Recent approaches in-         nal answer. Swayamdipta et al. (2017) attempted
vestigated discourse structures (Louis et al., 2010;     cascaded non-recurrent small networks on extrac-
Hirao et al., 2013; Kikuchi et al., 2014; Wang           tive QA, resulting a scalable, parallelizable model.
et al., 2015), graph cuts (Qian and Liu, 2013),          Fan et al. (2017) added controlling parameters to
and parse trees (Li et al., 2014; Bing et al., 2015).    adapt the summary to length, style, and entity pref-
For neural models, Cheng and Lapata (2016) used          erences. However, none of these used RL to bridge
a second neural net to select words from an ex-          the non-differentiability of neural models.
tractor’s output. Our abstractor does not merely
‘compress’ the sentences but generatively produce        5     Experimental Setup
novel words. Moreover, our RL bridges the ex-
                                                         Please refer to the supplementary for full training
tractor and the abstractor for end-to-end training.
                                                         details (all hyperparameter tuning was performed
   Reinforcement learning has been used to op-           on the validation set). We use the CNN/Daily Mail
timize the non-differential metrics of language          dataset (Hermann et al., 2015) modified for sum-
generation and to mitigate exposure bias (Ran-           marization (Nallapati et al., 2016). Because there
zato et al., 2016; Bahdanau et al., 2017). Henß          are two versions of the dataset, original text and
et al. (2015) use Q-learning based RL for extrac-        entity anonymized, we show results on both ver-
tive summarization. Paulus et al. (2018) use RL          sions of the dataset for a fair comparison to prior
policy gradient methods for abstractive summa-           works. The experiment runs training and evalu-
rization, utilizing sequence-level metric rewards        ation for each version separately. Despite the fact
with curriculum learning (Ranzato et al., 2016)          that the 2 versions have been considered separately
or weighted ML+RL mixed loss (Paulus et al.,             by the summarization community as 2 different
2018) for stability and language fluency. We use         datasets, we use same hyper-parameter values for
sentence-level rewards to optimize the extractor         both dataset versions to show the generalization of
while keeping our ML trained abstractor decoder          our model. We also show improvements on the
fixed, so as to achieve the best of both worlds.         DUC-2002 dataset in a test-only setup.
   Training a neural network to use another fixed
                                                         5.1    Evaluation Metrics
network has been investigated in machine trans-
lation for better decoding (Gu et al., 2017a) and        For all the datasets, we evaluate standard ROUGE-
real-time translation (Gu et al., 2017b). They used      1, ROUGE-2, and ROUGE-L (Lin, 2004) on full-
a fixed pretrained translator and applied policy         length F1 (with stemming) following previous
gradient techniques to train another task-specific       works (Nallapati et al., 2017; See et al., 2017;
network. In question answering (QA), Choi et al.         Paulus et al., 2018). Following See et al. (2017),
(2017) extract one sentence and then generate the        we also evaluate on METEOR (Denkowski and
answer from the sentence’s vector representation         Lavie, 2014) for a more thorough analysis.
with RL bridging. Another recent work attempted
a new coarse-to-fine attention approach on sum-          5.2    Modular Extractive vs. Abstractive
marization (Ling and Rush, 2017) and found de-           Our hybrid approach is capable of both extrac-
sired sharp focus properties for scaling to larger in-   tive and abstractive (i.e., rewriting every sentence)
puts (though without metric improvements). Very          summarization. The extractor alone performs ex-
recently (concurrently), Narayan et al. (2018) use       tractive summarization. To investigate the effect of
RL for ranking sentences in pure extraction-based        the recurrent extractor (rnn-ext), we implement a
summarization and Çelikyilmaz et al. (2018) in-         feed-forward extractive baseline ff-ext (details in
vestigate multiple communicating encoder agents          supplementary). It is also possible to apply RL
Models                                         ROUGE-1          ROUGE-2      ROUGE-L             METEOR
                                                       Extractive Results
           lead-3 (See et al., 2017)                        40.34            17.70         36.57             22.21
           Narayan et al. (2018) (sentence ranking RL)       40.0             18.2          36.6               -
           ff-ext                                           40.63            18.35         36.82             22.91
           rnn-ext                                          40.17            18.11         36.41             22.81
           rnn-ext + RL                                     41.47            18.72         37.76             22.35
                                                       Abstractive Results
           See et al. (2017) (w/o coverage)                 36.44            15.66         33.42             16.65
           See et al. (2017)                                39.53            17.28         36.38             18.72
           Fan et al. (2017) (controlled)                   39.75            17.29         36.54               -
           ff-ext + abs                                     39.30            17.02         36.93             20.05
           rnn-ext + abs                                    38.38            16.12         36.04             19.39
           rnn-ext + abs + RL                               40.04            17.61         37.59             21.00
           rnn-ext + abs + RL + rerank                      40.88            17.80         38.54             20.38
Table 1: Results on the original, non-anonymized CNN/Daily Mail dataset. Adding RL gives statisti-
cally significant improvements for all metrics over non-RL rnn-ext models (and over the state-of-the-art
See et al. (2017)) in both extractive and abstractive settings with p < 0.01. Adding the extra reranking
stage yields statistically significant better results in terms of all ROUGE metrics with p < 0.01.

to extractor without using the abstractor (rnn-ext                Models                             R-1      R-2     R-L
                                                                                       Extractive Results
+ RL).9 Benefiting from the high modularity of                    lead-3 (Nallapati et al., 2017)    39.2     15.7    35.5
our model, we can make our summarization sys-                     Nallapati et al. (2017)            39.6     16.2    35.3
tem abstractive by simply applying the abstractor                 ff-ext                            39.51    16.85   35.80
                                                                  rnn-ext                           38.97    16.65   35.32
on the extracted sentences. Our abstractor rewrites               rnn-ext + RL                      40.13    16.58   36.47
each sentence and generates novel words from a                                        Abstractive Results
large vocabulary, and hence every word in our                     Nallapati et al. (2016)           35.46    13.30   32.65
                                                                  Fan et al. (2017) (controlled)    38.68    15.40   35.47
overall summary is generated from scratch; mak-                   Paulus et al. (2018) (ML)         38.30    14.81   35.49
ing our full model categorized into the abstractive               Paulus et al. (2018) (RL+ML) 39.87         15.82   36.90
paradigm.10 We run experiments on separately                      ff-ext + abs                      38.73    15.70   36.33
                                                                  rnn-ext + abs                     37.58    14.68   35.24
trained extractor/abstractor (ff-ext + abs, rnn-ext +             rnn-ext + abs + RL                38.80    15.66   36.37
abs) and the reinforced full model (rnn-ext + abs +               rnn-ext + abs + RL + rerank       39.66    15.85   37.34
RL) as well as the final reranking version (rnn-ext                Table 2: ROUGE for anonymized CNN/DM.
+ abs + RL + rerank).
                                                                (2017) and a strong lead-3 baseline. For producing
6        Results                                                our summary, we simply concatenate the extracted
For easier comparison, we show separate tables                  sentences from the extractors. From Table 1 and
for the original-text vs. anonymized versions –                 Table 2, we can see that our feed-forward extrac-
Table 1 and Table 2, respectively. Overall, our                 tor out-performs the lead-3 baseline, empirically
model achieves strong improvements and the new                  showing that our hierarchical sentence encoding
state-of-the-art on both extractive and abstractive             model is capable of extracting salient sentences.11
settings for both versions of the CNN/DM dataset                The reinforced extractor performs the best, be-
(with some comparable results on the anonymized                 cause of the ability to get the summary-level re-
version). Moreover, Table 3 shows the gener-                    ward and the reduced train-test mismatch of feed-
alization of our abstractive system to an out-of-               ing the previous extraction decision. The improve-
domain test-only setup (DUC-2002), where our                    ment over lead-3 is consistent across both tables.
model achieves better scores than See et al. (2017).            In Table 2, it outperforms the previous best neural
                                                                extractive model (Nallapati et al., 2017). In Ta-
6.1       Extractive Summarization                              ble 1, our model also outperforms a recent, con-
In the extractive paradigm, we compare our model                   11
                                                                     The ff-ext model outperforms rnn-ext possibly because
with the extractive model from Nallapati et al.                 it does not predict sentence ordering; thus is easier to opti-
                                                                mize and the n-gram based metrics do not consider sentence
     9
    In this case the abstractor function g(d) = d.              ordering. Also note that in our MDP formulation, we cannot
    10
    Note that the abstractive CNN/DM dataset does not in-       apply RL on ff-ext due to its historyless nature. Even if ap-
clude any human-annotated extraction label, and hence our       plied naively, there is no mean for the feed-forward model to
models do not receive any direct extractive supervision.        learn the EOE described in Sec. 3.2.
Models                  R-1      R-2      R-L                                            Relevance   Readability   Total
       See et al. (2017)      37.22    15.78    33.90             See et al. (2017)                120         128        248
       rnn-ext + abs + RL     39.46    17.34    36.72             rnn-ext + abs + RL + rerank      137         133        270
   Table 3: Generalization to DUC-2002 (F1).                      Equally good/bad                  43          39         82
                                                                 Table 4: Human Evaluation: pairwise comparison
current sentence-ranking RL model by Narayan                     between our final model and See et al. (2017).
et al. (2018), showing that our pointer-network ex-
tractor and reward formulations are very effective               model (rnn-ext + abs + RL + rerank) is clearly su-
when combined with A2C RL.                                       perior than the one of See et al. (2017). We are
                                                                 comparable on R-1 and R-2 but a 0.4 point im-
6.2    Abstractive Summarization                                 provement on R-L w.r.t. Paulus et al. (2018).14
After applying the abstractor, the ff-ext based                  We also outperform the results of Fan et al. (2017)
model still out-performs the rnn-ext model. Both                 on both original and anonymized dataset versions.
combined models exceed the pointer-generator                     Several previous works have pointed out that ex-
model (See et al., 2017) without coverage by a                   tractive baselines are very difficult to beat (in
large margin for all metrics, showing the effec-                 terms of ROUGE) by an abstractive system (See
tiveness of our 2-step hierarchical approach: our                et al., 2017; Nallapati et al., 2017). Note that our
method naturally avoids repetition by extracting                 best model is one of the first abstractive models
multiple sentences with different keypoints.12                   to outperform the lead-3 baseline on the original-
   Moreover, after applying reinforcement learn-                 text CNN/DM dataset. Our extractive experiment
ing, our model performs better than the best model               serves as a complementary analysis of the effect of
of See et al. (2017) and the best ML trained model               RL with extractive systems.
of Paulus et al. (2018). Our reinforced model out-
performs the ML trained rnn-ext + abs baseline                   6.3    Human Evaluation
with statistical significance of p < 0.01 on all met-            We also conduct human evaluation to ensure ro-
rics for both version of the dataset, indicating the             bustness of our training procedure. We measure
effectiveness of the RL training. Also, rnn-ext +                relevance and readability of the summaries. Rel-
abs + RL is statistically significant better than See            evance is based on the summary containing im-
et al. (2017) for all metrics with p < 0.01.13 In                portant, salient information from the input article,
the supplementary, we show the learning curve of                 being correct by avoiding contradictory/unrelated
our RL training, where the average reward goes                   information, and avoiding repeated/redundant in-
up quickly after the extractor learns the End-of-                formation. Readability is based on the summa-
Extract action and then stabilizes. For all the                  rys fluency, grammaticality, and coherence. To
above models, we use standard greedy decoding                    evaluate both these criteria, we design the follow-
and find that it performs well.                                  ing Amazon MTurk experiment: we randomly se-
                                                                 lect 100 samples from the CNN/DM test set and
Reranking and Redundancy Although the                            ask the human testers (3 for each sample) to rank
extract-then-abstract approach inherently will not               between summaries (for relevance and readabil-
generate repeating sentences like other neural-                  ity) produced by our model and that of See et al.
decoders do, there might still be across-sentence                (2017) (the models were anonymized and ran-
redundancy because the abstractor is not aware                   domly shuffled), i.e. A is better, B is better, both
of other extracted sentences when decoding one.                  are equally good/bad. Following previous work,
Hence, we incorporate an optional reranking strat-               the input article and ground truth summaries are
egy described in Sec. 3.3. The improved ROUGE                    also shown to the human participants in addition
scores indicate that this successfully removes                   to the two model summaries.15 From the results
some remaining redundancies and hence produces                   shown in Table 4, we can see that our model is
more concise summaries. Our best abstractive                     better in both relevance and readability w.r.t. See
  12
      A trivial lead-3 + abs baseline obtains ROUGE of           et al. (2017).
(37.37, 15.59, 34.82), which again confirms the importance
                                                                   14
of our reinforce-based sentence selection.                             We do not list the scores of their pure RL model because
   13                                                            they discussed its bad readability.
      We calculate statistical significance based on the boot-
                                                                    15
strap test (Noreen, 1989; Efron and Tibshirani, 1994) with             We selected human annotators that were located in the
100K samples. Output of Paulus et al. (2018) is not available    US, had an approval rate greater than 95%, and had at least
so we couldn’t test for statistical significance there.          10,000 approved HITs on record.
Speed                                                       Novel N -gram (%)
 Models                           total time (hr)   words / sec        Models                        1-gm 2-gm 3-gm 4-gm
 (See et al., 2017)                    12.9           14.8             See et al. (2017)              0.1    2.2     6.0    9.7
 rnn-ext + abs + RL                    0.68           361.3            rnn-ext + abs + RL + rerank    0.3   10.0 21.7 31.6
 rnn-ext + abs + RL + rerank    2.00 (1.46 +0.54)     109.8            reference summaries           10.8 47.5 68.2 78.2
Table 5: Speed comparison with See et al. (2017).                  Table 6: Abstractiveness: novel n-gram counts.
6.4    Speed Comparison                                            7      Analysis
Our two-stage extractive-abstractive hybrid model
                                                                   7.1     Abstractiveness
is not only the SotA on summary quality met-
rics, but more importantly also gives a significant                We compute an abstractiveness score (See et al.,
speed-up in both train and test time over a strong                 2017) as the ratio of novel n-grams in the gen-
neural abstractive system (See et al., 2017).16                    erated summary that are not present in the in-
   Our full model is composed of a extremely fast                  put document. The results are shown in Table 6:
extractor and a parallelizable abstractor, where the               our model rewrites substantially more abstractive
computation bottleneck is on the abstractor, which                 summaries than previous work. A potential rea-
has to generate summaries with a large vocabulary                  son for this is that when trained with individual
from scratch.17 The main advantage of our ab-                      sentence-pairs, the abstractor learns to drop more
stractor at decoding time is that we can first com-                document words so as to write individual sum-
pute all the extracted sentences for the document,                 mary sentences as concise as human-written ones;
and then abstract every sentence concurrently (in                  thus the improvement in multi-gram novelty.
parallel) to generate the overall summary. In Ta-
                                                                   7.2     Qualitative Analysis on Output Examples
ble 5, we show the substantial test-time speed-up
of our model compared to See et al. (2017).18 We                   We show examples of how our best model selects
calculate the total decoding time for producing all                sentences and then rewrites them. In the supple-
summaries for the test set.19 Due to the fact that                 mentary Fig. 4 and Fig. 5, we can see how the ab-
the main test-time speed bottleneck of RNN lan-                    stractor rewrites the extracted sentences concisely
guage generation model is that the model is con-                   while keeping the mentioned facts. Adding the
strained to generate one word at a time, the total                 reranker makes the output more compact globally.
decoding time is dependent on the number of to-                    We observe that when rewriting longer text, the
tal words generated; we hence also report the de-                  abstractor would have many facts to choose from
coded words per second for a fair comparison. Our                  (Fig. 5 sentence 2) and this is where the reranker
model without reranking is extremely fast. From                    helps avoid redundancy across sentences.
Table 5 we can see that we achieve a speed up of
18x in time and 24x in word generation rate. Even                  8      Conclusion
after adding the (optional) reranker, we still main-               We propose a novel sentence-level RL model
tain a 6-7x speed-up (and hence a user can choose                  for abstractive summarization, which makes the
to use the reranking component depending on their                  model aware of the word-sentence hierarchy. Our
downstream application’s speed requirements).20                    model achieves the new state-of-the-art on both
   16                                                              CNN/DM versions as well a better generalization
      The only publicly available code with a pretrained model
for neural summarization which we can test the speed.              on test-only DUC-2002, along with a significant
   17
      The time needed for extractor is negligible w.r.t. the ab-   speed-up in training and decoding.
stractor because it does not require large matrix multiplica-
tion for generating every word. Moreover, with convolutional
encoder at word-level made parallelizable by the hierarchical
                                                                   Acknowledgments
rnn-ext, our model is scalable for very long documents.
   18
      For details of training speed-up, please see the supp.       We thank the anonymous reviewers for their help-
   19
      We time the model of See et al. (2017) using beam size of    ful comments. This work was supported by a
4 (used for their best-reported scores). Without beam-search,      Google Faculty Research Award, a Bloomberg
it gets significantly worse ROUGE of (36.62, 15.12, 34.08),
so we do not compare speed-ups w.r.t. that version.                Data Science Research Grant, an IBM Faculty
   20
      Most of the recent neural abstractive summarization sys-     Award, and NVidia GPU awards.
tems are of similar algorithmic complexity to that of See et al.
(2017). The main differences such as the training objective        attentional-decoder’s sequential generation; and this is the
(ML vs. RL) and copying (soft/hard) has negligible test run-       component that we substantially speed up via our parallel
time compared to the slowest component: the long-summary           sentence decoding with sentence-selection RL.
References                                                James Clarke and Mirella Lapata. 2010. Discourse
                                                            constraints for document compression. Computa-
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu,               tional Linguistics, 36(3):411–441.
  Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron C.
  Courville, and Yoshua Bengio. 2017. An actor-critic     Michael Denkowski and Alon Lavie. 2014. Meteor
  algorithm for sequence prediction. In ICLR.               universal: Language specific translation evaluation
                                                            for any target language. In Proceedings of the EACL
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
                                                            2014 Workshop on Statistical Machine Translation.
  gio. 2015. Neural machine translation by jointly
  learning to align and translate. In ICLR.               Bradley Efron and Robert J Tibshirani. 1994. An intro-
Michele Banko, Vibhu O. Mittal, and Michael J. Wit-         duction to the bootstrap. CRC press.
  brock. 2000. Headline generation based on statis-       Angela Fan, David Grangier, and Michael Auli. 2017.
  tical translation. In Proceedings of the 38th An-         Controllable abstractive summarization.     arXiv
  nual Meeting on Association for Computational Lin-        preprint, abs/1711.05217.
  guistics, ACL ’00, pages 318–325, Stroudsburg, PA,
  USA. Association for Computational Linguistics.         Katja Filippova, Enrique Alfonseca, Carlos Col-
                                                            menares, Lukasz Kaiser, and Oriol Vinyals. 2015.
Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad
                                                            Sentence compression by deletion with lstms. In
   Norouzi, and Samy Bengio. 2017. Neural combi-
                                                            Proceedings of the 2015 Conference on Empir-
   natorial optimization with reinforcement learning.
                                                            ical Methods in Natural Language Processing
   arXiv preprint 1611.09940.
                                                            (EMNLP’15).
Taylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein.
                                                          Dan Gillick and Benoit Favre. 2009. A scalable global
  2011. Jointly learning to extract and compress. In
                                                            model for summarization. In Proceedings of the
  Proceedings of the 49th Annual Meeting of the Asso-
                                                            Workshop on Integer Linear Programming for Nat-
  ciation for Computational Linguistics: Human Lan-
                                                            ural Langauge Processing, ILP ’09, pages 10–18,
  guage Technologies - Volume 1, HLT ’11, pages
                                                            Stroudsburg, PA, USA. Association for Computa-
  481–490, Stroudsburg, PA, USA. Association for
                                                            tional Linguistics.
  Computational Linguistics.
Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei            Jiatao Gu, Kyunghyun Cho, and Victor O. K. Li. 2017a.
  Guo, and Rebecca J. Passonneau. 2015. Abstractive          Trainable greedy decoding for neural machine trans-
  multi-document summarization via phrase selection          lation. In EMNLP.
  and merging. In ACL.                                    Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.
Asli Çelikyilmaz, Antoine Bosselut, Xiaodong He, and        Li. 2016. Incorporating copying mechanism in
  Yejin Choi. 2018. Deep communicating agents for            sequence-to-sequence learning. In Proceedings of
  abstractive summarization. NAACL-HLT.                      the 54th Annual Meeting of the Association for Com-
                                                             putational Linguistics (Volume 1: Long Papers),
Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and            pages 1631–1640, Berlin, Germany. Association for
  Hui Jiang. 2016. Distraction-based neural networks         Computational Linguistics.
  for modeling documents. In IJCAI.
                                                          Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic-
Jianpeng Cheng and Mirella Lapata. 2016. Neural              tor O. K. Li. 2017b. Learning to translate in real-
   summarization by extracting sentences and words.          time with neural machine translation. In EACL.
   In Proceedings of the 54th Annual Meeting of the
   Association for Computational Linguistics (Volume      Sebastian Henß, Margot Mieskes, and Iryna Gurevych.
   1: Long Papers), pages 484–494, Berlin, Germany.         2015. A reinforcement learning approach for adap-
   Association for Computational Linguistics.               tive single- and multi-document summarization. In
                                                            International Conference of the German Society for
Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia         Computational Linguistics and Language Technol-
  Polosukhin, Alexandre Lacoste, and Jonathan Be-           ogy (GSCL-2015), pages 3–12.
  rant. 2017. Coarse-to-fine question answering for
  long documents. In Proceedings of the 55th Annual       Karl Moritz Hermann, Tomáš Kočiský, Edward
  Meeting of the Association for Computational Lin-         Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su-
  guistics (Volume 1: Long Papers), pages 209–220.          leyman, and Phil Blunsom. 2015. Teaching ma-
  Association for Computational Linguistics.                chines to read and comprehend. In Advances in Neu-
                                                            ral Information Processing Systems (NIPS).
Sumit Chopra, Michael Auli, and Alexander M. Rush.
  2016. Abstractive sentence summarization with at-       Tsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino,
  tentive recurrent neural networks. In Proceedings of      Norihito Yasuda, and Masaaki Nagata. 2013.
  the 2016 Conference of the North American Chap-           Single-document summarization as a tree knapsack
  ter of the Association for Computational Linguistics:     problem. In Proceedings of the 2013 Conference on
  Human Language Technologies, pages 93–98, San             Empirical Methods in Natural Language Process-
  Diego, California. Association for Computational          ing, pages 1515–1520, Seattle, Washington, USA.
  Linguistics.                                              Association for Computational Linguistics.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long       Annie Louis, Aravind Joshi, and Ani Nenkova. 2010.
  short-term memory. Neural Comput., 9(9):1735–             Discourse indicators for content selection in summa-
  1780.                                                     rization. In Proceedings of the 11th Annual Meeting
                                                            of the Special Interest Group on Discourse and Dia-
Hakan Inan, Khashayar Khosravi, and Richard Socher.         logue, SIGDIAL ’10, pages 147–156, Stroudsburg,
  2017. Tying word vectors and word classifiers: A          PA, USA. Association for Computational Linguis-
  loss framework for language modeling. In ICLR.            tics.

Hongyan Jing and Kathleen R. McKeown. 2000. Cut           Minh-Thang Luong, Hieu Pham, and Christopher D.
  and paste based text summarization. In Proceed-           Manning. 2015. Effective approaches to attention-
  ings of the 1st North American Chapter of the As-         based neural machine translation.        In Empir-
  sociation for Computational Linguistics Conference,       ical Methods in Natural Language Processing
  NAACL 2000, pages 178–185, Stroudsburg, PA,               (EMNLP), pages 1412–1421, Lisbon, Portugal. As-
  USA. Association for Computational Linguistics.           sociation for Computational Linguistics.

Yuta Kikuchi, Tsutomu Hirao, Hiroya Takamura, Man-        André F. T. Martins and Noah A. Smith. 2009. Sum-
  abu Okumura, and Masaaki Nagata. 2014. Single             marization with a joint model for sentence extraction
  document summarization based on nested tree struc-        and compression. In Proceedings of the Workshop
  ture. In Proceedings of the 52nd Annual Meeting of        on Integer Linear Programming for Natural Lan-
  the Association for Computational Linguistics (Vol-       gauge Processing, ILP ’09, pages 1–9, Stroudsburg,
  ume 2: Short Papers), pages 315–320, Baltimore,           PA, USA. Association for Computational Linguis-
  Maryland. Association for Computational Linguis-          tics.
  tics.                                                   Yishu Miao and Phil Blunsom. 2016. Language as a
                                                            latent variable: Discrete generative models for sen-
Yoon Kim. 2014. Convolutional neural networks               tence compression. In EMNLP.
  for sentence classification. In Proceedings of the
  2014 Conference on Empirical Methods in Natural         Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
  Language Processing (EMNLP), pages 1746–1751,             rado, and Jeff Dean. 2013. Distributed representa-
  Doha, Qatar. Association for Computational Lin-           tions of words and phrases and their composition-
  guistics.                                                 ality. In C. J. C. Burges, L. Bottou, M. Welling,
                                                            Z. Ghahramani, and K. Q. Weinberger, editors, Ad-
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A              vances in Neural Information Processing Systems
  method for stochastic optimization. In ICLR.              26, pages 3111–3119. Curran Associates, Inc.
Kevin Knight and Daniel Marcu. 2000. Statistics-          Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi
  based summarization - step one: Sentence compres-         Mirza, Alex Graves, Timothy Lillicrap, Tim Harley,
  sion. In Proceedings of the Seventeenth National          David Silver, and Koray Kavukcuoglu. 2016. Asyn-
  Conference on Artificial Intelligence and Twelfth         chronous methods for deep reinforcement learning.
  Conference on Innovative Applications of Artificial       In Proceedings of The 33rd International Confer-
  Intelligence, pages 703–710. AAAI Press.                  ence on Machine Learning, volume 48 of Proceed-
                                                            ings of Machine Learning Research, pages 1928–
Chen Li, Yang Liu, Fei Liu, Lin Zhao, and Fuliang           1937, New York, New York, USA. PMLR.
  Weng. 2014. Improving multi-documents summa-
  rization by sentence compression based on expanded      Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.
  constituent parse trees. In Proceedings of the 2014       Summarunner: A recurrent neural network based se-
  Conference on Empirical Methods in Natural Lan-           quence model for extractive summarization of doc-
  guage Processing (EMNLP), pages 691–701. Asso-            uments. In AAAI Conference on Artificial Intelli-
  ciation for Computational Linguistics.                    gence.

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. A sim-     Ramesh Nallapati, Bowen Zhou, Cicero Nogueira dos
   ple, fast diverse decoding algorithm for neural gen-     santos, Caglar Gulcehre, and Bing Xiang. 2016.
   eration. arXiv preprint, abs/1611.08562.                 Abstractive text summarization using sequence-to-
                                                            sequence rnns and beyond. In CoNLL.
Chin-Yew Lin. 2004. Rouge: A package for automatic        Shashi Narayan, Shay B. Cohen, and Mirella Lapata.
  evaluation of summaries. In Text Summarization            2018. Ranking sentences for extractive summariza-
  Branches Out: Proceedings of the ACL-04 Work-             tion with reinforcement learning. NAACL-HLT.
  shop, pages 74–81, Barcelona, Spain. Association
  for Computational Linguistics.                          Eric W Noreen. 1989. Computer-intensive methods for
                                                             testing hypotheses. Wiley New York.
Jeffrey Ling and Alexander Rush. 2017. Coarse-to-fine
   attention models for document summarization. In        Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
   Proceedings of the Workshop on New Frontiers in          2013. On the difficulty of training recurrent neural
   Summarization, pages 33–42. Association for Com-         networks. In Proceedings of the 30th International
   putational Linguistics.                                  Conference on Machine Learning, volume 28 of
Proceedings of Machine Learning Research, pages         Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.
  1310–1318, Atlanta, Georgia, USA. PMLR.                   2015. Pointer networks. In C. Cortes, N. D.
                                                            Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,
Romain Paulus, Caiming Xiong, and Richard Socher.           editors, Advances in Neural Information Processing
  2018. A deep reinforced model for abstractive sum-        Systems 28, pages 2692–2700. Curran Associates,
  marization. In ICLR.                                      Inc.
Ofir Press and Lior Wolf. 2017. Using the output em-      Xun Wang, Yasuhisa Yoshida, Tsutomu Hirao, Kat-
  bedding to improve language models. In Proceed-           suhito Sudoh, and Masaaki Nagata. 2015. Sum-
  ings of the 15th Conference of the European Chap-         marization based on task-oriented discourse parsing.
  ter of the Association for Computational Linguistics:     IEEE/ACM Trans. Audio, Speech and Lang. Proc.,
  Volume 2, Short Papers, pages 157–163. Association        23(8):1358–1367.
  for Computational Linguistics.
                                                          Ronald J. Williams. 1992. Simple statistical gradient-
Xian Qian and Yang Liu. 2013. Fast joint compres-           following algorithms for connectionist reinforce-
  sion and summarization via graph cuts. In Proceed-        ment learning. Mach. Learn., 8(3-4):229–256.
  ings of the 2013 Conference on Empirical Methods
  in Natural Language Processing, pages 1492–1502,        David Zajic, Bonnie Dorr, and Richard Schwartz. 2004.
  Seattle, Washington, USA. Association for Compu-          Bbn/umd at duc-2004: Topiary. In HLT-NAACL
  tational Linguistics.                                     2004 Document Understanding Workshop, pages
                                                            112–119, Boston, Massachusetts.
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,
 and Wojciech Zaremba. 2016. Sequence level train-        Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou.
 ing with recurrent neural networks. In ICLR.               2017. Selective encoding for abstractive sentence
                                                            summarization. In Proceedings of the 55th An-
Alexander M. Rush, Sumit Chopra, and Jason Weston.          nual Meeting of the Association for Computational
  2015. A neural attention model for abstractive sen-       Linguistics (Volume 1: Long Papers), pages 1095–
  tence summarization. In Proceedings of the 2015           1104. Association for Computational Linguistics.
  Conference on Empirical Methods in Natural Lan-
  guage Processing, pages 379–389, Lisbon, Portugal.
  Association for Computational Linguistics.

Mike Schuster, Kuldip K. Paliwal, and A. General.
  1997.   Bidirectional recurrent neural networks.
  IEEE Transactions on Signal Processing.

Abigail See, Peter J. Liu, and Christopher D. Manning.
  2017. Get to the point: Summarization with pointer-
  generator networks. In Proceedings of the 55th An-
  nual Meeting of the Association for Computational
  Linguistics (Volume 1: Long Papers), pages 1073–
  1083. Association for Computational Linguistics.

Jun Suzuki and Masaaki Nagata. 2016. Rnn-based
  encoder-decoder approach with word frequency es-
  timation. In EACL.

Swabha Swayamdipta, Ankur P. Parikh, and Tom
  Kwiatkowski. 2017. Multi-mention learning for
  reading comprehension with neural cascades. arXiv
  preprint, abs/1711.00894.

Chuanqi Tan, Furu Wei, Nan Yang, Weifeng Lv, and
  Ming Zhou. 2018. S-net: From answer extraction to
  answer generation for machine reading comprehen-
  sion. In AAAI.

Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017.
   Abstractive document summarization with a graph-
   based attentional neural model. In ACL.

Oriol Vinyals, Samy Bengio, and Manjunath Kudlur.
  2016. Order matters: Sequence to sequence for sets.
  In International Conference on Learning Represen-
  tations (ICLR).
Supplementary Materials                                  A.3   Actor-Critic Policy Gradient

A     Model Details                                      Here we discuss the details of the actor-critic pol-
                                                         icy gradient training. Given the MDP formulation
A.1    Convolutional Encoder                             described in Sec. 3.2 , the return (total discounted
                                                         future reward) is
Here we describe the convolutional sentence rep-
resentation used in Sec. 2.1.1. We use the tempo-                               Ns
ral convolutional model proposed by Kim (2014)                           Rt =
                                                                                X
                                                                                      γ t r(t + 1)          (9)
to compute the representation of every individual                               t=1
sentence in the document. First, the words are con-
verted to the distributed vector representation by a     for each recurrent step t. To learn a optimal policy
learned word embedding matrix Wemb . The se-             π ∗ that maximize the state-value function:
quence of the word vectors from each sentence is
                                                                         ∗
then fed through 1-D single-layer convolution fil-                    V π (c) = Eπ∗ [Rt |ct = c]
ters with various window sizes (3, 4, 5) to capture
the temporal dependencies of nearby words and
                                                         we will make use of the action-value function
then followed by relu non-linear activation and
max-over-time pooling. The convolutional repre-
                                                                Qπθ (c, j) = Eπθ [Rt |ct = c, jt = j]
sentation rj for the jth sentence is then obtained
by concatenating the outputs from the activations
of all filter window sizes.                              We then take the policy gradient theorem and
                                                         then substitute the action-value function with the
A.2    Abstractor                                        Monte-Carlo sample:

In this section we discuss the architecture choices
                                                            ∇θ J(θ) = Eπθ [∇θ logπθ (c, j)Qπθ (c, j)]      (10)
for our abstractor network in Sec. 2.2. At a high-
                                                                            Ns
level, it is a sequence-to-sequence model with at-                       1 X
                                                                     =         ∇θ logπθ (ct , jt )Rt       (11)
tention and copy mechanism (but no coverage).                            Ns
                                                                             t=1
Note that the abstractor network is a separate neu-
ral network from the extractor agent without any         which runs a single episode and gets the return (es-
form of parameter sharing.                               timate of action-value function) by sampling from
                                                         the policy πθ , where Ns is the total number of
Sequence-Attention-Sequence Model We use
                                                         sentences the agent extracts. This gradient up-
a standard encoder-aligner-decoder model (Bah-
                                                         date is also known as the REINFORCE algorithm
danau et al., 2015; Luong et al., 2015) with the
                                                         (Williams, 1992).
bilinear multiplicative attention function (Luong
                                                            The vanilla REINFORCE algorithm is known
et al., 2015), fatt (hi , zj ) = h>
                                  i Wattn zj , for the
                                                         for high variance. To mitigate this problem we
context vector ej . We share the source and target
                                                         add a critic network with trainable parameters θc
embedding matrix Wemb as well as output projec-
                                                         having the same structure as the pointer-network’s
tion matrix as in Inan et al. (2017); Press and Wolf
                                                         decoder (described in Sec. 2.1.2) but change the fi-
(2017); Paulus et al. (2018).
                                                         nal output layer to regress the state-value function
Copy Mechanism We add the copying mech-                  V πθa ,ω (c). The predicted value bθc ,ω (c) is called
anism as in See et al. (2017) to extend the de-          the baseline and is subtracted from the action-
coder to predict over the extended vocabulary of         value function to estimate the advantage
words in the input document. A copy probability
pcopy = σ(vẑ> ẑj + vs> zj + vw
                               > w + b) is calculated
                                  j                             Aπθ (c, j) = Qπθa ,ω (c, j) − bθc ,ω (c)
by learnable parameters v’s and b, and then is used
to further compute a weighted sum of the probabil-       where θ = {θa , θc , ω} denotes the set of all train-
ity of source vocabulary and the predefined vocab-       able parameters. The new policy gradient for
ulary. At test time, an OOV prediction is replaced       our extractor can be estimated by substituting the
by the document word with the highest attention          action-value function in Eqn. 10 by the advantage
score.                                                   and then use Monte-Carlo samples (use Rt to esti-
You can also read