Data-to-text Generation with Macro Planning

Page created by Jason Patel
 
CONTINUE READING
Data-to-text Generation with Macro Planning

                                                                       Ratish Puduppully and Mirella Lapata
                                                                 Institute for Language, Cognition and Computation
                                                                   School of Informatics, University of Edinburgh
                                                                       10 Crichton Street, Edinburgh EH8 9AB
                                                               r.puduppully@sms.ed.ac.uk mlap@inf.ed.ac.uk

                                                              Abstract                           content to talk about and how it might be orga-
                                                                                                 nized in discourse), sentence planning (aggregat-
                                             Recent approaches to data-to-text generation        ing content into sentences, deciding specific words
                                             have adopted the very successful encoder-           to describe concepts and relations, and generat-
                                             decoder architecture or variants thereof.           ing referring expressions), and linguistic realisa-
arXiv:2102.02723v1 [cs.CL] 4 Feb 2021

                                             These models generate text which is flu-            tion (applying the rules of syntax, morphology
                                             ent (but often imprecise) and perform quite
                                                                                                 and orthographic processing to generate surface
                                             poorly at selecting appropriate content and
                                             ordering it coherently. To overcome some            forms). Recent neural network-based approaches
                                             of these issues, we propose a neural model          (Lebret et al., 2016; Mei et al., 2016; Wiseman
                                             with a macro planning stage followed by a           et al., 2017) make use of the encoder-decoder ar-
                                             generation stage reminiscent of traditional         chitecture (Sutskever et al., 2014), are trained end-
                                             methods which embrace separate modules              to-end, and have no special-purpose modules for
                                             for planning and surface realization. Macro         how to best generate a text, aside from generic
                                             plans represent high level organization of im-
                                                                                                 mechanisms such as attention and copy (Bahdanau
                                             portant content such as entities, events and
                                             their interactions; they are learnt from data
                                                                                                 et al., 2015; Gu et al., 2016). The popularity of
                                             and given as input to the generator. Extensive      end-to-end models has been further boosted by the
                                             experiments on two data-to-text benchmarks          release of new datasets with thousands of input-
                                             (ROTOW IRE and MLB) show that our ap-               document training pairs. The example shown in
                                             proach outperforms competitive baselines in         Figure 1 is taken from the MLB dataset (Pudup-
                                             terms of automatic and human evaluation.            pully et al., 2019b) which contains baseball game
                                                                                                 statistics and human written summaries (~25K in-
                                                                                                 stances). ROTOW IRE (Wiseman et al., 2017) is
                                        1   Introduction                                         another widely used benchmark, which contains
                                        Data-to-text generation refers to the task of generat-   NBA basketball game statistics and their descrip-
                                        ing textual output from non-linguistic input (Reiter     tions (~5K instances).
                                        and Dale, 1997, 2000; Gatt and Krahmer, 2018)               Despite being able to generate fluent text, neu-
                                        such as databases of records, simulations of phys-       ral data-to-text generation models are often impre-
                                        ical systems, accounting spreadsheets, or expert         cise, prone to hallucination (i.e., generate text that
                                        system knowledge bases. As an example, Figure 1          is not supported by the input), and poor at con-
                                        shows various statistics describing a major league       tent selection and document structuring (Wiseman
                                        baseball (MLB) game, including extracts from the         et al., 2017). Attempts to remedy some of these
                                        box score (i.e., the performance of the two teams        issues focus on changing the way entities are repre-
                                        and individual team members who played as bat-           sented (Puduppully et al., 2019b; Iso et al., 2019),
                                        ters, pitchers or fielders; Tables (A)), play-by-play    allowing the decoder to skip low-confidence to-
                                        (i.e., the detailed sequence of each play of the game    kens to enhance faithful generation (Tian et al.,
                                        as it occurred; Table (B)), and a human written          2019), and making the encoder-decoder architec-
                                        game summary (Table (C)).                                ture more modular by introducing micro planning
                                           Traditional methods for data-to-text generation       (Puduppully et al., 2019a; Moryossef et al., 2019).
                                        (Kukich, 1983; McKeown, 1992; Reiter and Dale,           Micro planning operates at the record level (see Ta-
                                        1997) follow a pipeline architecture, adopting sep-      bles (A) Figure 1; e.g., C.Mullins BH 2, J.Villar TEAM
                                        arate stages for text planning (determining which        Orioles), it determines which facts should be men-
(A)
                                                                                (C)
 TEAM            Inn1 Inn2 Inn3 Inn4 . . . TR TH E . . .
                                                                                KANSAS CITY, Mo. – Brad Keller kept up his recent pitching surge with another strong
 Orioles           1    0        0       0       ...   2       4       0 ...
                                                                                outing.  Keller gave up a home run to the first batter of the game – Cedric Mullins
 Royals            1    0        0       3       ...   9       14 1 . . .
                                                                                – but quickly settled in to pitch eight strong innings in the Kansas City Royals’ 9–2 win
                                                                                over the Baltimore Orioles in a matchup of the teams with the worst records in the majors.
      BATTER            H/V AB BR BH RBI TEAM                           ...
                                                                                 Keller (7–5) gave up two runs and four hits with two walks and four strikeouts
      C.Mullins          H   4       2       2     1   Orioles . . .
                                                                                to improve to 3–0 with a 2.16 ERA in his last four starts.  Ryan O’Hearn homered
      J.Villar           H   4       0       0     0   Orioles . . .
                                                                                among his three hits and drove in four runs, Whit Merrifield scored three runs, and Hunter
      W.Merrifield       V   2       3       2     1   Royals . . .
                                                                                Dozier and Cam Gallagher also went deep to help the Royals win for the fifth time in six
      R.O’Hearn          V   5       1       3     4   Royals . . .
                                                                                games on their current homestand.  With the score tied 1–1 in the fourth, Andrew
      ...               ... ... ... ... ... ...
                                                                                Cashner (4–13) gave up a sacrifice fly to Merrifield after loading the bases on two walks
                                                                                and a single. Dozier led off the fifth inning with a 423-foot home run to left field to make
PITCHER H/V W                L IP PH PR ER BB K . . .
                                                                                it 3-1.  The Orioles pulled within a run in the sixth when Mullins led off with a
A.Cashner           H    4 13 5.1            9     4       4       3    1 ...
                                                                                double just beyond the reach of Dozier at third, advanced to third on a fly ball and scored
B.Keller            V    7   5 8.0           4     2       2       2    4 ...
                                                                                on Trey Mancini’s sacrifice fly to the wall in right.  . . .
...                ... ... ... ... ... ...
                                                                                   (B)
Inn1: runs in innings, TR: team runs, TH: team hits,
                                                                                   BATTER        PITCHER      SCORER        ACTION     TEAM       INN PL-ID SCORE . . .
E: errors, AB: at-bats, RBI: runs-batted-in, BR: batter
                                                                                   C.Mullins      B.Keller         -       Home run Orioles       1-T    1        1     ...
runs, BH: batter hits, H/V: home or visiting, W: wins,
                                                                                   H.Dozier      A.Cashner W.Merrifield Grounded Royals           1-B    3        1     ...
L: losses, IP: innings pitched, PH: hits given, PR: runs
                                                                                   W.Merrifield A.Cashner B.Goodwin         Sac fly    Royals     4-B    5        2     ...
given, ER: earned runs, BB: walks, K: strike outs, INN:
                                                                                   H.Dozier      A.Cashner         -       Home run Royals        5-B    1        3     ...
inning with (T)op/(B)ottom, PL-ID: play id.
                                                                                   ...               ...          ...         ...        ...       ...   ...     ...    ...

(D)                    V(Orioles), V(Royals),                                              V(Royals) V(Orioles),
                       V(C.Mullins), V(J.Villar),                                          V(Orioles) V(C.Mullins), V(Orioles) V(J.Villar),
                       V(W.Merrifield), V(R.O’Hearn),                                      V(Royals) V(W.Merrifield), V(Royals)
                       V(A.Cashner), V(B.Keller),                                          V(R.O’Hearn), V(Orioles) V(A.Cashner), V(Royals)
                       V(H.Dozier), . . . ,                                                V(B.Keller), . . . ,
                       V(1-T), V(1-B), V(2-T), V(2-B),                                     V(C.Mullins) V(Royals) V(Orioles),
                       V(3-T), V(3-B), . . .                                               V(J.Villar) V(Royals) V(Orioles), . . .

(E)                 V(B.Keller)V(B.Keller) V(C.Mullins) V(Royals) V(Orioles)V(B.Keller)
                    V(R.O’Hearn) V(W.Merrifield) V(H.Dozier) V(C.Gallagher) V(4-B, 5-B)  V(6-T)

Figure 1: MLB statistics tables and game summary. Tables summarize the performance of teams and individual
team members who played as batters and pitchers as well as the most important actions (and their actors) in each
play (Tables (A) and (B)). Macro plan for the game summary is shown at the bottom (Table (E)).  indicates
paragraph delimiters. There is a plan for every paragraph in the game summary (correspondence shown in same
color);  verbalizes entities, while  verbalizes events related to the top/bottom
side of an inning (see Section 3.1). Set of candidate paragraph plans are shown above macro plan (Table (D)) and
grouped into two types: plans describing a single entity/event or their combinations. Best viewed in color.

tioned within a textual unit (e.g., a sentence) and                                              selves to document planning as there is no explicit
how these should be structured (e.g., the sequence                                               link between the summary and the content of the
of records). An explicit content planner essentially                                             game (which is encoded in tabular form). In other
makes the job of the neural network less onerous                                                 words, the underlying plans are latent, and it is not
allowing to concentrate on producing fluent natu-                                                clear how they might be best represented, i.e., as
ral language output, without expending too much                                                  sequences of records from a table, or simply words.
effort on content organization.                                                                  Nevertheless, game summaries through their seg-
   In this work, we focus on macro planning, the                                                 mentation into paragraphs (and lexical overlap with
high-level organization of information and how                                                   the input) give clues as to how content might be
it should be presented which we argue is impor-                                                  organized. Paragraphs are a central element of dis-
tant for the generation of long, multi-paragraph                                                 course (Chafe, 1979; Longacre, 1979; Halliday and
documents (see text (C) in Figure 1). Problem-                                                   Hasan, 1976), the smallest domain where coher-
atically, modern datasets like MLB (Puduppully                                                   ence and topic are defined and anaphora resolu-
et al. 2019b; and also Figure 1) and ROTOW IRE                                                   tion is possible (Zadrozny and Jensen, 1991). We
(Wiseman et al., 2017) do not naturally lend them-                                               therefore operationalize the macro plan for a game
summary as a sequence of paragraph plans.              generic planners (Dale, 1989). Plans are mostly
    Although resorting to paragraphs describes the     based on hand-crafted rules after analyzing the tar-
summary plan at a coarse level, we still need          get text, although a few approaches have recog-
to specify individual paragraph plans. In the          nized the need for learning-based methods. For ex-
sports domain, paragraphs typically mention en-        ample, Duboue and McKeown (2001) learn order-
tities (e.g, players important in the game), key       ing constraints in a content plan, Konstas and Lap-
events (e.g., scoring a run), and their interaction.   ata (2013) represent plans as grammar rules whose
And most of this information is encapsulated in        probabilities are estimated empirically, while oth-
the statistics accompanying game summaries (see        ers make use of semantically annotated corpora to
Tables (A) and (B) in Figure 1). We thus define        bootstrap content planners (Duboue and McKeown,
paragraph plans such that they contain verbaliza-      2002; Kan and McKeown, 2002).
tions of entity and event records (see plan (E) in        More recently, various attempts have been made
Figure 1). Given a set of paragraph plans and their    to improve neural generation models (Wiseman
corresponding game summary (see Tables (D) and         et al., 2017) based on the encoder-decoder archi-
summary (C) in Figure 1), our task is twofold. At      tecture (Bahdanau et al., 2015) by adding various
training time, we must learn how content was se-       planning modules. Puduppully et al. (2019a) pro-
lected in order to give rise to specific game sum-     pose a model for data-to-text generation which first
maries (e.g., how input (D) led to plan (E) for sum-   learns a plan from the records in the input table
mary (C) in Figure 1), while at test time, given       and then generates a summary conditioned on this
input for a new game, we first predict a macro plan    plan. Shao et al. (2019) introduce a Planning-based
for the summary and then generate the correspond-      Hierarchical Variational Model where a plan is a
ing document.                                          sequence of groups, each of which contains a sub-
    We present a two-stage approach where macro        set of input items to be covered in a sentence. The
plans are induced from training data (by taking        content of each sentence is verbalized, conditioned
the table and corresponding summaries into ac-         on the plan and previously generated context. In
count) and then fed to the text generation stage.      their case, input items are a relatively small list of
Aside from making data-to-text generation more         attributes (~28) and the output document is also
interpretable, the task of generating a document       short (~110 words).
from a macro plan (rather than a table) affords
                                                          There have also been attempts to incorporate
greater control over the output text and plays to
                                                       neural modules in a pipeline architecture for data-
the advantage of encoder-decoder architectures
                                                       to-text generation. Moryossef et al. (2019) develop
which excel at modeling sequences. We evalu-
                                                       a model with a symbolic text planning stage fol-
ate model performance on the ROTOW IRE (Wise-
                                                       lowed by a neural realization stage. They exper-
man et al., 2017) and MLB (Puduppully et al.,
                                                       iment with the WebNLG dataset (Gardent et al.,
2019b) benchmarks. Experimental results show
                                                       2017) which consists of RDF h Subject, Object,
that our plan-and-generate approach produces out-
                                                       Predicate i triples paired with corresponding text.
put which is more factual, coherent, and fluent
                                                       Their document plan is a sequence of sentence
compared to existing state-of-the-art models. Our
                                                       plans which in turn determine the division of facts
code, trained models and dataset with macro plans
                                                       into sentences and their order. Along similar lines,
can be found at https://github.com/ratishsp/
                                                       Castro Ferreira et al. (2019) propose an architecture
data2text-macro-plan-py.
                                                       comprising of multiple steps including discourse or-
                                                       dering, text structuring, lexicalization, referring ex-
2   Related Work                                       pression generation, and surface realization. Both
Content planning has been traditionally consid-        approaches show the effectiveness of pipeline ar-
ered a fundamental component in natural language       chitectures, however, their task does not require
generation. Not only does it determine which           content selection and the output texts are relatively
information-bearing units to talk about, but also      short (24 tokens on average).
arranges them into a structure that creates coherent     Although it is generally assumed that task-
output. Many content planners have been based          specific parallel data is available for model training,
on theories of discourse coherence (Hovy, 1993),       Laha et al. (2020) do away with this assumption
schemas (McKeown et al., 1997) or have relied on       and present a three-stage pipeline model which
learns from monolingual corpora. They first con-          can be straightforwardly adapted to ROTOW IRE
vert the input to a form of tuples, which in turn         (Wiseman et al., 2017).
are expressed in simple sentences, followed by the
third stage of merging simple sentences to form           3.1   Macro Plan Definition
more complex ones by aggregation and referring            A macro plan consists of a sequence of para-
expression generation. They also evaluate on data-        graph plans separated by a paragraph discourse
to-text tasks which have relatively short outputs.        marker , i.e., x = ei  e j . . .  ek where
There have also been efforts to improve the coher-        ei , e j , ek ∈ E . A paragraph plan in turn is a se-
ence of the output, especially when dealing with          quence of entities and events describing the game.
longer documents. Puduppully et al. (2019b) make          By entities we mean individual players or teams
use of hierarchical attention over entity represen-       and the information provided about them in box
tations which are updated dynamically, while Iso          score statistics (see rows and column headings in
et al. (2019) explicitly keep track of salient entities   Figure 1 Table (A)), while events refer to informa-
and memorize which ones have been mentioned.              tion described in play-by-play (see Table (B)). In
   Our work also attempts to alleviate deficiencies       baseball, plays are grouped in half-innings. Dur-
in neural data-to-text generation models. In con-         ing each half of an inning, a team takes its turn to
trast to previous approaches, (Puduppully et al.,         bat (the visiting team bats in the top half and the
2019a; Moryossef et al., 2019; Laha et al., 2020),        home team in the bottom half). An example macro
we place emphasis on macro planning and create            plan is shown at the bottom of Figure 1. Within
plans representing high-level organization of a doc-      a paragraph plan, entities and events are verbal-
ument including both its content and structure. We        ized into a text sequence along the lines of Saleh
share with previous work (e.g., Moryossef et al.          et al. (2019). We make use of special tokens for
2019) the use of a two-stage architecture. We show        the  of record followed by the value of
that macro planning can be successfully applied           record from the table. We retain the same posi-
to long document data-to-text generation result-          tion for each record type and value. For example,
ing in improved factuality, coherence, and fluency        batter C.Mullins from Figure 1 would be verbal-
without any postprocessing (e.g., to smooth refer-        ized as C.Mullins H 4
ring expressions) or recourse to additional tools         2 2 1 Orioles . . . .
(e.g., parsing or information extraction).                For the sake of brevity we use shorthand
                                                           for the full entity.
3   Problem Formulation                                   Paragraph Plan for Entities For a paragraph
We hypothesize that generation based on plans             containing entities, the corresponding plan will be
should fare better compared to generating from            a verbalization of the entities in sequence. For para-
a set of records, since macro plans offer a bird’s-       graphs with multiple mentions of the same entity,
eye view, a high-level organization of the document       the plan will verbalize an entity only once and at
content and structure. We also believe that macro         its first position of mention. Paragraph “Keller
planning will work well for long-form text genera-        gave up a home run . . . the teams with the worst
tion, i.e., for datasets which have multi-paragraph       records in the majors” from the summary in Fig-
target texts, a large vocabulary space, and require       ure 1 describes four entities including B. Keller,
content selection.                                        C. Mullins, Royals and Orioles. The respective
   We assume the input to our model is a set of para-     plan is the verbalization of the four entities in
                         |E |                             sequence:  
graph plans E = {ei }i=1 where ei is a paragraph
                                                           , where V stands
plan. We model the process of generating output
                                                          for verbalization and  is a short-
summary y given E as a two step process, namely
                                                          hand for B.Keller V 7
the construction of a macro plan x based on the
                                                          5 8 . . . ,  is a shorthand
set of paragraph plans, followed by the generation
                                                          for the team Royals 9 14
of a summary given a macro plan as input. We
                                                          1, and so on.
now explain how E is obtained and each step is
realized. We discuss our model considering mainly         Paragraph Plan for Events A paragraph may
an example from the MLB dataset (Puduppully               also describe one or more events. For example, the
et al., 2019b) but also touch on how the approach         paragraph “With the score tied 1–1 in the fourth
. . . 423-foot home run to left field to make it 3-1 ”   struct macro plans by matching entities and events
discusses what happened in the bottom halves             in the summary to records in the tables. Further-
of the fourth and fifth innings. We verbalize an         more, paragraph delimiters within summaries form
event by first describing the participating entities     natural units which taken together give rise to a
followed by the plays in the event. Entities are         high-level document plan.
described in the order in which they appear in              We match entities in summaries with entities in
a play, and within the same play we list the             tables using exact string match, allowing for some
batter followed by the pitcher, fielder, scorer,         degree of variation in the expression of team names
and basemen. The paragraph plan corresponding            (e.g., A’s for Athletics and D-backs for Diamond-
to the bottom halves of the fourth and fifth             backs). Information pertaining to innings appears
inning is . Here,  is a shorthand for                (e.g., first, ninth ) modifying the noun inning and
   . . .   . . .                 ing (e.g., in sentences like “Dozier led off the fifth
 , and so on. The en-                inning”). However, there are instances where the
tities , ,                mention of innings is more ambiguous (e.g., “With
,           and               the scored tied 1–1 in the fourth, Andrew Cashner
correspond in turn to W. Merrifield, A.                  (4-13) gave up a sacrifice fly”). We could disam-
Cashner, B. Goodwin, and H. Dozier while                 biguate such mentions manually and then train a
 refers to the first play in the               classifier to learn to predict whether an inning is
bottom half of the fifth inning (see the play-           mentioned. Instead, we explore a novel annotation-
by-play table in Figure 1) and abbreviates the           free method which makes use of the pretrained lan-
following detailed plan: 5 B                  guage model GPT2 (Radford et al., 2019). Specif-
Royals Orioles                        ically, we feed the context preceding the ordinal
1 H.Dozier A.                    number to GPT2 (i.e., the current paragraph up to
Cashner> Home-run                        the ordinal number and the paragraph preceding it)
Royals-3-Orioles-1, etc.                                 and if inning appears in the top 10 next word pre-
     The procedure described above is not specific       dictions, we consider it a positive match. On a held
to MLB and can be ported to other datasets with          out dataset, this method achieves 98% precision
similar characteristics such as ROTOW IRE. How-          and 98% recall at disambiguating inning mentions.
ever, ROTOW IRE does not provide play-by-play               To resolve whether the summary discusses the
information, and as a result there is no event ver-      top or bottom side of an inning, we compare the
balization for this dataset.                             entities in the paragraph with the entities in each
                                                         half-inning (play-by-play Table (B) in Figure 1)
3.2   Macro Plan Construction                            and choose the side with the greater number of
                                                         entity matches. For instance, Andrew Cashner,
We provided our definition for macro plans in the
                                                         Merrifield and fourth inning uniquely resolves to
previous sections, however, it is important to note
                                                         the bottom half of the fourth inning.
that such macro plans are not readily available in
data-to-text benchmarks like MLB (Puduppully
                                                         3.3   Paragraph Plan Construction
et al., 2019b) and ROTOW IRE (Wiseman et al.,
2017) which consist of tables of records r paired        Figure 1 shows the macro plan we obtain for game
with a gold summary y (see Tables (A)–(C) in Fig-        summary (C). Importantly, macro plan (E) is the
ure 1). We now describe our method for obtaining         outcome of a content selection process after con-
macro plans x from r and y.                              sidering several candidate paragraph plans as input.
   Similar to Moryossef et al. (2019), we define         So, what are the candidate paragraph plans which
macro plans to be conformant with gold summaries         give rise to macro plan (E)? To answer this ques-
such that (1) they have the same splits into para-       tion, we examined the empirical distribution of
graphs — entities and events within a paragraph          paragraph plans in MLB and ROTOW IRE (train-
in y are grouped into a paragraph plan in x; and         ing portion). Interestingly, we found that ~79% of
(2) the order of events and entities in a paragraph      the paragraph plans in MLB refer to a single event
and its corresponding plan are identical. We con-        or a single player (and team(s)). In ROTOW IRE,
Sigmoid               ·       Content Selection
                                                                                                  4.1   Macro Planning
                                                                                                  Paragraph Plan Representation We encode to-
                                                                                                                                                     |ei |
                                                 patt
                                                  3
                                                                  p3
                                                                                                  kens in a verbalized paragraph plan ei as {ei, j } j=1
    ...     ...   ...            ...
    ...     ...   ...            ...                                                              with a BiLSTM (Figure 2 bottom part). To re-
    p3,1   p3,2   p3,3   ...   p3,|p3 |     Attention                                             flect the fact that some records will be more impor-
    ...     ...   ...            ...                           Attention
                                                                                          d       tant than others, we compute an attention weighted
                                                                                                                 |ei |
                                                                                                  sum of {ei, j } j=1  following Yang et al. (2016). Let
                                                                                                         n
                                                                                                  d ∈ R denote a randomly initialized query vector
                                          e3,1          e3,2     e3,3          ...     e3,|p3 |

                                                                                                  learnt jointly with the rest of parameters. We com-
                                          p3,1          p3,2     p3,3         ...      p3,|p3 |
                                                                                                  pute attention values αi, j over d and paragraph plan
                                                                                                  token representation ei, j :
                                          p3,1          p3,2     p3,3        ...       p3,|p3 |
                                                                                                                    αi, j ∝ exp(d| ei, j )            (1)

Figure 2: Paragraph plan representation and contex-                                               Paragraph plan vector ei is the attention weighted
tualization for macro planning. Computation of e3 is                                              sum of ei, j (with ∑ j αi, j = 1):
detailed in Equations (1), (2), eatt
                                 3 in Equation (3), and
ec3 in Equation (4).
                                                                                                                      ei = ∑ αi, j ei, j              (2)
                                                                                                                                j

~92% of paragraphs are about a singleton player                                                      Next, we contextualize each paragraph plan rep-
(and team(s)) or a pair of players.                                                               resentation vis-a-vis other paragraph plans (Fig-
                                                                                                  ure 2 top left part). First, we compute attention
   Based on this analysis, we assume that paragraph
                                                                                                  scores βi,k over paragraph plan representations to
plans can be either one (verbalized) entity/event
                                                                                                  obtain an attentional vector eatt
                                                                                                                                i for each:
or a combination of at most two. Under this as-
sumption, we explicitly enumerate the set of can-
                                                                                                                  βi,k ∝ exp(e|i Wa ek )
didate paragraph plans in a game. For the game
in Figure 1, candidate paragraph plans are shown                                                                     ci = ∑ βi,k ek
                                                                                                                             k6=i
in Tables (D). The first table groups plans based
on individual verbalizations describing the team(s),                                                               eatt
                                                                                                                    i     = Wg [ei ; ci ]             (3)
players, and events taking place in specific innings.
The second table groups pairwise combinations                                                     where Wa ∈ Rn×n , Wg ∈ Rn×2n are parameter ma-
thereof. In MLB, such combinations are between                                                    trices, and ∑k6=i βi,k = 1. Then, we compute a con-
team(s) and players. In ROTOW IRE, we also cre-                                                   tent selection gate, and apply this gate to ei to ob-
ate combinations between players. Such paragraph                                                  tain new paragraph plan representation eci :
plans form set E based on which macro plan x is
                                                                                                                   gi = sigmoid eatt
                                                                                                                                            
constructed to give rise to game summary y.                                                                                      i
                                                                                                                   eci = gi         ei                (4)

4          Model Description                                                                      where      denotes element-wise multiplication.
                                                                                                  Thus, each element in ei is weighted by correspond-
                                                                                                  ing element of gi ∈ [0, 1]n to obtain a contextualized
The input to our model is a set of paragraph plans                                                paragraph plan representation eci .
each of which is a sequence of tokens. We first
compute paragraph plan representations ∈ Rn , and                                                 Content Planning Our model learns to predict
then apply a contextualization and content planning                                               macro plans, after having been trained on pairs
mechanism similar to planning modules introduced                                                  of sets of paragraph plans and corresponding
in earlier work (Puduppully et al., 2019a; Chen                                                   macro plans (Sections 3.2 and 3.3 explain how
and Bansal, 2018). Predicted macro plans serve as                                                 we obtain these for data-to-text datasets like RO -
input to our text generation model which adopts                                                   TOW IRE and MLB). More formally, we model
an encoder-decoder architecture (Bahdanau et al.,                                                 macro plan z = z1 . . . z|z| as a sequence of pointers,
2015; Luong et al., 2015).                                                                        with each zk pointing to an input paragraph plan,
pCS     pCS    pCS
                                                                                              (Bahdanau et al., 2015; Luong et al., 2015). We
                                                                 3       |p|    1

                                                                                              encode macro plan x with a bidirectional LSTM
                                  ...    pCS            pe
                                                                                              (Hochreiter and Schmidhuber, 1997). At time
  pCS
   1      pCS
           2       pCS
                    3    pCS
                          4               |p|                    h1      h2    h3     h4
                                                                                              step t, we lookup the embedding of the previously
      ·    ·        ·     ·               ·                      ps     pCS
                                                                         3
                                                                               pCS
                                                                                |p|   pCS
                                                                                       1      predicted word yt−1 and feed it as input to the de-
   p1     p2       p3     p4     ...     p|p|           EOS                                   coder which is another LSTM unit. The decoder
          Vector               Decoder        ·     Content Selection Gate                    attends over hidden states of the macro plan to pre-
                                                                                              dict yt . We further incorporate a copy mechanism
                                                                                              (Gulcehre et al., 2016) in the decoder to enable
Figure 3: Macro planning model; paragraph plan repre-
                                                                                              copying values directly from the macro plan.
sentation and contextualization mechanism are detailed
in Figure 2. The output points to e3 , e|E | , and e1 (see                                       We expect the text generation model to learn to
Equations (5) and (6)). EOM is end of macro plan token.                                       generate summary tokens while focusing on the
                                                                                              corresponding macro plan and that the output sum-
                         |E |
                                                                                              mary will indeed follow the plan in terms of the
i.e., zk ∈ {ei }i=1 . We decompose p(z|E ), the prob-                                         entities and events being described and their order.
ability of macro plan z given paragraph plans E , as:                                         At the same time, we believe that text generation
                                                  |z|
                                                                                              is relatively easier as the encoder-decoder model
                         p(z|E ) = ∏ p(zk |z
ROTOW IRE       MLB             truncation length 100. We learn subword vocab-
       Vocab Size               11.3K        38.9K            ulary (Sennrich et al., 2016) for paragraph plans
       # Tokens                 1.5M         14.3M
       # Instances               4.9K        26.3K            in the macro planning stage. We used 2.5K merge
       # Record Types             39           53             operations for ROTOW IRE and 8K merge opera-
       Avg Records               628          565             tions for MLB. In text generation, we learn a joint
       Avg Paragraph Plans       10.7         15.1            subword vocabulary for the macro plan and game
       Avg Length               337.1        542.05
                                                              summaries. We used 6K merge operations for RO -
Table 1: Dataset statistics for ROTOW IRE and MLB.            TOW IRE and 16K merge operations for MLB. All
Vocabulary size, number of tokens, number of instances        models were implemented on OpenNMT-py (Klein
(i.e., table-summary pairs), number of record types, av-      et al., 2017). We add to set E the paragraph plans
erage number of records, average number of paragraph          corresponding to the output summary paragraphs,
plans, and average summary length.                            to ensure full coverage during training of the macro
                                                              planner. During inference for predicting macro
5    Experimental Setup                                       plans, we employ length normalization (Bahdanau
                                                              et al., 2015) to avoid penalizing longer outputs;
Data We performed experiments on the RO -                     specifically, we divide the scores of beam search
TOW IRE    (Wiseman et al., 2017) and MLB (Pudup-             by the length of the output. In addition, we adopt
pully et al., 2019b) benchmarks. The details of               bigram blocking (Paulus et al., 2018). For MLB,
these two datasets are given in Table 1. We can               we further block beams containing more than two
see that MLB is around 5 times bigger, has a richer           repetitions of a unigram. This helps improve the
vocabulary and longer game summaries. We use                  diversity of the predicted macro plans.
the official splits of 3,398/727/728 for ROTOW IRE
and 22,821/1,739/1,744 for MLB. We make use of                System Comparisons We compared our model
a tokenization script1 to detokenize and retokenize           against the following systems: (1) the Template-
the summaries in both ROTOW IRE and MLB.                      based generators from Wiseman et al. (2017) for
   We reconstructed the MLB dataset, as the ver-              ROTOW IRE and Puduppully et al. (2019b) for
sion released by Puduppully et al. (2019b) had re-            MLB. Both systems apply the same principle, they
moved all paragraph delimiters from game sum-                 emit a sentence about the teams playing in the
maries. Specifically, we followed their methodol-             game, followed by player-specific sentences, and
ogy and downloaded the same summaries from the                a closing sentence. MLB additionally contains a
ESPN website2 and added the  delimiter to                  description of play-by-play; (2) ED+CC, the best
paragraphs in the summaries.3 ROTOW IRE does                  performing system in Wiseman et al. (2017), is a
not have paragraph delimiters in game summaries               vanilla encoder-decoder model equipped with an
either. We reverse engineered these as follows:               attention and copy mechanism; (3) NCP+CC, the
(1) we split summaries into sentences using the               micro planning model of Puduppully et al. (2019a),
NLTK (Bird et al., 2009) sentence tokenizer; (2)              generates content plans from the table by making
initialized each paragraph with a separate sentence;          use of Pointer networks (Vinyals et al., 2015) to
(3) merged two paragraphs into one if the entities in         point to records; content plans are encoded with a
the former were a superset of entities in the latter;         BiLSTM and the game summary is decoded using
(4) repeated Step 3 until no merges were possible.            another LSTM with attention and copy; (4) ENT,
                                                              the entity-based model of Puduppully et al. (2019b),
Training Configuration We tuned the model hy-                 creates dynamically updated entity-specific repre-
perparameters on the development set. For training            sentations; the text is generated conditioned on the
the macro planning and the text generation stages,            data input and entity memory representations using
we used the Adagrad (Duchi et al., 2011) optimizer.           hierarchical attention at each time step.
Furthermore, the text generation stage made use of
truncated BPTT (Williams and Peng, 1990) with                 6   Results
    1 https://github.com/neulab/DGT
                                                              Automatic Evaluation For automatic evalua-
    2 http://www.espn.com/mlb/recap?gameId=
                                                              tion, following earlier work (Wiseman et al. 2017;
{gameid}
    3 Although our model is trained on game summaries with
                                                              Puduppully et al. 2019a,b, inter alia) we report
paragraph delimiters, and also predicts these at generation   BLEU (Papineni et al., 2002) with the gold sum-
time, for evaluation we strip  from model output.          mary as reference but also make use of the Infor-
RG              CS         CO            TOW IRE   test set. In addition to Templ, NCP+CC,
ROTOW IRE                                       BLEU
             #    P%    P%     R%     F% DLD%
                                                         ENT, and ED+CC we include the best performing
Templ       54.3 99.9   27.1   57.7   36.9 13.1  8.46
WS-2017     34.1 75.1   20.3   36.3   26.1 12.4 14.19    model of Wiseman et al. (2017) (WS-2017; note
ED+CC       35.9 82.6   19.8   33.8   24.9 12.0 14.99    that ED+CC is an improved re-implementation
NCP+CC      40.8 87.6   28.0   51.1   36.2 15.8 16.50    of their model), and the model of Rebuffel et al.
ENT         32.7 91.7   34.7   48.5   40.5 16.6 16.12    (2020) (RBF-2020) which represents the state of
RBF-2020    44.9 89.5   23.9   47.0   31.7 14.3 17.16
                                                         the art on ROTOW IRE. This model has a Trans-
Macro       42.1 97.6   34.1   57.8   42.9 17.7 15.46
−Plan(4)    36.2 81.3   22.1   38.6   28.1 12.1 14.00    former encoder (Vaswani et al., 2017) with a hi-
                                                         erarchical attention mechanism over entities and
MLB
                RG              CS         CO
                                                BLEU
                                                         records within entities. The models of Saleh et al.
             #    P%    P%     R%     F% DLD%            (2019), Iso et al. (2019), and Gong et al. (2019)
Templ       62.3 99.9   21.6   55.2   31.0 11.0  4.12    make use of additional information not present
ED+CC       32.5 91.3   27.8   40.6   33.0 17.1  9.68
NCP+CC      19.6 81.3   44.5   44.1   44.3 21.9  9.68    in the input (e.g., previous/next games, summary
ENT         23.8 81.1   40.9   49.5   44.8 20.7 11.50    writer) and are not directly comparable to the sys-
Macro       30.8 94.4   40.8   54.9   46.8 21.8 12.62    tems in Table 2. Results for the MLB test set are in
−Plan(SP,4) 25.1 92.7   40.0   44.6   42.2 21.9 11.09    the bottom portion of Table 2.
Table 2: Evaluation on ROTOW IRE and MLB test sets;         Templ has the highest RG precision and count
relation generation (RG) count (#) and precision (P%),   on both datasets. This is not surprising, by design
content selection (CS) precision (P%), recall (R%) and   Templ is always faithful to the input. However,
F-measure (F%), content ordering (CO) in normalized      notice that it achieves the lowest BLEU amongst
Damerau-Levenshtein distance (DLD%), and BLEU.           comparison systems indicating that it mostly regur-
                                                         gitates facts with low fluency. Macro achieves the
                                                         highest RG precision amongst all neural models
mation Extraction (IE) metrics from Wiseman et al.
                                                         for ROTOW IRE and MLB. We obtain an absolute
(2017) which are defined over the output of an IE
                                                         improvement of 5.9% over ENT for ROTOW IRE
system; the latter extracts entity (players, teams)
                                                         and 13.3% for MLB. In addition, Macro achieves
and value (numbers) pairs in a summary, and then
                                                         the highest CS F-measure for both datasets. On
predicts the type of relation. For instance, given the
                                                         ROTOW IRE, Macro achieves the highest CO score,
pair Kansas City Royals, 9, it would predict their
                                                         and the highest BLEU on MLB. On ROTOW IRE, in
relation as TR (i.e., Team Runs). Training data for
                                                         terms of BLEU, Macro is worse than comparison
the IE system is obtained by checking for matches
                                                         models (e.g., NCP+CC or ENT). Inspection of the
between entity, value pairs in the gold summary
                                                         output showed that the opening paragraph, which
and entity, value, record type triplets in the table.
                                                         mostly describes how the two teams fared, is gener-
   Let ŷ be the gold summary and y the model
                                                         ally shorter in Macro, leading to shorter summaries
output. Relation Generation (RG) measures the
                                                         and thus lower BLEU. There is high variance in
precision and count of relations extracted from
                                                         the length of the opening paragraph in the training
y that also appear in records r. Content Selec-
                                                         data and Macro verbalizes the corresponding plan
tion (CS) measures the precision and recall of
                                                         conservatively. Ideas such as length normalisation
relations extracted from y that are also extracted
                                                         (Wu et al., 2016) or length control (Kikuchi et al.,
from ŷ. Content Ordering (CO) measures the nor-
                                                         2016; Takeno et al., 2017; Fan et al., 2018) could
malized Damerau-Levenshtein distance between
                                                         help alleviate this; however, we do not pursue them
the sequences of relations extracted from y and ŷ.
                                                         further for fair comparison with the other models.
   We reused the IE model from Puduppully et al.
(2019a) for ROTOW IRE but retrained it for MLB           The Contribution of Macro Planning To study
to improve its precision and recall. Furthermore,        the effect of macro planning in more detail, we
the implementation of Wiseman et al. (2017) com-         further compared Macro against text generation
putes RG, CS, and CO excluding duplicate rela-           models (see Section 4.2) which are trained on ver-
tions. This artificially inflates the performance of     balizations of the tabular data (and gold summaries)
models whose outputs contain repetition. We in-          but do not make use of document plans or a doc-
clude duplicates in the computation of the IE met-       ument planning mechanism. On ROTOW IRE, the
rics (and recreate them for all comparison systems).     model was trained on verbalizations of players and
   Table 2 (top) presents our results on the RO -        teams, with the input arranged such that the ver-
Macro          CS-P    CS-R    CS-F    CO                 
      ROTOW IRE      81.3    73.2    77.0    45.8               
                                                                
      MLB            80.6    63.3    70.9    31.4               
                                                           ST. PETERSBURG, Fla. (AP) – The Tampa Bay Rays are making
Table 3: Evaluation of macro planning stage; content       the most of it.  Akinori Iwamura hit a two-run homer in the
selection precision (CS-P), recall (CS-R), F-measure       eighth inning and the Rays beat the Boston Red Sox 2-1 on Sunday
(CS-F) and content ordering (CO) between the inferred      to complete a three-game sweep.  The Rays, who have the best
                                                           record in the majors, have won six of their last seven games. 
plans and gold plans in terms of entities and events for   The Rays have won four of their last five series, including three in
ROTOW IRE (RW) and MLB test sets.                          a row against the Red Sox, who have won six of their last seven
                                                           overall.  Dioner Navarro singled with one out in the eighth off
                                                           Clay Buchholz (1-2) and moved to third on Jason Bartlett’s flyout to
                                                           center. Iwamura then drove a 1-1 pitch into the left-field stands for
balization of the home team was followed by the            his second homer of the season.  Scott Dohmann (2-0) got the
visiting team, the home team players and the visit-        win in relief , striking out Manny Ramirez with runners on first and
                                                           third to end the eighth.  Troy Percival worked the ninth for
ing team players. Mention of players was limited           his fifth save in five opportunities.  Clay Buchholz (1-2) gave
to the four best ones, following Saleh et al. (2019)       up two runs and three hits in eight innings. He struck out nine and
                                                           walked two.  The Red Sox loaded the bases with one out in the
(see −Plan(4) in Table 2). For MLB, we addi-               fifth on a single by Coco Crisp, a wild pitch and a walk to Jed Lowrie.
tionally include verbalizations of innings focusing        Jacoby Ellsbury drove in Crisp with a two-out single to center. 
                                                           Jackson struck out four and walked three.  The Red Sox loaded
on scoring plays which are likely to be discussed          the bases with one out in the fifth on a single by Coco Crisp, a walk
in game summaries (see −Plan(SP,4) in Table 2).            to Jed Lowrie and a one-out walk to Jed Lowrie. Jackson struck out
Note that by preprocessing the input in such a way         Julio Lugo, but Jacoby Ellsbury singled to center to put the Red Sox
                                                           up 1-0.  The Red Sox threatened in the eighth when J. D. Drew
some simple form of content selection takes place          drew a two-out walk against Trever Miller, but Ramirez struck out to
simply by removing extraneous information which            end the inning.
the model does not need to consider.
                                                           Table 4: Predicted macro plan (top) with corresponding
   Across both datasets, −Plan variants appear             model output (bottom). Entities and events in summary
competitive. On ROTOW IRE −Plan(4) is better               corresponding to those in the macro plan are bold faced.
than ED+CC in terms of content selection but
worse compared to ENT. On MLB, −Plan(SP,4) is
                                                           Table 4 together with the predicted document plan.
again superior to ED+CC in terms of content selec-
tion but not ENT whose performance lags behind             Human-Based Evaluation We also asked par-
when considering RG precision. Taken together,             ticipants to assess model output in terms of rela-
these results confirm that verbalizing entities and        tion generation, grammaticality, coherence, and
events into a text sequence is effective. At the same      conciseness (Wiseman et al., 2017; Puduppully
time, we see that −Plan variants are worse than            et al., 2019a,b), For ROTOW IRE, we compared
Macro across most metrics which underlines the             Macro against RBF-20204 , ED+CC, Gold, and
importance of an explicit planning component.              Templ. For MLB, we compared Macro against
   Table 3 presents intrinsic evaluation of the macro      ENT, ED+CC, Gold, and Templ.
planning stage. Here, we compare the inferred                 We conducted our study on the Amazon Mechan-
macro plan with the gold macro plans, CS and CO            ical Turk (AMT) crowdsourcing platform, follow-
metrics with regard to entities and events instead         ing best practices for human evaluation in NLG
of relations. We see that our macro planning model         (van der Lee et al., 2019). Specifically, to en-
(Macro) achieves high scores for CS and CO for             sure consistent ratings, we required crowdworkers
both ROTOW IRE and MLB. We further used the                to have an approval rating greater than 98% and
CS and CO metrics to check how well the gener-             a minimum of 1,000 previously completed tasks.
ated summary follows the (predicted) plan. We              Raters were restricted to English speaking coun-
followed the steps in Section 3.2 and reverse engi-        tries (i.e., US, UK, Canada, Ireland, Australia, or
neered macro plans from the model summaries and            NZ). Participants were allowed to provide feedback
compared these extracted plans with the original           on the task or field questions (our interface accepts
macro plans with regard to entities and events. We         free text).
found that Macro creates summaries which follow               In our first study, we presented crowdworkers
the plan closely: for ROTOW IRE, the CS F-score            with sentences randomly selected from summaries
and CO are greater than 98%; for MLB, the CS               along with their corresponding box score (and play-
F-score is greater than 94% and CO is greater than             4 We are grateful to Clément Rebuffel for providing us with

89%. We show an output summary for Macro in                the output of their system.
ROTOW IRE   #Supp   #Contra    Gram      Coher Concis     records) from the table.
Gold        3.63     0.07       38.33    46.25* 30.83
Templ       7.57*    0.08     −61.67*   −52.92* −36.67*
ED+CC       3.92     0.91*       5.0     −8.33 −4.58         We also conducted a second study to evaluate the
RBF-2020    5.08*    0.67*     13.33      4.58    3.75    quality of the generated summaries. We presented
Macro       4.00     0.27       5.0      10.42    6.67
                                                          crowdworkers with a pair of summaries and asked
MLB          #Supp #Contra     Gram        Coher Concis   them to choose the better one in terms of Grammat-
Gold         3.59     0.14      21.67      30.0     26.67 icality (is the summary written in well-formed En-
Templ        4.21     0.04    −51.25* −43.75*        7.5  glish?), Coherence (is the summary well structured
ED+CC        3.42     0.72*   −22.5* −12.08* −39.17* and well organized and does it have a natural order-
ENT          3.71     0.73*      5.83* −0.83* −22.08*
Macro        3.76     0.25      46.25     26.67     27.08
                                                          ing of the facts?) and Conciseness (does the sum-
                                                          mary avoid unnecessary repetition including whole
Table 5: Average number of supported (#Supp) and          sentences, facts or phrases?). We provided example
contradicting (#Contra) facts in game summaries and       summaries showcasing good and bad output. For
best-worst scaling evaluation (higher is better). Systems this task, we required that the crowdworkers be able
significantly different from Macro are marked with an     to comfortably comprehend NBA/MLB game sum-
asterisk * (using a one-way ANOVA with posthoc Tukey
                                                          maries. We elicited preferences with Best-Worst
HSD tests; p ≤ 0.05).
                            .                             Scaling (Louviere and Woodworth, 1991; Louviere
                                                          et al., 2015), a method shown to be more reliable
                                                          than rating scales. The score of a system is com-
by-play in case of MLB) and asked them to count           puted as the number of times it is rated best minus
supported and contradicting facts (ignoring halluci- the number of times it is rated worst (Orme, 2009).
nations, i.e., unsupported facts). We did not require     The scores range from −100 (absolutely worst) to
crowdworkers to be familiar with NBA or MLB. +100 (absolutely best). We divided the five compet-
Instead, we provided a cheat sheet explaining the         ing systems into ten pairs of summaries and elicited
semantics of box score tables. In addition, we            ratings for 40 summaries (20 per dataset). Each
provided examples of sentences with supported/- summary pair was rated by 3 raters. This resulted
contradicting facts. We evaluated 40 summaries            in 40 summaries × 10 system pairs × 3 evaluation
from the test set (20 per dataset), 4 sentences from      criteria × 3 raters for a total of 3,600 tasks. 206
each summary and elicited 3 responses per sum- crowdworkers participated in this task (agreement
mary. This resulted in 40 summaries × 5 systems           using Krippendorff’s α was 0.47).
× 3 raters for a total of 600 tasks. Altogether 131
crowdworkers participated in this study (agreement           As shown in Table 5, on ROTOW IRE, Macro
using Krippendorff’s α was 0.44 for supported and         is comparable to Gold, RBF-2020, and ED+CC
0.42 for contradicting facts).                            in terms of Grammaticality but significantly bet-
   As shown in Table 5, Macro yields the small-           ter than Templ. In terms of Coherence, Macro is
est number of contradicting facts among neural            comparable to RBF-2020 and ED+CC but signif-
models on both datasets. On ROTOW IRE the num-            icantly better than Templ and significantly worse
ber of contradicting facts for Macro is comparable        than Gold. With regard to Conciseness, Macro is
to Gold and Templ (the difference is not statis-          comparable to Gold, RBF-2020, and ED+CC, and
tically significant) and significantly smaller com-       significantly better than Templ. On MLB, Macro is
pared to RBF-2020 and ED+CC. The count of                 comparable to Gold in terms of Grammaticality and
supported facts for Macro is comparable to Gold,          significantly better than ED+CC, ENT and Templ.
and ED+CC, and significantly lower than Templ             Macro is comparable to Gold in terms of Coher-
and RBF-2020. On MLB, Macro has significantly             ence and significantly better than ED+CC, ENT
fewer contradicting facts than ENT and ED+CC              and Templ. In terms of Conciseness, raters found
and is comparable to Templ, and Gold (the differ-         Macro comparable to Gold and Templ and signif-
ence is not statistically significant). The count of      icantly better than ED+CC, and ENT. Taken to-
supported facts for Macro is comparable to Gold,          gether, our results show that macro planning leads
ENT, ED+CC and Templ. For both datasets, Templ            to improvement in data-to-text generation in com-
has the lowest number of contradicting facts. This        parison to other systems for both ROTOW IRE and
is expected as Templ essentially parrots facts (aka       MLB datasets.
7   Discussion                                             Despite promising results, there is ample room
                                                        to improve macro planning, especially in terms
In this work we presented a plan-and-generate ap-       of the precision of RG (see Table 2, P% column
proach for data-to-text generation which consists       of RG). We should not underestimate that Macro
of a macro planning stage representing high-level       must handle relatively long inputs (the average in-
document organization in terms of structure and         put length in the MLB development set is ~3100 to-
content, followed by a text generation stage. Ex-       kens) which are challenging for the attention mech-
tensive automatic and human evaluation shows that       anism. Consider the following output of our model
our approach achieves better results than existing      on the MLB dataset: Ramirez’s two-run double
state-of-the-art models and generates summaries         off Joe Blanton tied it in the sixth, and Brandon
which are factual, coherent, and concise.               Moss added a two-out RBI single off Alan Em-
   Our results show that macro planning is more         bree to give Boston a 3-2 lead. Here, the name
advantageous for generation tasks expected to pro-      of the pitcher should have been Joe Blanton in-
duce longer texts with multiple discourse units, and    stead of Alan Embree. In fact, Alan Embree is the
could be easily extended to other sports domains        pitcher for the following play in the half inning.
such as cricket (Kelly et al., 2009) or American        In this case, attention diffuses over the relatively
football (Barzilay and Lapata, 2005). Other ap-         long MLB macro plan, leading to inaccurate con-
proaches focusing on micro planning (Puduppully         tent selection. We could alleviate this problem by
et al., 2019a; Moryossef et al., 2019) might be bet-    adopting a noisy channel decomposition (Yee et al.,
ter tailored for generating shorter texts. There has    2019; Yu et al., 2020), i.e., by learning two different
been a surge of datasets recently focusing on single-   distributions: a conditional model which provides
paragraph outputs and the task of content selection     the probability of translating a paragraph plan to
such as E2E (Novikova et al., 2017), WebNLG             text and a language model which provides an un-
(Gardent et al., 2017), and WikiBio (Lebret et al.,     conditional estimate of the output (i.e., the whole
2016; Perez-Beltrachini and Lapata, 2018). We           game summary). However, we leave this to future
note that in our model content selection takes place    work.
during macro planning and text generation. The
results in Table 2 show that Macro achieves the            For ROTOW IRE, the main source of errors is the
highest CS F-measure on both datasets indicating        model’s inability to understand numbers. For ex-
that the document as a whole and individual sen-        ample, Macro generates the following output The
tences discuss appropriate content.                     Lakers were the superior shooters in this game, go-
                                                        ing 48 percent from the field and 30 percent from
   Throughout our experiments we observed that
                                                        the three-point line, while the Jazz went 47 percent
template-based systems score poorly in terms of
                                                        from the floor and 30 percent from beyond the arc..
CS (but also CO and BLEU). This is primarily
                                                        Here, 30 percent should have been 24 percent for
due to the inflexibility of the template approach
                                                        the Lakers but the language model expects a higher
which is limited to the discussion of a fixed num-
                                                        score for the three-point line, and since 24 is low
ber of (high-scoring) players. Yet, human writers
                                                        (especially compared to 30 scored by the Jazz), it
(and neural models to a certain extent), synthesize
                                                        simply copies 30 scored by the Jazz instead. A
summaries taking into account the particulars of a
                                                        mechanism for learning better representations for
specific game (where some players might be more
                                                        numbers (Wallace et al., 2019) or executing oper-
important than others even if they scored less) and
                                                        ations such as argmax or minus (Nie et al., 2018)
are able to override global defaults. Template sen-
                                                        should help alleviate this problem.
tences are fluent on their own, but since it is not
possible to perform aggregation (Reiter, 1995), the        Finally, although our focus so far has been on
whole summary appears stilted, it lacks coherence       learning document plans from data, the decoupling
and variability, contributing to low BLEU scores.       of planning from generation allows to flexibly gen-
The template baseline is worse for MLB than RO -        erate output according to specification. For ex-
TOW IRE which reflects the greater difficulty to        ample, we could feed the model with manually
manually create a good template for MLB. Overall,       constructed macro plans, consequently controlling
we observe that neural models are more fluent and       the information content and structure of the output
coherent, being able to learn a better ordering of      summary (e.g., for generating short or long texts,
facts which is in turn reflected in better CO scores.   or focusing on specific aspects of the game).
Acknowledgements                                         pages 675–686, Melbourne, Australia. Associa-
                                                         tion for Computational Linguistics.
We thank the Action Editor, Claire Gardent, and the
three anonymous reviewers for their constructive       Robert Dale. 1989. Generating referring expres-
feedback. We also thank Laura Perez-Beltrachini          sions in a domain of objects and processes.
for her comments on an earlier draft of this pa-
per, and Parag Jain, Hao Zheng, Stefanos Ange-         Pablo Duboue and Kathleen McKeown. 2002. Con-
lidis and Yang Liu for helpful discussions. We           tent planner construction via evolutionary algo-
acknowledge the financial support of the European        rithms and a corpus-based fitness function. In
Research Council (Lapata; award number 681760,           Proceedings of the International Natural Lan-
“Translating Multiple Modalities into Text”).            guage Generation Conference, pages 89–96,
                                                         Harriman, New York, USA. Association for
                                                         Computational Linguistics.
References
                                                       Pablo A. Duboue and Kathleen R. McKeown. 2001.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
                                                         Empirically estimating order constraints for con-
  Bengio. 2015. Neural machine translation by
                                                         tent planning in generation. In Proceedings
  jointly learning to align and translate. In 3rd
                                                         of the 39th Annual Meeting of the Associa-
  International Conference on Learning Represen-
                                                         tion for Computational Linguistics, pages 172–
  tations, ICLR 2015, San Diego, CA, USA, May
                                                         179, Toulouse, France. Association for Compu-
 7-9, 2015, Conference Track Proceedings.
                                                         tational Linguistics.
Regina Barzilay and Mirella Lapata. 2005. Collec-
                                                       John C. Duchi, Elad Hazan, and Yoram Singer.
  tive content selection for concept-to-text genera-
                                                         2011. Adaptive subgradient methods for online
  tion. In Proceedings of Human Language Tech-
                                                         learning and stochastic optimization. Journal of
  nology Conference and Conference on Empirical
                                                         Machine Learning Research, 12:2121–2159.
  Methods in Natural Language Processing, pages
  331–338, Vancouver, British Columbia, Canada.        Angela Fan, David Grangier, and Michael Auli.
  Association for Computational Linguistics.             2018. Controllable abstractive summarization.
Steven Bird, Ewan Klein, and Edward Loper.               In Proceedings of the 2nd Workshop on Neu-
  2009. Natural Language Processing with                 ral Machine Translation and Generation, pages
  Python. O’Reilly Media.                               45–54, Melbourne, Australia. Association for
                                                         Computational Linguistics.
Thiago Castro Ferreira, Chris van der Lee, Emiel
  van Miltenburg, and Emiel Krahmer. 2019. Neu-        Claire Gardent, Anastasia Shimorina, Shashi
  ral data-to-text generation: A comparison be-          Narayan, and Laura Perez-Beltrachini. 2017.
  tween pipeline and end-to-end architectures. In        Creating training corpora for NLG micro-
  Proceedings of the 2019 Conference on Em-              planners. In Proceedings of the 55th Annual
  pirical Methods in Natural Language Process-           Meeting of the Association for Computational
  ing and the 9th International Joint Confer-            Linguistics (Volume 1: Long Papers), pages 179–
  ence on Natural Language Processing (EMNLP-            188, Vancouver, Canada. Association for Com-
  IJCNLP), pages 552–562, Hong Kong, China.              putational Linguistics.
  Association for Computational Linguistics.
                                                       Albert Gatt and Emiel Krahmer. 2018. Survey of
Wallace L. Chafe. 1979. The flow of thought and          the state of the art in natural language generation:
 the flow of language. In Talmy Givón, editor,           Core tasks, applications and evaluation. J. Artif.
 Syntax and Semantics, volume 12, pages 159–             Intell. Res., 61:65–170.
 181. Academic Press Inc.
                                                       Heng Gong, Xiaocheng Feng, Bing Qin, and Ting
Yen-Chun Chen and Mohit Bansal. 2018. Fast ab-           Liu. 2019. Table-to-text generation with ef-
  stractive summarization with reinforce-selected        fective hierarchical encoder on three dimen-
  sentence rewriting. In Proceedings of the 56th         sions (row, column and time). In Proceed-
  Annual Meeting of the Association for Compu-           ings of the 2019 Conference on Empirical Meth-
  tational Linguistics (Volume 1: Long Papers),          ods in Natural Language Processing and the
9th International Joint Conference on Natural        Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hi-
  Language Processing (EMNLP-IJCNLP), pages              roya Takamura, and Manabu Okumura. 2016.
  3143–3152, Hong Kong, China. Association for           Controlling output length in neural encoder-
  Computational Linguistics.                             decoders. In Proceedings of the 2016 Confer-
                                                         ence on Empirical Methods in Natural Language
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.        Processing, pages 1328–1338, Austin, Texas. As-
   Li. 2016. Incorporating copying mechanism in          sociation for Computational Linguistics.
   sequence-to-sequence learning. In Proceedings
   of the 54th Annual Meeting of the Association       Guillaume Klein, Yoon Kim, Yuntian Deng, Jean
   for Computational Linguistics (Volume 1: Long         Senellart, and Alexander Rush. 2017. Open-
   Papers), pages 1631–1640, Berlin, Germany. As-        NMT: Open-source toolkit for neural machine
   sociation for Computational Linguistics.              translation. In Proceedings of ACL 2017, Sys-
                                                         tem Demonstrations, pages 67–72, Vancouver,
Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati,          Canada. Association for Computational Linguis-
  Bowen Zhou, and Yoshua Bengio. 2016. Point-            tics.
  ing the unknown words. In Proceedings of the
  54th Annual Meeting of the Association for Com-      Ioannis Konstas and Mirella Lapata. 2013. Induc-
  putational Linguistics (Volume 1: Long Papers),        ing document plans for concept-to-text gener-
  pages 140–149, Berlin, Germany. Association            ation. In Proceedings of the 2013 Conference
  for Computational Linguistics.                         on Empirical Methods in Natural Language Pro-
                                                         cessing, pages 1503–1514, Seattle, Washington,
M. A. K. Halliday and Ruqaiya Hasan. 1976. Co-           USA. Association for Computational Linguis-
 hesion in English. Longman, London.                     tics.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.          Karen Kukich. 1983. Design of a knowledge-based
  Long Short-Term Memory. Neural Computa-                report generator. In 21st Annual Meeting of the
  tion, 9:1735–1780.                                     Association for Computational Linguistics.
Eduard H Hovy. 1993. Automated discourse gen-          Anirban Laha, Parag Jain, Abhijit Mishra, and
  eration using discourse structure relations. Arti-     Karthik Sankaranarayanan. 2020. Scalable
  ficial intelligence, 63(1-2):341–385.                  micro-planned generation of discourse from
                                                         structured data. Computational Linguistics,
Hayate Iso, Yui Uehara, Tatsuya Ishigaki, Hiroshi
                                                        45(4):737–763.
  Noji, Eiji Aramaki, Ichiro Kobayashi, Yusuke
  Miyao, Naoaki Okazaki, and Hiroya Takamura.          Rémi Lebret, David Grangier, and Michael Auli.
  2019. Learning to select, track, and generate for      2016. Neural text generation from structured
  data-to-text. In Proceedings of the 57th Annual        data with application to the biography domain.
  Meeting of the Association for Computational           In Proceedings of the 2016 Conference on Empir-
  Linguistics, pages 2102–2113, Florence, Italy.         ical Methods in Natural Language Processing,
  Association for Computational Linguistics.             pages 1203–1213, Austin, Texas. Association
                                                         for Computational Linguistics.
Min-Yen Kan and Kathleen R. McKeown. 2002.
 Corpus-trained text generation for summariza-         Chris van der Lee, Albert Gatt, Emiel van Mil-
 tion. In Proceedings of the International Nat-          tenburg, Sander Wubben, and Emiel Krahmer.
 ural Language Generation Conference, pages              2019. Best practices for the human evaluation
 1–8, Harriman, New York, USA. Association for           of automatically generated text. In Proceedings
 Computational Linguistics.                              of the 12th International Conference on Natural
                                                         Language Generation, pages 355–368, Tokyo,
Colin Kelly, Ann Copestake, and Nikiforos Kara-
                                                         Japan. Association for Computational Linguis-
  manis. 2009. Investigating content selection
                                                         tics.
  for language generation using machine learning.
  In Proceedings of the 12th European Workshop         R E Longacre. 1979. The paragraph as a gram-
  on Natural Language Generation (ENLG 2009),            matical unit. In Talmy Givón, editor, Syntax
  pages 130–137, Athens, Greece. Association for         and Semantics, volume 12, pages 115–133. Aca-
  Computational Linguistics.                             demic Press Inc.
You can also read