Incremental Beam Manipulation for Natural Language Generation

Page created by Marc Dennis
 
CONTINUE READING
Incremental Beam Manipulation for Natural Language Generation
Incremental Beam Manipulation for Natural Language Generation
                                            James Hargreaves
                                 The Trade Desk & University of Cambridge
                               james.hargreaves@thetradedesk.com

                      Andreas Vlachos                                        Guy Emerson
                   University of Cambridge                               University of Cambridge
                    av308@cam.ac.uk                                       gete2@cam.ac.uk

                        Abstract                                at a time. Seq2seq models are generally decoded
                                                                using beam search, to mitigate the effect of locally
      The performance of natural language genera-
                                                                optimal but globally suboptimal decisions made by
      tion systems has improved substantially with
      modern neural networks. At test time they typi-
                                                                greedy search.
      cally employ beam search to avoid locally opti-
                                                                The performance of NLG systems can plateau or
      mal but globally suboptimal predictions. How-
      ever, due to model errors, a larger beam size             even decrease when beam sizes larger than 10 are
      can lead to deteriorating performance accord-             used, which is counter-intuitive since larger beams
      ing to the evaluation metric. For this reason, it         produce more likely sequences according to the
      is common to rerank the output of beam search,            model. For example, Dušek and Jurčı́ček (2016)
      but this relies on beam search to produce a               used a beam size of 10, and Asghar et al. (2017)
      good set of hypotheses, which limits the poten-           found a size of 5 to be optimal. Decreasing per-
      tial gains. Other alternatives to beam search
                                                                formance has been found across a range of tasks
      require changes to the training of the model,
      which restricts their applicability compared to
                                                                including (Cohen and Beck, 2019). Moreover, it
      beam search. This paper proposes incremen-                and was given by Koehn and Knowles (2017) as
      tal beam manipulation, i.e. reranking the hy-             one of the six main challenges facing neural ma-
      potheses in the beam during decoding instead              chine translation. To investigate this, Stahlberg and
      of only at the end. This way, hypotheses that             Byrne (2019) presented an exact search algorithm
      are unlikely to lead to a good final output are           to find the most likely output according to a seq2seq
      discarded, and in their place hypotheses that             model. However, this performed poorly compared
      would have been ignored will be considered
                                                                to beam search, demonstrating that search errors
      instead. Applying incremental beam manip-
      ulation leads to an improvement of 1.93 and               (from beam search) can mask model errors (from
      5.82 BLEU points over vanilla beam search                 the seq2seq model).
      for the test sets of the E2E and WebNLG chal-
      lenges respectively. The proposed method also             To mitigate the limitations of beam search, it is
      outperformed a strong reranker by 1.04 BLEU               common practice to apply a reranker to the final
      points on the E2E challenge, while being on               set of hypotheses. This can be done by defining a
      par with it on the WebNLG dataset.                        reranking criterion (for example: Kumar and Byrne,
                                                                2004; Blain et al., 2017; Borgeaud and Emerson,
 1    Introduction                                              2020) or by training a reranker to predict the best
                                                                hypothesis in a beam (for example: Dušek and
 In natural language generation (NLG), the goal is              Jurčı́ček, 2016; Agarwal et al., 2018). Training a
 to generate text representing structured information           reranker allows us to take into account information
 (e.g. a database record or a meaning representation)           from outside the model and mitigate model errors.
 that is both fluent and contains the right informa-            However, rerankers can only choose a hypothesis
 tion. Sequence-to-sequence models (seq2seq) have               from the final beam, which limits their potential.
 been effective on many tasks in NLG (for exam-
 ple: Wen et al., 2015; Dušek and Jurčı́ček, 2016).          To quantify this, we trained the seq2seq model pro-
 These systems first create an embedding for the                posed by Dušek and Jurčı́ček (2016), and applied it
 input information. This embedding is used incre-               to the E2E validation set (Novikova et al., 2017b).
 mentally during decoding, generating one token                 For each instance, we recorded the point at which

                                                           2563
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 2563–2574
                            April 19 - 23, 2021. ©2021 Association for Computational Linguistics
Incremental Beam Manipulation for Natural Language Generation
2   Related Work
                                                                This paper is far from the first to try to improve
                                                                beam search for natural language generation.
                                                                One modification is to use a variable beam size in-
                                                                stead of a fixed one (Freitag and Al-Onaizan, 2017).
                                                                However, this can only improve decoding speed, as
                                                                the ranking of the hypotheses in the beam remains
                                                                unchanged, and thus model errors are exposed by
                                                                the reduction of search errors.
                                                                Length normalisation (Murray and Chiang, 2018)
                                                                is widely used strategy that often improves the per-
Figure 1: The percentage of beams which contain a ref-          formance of a beam search decoder, by mitigating
erence (orange), or which could still lead to a reference
                                                                the fact that seq2seq models are biased towards
(blue), using the model of Dušek and Jurčı́ček (2016)
with beam size 3 on the E2E validation set.
                                                                generating shorter sequences. Rather than directly
                                                                using model probabilities to order the hypotheses in
                                                                the beam, each probability is normalised according
all gold-standard references fell out of the beam,              to the length of the hypothesis, so that shorter hy-
meaning that none of the partial hypotheses in the              potheses are penalised. However, this only has an
beam could be extended to a gold reference. A                   impact once the hypotheses within the beam have
final beam containing at least one of the references            different lengths. This only occurs towards the end
would score optimally with an oracle reranker (pro-             of the decoding process, and we showed in the pre-
viding an upper bound on performance). Figure 1                 vious section that the reference hypotheses often
shows the results for beam size 3.1 The final beam              fall out of the beam relatively early. Furthermore,
contained a reference in only 60 out of 547 cases               Stahlberg and Byrne (2019) showed that the biases
(11%). For the remaining 89% of the cases, even                 causing the deteriorating model performance are
an oracle reranker would be unable to give optimal              more complex than a simple length bias.
results. The figure also shows that in over half of
                                                                Wiseman and Rush (2016) modified the training
the cases, all references fell out in the first 6 steps.
                                                                procedure for seq2seq models. They ran beam
In contrast, references that were still in the beam at
                                                                search and introduced a loss each time the gold
step 15 were almost certain to stay in the beam until
                                                                standard sequence fell out of the beam. Goyal et al.
the end. These observations suggest that an early
                                                                (2018) and Collobert et al. (2019) also modified
manipulation of the beam has a strong potential to
                                                                the training procedure. They added a term to the
improve performance.
                                                                loss function that approximated the loss that the
In this paper, we propose a method for manipulat-               model would receive when generating using a beam
ing which items are pruned from the beam at each                search method for each example in the training set.
stage of decoding. We then present evidence that                However, one of the reasons that beam search has
this is a successful approach: it led to an improve-            been so widely used is that it can be applied on top
ment of 1.93, and 5.82 BLEU points over vanilla                 of a language model without changing the training
beam search on the E2E and WebNLG challenges,                   procedure, and this is lost with these approaches.
respectively. When comparing to a strong reranker,
                                                                Gu et al. (2017) manipulated the hidden state of
the performance of incremental beam manipula-
                                                                the language model at each step of the decoding.
tion was similar on the WebNLG dataset, whilst
                                                                This was achieved via a multi-output regressor that
increasing the performance on the E2E challenge
                                                                produced a vector that is added to the hidden state
by 1.04 points. We also applied beam manipula-
                                                                used in decoding. The regressor was trained via
tion on top of length normalisation (Murray and
                                                                reinforcement learning, and the training signal was
Chiang, 2018), and incremental beam manipulation
                                                                gathered by injecting unstructured noise to the hid-
was able to improve its performance.
                                                                den state. Chen et al. (2018) also manipulate the
    1
      For larger beam sizes, the same general trends were ob-   hidden state. For each training instance, they ap-
served. See Appendix A for beam size 10.                        ply beam search and take the hypothesis with the

                                                           2564
Incremental Beam Manipulation for Natural Language Generation
highest BLEU score. The manipulator network is                Loch Fyne is a restaurant located...
trained to encourage a greedy decoder to produce
this output. Both of these approaches rely on infer-          There is a family friendly...
ring a better hidden state to be used in the decoding,
which is not straightforward to define. We instead       Both of these convey some information about a
manipulate the hypotheses in the beam directly.          family-friendly restaurant named ‘Loch Fyne’. It
Finally, Negrinho et al. (2018) presented a frame-       is hard to know which partial sequence will lead
work for learning a beam search framework via            to a better complete sentence, which is what we
imitation learning. This resulted in a beam aware        would like a ranker to tell us. Existing rerankers
algorithm which was proved to have no regret guar-       often rely on detecting missing information, but
antees. While this paper makes a compelling ar-          some information may still be to come for partial
gument for this method in theory, putting it into        hypotheses.
practice requires a number of further engineering
decisions. Our work can be seen as a way of ap-          What we need is a way to rank partial hypotheses
plying this general framework using a simple and         based on how the seq2seq model is likely to com-
computationally efficient roll-out strategy.             plete them. We propose ranking partial hypotheses
                                                         based on a greedy roll-out. This is a computation-
3     Incremental Beam Manipulation                      ally efficient approximation of how the seq2seq
                                                         model might complete the partial hypothesis.
In order to describe our method for incremental
beam search, we first introduce terminology to de-
scribe a standard beam search decoder. The de-           In the existing literature, roll-outs are generally
coder produces a sequence iteratively, token by          used at training time (Chang et al., 2015), for the
token. At each iteration it performs 3 actions: ex-      situation where the model’s subsequent decisions
pand, rank and prune. The expand step generates          influence the loss function for an individual deci-
all possible next step hypotheses. The rank step         sion. The roll-outs are used to produce an approxi-
orders these hypotheses from most likely to least        mation to the final sequence that would be reached
likely. The pruning step then removes the hypothe-       if the original action was taken. This enables a
ses that are near the end of this order.                 value for the loss of the original decision to be
                                                         predicted.
This formulation of the beam search algorithm en-
ables us to view beam manipulation as a ranking          On the other hand, incremental beam manipulation
problem since expand is determined by the (fixed)        aims to predict which partial hypotheses will lead
decoder and the size of the beam chosen deter-           to good completed sequences. Similar to traditional
mines the pruning. The rank step determines which        roll-outs, this is impacted by the generating model’s
hypotheses will not be kept in the next iteration        subsequent decisions. In this case, the difference
and hence discarded. By modifying the ranking            is that roll-outs are used to provide features in ad-
method used, we can choose the partial hypotheses        dition to obtaining training signal. It is also worth
expanded during beam search, taking into account         noting that for incremental beam manipulation we
the current state of the beam as well as signals         use roll-outs at test time as well as at training time.
beyond model scores.
It is worth noting that while this paper applies beam    Beam manipulation can be applied after any step
manipulation on top of a seq2seq model, the tech-        in the beam search decoding. Figure 2 illustrates
niques used could be applied without change to           a single manipulation. The roll-outs are used to
any conditional or unconditional neural language         produce approximations to the completed hypothe-
model that can be decoded using beam search.             ses that would be produced if the partial hypothesis
                                                         remained in the beam. These completed sequences
3.1    Ranking via Roll-out                              are then ranked to define an order of the partial
Partial hypotheses are more difficult to rank than       sequences. Since this may result in different hy-
complete hypotheses since the rest of the gener-         potheses remaining in the beam, the area of the
ated text is unknown. For example, consider the          search space considered during the decoding has
following partial hypotheses:                            been manipulated.

                                                    2565
Current beam            Next token and greedy roll-out                                          Rank

      Loch Fyne is a           family friendly restaurant in the city centre                           1

                               restaurant located in the centre of the city                            3

          There is a           restaurant in the city centre called Loch Fyne                          4

                               child friendly restaurant called Loch Fyne                              2

Figure 2: Incremental beam manipulation, for a beam size of 2, when generating the 4th token. Each element of
the beam is expanded, and each partial hypothesis is greedily rolled out to a complete hypothesis. The complete
hypotheses are ranked and then pruned. The partial hypotheses are then used in the next step of beam search.

3.2   Reranker Architecture                              approach.
Incremental beam manipulation requires a method          The inputs to the reranker were:
to rank completed hypotheses. There are many ex-
                                                            • The meaning representation (i.e. the struc-
isting rerankers designed for this task, such as the
                                                              tured information input for NLG). This was
TGEN reranker (Dušek and Jurčı́ček, 2016). How-
                                                              passed as a sequence.
ever, these rerankers are unlikely to be effective
when used in incremental beam manipulation. In              • The generated text produced by the roll-out of
the latter, the partial hypotheses need to be ranked          the partial hypothesis. This was passed as a
according to their potential to produce good com-             sequence surrounded by the start token 
pleted hypotheses. This is a related but different            and end token .
task to that of a traditional reranker that aims to
identify the hypotheses that will score best against        • The rank, according to the seq2seq model
some metric such as BLEU score. Rerankers of                  probability, of the completed sequence in the
completed hypotheses typically rely on input infor-           beam. Since this was a categorical variable, it
mation missing from the output as the signal; how-            was passed as a one-hot vector.
ever, this is not necessarily useful when reranking      The architecture of the reranker is summarised in
partial hypotheses. For example, it is more useful       Figure 3. This model is used to assign a value to
to identify partial hypotheses which are indicative      each hypothesis individually. The meaning repre-
of model failure at an early stage of decoding.          sentation and the rolled-out text are each passed
                                                         through an LSTM,2 and then all inputs are con-
To rank the partial hypotheses via roll-outs as in-      catenated and passed through two fully connected
troduced in the previous section, we explored two        layers.
commonly used techniques in the field of infor-
mation retrieval: pointwise ranking and pairwise         At inference time the reranker is used to identify
ranking (Liu, 2009). Pointwise approaches predict        poorly performing hypotheses so that these can be
a numerical value for each item, and the items are       pruned. This enabled the task of the reranker to be
ordered by sorting them according to this value.         simplified to distinguishing between those hypothe-
Pairwise approaches, given a pair of hypotheses,         ses near the bottom of the beam from the hypothe-
output which of them would rank more highly, and         ses in the rest of the beam. Therefore, we used the
techniques such as the Copeland method (Saari            reranker’s scores to split the hypotheses into two
and Merlin, 1996) are then used to produce a to-         groups: those with scores in the bottom quartile
tal ordering from these pairwise comparisons. In             2
                                                               For efficient batching, it is common practice to add
preliminary experiments, the pointwise approach          padding tokens to make sequences the same length. We
outperformed the pairwise approach. These results        prepended padding tokens rather than appending them, as this
                                                         led to better performance in preliminary experiments. This is
are summarised in Appendix B. For the remain-            presumably because prepended tokens have a smaller impact
der of this paper, we will focus on the pointwise        on the final hidden state.

                                                    2566
target value -1, and the rest were assigned the target
                                                                 value 1.

                                                                 Similarly, it is only the differences in the reranker’s
                                                                 scores that matter, and not the absolute values.
                                                                 Therefore, after applying the reranker to each hy-
                                                                 pothesis in a beam, we normalise the scores to have
                                                                 a mean of 0.
Figure 3: Architecture of the reranker. N is the number          Using the normalised BLEU scores (-1 or 1) and
of hidden units for the LSTMs and the number of out-
                                                                 normalised reranker scores (with a mean of 0), we
put nodes for the fully connected layers. For the fully
connected layers, the activation function is shown.              use relative mean absolute error (RMAE) as the
                                                                 training objective, as shown in Equation (1), where
                                                                 b is the set of hypotheses in the beam, x is a hy-
and those with scores in the top three quartiles. The            pothesis, x̂ranker is the normalised score predicted
hypotheses within each group were ordered by the                 by the reranker, and xBLEU is the normalised target
seq2seq model’s probability.                                     derived from the BLEU score ordering of the beam.
In preliminary experiments, we also tried using
                                                                                       X
the reranker to provide a total ordering. However,                      RMAE(b) =             |x̂ranker − x̂BLEU |   (1)
using it to provide a coarse partial ordering (as                                       x∈b
described above) gave more promising results.
                                                                 Several other relative loss functions have been
3.3    Training the Reranker
                                                                 shown to be successful in other situations (Zhang
We trained the reranker on completed hypotheses                  et al., 2019). In preliminary experiments, we eval-
that were ranked from best to worst. The sequences               uated a number of these including log cosh error
were produced by generating text sequences using                 and mean square error, but they did not outperform
the seq2seq model. A beam search decoding, with                  RMAE.
a large beam size, was applied to each instance in
the training set. 3 The set of hypotheses present in             3.4   Choosing when to Manipulate
the final beam of the search was ranked from best                In theory, it would be desirable to manipulate the
to worst and recorded.                                           beam at every step of hypothesis generation, but in
The notion of the best hypothesis was simplified                 practice, the difficulty of ranking partial hypotheses
to the one that received the highest BLEU score                  could limit its benefits. While manipulating the
(Papineni et al., 2002) against the manually writ-               beam can avoid certain model errors, it might also
ten references. BLEU was chosen due to its wide                  introduce other errors, either from the greedy roll-
adoption as an automatic evaluation measure, but                 out strategy or the reranker. Reranking at every step
any automatic metric could have been used in its                 may compound such errors. Empirically, we found
place.                                                           it was more effective to apply beam manipulation
                                                                 to some rather than all steps.
As discussed at the end of the previous section, we
need the reranker to distinguish between hypothe-                Choosing when to manipulate is thus an important
ses that should be pruned from those that should                 decision. It is advisable to avoid manipulating the
be kept. Furthermore, for the purpose of rerank-                 beam too early: not only it is harder to rank hy-
ing, only relative differences in BLEU score matter,             potheses with very few tokens, but it is also less
not absolute values. Therefore, when generating                  likely to be beneficial. As shown in Figure 1, in the
training data, hypotheses in the bottom section of               first few steps even a relatively small beam size can
the beam (according to BLEU) were assigned the                   keep hypotheses that could lead to the reference
                                                                 outputs. On the other hand, it is also advisable
   3
     In preliminary experiments, we held back a portion of the   not to manipulate too late: once hypotheses have
training set from the seq2seq model’s training phase, so that    fallen out of the beam, they cannot be put back in.
the beam manipulation ranker was trained on outputs that the
seq2seq model had not seen. However, the system was unable       As the optimal choice of when to manipulate the
to recover from the lower performance of the seq2seq model.      beam is dependent on the dataset and the model,

                                                             2567
we treat this as a hyperparameter to be tuned on the
validation set.

4       Experiments
In this section we will present results on the E2E
(Novikova et al., 2017b) and WebNLG challenges
(Gardent et al., 2017). We evaluate the systems us-
ing the BLEU implementation used in the original
E2E challenge.4 For all the experiments reported
we use the seq2seq architecture from Dušek and
Jurčı́ček (2016) for the underlying model that we
are trying to manipulate.5
                                                           Figure 4: Results on the test set of the E2E challenge.
It is well-known that BLEU is not a completely             LN = Length Normalisation.
reliable method for predicting human perceptions
of the quality of individual NLG outputs (for exam-
ple: Callison-Burch et al., 2006; Novikova et al.,         BLEU score of the resulting sequences since it ad-
2017a). However, in this case, we are comparing            dresses the bias of language models to favour short
outputs from variants of the same system, and thus         sequences, as discussed in Section 2. Although the
BLEU is more likely to provide reasonable esti-            values assigned to a sequence when using length
mates of their felicity to the references, as argued       normalisation are no longer probabilities, it still
by both Callison-Burch et al. and Novikova et al..         performs the expand, rank and prune steps at each
                                                           iteration of the decoding. Hence, we can apply
To support the idea behind our approach, i.e. manip-       beam manipulation in tandem with length normal-
ulating the beam during decoding instead of only           isation. Finally, we also considered nucleus sam-
at the end, we compare against existing rerankers          pling (Holtzman et al., 2020) as a baseline. How-
applied to the beam at the end of decoding. These          ever, it was found to decrease performance even
include the TGEN reranker, proposed by Dušek              when compared to vanilla beam search.
and Jurčı́ček (2016), that has achieved state-of-the-
art BLEU scores on NLG tasks, as well as the same          In what follows, we will not only comment on test
reranker architecture defined in Section 3.2. Both         results when the beam size is tuned on the valida-
architectures are trained to perform reranking of          tion sets, but we will also comment on test results
the final beam only. When comparing between                across all beam sizes. The reason for doing this is
these two methods of reranking the final beam, no          that considering all beam sizes assesses whether the
significant difference in performance was found. In        technique is robust to changes in beam size. In our
this section, we report results for the architecture       opinion, this makes for a more convincing result
defined in Section 3.2. For completeness, results          then just indicating a difference in performance at a
for both rerankers are included in Appendix C.             single beam size. This is especially pertinent since
                                                           a well-documented issue of beam search is that
By using the same architecture both for the final-         larger beam sizes can lead to deteriorating perfor-
beam reranker and for beam manipulation, we can            mance. The results table in Appendix C indicates
be more confident that any difference in results is        the validation set’s optimal beam size for each of
due to the beam manipulation strategy (reranking           the systems.
during decoding, not just at the end).
                                                           4.1   E2E BLEU Results
In our experiments, we also consider length nor-
malisation (Murray and Chiang, 2018), which is             Figure 4 indicates the results on the E2E test set.
a well-known technique that often increases the            The first thing to note is that increasing the beam
                                                           size did not lead to any considerable gain in per-
    4
    https://github.com/tuetschek/                          formance for the vanilla beam search strategy. For
e2e-metrics                                                all beam sizes except 10, the performance is worse
  5
    The source code for this paper is at https://github.
com/jamesHargreaves12/incremental_beam_                    than greedy decoding, while using a beam size
manipulation                                               of 10 only increased performance by 0.06 points.

                                                      2568
Compared to vanilla beam search, reranking was
an effective strategy, increasing the performance
at all beam sizes. Similarly, applying incremental
beam manipulation was able to outperform both
methods at all beam sizes. Using the validation
set to tune the beam size, the BLEU scores are
0.89 and 1.93 BLEU points higher for the reranker
and incremental beam manipulation, respectively.
The difference in BLEU scores between incremen-
tal beam manipulation and reranker methods was
found to be significant (using a permutation test
with significance level 0.01).

Length normalisation was the strongest baseline in-     Figure 5: Results on the test set of the WebNLG chal-
creasing the BLEU score of vanilla by 1.69 points.      lenge. LN = Length Normalisation.
Adding the reranker on top of length normalisation
decreases performance for all beam sizes less than      The length normalisation baseline improved upon
30. The strong performance of length normalisa-         the vanilla baseline, increasing the BLEU score
tion is likely due to the fact that the E2E test set    by 5.01 points. Reranking the final beam of the
contained longer, and more complex inputs (and          length normalised beam search was more effec-
hence references) than the training and validation      tive on the WebNLG dataset than the E2E dataset;
set (Dušek et al., 2020). Nevertheless, applying       applying the reranker outperformed length normal-
incremental beam manipulation on top of length          isation at every beam size. Focusing on the beam
normalisation was able to increase the BLEU score       sizes that performed optimally on the validation set,
for all beam sizes except 5.                            the BLEU score on the test set was increased by
It is worth pointing out that while incremental beam    0.43 points. Applying incremental beam manipula-
manipulation improved both vanilla beam search          tion on top of length normalisation received a yet
and length normalisation, the overall BLEU score        higher BLEU score than reranked length normalisa-
for the combination with the latter was lower for       tion for all beam sizes. Increasing the BLEU score
all sizes other than size 3. This is surprising con-    by 1.33 points compared to the length normalisa-
sidering that vanilla beam search performed worse       tion. The improvement in BLEU scores achieved
than length normalisation when not combined with        by applying incremental beam manipulation to the
incremental beam manipulation. This could be due        length normalised beam search was found to be
to the fact that the greedy roll-out approximation is   significant when compared to length normalisation
less accurate for length normalisation than vanilla     (with or without final beam reranking).
beam search since length normalisation only has         Unlike the E2E dataset, beam manipulation had
an impact once some items in the beam have been         higher performance when applied on top of length
completed.                                              normalisation rather than vanilla beam search, out-
                                                        performing it for all beam sizes except 3. The
4.2   WebNLG BLEU results                               BLEU score was 0.52 points higher when taking
Figure 5 indicates the results on the WebNLG test       the values at the beam sizes with the highest per-
set. As in the results for E2E, we can see that in-     formances on the validation set.
creasing the beam size of vanilla beam search was
not an effective way to increase BLEU score. A          4.3   Fallout with Beam Manipulation
greedy decode outperformed it at all beam sizes.        In Section 1, we explained that references often fall
Reranking the final beam was more effective, in-        out a beam relatively early during decoding, and
creasing the BLEU score by 5.83 points. Applying        reported results on the E2E task. We repeated the
incremental beam manipulation had a very similar        same experiment for when applying incremental
performance to reranking, increasing the perfor-        beam manipulation. A beam size of 3 was used so
mance at beam sizes 3 and 10 but reducing it at         that the results could be directly compared to those
size 5.                                                 for vanilla beam search in Figure 1.

                                                   2569
The systems were labelled as equally fluent in 76%
                                                          of cases (incremental beam manipulation was pre-
                                                          ferred in 12% of all cases).

                                                          To judge the adequacy of the generations, human
                                                          raters were presented with the meaning represen-
                                                          tation input and the text generated by a system.
                                                          They were asked to label any hallucinations and
                                                          any repetitions. They were asked to ignore miss-
                                                          ing information as the human references that the
                                                          system had been trained on contained frequent ex-
                                                          amples of missing information (Dušek et al., 2020),
                                                          so for this dataset, missing information is better
Figure 6: The percentage of beams which contain a         seen as content selection. Between the annotators,
reference (orange), or which could still lead to a ref-   a combined total of 524 examples were labelled for
erence (blue), using Incremental Beam Manipulation        both for hallucination and repetition.
with beam size 3 on the E2E validation set. This im-
proves on vanilla beam search, shown in Fig. 1.           Once again, the results were not conclusive in sup-
                                                          port of either system, with no statistically signif-
                                                          icant difference between them. The overall per-
The results are shown in Figure 6. The graph
                                                          formance was very high: for 95% of the inputs,
indicates that beam manipulation indeed amelio-
                                                          both systems exhibited no signs of hallucination or
rates this issue. The final beam contains a (correct)
                                                          repetition. This error analysis did, however, high-
reference in 100/547 cases (approx 18%), a large
                                                          light that some errors are repeated multiple times,
increase from the 60 of vanilla beam search. This
                                                          almost word for word. For example, all 5 cases of
is mainly due to reducing the number of references
                                                          repetition for the incremental beam manipulation
that fall out in steps 5 to 15, which is consistent
                                                          system had the following form: “There is a pub
with the fact that we are manipulating at steps 5,
                                                          called X located near Y. It is a pub.”
10, 15 and 20. We also observed that most of the
retention gain is due to earlier manipulation steps.      It is worth re-iterating that the system was opti-
                                                          mised for BLEU, and not fluency or adequacy. The
4.4   Human Evaluation                                    fact that an improvement in BLEU has not led to
To further investigate the differences between the        an improvement in a human evaluation suggests
systems, we conducted a human evaluation compar-          that BLEU may not be an accurate enough metric
ing the incremental beam manipulation system’s            for this task, even when comparing similar systems.
output against the output of the strongest baseline       Therefore, BLEU may be even more limited in
on the E2E dataset – length normalisation.                usefulness than Callison-Burch et al. (2006) and
                                                          Novikova et al. (2017a) suggested.
The human evaluation was performed by the sec-
ond and third authors of the paper. While the anno-       4.5   Example outputs
tators had been involved in the design of the system,
                                                          We now present a couple of examples where ma-
they had not seen textual outputs from the system
                                                          nipulating the beam during decoding led to an im-
prior to the annotation. The outputs were presented
                                                          provement in the quality of the output. These were
in a random order, without indicating which out-
                                                          selected from the set of examples for which the
put came from which system. The systems were
                                                          output of the beam manipulator was preferred by
compared in terms of both fluency and adequacy.
                                                          the human annotators in terms of adequacy. The
For fluency, each annotator compared the system           examples are given in Figure 7.
outputs for the same meaning representation input
(without seeing it) and indicated their preference.       In the first example, we can see that length nor-
Both annotators annotated 50 examples from the            malisation leads to a repetition of the fact that The
E2E test set.                                             Cricketers had an ‘average customer rating’, show-
                                                          ing the downsides of a technique that just favours
Little difference between the outputs was found.          longer outputs. Neither the output from the beam

                                                     2570
Input: name = The Cricketers | eat type = restaurant | food = English | price range = cheap |
    rating = average | area = city centre | family friendly = yes | near = Café Rouge
    BM: The Cricketers serves cheap English food in the city centre near Café Rouge. It has an average
    customer rating and is family-friendly.
    LN: The Cricketers is a cheap, family-friendly, English restaurant with an average customer rating.
    It is located in the city centre near Café Rouge and has an average customer rating.
    RR: The Cricketers is a cheap, family-friendly restaurant located in city centre near Café Rouge.

    Input: name = The Phoenix | eat type = pub | food = French | price range = £20-25 | rating = 3 out
    of 5 | area = riverside | family friendly = no | near = Café Sicilia
    BM: The Phoenix is a French restaurant in riverside near Café Sicilia. It has a moderate price range
    and a customer rating of 3 out of 5. It is not kid friendly.
    LN: The Phoenix is a restaurant providing French food in the £20-25 price range. It is located in the
    riverside. It is near Café Sicilia. Its customer rating is 5 out of 5.
    RR: The Phoenix is a restaurant providing French food in the £20-25 price range. It is located in the
    riverside. It is near Café Sicilia. Its customer rating is high.

Figure 7: Example outputs for different systems. BM = Incremental beam manipulation system, LN = Vanilla
beam search with length normalisation, RR = Vanilla beam search with reranking applied to the final beam.

manipulator nor the reranked approach contain rep-        the E2E challenge. We further showed that incre-
etitions, although we can see that more of the input      mental beam manipulation was able to increase
information is realised in the case of the beam ma-       performance when applied on top of length normal-
nipulator.                                                isation.

The second example contains a hallucination for           The optimal reranker for incremental beam manip-
both the length normalised and reranked systems           ulation may differ at each step of generation (for
– the input clearly states that the customer rating       example, token 5 vs. token 20). In future work, we
was ‘3 out of 5’. In contrast, whereas these systems      intend to refine our method further by conditioning
claim that it was ‘5 out of 5’ and ‘high’ respectively.   the reranker on how far through the beam search
The beam manipulation system avoided this issue.          we are.

5    Conclusions
                                                          References
Rerankers are commonly used to increase the per-          Shubham Agarwal, Marc Dymetman, and Eric
formance of NLG systems decoded by beam search,           Gaussier. 2018. Char2char generation with reranking
by modifying which hypothesis from the final beam         for the E2E NLG challenge. pages 451–456.
is chosen. This means that rerankers are dependent        Nabiha Asghar, Pascal Poupart, Xin Jiang, and Hang Li.
on good hypotheses reaching the final beam. How-          2017. Deep active learning for dialogue generation. In
ever, this is often not the case; on the validation       Proceedings of the 6th Joint Conference on Lexical and
                                                          Computational Semantics (*SEM 2017), pages 78–83.
set of E2E challenge, only 11% of references were
present in the final beam when the seq2seq model          Frédéric Blain, Lucia Specia, and Pranava Madhyastha.
from (Dušek and Jurčı́ček, 2016) was decoded with      2017. Exploring hypotheses spaces in neural machine
                                                          translation. In Proceedings of the 16th Machine Trans-
a beam size of 3.                                         lation Summit (MT Summit XVI). Asia-Pacific Associa-
                                                          tion for Machine Translation (AAMT).
To address this limitation, we proposed incremental
beam manipulation, which modifies the ranking of          Sebastian Borgeaud and Guy Emerson. 2020. Leverag-
partial hypotheses within the beam at intermediate        ing sentence similarity in natural language generation:
                                                          Improving beam search using range voting. In Pro-
steps of the decoding, and hence chooses which are        ceedings of the 4th Workshop on Neural Generation
pruned. We evaluated this method on both the E2E          and Translation (WNGT), pages 97–109. Association
and WebNLG challenges. The results showed that            for Computational Linguistics.
applying beam manipulation, instead of a reranker,        Chris Callison-Burch, Miles Osborne, and Philipp
was able to increase the BLEU score by 1.04 on            Koehn. 2006. Re-evaluating the role of BLEU in ma-

                                                     2571
chine translation research. In Proceedings of the 11th     Yejin Choi. 2020. The curious case of neural text de-
Conference of the European Chapter of the Association      generation.
for Computational Linguistics (EACL).
                                                           Philipp Koehn and Rebecca Knowles. 2017. Six chal-
Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agar-           lenges for neural machine translation. In Proceedings
wal, Hal Daumé III, and John Langford. 2015. Learn-       of the First Workshop on Neural Machine Translation,
ing to search better than your teacher. In Proceed-        pages 28–39.
ings of the 32nd International Conference on Machine
Learning (ICML), volume 37 of JMLR Proceedings,            Shankar Kumar and William Byrne. 2004. Minimum
pages 2058–2066. JMLR.org.                                 Bayes-Risk decoding for statistical machine translation.
                                                           In Proceedings of the Human Language Technology
Yun Chen, Kyunghyun Cho, Samuel R Bowman, and              Conference of the North American Chapter of the As-
Victor OK Li. 2018. Stable and effective trainable         sociation for Computational Linguistics: HLT-NAACL
greedy decoding for sequence to sequence learning. In      2004, pages 169–176.
Proceedings of the 6th International Conference on
                                                           Tie-Yan Liu. 2009. Learning to rank for information
Learning Representations (ICLR 2018).
                                                           retrieval. Found. Trends Inf. Retr., 3(3):225–331.
Eldan Cohen and Christopher Beck. 2019. Empirical
                                                           Kenton Murray and David Chiang. 2018. Correcting
analysis of beam search performance degradation in
                                                           length bias in neural machine translation. In Proceed-
neural sequence models. In Proceedings of the 36th In-
                                                           ings of the Third Conference on Machine Translation:
ternational Conference on Machine Learning (ICML),
                                                           Research Papers, pages 212–223.
pages 1290–1299.
                                                           Renato Negrinho, Matthew R. Gormley, and Geoffrey J.
Ronan Collobert, Awni Hannun, and Gabriel Synnaeve.        Gordon. 2018. Learning beam search policies via imi-
2019. A fully differentiable beam search decoder. In       tation learning.
Proceedings of the 36th International Conference on
Machine Learning (ICML), pages 1341–1350.                  Jekaterina Novikova, Ondřej Dušek, Amanda Cercas
                                                           Curry, and Verena Rieser. 2017a. Why we need new
Ondřej Dušek and Filip Jurčı́ček. 2016. Sequence-to-   evaluation metrics for NLG. In Proceedings of the
sequence generation for spoken dialogue via deep syn-      2017 Conference on Empirical Methods in Natural
tax trees and strings. In Proceedings of the 54th An-      Language Processing, pages 2241–2252. Association
nual Meeting of the Association for Computational Lin-     for Computational Linguistics.
guistics (Volume 2: Short Papers), pages 45–51, Berlin,
Germany. Association for Computational Linguistics.        Jekaterina Novikova, Ondřej Dušek, and Verena Rieser.
                                                           2017b. The E2E dataset: New challenges for end-to-
Ondřej Dušek, Jekaterina Novikova, and Verena Rieser.    end generation. In Proceedings of the 18th Annual SIG-
2020. Evaluating the state-of-the-art of end-to-end nat-   dial Meeting on Discourse and Dialogue, pages 201–
ural language generation: The E2E NLG challenge.           206.
Computer Speech & Language, 59:123–156.
                                                           Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Markus Freitag and Yaser Al-Onaizan. 2017. Beam            Jing Zhu. 2002. Bleu: a method for automatic evalua-
search strategies for neural machine translation. In       tion of machine translation. In Proceedings of the 40th
Proceedings of the First Workshop on Neural Machine        Annual Meeting of the Association for Computational
Translation, pages 56–60, Vancouver. Association for       Linguistics, pages 311–318, Philadelphia, Pennsylva-
Computational Linguistics.                                 nia, USA. Association for Computational Linguistics.
Claire Gardent, Anastasia Shimorina, Shashi Narayan,       Donald G. Saari and Vincent R. Merlin. 1996. The
and Laura Perez-Beltrachini. 2017. The WebNLG              Copeland method. Economic Theory, 8(1):51–76.
challenge: Generating text from RDF data. In Pro-
ceedings of the 10th International Conference on Nat-      Felix Stahlberg and Bill Byrne. 2019. On NMT search
ural Language Generation, pages 124–133, Santiago          errors and model errors: Cat got your tongue? In Pro-
de Compostela, Spain. Association for Computational        ceedings of the 2019 Conference on Empirical Methods
Linguistics.                                               in Natural Language Processing and the 9th Interna-
                                                           tional Joint Conference on Natural Language Process-
Kartik Goyal, Graham Neubig, Chris Dyer, and Tay-          ing (EMNLP-IJCNLP), pages 3347–3353.
lor Berg-Kirkpatrick. 2018. A continuous relaxation of
beam search for end-to-end training of neural sequence     Tsung-Hsien Wen, Milica Gasic, Dongho Kim, Nikola
models. In Proceedings of the 32nd AAAI Conference         Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young.
on Artificial Intelligence, pages 3045–3052.               2015. Stochastic language generation in dialogue using
                                                           recurrent neural networks with convolutional sentence
Jiatao Gu, Kyunghyun Cho, and Victor OK Li. 2017.          reranking. In Proceedings of the 16th Annual Meeting
Trainable greedy decoding for neural machine trans-        of the Special Interest Group on Discourse and Dia-
lation. In Proceedings of the 2017 Conference on           logue, pages 275–284.
Empirical Methods in Natural Language Processing
(EMNLP), pages 1968–1978.                                  Sam Wiseman and Alexander M. Rush. 2016.
                                                           Sequence-to-sequence learning as beam-search opti-
Ari Holtzman, Jan Buys, Leo Du, Maxwell Forbes, and        mization. In Proceedings of the 2016 Conference on

                                                      2572
Figure 8: The percentage of beams that contain a refer-       Figure 9: Comparison between the performance of the
ence sentence after each step of beam search. A beam          pointwise and pairwise rankers when used as rerankers
size of 10 was used to decode the model proposed by           on the E2E validation set.
Dušek and Jurčı́ček (2016). Results are for the E2E val-
idation dataset. The orange bars indicate the number of
completed references within the beam.                         B   Pointwise vs Pairwise rerankers

                                                              This paper required a method of ranking completed
Empirical Methods in Natural Language Processing              hypotheses from worst to best. During preliminary
(EMNLP), pages 1296–1306.
                                                              experiments we implemented rerankers based on
                                                              the Pairwise and Pointwise strategies from the In-
Ning Zhang, Shui-Long Shen, Annan Zhou, and Ye-
Shuang Xu. 2019. Investigation on performance of neu-         formation Retrieval field. See Section 3.2 for more
ral networks using quadratic relative error cost function.    details.
IEEE Access, 7:106642–106652.
                                                              To evaluate the performance of the different rankers
A    Fallout experiment with larger beam                      we applied each of the rankers as a reranker of the
     size                                                     final beam of a vanilla beam search over the E2E
                                                              validation set. The BLEU scores for each of the
Figure 1 indicates the step at which the reference            rerankers were calculated for each beam size. The
sentences drop out of the beam (for a beam size of            results are shown in Figure 9.
3). Figure 8 indicates the same results for a larger
beam size of 10.                                              We can see that there was very little difference in
                                                              performance for the two methods of reranking for
The figure indicates that the number of references            the beam sizes up to 10. However, for beam size
that were contained in the final beam was higher              30 the pointwise reranker significantly outperforms
for a beam size 10. For the early iterations of the           the pairwise reranker. The larger the beam size the
decoding the number of references that fell out of            greater the number of hypotheses that the reranker
the beam was far lower for a beam size of 10. A               can pick as top and hence the greater the impact of
larger beam size meant that the beam contained                the reranker.
more hypotheses and so has more chances to match
against a reference.                                          The pointwise reranker requires O(k) runs of the
                                                              reranker to produce a total ordering. On the other
However the shape of the graphs is very similar.              hand the Copeland method to produce a total order-
The majority of references that fell out did so rela-         ing from the pairwise comparisons requires O(k 2 )
tively early in the process. 54% of references fell           number of pairwise Comparisons.
out by step 7, increasing to 79% by step 9. At
step 21 the final last reference fell out of the beam         These factors lead us to choose the Pointwise
despite the fact that the beam contained partially            ranker over the pairwise ranker for the experiments
references up to step 40.                                     in the results section.

                                                         2573
Beam size     Vanilla    Rerank     TGEN        LN       LN+Rerank        BM        LN+BM
         1             64.72      64.72      64.72       64.72    64.72            64.72     64.72
         3             64.47      64.94      65.33       65.93    65.26            65.73     66.06
         5             64.69      65.36      65.47*      66.40    66.19*           66.40     66.27
         10            64.78*     65.67*     65.58       66.47*   66.19            66.71*    66.61
         30            63.65      65.25      65.44       65.58    66.05            66.40     66.25*

Table 1: BLEU scores for each of the different systems on the E2E testset. *indicates the beam size which scored
highest on the respective validation sets.bold indicates the highest scoring system for each beam size.

         Beam size     Vanilla    Rerank     TGEN        LN       LN+Rerank        BM        LN+BM
         1             42.14      42.14      42.14       42.14    42.14            42.14     42.14
         3             42.10*     47.93*     47.28*      47.02    47.37            48.38     47.78
         5             41.77      48.13      47.41       47.49    47.54*           47.92*    48.39
         10            41.33      47.33      46.50       47.11*   47.70            47.66     48.21
         30            41.20      47.42      46.61       47.18    47.81            47.41     48.44*

Table 2: BLEU scores for each of the different systems on the WebNLG testset. *indicates the beam size which
scored highest on the respective validation sets. bold indicates the highest scoring system for each beam size.

C    Numerical results                                       • E2E, Incremental Beam Manipulation on top
                                                               of length normalised beam search: 5, 7 and
This section will present the numerical results for            10.
the E2E and WebNLG datasets so that they can be
more readily compared in future works. The results           • WebNLG, Incremental Beam Manipulation
are given in Table 1 and Table 2.                              on top of vanilla beam search: 4, 12 and final.
                                                             • WebNLG, Incremental Beam Manipulation
D    Hyperparameters                                           on top of length normalised beam search: 5
Throughout this paper a number of hyperparame-                 and 12.
ters were introduced. The values used for each of         It is worth noting that manipulating the final step
the models in this paper are summarised bellow.           is the same as reranking the beam according to the
Note that the search for these values was far from        ranker used in beam manipulation (i.e. no rollouts
exhaustive so there is a good chance that the results     are performed).
of this paper could be improved upon through a
better optimisation procedure.

In Section 3.3, the beam is split into two sections,
bottom and rest. For all beam manipulation mod-
els the bottom of the beam was set to the bottom
(i.e. lowest scoring) quarter of the beam. We also
say a large beam is used to generate the data for
training the beam. For all experiments in this paper
we use a beam size of 50.

A key hyperparameter for performance of the in-
cremental beam manipulation was the steps of the
beam search at which the beam was manipulated.
This hyperparameter varied for the 4 separate beam
manipulation models. The values are summarised
as follows:

    • E2E, Incremental Beam Manipulation on top
      of vanilla beam search: 5, 10, 15, 20 and final.

                                                     2574
You can also read