Correct Me If You Can: Learning from Error Corrections and Markings

Page created by Dennis Richards
 
CONTINUE READING
Correct Me If You Can: Learning from Error Corrections and Markings

                                                          Julia Kreutzer∗ and Nathaniel Berger∗ and Stefan Riezler†,∗
                                                                       ∗
                                                                         Computational Linguistics & † IWR
                                                                          Heidelberg University, Germany
                                                        {kreutzer, berger, riezler}@cl.uni-heidelberg.de

                                                                 Abstract                                reference structures (Ranzato et al., 2016; Bah-
arXiv:2004.11222v1 [cs.CL] 23 Apr 2020

                                                                                                         danau et al., 2017; Kreutzer et al., 2017; Sokolov
                                             Sequence-to-sequence learning involves a                    et al., 2017). An alternative approach that proposes
                                             trade-off between signal strength and an-                   to considerably reduce human annotation effort by
                                             notation cost of training data. For ex-                     allowing to mark errors in machine outputs, for ex-
                                             ample, machine translation data range                       ample erroneous words or phrases in a machine
                                             from costly expert-generated translations                   translation, has recently been proposed and been
                                             that enable supervised learning, to weak                    investigated in simulation studies by Marie and
                                             quality-judgment feedback that facilitate                   Max (2015); Domingo et al. (2017); Petrushkov
                                             reinforcement learning. We present the                      et al. (2018). This approach takes the middle
                                             first user study on annotation cost and                     ground between supervised learning from error
                                             machine learnability for the less popu-                     corrections as in machine translation post-editing1
                                             lar annotation mode of error markings.                      (or from translations created from scratch) and
                                             We show that error markings for trans-                      reinforcement learning from sequence-level ban-
                                             lations of TED talks from English to                        dit feedback (this includes self-supervised learning
                                             German allow precise credit assignment                      where all outputs are rewarded uniformly). Error
                                             while requiring significantly less human                    markings are highly promising since they suggest
                                             effort than correcting/post-editing, and that               an interaction mode with low annotation cost, yet
                                             error-marked data can be used success-                      they can enable precise token-level credit/blame
                                             fully to fine-tune neural machine transla-                  assignment, and thus can lead to an effective fine-
                                             tion models.                                                grained discriminative signal for machine learning
                                                                                                         and data filtering.
                                         1   Introduction                                                   Our work is the first to investigate learning from
                                         Successful machine learning for structured output               error markings in a user study. Error corrections
                                         prediction requires the effort of annotating suf-               and error markings are collected from junior pro-
                                         ficient amounts of gold-standard outputs—a task                 fessional translators, analyzed, and used as train-
                                         that can be costly if structures are complex and ex-            ing data for fine-tuning neural machine translation
                                         pert knowledge is required, as for example in neu-              systems. The focus of our work is on the learn-
                                         ral machine translation (NMT) (Bahdanau et al.,                 ability from error corrections and error markings,
                                         2015). Approaches that propose to train sequence-               and on the behavior of annotators as teachers to
                                         to-sequence prediction models by reinforcement                  a machine translation system. We find that error
                                         learning from task-specific scores, for example                 markings require significantly less effort (in terms
                                         BLEU in machine translation (MT), shift the prob-               of key-stroke-mouse-ratio (KSMR) and time) and
                                         lem by simulating such scores by evaluating ma-                 result in a lower correction rate (ratio of words
                                         chine translation output against expert-generated               marked as incorrect or corrected in a post-edit).
                                                                                                         Furthermore, they are less prone to over-editing
                                         c 2020 The authors. This article is licensed under a Creative
                                                                                                         1
                                         Commons 3.0 licence, no derivative works, attribution, CC-       In the following we will use the more general term error cor-
                                         BY-ND.                                                          rections and MT specific term post-edits interchangeably.
than error corrections. Perhaps surprisingly, agree-    3     User Study on Human Error Markings
ment between annotators of which words to mark                and Corrections
or to correct was lower for markings than for post-
edits. However, despite of the low inter-annotator      The goal of the annotation study is to compare the
agreement, fine-tuning of neural machine transla-       novel error marking mode to the widely adopted
tion could be conducted successfully from data an-      machine translation post-editing mode. We are in-
notated in either mode. Our data set of error cor-      terested in finding an interaction scenario that costs
rections and markings is publicly available.2           little time and effort, but still allows to teach the
                                                        machine how to improve its translations. In this
                                                        section we present the setup, measure and com-
2   Related Work                                        pare the observed amount of effort and time that
                                                        went into these annotations, and discuss the relia-
Prior work closest to ours is that of Marie and Max     bility and adoption of the new marking mode. Ma-
(2015); Domingo et al. (2017); Petrushkov et al.        chine learnability, i.e. training of an NMT system
(2018), however, these works were conducted by          on human-annotated data is discussed in Section 4.
simulating error markings by heuristic matching
of machine translations against independently cre-      3.1    Participants
ated human reference translations. Thus the ques-       We recruited 10 participants that described them-
tion of the practical feasibility of machine learning   selves as native German speakers and having ei-
from noisy human error markings is left open.           ther a C1 or C2 level in English, as measured by
    User studies on machine learnability from hu-       the Common European Framework of Reference
man post-edits, together with thorough perfor-          levels. 8 participants were students studying trans-
mance analyses with mixed effects models, have          lation or interpretation and 2 participants were stu-
been presented by Green et al. (2014); Bentivogli       dents studying computational linguistics. All par-
et al. (2016); Karimova et al. (2018). Albeit show-     ticipants were paid 100e for their participation in
casing the potential of improving NMT through           the study, which was done online, and limited to a
human corrections of machine-generated outputs,         maximum of 6 hours, and it took them between 2
these works do not consider “weaker” annotation         and 4.5 hours excluding breaks. They agreed to the
modes like error markings. User studies on the          usage of the recorded data for research purposes.
process and effort of machine translation post-
editing are too numerous to list—a comprehensive        3.2    Interface
overview is given in Koponen (2016). In contrast        The annotation interface has three modes: (1)
to works on interactive-predictive translation (Fos-    markings, (2) corrections, and (3) the user-choice
ter et al., 1997; Knowles and Koehn, 2016; Peris        mode, where annotators first choose between (1)
et al., 2017; Domingo et al., 2017; Lam et al.,         and (2) before submitting their annotation. While
2018), our approach does not require an online in-      the first two modes are used for collecting train-
teraction with the human and allows to investigate,     ing data for the MT model, the third mode is used
filter, pre-process, or augment the human feedback      for evaluative purposes to investigate which mode
signal before making a machine learning update.         is preferable when given the choice. In any case,
    Machine learning from human feedback beyond         annotators are presented the source sentence, the
the scope of translations, has considered learn-        target sentence and an instruction to either mark or
ing from human pairwise preferences (Christiano         correct (aka post-edit) the translation or choose an
et al., 2017), from human corrective feedback           editing mode. They also had the option to pause
(Celemin et al., 2018), or from sentence-level re-      and resume the session. No document-level con-
ward signals on a Likert scale (Kreutzer et al.,        text was presented, i.e., translated sentences were
2018). However, none of these studies has consid-       judged in isolation, but in consecutive order like
ered error markings on tokens of output sequences,      they appeared in the original documents to provide
despite its general applicability to a wide range of    a reasonable amount of context. They received
learning tasks.                                         detailed instructions (see Appendix A) on how
                                                        to proceed with the annotation. Each annotator
2
 https://www.cl.uni-heidelberg.de/                      worked on 300 sentences, 100 for each mode, and
statnlpgroup/humanmt/                                   an extra 15 sentences for intra-annotator agree-
To investigate the sources of variance affecting
                                                                 time and effort, we use Linear Mixed Effect Mod-
                                                                 els (LMEM) (Barr et al., 2013) and build one with
                                                                 KSMR as response variable, and another one for
                                                                 the total edit duration (excluding breaks) as re-
                                                                 sponse variable, and with the editing mode (cor-
                                                                 recting vs. marking) as fixed effect. For both re-
                                                                 sponse variables, we model users5 , talks and tar-
                                                                 get lengths6 as random effects, e.g., the one for
Figure 1: Interface for marking of translation outputs follow-   KSMR:
ing user choice between markings and post-edits.
                                                                 KSMR ∼ mode + (1 | user id) + (1 | talk id)
ment measures that were repeated after each mode.                                   + (1 | trg length)                      (1)
After the completion of the annotation task they
answered a survey about the preferred mode, the                     We use the implementation in the R package
perceived editing/marking speed, user-choice poli-               lmer4 (Bates et al., 2015) and fit the models
cies, and suggestions for improvement. A sreen-                  with restricted maximum likelihood. Inspecting
shot of the interface showing a marking operation                the intercepts of the fitted models, we confirm that
is shown in Figure 1. The code for the interface is              KSMR is significantly (p = 0.01) higher for post
publicly available3 .                                            edits than for markings (+3.76 on average). The
                                                                 variance due to the user (0.69) is larger than due to
3.3   Data                                                       the talk (0.54) and the length (0.05)7 . Longer sen-
We selected a subset of 30 TED talks to create the               tences have a slightly higher KSMR than shorter
three data sets from the IWSLT17 machine trans-                  ones. When modeling the topics as random effects
lation corpus4 . The talks were filtered by the fol-             (rather than the talks), the highest KSMR (judging
lowing criteria: single speakers, no music/singing,              by individual intercepts) was obtained for physics
low intra-line final-sentence punctuation (indicat-              and biodiversity and the lowest for language and
ing bad segmentation), length between 80 and 149                 diseases. This might be explained by e.g. the MT
sentences. One additional short talk was selected                training data or the raters expertise.
for testing the inter- and intra-annotator reliability.             Analyzing the LMEM for editing duration, we
We filtered out those sentences where model hy-                  find that post-editing takes on average 42s longer
pothesis and references were equal, in order to save             than marking, which is significant at p = 0.01.
annotation effort where it is clearly not needed,                The variance due to the target length is the largest,
and also removed the last line from every talk (usu-             followed by the one due to the talk and the one
ally “thank you”). For each talk, one topic of a set             due to the user is smallest. Long sentences have
of keywords provided by TED was selected. See                    a six time higher editing duration on average than
Appendix B for a description of how data was split               shorter ones. With respect to topics, the longest
across annotators.                                               editing was done for topics like physics and evolu-
                                                                 tion, shortest for diseases and health.
3.4   Effort and Time
                                                                 3.4.1    Annotation Quality
Correcting one translated sentence took on aver-
age approximately 5 times longer than marking                       The corrections increased the quality, measured
errors, and required 42 more actions, i.e., clicks               by comparison to reference translations, by 2.1
and keystrokes. That is 0.6 actions per character                points in BLEU and decreased TER by 1 point.
for post-edits, while only 0.03 actions per charac-              While this indicates a general improvement, it has
ter for markings. This measurement aligns with                   to be taken with a grain of salt, since the post-edits
the unanimous subjective impression of the partic-               5
                                                                   Random effects are denoted, e.g., by (1|user id)).
                                                                 6
ipants that they were faster in marking mode.                      Target lengths measured by number of characters were
                                                                 binned into two groups at the limit of 176 characters.
3                                                                7
  https://github.com/StatNLP/                                      Note that KSMR is already normalized by reference length,
mt-correct-mark-interface                                        hence the small effect of target length. In a LMER for the
4
  https://sites.google.com/site/                                 raw action count (clicks+key strokes), this effect had a larger
iwsltevaluation2017/                                             impact.
src   I am a nomadic artist.                                     src   Each year, it sends up a new generation of shoots.
    hyp   Ich bin ein nomadischer Künstler.                         ann   Jedes Jahr sendet es eine neue Generation von Shoots.
                                                                     sim   Jedes Jahr sendet es eine neue Generation von Shoots.
    pe    Ich bin ein nomadischer Künstler.
                                                                     ref   Jedes Jahr wachsen neue Triebe.
    ref   Ich wurde zu einer nomadischen Künstlerin.
                                                                     src   He killed 63 percent of the Hazara population.
    src   I look at the chemistry of the ocean today.                ann   Er starb 63 Prozent der Bevölkerung Hazara.
    hyp   Ich betrachte heute die Chemie des Ozeans.                 sim   Er starb 63 Prozent der Bevölkerung Hazara.
    pe    Ich erforsche täglich die Chemie der Meere.               ref   Er tötete 63% der Hazara-Bevölkerung.
    ref   Ich untersuche die Chemie der Meere der Gegenwart.         src   They would ordinarily support fish and other wildlife.
                                                                     ann   Sie würden Fisch und andere wild lebende Tiere unterstützen.
    src   There’s even a software called cadnano that allow . . .    sim   Sie würden Fisch und andere wild lebende Tiere unterstützen.
    hyp   Es gibt sogar eine Software namens Caboano, die . . .      ref   Normalerweise würden sie Fisch und andere Wildtiere ernähren.
    pe    Es gibt sogar eine Software namens Caboano, die . . .
    ref   Es gibt sogar eine Software namens ”cadnano”, . . .       Table 2: Examples of markings to illustrate differences be-
    src   It was a thick forest.                                    tween human markings (ann) and simulated markings (sim).
                                                                    Marked parts are underlined. Example 1: “es” not clear from
    hyp   Es war ein dicker Wald.
                                                                    context, less literal reference translation. Example 2: Word
    pe    Es handelte sich um einen dichten Wald.                   omission (preposition after “Bevölkerung”) or incorrect word
    ref   Auf der Insel war dichter Wald.                           order is not possible to mark. Example 3: Word order differs
                                                                    between MT and references, word omission (“ordinarily”) not
Table 1: Examples of post-editing to illustrate differences         marked.
between reference translations (ref ) and post-edits (pe). Ex-
ample 1: The gender in the German translation could not be
inferred from the context, since speaker information is un-
available to post-editor. Example 2: “today” is interpreted as      How good are the markings? Markings, in con-
adverb by the NMT, this interpretation is kept in the post-edit     trast, are less prone to over-editing, since they have
(“telephone game” effect). Example 3: Another case of the           fewer degrees of freedom. They are equally ex-
“telephone game” effect: the name of the software is changed
by the NMT, and not corrected by post-editors. Example 4:           posed to problem (3) of missing context, and an-
Over-editing by post-editor, and more information in the ref-       other limitation is added: Word omissions and
erence translation than in the source.
                                                                    word order problems cannot be annotated. Table 2
                                                                    gives a set of examples that illustrate these prob-
are heavily biased by the structure, word choice                    lems. While annotators were most likely not aware
etc. by the machine translation, which might not                    of problems (1) and (2), they might have sensed
necessarily agree with the reference translations,                  that information was missing, as well as the ad-
while still being accurate.                                         ditional limitations of markings. The simulation
                                                                    of markings from references as used in previous
How good are the corrections? We therefore
                                                                    work (Petrushkov et al., 2018; Marie and Max,
manually inspect the post-edits to get insights into
                                                                    2015) seems overly harsh for the generated target
the differences between post-edits and references.
                                                                    translations, e.g., marking “Hazara-Bevölkerung”
Table 1 provides a set of examples8 with their anal-
                                                                    as incorrect, even though it is a valid translation of
ysis in the caption. Besides the effect of “liter-
                                                                    “Hazara population”.
alness” (Koponen, 2016), we observe three major
problems:
                                                                     Mode             Intra-Rater (Mean / Std.) α           Inter-Rater α
     1. Over-editing: Editors edited translations even               Marking                   0.522 / 0.284                   0.201
        though they are adequate and fluent.                         Correction                0.820 / 0.171                   0.542
                                                                     User-Chosen               0.775 / 0.179                   0.473
     2. “Telephone game” effect: Semantic mistakes
        (that do not influence fluency) introduced by               Table 3: Intra- and Inter-rater agreement calculated by Krip-
                                                                    pendorff’s α.
        the MT system flow into the post-edit and re-
        main uncorrected, when more obvious correc-
        tions are needed elsewhere in the sentence.                 How reliable are corrections and markings?
                                                                    In addition to the absolute quality of the anno-
     3. Missing information: Since editors only ob-
                                                                    tations, we are interested in measuring their re-
        serve a portion of the complete context, i.e.,
                                                                    liability: Do annotators agree on which parts of
        they do not see the video recording of the
                                                                    a translation to mark or edit? While there are
        speaker or the full transcript of the talk, they
                                                                    many possible valid translations, and hence many
        are not able to convey as much information as
                                                                    ways to annotate one given translation, it has been
        the reference translations.
                                                                    shown that learnability profits from annotations
8
    Selected because of their differences to references.            with less conflicting information (Kreutzer et al.,
2018). In order to quantify agreement for both
modes on the same scale, we reduce both anno-                                     1.00

tations to sentence-level quality judgments, which
for markings is the ratio of words that were marked
as incorrect in a sentence, and for corrections the                               0.75

ratio of words that was actually edited. If the hy-
pothesis was perfect, no markings nor edits would

                                                                Correction Rate
be required, and if it was completely wrong, all                                  0.50

of it had to be marked or edited. After this reduc-
tion, we measure agreement with Krippendorff’s α
(Krippendorff, 2013), see Table 3.
                                                                                  0.25

Which mode do annotators prefer? In the
user-choice mode, where annotators can choose
for each sentence whether they would like to mark                                 0.00

or correct it, markings were chosen much more fre-                                               marking
                                                                                                           Editing Mode
                                                                                                                          post_edit

quently than post-edits (61.9%). Annotators did
not agree on the preferred choice of mode for the               Figure 2: Correction rate by annotation mode. The correc-
                                                                tion rate describes the ratio of words in the translation that
repeated sentences (α = −0.008), which indicates                were marked as incorrect (in marking mode) or edited (in
that there is no obvious policy when one of the                 post-editing mode). Means are indicated with diamonds.
modes would be advantageous over the other. In
the post-annotation questionnaire, however, 60%
of the participants said they generally preferred                  This is partially caused by the reduced degrees
post-edits over markings, despite markings being                of freedom in marking mode, but also underlines
faster, and hence resulting in a higher hourly pay.             the general trend towards over-editing when in
                                                                post-edit mode. If markings and post-edits were
   To better understand the differences in modes,
                                                                used to compute a quality metric based on the
we asked them about their policies in the user-
                                                                correction rate, translations are judged as much
choice mode where for each sentence they would
                                                                worse in post-editing mode than in marking mode
have to decide individually if they want to mark
                                                                (Figure 2). This also holds for whole sentences,
or post-edit it. The most commonly described pol-
                                                                where 273 (26.20%) were left un-edited in mark-
icy is decide based on error types and frequency:
                                                                ing mode, and only 3 (0.29%) in post-editing
choose post-edits when insertions or re-ordering is
                                                                mode.
needed, and markings preferably for translations
with word errors (less effort than doing a lookup
                                                                4                        Machine Learnability of NMT from
or replacement). One person preferred post-edits
for short translations, markings for longer ones,
                                                                                         Human Markings and Corrections
another three generally preferred markings gener-               The hypotheses presented to the annotators were
ally, and one person preferred post-edits. Where                generated by an NMT model. The goal is to use
annotators found the interface to need improve-                 the supervision signal provided by the human an-
ments was (1) in the presentation of inter-sentential           notation to improve the underlying model by ma-
context, (2) in the display of overall progress and             chine learning. Learnability is concerned with the
(3) an option to edit previously edited sentences.              question of how strong a signal is necessary in or-
For the marking mode they requested an option to                der to see improvements in NMT fine-tuning on
mark missing parts or areas for re-ordering.                    the respective data.
Do markings and corrections express the same                    Definition. Let x = x1 . . . xS be a sequence
translation quality judgment? We observe that                   of indices over a source vocabulary VS RC , and
annotators find more than twice as many token cor-              y = y1 . . . yT a sequence of indices over a tar-
rections in post-edit mode than in marking mode9                get vocabulary VT RG . The goal of sequence-to-
9
                                                                sequence learning is to learn a function for map-
 The automatically assessed translation quality for the base-
line model does not differ drastically between the portions     ping a input sequence x into an output sequences
selected per mode.                                              y. For the example of machine translation, y
is a translation of x, and a model parameterized                Domain        train                       dev      test
by a set of weights θ is optimized to maximize                  WMT17         5,919,142                   2,169    3,004
pθ (y | x). This quantity is further factorized into            IWSLT17       206,112                     2,385    1,138
       QT probabilities over single tokens pθ (y |
conditional                                                     Selection     1035 corr / 1042 mark                1,043
x) = t=1 pθ (yt | x; y
System             TER ↓    BLEU ↑     METEOR ↑         more appropriate comparison than to references,
 1      WMT baseline        58.6     23.9          42.7         but for that purpose we conduct a small human
        Error Corrections                                       evaluation study. Three bilingual raters receive
                                                                120 translations of the test set (∼10%) and the
 2      Full                57.4?    24.6?        44.7?
 3      Small               57.9?    24.1         44.2?         corresponding source sentences for each mode and
                                                                judge whether the translation is better, as good as,
 Error Markings
                                                                or worse than the baseline: 64% of the translations
 4      0/1                 57.5?    24.4?        44.0?
 5      -0.5/0.5            57.4?    24.6?        44.2?
                                                                obtained from learning from error markings are
 6      random              58.1?    24.1         43.5?         judged at least as good as the baseline, compared to
 Quality Judgments
                                                                65.2% for the translations obtained from learning
                                                                from error corrections. Table 6 shows the detailed
 7      from corrections    57.4?    24.6?        44.7?
 8      from markings       57.6?    24.5?        43.8?         proportions excluding identical translations.

Table 5: Results on the test set with feedback collected from     System                   > BL       = BL       < BL
humans. Decoding with beam search of width 5 and length
penalty of 1. Significant (p : better than the baseline, <
out-of-domain model.                                            worse than the baseline.
   The “small” model trained with error correc-
tions is trained on one fifth of the data, which is
                                                                Effort vs. Translation Quality. Figure 3 illus-
comparable to the effort it takes to collect the er-
                                                                trates the relation between the total time spent on
ror markings. Both error corrections and mark-
                                                                annotations and the resulting translation quality for
ings can be reduced to sentence-level quality judg-
                                                                corrections and markings trained on a selection of
ments, where all tokens receive the same weight
              #marked                                           subsets of the full annotated data: The overall trend
in Eq. δ = hyptokens   or δ = #corrected
                                hyptokens . In addi-            shows that both modes benefit from more training
tion, we compare the markings against a random
                                                                data, with more variance for the marking mode,
choice of marked tokens per sentence.12 We see
                                                                but also a steeper descent. From a total annota-
that both models trained on corrections and mark-
                                                                tion amount of approximately 20,000s on (≈ 5.5h),
ings improve significantly over the baseline (rows
                                                                markings are the more efficient choice.
2 and 3). Tuning the weights for (in)correct tokens
makes a small but significant difference for learn-             4.2.1   LMEM Analysis
ing from markings (rows 4 and 5). These human                      We fit a LMEM for sentence-level quality scores
markings lead to significantly better models than               of the baseline, and three runs each for the NMT
random markings (row 6). When reducing both                     systems fine-tuned on markings and post-edits re-
types of human feedback to sentence-level quality               spectively, and inspect the influence of the system
judgments, no loss in comparison to error correc-               as a fixed effect, and sentence id, topic and source
tions and a small loss for markings (rows 7 and                 length as random effects.
8) is observed. We suspect that the small margin
between results for learning from corrections and                 TER ∼system + (1 | talk id/sent id)
markings is due to evaluating against references.
                                                                           + (1 | topic) + (1 | src length)
Effects like over-editing (see Section 3.4.1) pro-
duce training data that lead the model to generate              The fixed effect is significant at p = 0.05, i.e., the
outputs that diverge more from independent refer-               quality scores of the three systems differ signifi-
ences and therefore score lower than deserved un-               cantly under this model. The global intercept lies
der all metrics except for METEOR.                              at 64.73, the one for marking 1.23 below, and the
Human Evaluation. It is infeasible to collect                   one for post-editing 0.96 below. The variance in
markings or corrections for all our systems for a               TER is for the largest part explained by the sen-
                                                                tence, then the talk, the source length, and the least
12
     Each token is marked with probability pmark = 0.5.         by the topic.
References
                          ●                                                    Bahdanau, D., Brakel, P., Xu, K., Goyal, A.,
              ●

                          ●                                                      Lowe, R., Pineau, J., Courville, A., and Ben-
                      ●               ●
       59.0
                  ●                                                 mode         gio, Y. (2017). An actor-critic algorithm for
                      ●                   ●                         ●
                                                                        mark     sequence prediction. In Proceedings of the In-
                                                                        pe
                  ●                                                              ternational Conference on Learning Represen-
                                                                                 tations (ICLR), Toulon, France.
 TER

                                          ●                         size
                              ●
       58.5                       ●
                                                                    ●

                                                                    ●
                                                                      0        Bahdanau, D., Cho, K., and Bengio, Y. (2015).
                                                                      250
                              ●       ●                             ● 500        Neural machine translation by jointly learning
                                                                    ● 750
                                                                    ● 1000       to align and translate. In Proceedings of the In-
                                                                                 ternational Conference on Learning Represen-
                                  ●                                              tations (ICLR), San Diego, CA.
       58.0

                                              ●                                Barr, D. J., Roger, L., Scheepers, C., and Tily, H. J.
              0                   20000           40000     60000
                                                                                 (2013). Random effects structure for confirma-
                                  Annotation Duration [s]                        tory hypothesis testing: Keep it maximal. J.
                                                                                 Mem. Lang, 68(3):255–278.
Figure 3: Improvement in TER for training data of
varying size: lower is better.         Scores are collected                    Bates, D., Mächler, M., Bolker, B., and Walker,
across two runs with a random selection of k ∈                                   S. (2015). Fitting linear mixed-effects mod-
[125, 250, 375, 500, 625, 750, 875] training data points.
                                                                                 els using lme4. Journal of Statistical Software,
                                                                                 67(1):1–48.
                                                                               Bentivogli, L., Bisazza, A., Cettolo, M., and Fed-
5       Conclusion
                                                                                 erico, M. (2016). Neural versus phrase-based
                                                                                 machine translation quality: a case study. In
We presented the first user study on the annotation                              Proceedings of the 2016 Conference on Empir-
process and the machine learnability of human er-                                ical Methods in Natural Language Processing
ror markings of translation outputs. This annota-                                (EMNLP), Austin, TX.
tion mode has so far been given less attention than                            Bottou, L., Curtis, F. E., and Nocedal, J. (2018).
error corrections or quality judgments, and has un-                              Optimization methods for large-scale machine
til now only been investigated in simulation stud-                               learning. SIAM Review, 60(2):223–311.
ies. We found that both according to automatic
evaluation metrics and by human evaluation, fine-                              Celemin, C., del Solar, J. R., and Kober, J. (2018).
tuning of NMT models achieved comparable gains                                   A fast hybrid reinforcement learning framework
by learning from error corrections and markings.                                 with human corrective feedback. Autonomous
However, error markings required several orders of                               Robots.
magnitude less human annotation effort.                                        Chen, M. X., Firat, O., Bapna, A., Johnson, M.,
In future work we will investigate the integration                               Macherey, W., Foster, G., Jones, L., Schus-
of automatic markings into the learning process,                                 ter, M., Shazeer, N., Parmar, N., Vaswani, A.,
and we will explore online adaptation possibilities.                             Uszkoreit, J., Kaiser, L., Chen, Z., Wu, Y., and
                                                                                 Hughes, M. (2018). The best of both worlds:
                                                                                 Combining recent advances in neural machine
Acknowledgments                                                                  translation. In Proceedings of the 56th Annual
                                                                                 Meeting of the Association for Computational
                                                                                 Linguistics (ACL), Melbourne, Australia.
We would like to thank the anonymous reviewers
for their feedback, Michael Staniek and Michael                                Christiano, P. F., Leike, J., Brown, T., Martic, M.,
Hagmann for the help with data processing and                                    Legg, S., and Amodei, D. (2017). Deep re-
analysis, and Sariya Karimova and Tsz Kin Lam                                    inforcement learning from human preferences.
for their contribution to a preliminary study. The                               In Advances in Neural Information Processing
research reported in this paper was supported in                                 Systems (NIPS), Long Beach, CA.
part by the German research foundation (DFG) un-                               Clark, J. H., Dyer, C., Lavie, A., and Smith, N. A.
der grant RI-2221/4-1.                                                           (2011). Better hypothesis testing for statisti-
cal machine translation: Controlling for opti-       Kreutzer, J., Uyheng, J., and Riezler, S. (2018).
  mizer instability. In Proceedings of the 49th An-      Reliability and learnability of human bandit
  nual Meeting of the Association for Computa-           feedback for sequence-to-sequence reinforce-
  tional Linguistics: Human Language Technolo-           ment learning. In Proceedings of the 56th An-
  gies (ACL-HLT), Portland, OR.                          nual Meeting of the Association for Computa-
Domingo, M., Peris, Á., and Casacuberta, F.             tional Linguistics (ACL), Melbourne, Australia.
  (2017). Segment-based interactive-predictive         Krippendorff, K. (2013). Content Analysis. An In-
  machine translation.  Machine Translation,             troduction to Its Methodology. Sage, third edi-
  31(4):163–185.                                         tion.
Foster, G., Isabelle, P., and Plamondon, P. (1997).    Lam, T. K., Kreutzer, J., and Riezler, S. (2018). A
  Target-text mediated interactive machine trans-        reinforcement learning approach to interactive-
  lation. Machine Translation, 12(1-2):175–194.          predictive neural machine translation. In Pro-
Gehring, J., Auli, M., Grangier, D., Yarats, D., and     ceedings of the 21st Annual Conference of the
  Dauphin, Y. (2017). Convolutional sequence to          European Association for Machine Translation
  sequence learning. In Proceedings of the 55th          (EAMT), Alicante, Spain.
  Annual Meeting of the Association for Compu-         Lam, T. K., Schamoni, S., and Riezler, S. (2019).
  tational Linguistics (ACL), Vancouver, Canada.         Interactive-predictive neural machine transla-
Green, S., Wang, S. I., Chuang, J., Heer, J., Schus-     tion through reinforcement and imitation. In
  ter, S., and Manning, C. D. (2014). Human ef-          Proceedings of Machine Translation Summit
  fort and machine learnability in computer aided        XVII (MTSUMMIT), Dublin, Ireland.
  translation. In Proceedings the onference on         Lavie, A. and Denkowski, M. J. (2009). The me-
  Empirical Methods in Natural Language Pro-             teor metric for automatic evaluation of machine
  cessing (EMNLP), Doha, Qatar.                          translation. Machine Translation, 23(2-3):105–
Karimova, S., Simianer, P., and Riezler, S. (2018).      115.
  A user-study on online adaptation of neural ma-      Marie, B. and Max, A. (2015). Touch-based
  chine translation to human post-edits. Machine        pre-post-editing of machine translation output.
  Translation, 32(4):309–324.                           In Proceedings of the Conference on Empiri-
Knowles, R. and Koehn, P. (2016). Neural inter-         cal Methods in Natural Language Processing
  active translation prediction. In Proceedings of      (EMNLP), Lisbon, Portugal.
  the Conference of the Association for Machine        Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J.
  Translation in the Americas (AMTA), Austin,            (2002). Bleu: a method for automatic evalua-
  TX.                                                    tion of machine translation. In Proceedings of
Koponen, M. (2016). Machine Translation Post-            the 40th Annual Meeting of the Association for
  Editing and Effort. Empirical Studies on the           Computational Linguistics (ACL), Philadelphia,
  Post-Editing Process. PhD thesis, University of        PA.
  Helsinki.                                            Peris, Á., Domingo, M., and Casacuberta, F.
Kreutzer, J., Bastings, J., and Riezler, S. (2019).      (2017). Interactive neural machine translation.
  Joey NMT: A minimalist NMT toolkit for                 Computer Speech & Language, 45:201–220.
  novices. In Proceedings of the 2019 Confer-          Petrushkov, P., Khadivi, S., and Matusov, E.
  ence on Empirical Methods in Natural Lan-              (2018). Learning from chunk-based feedback
  guage Processing and the 9th International             in neural machine translation. In Proceedings of
  Joint Conference on Natural Language Process-          the 56th Annual Meeting of the Association for
  ing (EMNLP-IJCNLP), Hong Kong, China.                  Computational Linguistics (ACL), Melbourne,
Kreutzer, J., Sokolov, A., and Riezler, S.               Australia.
  (2017). Bandit structured prediction for neural      Ranzato, M., Chopra, S., Auli, M., and Zaremba,
  sequence-to-sequence learning. In Proceedings          W. (2016). Sequence level training with recur-
  of the 55th Annual Meeting of the Association          rent neural networks. In Proceedings of the In-
  for Computational Linguistics (ACL), Vancou-           ternational Conference on Learning Represen-
  ver, Canada.                                           tation (ICLR), San Juan, Puerto Rico.
Sennrich, R., Haddow, B., and Birch, A. (2016).                       like to stop highlighting on, and release
  Neural machine translation of rare words with                       the mouse button while over that word.
  subword units. In Proceedings of the 54th An-             •   If you want to take a short break (get a coffee,
  nual Meeting of the Association for Computa-                  etc.), click on “pause” to pause the session.
  tional Linguistics (ACL), Berlin, Germany.                    We’re measuring time it takes to work on each
Snover, M., Dorr, B., Schwartz, R., Micciulla, L.,              sentence, so please do not overuse this button
  and Makhoul, J. (2006). A study of transla-                   (e.g. do not press pause while you’re making
  tion edit rate with targeted human annotation.                your decisions), but also do not feel rushed if
  In Proceedings of the Conference of the Asso-                 you feel uncertain about a sentence.
  ciation for Machine Translation in the Americas           •   Instead, if you want to take a longer break,
  (AMTA), Cambridge, MA.                                        just log out. The website will return you re-
                                                                turn you to the latest unannotated sentence
Sokolov, A., Kreutzer, J., Sunderland, K.,                      when you log back in. If you log out in the
  Danchenko, P., Szymaniak, W., Fürstenau, H.,                 middle of an annotation, your markings or
  and Riezler, S. (2017). A shared task on bandit               post-edits will not be saved.
  learning for machine translation. In Proceedings          •   After completing all sentences (ca. 300),
  of the Second Conference on Machine Transla-                  you’ll be asked to fill a survey about your ex-
  tion, Copenhagen, Denmark.                                    perience.
Turchi, M., Negri, M., Farajian, M. A., and Fed-            •   Important:
  erico, M. (2017). Continuous learning from hu-                   – Please do not use any external dictionar-
  man post-edits for neural machine translation.                      ies or translation tools.
  The Prague Bulletin of Mathematical Linguis-                     – You might notice that some sentences re-
  tics (PBML), 1(108):233–244.                                        appear, which is desired. Please try to be
                                                                      consistent with repeated sentences.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit,
                                                                   – There is no way to return and re-edit
  J., Jones, L., Gomez, A. N., Kaiser, L., and
                                                                      previous sentences, so please make sure
  Polosukhin, I. (2017). Attention is all you need.
                                                                      you’re confident with the edits/markings
  In Advances in Neural Information Processing
                                                                      you provided before you click “submit”.
  Systems (NIPS), Long Beach, CA.
                                                        B       Creating Data Splits
Appendix                                                In order to have users see a wider range of talks,
                                                        each talk was split into three parts (beginning, mid-
A   Annotator Instructions                              dle, and end). Each talk part was assigned an an-
                                                        notation mode. Parts were then assigned to users
The annotators received the following instructions:
                                                        using the following constraints:
  • You will be shown a source sentence, its
                                                           • Each user should see nine document parts.
     translation and an instruction.
                                                           • No user should see the same document twice.
  • Read the source sentence and the translation.
                                                           • Each user should see three sections in post-
  • Follow the instruction by either marking the
                                                             editing, marking, and user-choice mode.
     incorrect words of the translation by clicking
                                                           • Each user should see three beginning, three
     on them or highlighting them, correcting the
                                                             middle, and three ending sections.
     translation by deleting, inserting and replac-
                                                           • Each document should be assigned each of
     ing words or parts of words, or choosing be-
                                                             the three annotation modes.
     tween modes (i) and (ii), and then click “sub-
                                                        To avoid assigning post-editing to every beginning
     mit”.
                                                        section, marking to every middle section, and user-
        – In (ii), if you make a mistake and want
                                                        choice to every ending section, assignment was
          to start over, you can click on the button
                                                        done with an integer linear program with the above
          “reset”.
                                                        constraints. Data was presented to users in the
        – In (i), to highlight, click on the word you
                                                        order [Post-edit, Marking, User Chosen, Agree-
          would like to start highlighting from,
                                                        ment].
          keep the mouse button pushed down,
          drag the pointer to the word you would
You can also read