Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks

Page created by Rene Sandoval

Shopping

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Searchable Hidden Intermediates for End-to-End Models of
                                                                 Decomposable Sequence Tasks

                                            Siddharth Dalmia Brian Yan Vikas Raunak Florian Metze Shinji Watanabe
                                                      Language Technologies Institute, Carnegie Mellon University, USA
                                                                   {sdalmia,byan}@cs.cmu.edu

                                                               Abstract                                  Similarly, many sequence-to-sequence tasks that
                                                                                                      convert one sequence into another (Sutskever et al.,
                                             End-to-end approaches for sequence tasks are             2014) can be decomposed to simpler sequence sub-
                                             becoming increasingly popular. Yet for com-
                                                                                                      tasks in order to reduce the overall complexity.
arXiv:2105.00573v1 [cs.CL] 2 May 2021

                                             plex sequence tasks, like speech translation,
                                             systems that cascade several models trained
                                                                                                      For example, speech translation systems, which
                                             on sub-tasks have shown to be superior, sug-             seek to process speech in one language and output
                                             gesting that the compositionality of cascaded            text in another language, can be naturally decom-
                                             systems simplifies learning and enables so-              posed into the transcription of source language au-
                                             phisticated search capabilities. In this work,           dio through automatic speech recognition (ASR)
                                             we present an end-to-end framework that ex-              and translation into the target language through ma-
                                             ploits compositionality to learn searchable hid-         chine translation (MT). Such cascaded approaches
                                             den representations at intermediate stages of a
                                                                                                      have been widely used to build practical systems
                                             sequence model using decomposed sub-tasks.
                                             These hidden intermediates can be improved               for a variety of sequence tasks like hybrid ASR
                                             using beam search to enhance the overall per-            (Hinton et al., 2012), phrase-based MT (Koehn
                                             formance and can also incorporate external               et al., 2007), and cascaded ASR-MT systems for
                                             models at intermediate stages of the network to          speech translation (ST) (Pham et al., 2019).
                                             re-score or adapt towards out-of-domain data.               End-to-end sequence models like encoder-
                                             One instance of the proposed framework is
                                                                                                      decoder models (Bahdanau et al., 2015; Vaswani
                                             a Multi-Decoder model for speech translation
                                             that extracts the searchable hidden intermedi-           et al., 2017), are attractive in part due to their sim-
                                             ates from a speech recognition sub-task. The             plistic design and the reduced need for hand-crafted
                                             model demonstrates the aforementioned bene-              features. However, studies have shown mixed re-
                                             fits and outperforms the previous state-of-the-          sults compared to cascaded models particularly for
                                             art by around +6 and +3 BLEU on the two test             complex sequence tasks like speech translation (In-
                                             sets of Fisher-CallHome and by around +3 and             aguma et al., 2020) and spoken language under-
                                             +4 BLEU on the English-German and English-
                                                                                                      standing (Coucke et al., 2018). Although direct
                                             French test sets of MuST-C.1
                                                                                                      target sequence prediction avoids the issue of er-
                                        1    Introduction                                             ror propagation from one system to another in cas-
                                                                                                      caded approaches (Tzoukermann and Miller, 2018),
                                        The principle of compositionality loosely states that         there are many attractive properties of cascaded sys-
                                        a complex whole is composed of its parts and the              tems, missing in end-to-end approaches, that are
                                        rules by which those parts are combined (Lake and             useful in complex sequence tasks.
                                        Baroni, 2018). This principle is present in engineer-            In particular, we are interested in (1) the strong
                                        ing, where task decomposition of a complex system             search capabilities of the cascaded systems that
                                        is required to assess and optimize task allocations           compose the final task output from individual sys-
                                        (Levis et al., 1994), and in natural language, where          tem predictions (Mohri et al., 2002; Kumar et al.,
                                        paragraph coherence and discourse analysis rely               2006; Beck et al., 2019), (2) the ability to incor-
                                        on decomposition into sentences (Johnson, 1992;               porate external models to re-score each individual
                                        Kuo, 1995) and sentence level semantics relies on             system (Och and Ney, 2002; Huang and Chiang,
                                        decomposition into lexical units (Liu et al., 2020b).         2007), (3) the ability to easily adapt individual com-
                                           1
                                             All code and models are released as part of the ESPnet   ponents towards out-of-domain data (Koehn and
                                        toolkit: https://github.com/espnet/espnet.                    Schroeder, 2007; Peddinti et al., 2015), and finally

(4) the ability to monitor performance of the indi-        sequence tasks to learn next word prediction, which
vidual systems towards the decomposed sub-task             outputs a distribution over the next target token
(Tillmann and Ney, 2003; Meyer et al., 2016).              yl given the previous tokens y1:l91 and the input
   In this paper, we seek to incorporate these proper-     sequence x = (x1 , xt , . . . , xT ), where T is the
ties of cascaded systems into end-to-end sequence          input sequence length. In the next sub-section we
models. We first propose a generic framework               detail the training and inference of these models.
to learn searchable hidden intermediates using an
auto-regressive encoder-decoder model for any de-          2.2    Auto-regressive Encoder-Decoder Models
composable sequence task (§3). We then apply               Training: In an auto-regressive encoder-decoder
this approach to speech translation, where the in-         model, the E NCODER maps the input sequence x
termediate stage is the output of ASR, by passing          to a sequence of continuous hidden representations
continuous hidden representations of discrete tran-        hE = (hE        E            E             E
                                                                      1 , ht , . . . , hT ), where ht ∈ R . The
                                                                                                                 d

                                                           D ECODER then auto-regressively maps h and the     E
script sequences from the ASR sub-net decoder to
the MT sub-net encoder. By doing so, we gain               preceding ground-truth output tokens, ŷ1:l91 , to hD      l ,
the ability to use beam search with optional ex-           where hD l  ∈  R d . The sequence of decoder hidden

ternal model re-scoring on the hidden intermedi-           representations form hD = (hD            D            D
                                                                                               1 , hl , . . . , hL ) and
ates, while maintaining end-to-end differentiability.      the likelihood of each output token yl is given by
Next, we suggest mitigation strategies for the error       S OFTMAX O UT, which denotes an affine projection
propagation issues inherited from decomposition.           of hDl to V followed by a softmax function.
   We show the efficacy of searchable intermediate
representations in our proposed model, called the                              hE = E NCODER(x)
Multi-Decoder, on speech translation with a 5.4                                ĥD              E
                                                                                 l = D ECODER (h , ŷ1:l91 ) (1)
and 2.8 BLEU score improvement over the previ-
                                                             P (yl | ŷ1:l91 , hE ) = S OFTMAX O UT(ĥD
                                                                                                      l )            (2)
ous state-of-the-arts for Fisher and CallHome test
sets respectively (§6). We extend these improve-           During training, the D ECODER performs token clas-
ments by an average of 0.5 BLEU score through              sification for next word prediction by considering
the aforementioned benefit of re-scoring the inter-        only the ground truth sequences for previous to-
mediate search with external models trained on the         kens ŷ. We refer to this ĥD as oracle decoder
same dataset. We also show a method for monitor-           representations, which will be discussed later.
ing sub-net performance using oracle intermediates         Inference: During inference, we can maximize the
that are void of search errors (§6.1). Finally, we         likelihood of the entire sequence from the output
show how these models can adapt to out-of-domain           space S by composing the conditional probabilities
speech translation datasets, how our approach can          of each step for the L tokens in the sequence.
be generalized to other sequence tasks like speech
recognition, and how the benefits of decomposition                          hD              E
                                                                             l = D ECODER (h , y1:l91 )              (3)
persist even for larger corpora like MuST-C (§6.2).         P (yl | x, y1:l91 ) = S OFTMAX O UT(hD
                                                                                                 l )
2     Background and Motivation                                                            L
                                                                                           Y
                                                                           ỹ = argmax           P (yi | x, y1:i91 ) (4)
2.1    Compositionality in Sequences Models                                        y∈S     i=1
The probabilistic space of a sequence is combinato-
                                                           This is an intractable search problem and it can be
rial in nature, such that a sentence of L words from
                                                           approximated by either greedily choosing argmax
a fixed vocabulary V would have an output space S
                                                           at each step or using a search algorithm like beam
of size |V|L . In order to deal with this combinato-
                                                           search to approximate ỹ. Beam search (Reddy,
rial output space, an output sentence is decomposed
                                                           1988) generates candidates at each step and prunes
into labeled target tokens, y = (y1 , y2 , . . . , yL ),
                                                           the search space to a tractable beam size of B most
where yl ∈ V.
                                                           likely sequences. As B → ∞, the beam search
                        L
                        Y                                  result would be equivalent to equation 4.
          P (y | x) =         P (yi | x, y1:i91 )
                        i=1                                  G REEDY S EARCH := argmax P (yl | x, y1:l91 )
                                                                                            yl
An auto-regressive encoder-decoder model uses the
above probabilistic decomposition in sequence-to-                B EAM S EARCH := B EAM(P (yl | x, y1:l91 ))

(a) Multi-Decoder ST Model                                   (b) Multi-Sequence Attention

Figure 1: The left side present the schematics and the information flow of our proposed framework applied to ST, in
a model we call the Multi-Decoder. Our model decomposes ST into ASR and MT sub-nets, each of which consist
of an encoder and decoder. The right side displays a Multi-Sequence Attention variant of the DECODER ST that is
conditioned on both speech information via the ENCODER ASR and transcription information via the ENCODER ST .

In approximate search for auto-regressive models,                S UBA→B N ET:
like beam search, the D ECODER receives alternate
candidates of previous tokens to find candidates                                   hE = E NCODERA (A)
with a higher likelihood as an overall sequence.                                 ĥD
                                                                                   l
                                                                                     B                     B
                                                                                       = D ECODERB (hE , ŷ1:l91 )
This also allows for the use of external models like
                                                                  P (ylB | ŷ1:l91
                                                                             B
                                                                                   , hE ) = S OFTMAX O UT(ĥDB
                                                                                                            l ) (5)
Language Models (LM) or Connectionist Temporal
Classification Models (CTC) for re-scoring candi-                S UBB→C N ET:
dates (Hori et al., 2017).
                                                                        P (C | ĥD B                    DB
                                                                                 l ) = S UB B→C N ET (ĥl )           (6)

3    Proposed Framework                                          Note that the final prediction, given by equation
                                                                 6, does not need to be a sequence and can be a
In this section, we present a general framework to               categorical class like in spoken language under-
exploit natural decompositions in sequence tasks                 standing tasks. Next we will show how the hidden
which seek to predict some output C from an input                intermediates become searchable during inference.
sequence A. If there is an intermediate sequence B
                                                                 3.1   Searchable Hidden Intermediates
for which A → B sequence transduction followed
by B → C prediction achieves the original task,                  As stated in section §2.2, approximate search algo-
then the original A → C task is decomposable.                    rithms maximize the likelihood, P (y | x), of the
                                                                 entire sequence by considering different candidates
   In other words, if we can learn P (B | A) then
                                                                 yl at each step. Candidate-based search, particu-
we can learn the overall task of P (C | A) through
                                                                 larly in auto-regressive encoder-decoder models,
maxB (P (C | A, B)P (B | A)), approximated
                                                                 also affects the decoder hidden representation, hD ,
using Viterbi search. We define a first encoder-
                                                                 as these are directly dependent on the previous can-
decoder S UBA→B N ET to map an input sequence
                                                                 didate (refer to equations 1 and 3). This implies that
A to a sequence of decoder hidden states, hDB .
                                                                 by searching for better approximations of the pre-
Then we define a subsequent S UBB→C N ET to map
                                                                 vious predicted tokens, yl91 = (yBEAM )l91 , we also
hDB to the final probabilistic output space of C.
                                                                 improve the decoder hidden representations for the
Therefore, we call hDB hidden intermediates. The
                                                                 next token, hD          D
                                                                                 l = (hBEAM )l . As yBEAM → ŷ, the
following equations shows the two sub-networks of
                                                                 decoder hidden representations tend to the oracle
our framework, S UBA→B N ET and S UBB→C N ET,
                                                                 decoder representations that have only errors from
which can be trained end-to-end while also exploit-
                                                                 next word prediction, hD               D
                                                                                            BEAM → ĥ . A perfect
ing compositionality in sequence tasks. 2
                                                                 search is analogous to choosing the ground truth ŷ
                                                                 at each step, which would yield ĥD .
    2
      Note that this framework does not use locally-normalized      We apply this beam search of hidden interme-
softmax distributions but rather the hidden representations,
thereby avoiding label bias issues when combining multiple       diates, thereby approximating ĥDB with hD         B
                                                                                                                 BEAM .
sub-systems (Bottou et al., 1997; Wiseman and Rush, 2016).       This process is illustrated in algorithm 1, which

shows beam search for hD  B
                         BEAM that are subsequently             quence of speech x and uses a sequence of text
passed to the S UBB→C N ET.3 In line 7, we show                 transcriptions y ASR as an intermediate. In this case,
how an external model like an LM or a CTC model                 the S UBA→B N ET in equation 5 is specified as the
can be used to generate an alternate sequence like-             ASR sub-net and the S UBB→C N ET in equation 6 is
lihood, PEXT (ylB ), which can be combined with                 specified as the MT sub-net. Since the MT sub-net
the S UBA→B N ET likelihood, PB (ylB | x) , with a              is also a sequence prediction task, both sub-nets are
tunable parameter λ.                                            encoder-decoder models in our architecture (Bah-
                                                                danau et al., 2015; Vaswani et al., 2017). In Figure
Algorithm 1 Beam Search for Hidden Interme-                     1 we illustrate the schematics of our transformer
diates: We perform beam search to approximate                   based Multi-Decoder ST model which can also be
the most likely sequence for the sub-task A →                   summarized as follows:
B, yB BEAM , while collecting the corresponding
D ECODERB hidden representations, hD       B                              hEASR = E NCODER ASR (x)                         (7)
                                          BEAM . The
         DB
output hBEAM , is passed to the final sub-network to                      ĥD
                                                                            l
                                                                              ASR
                                                                                  = D ECODER ASR (hEASR , ŷ1:l91
                                                                                                            ASR
                                                                                                                  )        (8)
predict final output C and yB BEAM is used for moni-                      hEST = E NCODER ST (ĥDASR )                     (9)
toring performance on predicting B.
                                                                          ĥD
                                                                            l
                                                                              ST
                                                                                   = D ECODER ST (h   EST      ST
                                                                                                            , ŷ1:l91 )   (10)
 1:    Initialize: BEAM ← {sos}; k ← beam size;
 2:    hEA ← E NCODERA (x)                                      As we can see from Equations 9 and 10, the MT
 3:    for l=1 to maxSTEPS do                                   sub-network attends only to the decoder representa-
 4:      for yl91B ∈ BEAM do                                    tions, ĥDASR , of the ASR sub-network, which could
 5:          hD   B                       B )
                    ← D ECODERB (hEA , yl91                     lead to the error propagation issues from the ASR
               l
 6:                 B      B
             for yl ∈ yl91 + {V} do                             sub-network to the MT sub-network similar to the
 7:              sl ← PA→B (ylB | x)19λ PEXT (ylB )λ            cascade systems, as mentioned in §1. To allevi-
                 H ← (sl , ylB , hDB                            ate this problem, we modify equation 10 such that
 8:                               l )
 9:          end for                                            D ECODER ST attends to both hEST and hEASR :
                                                                       SA
                                                                      DST                    EST
10:      end for                                                    ĥl     = D ECODER SA
                                                                                       ST (h     , hEASR , ŷ1:l91
                                                                                                             ST
                                                                                                                   ) (11)
11:      BEAM ← argk max(H)
                                                                We use the multi-sequence cross-attention dis-
12:    end for
                        DB                                      cussed by Helcl et al. (2018), shown on the right
13:    (sB , yB
              BEAM , h BEAM ) ← argmax( BEAM )                  side of Figure 1, to condition the final outputs gen-
14:    Return yB    BEAM → S UB A→B N ET Monitoring             erated by ĥD ST
                                                                                 on both speech and transcript in-
15:    Return hD      B
                    BEAM → Final S UB B→C N ET
                                                                            l
                                                                formation in an attempt to allow our network to
                                                                recover from intermediate mistakes during infer-
We can monitor the performance of the                           ence. We call this model the Multi-Decoder w/
S UBA→B N ET by comparing the decoded in-                       Speech-Attention.
termediate sequence yB   BEAM to the ground truth
  B
ŷ . We can also monitor the S UBB→C N ET                       4     Baseline Encoder-Decoder Model
performance by using the aforementioned oracle
                                                                For our baseline model, we use an end-to-end
representations of the intermediates, ĥDB , which
                                                                encoder-decoder (Enc-Dec) ST model with ASR
can be obtained by feeding the ground truth ŷB
                                                                joint training (Inaguma et al., 2020) as an aux-
to D ECODERB . By passing ĥDB to S UBB→C N ET,
                                                                iliarly loss to the speech encoder. In other
we can observe its performance in a vacuum, i.e.
                                                                words, the model consumes speech input using
void of search errors in the hidden intermediates.
                                                                the E NCODER ASR , to produce hEASR , which is
3.2     Multi-Decoder Model                                     used for cross-attention by D ECODER ASR and the
                                                                D ECODER ST . Using the decomposed ASR task as
In order to show the applicability of our end-to-end
                                                                an auxiliary loss also helps the baseline Enc-Dec
framework we propose our Multi-Decoder model
                                                                model and provide strong baseline performance, as
for speech translation. This model predicts a se-
                                                                we will see in Section 6.
quence of text translations y ST from an input se-
   3
     The algorithm shown only considers a single top approxi-   5     Data and Experimental Setup
mation of the search; however, with added time-complexity,
the final task prediction improves with the n-best hD  B
                                                     BEAM for
                                                                Data: We demonstrate the efficacy of our pro-
selecting the best resultant C.                                 posed approach on ST in the Fisher-CallHome cor-

pus (Post et al., 2013) which contains 170 hours of has an E NCODER ST consisting of 2 transformer
Spanish conversational telephone speech, transcrip- encoder blocks with the same configuration as
tions, and English translations. All punctuations E NCODER ASR , giving a total of 40.5M trainable
except apostrophes were removed and results are parameters. The training configuration is also the
reported in terms of detokenized case-insensitive same as for the baseline. For the Multi-Decoder w/
BLEU (Papineni et al., 2002; Post, 2018). We com- Speech-Attention model (42.1M trainable parame-
pute BLEU using the 4 references in Fisher (dev, ters), we increase the attention dropout of the ST
dev2, and test) and the single reference in Call- decoder to 0.4 and dropout on all other components
Home (dev and test) (Post et al., 2013; Kumar et al., of the ST decoder to 0.2 while keeping dropout on
2014; Weiss et al., 2017). We use a joint source and the remaining components at 0.1. We verified that
target vocabulary of 1K byte pair encoding (BPE) increasing the dropout does not help the vanilla
units (Kudo and Richardson, 2018). multi-decoder ST model.
We prepare the corpus using the ESPnet library During inference, we perform beam search on
and we follow the standard data preparation, where both the ASR and ST output sequences, as dis-
inputs are globally mean-variance normalized log- cussed in §3. The ST beam search is identical
mel filterbank and pitch features from up-sampled to that of the baseline. For the intermediate ASR
16kHz audio (Watanabe et al., 2018). We also ap- beam search, we use a beam size of 16, length
ply speed perturbations of 0.9 and 1.1 and the SS penalty of 0.2, max length ratio of 0.3. In some of
SpecAugment policy (Park et al., 2019). our experiments, we also include fusion of a source
language LM with a 0.2 weight and CTC with a
Baseline Configuration: All of our models are 0.3 weight to re-score the intermediate ASR beam
implemented using the ESPnet library and trained search (Watanabe et al., 2017). For the Speech-
on 3 NVIDIA Titan 2080Ti GPUs for ≈12 hours. Attention variant, we increase LM weight to 0.4.
For the Baseline Enc-Dec baseline, discussed in Note that the ST beam search configuration
§4, we use an E NCODER ASR consisting of a con- remains constant across our baseline and Multi-
volutional sub-sampling by a factor of 4 (Watan- Decoder experiments as our focus is on improving
abe et al., 2018) and 12 transformer encoder overall performance through searchable intermedi-
blocks with 2048 feed-forward dimension, 256 ate representations. Thus, the various re-scoring
attention dimension, and 4 attention heads. The techniques applied to the ASR beam search are op-
D ECODER ASR and D ECODER ST both consist of 6 tions newly enabled by our proposed architecture
transformer decoder blocks with the same configu- and are not used in the ST beam search.
ration as E NCODER ASR . There are 37.9M trainable
parameters. We apply dropout of 0.1 for all com- 6 Results
ponents, detailed in the Appendix (A.1).
We train our models using an effective batch- Table 1 presents the overall ST performance
size of 384 utterances and use the Adam optimizer (BLEU) of our proposed Multi-Decoder
(Kingma and Ba, 2015) with inverse square root model. Our model improves by +2.9/+0.3
decay learning rate schedule. We set learning rate (Fisher/CallHome) over the best cascaded baseline
to 12.5, warmup steps to 25K, and epochs to 50. We and by +5.6/+1.5 BLEU over the best published
use joint training with hybrid CTC/attention ASR end-to-end baselines. With Speech-Attention,
(Watanabe et al., 2017) by setting mtl-alpha to 0.3 our model improves by +3.4/+1.6 BLEU over
and asr-weight to 0.5 as defined by Watanabe et al. the cascaded baselines and +7.1/+2.8 BLEU
(2018). During inference, we perform beam search over encoder-decoder baselines. Both the Multi-
(Seki et al., 2019) on the ST sequences, using a Decoder and Multi-Decoder w/ Speech-Attention
beam size of 10, length penalty of 0.2, max length on average are further improved by +0.9/+0.4
ratio of 0.3 (Watanabe et al., 2018). BLEU through ASR re-scoring.4
Table 1 also includes our implementation of the
Multi-Decoder Configuration: For the Multi- Baseline Enc-Dec model discussed in §4. In this
Decoder ST model, discussed in §3, we use way, we are able to make a fair comparison with our
the same transformer configuration as the base- framework as we control the model and inference
line for the E NCODER ASR , D ECODER ASR , and 4
We also evaluate our models using other MT metrics to
D ECODER ST . Additionally, the Multi-Decoder supplement these results, as shown in the Appendix (A.2).

Uses Speech                                 Fisher                CallHome
  Model Type       Model Name                    Transcripts                    dev(↑)     dev2(↑)    test(↑)   dev(↑)     test(↑)
  Cascade          Inaguma et al. (2020)             3                           41.5        43.5      42.2      19.6       19.8
  Cascade          ESPnet ASR+MT (2018)              3                           50.4        51.2      50.7      19.6       19.2
  Enc-Dec          Weiss et al. (2017) ♦             7                           46.5        47.3      47.3      16.4       16.6
  Enc-Dec          Weiss et al. (2017) ♦             3                           48.3        49.1      48.7      16.8       17.4
  Enc-Dec          Inaguma et al. (2020)             3                           46.6        47.6      46.5      16.8       16.8
  Enc-Dec          Guo et al. (2021)                 3                           48.7        49.6      47.0      18.5       18.6
  Enc-Dec          Our Implementation                3                           49.6        50.9      49.5      19.1       18.2
  Multi-Decoder    Our Proposed Model                3                           52.7        53.3      52.6      20.5       20.1
  Multi-Decoder     +ASR Re-scoring                  3                           53.3        54.2      53.7      21.1       20.8
  Multi-Decoder     +Speech-Attention                3                           54.6        54.6      54.1      21.7       21.4
  Multi-Decoder      +ASR Re-scoring                 3                           55.2        55.2      55.0      21.7       21.5
Table 1: Results presenting the overall performance (BLEU) of our proposed multi-decoder model. Cascade and
Enc-Dec results from previous papers and our own implementation of the Enc-Dec are shown for comparison. The
best performing models are highlighted. ♦ Implemented with LSTM, while all others are Transformer-based.

                                                                                                                           23.8
                       Overall   Sub-Net   Sub-Net
 Model                 ST(↑)     ASR(↓)     MT(↑)                             52.6                                         23.6
                                                                                                           Multi-Decoder
                                                          ST BLEU Score (↑)

 Multi-Decoder           52.7      22.6      64.9                             52.4

                                                                                                                                   ASR % WER (↓)
                                                                                                                  BLEU     23.4
  +Speech-Attention      54.6      22.4      66.6                                                                % WER
                                                                              52.2                                         23.2
Table 2: Results presenting the overall ST performance                         52
(BLEU) of our Multi-Decoder models, along with their                                                                       23
sub-net ASR (% WER) and MT (BLEU) performances.                               51.8
                                                                                                                           22.8
All results are from the Fisher dev set.
                                                                              51.6
                                                                                                                           22.6
                                                                                       1    4       8 10             16
configurations to be analagous. For instance, we                                                ASR Beam Size
keep the same search parameters for the final output
in the baseline and the Multi-Decoder to demon-          Figure 2: Results studying the effect of the differ-
                                                         ent ASR beam sizes in the intermediate representa-
strate impact of the intermediate beam search.
                                                         tion search on the overall ST performance (BLEU) and
6.1     Benefits                                         the ASR sub-net performance (% WER) for our multi-
                                                         decoder model. Beam of 1 is same as greedy search.
6.1.1    Sub-network performance monitoring
An added benefit of our proposed approach over the
Baseline Enc-Dec is the ability to monitor the indi-     beam size of 1, which is a greedy search, results in
vidual performances of the ASR (% WER) and MT            lower ASR sub-net and overall ST performances.
(BLEU) sub-nets as shown in Table 2. The Multi-          As beam sizes become larger, gains taper off as can
Decoder w/ Speech-Attention shows a greater MT           be seen between beam sizes of 10 and 16.
sub-net performance than the Multi-Decoder as
                                                         6.1.3                       External models for better search
well as a slight improvement of the ASR sub-net,
suggesting that ST can potentially help ASR.             External models like CTC acoustic models and lan-
                                                         guage models are commonly used for re-scoring
6.1.2    Beam search for better intermediates            encoder-decoder models (Hori et al., 2017), due to
The overall ST performance improves when a               the difference in their modeling capabilities. CTC
higher beam size is used in the intermediate ASR         directly models transcripts while being condition-
search, and this increase can be attributed to the im-   ally independent on the other outputs given the in-
proved ASR sub-net performance. Figure 1 shows           put, and LMs predict the next token in a sequence.
this trend across ASR beam sizes of 1, 4, 8, 10, 16         Both variants of the Multi-Decoder improve due
while fixing the ST decoding beam size to 10. A          to improved ASR sub-net performance using exter-

Overall   Sub-Net                           40
 Model                            ST(↑)     ASR(↓)                                                          Baseline Enc-Dec
                                                                                           32.1 33.2        Multi-Decoder
 Multi-Decoder                     52.7      22.6                                   29.9                    Multi-Decoder w/ SA

                                                          ST BLEU Score (↑)
                                                                              30
  +ASR Re-scoring w/ LM            53.2      22.6
  +ASR Re-scoring w/ CTC           52.8      22.1
   +ASR Re-scoring w/ LM           53.3      21.7                                                      20.1 19.1 21.2
                                                                              20
 Multi-Decoder w/ Speech-Attn.     54.6      22.4
  +ASR Re-scoring w/ LM            55.1      22.4
  +ASR Re-scoring w/ CTC           54.7      22.0
   +ASR Re-scoring w/ LM           55.2      21.9                             10
                                                                                                                        5.4 5.8 5.6
Table 3: Results presenting the overall ST performance
(BLEU) and the sub-net ASR (% WER) of our Multi-                                       < 40%            [40, 80)%         ≥ 80%
Decoder models with external CTC and LM re-scoring
                                                                                                   ASR % WER (↓)
in the ASR intermediate representation search. All re-
sults are from the Fisher dev set.
                                                         Figure 3: Results comparing the ST performances
                                                         (BLEU) of our Baseline Enc-Dec, Multi-Decoder, and
                                                         Multi-Decoder w/ Speech-Attention across different
nal CTC and LM models for re-scoring, as shown
                                                         ASR difficulties measured using % WER on the Fisher
in Table 3. We use a recurrent neural network LM         dev set (1-ref). The buckets on the x-axis are de-
trained on the Fisher-CallHome Spanish transcripts       termined using the utterance level % WER using the
with a dev perplexity of 18.8 and the CTC model          Multi-Decoder ASR sub-net performance.
from joint loss applied during training. Neither
external model incorporates additional data. Al-
though the impact of the LM-only re-scoring is not       6.2.1                     Robustness through Decomposition
shown in the ASR % WER, it reduces substitution          Like cascaded systems, searchable intermediates
and deletion rates in the ASR and this is observed       provide our model adaptability in individual sub-
to help the overall ST performance.                      systems towards out-of-domain data using external
                                                         in-domain language model, thereby giving access
6.1.4    Error propagation avoidance
                                                         to more in-domain data. Specifically for speech
As discussed in §3, our Multi-Decoder model in-          translation systems, this means we can use in-
herits the error propagation issue as can be seen        domain language models in both source and target
in Figure 3. For the easiest bucket of utterances        languages. We test the robustness of our Multi-
with < 40% WER in Multi-Decoder’s ASR sub-               Decoder model trained on Fisher-CallHome con-
net, our model’s ST performance, as measured by          versational speech dataset on read speech CoVost-2
the corpus BLEU of the bucket, exceeds that of           dataset (Wang et al., 2020b). In Table 4 we show
the Baseline Enc-Dec. The inverse is true for the        that re-scoring the ASR sub-net with an in-domain
more difficult bucket of [40, 80)%, showing that         LM improves ASR with around 10.0% lower WER,
error propagation is limiting the performance of         improving the overall ST performance by around
our model; however, we show that multi-sequence          +2.5 BLEU. Compared to an in-domain ST base-
attention can alleviate this issue. For extremely        line (Wang et al., 2020a), our out-of-domain Multi-
difficult utterances in the ≥ 80% bucket, ST perfor-     Decoder with in-domain ASR re-scoring demon-
mance for all three approaches is suppressed. We         strates the robustness of our approach.
also provide qualitative examples of error propaga-
tion avoidance in the Appendix (A.3).                    6.2.2                     Decomposing Speech Transcripts
                                                         We apply our generic framework to another de-
6.2     Generalizability
                                                         composable sequence task, speech recognition, and
In this section, we discuss the generalizability of      show the results of various levels of decomposition
our framework towards out-of-domain data. We             in Table 5. We show that with phoneme, character,
also extend our Multi-Decoder model to other se-         or byte-pair encoding (BPE) sequences as interme-
quence tasks like speech recognition. Finally, we        diates, the Multi-Decoder presents strong results
apply our ST models to a larger corpus with more         on both Fisher and CallHome test sets. We also
language pairs and a different domain of speech.         observe that the BPE intermediates perform bet-

Overall   Sub-Net                                          En→De    En→Fr
 Model                                    ST(↑)     ASR(↓)        Model                               ST(↑)    ST(↑)
 I N - DOMAIN ST M ODEL                                           NeurST (Zhao et al., 2020)          22.9     33.3
 Baseline (Wang et al., 2020b)             12.0       -
                                                                  Fairseq S2T (Wang et al., 2020a)    22.7     32.9
   +ASR Pretrain (Wang et al., 2020b) ♦    23.0      16.0
                                                                  ESPnet-ST (Inaguma et al., 2020)    22.9     32.7
 O UT- OF - DOMAIN ST M ODEL                                      Dual-Decoder (Le et al., 2020)      23.6     33.5
 Multi-Decoder                             11.8      46.8
  +ASR Re-scoring w/ in-domain LM          14.4      36.7         Multi-Decoder w/ Speech-Attn.       26.3     37.0
 Multi-Decoder w/ Speech-Attention         12.6      46.5          +ASR Re-scoring                    26.4     37.4
  +ASR Re-scoring w/ in-domain LM          15.0      36.7
                                                              Table 6: Results presenting the overall ST performance
Table 4: Results presenting the overall ST perfor-            (BLEU) of our Multi-Decoder w/ Speech-Attention
mance (BLEU) and the sub-net ASR (% WER) of our               models with ASR re-scoring across two language-
Multi-Decoder models when tested on out-of-domain             pairs, English-German (En→De) and English-French
data. All models were trained on the Fisher-CallHome          (En→Fr). All results are from the MuST-C tst-
Es→En corpus and tested on CoVost2 Es→En corpus.              COMMON sets. All models use speech transcripts.
♦
  Pretrained with 364 hours of in-domain ASR data.

                                                              approach across several dimensions of ST tasks.
                                    Fisher        CallHome
 Model              Intermediate    ASR(↓)         ASR(↓)     First, our approach consistently improves over base-
           ♦
                                                              lines across multiple language-pairs. Second, our
 Enc-Dec                  -           23.2          45.3
                                                              approach is robust to the distinct domains of tele-
 Multi-Decoder       Phoneme          20.7          40.0      phone conversations from Fisher-CallHome and
 Multi-Decoder       Character        20.4          39.9
                                                              the TED-Talks from MuST-C. Finally, by scaling
 Multi-Decoder       BPE100           19.7          38.9
                                                              from 170 hours of Fisher-CallHome data to 500
Table 5: Results presenting the % WER ASR perfor-             hours of MuST-C data, we show that the benefits
mance when using the Multi-Decoder model on de-               of decomposing sequence tasks with searchable
composed ASR task with phoneme, character, and                hidden intermediates persist even with more data.
BPE100 as intermediates. All results are from the
                                                                 Furthermore, the performance of our Multi-
Fisher-CallHome Spanish corpus. ♦ (Weiss et al., 2017)
                                                              Decoder models trained with only English-German
                                                              or English-French ST data from MuST-C is com-
ter than phoneme/character variants, which could              parable to other methods which incorporate larger
be attributed to the reduced search capabilities              external ASR and MT data in various ways. For in-
of encoder-decoder models using beam search on                stance, Zheng et al. (2021) use 4700 hours of ASR
longer sequences (Sountsov and Sarawagi, 2016)                data and 2M sentences of MT data for pretrain-
like in phoneme/character sequences.                          ing and multi-task learning. Similarly, Bahar et al.
                                                              (2021) use 2300 hours of ASR data and 27M sen-
6.2.3    Extending to MuST-C Language Pairs                   tences of MT data for pretraining. Our competitive
In addition to our results using the 170 hours of the         performance without the use of any additional data
Spanish-English Fisher-CallHome corpus, in Ta-                highlights the data-efficient nature of our proposed
ble 6 we show that our decompositional framework              end-to-end framework as opposed to the baseline
is also effective on larger ST corpora. In particu-           encoder-decoder model, as pointed out by Sperber
lar, we use 400 hours of English-German and 500               and Paulik (2020).
hours of English-French ST from the MuST-C cor-
pus (Di Gangi et al., 2019). Our Multi-Decoder                7     Discussion and Relation to Prior Work
model improves by +2.7 and +1.5 BLEU, in Ger-
                                                              Compositionality: A number of recent works
man and French respectively, over end-to-end base-
                                                              have constructed composable neural network mod-
lines from prior works that do not use additional
                                                              ules for tasks such as visual question answering
training data. We show that ASR re-scoring gives
                                                              (Andreas et al., 2016), neural MT (Raunak et al.,
an additional +0.1 and +0.4 BLEU improvement. 5
                                                              2019), and synthetic sequence-to-sequence tasks
   By extending our Multi-Decoder models to this              (Lake, 2019). Modules that are first trained sepa-
MuST-C study, we show the generalizability of our             rately can subsequently be tightly integrated into a
   5
     Details of the MuST-C data preparation and model pa-     single end-to-end trainable model by passing differ-
rameters are detailed in Appendix (A.4).                      entiable soft decisions instead of discrete decisions

in the intermediate stage (Bahar et al., 2021). Fur-    impacting the performance of both the task at hand
ther, even a single encoder-decoder model can be        and any downstream tasks. Our approach allevi-
decomposed into modular components where the            ates these problems through intermediate search,
encoder and decoder modules have explicit func-         external models for intermediate re-scoring, and
tions (Dalmia et al., 2019).                            multi-sequence attention.

Joint Training with Sub-Tasks: End-to-end se-           8   Conclusion and Future Work
quence models been shown to benefit from intro-
ducing joint training with sub-tasks as auxiliary       We present searchable hidden intermediates for end-
loss functions for a variety of tasks like ASR (Kim     to-end models of decomposable sequence tasks.
et al., 2017), ST (Salesky et al., 2019; Liu et al.,    We show the efficacy of our Multi-Decoder model
2020a; Dong et al., 2020; Le et al., 2020), SLU         on the Fisher-CallHome Es→En and MuST-C
(Haghani et al., 2018). They have been shown to in-     En→De and En→Fr speech translation corpora,
duce structure (Belinkov et al., 2020) and improve      achieving state-of-the-art results. We present var-
the model performance (Toshniwal et al., 2017),         ious benefits in our framework, including sub-net
but this joint training may reduce data efficiency      performance monitoring, beam search for better
if some sub-nets are not included in the final end-     hidden intermediates, external models for better
to-end model (Sperber et al., 2019; Wang et al.,        search, and error propagation avoidance. Further,
2020c). Our framework avoids this sub-net waste         we demonstrate the flexibility of our framework
at the cost of computational load during inference.     towards out-of-domain tasks with the ability to
                                                        adapt our sequence model at intermediate stages of
Speech Translation Decoders: Prior works                decomposition. Finally, we show generalizability
have used ASR/MT decoding to improve the over-          by training Multi-Decoder models for the speech
all ST decoding through synchronous decoding            recognition task at various levels of decomposition.
(Liu et al., 2020a), dual decoding (Le et al., 2020),      We hope insights derived from our study stim-
and successive decoding (Dong et al., 2020). These      ulate research on tighter integrations between the
works partially or fully decode ASR transcripts and     benefits of cascaded and end-to-end sequence mod-
use discrete intermediates to assist MT decoding.       els. Exploiting searchable intermediates through
Tu et al. (2017) and Anastasopoulos and Chiang          beam search is just the tip of the iceberg for search
(2018) are closest to our multi-decoder ST model,       algorithms, as numerous approximate search tech-
however the benefits of our proposed framework          niques like diverse beam search (Vijayakumar et al.,
are not entirely explored in these works.               2018) and best-first beam search (Meister et al.,
                                                        2020) have been recently proposed to improve di-
Two-Pass Decoding: Two-pass decoding in-                versity and approximation of the most-likely se-
volves first predicting with one decoder and then       quence. Incorporating differentiable lattice based
re-evaluating with another decoder (Geng et al.,        search (Hannun et al., 2020) can also allow the sub-
2018; Sainath et al., 2019; Hu et al., 2020; Rijh-      sequent sub-net to digest n-best representations.
wani et al., 2020). The two decoders iterate on the
same sequence, so there is no decomposition into        9   Acknowledgements
sub-tasks in this method. On the other hand, our
approach provides the subsequent decoder with a         This work started while Vikas Raunak was a stu-
more structured representation than the input by de-    dent at CMU, he is now working as a Research Sci-
composing the complexity of the overall task. Like      entist at Microsoft. We thank Pengcheng Guo, Hi-
two-pass decoding, our approach provides a sense        rofumi Inaguma, Elizabeth Salesky, Maria Ryskina,
of the future to the second decoder which allows it     Marta Méndez Simón and Vijay Viswanathan for
to correct mistakes from the previous first decoder.    their helpful discussion during the course of this
                                                        project. We also thank the anonymous reviewers
Auto-Regressive Decoding: As auto-regressive            for their valuable feedback. This work used the
decoders inherently learn a language model along        Extreme Science and Engineering Discovery En-
with the task at hand, they tend to be domain spe-      vironment (XSEDE) (Towns et al., 2014), which
cific (Samarakoon et al., 2018; Müller et al., 2020).   is supported by National Science Foundation grant
This can cause generalizability issues during infer-    number ACI-1548562. Specifically, it used the
ence (Murray and Chiang, 2018; Yang et al., 2018),      Bridges system (Nystrom et al., 2015), which is

supported by NSF award number ACI-1445606, Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has-
at the Pittsburgh Supercomputing Center (PSC). san Sajjad, and James Glass. 2020. On the linguistic
representational power of neural machine translation
The work was supported in part by an AWS Ma-
models. Computational Linguistics, 46(1):1–52.
chine Learning Research Award. This research
was also supported in part the DARPA KAIROS Léon Bottou, Yoshua Bengio, and Yann Le Cun. 1997.
program from the Air Force Research Laboratory Global training of document processing systems us-
under agreement number FA8750-19-2-0200. The ing graph transformer networks. In Proceedings of
IEEE Computer Society Conference on Computer Vi-
U.S. Government is authorized to reproduce and sion and Pattern Recognition, pages 489–494. IEEE.
distribute reprints for Governmental purposes not
withstanding any copyright notation there on. The Alice Coucke, Alaa Saade, Adrien Ball, Théodore
views and conclusions contained herein are those of Bluche, Alexandre Caulier, David Leroy, Clément
the authors and should not be interpreted as neces- Doumouro, Thibault Gisselbrecht, Francesco Calt-
agirone, Thibaut Lavril, et al. 2018. Snips voice
sarily representing the official policies or endorse- platform: an embedded spoken language understand-
ments, either expressed or implied, of the Air Force ing system for private-by-design voice interfaces. In
Research Laboratory or the U.S. Government. Privacy in Machine Learning and Artificial Intelli-
gence workshop, ICML.

References Siddharth Dalmia, Abdelrahman Mohamed, Mike
Lewis, Florian Metze, and Luke Zettlemoyer.
Antonios Anastasopoulos and David Chiang. 2018. 2019. Enforcing encoder-decoder modularity in
Tied multitask learning for neural speech translation. sequence-to-sequence models. arXiv preprint
In Proceedings of the 2018 Conference of the North arXiv:1911.03782.
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli,
Volume 1 (Long Papers), pages 82–91, New Orleans, Matteo Negri, and Marco Turchi. 2019. MuST-
Louisiana. Association for Computational Linguis- C: a Multilingual Speech Translation Corpus. In
tics. Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computa-
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and
tional Linguistics: Human Language Technologies,
Dan Klein. 2016. Neural module networks. In 2016
pages 2012–2017, Minneapolis, Minnesota. Associ-
IEEE Conference on Computer Vision and Pattern
ation for Computational Linguistics.
Recognition, CVPR 2016, Las Vegas, NV, USA, June
27-30, 2016, pages 39–48. IEEE Computer Society.
Qianqian Dong, Mingxuan Wang, Hao Zhou, Shuang
Parnia Bahar, Tobias Bieschke, Ralf Schlüter, and Xu, Bo Xu, and Lei Li. 2020. SDST: Successive de-
Hermann Ney. 2021. Tight integrated end-to- coding for speech-to-text translation. Proceedings
end training for cascaded speech translation. In of the Thirty-Fifth AAAI Conference on Artificial In-
2021 IEEE Spoken Language Technology Workshop telligence.
(SLT), pages 950–957. IEEE.
Xinwei Geng, Xiaocheng Feng, Bing Qin, and Ting
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Liu. 2018. Adaptive multi-pass decoder for neural
gio. 2015. Neural machine translation by jointly machine translation. In Proceedings of the 2018
learning to align and translate. In 3rd Inter- Conference on Empirical Methods in Natural Lan-
national Conference on Learning Representations, guage Processing, pages 523–532, Brussels, Bel-
ICLR 2015. gium. Association for Computational Linguistics.
Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
An automatic metric for MT evaluation with im- Pengcheng Guo, Florian Boyer, Xuankai Chang,
proved correlation with human judgments. In Pro- Tomoki Hayashi, Yosuke Higuchi, Hirofumi In-
ceedings of the ACL Workshop on Intrinsic and Ex- aguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-
trinsic Evaluation Measures for Machine Transla- Romero, Jiatong Shi, et al. 2021. Recent develop-
tion and/or Summarization, pages 65–72, Ann Ar- ments on ESPnet toolkit boosted by conformer. In
bor, Michigan. Association for Computational Lin- 2021 IEEE international conference on acoustics,
guistics. speech and signal processing (ICASSP). IEEE.

Daniel Beck, Trevor Cohn, and Gholamreza Haffari. Parisa Haghani, Arun Narayanan, Michiel Bacchiani,
2019. Neural speech translation using lattice trans- Galen Chuang, Neeraj Gaur, Pedro Moreno, Rohit
formations and graph networks. In Proceedings of Prabhavalkar, Zhongdi Qu, and Austin Waters. 2018.
the Thirteenth Workshop on Graph-Based Methods From audio to semantics: Approaches to end-to-end
for Natural Language Processing (TextGraphs-13), spoken language understanding. In 2018 IEEE Spo-
pages 26–31, Hong Kong. Association for Computa- ken Language Technology Workshop (SLT), pages
tional Linguistics. 720–726. IEEE.

Awni Hannun, Vineel Pratap, Jacob Kahn, and Wei- Constantin, and Evan Herbst. 2007. Moses: Open
Ning Hsu. 2020. Differentiable weighted finite-state source toolkit for statistical machine translation. In
transducers. arXiv preprint arXiv:2010.01003. Proceedings of the 45th Annual Meeting of the As-
sociation for Computational Linguistics Companion
Jindřich Helcl, Jindřich Libovický, and Dušan Variš. Volume Proceedings of the Demo and Poster Ses-
2018. CUNI system for the WMT18 multimodal sions, pages 177–180, Prague, Czech Republic. As-
translation task. In Proceedings of the Third Confer- sociation for Computational Linguistics.
ence on Machine Translation: Shared Task Papers,
pages 616–623, Belgium, Brussels. Association for Philipp Koehn and Josh Schroeder. 2007. Experiments
Computational Linguistics. in domain adaptation for statistical machine transla-
tion. In Proceedings of the second workshop on sta-
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, tistical machine translation, pages 224–227.
Abdel-rahman Mohamed, Navdeep Jaitly, Andrew
Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Taku Kudo and John Richardson. 2018. SentencePiece:
Sainath, et al. 2012. Deep neural networks for A simple and language independent subword tok-
acoustic modeling in speech recognition: The shared enizer and detokenizer for neural text processing. In
views of four research groups. IEEE Signal process- Proceedings of the 2018 Conference on Empirical
ing magazine, 29(6):82–97. Methods in Natural Language Processing: System
Demonstrations, pages 66–71.
Takaaki Hori, Shinji Watanabe, Yu Zhang, and William
Chan. 2017. Advances in joint CTC-attention based Gaurav Kumar, Matt Post, Daniel Povey, and Sanjeev
end-to-end speech recognition with a deep CNN en- Khudanpur. 2014. Some insights from translating
coder and RNN-LM. In Proc. Interspeech 2017, conversational telephone speech. In IEEE Interna-
pages 949–953. tional Conference on Acoustics, Speech and Signal
Processing, ICASSP 2014, Florence, Italy, May 4-9,
Ke Hu, Tara N Sainath, Ruoming Pang, and Rohit Prab- 2014, pages 3231–3235. IEEE.
havalkar. 2020. Deliberation model based two-pass
end-to-end speech recognition. In ICASSP 2020- Shankar Kumar, Yonggang Deng, and William Byrne.
2020 IEEE International Conference on Acoustics, 2006. A weighted finite state transducer translation
Speech and Signal Processing (ICASSP), pages template model for statistical machine translation.
7799–7803. IEEE. Natural Language Engineering, 12(1):35–76.
Chih-Hua Kuo. 1995. Cohesion and coherence in aca-
Liang Huang and David Chiang. 2007. Forest Rescor-
demic writing: From lexical choice to organization.
ing: Faster decoding with integrated language mod-
RELC Journal, 26(1):47–62.
els. In Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics, pages Brenden M Lake. 2019. Compositional generalization
144–151, Prague, Czech Republic. Association for through meta sequence-to-sequence learning. In Ad-
Computational Linguistics. vances in Neural Information Processing Systems,
pages 9791–9801.
Hirofumi Inaguma, Shun Kiyono, Kevin Duh, Shigeki
Karita, Nelson Yalta, Tomoki Hayashi, and Shinji Brenden M. Lake and Marco Baroni. 2018. General-
Watanabe. 2020. ESPnet-ST: All-in-one speech ization without systematicity: On the compositional
translation toolkit. In Proceedings of the 58th An- skills of sequence-to-sequence recurrent networks.
nual Meeting of the Association for Computational In Proceedings of the 35th International Conference
Linguistics: System Demonstrations, pages 302–311. on Machine Learning, ICML 2018, pages 2879–
Association for Computational Linguistics. 2888. PMLR.
Patricia Johnson. 1992. Cohesion and coherence in Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Di-
compositions in Malay and English. RELC Journal, dier Schwab, and Laurent Besacier. 2020. Dual-
23(2):1–17. decoder transformer for joint automatic speech
recognition and multilingual speech translation.
Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Proceedings of the 28th International Conference on
Joint CTC-attention based end-to-end speech recog- Computational Linguistics.
nition using multi-task learning. In 2017 IEEE inter-
national conference on acoustics, speech and signal Alexander H. Levis, Neville Moray, and Baosheng Hu.
processing (ICASSP), pages 4835–4839. IEEE. 1994. Task decomposition and allocation problems
and discrete event systems. Automatica, 30(2):203 –
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A 216.
method for stochastic optimization. In 3rd Inter-
national Conference on Learning Representations, Yuchen Liu, Jiajun Zhang, Hao Xiong, Long Zhou,
ICLR 2015. Zhongjun He, Hua Wu, Haifeng Wang, and
Chengqing Zong. 2020a. Synchronous speech
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris recognition and speech-to-text translation with inter-
Callison-Burch, Marcello Federico, Nicola Bertoldi, active decoding. In The Thirty-Fourth AAAI Con-
Brooke Cowan, Wade Shen, Christine Moran, ference on Artificial Intelligence, AAAI 2020, pages
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra 8417–8424.

Zhiyuan Liu, Yankai Lin, and Maosong Sun. 2020b. N. Pham, Thai-Son Nguyen, Thanh-Le Ha, J. Hussain,
Compositional Semantics, pages 43–57. Springer Felix Schneider, J. Niehues, Sebastian Stüker, and
Singapore, Singapore. A. Waibel. 2019. The IWSLT 2019 KIT speech
translation system. In International Workshop on
Clara Meister, Tim Vieira, and Ryan Cotterell. 2020. Spoken Language Translation (IWSLT).
Best-first beam search. Transactions of the Associa-
tion for Computational Linguistics, 8:795–809. Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on
Bernd T Meyer, Sri Harish Mallidi, Angel Mario Cas- Machine Translation: Research Papers, pages 186–
tro Martinez, Guillermo Payá-Vayá, Hendrik Kayser, 191, Belgium, Brussels. Association for Computa-
and Hynek Hermansky. 2016. Performance monitor- tional Linguistics.
ing for automatic speech recognition in noisy multi-
channel environments. In 2016 IEEE Spoken Lan- Matt Post, Gaurav Kumar, Adam Lopez, Damianos
guage Technology Workshop (SLT), pages 50–56. Karakos, Chris Callison-Burch, and Sanjeev Khu-
danpur. 2013. Improved speech-to-text transla-
Mehryar Mohri, Fernando Pereira, and Michael Ri- tion with the Fisher and Callhome Spanish–English
ley. 2002. Weighted finite-state transducers in speech translation corpus. In International Work-
speech recognition. Computer Speech & Language, shop on Spoken Language Translation (IWSLT
16(1):69–88. 2013).
Mathias Müller, Annette Rios, and Rico Sennrich. Vikas Raunak, Vaibhav Kumar, and Florian Metze.
2020. Domain robustness in neural machine trans- 2019. On compositionality in neural machine trans-
lation. In Proceedings of the 14th Conference of lation. NeurIPS Workshop, Context and Composi-
the Association for Machine Translation in the Amer- tionality in Biological and Artificial Neural Systems.
icas, pages 151–164, Virtual. Association for Ma-
chine Translation in the Americas. Raj Reddy. 1988. Foundations and grand challenges
of artificial intelligence: AAAI presidential address.
Kenton Murray and David Chiang. 2018. Correcting AI Mag., 9(4):9–21.
length bias in neural machine translation. In Pro-
Shruti Rijhwani, Antonios Anastasopoulos, and Gra-
ceedings of the Third Conference on Machine Trans-
ham Neubig. 2020. OCR post correction for endan-
lation: Research Papers, pages 212–223, Brussels,
gered language texts. In Proceedings of the 2020
Belgium.
Conference on Empirical Methods in Natural Lan-
Nicholas A. Nystrom, Michael J. Levine, Ralph Z. guage Processing (EMNLP), pages 5931–5942, On-
Roskies, and J. Ray Scott. 2015. Bridges: A line. Association for Computational Linguistics.
uniquely flexible HPC resource for new commu- Tara N Sainath, Ruoming Pang, David Rybach,
nities and data analytics. In Proceedings of the Yanzhang He, Rohit Prabhavalkar, Wei Li, Mirkó Vi-
2015 XSEDE Conference: Scientific Advancements sontai, Qiao Liang, Trevor Strohman, Yonghui Wu,
Enabled by Enhanced Cyberinfrastructure. Associa- et al. 2019. Two-pass end-to-end speech recognition.
tion for Computing Machinery. Proc. Interspeech 2019, pages 2773–2777.
Franz Josef Och and Hermann Ney. 2002. Discrimina- Elizabeth Salesky, Matthias Sperber, and Alan W
tive training and maximum entropy models for sta- Black. 2019. Exploring Phoneme-Level Speech
tistical machine translation. In Proceedings of the Representations for End-to-End Speech Translation.
40th Annual meeting of the Association for Compu- In Proceedings of the 57th Annual Meeting of the
tational Linguistics, pages 295–302. Association for Computational Linguistics, pages
1835–1841, Florence, Italy. Association for Compu-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
tational Linguistics.
Jing Zhu. 2002. BLEU: a method for automatic eval-
uation of machine translation. In Proceedings of Lahiru Samarakoon, Brian Mak, and Albert YS
the 40th Annual Meeting of the Association for Com- Lam. 2018. Domain adaptation of end-to-end
putational Linguistics, pages 311–318, Philadelphia, speech recognition in low-resource settings. In
Pennsylvania, USA. 2018 IEEE Spoken Language Technology Workshop
(SLT), pages 382–388. IEEE.
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng
Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Niko
2019. SpecAugment: A simple data augmentation Moritz, and Jonathan Le Roux. 2019. Vectorized
method for automatic speech recognition. Proc. In- beam search for CTC-attention-based speech recog-
terspeech 2019, pages 2613–2617. nition. In Proc. Interspeech 2019, pages 3825–
3829.
Vijayaditya Peddinti, Guoguo Chen, Vimal Manohar,
Tom Ko, Daniel Povey, and Sanjeev Khudanpur. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
2015. JHU ASpIRE system: Robust LVCSR with nea Micciulla, and John Makhoul. 2006. A study of
TDNNS, iVector adaptation and RNN-LMS. In translation edit rate with targeted human annotation.
2015 IEEE Workshop on Automatic Speech Recog- In Proceedings of Association for Machine Transla-
nition and Understanding (ASRU), pages 539–546. tion in the Americas.

Pavel Sountsov and Sunita Sarawagi. 2016. Length            In Proceedings of the Thirty-Second AAAI Confer-
  bias in encoder decoder models and a case for global      ence on Artificial Intelligence, (AAAI-18), pages
  conditioning. In Proceedings of the 2016 Confer-          7371–7379. AAAI Press.
  ence on Empirical Methods in Natural Language
  Processing, pages 1516–1525, Austin, Texas. Asso-       Changhan Wang, Yun Tang, Xutai Ma, Anne Wu,
  ciation for Computational Linguistics.                    Dmytro Okhonko, and Juan Pino. 2020a. Fairseq
                                                            S2T: Fast speech-to-text modeling with fairseq. In
Matthias Sperber, Graham Neubig, Jan Niehues, and           Proceedings of the 1st Conference of the Asia-Pacific
 Alex Waibel. 2019. Attention-passing models for ro-        Chapter of the Association for Computational Lin-
 bust and data-efficient end-to-end speech translation.     guistics (AACL): System Demonstrations, pages 33–
 Transactions of the Association for Computational          39. Association for Computational Linguistics.
 Linguistics, 7:313–325.
                                                          Changhan Wang, Anne Wu, and Juan Pino. 2020b.
Matthias Sperber and Matthias Paulik. 2020. Speech          CoVoST 2: A massively multilingual speech-
 translation and the end-to-end promise: Taking stock       to-text translation corpus. arXiv preprint
 of where we are. Proceedings of the 58th Annual            arXiv:2007.10310.
 Meeting of the Association for Computational Lin-
 guistics.                                                Chengyi Wang, Yu Wu, Shujie Liu, Zhenglu Yang, and
                                                            Ming Zhou. 2020c. Bridging the gap between pre-
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.         training and fine-tuning for end-to-end speech trans-
   Sequence to sequence learning with neural networks.      lation. Proceedings of the AAAI Conference on Arti-
   In Advances in neural information processing sys-        ficial Intelligence, 34(05):9161–9168.
   tems, pages 3104–3112.
                                                          Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki
Christoph Tillmann and Hermann Ney. 2003. Word re-          Hayashi, Jiro Nishitoba, Yuya Unno, Nelson En-
  ordering and a dynamic programming beam search            rique Yalta Soplin, Jahn Heymann, Matthew Wies-
  algorithm for statistical machine translation. Com-       ner, Nanxin Chen, Adithya Renduchintala, and
  putational linguistics, 29(1):97–133.                     Tsubasa Ochiai. 2018. ESPnet: End-to-end speech
                                                            processing toolkit. In Proc. Interspeech 2018, pages
Shubham Toshniwal, Hao Tang, Liang Lu, and Karen            2207–2211.
  Livescu. 2017. Multitask learning with low-level
  auxiliary tasks for encoder-decoder based speech        Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R.
  recognition. In Proc. Interspeech 2017, pages 3532–       Hershey, and Tomoki Hayashi. 2017.          Hybrid
  3536.                                                     CTC/attention architecture for end-to-end speech
                                                            recognition. IEEE Journal of Selected Topics in Sig-
John Towns, Timothy Cockerill, Maytal Dahan, Ian            nal Processing, 11(8):1240–1253.
  Foster, Kelly Gaither, Andrew Grimshaw, Victor Ha-
  zlewood, Scott Lathrop, Dave Lifka, Gregory D Pe-       Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
  terson, et al. 2014. XSEDE: accelerating scientific       Wu, and Zhifeng Chen. 2017.          Sequence-to-
  discovery. Computing in science & engineering,            sequence models can directly translate foreign
  16(5):62–74.                                              speech. In Proc. Interspeech 2017, pages 2625–
                                                            2629.
Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu,
                                                          Sam Wiseman and Alexander M. Rush. 2016.
  and Hang Li. 2017. Neural machine translation with
                                                            Sequence-to-sequence learning as beam-search op-
  reconstruction. In Proceedings of the Thirty-First
                                                            timization. In Proceedings of the 2016 Conference
  AAAI Conference on Artificial Intelligence, 2017,
                                                            on Empirical Methods in Natural Language Process-
  pages 3097–3103. AAAI Press.
                                                            ing, pages 1296–1306, Austin, Texas. Association
Evelyne Tzoukermann and Corey Miller. 2018. Evalu-          for Computational Linguistics.
  ating automatic speech recognition in translation. In
                                                          Yilin Yang, Liang Huang, and Mingbo Ma. 2018.
  Proceedings of the 13th Conference of the Associa-
                                                             Breaking the beam search curse: A study of (re-
  tion for Machine Translation in the Americas (Vol-
                                                             )scoring methods and stopping criteria for neural
  ume 2: User Track), pages 294–302, Boston, MA.
                                                             machine translation. In Proceedings of the 2018
  Association for Machine Translation in the Ameri-
                                                            Conference on Empirical Methods in Natural Lan-
  cas.
                                                             guage Processing, pages 3054–3059, Brussels, Bel-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob             gium. Association for Computational Linguistics.
  Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz           Chengqi Zhao, Mingxuan Wang, and Lei Li. 2020.
  Kaiser, and Illia Polosukhin. 2017. Attention is all      NeurST: Neural speech translation toolkit. arXiv
  you need. In Advances in neural information pro-          preprint arXiv:2012.10018.
  cessing systems, pages 5998–6008.
                                                          Renjie Zheng, Junkun Chen, Mingbo Ma, and Liang
Ashwin K. Vijayakumar, Michael Cogswell, Ram-               Huang. 2021. Fused acoustic and text encoding for
  prasaath R. Selvaraju, Qing Sun, Stefan Lee, David J.     multimodal bilingual pretraining and speech transla-
  Crandall, and Dhruv Batra. 2018. Diverse beam             tion. arXiv preprint arXiv:2102.05766.
  search for improved description of complex scenes.

You can also read