The USTC-NELSLIP Systems for Simultaneous Speech Translation Task at IWSLT 2021

Page created by Ryan Beck

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

The USTC-NELSLIP Systems for Simultaneous Speech Translation Task
                       at IWSLT 2021
                Dan Liu1,2 , Mengge Du2 , Xiaoxi Li2 , Yuchen Hu1 , and Lirong Dai 1
                    1
                        University of Science and Technology of China, Hefei, China
                                      2
                                        iFlytek Research, Hefei, China
                                   {danliu, huyuchen}@mail.ustc.edu.cn
                                              lrdai@ustc.edu.cn
                                    {danliu, xxli16,mgdou}@iflytek.com

                        Abstract                              marginal probability based on conventional Atten-
    This paper describes USTC-NELSLIP’s sub-
                                                              tion Encoder-Decoder (Sennrich et al., 2016) ar-
    missions to the IWSLT2021 Simultaneous                    chitectures (Transformer (Vaswani et al., 2017) in-
    Speech Translation task. We proposed a novel              cluded), which is due to the deep coupling between
    simultaneous translation model, Cross Atten-              source contexts and target history contexts. To
    tion Augmented Transducer (CAAT), which                   solve this problem, we propose a novel architecture,
    extends conventional RNN-T to sequence-to-                Cross Attention augmented Transducer (CAAT),
    sequence tasks without monotonic constraints,             and a latency loss function to ensure CAAT model
    e.g., simultaneous translation. Experiments
                                                              works with an appropriate latency. In simultane-
    on speech-to-text (S2T) and text-to-text (T2T)
    simultaneous translation tasks shows CAAT                 ous translation, policy is integrated into translation
    achieves better quality-latency trade-offs com-           model and learned jointly for CAAT model.
    pared to wait-k, one of the previous state-of-               In this work, we build simultaneous translation
    the-art approaches. Based on CAAT architec-               systems for both text-to-text (T2T) and speech-
    ture and data augmentation, we build S2T and              to-text S2T) task. We propose a novel archi-
    T2T simultaneous translation systems in this              tecture, Cross Attention Augmented Transducer
    evaluation campaign. Compared to last year’s
                                                              (CAAT), which significantly outperforms wait-k
    optimal systems, our S2T simultaneous trans-
    lation system improves by an average of 11.3              (Ma et al., 2019) baseline in both text-to-text and
    BLEU for all latency regimes, and our T2T si-             speech-to-text simultaneous translation task. Be-
    multaneous translation system improves by an              sides, we adopt a variety of data augmentation
    average of 4.6 BLEU.                                      methods, back-translation (Edunov et al., 2018),
                                                              Self-training (Kim and Rush, 2016) and speech
1   Introduction
                                                              synthesis with Tacotron2 (Shen et al., 2018). Com-
This paper describes the submission to IWSLT                  bining all of these and models ensembling, we
2021 Simultaneous Speech Translation task by Na-              achieved about 11.3 BLEU (in S2T task) and 4.6
tional Engineering Laboratory for Speech and Lan-             BLEU (in T2T task) gains compared to the best
guage Information Processing (NELSLIP), Univer-               performance last year.
sity of Science and Technology of China, China.
   Recent work in text-to-text simultaneous transla-          2     Data
tion tends to fall into two categories, fixed policy
and flexible policy, represented by wait-k (Ma et al.,        2.1    Statistics and Preprocessing
2019) and monotonic attention (Arivazhagan et al.,            EN→DE Speech Corpora The speech datasets
2019; Ma et al., 2020b) respectively. The draw-               used in our experiments are shown in Table 1,
back of fixed policy is that it may introduce over            where MuST-C, Europarl and CoVoST2 are speech
latency for some sentences and under latency for              translation specific (speech, transcription and trans-
others. Meanwhile, flexible policy often leads to             lation included), and LibriSpeech, TED-LIUM3
difficulties in model optimization.                           are speech recognition specific (only speech and
   Inspired by RNN-T (Graves, 2012), we aim                   transcription). After augmented with speed and
to optimize the marginal distribution of all ex-              echo perturbation, we use Kaldi (Povey et al., 2011)
panded paths in simultaneous translation. How-                to extract 80 dimensional log-mel filter bank fea-
ever, we found it’s impossible to calculate the               tures, computed with a 25ms window size and a

                                                         30
             Proceedings of the 18th International Conference on Spoken Language Translation, pages 30–38
            Bangkok, Thailand (Online), August 5–6, 2021. ©2021 Association for Computational Linguistics

10ms window shift, and specAugment (Park et al.,              For EN→DE task, we directly use Sentence-
2019) were performed during training phase.                Piece (Kudo and Richardson, 2018) to generate
                                                           a unigram vocabulary of size 32,000 for source
      Corpus           Segments       Duration(h)          and target language jointly. And for EN→JA task,
                                                           sentences in Japanese are firstly participled by
      MuST-C             250.9k           448
                                                           MeCab (Kudo, 2006), and then a unigram vocab-
      Europarl            69.5k            155
                                                           ulary of size 32,000 is generated for source and
     CoVoST2             854.4k           1090
                                                           target jointly similar to EN→DE task.
    LibriSpeech          281.2k           960
                                                              During data preprocessing, the bilingual datasets
    TED-LIUM3            268.2k           452
                                                           are firstly filtered by length less than 1024 and
          Table 1: Statistics of speech corpora.           length ratio of target to source 0.25 < r < 4. In
                                                           the second step, with a baseline Transformer model
                                                           trained with only bilingual data, we filtered the
Text Translation Corpora The bilingual paral-              mismatched parallel pairs with log-likelihood from
lel datasets for Englith to German(EN→DE) and              the baseline model, threshold is set to −4.0 for
English to Japanese (EN→JA) used are shown in              EN→DE task and −5.0 for EN→JA task. At last
Table 2, and the monolingual datasets in English,          we keep 27.3 million sentence pairs for EN-DE
German and Japanese are shown in Table 3. And              task and 17.0 sentence pairs for EN→JA task.
we found the Paracrawl dataset in EN→DE task is
too big to our model training, we randomly select          2.2    Data Augmentation
a subset of 14M sentences from it.                         For text-to-text machine translation, augmented
                                                           data from monolingual corpora in source and target
                        Corpus         Sentences           language are generated by self-training (He et al.,
                     MuST-C(v2)          229.7k            2019) and back translation (Edunov et al., 2018)
                      Europarl           1828.5k           respectively. Statistics of the augmented training
                     Rapid-2019         1531.3k            data are shown in Table 4.
                     WIT3-TED             209.5k
    EN→DE                                                        Data                  EN→DE        EN→JA
                    Commoncrawl          2399.1k
                     WikiMatrix          6227.2k                 Bilingual data         27.3M        17.0M
                      Wikititles         1382.6k                  +back-translation     34.3M        22.0M
                      Paracrawl         82638.2k                   +self-training       41.3M        27.0M
                     WIT3-TED             225.0k           Table 4: Augmented training data for text-to-text trans-
                       JESC              2797.4k           lation.
                        kftt              440.3k            We further extend these two data augmentation
     EN→JA
                     WikiMatrix          3896.0k           methods to speech-to-text translation, detailed as:
                     Wikititles           706.0k
                     Paracrawl          10120.0k             1. Self-training: Maybe similar to sequence-
                                                                level distillation (Kim and Rush, 2016; Ren
     Table 2: Statistics of text parallel datasets.             et al., 2020; Liu et al., 2019). Transcriptions
                                                                of all speech datasets (both speech recogni-
                                                                tion and speech translation specific) are sent
                                                                                                               0
 Language             Corpus            Sentences               to a text translation model to generate text y
                                                                                                      0
                                                                in target language, the generated y with its
                  Europarl-v10           2295.0k                corresponding speech are directly added to
    EN
                 News-crawl-2019         33600.8k               speech translation dataset.
                  Europarl-v10           2108.0k
    DE                                                       2. Speech Synthesis: We employ Tacotron2
                 News-crawl-2020         53674.4k
                                                                (Shen et al., 2018) with slightly modified by
                 News-crawl-2019         3446.4k                introducing speaker representations to both en-
     JA
                 News-crawl-2020         10943.3k               coder and decoder as our text-to-speech (TTS)
                                                                model architecture, and trained on MuST-
     Table 3: Statistics of monolingual datasets.               C(v2) speech corpora to generate filter-bank

                                                      31

speech representations. We randomly select                 tion of latency metric through all possible expanded
       4M sentence pairs from EN→DE text trans-                   paths:
       lation corpora and generate audio feature by
       speech synthesis. The generated filter bank                   L(x, y) = Lnll (x, y) + Llatency (x, y)
       features and their corresponding target lan-                           X
       guage text are used to expand our speech trans-               = − log     p(ŷ|x) + Eŷ l(ŷ)
       lation dataset.                                                          ŷ                                                (1)
                                                                               X                 X
                                                                     = − log         p(ŷ|x) +         Pr(ŷ|y, x)l(ŷ)
The expanded training data are shown in Table 5.                                ŷ                ŷ
Besides, during the training period for all the
speech translation tasks, we sample the speech data
from the whole corpora with fixed ratio and the
concrete ratio for different dataset is shown in Ta-
ble 6.

    Dataset                  Segements      Duration(h)
    Raw S2T dataset            1.17M             1693
     +self-training            2.90M            4799
      +Speech synthesis        7.22M            10424

Table 5: Expanded speech translation dataset with self-
training and speech synthesis.
                                                                  Figure 1: Expanded paths in simultaneous translation.
              Dataset          Sample Ratio                                                            p(ŷ|x)
                                                                    Where Pr(ŷ|y, x) =     P
                                                                                                              p(ŷ 0 |x)
                                                                                                                         ,   and ŷ ∈
                                                                                                  0
             MuST-C                    2                                                         ŷ ∈H(x,y)

             Europarl                  1                          H(x, y) is an expansion of target sequence y, and
             CoVoST2                   1                          l(ŷ is the latency of expanded path ŷ.
            LibriSpeech                1                             However, RNN-T is trained and inferenced
           TED-LIUM3                   2                          based on source-target monotonic constraint,
          Speech synthesis             5                          which means it isn’t suitable for translation
                                                                  task.
                                                                      P And the calculation of marginal probabil-
Table 6: Sample ratio for different datasets during train-        ity ŷ∈H(x,y) Pr(ŷ|x) is impossible for Attention
ing period.                                                       Encoder-Decoder framework due to deep coupling
                                                                  of source and previous target representation. As
                                                                  shown in Figure 1, the decoder hidden states for
3     Methods and Models                                          the red path ŷ 1 and the blue path ŷ 2 is not equal at
                                                                  the intersection s12 6= s22 . To solve this, we sepa-
3.1    Cross Attention Augmented Transducer
                                                                  rate the source attention mechanism from the target
Let x and y denote the source and target se-                      history representation, which is similar to joiner
quence, respectively. The policy of simultane-                    and predictor in RNN-T. The novel architecture
ous translation is denoted as an action sequence                  can be viewed as a extension version of RNN-T
p ∈ {R, W }|x|+|y| where R denotes the READ                       with attention mechanism augmented joiner, and is
action and W the WRITE action. Another repre-                     named as Cross Attention Augmented Transducer
sentation of policy is extending target sequence                  (CAAT). Figure 2 is the implementation of RAAT
y to length |x| + |y| with blank symbol φ as                      based on Transformer.
ŷ ∈ (v ∪ {φ})|x|+|y| , where v is the vocabulary                    Computation cost of joiner in CAAT is signif-
of the target language. The mapping from y to sets                icantly more expensive than that of RNN-T. The
of all possible expansion ŷ denotes as H(x, y).                  complexity of joiner is O(|x| · |y|) during train-
   Inspired by RNN-T (Graves, 2012), the loss func-               ing, which means O(|x|) times higher than conven-
tion for simultaneous translation can be defined as               tional Transformer. We solve this problem by mak-
the marginal conditional probability and expecta-                 ing decisions with decision step size d > 1, and

                                                             32

output                                             Latency loss and its gradients can be calculated
                                                                            by the forward-backward algorithm, similar to Se-
                       Add & Norm
                                                                            quence Criterion Training in ASR (Povey, 2005).
                          Feed
                         Forward                                               At last, we add the cross entropy loss of offline
                                                                            translation model as an auxiliary loss to CAAT
                       Add & Norm
       Add & Norm                              Add & Norm
                                                                            model training for two reasons. First we hope the
                           Cross
          Feed                                   Feed
         Forward
                        Attention
                                                Forward
                                                                            CAAT model fall back to offline translation in the
                                                                            worst case; second, CAAT models is carried out
       Add & Norm                              Add & Norm
                                                                            in accordance with offline translation when source
           Self                                    Self
        Attention                               Attention
                                                                            sentence ended. The final loss function for CAAT
                                                                            training is defined as follows:
            x                                   y(shift right)
                                                                            L(x, y) = LCAAT (x, y) + λlatency Llatency (x, y)
Figure 2: Architecture of CAAT based on Transformer.                                + λCE LCE (x, y)
                                                                                            X
                                                                                    = − log    p(ŷ|x)
reduce the complexity of joiner from O(|x| · |y|)                                                 ŷ

to O(|x|·|y|)
                                                                                                   X
          d     . Besides, to further reduce video mem-                             + λlatency              Pr(ŷ|y, x)d(ŷ)
ory consumption, we split hidden states into small                                                     ŷ
pieces before sent into joiner, and recombine it for
                                                                                            X
                                                                                    − λCE              log p(yj |x, y

case of our method with {m = 1, r = 0}. Note 3.5 Unsegmented Data Processing
that the look-ahead window size in our method is To deal with unsegmented data, we segment the
fixed, which ensures increasing transformer layers input text based on sentence ending marks for T2T
won’t affect latency. track. For S2T task, input speech is simply seg-
mented into utterances with duration of 20 sec-
3.3 Text-to-Text Simultaneous Translation
onds and each segmented piece is directly sent to
We implemented both CAAT in Sec. 3.1 and wait-k our simultaneous translation systems to obtain the
(Ma et al., 2019) systems for text-to-text simulta- streaming results. We found an abnormally large
neous translation, both of them are implemented average lagging (AL) on IWSLT tst2018 test set
based on fairseq (Ott et al., 2019). based on existed SimuEval toolkit(Ma et al., 2020a)
All of wait-k experiments use the parameter set- and segment strategy, so relevant results are not pre-
tings based on big transformer (Vaswani et al., sented here. A more reasonable latency criterion
2017) with unidirectional encoders, which corre- may be needed for unsegmented data in the future.
sponds to a 12-layer encoder and 6-layer decoder
transformer with a embedding size of 1024, a feed 4 Experiments
forward network size of 4096, and 16 heads atten-
4.1 Effectiveness of CAAT
tion.
Hyper-parameters of our CAAT model architec- To demonstrate the effectiveness of CAAT architec-
tures are shown in Table 7. CAAT training re- ture, we compare it to wait-k with speculative beam
quires significantly more GPU memory than con- search (SBS) (Ma et al., 2019; Zheng et al., 2019b),
ventional
Transformer
(Vaswani et al., 2017), for one of the previous state-of-the-art. The latency-
|x|·|y| quality trade-off curves on S2T and T2T tasks are
the O d complexity of joiner module. We
shown in Figure 3 and Figure 4(a). We can find that
mitigate this problem by reducing joiner hidden
CAAT significantly outperforms wait-k with SBS,
dimension for lower decision step size d.
especially in low latency section(AL < 1000ms
3.4 Speech-to-Text Simultaneous Translation for S2T track and AL < 3 for T2T track).
3.4.1 End-to-End Systems 27
The main system of End-to-End Speech-to-Text 26
simultaneous Translation is based on the aforemen- 25
tioned CAAT structure. For speech encoder, two 24
2D convolution blocks are introduced before the
23
BLEU

stacked 24 Transformer encoder layers. Each con-
volution block consists of a 3-by-3 convolution 22

layer with 64 channels and stride size as 2, and a 21
ReLU activation function. Input speech features are 20 wait-k_SBS
downsampled 4 times by convolution blocks and CAAT
19
flattened to 1D sequence as input to transformer lay- 18
ers. Other hyper-parameters are shown in Table 7. 500 1000 1500 2000 2500 3000
AL
The latency-quality trade-off may be adjusted by
varying the decision step size d and the latency Figure 3: Comparison of CAAT and wait-k with
scaling factor λlatency . We submitted systems with SBS systems on EN→DE Speech-to-Text simultane-
best performance in each latency region. ous translation.

3.4.2 Cascaded Systems
The cascaded system consists of two modules, si- 4.2 Effectiveness of data augmentation
multaneous automatic speech recognition (ASR) In order to testify the effectiveness of data augmen-
and simultaneous text-to-text Machine Translation tation, we compare the results of different data aug-
(MT). Both simultaneous ASR and MT system are mentation methods based on the offline and simulta-
built with CAAT proposed in Sec. 3.1. And we neous speech translation task. As demonstrated in
found the cascaded systems outperforms end-to- Table 8, adding new generated target sentences into
end system in medium and high latency region. the training corpora by using Self-training gives

Parameters           S2T config         T2T config-A      T2T config-B
                                 layers                  24                12                12
                            attention heads               8                16                16
            Encoder
                            FFN dimension               2048              4096              4096
                            embedding size               512              1024              1024
                            attention heads               8                16                16
                            FFN dimension               2048              4096              4096
           Predictor        embedding size               512              1024              1024
                           output dimension              512               512              1024
                            attention heads               8                 8                16
                            FFN dimension               1024              2048              4096
             Joiner
                            embedding size               512               512              1024
                           decision step size       {16,64}           {4,10,16,32}        {10,32}
                /
                         latency scaling factor    {1.0,0.2}            {1.0,0.2}           0.2

Table 7: Parameters of CAAT in T2T and end-to-end S2T simultaneous translation. Noted that both predictor and
joiner have 6 layers for T2T and S2T tasks, and the additional two parameters for end-to-end 2T simultaneous
translation, which is the main context and right context described in Sec.3.2, are set m = 32 and r = 16 .

        Dataset                      BLEU                    more or less under different latency regimes, and
                                                             the increase is quite obvious in low latency regime.
        Original speech corpora      21.24
                                                             Compared with the best result in 2020, we finally
          +self-training             28.21
                                                             get improvement by 6.8 and 3.4 BLEU in low and
             +Speech systhesis       29.72
                                                             high latency regime respectively.
Table 8: Performance of offline speech translation on        En→JA Task Results of Text-to-Text simultane-
MuST-C(v2) tst-COMMON with different datasets.
                                                             ous translation (EN→JA) track are plotted in Fig-
                                                             ure 4(b), where the curve naming CAAT bst is best
a boost of nearly 7 BLEU points and speech syn-              performances in this track with or without model-
thesis provides the other 1.5 BLEU points increase           ensembling method. Curves in this sub-figure show
on MuST-C(v2) tst-COMMON. As illustrated in                  the similar conclusion to the former subsection,
Figure 3, all the data augmentation methods are              that the result of proposed CAAT significantly out-
employed and provide nearly 3 BLEU points on                 performs that of wait-k with SBS. While we can
average in the simultaneous task at different la-            also find that the gap between CAAT and offline is
tency regimes. Note that our data augmentation               more obvious (nearly 0.4 BLEU), this is mainly be-
methods alleviate the scarcity of parallel datasets          cause parameters of joiner block for EN→JA track
in the End-to-End speech translation task and make           in high-latency regime is reduced a lot from that
a significant improvement.                                   for EN→DE track, due to the unstable EN→JA
                                                             training.
4.3   Text-to-Text Simultaneous Translation
EN→DE Task The performances of text-to-text                  4.4   Speech-to-Text Simultaneous Translation
EN→DE task is shown in Figure 4(a). We can                   End-to-End System In this section, we discuss
see that the performance of proposed CAAT is al-             about our final results of End-to-End system based
ways better than that of wait-k with SBS and the             on CAAT. We tune the decision step size d and
best results from ON-TRAC in 2020 (Elbayad                   latency scaling factor λlatency to meet different la-
et al., 2020), especially in low latency regime, and         tency regime requirements. For low, medium and
the performance of CAAT with model ensemble                  high latency, the corresponding d and λlatency are
is nearly equivalent to offline result. Moreover, it         set to (16,64,64) and (1.0,1.0,0.2) respectively. We
can be further noticed from Figure 4(a) that the             show our final latency-quality trade-offs in Figure 5.
model ensemble can also improve the BLUE score               Combined with our data augmentation methods and

                                                        35

36                                                                        18

                                                                                 17
       34

                                                                                 16
       32
BLEU

                                                                          BLEU
                                                                                 15
       30
                                                    offline
                                                    CAAT_ens                     14
                                                    CAAT
       28                                           wait-k_SBS_ens                                                                         offline
                                                    ON-TRAC_2020                 13                                                        CAAT_bst
                                                                                                                                           wait-k_SBS_ens
       26                                                                        12
            2   3     4   5   6    7        8   9   10    11    12   13               2          3    4      5      6    7      8      9      10     11     12
                                       AL                                                                                AL

                    (a) tst-COMMON(v2) on EN→DE                                                      (b) IWSLT dev2021 on EN→JA

                          Figure 4: Latency-quality trade-offs of Text-to-Text simultaneous translation.

new CAAT model structure, it can be seen that our                                translation task.
single model system has already outperformed the
best results of last year in all latency regimes and                                      30.0
provides 9.8 BLEU scores increase on average. En-
                                                                                          27.5
sembling different models can further boost the
                                                                                          25.0
BLEU scores by roughly 0.5-1.5 points at different
latency regimes.                                                                          22.5
                                                                                 BLEU

                                                                                          20.0
Cascaded System Under the cascaded setting,
                                                                                          17.5
we paired two well-trained ASR and MT systems,                                                                                              CAAT
where the WER of ASR system’s performance is                                              15.0                                              CAAT_ens
                                                                                                                                            Cascaded System
6.30 with 1720.20 AL, and the MT system is fol-                                           12.5                                              Best_results_2020
lowed by the config-A in Table 7, whose results                                           10.0
                                                                                              500     1000       1500   2000    2500       3000    3500     4000
are 34.79 BLEU and 5.93 AL. We found the best                                                                                  AL
medium and high-latency systems at decision step
size pair (dasr , dmt ) with (6, 10) and (12, 10) re-                            Figure 5: Latency-quality trade-offs of Speech-to-
                                                                                 Text simultaneous translation on MuST-C(v2) tst-
spectively. Performance of cascaded systems are
                                                                                 COMMON.
shown in Figure 5. Note that under current con-
figuration of ASR and MT systems, we can not
provide valid results that satisfy the requirement of
                                                                                 5          Related Work
AL at low latency regime since cascaded system
usually has a larger latency compared to End-to                                  Simultaneous Translation Recent work on si-
End system. During the online decoding of the                                    multaneous translation falls into two categories.
cascaded system, only after specific tokens are rec-                             The first category uses a fixed policy for the
ognized by the ASR system, the translation model                                 READ/WRITE actions and can thus be easily inte-
can further translate them to obtain the final result.                           grated into the training stage, as typified by wait-
The decoded results from ASR model first has a                                   k approaches (Ma et al., 2019).The second cate-
delay compared to the actual contents of the audio,                              gory includes models with a flexible policy learned
and the two-steps decoding further accumulates the                               and/or adaptive to current context, e.g., by Rein-
delay, which contributes to the higher latency com-                              forcement Learning (Gu et al., 2017), Supervise
pared to the End-to-End system. However, it still                                Learning (Zheng et al., 2019a) and so on. A special
can be seen that cascaded system has significant                                 sub-category of flexible policy jointly optimizes
advantages over End-to-End system at medium and                                  policy and translation by monotonic attention cus-
high latency regime and it still has a long way to go                            tomized to translation model, e.g., Monotonic Infi-
for End-to-End system in the simultaneous speech                                 nite Lookback (MILk) attention (Arivazhagan et al.,

                                                                          36

2019) and Monotonic Multihead Attention (MMA) Linhao Dong, Feng Wang, and Bo Xu. 2019. Self-
(Ma et al., 2020b). We propose a novel method attention aligner: A latency-control end-to-end
model for ASR using self-attention network and
to optimize policy and translation model jointly,
chunk-hopping. In IEEE International Confer-
which is motivated by RNN-T (Graves, 2012) in ence on Acoustics, Speech and Signal Processing,
online ASR. Unlike RNN-T, the CAAT model re- ICASSP 2019, Brighton, United Kingdom, May 12-
moves the monotonic constraint, which is critical 17, 2019, pages 5656–5660. IEEE.
for considering reordering in machine translation Sergey Edunov, Myle Ott, Michael Auli, and David
tasks. The optimization of our latency loss is moti- Grangier. 2018. Understanding back-translation at
vated by Sequence Discriminative Training in ASR scale. arXiv preprint arXiv:1808.09381.
(Povey, 2005). Maha Elbayad, Ha Nguyen, Fethi Bougares, Natalia
Tomashenko, Antoine Caubrière, Benjamin Lecou-
Data Augmentation As described in Sec. 2, the teux, Yannick Estève, and Laurent Besacier. 2020.
size of training data for speech translation is sig- On-trac consortium for end-to-end and simultane-
nificantly smaller than that of text-to-text machine ous speech translation challenge tasks at iwslt 2020.
arXiv preprint arXiv:2005.11861.
translation, which is the main bottleneck to im-
prove the performance of speech translation. Self- Ramazan Gokay and Hulya Yalcin. 2019. Improving
training, or sequnece-level knowledge distillation low resource turkish speech recognition with data
by text-to-text machine translation model, is the augmentation and tts. In 2019 16th International
Multi-Conference on Systems, Signals & Devices
most effective way to utilize the huge ASR train- (SSD), pages 357–360. IEEE.
ing data (Liu et al., 2019; Pino et al., 2020). On
the other hand, synthesizing data by text-to-speech Alex Graves. 2012. Sequence transduction with
recurrent neural networks. arXiv preprint
(TTS) has been demonstrated to be effective for arXiv:1211.3711.
low resource speech recognition task (Gokay and
Yalcin, 2019; Ren et al., 2019). To the best of our Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic-
tor O.K. Li. 2017. Learning to translate in real-time
knowledge, this is the first work to augment data
with neural machine translation. In Proceedings of
by TTS for simultaneous speech-to-text translation the 15th Conference of the European Chapter of the
tasks. Association for Computational Linguistics: Volume
1, Long Papers, pages 1053–1062, Valencia, Spain.
Association for Computational Linguistics.
6 Conclusion
Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio
In this paper, we propose a novel simultane- Ranzato. 2019. Revisiting self-training for
ous translation architecture, Cross Attention Aug- neural sequence generation. arXiv preprint
mented Transducer (CAAT), which significantly arXiv:1909.13788.
outperforms wait-k in both S2T and T2T simulta- Yoon Kim and Alexander M Rush. 2016. Sequence-
neous translation task. Based on CAAT architec- level knowledge distillation. arXiv preprint
ture and data augmentation, we build simultaneous arXiv:1606.07947.
translation systems on text-to-text and speech-to- Taku Kudo. 2006. Mecab: Yet another
text simultaneous translation tasks. We also build part-of-speech and morphological analyzer.
a cascaded speech-to-text simultaneous translation http://mecab.sourceforge.jp.
system for comparison. Both T2T and S2T systems Taku Kudo and John Richardson. 2018. SentencePiece:
achieve significant improvements over last year’s A simple and language independent subword tok-
best-performing systems. enizer and detokenizer for neural text processing. In
Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing: System
Demonstrations, pages 66–71, Brussels, Belgium.
References Association for Computational Linguistics.
Naveen Arivazhagan, Colin Cherry, Wolfgang Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang,
Macherey, Chung-Cheng Chiu, Semih Yavuz, Hua Wu, Haifeng Wang, and Chengqing Zong. 2019.
Ruoming Pang, Wei Li, and Colin Raffel. 2019. End-to-end speech translation with knowledge distil-
Monotonic infinite lookback attention for simulta- lation.
neous machine translation. In Proceedings of the
57th Annual Meeting of the Association for Com- Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,
putational Linguistics, pages 1313–1323, Florence, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,
Italy. Association for Computational Linguistics. Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and

Haifeng Wang. 2019. STACL: Simultaneous trans- to speech and automatic speech recognition. In In-
lation with implicit anticipation and controllable la- ternational Conference on Machine Learning, pages
tency using prefix-to-prefix framework. In Proceed- 5410–5419. PMLR.
ings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 3025–3036, Rico Sennrich, Barry Haddow, and Alexandra Birch.
Florence, Italy. Association for Computational Lin- 2016. Neural machine translation of rare words
guistics. with subword units. In Proceedings of the 54th An-
nual Meeting of the Association for Computational
Xutai Ma, Mohammad Javad Dousti, Changhan Wang, Linguistics (Volume 1: Long Papers), pages 1715–
Jiatao Gu, and Juan Pino. 2020a. SIMULEVAL: An 1725, Berlin, Germany. Association for Computa-
evaluation toolkit for simultaneous translation. In tional Linguistics.
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike
Demonstrations, pages 144–150, Online. Associa- Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
tion for Computational Linguistics. Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan,
et al. 2018. Natural tts synthesis by condition-
Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, ing wavenet on mel spectrogram predictions. In
and Jiatao Gu. 2020b. Monotonic multihead atten- 2018 IEEE International Conference on Acoustics,
tion. In 8th International Conference on Learning Speech and Signal Processing (ICASSP), pages
Representations, ICLR 2020, Addis Ababa, Ethiopia, 4779–4783. IEEE.
April 26-30, 2020. OpenReview.net.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Kaiser, and Illia Polosukhin. 2017. Attention is all
Fan, Sam Gross, Nathan Ng, David Grangier, and
you need. In Advances in Neural Information Pro-
Michael Auli. 2019. fairseq: A fast, extensible
cessing Systems 30: Annual Conference on Neural
toolkit for sequence modeling. In Proceedings of
Information Processing Systems 2017, December 4-
the 2019 Conference of the North American Chap-
9, 2017, Long Beach, CA, USA, pages 5998–6008.
ter of the Association for Computational Linguistics
(Demonstrations), pages 48–53, Minneapolis, Min- Chunyang Wu, Yongqiang Wang, Yangyang Shi,
nesota. Association for Computational Linguistics. Ching-Feng Yeh, and Frank Zhang. 2020. Stream-
ing transformer-based acoustic models using self-
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng attention with augmented memory. arXiv preprint
Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. arXiv:2005.08042.
2019. Specaugment: A simple data augmentation
method for automatic speech recognition. arXiv Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang
preprint arXiv:1904.08779. Huang. 2019a. Simultaneous translation with flexi-
ble policy via restricted imitation learning. In Pro-
Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad ceedings of the 57th Annual Meeting of the Asso-
Dousti, and Yun Tang. 2020. Self-training for ciation for Computational Linguistics, pages 5816–
end-to-end speech translation. arXiv preprint 5822, Florence, Italy. Association for Computa-
arXiv:2006.02490. tional Linguistics.

Daniel Povey. 2005. Discriminative training for large Renjie Zheng, Mingbo Ma, Baigong Zheng, and Liang
vocabulary speech recognition. Ph.D. thesis, Uni- Huang. 2019b. Speculative beam search for simulta-
versity of Cambridge. neous translation. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Processing and the 9th International Joint Confer-
Burget, Ondrej Glembek, Nagendra Goel, Mirko ence on Natural Language Processing (EMNLP-
Hannemann, Petr Motlicek, Yanmin Qian, Petr IJCNLP), pages 1395–1402, Hong Kong, China. As-
Schwarz, et al. 2011. The kaldi speech recogni- sociation for Computational Linguistics.
tion toolkit. In IEEE 2011 workshop on automatic
speech recognition and understanding, CONF. IEEE
Signal Processing Society.

Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin,
Zhou Zhao, and Tie-Yan Liu. 2020. SimulSpeech:
End-to-end simultaneous speech to text translation.
In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, pages
3787–3796, Online. Association for Computational
Linguistics.

Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao,
and Tie-Yan Liu. 2019. Almost unsupervised text

You can also read