The USTC-NELSLIP Systems for Simultaneous Speech Translation Task at IWSLT 2021
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The USTC-NELSLIP Systems for Simultaneous Speech Translation Task
at IWSLT 2021
Dan Liu1,2 , Mengge Du2 , Xiaoxi Li2 , Yuchen Hu1 , and Lirong Dai 1
1
University of Science and Technology of China, Hefei, China
2
iFlytek Research, Hefei, China
{danliu, huyuchen}@mail.ustc.edu.cn
lrdai@ustc.edu.cn
{danliu, xxli16,mgdou}@iflytek.com
Abstract marginal probability based on conventional Atten-
This paper describes USTC-NELSLIP’s sub-
tion Encoder-Decoder (Sennrich et al., 2016) ar-
missions to the IWSLT2021 Simultaneous chitectures (Transformer (Vaswani et al., 2017) in-
Speech Translation task. We proposed a novel cluded), which is due to the deep coupling between
simultaneous translation model, Cross Atten- source contexts and target history contexts. To
tion Augmented Transducer (CAAT), which solve this problem, we propose a novel architecture,
extends conventional RNN-T to sequence-to- Cross Attention augmented Transducer (CAAT),
sequence tasks without monotonic constraints, and a latency loss function to ensure CAAT model
e.g., simultaneous translation. Experiments
works with an appropriate latency. In simultane-
on speech-to-text (S2T) and text-to-text (T2T)
simultaneous translation tasks shows CAAT ous translation, policy is integrated into translation
achieves better quality-latency trade-offs com- model and learned jointly for CAAT model.
pared to wait-k, one of the previous state-of- In this work, we build simultaneous translation
the-art approaches. Based on CAAT architec- systems for both text-to-text (T2T) and speech-
ture and data augmentation, we build S2T and to-text S2T) task. We propose a novel archi-
T2T simultaneous translation systems in this tecture, Cross Attention Augmented Transducer
evaluation campaign. Compared to last year’s
(CAAT), which significantly outperforms wait-k
optimal systems, our S2T simultaneous trans-
lation system improves by an average of 11.3 (Ma et al., 2019) baseline in both text-to-text and
BLEU for all latency regimes, and our T2T si- speech-to-text simultaneous translation task. Be-
multaneous translation system improves by an sides, we adopt a variety of data augmentation
average of 4.6 BLEU. methods, back-translation (Edunov et al., 2018),
Self-training (Kim and Rush, 2016) and speech
1 Introduction
synthesis with Tacotron2 (Shen et al., 2018). Com-
This paper describes the submission to IWSLT bining all of these and models ensembling, we
2021 Simultaneous Speech Translation task by Na- achieved about 11.3 BLEU (in S2T task) and 4.6
tional Engineering Laboratory for Speech and Lan- BLEU (in T2T task) gains compared to the best
guage Information Processing (NELSLIP), Univer- performance last year.
sity of Science and Technology of China, China.
Recent work in text-to-text simultaneous transla- 2 Data
tion tends to fall into two categories, fixed policy
and flexible policy, represented by wait-k (Ma et al., 2.1 Statistics and Preprocessing
2019) and monotonic attention (Arivazhagan et al., EN→DE Speech Corpora The speech datasets
2019; Ma et al., 2020b) respectively. The draw- used in our experiments are shown in Table 1,
back of fixed policy is that it may introduce over where MuST-C, Europarl and CoVoST2 are speech
latency for some sentences and under latency for translation specific (speech, transcription and trans-
others. Meanwhile, flexible policy often leads to lation included), and LibriSpeech, TED-LIUM3
difficulties in model optimization. are speech recognition specific (only speech and
Inspired by RNN-T (Graves, 2012), we aim transcription). After augmented with speed and
to optimize the marginal distribution of all ex- echo perturbation, we use Kaldi (Povey et al., 2011)
panded paths in simultaneous translation. How- to extract 80 dimensional log-mel filter bank fea-
ever, we found it’s impossible to calculate the tures, computed with a 25ms window size and a
30
Proceedings of the 18th International Conference on Spoken Language Translation, pages 30–38
Bangkok, Thailand (Online), August 5–6, 2021. ©2021 Association for Computational Linguistics10ms window shift, and specAugment (Park et al., For EN→DE task, we directly use Sentence-
2019) were performed during training phase. Piece (Kudo and Richardson, 2018) to generate
a unigram vocabulary of size 32,000 for source
Corpus Segments Duration(h) and target language jointly. And for EN→JA task,
sentences in Japanese are firstly participled by
MuST-C 250.9k 448
MeCab (Kudo, 2006), and then a unigram vocab-
Europarl 69.5k 155
ulary of size 32,000 is generated for source and
CoVoST2 854.4k 1090
target jointly similar to EN→DE task.
LibriSpeech 281.2k 960
During data preprocessing, the bilingual datasets
TED-LIUM3 268.2k 452
are firstly filtered by length less than 1024 and
Table 1: Statistics of speech corpora. length ratio of target to source 0.25 < r < 4. In
the second step, with a baseline Transformer model
trained with only bilingual data, we filtered the
Text Translation Corpora The bilingual paral- mismatched parallel pairs with log-likelihood from
lel datasets for Englith to German(EN→DE) and the baseline model, threshold is set to −4.0 for
English to Japanese (EN→JA) used are shown in EN→DE task and −5.0 for EN→JA task. At last
Table 2, and the monolingual datasets in English, we keep 27.3 million sentence pairs for EN-DE
German and Japanese are shown in Table 3. And task and 17.0 sentence pairs for EN→JA task.
we found the Paracrawl dataset in EN→DE task is
too big to our model training, we randomly select 2.2 Data Augmentation
a subset of 14M sentences from it. For text-to-text machine translation, augmented
data from monolingual corpora in source and target
Corpus Sentences language are generated by self-training (He et al.,
MuST-C(v2) 229.7k 2019) and back translation (Edunov et al., 2018)
Europarl 1828.5k respectively. Statistics of the augmented training
Rapid-2019 1531.3k data are shown in Table 4.
WIT3-TED 209.5k
EN→DE Data EN→DE EN→JA
Commoncrawl 2399.1k
WikiMatrix 6227.2k Bilingual data 27.3M 17.0M
Wikititles 1382.6k +back-translation 34.3M 22.0M
Paracrawl 82638.2k +self-training 41.3M 27.0M
WIT3-TED 225.0k Table 4: Augmented training data for text-to-text trans-
JESC 2797.4k lation.
kftt 440.3k We further extend these two data augmentation
EN→JA
WikiMatrix 3896.0k methods to speech-to-text translation, detailed as:
Wikititles 706.0k
Paracrawl 10120.0k 1. Self-training: Maybe similar to sequence-
level distillation (Kim and Rush, 2016; Ren
Table 2: Statistics of text parallel datasets. et al., 2020; Liu et al., 2019). Transcriptions
of all speech datasets (both speech recogni-
tion and speech translation specific) are sent
0
Language Corpus Sentences to a text translation model to generate text y
0
in target language, the generated y with its
Europarl-v10 2295.0k corresponding speech are directly added to
EN
News-crawl-2019 33600.8k speech translation dataset.
Europarl-v10 2108.0k
DE 2. Speech Synthesis: We employ Tacotron2
News-crawl-2020 53674.4k
(Shen et al., 2018) with slightly modified by
News-crawl-2019 3446.4k introducing speaker representations to both en-
JA
News-crawl-2020 10943.3k coder and decoder as our text-to-speech (TTS)
model architecture, and trained on MuST-
Table 3: Statistics of monolingual datasets. C(v2) speech corpora to generate filter-bank
31speech representations. We randomly select tion of latency metric through all possible expanded
4M sentence pairs from EN→DE text trans- paths:
lation corpora and generate audio feature by
speech synthesis. The generated filter bank L(x, y) = Lnll (x, y) + Llatency (x, y)
features and their corresponding target lan- X
guage text are used to expand our speech trans- = − log p(ŷ|x) + Eŷ l(ŷ)
lation dataset. ŷ (1)
X X
= − log p(ŷ|x) + Pr(ŷ|y, x)l(ŷ)
The expanded training data are shown in Table 5. ŷ ŷ
Besides, during the training period for all the
speech translation tasks, we sample the speech data
from the whole corpora with fixed ratio and the
concrete ratio for different dataset is shown in Ta-
ble 6.
Dataset Segements Duration(h)
Raw S2T dataset 1.17M 1693
+self-training 2.90M 4799
+Speech synthesis 7.22M 10424
Table 5: Expanded speech translation dataset with self-
training and speech synthesis.
Figure 1: Expanded paths in simultaneous translation.
Dataset Sample Ratio p(ŷ|x)
Where Pr(ŷ|y, x) = P
p(ŷ 0 |x)
, and ŷ ∈
0
MuST-C 2 ŷ ∈H(x,y)
Europarl 1 H(x, y) is an expansion of target sequence y, and
CoVoST2 1 l(ŷ is the latency of expanded path ŷ.
LibriSpeech 1 However, RNN-T is trained and inferenced
TED-LIUM3 2 based on source-target monotonic constraint,
Speech synthesis 5 which means it isn’t suitable for translation
task.
P And the calculation of marginal probabil-
Table 6: Sample ratio for different datasets during train- ity ŷ∈H(x,y) Pr(ŷ|x) is impossible for Attention
ing period. Encoder-Decoder framework due to deep coupling
of source and previous target representation. As
shown in Figure 1, the decoder hidden states for
3 Methods and Models the red path ŷ 1 and the blue path ŷ 2 is not equal at
the intersection s12 6= s22 . To solve this, we sepa-
3.1 Cross Attention Augmented Transducer
rate the source attention mechanism from the target
Let x and y denote the source and target se- history representation, which is similar to joiner
quence, respectively. The policy of simultane- and predictor in RNN-T. The novel architecture
ous translation is denoted as an action sequence can be viewed as a extension version of RNN-T
p ∈ {R, W }|x|+|y| where R denotes the READ with attention mechanism augmented joiner, and is
action and W the WRITE action. Another repre- named as Cross Attention Augmented Transducer
sentation of policy is extending target sequence (CAAT). Figure 2 is the implementation of RAAT
y to length |x| + |y| with blank symbol φ as based on Transformer.
ŷ ∈ (v ∪ {φ})|x|+|y| , where v is the vocabulary Computation cost of joiner in CAAT is signif-
of the target language. The mapping from y to sets icantly more expensive than that of RNN-T. The
of all possible expansion ŷ denotes as H(x, y). complexity of joiner is O(|x| · |y|) during train-
Inspired by RNN-T (Graves, 2012), the loss func- ing, which means O(|x|) times higher than conven-
tion for simultaneous translation can be defined as tional Transformer. We solve this problem by mak-
the marginal conditional probability and expecta- ing decisions with decision step size d > 1, and
32output Latency loss and its gradients can be calculated
by the forward-backward algorithm, similar to Se-
Add & Norm
quence Criterion Training in ASR (Povey, 2005).
Feed
Forward At last, we add the cross entropy loss of offline
translation model as an auxiliary loss to CAAT
Add & Norm
Add & Norm Add & Norm
model training for two reasons. First we hope the
Cross
Feed Feed
Forward
Attention
Forward
CAAT model fall back to offline translation in the
worst case; second, CAAT models is carried out
Add & Norm Add & Norm
in accordance with offline translation when source
Self Self
Attention Attention
sentence ended. The final loss function for CAAT
training is defined as follows:
x y(shift right)
L(x, y) = LCAAT (x, y) + λlatency Llatency (x, y)
Figure 2: Architecture of CAAT based on Transformer. + λCE LCE (x, y)
X
= − log p(ŷ|x)
reduce the complexity of joiner from O(|x| · |y|) ŷ
to O(|x|·|y|)
X
d . Besides, to further reduce video mem- + λlatency Pr(ŷ|y, x)d(ŷ)
ory consumption, we split hidden states into small ŷ
pieces before sent into joiner, and recombine it for
X
− λCE log p(yj |x, ycase of our method with {m = 1, r = 0}. Note 3.5 Unsegmented Data Processing
that the look-ahead window size in our method is To deal with unsegmented data, we segment the
fixed, which ensures increasing transformer layers input text based on sentence ending marks for T2T
won’t affect latency. track. For S2T task, input speech is simply seg-
mented into utterances with duration of 20 sec-
3.3 Text-to-Text Simultaneous Translation
onds and each segmented piece is directly sent to
We implemented both CAAT in Sec. 3.1 and wait-k our simultaneous translation systems to obtain the
(Ma et al., 2019) systems for text-to-text simulta- streaming results. We found an abnormally large
neous translation, both of them are implemented average lagging (AL) on IWSLT tst2018 test set
based on fairseq (Ott et al., 2019). based on existed SimuEval toolkit(Ma et al., 2020a)
All of wait-k experiments use the parameter set- and segment strategy, so relevant results are not pre-
tings based on big transformer (Vaswani et al., sented here. A more reasonable latency criterion
2017) with unidirectional encoders, which corre- may be needed for unsegmented data in the future.
sponds to a 12-layer encoder and 6-layer decoder
transformer with a embedding size of 1024, a feed 4 Experiments
forward network size of 4096, and 16 heads atten-
4.1 Effectiveness of CAAT
tion.
Hyper-parameters of our CAAT model architec- To demonstrate the effectiveness of CAAT architec-
tures are shown in Table 7. CAAT training re- ture, we compare it to wait-k with speculative beam
quires significantly more GPU memory than con- search (SBS) (Ma et al., 2019; Zheng et al., 2019b),
ventional
Transformer
(Vaswani et al., 2017), for one of the previous state-of-the-art. The latency-
|x|·|y| quality trade-off curves on S2T and T2T tasks are
the O d complexity of joiner module. We
shown in Figure 3 and Figure 4(a). We can find that
mitigate this problem by reducing joiner hidden
CAAT significantly outperforms wait-k with SBS,
dimension for lower decision step size d.
especially in low latency section(AL < 1000ms
3.4 Speech-to-Text Simultaneous Translation for S2T track and AL < 3 for T2T track).
3.4.1 End-to-End Systems 27
The main system of End-to-End Speech-to-Text 26
simultaneous Translation is based on the aforemen- 25
tioned CAAT structure. For speech encoder, two 24
2D convolution blocks are introduced before the
23
BLEU
stacked 24 Transformer encoder layers. Each con-
volution block consists of a 3-by-3 convolution 22
layer with 64 channels and stride size as 2, and a 21
ReLU activation function. Input speech features are 20 wait-k_SBS
downsampled 4 times by convolution blocks and CAAT
19
flattened to 1D sequence as input to transformer lay- 18
ers. Other hyper-parameters are shown in Table 7. 500 1000 1500 2000 2500 3000
AL
The latency-quality trade-off may be adjusted by
varying the decision step size d and the latency Figure 3: Comparison of CAAT and wait-k with
scaling factor λlatency . We submitted systems with SBS systems on EN→DE Speech-to-Text simultane-
best performance in each latency region. ous translation.
3.4.2 Cascaded Systems
The cascaded system consists of two modules, si- 4.2 Effectiveness of data augmentation
multaneous automatic speech recognition (ASR) In order to testify the effectiveness of data augmen-
and simultaneous text-to-text Machine Translation tation, we compare the results of different data aug-
(MT). Both simultaneous ASR and MT system are mentation methods based on the offline and simulta-
built with CAAT proposed in Sec. 3.1. And we neous speech translation task. As demonstrated in
found the cascaded systems outperforms end-to- Table 8, adding new generated target sentences into
end system in medium and high latency region. the training corpora by using Self-training gives
34Parameters S2T config T2T config-A T2T config-B
layers 24 12 12
attention heads 8 16 16
Encoder
FFN dimension 2048 4096 4096
embedding size 512 1024 1024
attention heads 8 16 16
FFN dimension 2048 4096 4096
Predictor embedding size 512 1024 1024
output dimension 512 512 1024
attention heads 8 8 16
FFN dimension 1024 2048 4096
Joiner
embedding size 512 512 1024
decision step size {16,64} {4,10,16,32} {10,32}
/
latency scaling factor {1.0,0.2} {1.0,0.2} 0.2
Table 7: Parameters of CAAT in T2T and end-to-end S2T simultaneous translation. Noted that both predictor and
joiner have 6 layers for T2T and S2T tasks, and the additional two parameters for end-to-end 2T simultaneous
translation, which is the main context and right context described in Sec.3.2, are set m = 32 and r = 16 .
Dataset BLEU more or less under different latency regimes, and
the increase is quite obvious in low latency regime.
Original speech corpora 21.24
Compared with the best result in 2020, we finally
+self-training 28.21
get improvement by 6.8 and 3.4 BLEU in low and
+Speech systhesis 29.72
high latency regime respectively.
Table 8: Performance of offline speech translation on En→JA Task Results of Text-to-Text simultane-
MuST-C(v2) tst-COMMON with different datasets.
ous translation (EN→JA) track are plotted in Fig-
ure 4(b), where the curve naming CAAT bst is best
a boost of nearly 7 BLEU points and speech syn- performances in this track with or without model-
thesis provides the other 1.5 BLEU points increase ensembling method. Curves in this sub-figure show
on MuST-C(v2) tst-COMMON. As illustrated in the similar conclusion to the former subsection,
Figure 3, all the data augmentation methods are that the result of proposed CAAT significantly out-
employed and provide nearly 3 BLEU points on performs that of wait-k with SBS. While we can
average in the simultaneous task at different la- also find that the gap between CAAT and offline is
tency regimes. Note that our data augmentation more obvious (nearly 0.4 BLEU), this is mainly be-
methods alleviate the scarcity of parallel datasets cause parameters of joiner block for EN→JA track
in the End-to-End speech translation task and make in high-latency regime is reduced a lot from that
a significant improvement. for EN→DE track, due to the unstable EN→JA
training.
4.3 Text-to-Text Simultaneous Translation
EN→DE Task The performances of text-to-text 4.4 Speech-to-Text Simultaneous Translation
EN→DE task is shown in Figure 4(a). We can End-to-End System In this section, we discuss
see that the performance of proposed CAAT is al- about our final results of End-to-End system based
ways better than that of wait-k with SBS and the on CAAT. We tune the decision step size d and
best results from ON-TRAC in 2020 (Elbayad latency scaling factor λlatency to meet different la-
et al., 2020), especially in low latency regime, and tency regime requirements. For low, medium and
the performance of CAAT with model ensemble high latency, the corresponding d and λlatency are
is nearly equivalent to offline result. Moreover, it set to (16,64,64) and (1.0,1.0,0.2) respectively. We
can be further noticed from Figure 4(a) that the show our final latency-quality trade-offs in Figure 5.
model ensemble can also improve the BLUE score Combined with our data augmentation methods and
3536 18
17
34
16
32
BLEU
BLEU
15
30
offline
CAAT_ens 14
CAAT
28 wait-k_SBS_ens offline
ON-TRAC_2020 13 CAAT_bst
wait-k_SBS_ens
26 12
2 3 4 5 6 7 8 9 10 11 12 13 2 3 4 5 6 7 8 9 10 11 12
AL AL
(a) tst-COMMON(v2) on EN→DE (b) IWSLT dev2021 on EN→JA
Figure 4: Latency-quality trade-offs of Text-to-Text simultaneous translation.
new CAAT model structure, it can be seen that our translation task.
single model system has already outperformed the
best results of last year in all latency regimes and 30.0
provides 9.8 BLEU scores increase on average. En-
27.5
sembling different models can further boost the
25.0
BLEU scores by roughly 0.5-1.5 points at different
latency regimes. 22.5
BLEU
20.0
Cascaded System Under the cascaded setting,
17.5
we paired two well-trained ASR and MT systems, CAAT
where the WER of ASR system’s performance is 15.0 CAAT_ens
Cascaded System
6.30 with 1720.20 AL, and the MT system is fol- 12.5 Best_results_2020
lowed by the config-A in Table 7, whose results 10.0
500 1000 1500 2000 2500 3000 3500 4000
are 34.79 BLEU and 5.93 AL. We found the best AL
medium and high-latency systems at decision step
size pair (dasr , dmt ) with (6, 10) and (12, 10) re- Figure 5: Latency-quality trade-offs of Speech-to-
Text simultaneous translation on MuST-C(v2) tst-
spectively. Performance of cascaded systems are
COMMON.
shown in Figure 5. Note that under current con-
figuration of ASR and MT systems, we can not
provide valid results that satisfy the requirement of
5 Related Work
AL at low latency regime since cascaded system
usually has a larger latency compared to End-to Simultaneous Translation Recent work on si-
End system. During the online decoding of the multaneous translation falls into two categories.
cascaded system, only after specific tokens are rec- The first category uses a fixed policy for the
ognized by the ASR system, the translation model READ/WRITE actions and can thus be easily inte-
can further translate them to obtain the final result. grated into the training stage, as typified by wait-
The decoded results from ASR model first has a k approaches (Ma et al., 2019).The second cate-
delay compared to the actual contents of the audio, gory includes models with a flexible policy learned
and the two-steps decoding further accumulates the and/or adaptive to current context, e.g., by Rein-
delay, which contributes to the higher latency com- forcement Learning (Gu et al., 2017), Supervise
pared to the End-to-End system. However, it still Learning (Zheng et al., 2019a) and so on. A special
can be seen that cascaded system has significant sub-category of flexible policy jointly optimizes
advantages over End-to-End system at medium and policy and translation by monotonic attention cus-
high latency regime and it still has a long way to go tomized to translation model, e.g., Monotonic Infi-
for End-to-End system in the simultaneous speech nite Lookback (MILk) attention (Arivazhagan et al.,
362019) and Monotonic Multihead Attention (MMA) Linhao Dong, Feng Wang, and Bo Xu. 2019. Self-
(Ma et al., 2020b). We propose a novel method attention aligner: A latency-control end-to-end
model for ASR using self-attention network and
to optimize policy and translation model jointly,
chunk-hopping. In IEEE International Confer-
which is motivated by RNN-T (Graves, 2012) in ence on Acoustics, Speech and Signal Processing,
online ASR. Unlike RNN-T, the CAAT model re- ICASSP 2019, Brighton, United Kingdom, May 12-
moves the monotonic constraint, which is critical 17, 2019, pages 5656–5660. IEEE.
for considering reordering in machine translation Sergey Edunov, Myle Ott, Michael Auli, and David
tasks. The optimization of our latency loss is moti- Grangier. 2018. Understanding back-translation at
vated by Sequence Discriminative Training in ASR scale. arXiv preprint arXiv:1808.09381.
(Povey, 2005). Maha Elbayad, Ha Nguyen, Fethi Bougares, Natalia
Tomashenko, Antoine Caubrière, Benjamin Lecou-
Data Augmentation As described in Sec. 2, the teux, Yannick Estève, and Laurent Besacier. 2020.
size of training data for speech translation is sig- On-trac consortium for end-to-end and simultane-
nificantly smaller than that of text-to-text machine ous speech translation challenge tasks at iwslt 2020.
arXiv preprint arXiv:2005.11861.
translation, which is the main bottleneck to im-
prove the performance of speech translation. Self- Ramazan Gokay and Hulya Yalcin. 2019. Improving
training, or sequnece-level knowledge distillation low resource turkish speech recognition with data
by text-to-text machine translation model, is the augmentation and tts. In 2019 16th International
Multi-Conference on Systems, Signals & Devices
most effective way to utilize the huge ASR train- (SSD), pages 357–360. IEEE.
ing data (Liu et al., 2019; Pino et al., 2020). On
the other hand, synthesizing data by text-to-speech Alex Graves. 2012. Sequence transduction with
recurrent neural networks. arXiv preprint
(TTS) has been demonstrated to be effective for arXiv:1211.3711.
low resource speech recognition task (Gokay and
Yalcin, 2019; Ren et al., 2019). To the best of our Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic-
tor O.K. Li. 2017. Learning to translate in real-time
knowledge, this is the first work to augment data
with neural machine translation. In Proceedings of
by TTS for simultaneous speech-to-text translation the 15th Conference of the European Chapter of the
tasks. Association for Computational Linguistics: Volume
1, Long Papers, pages 1053–1062, Valencia, Spain.
Association for Computational Linguistics.
6 Conclusion
Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio
In this paper, we propose a novel simultane- Ranzato. 2019. Revisiting self-training for
ous translation architecture, Cross Attention Aug- neural sequence generation. arXiv preprint
mented Transducer (CAAT), which significantly arXiv:1909.13788.
outperforms wait-k in both S2T and T2T simulta- Yoon Kim and Alexander M Rush. 2016. Sequence-
neous translation task. Based on CAAT architec- level knowledge distillation. arXiv preprint
ture and data augmentation, we build simultaneous arXiv:1606.07947.
translation systems on text-to-text and speech-to- Taku Kudo. 2006. Mecab: Yet another
text simultaneous translation tasks. We also build part-of-speech and morphological analyzer.
a cascaded speech-to-text simultaneous translation http://mecab.sourceforge.jp.
system for comparison. Both T2T and S2T systems Taku Kudo and John Richardson. 2018. SentencePiece:
achieve significant improvements over last year’s A simple and language independent subword tok-
best-performing systems. enizer and detokenizer for neural text processing. In
Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing: System
Demonstrations, pages 66–71, Brussels, Belgium.
References Association for Computational Linguistics.
Naveen Arivazhagan, Colin Cherry, Wolfgang Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang,
Macherey, Chung-Cheng Chiu, Semih Yavuz, Hua Wu, Haifeng Wang, and Chengqing Zong. 2019.
Ruoming Pang, Wei Li, and Colin Raffel. 2019. End-to-end speech translation with knowledge distil-
Monotonic infinite lookback attention for simulta- lation.
neous machine translation. In Proceedings of the
57th Annual Meeting of the Association for Com- Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,
putational Linguistics, pages 1313–1323, Florence, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,
Italy. Association for Computational Linguistics. Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and
37Haifeng Wang. 2019. STACL: Simultaneous trans- to speech and automatic speech recognition. In In-
lation with implicit anticipation and controllable la- ternational Conference on Machine Learning, pages
tency using prefix-to-prefix framework. In Proceed- 5410–5419. PMLR.
ings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 3025–3036, Rico Sennrich, Barry Haddow, and Alexandra Birch.
Florence, Italy. Association for Computational Lin- 2016. Neural machine translation of rare words
guistics. with subword units. In Proceedings of the 54th An-
nual Meeting of the Association for Computational
Xutai Ma, Mohammad Javad Dousti, Changhan Wang, Linguistics (Volume 1: Long Papers), pages 1715–
Jiatao Gu, and Juan Pino. 2020a. SIMULEVAL: An 1725, Berlin, Germany. Association for Computa-
evaluation toolkit for simultaneous translation. In tional Linguistics.
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike
Demonstrations, pages 144–150, Online. Associa- Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
tion for Computational Linguistics. Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan,
et al. 2018. Natural tts synthesis by condition-
Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, ing wavenet on mel spectrogram predictions. In
and Jiatao Gu. 2020b. Monotonic multihead atten- 2018 IEEE International Conference on Acoustics,
tion. In 8th International Conference on Learning Speech and Signal Processing (ICASSP), pages
Representations, ICLR 2020, Addis Ababa, Ethiopia, 4779–4783. IEEE.
April 26-30, 2020. OpenReview.net.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Kaiser, and Illia Polosukhin. 2017. Attention is all
Fan, Sam Gross, Nathan Ng, David Grangier, and
you need. In Advances in Neural Information Pro-
Michael Auli. 2019. fairseq: A fast, extensible
cessing Systems 30: Annual Conference on Neural
toolkit for sequence modeling. In Proceedings of
Information Processing Systems 2017, December 4-
the 2019 Conference of the North American Chap-
9, 2017, Long Beach, CA, USA, pages 5998–6008.
ter of the Association for Computational Linguistics
(Demonstrations), pages 48–53, Minneapolis, Min- Chunyang Wu, Yongqiang Wang, Yangyang Shi,
nesota. Association for Computational Linguistics. Ching-Feng Yeh, and Frank Zhang. 2020. Stream-
ing transformer-based acoustic models using self-
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng attention with augmented memory. arXiv preprint
Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. arXiv:2005.08042.
2019. Specaugment: A simple data augmentation
method for automatic speech recognition. arXiv Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang
preprint arXiv:1904.08779. Huang. 2019a. Simultaneous translation with flexi-
ble policy via restricted imitation learning. In Pro-
Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad ceedings of the 57th Annual Meeting of the Asso-
Dousti, and Yun Tang. 2020. Self-training for ciation for Computational Linguistics, pages 5816–
end-to-end speech translation. arXiv preprint 5822, Florence, Italy. Association for Computa-
arXiv:2006.02490. tional Linguistics.
Daniel Povey. 2005. Discriminative training for large Renjie Zheng, Mingbo Ma, Baigong Zheng, and Liang
vocabulary speech recognition. Ph.D. thesis, Uni- Huang. 2019b. Speculative beam search for simulta-
versity of Cambridge. neous translation. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Processing and the 9th International Joint Confer-
Burget, Ondrej Glembek, Nagendra Goel, Mirko ence on Natural Language Processing (EMNLP-
Hannemann, Petr Motlicek, Yanmin Qian, Petr IJCNLP), pages 1395–1402, Hong Kong, China. As-
Schwarz, et al. 2011. The kaldi speech recogni- sociation for Computational Linguistics.
tion toolkit. In IEEE 2011 workshop on automatic
speech recognition and understanding, CONF. IEEE
Signal Processing Society.
Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin,
Zhou Zhao, and Tie-Yan Liu. 2020. SimulSpeech:
End-to-end simultaneous speech to text translation.
In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, pages
3787–3796, Online. Association for Computational
Linguistics.
Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao,
and Tie-Yan Liu. 2019. Almost unsupervised text
38You can also read