Non-autoregressive Mandarin-English Code-switching Speech Recognition with Pinyin Mask-CTC and Word Embedding Regularization
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Non-autoregressive Mandarin-English Code-switching Speech Recognition
with Pinyin Mask-CTC and Word Embedding Regularization
Shun-Po Chuang†1 , Heng-Jui Chang†2 , Sung-Feng Huang1 , Hung-yi Lee1,2
Graduate Institute of Communication Engineering, National Taiwan University, Taiwan1
Department of Electrical Engineering, National Taiwan University, Taiwan2
{f04942141, b06901020, f06942045, hungyilee}@ntu.edu.tw
Abstract pendency by directly predicting the whole sequence at once
or refine the output sequence in a constant amount of itera-
Mandarin-English code-switching (CS) is frequently used tions [16, 17]. NAR ASR can thus exploit parallel computing
among East and Southeast Asian people. However, the intra- technologies for faster inference. This paper primarily adopts
sentence language switching of the two very different languages Mask-CTC [18,19], a NAR ASR framework with a NAR condi-
makes recognizing CS speech challenging. Meanwhile, the re- tional masked language model (CMLM) for iterative decoding,
arXiv:2104.02258v1 [cs.CL] 6 Apr 2021
cent successful non-autoregressive (NAR) ASR models remove possessing competing performance with AR ASR baselines.
the need for left-to-right beam decoding in autoregressive (AR) Although Mask-CTC provided exciting results, several
models and achieved outstanding performance and fast infer- shortcomings remain unaddressed. The CS task suffers from
ence speed. Therefore, in this paper, we took advantage of data scarcity [20, 21]; using Mask-CTC would introduce nu-
the Mask-CTC NAR ASR framework to tackle the CS speech merous parameters and lead to overfitting. Thereby, we pro-
recognition issue. We propose changing the Mandarin out- pose word embedding label smoothing to incorporate additional
put target of the encoder to Pinyin for faster encoder training, textual knowledge, bringing more precise semantic and contex-
and introduce Pinyin-to-Mandarin decoder to learn contextual- tual information for a better regularization effect. Second, the
ized information. Moreover, we propose word embedding label current Mask-CTC framework trains the decoder part without
smoothing to regularize the decoder with contextualized infor- considering the encoder’s recognition error, probably causing
mation and projection matrix regularization to bridge that gap the error propagation issue during inference. We suggest con-
between the encoder and decoder. We evaluate the proposed straining the encoder’s output projection layer similar to the in-
methods on the SEAME corpus and achieved exciting results. put embedding layer in the decoder to bridge the gap between
Index Terms: code-switching, end-to-end speech recognition, the encoder and decoder. Moreover, Mask-CTC requires longer
non-autoregressive training time [18]; consequently, we propose changing the en-
coder’s output target from Mandarin characters to Pinyin sym-
1. Introduction bols, reducing training time and complexity, where Pinyin is a
Code-switching (CS) is a phenomenon of using two or more standardized way of translating Chinese characters into Latin
languages within a sentence in text or speech. This practice is letters. Each Mandarin character can be mapped to at least one
common in regions worldwide, especially in East and Southeast Pinyin symbol, representing its pronunciation. Using Pinyin
Asia, where people frequently use Mandarin-English CS speech symbols as the target allows the encoder to focus on acoustic
in daily conversations; thus, developing technologies to recog- modeling and reduces the vocabulary size.
nize this type of speech is critical. Although humans can eas- Before this paper, Naowarat et al. [22] used contextualized
ily recognize CS speech, automatic speech recognition (ASR) CTC to force NAR ASR to learn contextual information in CS
technologies nowadays perform poorly since these systems are speech recognition, while we utilize CMLM to better model-
mostly trained with monolingual data. Moreover, very few CS ing linguistic information. Also, Huang et al. [23] developed
speech corpora are publicly available, making training high- context-dependent label smoothing methods by removing im-
performance ASR models more challenging for CS speech. possible labels given the previous output token, while the pro-
Various approaches were studied to tackle the CS speech posed word embedding label smoothing is more straightforward
recognition problem, including language identity recogni- and leverages knowledge from the additional text data.
tion [1–4] and data-augmentation [5–8]. Many prior works ex- We conducted extensive experiments on the SEAME CS
ploited the powerful end-to-end ASR technologies like listen, corpus [24] to evaluate our methods. Next, we showed that the
attend and spell [9] and RNN transducers [10]. However, these proposed Pinyin-based Mask-CTC and the regularization meth-
methods mostly require autoregressive (AR) left-to-right beam ods benefit CS speech recognition and offer comparable perfor-
decoding, leading to longer processing time. Some studies also mance to the AR baseline. Moreover, these methods offered
demonstrated that processing each language with its encoder exciting performance under the low-resource scenario.
or decoder obtained better performance [11–13]. This multi-
encoder-decoder architecture and AR decoding require more 2. Method
computation power, thus less feasible for real-world applica- 2.1. Mask-CTC
tions. Therefore, in this paper, we leverage non-autoregressive
(NAR) ASR technology to tackle this issue. The Mask-CTC [18, 19] model is a NAR ASR framework that
Unlike the currently very successful AR ASR models [14, adopts the transformer encoder-decoder structure, but the de-
15], NAR ASR removes the need for the left-to-right de- coder is a CMLM for NAR decoding.
First, in the training phase, given transcribed audio-text pair
† Equal contribution. (X, Y ), the sequence X is first encoded with a conformer en-我 很 happy One of the challenges of training Mandarin-English CS speech
recognition models is that jointly modeling the two very differ-
ent languages is tricky. Nearly 5k characters are frequently used
Transformer Decoder
in Mandarin, consisting of approximately 400 pronunciations or
1500 with tones involved. Thereby having many characters with
我 很 happy 我 狠 happy identical pronunciation, making ASR models prone to predict
incorrect characters with the correct pronunciation. Moreover,
a larger vocabulary size is difficult for CTC training. We thus
Transformer Decoder P2M Decoder
propose changing the encoder’s targets from Mandarin charac-
ter to Pinyin. This idea could be applied to other languages
我 狠 happy wo hen happy which could be represented with symbols similar to Pinyin sys-
tem.
CTC CTC Each Mandarin character can be mapped to at least one
Pinyin symbol, representing the pronunciation, similar to
phoneme representations. Replacing characters with Pinyin re-
Conformer Encoder Conformer Encoder
duces vocabulary size and allows the encoder to focus on learn-
ing acoustic modeling. The Pinyin system’s pronunciation rules
are similar to English since Pinyin symbols can be represented
in Latin letters, providing a more intuitive way to utilize a pre-
trained English ASR model for initialization.
(a) Mask-CTC (b) Mask-CTC
+ P2M Decoder P2M Decoder. Next, we propose a P2M decoder to map
Pinyin back to character symbols using a single-layer decoder,
Figure 1: (a) The original Mask-CTC [18] framework and (b) as shown in the middle of Fig. 1. The P2M model is trained
the proposed Mask-CTC + Pinyin-to-Mandarin decoder frame- with Pinyin-Mandarin character sequence pairs conditioned on
work. The example ”我(wo)很(hen) happy” means ”I am the encoder’s output. Based on acoustic-contextualized infor-
very happy”. However, the character ”很(hen)” is misspelled mation, the P2M model can translate Pinyin-English mixed
as ”狠(hen)” which have the same pronunciation but differ- sequence to Mandarin-English because typical phonetic se-
ent meanings. The final transformer decoders can recover the quences in Mandarin are usually different from English. The
wrong character. original Mask-CTC uses Mandarin-English sentences to train
the decoder; this approach further applies the triplet pair =
(Pinyin, Mandarin, English) for training, providing more in-
coder model [14], and linearly projected to probability distri- formation to train the final Mandarin-English decoder. We en-
butions over all possible vocabularies V to minimize the CTC courage the P2M model to learn a Pinyin-to-Mandarin character
loss [25]. Then, some tokens Ymask in Y are randomly masked translation with a randomly masking technique.
with the special token . The decoder is trained to
predict the masked tokens conditioned on the observed tokens 2.3. Word Embedding Label Smoothing Regularization
Yobs = Y \Ymask and the encoder output. The Mask-CTC
model is thus trained to maximize the log-likelihood In this section, we introduce a novel label smoothing [26]
method using pre-trained word embedding 1 . As mentioned
log PNAR (Y |X) = α log PCTC (Y |X) previously, the CMLM decoder has less contextualized infor-
+ (1 − α) log PCMLM (Ymask |Yobs , X), mation involved or unaware of the neighboring predictions dur-
ing decoding. Hence, we wish to bring more textual knowledge
(1)
through label smoothing. Conventionally, label smoothing re-
where 0 ≤ α ≤ 1 is a tunable hyper-parameter. duces the ground truth label’s probability to 1 − for a small
At the decoding stage, the encoder output sequence is first constant 0 < < 1, while the other possible labels are equally
transformed by CTC decoding, denoted as Ŷ . Next, tokens assigned with a small constant probability. Although label
with probability lower than a specified threshold Pthres in Ŷ smoothing shows effectiveness in regularization, it is incapable
are masked with for the CMLM to predict, denoted of exploiting the target’s expression. Here we propose to dis-
as Ŷmask . Then, the masked tokens are gradually recovered till knowledge from semantic-rich word embedding. Word em-
by the CMLM in K iterations. In the kth iteration, the most bedding reflects semantic-similar and contextual-relevant words
b|Ŷmask |/Kc probable tokens in the masked sequence are re- that would have similar word representations. By leveraging
covered by calculating such properties would allow the model to learn the semantic-
(k) contextual information implicitly.
yu = arg max PCMLM (yu = y|Ŷobs , X), (2)
y∈V In this paper, we determine the possible labels by calcu-
lating the cosine similarity between word embeddings. For a
(k)
where yu is the uth output token and Ŷobs is the sequence in- ground truth label ŷ, we first find N labels with the highest co-
cluding the unmasked tokens and the masked tokens recovered sine similarity among the pre-trained word embeddings e as
in the first (k − 1) iterations. The Mask-CTC ASR framework DN = arg topN cos (e(ŷ), e(y)) , (3)
achieves performance close to AR ASR models but with signif- y∈V\{ŷ}
icantly lower computation cost [18, 19]. where the arg topN operator returns a set of N indices indicat-
ing the labels with the top N highest score and cos(·, ·) returns
2.2. Using Pinyin as Output Target
1 We noticed a related work [27] was published right before the sub-
Here, we introduce the Pinyin-to-Mandarin (P2M) decoder for mission deadline. However, they mainly focused on RNN-LM while
dealing with CS data and faster training, as shown in Fig. 1. we focus on improving NAR CS ASR decoders.the cosine similarity between the two inputs. The probabilities Table 1: The duration and the composition of Man-
of each label can thus be written as darin, English, and code-switching utterances in the train-
ing/validation/testing sets of the SEAME corpus.
1 − , y = ŷ
P (y) = N
, y ∈ DN . (4)
0, otherwise train val devman devsge
Duration (hours) 114.6 6.0 7.5 3.9
With the proposed label smoothing, we expect the decoders
Mandarin 19.3% 19.1% 19.9% 8.9%
to learn more contextualized information. Moreover, the word
English 23.8% 24.4% 12.4% 49.8%
embeddings can be trained with additional text data, introducing
Code-switching 56.9% 56.5% 67.7% 41.3%
richer information to the decoder.
2.4. Projection Matrix Regularization Mandarin and English. We excluded all the testing data from the
SEAME corpus and used the remaining for the train/validation
Here, we propose regularizing the projection matrices in the set. The statistics are listed in Table 1. The baseline model
ASR model to bridge the gap between the encoder and decoder. used subword units [28] for English and Chinese characters for
Although the encoder and decoder in the Mask-CTC model are Mandarin, resulting in a 5751 vocabulary size. The vocabularies
trained jointly, the decoder is connected with the encoder solely contain 2704 Chinese characters, and it is reduced to 390 Pinyin
with the cross-attention mechanism. The Mask-CTC uses the tokens for training the proposed model. We used the pypinyin
greedily decoded sequence from the encoder as the decoder’s package 2 for the Pinyin-Mandarin mapping.
input during inference, but the decoder is optimized with se- Monolingual Datasets: We used the LibriSpeech [29] English
quences without recognition errors. This mismatch might lead corpus of approximately 960 hours of training data for model
to degraded performance of the ASR; we propose to mitigate pre-training. For training the word embedding, we combined
this problem by making the encoder output projection matrix SEAME, TEDLIUM2 English corpus [30], and AISHELL
and the decoder input embedding matrix similar to each other. Mandarin corpus [31] to train a skip-gram model using the
The last layer of the encoder is a matrix WCTC ∈ Rd×|V| fastText toolkit [32]. We chose TEDLIUM2 rather than Lib-
for linearly projecting encoded features of dimension d to a riSpeech for training word embeddings since it has shorter and
probability distribution over all |V| possible output labels. The more spontaneous utterances similar to SEAME.
first layer of the decoder is the embedding layer that transforms
one-hot vectors representing text tokens to continuous hidden 3.2. Model
T
features with a matrix Wemb ∈ Rd×|V| . Matrices WCTC and
T
Wemb can be respectively written as [wCTC,1 . . . wCTC,|V| ] All experiments were based on the ESPnet toolkit [33] and
and [wemb,1 . . . wemb,|V| ], where w are column vectors of followed its training recipe. Audio features were extracted
dimension d. Here we hope these two matrices have similar be- to global mean-variance normalized 83-dimensional log-Mel
havior, and we thus apply cosine embedding loss to constrain filterbank and pitch features. Speed perturbation [34] and
the two matrices as SpecAugment [35] were added throughout the training process.
|V| Conventional label smoothing was applied to all experiments
1 X
LProjMatReg = [1 − cos (wCTC,v , wemb,v )] . (5) except for those with the proposed label smoothing technique.
|V| v=1 The encoder architecture for both AR and NAR ASR models
This loss function can also be applied to the two decoders to was a 12-layered conformer encoder [14] with a dimension of
build a relation between the output layer of the P2M decoder 512 and 8 attention heads per layer. The P2M decoder was a
and the input embedding layer of CMLM. Additionally, an al- 1-layered transformer decoder, and the iterative refinement de-
ternative solution is to share the same weights in WCTC and coder was a 6-layered transformer decoder [36]. All decoders
T
Wemb ; however, we found this method severely damaged the had a feed-forward dimension of 2048. We evaluated the perfor-
performance since they have different objectives. mance of our models with token error rate (TER) and real-time
Overall, we can combine all objective functions in the pre- factor (RTF), where the tokens refer to Mandarin characters and
vious sections, which can be represented as English words. RTF is a commonly used indicator for demon-
strating the advantage of NAR models’ fast inference speed over
L = − α log PCTC (Y pyin |X) AR models. The ASR models in our experiments were all ini-
pyin
− (1 − α) log PP2M (Ymask pyin
|Yobs , X) tialized with a pre-trained conformer ASR provided by ESPnet.
(6) We set the hyper-parameters in all experiments to α = 0.3 and
char char
− (1 − α) log PCMLM (Ymask |Yobs , X) β = 10−4 , and set = 0.1 and N = 10 for the label smooth-
+ βLProjMatReg , ing loss. We observed that the iterative refinement step provided
no gain to the ASR performance in both original and proposed
where Y pyin and Y char respectively represent using Pinyin and Mask-CTC frameworks at the inference stage. Therefore, we
character as Mandarin token. directly predict the output with a single pass through decoder
without iterative decoding described in [18].
3. Experiment
3.3. Proposed Pinyin Decoder and Regularization Methods
3.1. Data
This section investigated the effectiveness of the proposed
To evaluate the proposed methods for CS speech recognition, model and regularization methods using all training data in
we used a CS corpus and three monolingual corpora. SEAME. We tested models with the best validation loss, and
SEAME Corpus: SEAME [24] is a Mandarin-English CS con- results are listed in Sec. (I) of Table 2.
versational speech corpus collected in Singapore. The two eval-
uation sets, devman and devsge, are respectively biased toward 2 https://pypi.org/project/pypinyin/Table 2: TERs(%) and RTF of the AR and NAR ASR models Table 4: Ablation studies of the proposed methods shown in
trained with all data from SEAME. The Reg Methods in row TER(%). The EmbLS indicates the proposed word embedding
(d) are the word embedding label smoothing and the projection label smoothing method while MatReg is the projection matrix
matrix regularization methods applied jointly. Model averaging regularization method. The overall TER was in the All column.
was applied to rows (e) and (h).
EmbLS MatReg devman devsge All
Method devman devsge RTF
(I) Non-autoregressive (a) % % 42.2 50.7 45.3
(a) Mask-CTC 16.5 24.4 0.017 (b) " % 42.0 50.1 44.9
(b) +P2M (w/o Pinyin) 16.6 24.4 0.017 (c) % " 42.2 50.4 45.2
(c) +P2M (w/ Pinyin) 16.3 24.0 0.017 (d) " " 41.7 50.2 44.8
(d) + Reg Methods 16.0 24.1 0.017
(e) + Model Avg 15.3 22.3 0.017
tasks. In practice, transcribed CS speech data are difficult to
(II) Autoregressive
obtain; therefore, we simulated the low-resource scenario with
(f) Transformer-T [12] 18.5 26.3 - a different amount (60/30/10%) of available training data in
(g) Multi-Enc-Dec [11] 16.7 23.1 - SEAME to evaluate the proposed methods. The results are
(h) Conformer (Ours) 14.3 20.6 1.031 shown in Table. 3.
The proposed regularization approaches always offered bet-
Table 3: Results of TERs(%). Low-resource scenario when dif- ter performance on the SEAME corpus regardless of the given
ferent amount of data were available for the Pinyin Mask-CTC amount of training data (rows (a) v.s. (b); (c) v.s. (d); (e) v.s.
model w/ and w/o the proposed regularization methods. All the (f)). Note that SpecAugment, a robust data augmentation tech-
results were applied model averaging technique. nique capable of forcing ASR models from overfitting to un-
derfitting, was applied to all experiments. Despite this method
Data Method devman devsge already provided good results, the proposed methods further im-
proved ASR performance.
(a) Pinyin 42.2 50.7
10% Overall, we demonstrated the proposed Pinyin model and
(b) w/ Proposed 41.7 50.2
regularization methods could improve low-resource training.
(c) Pinyin 24.8 30.8 These methods could be applied to other tasks as well in the
30%
(d) w/ Proposed 24.0 30.2 limited data situation.
(e) Pinyin 18.2 24.4
60% 3.5. Ablation Studies
(f) w/ Proposed 17.7 23.8
To verify the efficacy of the proposed regularization methods,
We first set two baselines: one with the original Mask-CTC we conducted ablation studies under the low-resource scenario
(row (a)), the other had the proposed architecture but used man- with only 10% of training data. The performance of the base-
darin characters instead of Pinyin tokens as the intermediate line model is shown in row (a) of Table 4. Improvement was
(row (b)). Next, switching the encoder’s target to Pinyin im- brought by only applying the proposed label smoothing method
proved the ASR performance (rows (c) v.s. (a)(b)), indicating in Sec. 2.3 (rows (b) v.s. (a)), showing leveraging textual knowl-
introducing Pinyin as the intermediate representation effectively edge from additional text data was beneficial under the low-
transferred the learning burden from the encoder to decoders. resource scenario. With the proposed projection matrix regular-
Besides, the proposed regularization methods achieved a better ization technique in Sec. 2.4 applied, the model obtained limited
result on devman set and obtained a comparable result on the improvement (rows (c) v.s. (a)). Nevertheless, more improve-
devsge set (rows (d) v.s. (c)). To decrease the variance of the ments were achieved by applying both proposed methods (row
model’s prediction, we selected the top five checkpoints with (d)), reaching the best performance. We showed that the pro-
higher validation accuracy for averaging [37], resulting in the jection matrix regularization technique benefited from the pro-
best NAR ASR performance (row (e)). posed label smoothing method and filled up the gap between the
To highlight the benefits of using NAR ASR models, we encoder and decoders. The ablation study showed both methods
listed the performance of AR ASR models in Sec. (II) of Ta- were compatible with each other and contributed to improving
ble 2 for comparison. The implemented AR baseline used beam the recognition performance.
search with beam size of 10 for decoding and surpassed the pre-
vious SOTA results [11, 12] (row (h) v.s. (f)(g)), providing a 4. Conclusion
solid reference. The best NAR model offered good performance
with a small gap between the best AR model (rows (e) v.s. (h)) This paper introduces a novel non-autoregressive ASR frame-
but with a significantly 60× speedup (the RTF column), imply- work by adding a Pinyin-to-Mandarin decoder in the Mask-
ing the NAR model could recognize CS speech while possess- CTC ASR to solve the Mandarin-English code-switching
ing fast inference speed. Overall, we have shown NAR ASR speech recognition problem. We also propose word embed-
models incorporated with the proposed methods achieved ex- ding label smoothing for including contextual information to
citing results. conditional masked language model and projection matrix reg-
ularization method to bridge the gap between the encoder and
decoders. We demonstrated the effectiveness of our methods
3.4. Low-resource Scenario
with exciting performance on the SEAME corpus. The new
To shed light on the proposed regularization techniques, we ASR framework and regularization methods have the potential
conducted experiments on much-lower resourced CS ASR to improve various speech recognition scenarios.5. References [22] B. Naowarat, T. Kongthaworn, K. Karunratanakul, S. H. Wu, and
E. Chuangsuwanich, “Reducing spelling inconsistencies in code-
[1] S. Zhang, J. Yi, Z. Tian, J. Tao, and Y. Bai, “RNN-transducer with switching ASR using contextualized CTC loss,” arXiv preprint
language bias for end-to-end Mandarin-English code-switching arXiv:2005.07920, 2020.
speech recognition,” in ISCSLP, 2021.
[23] Z. Huang, P. Li, J. Xu, P. Zhang, and Y. Yan, “Context-dependent
[2] K. Li, J. Li, G. Ye, R. Zhao, and Y. Gong, “Towards code- label smoothing regularization for attention-based end-to-end
switching ASR for end-to-end CTC models,” in ICASSP, 2019. code-switching speech recognition,” in ISCSLP, 2021.
[3] C. Shan, C. Weng, G. Wang, D. Su, M. Luo, D. Yu, and L. Xie, [24] D.-C. Lyu, T.-P. Tan, E. S. Chng, and H. Li, “SEAME: a
“Investigating end-to-end speech recognition for Mandarin- Mandarin-English code-switching speech corpus in south-east
English code-switching,” in ICASSP, 2019. asia,” in INTERSPEECH, 2010.
[4] Z. Zeng, Y. Khassanov, V. T. Pham, H. Xu, E. S. Chng, and H. Li, [25] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con-
“On the end-to-end solution to Mandarin-English code-switching nectionist temporal classification: Labelling unsegmented se-
speech recognition,” in INTERSPEECH, 2019. quence data with recurrent neural networks,” in ICML, 2006.
[5] C. Du, H. Li, Y. Lu, L. Wang, and Y. Qian, “Data augmentation [26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re-
for end-to-end code-switching speech recognition,” in SLT, 2021. thinking the inception architecture for computer vision,” in CVPR,
[6] Y. Long, Y. Li, Q. Zhang, S. Wei, H. Ye, and J. Yang, “Acoustic 2016.
data augmentation for mandarin-english code-switching speech [27] M. Song, Y. Zhao, S. Wang, and M. Han, “Word similarity based
recognition,” Applied Acoustics, vol. 161, 2020. label smoothing in RNNLM training for asr,” in SLT, 2021.
[7] Y. Sharma, B. Abraham, K. Taneja, and P. Jyothi, “Improving [28] T. Kudo and J. Richardson, “SentencePiece: A simple and lan-
low resource code-switched ASR using augmented code-switched guage independent subword tokenizer and detokenizer for neural
TTS,” in INTERSPEECH, 2020. text processing,” in EMNLP: System Demonstrations, 2018.
[8] C.-T. Chang, S.-P. Chuang, and H. yi Lee, “Code-switching sen- [29] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
tence generation by generative adversarial networks and its appli- rispeech: An ASR corpus based on public domain audio books,”
cation to data augmentation,” in INTERSPEECH, 2019. in ICASSP, 2015.
[9] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend [30] A. Rousseau, P. Deléglise, and Y. Estève, “Enhancing the TED-
and spell: A neural network for large vocabulary conversational LIUM corpus with selected data for language modeling and more
speech recognition,” in ICASSP, 2016. TED talks,” in LREC, 2014.
[10] A. Graves, “Sequence transduction with recurrent neural net- [31] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An
works,” in ICML Workshop on Representation Learning, 2012. open-source mandarin speech corpus and a speech recognition
baseline,” in O-COCOSDA, 2017.
[11] X. Zhou, E. Yılmaz, Y. Long, Y. Li, and H. Li, “Multi-encoder-
decoder transformer for code-switching speech recognition,” in [32] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrich-
INTERSPEECH, 2020. ing word vectors with subword information,” arXiv preprint
arXiv:1607.04606, 2016.
[12] S. Dalmia, Y. Liu, S. Ronanki, and K. Kirchhoff, “Transformer-
transducers for code-switched speech recognition,” in ICASSP, [33] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno,
2021. N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen,
A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech
[13] Y. Lu, M. Huang, H. Li, J. Guo, and Y. Qian, “Bi-encoder processing toolkit,” in INTERSPEECH, 2018.
transformer network for Mandarin-English code-switching speech
[34] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmen-
recognition using mixture of experts,” in INTERSPEECH, 2020.
tation for speech recognition,” in INTERSPEECH, 2015.
[14] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, [35] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,
W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: and Q. V. Le, “SpecAugment: A simple data augmentation
Convolution-augmented transformer for speech recognition,” in method for automatic speech recognition,” in INTERSPEECH,
INTERSPEECH, 2020. 2019.
[15] C. Liu, F. Zhang, D. Le, S. Kim, Y. Saraf, and G. Zweig, “Im- [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
proving RNN transducer based ASR with auxiliary tasks,” in SLT, Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”
2021. in NeurIPS, 2017.
[16] E. A. Chi, J. Salazar, and K. Kirchhoff, “Align-refine: Non- [37] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,
autoregressive speech recognition via iterative realignment,” M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al., “A
arXiv preprint arXiv:2010.14233, 2020. comparative study on transformer vs RNN in speech applica-
[17] N. Chen, S. Watanabe, J. Villalba, and N. Dehak, “Listen and fill tions,” in ASRU, 2019.
in the missing letters: Non-autoregressive transformer for speech
recognition,” arXiv preprint arXiv:1911.04908, 2019.
[18] Y. Higuchi, S. Watanabe, N. Chen, T. Ogawa, and T. Kobayashi,
“Mask CTC: Non-autoregressive end-to-end ASR with CTC and
mask predict,” in INTERSPEECH, 2020.
[19] Y. Higuchi, H. Inaguma, S. Watanabe, T. Ogawa, and
T. Kobayashi, “Improved mask-CTC for non-autoregressive end-
to-end ASR,” in ICASSP, 2021.
[20] Y. Sharma, B. Abraham, K. Taneja, and P. Jyothi, “Improving Low
Resource Code-Switched ASR Using Augmented Code-Switched
TTS,” in Proc. Interspeech 2020, 2020, pp. 4771–4775. [Online].
Available: http://dx.doi.org/10.21437/Interspeech.2020-2402
[21] S.-P. Chuang, T.-W. Sung, and H.-y. Lee, “Training code-
switching language model with monolingual data,” in ICASSP,
2020.You can also read