Non-autoregressive Mandarin-English Code-switching Speech Recognition with Pinyin Mask-CTC and Word Embedding Regularization

Page created by Marshall Sandoval
 
CONTINUE READING
Non-autoregressive Mandarin-English Code-switching Speech Recognition
                                                with Pinyin Mask-CTC and Word Embedding Regularization
                                                            Shun-Po Chuang†1 , Heng-Jui Chang†2 , Sung-Feng Huang1 , Hung-yi Lee1,2

                                                  Graduate Institute of Communication Engineering, National Taiwan University, Taiwan1
                                                       Department of Electrical Engineering, National Taiwan University, Taiwan2
                                                                      {f04942141, b06901020, f06942045, hungyilee}@ntu.edu.tw

                                                                      Abstract                             pendency by directly predicting the whole sequence at once
                                                                                                           or refine the output sequence in a constant amount of itera-
                                        Mandarin-English code-switching (CS) is frequently used            tions [16, 17]. NAR ASR can thus exploit parallel computing
                                        among East and Southeast Asian people. However, the intra-         technologies for faster inference. This paper primarily adopts
                                        sentence language switching of the two very different languages    Mask-CTC [18,19], a NAR ASR framework with a NAR condi-
                                        makes recognizing CS speech challenging. Meanwhile, the re-        tional masked language model (CMLM) for iterative decoding,
arXiv:2104.02258v1 [cs.CL] 6 Apr 2021

                                        cent successful non-autoregressive (NAR) ASR models remove         possessing competing performance with AR ASR baselines.
                                        the need for left-to-right beam decoding in autoregressive (AR)         Although Mask-CTC provided exciting results, several
                                        models and achieved outstanding performance and fast infer-        shortcomings remain unaddressed. The CS task suffers from
                                        ence speed. Therefore, in this paper, we took advantage of         data scarcity [20, 21]; using Mask-CTC would introduce nu-
                                        the Mask-CTC NAR ASR framework to tackle the CS speech             merous parameters and lead to overfitting. Thereby, we pro-
                                        recognition issue. We propose changing the Mandarin out-           pose word embedding label smoothing to incorporate additional
                                        put target of the encoder to Pinyin for faster encoder training,   textual knowledge, bringing more precise semantic and contex-
                                        and introduce Pinyin-to-Mandarin decoder to learn contextual-      tual information for a better regularization effect. Second, the
                                        ized information. Moreover, we propose word embedding label        current Mask-CTC framework trains the decoder part without
                                        smoothing to regularize the decoder with contextualized infor-     considering the encoder’s recognition error, probably causing
                                        mation and projection matrix regularization to bridge that gap     the error propagation issue during inference. We suggest con-
                                        between the encoder and decoder. We evaluate the proposed          straining the encoder’s output projection layer similar to the in-
                                        methods on the SEAME corpus and achieved exciting results.         put embedding layer in the decoder to bridge the gap between
                                        Index Terms: code-switching, end-to-end speech recognition,        the encoder and decoder. Moreover, Mask-CTC requires longer
                                        non-autoregressive                                                 training time [18]; consequently, we propose changing the en-
                                                                                                           coder’s output target from Mandarin characters to Pinyin sym-
                                                                1. Introduction                            bols, reducing training time and complexity, where Pinyin is a
                                        Code-switching (CS) is a phenomenon of using two or more           standardized way of translating Chinese characters into Latin
                                        languages within a sentence in text or speech. This practice is    letters. Each Mandarin character can be mapped to at least one
                                        common in regions worldwide, especially in East and Southeast      Pinyin symbol, representing its pronunciation. Using Pinyin
                                        Asia, where people frequently use Mandarin-English CS speech       symbols as the target allows the encoder to focus on acoustic
                                        in daily conversations; thus, developing technologies to recog-    modeling and reduces the vocabulary size.
                                        nize this type of speech is critical. Although humans can eas-          Before this paper, Naowarat et al. [22] used contextualized
                                        ily recognize CS speech, automatic speech recognition (ASR)        CTC to force NAR ASR to learn contextual information in CS
                                        technologies nowadays perform poorly since these systems are       speech recognition, while we utilize CMLM to better model-
                                        mostly trained with monolingual data. Moreover, very few CS        ing linguistic information. Also, Huang et al. [23] developed
                                        speech corpora are publicly available, making training high-       context-dependent label smoothing methods by removing im-
                                        performance ASR models more challenging for CS speech.             possible labels given the previous output token, while the pro-
                                             Various approaches were studied to tackle the CS speech       posed word embedding label smoothing is more straightforward
                                        recognition problem, including language identity recogni-          and leverages knowledge from the additional text data.
                                        tion [1–4] and data-augmentation [5–8]. Many prior works ex-            We conducted extensive experiments on the SEAME CS
                                        ploited the powerful end-to-end ASR technologies like listen,      corpus [24] to evaluate our methods. Next, we showed that the
                                        attend and spell [9] and RNN transducers [10]. However, these      proposed Pinyin-based Mask-CTC and the regularization meth-
                                        methods mostly require autoregressive (AR) left-to-right beam      ods benefit CS speech recognition and offer comparable perfor-
                                        decoding, leading to longer processing time. Some studies also     mance to the AR baseline. Moreover, these methods offered
                                        demonstrated that processing each language with its encoder        exciting performance under the low-resource scenario.
                                        or decoder obtained better performance [11–13]. This multi-
                                        encoder-decoder architecture and AR decoding require more                                   2. Method
                                        computation power, thus less feasible for real-world applica-      2.1. Mask-CTC
                                        tions. Therefore, in this paper, we leverage non-autoregressive
                                        (NAR) ASR technology to tackle this issue.                         The Mask-CTC [18, 19] model is a NAR ASR framework that
                                             Unlike the currently very successful AR ASR models [14,       adopts the transformer encoder-decoder structure, but the de-
                                        15], NAR ASR removes the need for the left-to-right de-            coder is a CMLM for NAR decoding.
                                                                                                               First, in the training phase, given transcribed audio-text pair
                                            †   Equal contribution.                                        (X, Y ), the sequence X is first encoded with a conformer en-
我 很 happy                One of the challenges of training Mandarin-English CS speech
                                                                    recognition models is that jointly modeling the two very differ-
                                                                    ent languages is tricky. Nearly 5k characters are frequently used
                                        Transformer Decoder
                                                                    in Mandarin, consisting of approximately 400 pronunciations or
                                                                    1500 with tones involved. Thereby having many characters with
          我 很 happy                        我 狠 happy                identical pronunciation, making ASR models prone to predict
                                                                    incorrect characters with the correct pronunciation. Moreover,
                                                                    a larger vocabulary size is difficult for CTC training. We thus
       Transformer Decoder                     P2M Decoder
                                                                    propose changing the encoder’s targets from Mandarin charac-
                                                                    ter to Pinyin. This idea could be applied to other languages
          我 狠 happy                        wo hen happy             which could be represented with symbols similar to Pinyin sys-
                                                                    tem.
              CTC                                 CTC                    Each Mandarin character can be mapped to at least one
                                                                    Pinyin symbol, representing the pronunciation, similar to
                                                                    phoneme representations. Replacing characters with Pinyin re-
        Conformer Encoder                Conformer Encoder
                                                                    duces vocabulary size and allows the encoder to focus on learn-
                                                                    ing acoustic modeling. The Pinyin system’s pronunciation rules
                                                                    are similar to English since Pinyin symbols can be represented
                                                                    in Latin letters, providing a more intuitive way to utilize a pre-
                                                                    trained English ASR model for initialization.
          (a) Mask-CTC                     (b) Mask-CTC
                                           + P2M Decoder            P2M Decoder. Next, we propose a P2M decoder to map
                                                                    Pinyin back to character symbols using a single-layer decoder,
Figure 1: (a) The original Mask-CTC [18] framework and (b)          as shown in the middle of Fig. 1. The P2M model is trained
the proposed Mask-CTC + Pinyin-to-Mandarin decoder frame-           with Pinyin-Mandarin character sequence pairs conditioned on
work. The example ”我(wo)很(hen) happy” means ”I am                   the encoder’s output. Based on acoustic-contextualized infor-
very happy”. However, the character ”很(hen)” is misspelled          mation, the P2M model can translate Pinyin-English mixed
as ”狠(hen)” which have the same pronunciation but differ-           sequence to Mandarin-English because typical phonetic se-
ent meanings. The final transformer decoders can recover the        quences in Mandarin are usually different from English. The
wrong character.                                                    original Mask-CTC uses Mandarin-English sentences to train
                                                                    the decoder; this approach further applies the triplet pair =
                                                                    (Pinyin, Mandarin, English) for training, providing more in-
coder model [14], and linearly projected to probability distri-     formation to train the final Mandarin-English decoder. We en-
butions over all possible vocabularies V to minimize the CTC        courage the P2M model to learn a Pinyin-to-Mandarin character
loss [25]. Then, some tokens Ymask in Y are randomly masked         translation with a randomly masking technique.
with the special token . The decoder is trained to
predict the masked tokens conditioned on the observed tokens        2.3. Word Embedding Label Smoothing Regularization
Yobs = Y \Ymask and the encoder output. The Mask-CTC
model is thus trained to maximize the log-likelihood                In this section, we introduce a novel label smoothing [26]
                                                                    method using pre-trained word embedding 1 . As mentioned
  log PNAR (Y |X) = α log PCTC (Y |X)                               previously, the CMLM decoder has less contextualized infor-
                     + (1 − α) log PCMLM (Ymask |Yobs , X),         mation involved or unaware of the neighboring predictions dur-
                                                                    ing decoding. Hence, we wish to bring more textual knowledge
                                                            (1)
                                                                    through label smoothing. Conventionally, label smoothing re-
where 0 ≤ α ≤ 1 is a tunable hyper-parameter.                       duces the ground truth label’s probability to 1 −  for a small
     At the decoding stage, the encoder output sequence is first    constant 0 <  < 1, while the other possible labels are equally
transformed by CTC decoding, denoted as Ŷ . Next, tokens           assigned with a small constant probability. Although label
with probability lower than a specified threshold Pthres in Ŷ      smoothing shows effectiveness in regularization, it is incapable
are masked with  for the CMLM to predict, denoted             of exploiting the target’s expression. Here we propose to dis-
as Ŷmask . Then, the masked tokens are gradually recovered         till knowledge from semantic-rich word embedding. Word em-
by the CMLM in K iterations. In the kth iteration, the most         bedding reflects semantic-similar and contextual-relevant words
b|Ŷmask |/Kc probable tokens in the masked sequence are re-        that would have similar word representations. By leveraging
covered by calculating                                              such properties would allow the model to learn the semantic-
                                                (k)                 contextual information implicitly.
          yu = arg max PCMLM (yu = y|Ŷobs , X),              (2)
                    y∈V                                                   In this paper, we determine the possible labels by calcu-
                                                                    lating the cosine similarity between word embeddings. For a
                                         (k)
where yu is the uth output token and Ŷobs is the sequence in-      ground truth label ŷ, we first find N labels with the highest co-
cluding the unmasked tokens and the masked tokens recovered         sine similarity among the pre-trained word embeddings e as
in the first (k − 1) iterations. The Mask-CTC ASR framework                        DN = arg topN cos (e(ŷ), e(y)) ,                   (3)
achieves performance close to AR ASR models but with signif-                                y∈V\{ŷ}
icantly lower computation cost [18, 19].                            where the arg topN operator returns a set of N indices indicat-
                                                                    ing the labels with the top N highest score and cos(·, ·) returns
2.2. Using Pinyin as Output Target
                                                                       1 We noticed a related work [27] was published right before the sub-
Here, we introduce the Pinyin-to-Mandarin (P2M) decoder for         mission deadline. However, they mainly focused on RNN-LM while
dealing with CS data and faster training, as shown in Fig. 1.       we focus on improving NAR CS ASR decoders.
the cosine similarity between the two inputs. The probabilities     Table 1: The duration and the composition of Man-
of each label can thus be written as                                darin, English, and code-switching utterances in the train-
                                                                   ing/validation/testing sets of the SEAME corpus.
                         1 − , y = ŷ
                               
               P (y) =         N
                                 , y ∈ DN       .           (4)
                               0, otherwise                                               train      val     devman      devsge
                                                                      Duration (hours)    114.6      6.0        7.5          3.9
     With the proposed label smoothing, we expect the decoders
                                                                      Mandarin            19.3%     19.1%      19.9%        8.9%
to learn more contextualized information. Moreover, the word
                                                                      English             23.8%     24.4%      12.4%       49.8%
embeddings can be trained with additional text data, introducing
                                                                      Code-switching      56.9%     56.5%      67.7%       41.3%
richer information to the decoder.

2.4. Projection Matrix Regularization                               Mandarin and English. We excluded all the testing data from the
                                                                    SEAME corpus and used the remaining for the train/validation
Here, we propose regularizing the projection matrices in the        set. The statistics are listed in Table 1. The baseline model
ASR model to bridge the gap between the encoder and decoder.        used subword units [28] for English and Chinese characters for
Although the encoder and decoder in the Mask-CTC model are          Mandarin, resulting in a 5751 vocabulary size. The vocabularies
trained jointly, the decoder is connected with the encoder solely   contain 2704 Chinese characters, and it is reduced to 390 Pinyin
with the cross-attention mechanism. The Mask-CTC uses the           tokens for training the proposed model. We used the pypinyin
greedily decoded sequence from the encoder as the decoder’s         package 2 for the Pinyin-Mandarin mapping.
input during inference, but the decoder is optimized with se-       Monolingual Datasets: We used the LibriSpeech [29] English
quences without recognition errors. This mismatch might lead        corpus of approximately 960 hours of training data for model
to degraded performance of the ASR; we propose to mitigate          pre-training. For training the word embedding, we combined
this problem by making the encoder output projection matrix         SEAME, TEDLIUM2 English corpus [30], and AISHELL
and the decoder input embedding matrix similar to each other.       Mandarin corpus [31] to train a skip-gram model using the
     The last layer of the encoder is a matrix WCTC ∈ Rd×|V|        fastText toolkit [32]. We chose TEDLIUM2 rather than Lib-
for linearly projecting encoded features of dimension d to a        riSpeech for training word embeddings since it has shorter and
probability distribution over all |V| possible output labels. The   more spontaneous utterances similar to SEAME.
first layer of the decoder is the embedding layer that transforms
one-hot vectors representing text tokens to continuous hidden       3.2. Model
                            T
features with a matrix Wemb      ∈ Rd×|V| . Matrices WCTC and
    T
Wemb can be respectively written as [wCTC,1 . . . wCTC,|V| ]        All experiments were based on the ESPnet toolkit [33] and
and [wemb,1 . . . wemb,|V| ], where w are column vectors of         followed its training recipe. Audio features were extracted
dimension d. Here we hope these two matrices have similar be-       to global mean-variance normalized 83-dimensional log-Mel
havior, and we thus apply cosine embedding loss to constrain        filterbank and pitch features. Speed perturbation [34] and
the two matrices as                                                 SpecAugment [35] were added throughout the training process.
                          |V|                                       Conventional label smoothing was applied to all experiments
                       1 X
    LProjMatReg =             [1 − cos (wCTC,v , wemb,v )] . (5)    except for those with the proposed label smoothing technique.
                      |V| v=1                                       The encoder architecture for both AR and NAR ASR models
This loss function can also be applied to the two decoders to       was a 12-layered conformer encoder [14] with a dimension of
build a relation between the output layer of the P2M decoder        512 and 8 attention heads per layer. The P2M decoder was a
and the input embedding layer of CMLM. Additionally, an al-         1-layered transformer decoder, and the iterative refinement de-
ternative solution is to share the same weights in WCTC and         coder was a 6-layered transformer decoder [36]. All decoders
   T
Wemb   ; however, we found this method severely damaged the         had a feed-forward dimension of 2048. We evaluated the perfor-
performance since they have different objectives.                   mance of our models with token error rate (TER) and real-time
    Overall, we can combine all objective functions in the pre-     factor (RTF), where the tokens refer to Mandarin characters and
vious sections, which can be represented as                         English words. RTF is a commonly used indicator for demon-
                                                                    strating the advantage of NAR models’ fast inference speed over
        L = − α log PCTC (Y pyin |X)                                AR models. The ASR models in our experiments were all ini-
                                  pyin
             − (1 − α) log PP2M (Ymask    pyin
                                       |Yobs   , X)                 tialized with a pre-trained conformer ASR provided by ESPnet.
                                                              (6)   We set the hyper-parameters in all experiments to α = 0.3 and
                                   char    char
             − (1 − α) log PCMLM (Ymask |Yobs   , X)                β = 10−4 , and set  = 0.1 and N = 10 for the label smooth-
             + βLProjMatReg ,                                       ing loss. We observed that the iterative refinement step provided
                                                                    no gain to the ASR performance in both original and proposed
where Y pyin and Y char respectively represent using Pinyin and     Mask-CTC frameworks at the inference stage. Therefore, we
character as Mandarin token.                                        directly predict the output with a single pass through decoder
                                                                    without iterative decoding described in [18].
                     3. Experiment
                                                                    3.3. Proposed Pinyin Decoder and Regularization Methods
3.1. Data
                                                                    This section investigated the effectiveness of the proposed
To evaluate the proposed methods for CS speech recognition,         model and regularization methods using all training data in
we used a CS corpus and three monolingual corpora.                  SEAME. We tested models with the best validation loss, and
SEAME Corpus: SEAME [24] is a Mandarin-English CS con-              results are listed in Sec. (I) of Table 2.
versational speech corpus collected in Singapore. The two eval-
uation sets, devman and devsge, are respectively biased toward         2 https://pypi.org/project/pypinyin/
Table 2: TERs(%) and RTF of the AR and NAR ASR models                 Table 4: Ablation studies of the proposed methods shown in
trained with all data from SEAME. The Reg Methods in row              TER(%). The EmbLS indicates the proposed word embedding
(d) are the word embedding label smoothing and the projection         label smoothing method while MatReg is the projection matrix
matrix regularization methods applied jointly. Model averaging        regularization method. The overall TER was in the All column.
was applied to rows (e) and (h).
                                                                                EmbLS       MatReg      devman      devsge      All
    Method                     devman            devsge         RTF
                    (I) Non-autoregressive                               (a)       %           %          42.2        50.7     45.3

    (a) Mask-CTC                   16.5           24.4      0.017        (b)       "           %          42.0        50.1     44.9
    (b) +P2M (w/o Pinyin)          16.6           24.4      0.017        (c)       %           "          42.2        50.4     45.2
    (c) +P2M (w/ Pinyin)           16.3           24.0      0.017        (d)       "           "          41.7        50.2     44.8
    (d) + Reg Methods              16.0           24.1      0.017
    (e)   + Model Avg              15.3           22.3      0.017
                                                                      tasks. In practice, transcribed CS speech data are difficult to
                      (II) Autoregressive
                                                                      obtain; therefore, we simulated the low-resource scenario with
    (f) Transformer-T [12]         18.5           26.3        -       a different amount (60/30/10%) of available training data in
    (g) Multi-Enc-Dec [11]         16.7           23.1        -       SEAME to evaluate the proposed methods. The results are
    (h) Conformer (Ours)           14.3           20.6      1.031     shown in Table. 3.
                                                                           The proposed regularization approaches always offered bet-
Table 3: Results of TERs(%). Low-resource scenario when dif-          ter performance on the SEAME corpus regardless of the given
ferent amount of data were available for the Pinyin Mask-CTC          amount of training data (rows (a) v.s. (b); (c) v.s. (d); (e) v.s.
model w/ and w/o the proposed regularization methods. All the         (f)). Note that SpecAugment, a robust data augmentation tech-
results were applied model averaging technique.                       nique capable of forcing ASR models from overfitting to un-
                                                                      derfitting, was applied to all experiments. Despite this method
        Data     Method              devman          devsge           already provided good results, the proposed methods further im-
                                                                      proved ASR performance.
                 (a) Pinyin               42.2           50.7
        10%                                                                Overall, we demonstrated the proposed Pinyin model and
                 (b) w/ Proposed          41.7           50.2
                                                                      regularization methods could improve low-resource training.
                 (c) Pinyin               24.8           30.8         These methods could be applied to other tasks as well in the
        30%
                 (d) w/ Proposed          24.0           30.2         limited data situation.
                 (e) Pinyin               18.2           24.4
        60%                                                           3.5. Ablation Studies
                 (f) w/ Proposed          17.7           23.8
                                                                      To verify the efficacy of the proposed regularization methods,
     We first set two baselines: one with the original Mask-CTC       we conducted ablation studies under the low-resource scenario
(row (a)), the other had the proposed architecture but used man-      with only 10% of training data. The performance of the base-
darin characters instead of Pinyin tokens as the intermediate         line model is shown in row (a) of Table 4. Improvement was
(row (b)). Next, switching the encoder’s target to Pinyin im-         brought by only applying the proposed label smoothing method
proved the ASR performance (rows (c) v.s. (a)(b)), indicating         in Sec. 2.3 (rows (b) v.s. (a)), showing leveraging textual knowl-
introducing Pinyin as the intermediate representation effectively     edge from additional text data was beneficial under the low-
transferred the learning burden from the encoder to decoders.         resource scenario. With the proposed projection matrix regular-
Besides, the proposed regularization methods achieved a better        ization technique in Sec. 2.4 applied, the model obtained limited
result on devman set and obtained a comparable result on the          improvement (rows (c) v.s. (a)). Nevertheless, more improve-
devsge set (rows (d) v.s. (c)). To decrease the variance of the       ments were achieved by applying both proposed methods (row
model’s prediction, we selected the top five checkpoints with         (d)), reaching the best performance. We showed that the pro-
higher validation accuracy for averaging [37], resulting in the       jection matrix regularization technique benefited from the pro-
best NAR ASR performance (row (e)).                                   posed label smoothing method and filled up the gap between the
     To highlight the benefits of using NAR ASR models, we            encoder and decoders. The ablation study showed both methods
listed the performance of AR ASR models in Sec. (II) of Ta-           were compatible with each other and contributed to improving
ble 2 for comparison. The implemented AR baseline used beam           the recognition performance.
search with beam size of 10 for decoding and surpassed the pre-
vious SOTA results [11, 12] (row (h) v.s. (f)(g)), providing a                              4. Conclusion
solid reference. The best NAR model offered good performance
with a small gap between the best AR model (rows (e) v.s. (h))        This paper introduces a novel non-autoregressive ASR frame-
but with a significantly 60× speedup (the RTF column), imply-         work by adding a Pinyin-to-Mandarin decoder in the Mask-
ing the NAR model could recognize CS speech while possess-            CTC ASR to solve the Mandarin-English code-switching
ing fast inference speed. Overall, we have shown NAR ASR              speech recognition problem. We also propose word embed-
models incorporated with the proposed methods achieved ex-            ding label smoothing for including contextual information to
citing results.                                                       conditional masked language model and projection matrix reg-
                                                                      ularization method to bridge the gap between the encoder and
                                                                      decoders. We demonstrated the effectiveness of our methods
3.4. Low-resource Scenario
                                                                      with exciting performance on the SEAME corpus. The new
To shed light on the proposed regularization techniques, we           ASR framework and regularization methods have the potential
conducted experiments on much-lower resourced CS ASR                  to improve various speech recognition scenarios.
5. References                                     [22] B. Naowarat, T. Kongthaworn, K. Karunratanakul, S. H. Wu, and
                                                                               E. Chuangsuwanich, “Reducing spelling inconsistencies in code-
 [1] S. Zhang, J. Yi, Z. Tian, J. Tao, and Y. Bai, “RNN-transducer with        switching ASR using contextualized CTC loss,” arXiv preprint
     language bias for end-to-end Mandarin-English code-switching              arXiv:2005.07920, 2020.
     speech recognition,” in ISCSLP, 2021.
                                                                          [23] Z. Huang, P. Li, J. Xu, P. Zhang, and Y. Yan, “Context-dependent
 [2] K. Li, J. Li, G. Ye, R. Zhao, and Y. Gong, “Towards code-                 label smoothing regularization for attention-based end-to-end
     switching ASR for end-to-end CTC models,” in ICASSP, 2019.                code-switching speech recognition,” in ISCSLP, 2021.
 [3] C. Shan, C. Weng, G. Wang, D. Su, M. Luo, D. Yu, and L. Xie,         [24] D.-C. Lyu, T.-P. Tan, E. S. Chng, and H. Li, “SEAME: a
     “Investigating end-to-end speech recognition for Mandarin-                Mandarin-English code-switching speech corpus in south-east
     English code-switching,” in ICASSP, 2019.                                 asia,” in INTERSPEECH, 2010.
 [4] Z. Zeng, Y. Khassanov, V. T. Pham, H. Xu, E. S. Chng, and H. Li,     [25] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con-
     “On the end-to-end solution to Mandarin-English code-switching            nectionist temporal classification: Labelling unsegmented se-
     speech recognition,” in INTERSPEECH, 2019.                                quence data with recurrent neural networks,” in ICML, 2006.
 [5] C. Du, H. Li, Y. Lu, L. Wang, and Y. Qian, “Data augmentation        [26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re-
     for end-to-end code-switching speech recognition,” in SLT, 2021.          thinking the inception architecture for computer vision,” in CVPR,
 [6] Y. Long, Y. Li, Q. Zhang, S. Wei, H. Ye, and J. Yang, “Acoustic           2016.
     data augmentation for mandarin-english code-switching speech         [27] M. Song, Y. Zhao, S. Wang, and M. Han, “Word similarity based
     recognition,” Applied Acoustics, vol. 161, 2020.                          label smoothing in RNNLM training for asr,” in SLT, 2021.
 [7] Y. Sharma, B. Abraham, K. Taneja, and P. Jyothi, “Improving          [28] T. Kudo and J. Richardson, “SentencePiece: A simple and lan-
     low resource code-switched ASR using augmented code-switched              guage independent subword tokenizer and detokenizer for neural
     TTS,” in INTERSPEECH, 2020.                                               text processing,” in EMNLP: System Demonstrations, 2018.
 [8] C.-T. Chang, S.-P. Chuang, and H. yi Lee, “Code-switching sen-       [29] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
     tence generation by generative adversarial networks and its appli-        rispeech: An ASR corpus based on public domain audio books,”
     cation to data augmentation,” in INTERSPEECH, 2019.                       in ICASSP, 2015.
 [9] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend        [30] A. Rousseau, P. Deléglise, and Y. Estève, “Enhancing the TED-
     and spell: A neural network for large vocabulary conversational           LIUM corpus with selected data for language modeling and more
     speech recognition,” in ICASSP, 2016.                                     TED talks,” in LREC, 2014.
[10] A. Graves, “Sequence transduction with recurrent neural net-         [31] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An
     works,” in ICML Workshop on Representation Learning, 2012.                open-source mandarin speech corpus and a speech recognition
                                                                               baseline,” in O-COCOSDA, 2017.
[11] X. Zhou, E. Yılmaz, Y. Long, Y. Li, and H. Li, “Multi-encoder-
     decoder transformer for code-switching speech recognition,” in       [32] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrich-
     INTERSPEECH, 2020.                                                        ing word vectors with subword information,” arXiv preprint
                                                                               arXiv:1607.04606, 2016.
[12] S. Dalmia, Y. Liu, S. Ronanki, and K. Kirchhoff, “Transformer-
     transducers for code-switched speech recognition,” in ICASSP,        [33] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno,
     2021.                                                                     N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen,
                                                                               A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech
[13] Y. Lu, M. Huang, H. Li, J. Guo, and Y. Qian, “Bi-encoder                  processing toolkit,” in INTERSPEECH, 2018.
     transformer network for Mandarin-English code-switching speech
                                                                          [34] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmen-
     recognition using mixture of experts,” in INTERSPEECH, 2020.
                                                                               tation for speech recognition,” in INTERSPEECH, 2015.
[14] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu,           [35] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,
     W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer:                and Q. V. Le, “SpecAugment: A simple data augmentation
     Convolution-augmented transformer for speech recognition,” in             method for automatic speech recognition,” in INTERSPEECH,
     INTERSPEECH, 2020.                                                        2019.
[15] C. Liu, F. Zhang, D. Le, S. Kim, Y. Saraf, and G. Zweig, “Im-        [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
     proving RNN transducer based ASR with auxiliary tasks,” in SLT,           Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”
     2021.                                                                     in NeurIPS, 2017.
[16] E. A. Chi, J. Salazar, and K. Kirchhoff, “Align-refine: Non-         [37] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,
     autoregressive speech recognition via iterative realignment,”             M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al., “A
     arXiv preprint arXiv:2010.14233, 2020.                                    comparative study on transformer vs RNN in speech applica-
[17] N. Chen, S. Watanabe, J. Villalba, and N. Dehak, “Listen and fill         tions,” in ASRU, 2019.
     in the missing letters: Non-autoregressive transformer for speech
     recognition,” arXiv preprint arXiv:1911.04908, 2019.
[18] Y. Higuchi, S. Watanabe, N. Chen, T. Ogawa, and T. Kobayashi,
     “Mask CTC: Non-autoregressive end-to-end ASR with CTC and
     mask predict,” in INTERSPEECH, 2020.
[19] Y. Higuchi, H. Inaguma, S. Watanabe, T. Ogawa, and
     T. Kobayashi, “Improved mask-CTC for non-autoregressive end-
     to-end ASR,” in ICASSP, 2021.
[20] Y. Sharma, B. Abraham, K. Taneja, and P. Jyothi, “Improving Low
     Resource Code-Switched ASR Using Augmented Code-Switched
     TTS,” in Proc. Interspeech 2020, 2020, pp. 4771–4775. [Online].
     Available: http://dx.doi.org/10.21437/Interspeech.2020-2402
[21] S.-P. Chuang, T.-W. Sung, and H.-y. Lee, “Training code-
     switching language model with monolingual data,” in ICASSP,
     2020.
You can also read