Pronunciation augmentation for Mandarin-English code-switching speech recognition

Page created by Diane Yates
 
CONTINUE READING
Pronunciation augmentation for Mandarin-English code-switching speech recognition
Long et al. EURASIP Journal on Audio, Speech, and Music
Processing      (2021) 2021:34
https://doi.org/10.1186/s13636-021-00222-7

    RESEARCH                                                                                                                                  Open Access

Pronunciation augmentation for
Mandarin-English code-switching speech
recognition
Yanhua Long1*            , Shuang Wei1 , Jie Lian1 and Yijie Li2

    Abstract
    Code-switching (CS) refers to the phenomenon of using more than one language in an utterance, and it presents
    great challenge to automatic speech recognition (ASR) due to the code-switching property in one utterance, the
    pronunciation variation phenomenon of the embedding language words and the heavy training data sparse
    problem. This paper focuses on the Mandarin-English CS ASR task. We aim at dealing with the pronunciation variation
    and alleviating the sparse problem of code-switches by using pronunciation augmentation methods. An
    English-to-Mandarin mix-language phone mapping approach is first proposed to obtain a language-universal CS
    lexicon. Based on this lexicon, an acoustic data-driven lexicon learning framework is further proposed to learn new
    pronunciations to cover the accents, mis-pronunciations, or pronunciation variations of those embedding English
    words. Experiments are performed on real CS ASR tasks. Effectiveness of the proposed methods are examined on all of
    the conventional, hybrid, and the recent end-to-end speech recognition systems. Experimental results show that both
    the learned phone mapping and augmented pronunciations can significantly improve the performance of
    code-switching speech recognition.
    Keywords: Code-switching, Phone mapping, Pronunciation variation, Lexicon learning, Speech recognition

1 Introduction                                                                      code-switching speech [3]; and in Hong-Kong, the com-
Code-switching (CS) phenomenon is prevalent in many                                 bination of English and the native Cantonese is also very
multilingual communities. It is defined as the switching                            common [5]. Particularly, in East Asia, the Mandarin-
of two or more languages at the conversation, utterance,                            English code-switching is extremely popular, such as in
and sometimes even word level [1–3]. There are two dif-                             Singapore, Malaysia, Mainland China, and Taiwan [6, 7].
ferent forms of code-switching, one is the inter-sentential                         In addition, the code-switching is also now frequently
switching with the alternation is between sentences, and                            found in our daily life, such as in some professional activi-
the other is the intra-sentential with the switching is                             ties, social media, consumer goods, or entertainment, it is
within one sentence or word [3].                                                    fairly common to hear people borrowing words from one
  The code-switching phenomenon is quite common                                     language to use them in another [8, 9]. In recent years, the
around the world. For example, in India, it is very com-                            research of code-switching automatic speech recognition
mon to see the Bengali-English or Bengali-Hindi-English                             (ASR) has received increasingly attention. This is because
in most people’s daily speech [4]; in USA and Switzerland,                          with the rapid development of speech technology, variety
people can often hear Spanish-English and French-Italian                            speech-driven interfaces to smart devices, and other real
                                                                                    AI applications become mainstream, most state-of-the-
*Correspondence: yanhua@shnu.edu.cn
1
 SHNU-Unisound Joint Laboratory of Natural Human-Computer Interaction,              art monolingual ASR systems fail when they encounter
Shanghai Engineering Research Center of Intelligent Education and Bigdata,          code-switched speech.
Shanghai Normal University, Shanghai, China
Full list of author information is available at the end of the article

                                       © The Author(s). 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
                                       permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit
                                       to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The
                                       images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated
                                       otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended
                                       use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
                                       copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Pronunciation augmentation for Mandarin-English code-switching speech recognition
Long et al. EURASIP Journal on Audio, Speech, and Music Processing   (2021) 2021:34                                       Page 2 of 14

  Compared with the recent significant success achieved                 learning in [18], we propose a novel pronunciation aug-
in monolingual speech recognition [10, 11], the ASR sys-                mentation approach to produce the possible new pronun-
tems still have problem to deal with the code-switching                 ciations for those embedding words at the code-switches.
speech, especially the intra-sentential switching. To build             Based on the merged universal code-switching phone set,
a good code-switching ASR system, several challenges                    this approach integrates both of the information from the
need to be handled, either in acoustic or language model-               expert knowledge, and acoustic evidences in training cor-
ing. One of the major challenge is the pronunciation varia-             pus. Effectiveness of these proposed techniques are not
tion phenomenon of the embedding language at the code-                  only examined to improve the conventional, hybrid ASR
switches. Unlike the matrix language for native speak-                  system, but also validated to enhance the state-of-the-
ers, in many CS scenarios, most code-switching speakers                 art end-to-end ASR systems. Our experiments on real
may not be familiar with the embedding language, the                    code-switching ASR task show that the proposed meth-
words borrowed from the embedding language may be                       ods are very effective to improve the performance of CS
pronounced with a spectrum of accents and may be sys-                   speech recognition, and without any performance degra-
tematically or randomly mispronounced [8]. For example,                 dation of the matrix language recognition (Mandarin test
in the Mandarin-English code-switching utterances that                  set), this is very important for the real ASR applications.
collected from Mainland of China, most of those embed-                  Because in real ASR scenarios, a CS ASR system may
ded English words may be Chinglish (Chinese English).                   be used not only for recognizing the Mandarin-English
There is a significant pronunciation variation between the              code-switching speech, but also used for monolingual
Chinglish and the standard British or American English.                 Mandarin speech recognition simultaneously.
Although there were many previous works have been pro-                    The rest of the paper is organized as follows. A review
posed to deal with the discrepancy and co-articulation                  of previous works is presented in Section 2. Section 3
effects between different CS mixed languages, such as                   presents the framework of the proposed pronunciation
the units merging [8, 12–14], language-universal acoustic               augmentation method. Section 4 describes the details of
modeling targets, or framework [15–17]. Works related to                three speech recognition systems. Experimental config-
handle the pronunciation variation of embedding words                   urations are presented in Section 5. Results and perfor-
are very limited.                                                       mance analysis are presented in Section 6. Finally, we
  In this study, we concentrate on exploring pronun-                    conclude and present future works in Section 7.
ciation augmentation techniques for acoustic modeling.
These techniques are applied and examined to improve                    2 Review of previous works
a Mandarin-English intra-sentential code-switching ASR                  The code-switching phenomenon is very natural for peo-
system. We aim at dealing with the pronunciation varia-                 ple’s communication; however, it throws several inter-
tion and alleviating the sparse problem of code-switches                esting challenges to the speech recognition community.
by using pronunciation augmentation methods. Our con-                   Three major challenges have been focused in the litera-
tributions are as follows: (1) an English-to-Mandarin mix-              ture: (i) the heavy sparsity of code-switching training data,
language phone mapping is proposed. We first validate                   especially for the data of intra-sentential code-switched
the effectiveness of conventional data-driven phoneme                   points in both the acoustic and language modeling; (ii)
sharing. However, the direct one-to-one unit mapping                    the significant language discrepancy and co-articulation
only helps to alleviate the embedding language training                 effects in code-mixed utterances, it imposes a big gap
data sparsity at some extent. The acoustic discrimina-                  between acoustic modeling units of different languages;
tion between different languages is ignored. Therefore, we              and (iii) the above mentioned pronunciation variation
propose a new English-to-Mandarin phone mapping to                      of embedding language at the code-switches. All of the
enhance the pronunciations of universal code-switching                  stages in an ASR system could be significantly affected by
lexicon. (2) An acoustic data-driven lexicon learning                   any of these challenges, including the acoustic modeling,
framework is proposed to learn new pronunciations to                    language modeling, and decoding.
cover the accents, mis-pronunciations, or pronunciation                   To handle the code-switching data sparsity problem,
variations of those embedding English words. Only using                 the most straight forward way is to create code-switching
the phone mapping still can not well handle the pronunci-               speech corpus. However, for the Mandarin-English CS
ation variation with the mispronounced, accented embed-                 ASR, only a few small publicly available code-switching
ding words or phrases. Because the standard pronunci-                   corpus can be found, such as the 80 h OC16-CE80 cor-
ations in the monolingual lexicon can not cover these                   pus that provided for the Chinese-English mixlingual
variation cases, these words would typically need to be                 speech recognition challenge (MixASR-CHEN) [19] and
expressed with another new pronunciation or phone set.                  the SEAME corpus [20, 21] with 63 h spontaneous intra-
Therefore, motivated by the acoustic data-driven lexicon                sentential and inter-sentential code-switch speech. The
Pronunciation augmentation for Mandarin-English code-switching speech recognition
Long et al. EURASIP Journal on Audio, Speech, and Music Processing   (2021) 2021:34                                       Page 3 of 14

matrix language in both OC16-CE80 and SEAME is Man-                     may due to the fact that effective real speech are nor-
darin; the data amount of code-switch speech events in                  mally with complicated acoustic environment and intra
both corpus is extremely sparse. In addition, with the mix-             and inter speaker variabilities. These variabilities are very
language property and high cost of time and money, it is                challenging to speech synthesize community.
arduous to create large-scale CS corpus with golden stan-                  To alleviate the data sparsity and co-articulation effect
dard manual transcription [22]. Therefore, some works                   problem in code-switching acoustic modeling, previ-
start to focus on developing automatic CS event detec-                  ous works mainly focused on (1) exploring an univer-
tion system to extract speech utterances with language                  sal acoustic modeling units, (2) developing new acoustic
switches. For example, in [23], a latent language space                 modeling strategies with multi-task or transfer learn-
model and delta-Bayesian information criterion were pro-                ing, and (3) unsupervised or semi-supervised learning.
posed to detect the code-switching event. Rallabandi                    For improving the CS ASR systems with conventional
et al. [24] proposed to use an ASR system to detect                     architectures, such as the Gaussian Mixture Model-
code-switching style utterance from acoustics. In [25],                 Hidden Markov Model (GMM-HMM) or the Deep Neural
the authors used frame-level language posteriors gener-                 Network-HMM (DNN-HMM)-based hybrid framework,
ated from a CS ASR system to detect the code-switches.                  these works mainly focused on mix-language phone map-
Most of these works were highly dependent on the per-                   ping and unit merging, such as, in [34–36], the unit merg-
formances of ASR systems. They may not practical for                    ing on state, senone, and Gaussian Levels was proposed
extracting CS utterances from large real audio archives;                for Mandarin-English ASR tasks. In [8, 12, 14], differ-
therefore, some recent works start to focus on building                 ent data-driven and knowledge-based phone merging and
CS event classifiers based on the deep neural networks                  clustering algorithms were investigated to get a compact
directly [26]. In addition, we know that traditional data               bilingual phone set. For the recent popular end-to-end
augmentation techniques, such as the speed or volume                    (E2E) acoustic modeling, new universal acoustic modeling
perturbation [27, 28], the SpecAugment [29] or audio syn-               units for CS ASR were proposed to minimize the co-
thetic [30] have shown their effectiveness to alleviate the             articulation and discrepancy between different languages,
data sparsity at some extent in various speech and sound                such as, in [7, 15], the Character-Subword units with
processing tasks, however, these techniques can not han-                Chinese characters for Mandarin and Byte-pair Encoding
dle the pronunciation variation problem of the embedding                (BPE) [37] subword for English were constructed. Shan
language in a CS ASR scenario.                                          et al. [38] adopted the Mandarin characters plus English
   Towards the code-switched text data augmentation for                 letters and wordpieces as its E2E modeling units. For the
language modeling, there are also many works in the lit-                new strategies of CS ASR, most works aimed at inte-
erature. Works in [14] used machine translated text to                  grating the individual language information to improve
augment the available code-switched text and found that                 the final CS models. For example, two language-specific
those synthesized CS texts achieved significant reduc-                  DFSMN subnets with a shared output layer was proposed
tions in perplexity. And in [31], the authors increased the             to model the CS acoustic information in [15]. And in
CS texts by integrating both the syntactic and semantic                 [7, 16, 38], multi-task joint training of language identi-
features into the language modeling process. The recent                 fication and CS E2E ASR tasks were investigated, and
generative adversarial networks with reinforcement learn-               in some works, the transfer learning was also used to
ing was proposed in [32] to create CS text from mono-                   provide a good initialization of the E2E encoders using
lingual sentences. Pratapa et al. [33] proposed to gener-               large-scale monolingual corpus. In addition, to exploit
ate grammatically valid artificial CS data using parallel               the large-scale untranscribed code-switching data, many
monolingual sentences with linguistic equivalence con-                  other efforts have been paid on using the unsupervised
straint. In [4], a simple transliteration-based data augmen-            and semi-supervised learning for improving the CS acous-
tation approach was proposed to augment the Bengali-                    tic model, such as in [26, 39, 40], etc. All of these pre-
English code-switched transcripts. The results showed                   vious works have been proved to be effective, either for
that transliterating the code-mixed textual corpus to the               alleviating the data sparsity or co-articulation effects at
matrix language and adding it to training data significantly            some extent. However, they still can not solve the embed-
improved the CS ASR performance. All of these previ-                    ding language pronunciation variation problem, because
ous works showed that the artificial generated CS texts                 most current hybrid acoustic modeling approaches still
were very effective for alleviating the data sparsity prob-             highly rely on the monolingual lexicons with standard
lem in language modeling at some extent. However, from                  pronunciations. And moreover, in E2E Mandarin-English
our previous observation of acoustic data augmentation                  ASR scenarios, most sub-words or wordpiece extraction
presented in [26], it seems that it is more difficult to gen-           approaches only consider character sequence frequencies
erate effective synthesized CS speech than CS text; this                instead of acoustics, which at times produce inferior sub-
Long et al. EURASIP Journal on Audio, Speech, and Music Processing   (2021) 2021:34                                      Page 4 of 14

word segmentation that might lead to erroneous speech                   a pronunciation augmentation strategy for embed-
recognition output [41].                                                ding language words using acoustic data-driven lexicon
  Therefore, in few latest works, people start to focus                 learning.
on dealing with the pronunciation variation prob-
lem in acoustic modeling. For example, to handle the                    3.1 E2M phone mapping
accented pronunciation problem in conventional DNN-                     It is well known that there is big language difference
based hybrid Mandarin-English code-switching ASR sys-                   between Mandarin and English; however, the existence of
tem, [17] proposed to generate native pronunciations rep-               the co-articulation and CS data sparsity problem make
resentation of embedding language words in the matrix                   it important to use an universal phone set for building
language phoneme set, using a combination of existing                   a success DNN-HMM based hybrid ASR system. Instead
acoustic phone decoders and a LSTM-based grapheme-                      of mapping all the embedding language phones to the
to-phoneme (G2P) model. However, for the popular E2E                    matrix language phones as in previous works [8, 17],
architectures in CS ASR, we have not found any approach                 here we choose to cast light on the balance between the
focusing on the pronunciation variation issue in the litera-            similarities and differences of the two mixed Mandarin
ture, although in some recent works for monolingual E2E                 and English languages. Only part of phones with high
ASR, they have found that the high quality pronunciation                similarity measure are merged together.
lexicons developed by linguists can potentially improve                    Figure 1 illustrates the proposed framework of E2M
the performance of E2E systems, such as the detail inves-               mix-language phone mapping. In this framework, we
tigations on evaluating the value of pronunciation lexicon              combine two effective conventional data-driven phone
in E2E models in [42] and the pronunciation-assisted                    clustering methods with an expert correction to gener-
sub-word modeling method in [41].                                       ate the final universal phone set for Mandarin-English
  In this study, we also focus on dealing with the pro-                 code-switching ASR. Specifically, given two monolingual
nunciation variation problem in the Mandarin-English CS                 lexicons and speech corpus, we first obtain a set of English
ASR tasks. We aim at achieving better recognition accu-                 to Mandarin phone mapping pairs Tag , using the Tag
racy of the embedding languages phrases without any                     model-based phone mapping method that has been pro-
performance scarify of the matrix language speech, by                   posed in [8]. Then, we perform the TCM phone clustering
using pronunciation augmentation techniques. Unlike the                 to obtain another set of phone mapping pairs TCM as
work in [17], we not only consider merging the acous-                   proposed in [43]. These two sets of pairs are further
tic similarity between mixed languages, but also consider               combined and merged using following rule:
enhancing the discriminative information between dif-
                                                                            Com = {(PE , PM )|(PE , PM ) ∈ {Tag ∩ TCM }}      (1)
ferent languages. New possible pronunciations for those
embedding words at code-switches will be automatically                  where PE is the English phone, PM is Mandarin phone,
generated. Combined with the expert information and                     and (PE , PM ) is the E2M phone mapping pair, Com con-
G2P, effectiveness of these new pronunciations are exam-                tains those E2M pairs that lie in both the Tag and TCM
ined in both hybrid and E2E CS ASR systems.                             sets. These pairs are taken as the high confidence similar-
                                                                        ity phone mapping pairs, because they derived from two
3 The proposed pronunciation augmentation                               different phone clustering methods.
  approach                                                                The principle of the Tag model-based method aims at
In this section, we present the details of the proposed                 sharing individual Gaussians across languages. All the
pronunciation augmentation approach. An English-to-                     Gaussians in the Tag model are clustered in a single,
Mandarin (E2M) mix-language phone mapping approach                      phone-independent, language-independent, Kullback-
is first proposed to obtain an universal code-switching                 Leibler divergence-based, Vector Quantization (VQ)
phone set. Based on this phone set, we further investigate              code-book. If any two phones in the Tag model have the

 Fig. 1 The proposed E2M mix-language phone mapping framework
Long et al. EURASIP Journal on Audio, Speech, and Music Processing   (2021) 2021:34                                        Page 5 of 14

majority of their Gaussians lying in common VQ clusters,                  In addition, according to the experts’ knowledge and
then these phones are assumed to be similar [8]. On the                 different outputs of the above two phone clustering
other side, the TCM-based method in [43] is a two-pass                  methods, we speculate that it may be useful to perform
phone clustering method that is based on a co-occurrence                a one-to-two or one-to-many mapping for those English
confusion matrix. In each pass, Mandarin and English                    vowels with a similarity measure larger than a threshold.
take turns as the source and the target language. The                   This may help to efficiently handle the Mandarin accents
counts of co-occurrence between force-aligned target                    in English words because all the mappings from both
phone strings and the corresponding source phonetic                     phone clustering methods are data-driven results. A
transcriptions are then arranged to calculate the con-                  mapping pair has a very high confusion probability may
fusion probability between phone pairs. That is to say,                 still indicates a similar acoustic characteristics, even it is
these two methods merged the characteristics of two                     not lie in Com set. Therefore, in this study, besides the
languages in totally different aspects. Therefore, we                   one-to-one mapping, the one-to-two mapping cases with
hope that comparing and integrating the outputs of two                  a further expert correction are also investigated. Two
different methods can not only assure the confidence of                 one-to-two mapping examples for English phones AA and
data-driven phone mapping but also can provide some                     IY are illustrated in Fig. 2.
potential guidance for our further expert correction.
  Besides the Com , as shown in Fig. 1, we also add an                 3.2 Pronunciation augmentation using Lexicon learning
expert correction stage to improve the overall quality of               The idea of our pronunciation augmentation approach
E2M phone projection. The motivation of this stage is                   is motivated by the algorithm of acoustic data-driven
that, in our extensive experiments, we find the Com con-               lexicon learning in [18]. In [18], this algorithm was pro-
tains most of non-vowel phone mappings, but with only                   posed to automatic generate pronunciations for the OOV
few vowel phone-mappings, most vowel phone-mappings                     words in monolingual English ASR task. However, in
produced by the TCM and Tag-model based methods                         this study, we borrow the lexicon learning idea to han-
are very different. Therefore, we invite three linguistic               dle the pronunciation variation problem in Mandarin-
experts from Shanghai Normal University to perform the                  English code-switching ASR. It is developed to generate
vowel phone mapping correction. In this stage, all pairs                informative new pronunciations only for the embedding
in Com are directly fed into the final phone mapping set               language (English) words. These new pronunciations are
Final ; only other vowel phone mapping pairs in Tag and               then appended to the universal CS lexicon for ASR acous-
TCM are then checked and corrected by linguists. The                   tic modeling.
majority voting rule (2/3) is used to measure the relia-                  In fact, it is also possible for the system to produce new
bility of experts’ correction. This correction process not              pronunciations for Mandarin words using lexicon learn-
only dependent on the linguistic knowledge of experts,                  ing. However, in our Mandarin-English code-switching
but also guided by the statistics of confusion matrices                 tasks, the Mandarin is the matrix language, while English
achieved from the TCM and Tag model-based phone clus-                   is the embedding language, all the Mandarin words are
tering processes. It worth noting that we do not perform                spoken by the native speakers, and there is no pronunci-
any mapping for those pairs with low similarity measure-                ation variation for a native speaker to say native speech.
ments (e.g., the DH, ZH in CMU English lexicon); keeping                Therefore, we only considering the new pronunciations
these language-dependent phones may help to integrate                   of English words and ignore those ones produced for the
the large acoustic and language discriminative informa-                 Mandarin words.
tion during acoustic modeling. Finally, all of the corrected              Figure 3 presents the whole framework of the proposed
phone mapping pairs, the pairs in Com , and the few                    pronunciation augmentation using lexicon learning. It
language-dependent English phones are combined to pro-                  includes three main modules: (a) the CS lexicon prepa-
duce the final universal CS phone set, perform the English              ration for all the words in acoustic training data, (b)
lexicon mapping, and obtain the final universal CS lexicon.             the new pronunciation candidates collection, and (c) the

 Fig. 2 One-to-two phone mapping cases
Long et al. EURASIP Journal on Audio, Speech, and Music Processing   (2021) 2021:34                                       Page 6 of 14

 Fig. 3 The proposed pronunciation augmentation framework of embedding language words in CS ASR using lexicon learning

pronunciation pruning and selection. This framework is                  using an iterative greedy pronunciation selection (IGPS)
very similar to the one proposed in [18], however, there is             procedure with a per-utterance likelihood reduction cri-
many implementation details that are specially designed                 terion. Finally, with this procedure, all the least important
for the focused code-switching speech recognition task.                 pronunciation candidates will be iteratively removed in
   In module (a), we first create a Mandarin-English CS                 an efficient greedy fashion. All the details about lexicon
lexicon by mapping the standard monolingual English lex-                learning algorithm and other implementation tricks,
icon using those phone mappings obtained in Section 3.1.                please refer to the work [18].
Then, unlike the small seed lexicon in [18], here we                      Furthermore, unlike the motivation to generate pronun-
extend the whole CS lexicon by training a Sequitur G2P                  ciations for OOV words in [18], the lexicon learning idea
[44] model on it to produce initial pronunciation only                  in this study just plays a pronunciation augmentation role.
for OOV words in the training data. Based on the G2P                    Therefore, based on the greedy pronunciation selection,
extended lexicon, in module (b), we train an acoustic                   we added an additional PD selection constraint to assure
model using all of the monolingual Mandarin, English,                   a higher quality of new pronunciations as below:
and code-switching speech data to perform the training
                                                                            PD(w)τ  ρ Avg.R(w)τ                                  (2)
data forced-alignment, build the phone language model,
and further construct a phonetic decoder. This decoder is               where PD(w)τ is the “acoustic evidence” soft counts of
then used to generate the phonetic transcription for each               pronunciation for word w derived from phonetic decoding
specific word w exist in the training data. Finally, for each           after the last iteration of IGPS. Avg.R(w)τ is the average
individual word, we can obtain many new pronunciations                  soft counts of word pronunciations in the reference source
candidates (PD) generated from the phonetic decoding,                   lexicon (G2P Extended CS Lexicon), and ρ ∈[ 0, 1] is the
by aligning the phone sequence of forced-alignment                      statistical pruning factor.
and phonetic transcription using a normalized relative                    It worth noting that only the training utterances contain
frequency measurement. These PDs can be combined                        English words are taken into account during the acoustic
with the G2P extended lexicon into a large CS lexicon                   evidence collection and PD pruning stages, because in our
(Combined Lexicon) for the next acoustic evidence col-                  CS task, we only expect to augment the pronunciations
lection in module (c). The “acoustic evidence” is defined               of English words for the heavy pronunciation variation
as τ  p(Ou |w, b); it is the acoustic conditional data like-           issue. After the greedy process of PD pruning, only the
lihood of utterance Ou , given the pronunciation of word                informative subset of PDs for each word with acoustic evi-
w being b. This “acoustic evidence” is derived from the                 dence is selected. These informative pronunciations are
per-utterance lattice pronunciation-posterior statistics,               then used to augment the source CS lexicon for acoustic
and these statistics are computed using lattices of training            modeling. For those words in target vocabulary that are
utterances that are produced based on the Combined                      not seen in the acoustic training data, or no pronunciation
Lexicon and existing acoustic model in module (b).                      produced during lexicon learning, we choose to generate
Given a set of pronunciation candidates for a specific                  their pronunciations by re-training a CS G2P model using
word w, and the acoustic evidence τ per utterance, the                  the already augmented lexicon, instead of the initial G2P
pronunciation pruning and selection are performed                       pronunciation candidates in module (a).
Long et al. EURASIP Journal on Audio, Speech, and Music Processing   (2021) 2021:34                                      Page 7 of 14

4 Speech recognition systems                                              In addition, unlike the denominator lattices in classi-
Three types of ASR system are used to evaluate the pro-                 cal MMI, the lattices in the LF-MMI architecture are first
posed method. They are the conventional GMM-HMM                         generated from a phone-level n-gram language model,
system, the state-of-the-art lattice-free maximum mutual                and then compiled into utterance-specific FSA graphs for
information (LF-MMI)-based hybrid system and the latest                 TDNN training. Furthermore, to avoid over-fitting dur-
Transformer-based E2E system.                                           ing training, the cross-entropy objective function as well
                                                                        as the leaky HMM are also applied as extra regularization
4.1 GMM-HMM system                                                      techniques in this architecture. In this study, we choose to
Our GMM-HMM systems are built using the open source                     use a TDNN-LSTM hybrid structure presented in [47] as
Kaldi speech recognition toolkit [45]. The 13-dimensional               our acoustic model because of its better performances.
mel-frequency cepstral coefficients (MFCC) plus one-
dimensional pitch with their first- and second-order dif-               4.3 Transformer-based E2E system
ferential coefficients are used as the input acoustic fea-              With the great success of no-recurrence sequence-to-
tures to train the initial GMM-HMM acoustic models.                     sequence model-Transformer proposed in machine trans-
Based on the initial model, all of the acoustic features                lation [48], more and more research works in speech
are then spliced over 9 frames and projected to 40-                     community start to focus on it. Recently, a Speech-
dimensional subspace using the linear discriminant analy-               Transformer [49] was successfully proposed by introduc-
sis (LDA). A further maximum likelihood linear transform                ing the Transformer to ASR task. With the encoder-
(MLLT) is applied to transform the projected features                   decoder architecture and multi-head self-attention mech-
for a better orthogonality. These transformed features are              anism to learn the context and positional dependen-
then used to refine the GMM-HMM parameters. After                       cies, the Transformer has proven to be very successful
the decision tree clustering, the final models have around              to achieve competitive speech recognition performances,
6000 context-dependent tied states with around 32 Gaus-                 and it has already become the state-of-the-art E2E ASR
sians per state (different lexicon leads to different number            system.
of states). Based on this LDA+MLLT model, we further                      Compared to the above LF-MMI based hybrid system
perform the speaker adaptive training (SAT) with con-                   that consist of separate pronunciation, acoustic, and lan-
strained maximum likelihood linear regression to adapt                  guage models, the Transformer-based ASR system is a
the Gaussian mixture model parameters. After adapting                   single neural-network which implicitly models all three.
the parameters, a re-alignment is performed to improve                  Due to lack of pronunciation lexicon, most E2E sys-
the LDA+MLLT+SAT system. The framework and imple-                       tems choose to model the output text sequence in finer
mentation details of our GMM-HMM system training are                    units instead of the whole words, such as the charac-
followed the example recipe egs/swbd/s5c in Kaldi                       ters, BPE subwords, and wordpieces. Therefore, in this
main branch.                                                            study, we also investigate how to improve a Transformer-
                                                                        based E2E system for Mandarin-English code-switching
4.2 LF-MMI based hybrid system                                          ASR, by introducing the augmented CS pronunciations to
The LF-MMI based ASR hybrid framework was first pro-                    assist the BPE subword modeling. This is motivated by the
posed in [46]. Because of its good performances and excel-              recent work of pronunciation-assisted subword modeling
lent generalization ability, it has been becoming a main-               (PASM) proposed in [41]. We hope the PASM can gener-
stream technology for speech recognition, either in indus-              ate linguistically meaningful subwords for the embedding
try or in academic community. The LF-MMI based hybrid                   language English by analyzing the training text corpus and
acoustic model is a time-delay neural network (TDNN)                    our augmented CS lexicon.
with multi-splice sub-sampling topology. Povey et al. [46]                The recipe of PASM word segmentation1 was used in
proposed to train it in a purely sequence-discriminative                our experiments. All the experiments of E2E ASR system
way using the lattice-free version of the MMI criterion.                building were performed using the open-source end-to-
Compared with the classical TDNN training with cross-                   end speech recognition toolkit ESPnet [50].
entropy criterion, three major modifications have been
introduced to the LF-MMI TDNN training:                                 5 Experimental configurations
                                                                        5.1 Datasets
  • Training from scratch without initialization from a
                                                                        The corpus used to build our code-switching ASR sys-
    cross entropy system.                                               tems is provided by Unisound Corporation2 , including
  • The use of a threefold reduced frame rate and a
                                                                        186 hours (hrs) Mandarin-English code-switching speech,
    simpler HMM topology.
  • Limiting the range of time frames where supervision
                                                                        1 https://github.com/hainan-xv/PASM.git
    labels can appear by using finite state acceptors (FSA).            2 https://www.unisound.com/
Long et al. EURASIP Journal on Audio, Speech, and Music Processing   (2021) 2021:34                                                   Page 8 of 14

500 hrs Mandarin, and 100 hrs English (with Chinese                     Table 1 TER% results on pure Chilish test set of GMM-HMM
accent) monolingual speech. We use “Chilish” to term this               systems
accented English set. All of the utterances are conversa-               Training data                                                      Chilish
tional speech or speech collected from voice search, and                LibriSpeech (460hrs)                                               57.3
the speakers are all from the mainland of China. In our                 Chilish (100hrs)                                                   25.9
training set, there is a heavy imbalance between the data
amount of Mandarin and Chilish, because in real appli-
                                                                        The monolingual English lexicon is the CMU open source
cations, it is much harder to collect Chilish speech than
                                                                        English dictionary with 39 phonemes3 . In the experi-
Mandarin.
                                                                        ments, these two monolingual lexicons are first combined
  As our goal is to improve the ASR performance of the
                                                                        and mapped into an universal CS lexicon, and then further
embedding language without any performance scarify of
                                                                        augmented using the proposed E2M mix-lingual phone
the matrix language, we designed three test sets for per-
                                                                        mapping and lexicon learning approaches to handle pro-
formance evaluation, one is a 3 hrs pure Mandarin speech
                                                                        nunciation variation problem in the code-switching ASR
test set (Mandarin), one is 3.6 hrs Mandarin-English code-
                                                                        tasks. For the system performance measure, we use token
switching test set (1.6 hrs are from voice search, 2.0 hrs
                                                                        error rate (TER), where the “token” refers to the unit of
are general conversational speech), and the third one is a
                                                                        Mandarin character and English word respectively.
pure 1.6 hrs Chilish test set.
                                                                        6 Results and discussions
5.2 Neural network structures and acoustic features
                                                                        6.1 Acoustic difference between native and non-native
The experimental configurations of GMM-HMM systems
                                                                            English speech
have been already presented in Section 4.1. Unlike the
                                                                        It is well known that there is big acoustic difference
typical TDNN, the LF-MMI based TDNN-LSTM hybrid
                                                                        between native English and non-native English speech.
system is a mixture architecture of LSTMPs and sub-
                                                                        In Table 1, we also performed an experiment to validate
sampled TDNNs, using 3 fast-LSTMP layers interleaved
                                                                        this difference. In this experiment, the 1.6 hrs Chilish test
with 7 spliced TDNN layers. More details of this archi-
                                                                        set is taken for evaluation. The 100 hrs Chilish training
tecture can refer to the TDNN-LSTMP structure used
                                                                        data is used to train the GMM-HMM model for non-
for SWBD corpus in [47]. The frame-level alignments
                                                                        native monolingual English. The native English model is
and lattices in TDNN-LSTM model training were directly
                                                                        directly taken from the Kaldi LibriSpeech repository on
generated from the GMM-HMM system. The recipe of
                                                                        LibriSpeech.
swbd/s5c/local/chain/run_tdnn_lstm.sh in
                                                                          Both systems in Table 1 are trained using the CMU lexi-
Kaldi repository is used for our hybrid model training, but
                                                                        con. From the big TER gap on Chilish test set, it is clear to
without any i-vectors or other speaker adaptation tech-
                                                                        observe the significant acoustic difference between native
niques. The language model used in both GMM-HMM
                                                                        English and Chilish, even the native acoustic model is
and hybrid system is the same trigram LM that built on
                                                                        trained on a larger corpus. However, in the literature, most
all of the training data texts.
                                                                        ASR systems related to English are still using the CMU
  As most ASR example recipes in ESPnet, for the
                                                                        pronunciations which are specially designed for the native
Transformer-based E2E systems, we use 12 Encoder and 6
                                                                        English. This indicates that there might be chance to
Decoder blocks with 2048 feed-forward inner dimension.
                                                                        improve the ASR performance by integrating the acoustic
The model dimension dmodel is set to 256 and the atten-
                                                                        variations in the lexicon pronunciations.
tion head number h was set to 4. Both the hybrid and E2E
acoustic models use the same 80-dimensional filter-bank
                                                                        6.2 Evaluating E2M phone mapping in conventional ASR
features, plus 3-dimensional pitch (pitch and its first and
                                                                            systems
second derivatives). All of the input acoustic features are
                                                                        Table 2 compares the effectiveness of the proposed E2M
extracted using a 25-ms Hamming window with a 10-ms
                                                                        mix-lingual phone mapping with different conventional
frame shift. In addition, to enhance the model robust-
                                                                        phone mapping methods. The first two systems with
ness, we perform both the SpecAugument and speed
                                                                        only MLex and CMU lexicon are monolingual Man-
perturbation acoustic data augmentation during all
                                                                        darin and English system respectively; they are only sep-
the hybrid and E2E system training.
                                                                        arately trained using the 500 hrs Mandarin and 100 hrs
                                                                        Chilish training data. Other systems in this table are all
5.3 Lexicon and performance measure
                                                                        coder-switching GMM-HMM systems and trained on the
The monolingual Mandarin lexicon (MLex) used in
                                                                        total 786 hrs training data. The “CMU+MLex” repre-
this study is an ARPAbet-based tonal lexicon provided
                                                                        sents directly concatenating the phone sets and lexicons
by Unisound Corporation. It has 109 phoneme/toneme
phone set, covering more than 200k Mandarin words.                      3 CMU   lexicon:http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict
Long et al. EURASIP Journal on Audio, Speech, and Music Processing        (2021) 2021:34                                      Page 9 of 14

Table 2 Performance comparison(TER%) on and code-switching                   Chilish words are mis-recognized by the universal code-
test sets from GMM-HMM systems with different phone mapping                  switching acoustic model than the pure English model;
methods
                                                                             it indicates that phone mapping brings acoustic confu-
Phone mapping method         Mandarin      Code-switching       Chilish      sion between Chilish and Mandarin than monolingual
Only MLex                    18.2          –                    –            ASR. (c) On the Mandarin test set, all of the CS mod-
Only CMU                     –             –                    25.9         els achieve better performances than the system trained
                                                                             on pure Mandarin data. Relative 1.6–3.8% TER reductions
CMU+MLex(baseline)           17.6          32.3                 26.7
                                                                             has been obtained. This is due to the increased Man-
TCM [43]                     17.9          25.6                 27.2
                                                                             darin data included in the CS training set. In addition,
Tag model-based [8]          17.5          25.2                 27.0         we can see that there is almost no performance degrada-
E2M-po2o                     17.5          23.7                 27.0         tion on the Mandarin test set, by using phone mapping or
                                                                             clustering methods over the baseline.
of CMU and MLex without any phone sharing or map-                               Figure 4 demonstrates the TERs calculated separately
ping. The “TCM” and “Tag model-based” are the con-                           on the Mandarin and English part of the code-switching
ventional data-driven phone clustering methods proposed                      test set. “CS” is the whole code-switching test set as shown
in [43] and [8] respectively. The “E2M-po2o” is our pro-                     in Table 2, “CS-Mandarin” and “CS-English” represent
posed E2M phone mapping, in which only part of English                       the Mandarin part and English part of CS. Compared
phonemes are involved in the English to Mandarin phone                       with the CMU+MLex baseline, both the performances of
mapping, and in this case, they are just an one-to-one                       Mandarin and English speech are improved by the pro-
mapping.                                                                     posed E2M. And moreover, it is clear to see that the
  From TERs in Table 2, three observations can be                            performance gain on English part is much larger than the
found: (a) phone-mapping is very effective to improve                        one on Mandarin part. This indicates that the Chinese
the code-switching ASR performances, either by using                         accent characteristics of embedding English words are
the TCM, Tag model-based phone clustering or the pro-                        well learned and the acoustic events of code-switches are
posed E2M. Both conventional phone clustering meth-                          heavily increased by the proposed E2M; the recognition
ods achieved more than a relative 20% TER reductions                         error within code-switches are significantly reduced.
over baseline, a further 6% TER reduction is obtained                           Table 3 shows the performance comparison of the state-
by the proposed E2M. In the proposed E2M framework,                          of-the-art LF-MMI based hybrid systems with different
this further gain may benefit from the tradeoff mecha-                       E2M phone mapping strategies. From the TER numbers
nism between language-independent acoustic common-                           in Tables 2 and 3, we can see that these hybrid systems
ness and language-dependent characteristics. (b) More                        achieve significant ASR performance gains (more than

 Fig. 4 TER% between Mandarin and English part of code-switching test set with different phone mapping methods
Long et al. EURASIP Journal on Audio, Speech, and Music Processing        (2021) 2021:34                                             Page 10 of 14

Table 3 Performance (TER%) comparison of the proposed E2M                    entries provided more possible competitive candidates in
in LF-MMI based hybrid systems                                               the WFST paths. Moreover, it is clear to see that the recog-
Phone mapping method         Mandarin      Code-switching       Chilish      nition of pure Mandarin speech does not significantly
CMU+MLex (baseline)          5.5           12.9                 14.9         affected by different phone mapping methods, while on
E2M-po2o                     5.2           11.5                 15.4         the pure Chilish test set, a relative 6.7% performance
E2M-po2m                     5.8           12.2                 15.9
                                                                             degradation is obtained. This indicates that the acoustic
                                                                             model with a Mandarin dominant phone set and train-
E2M-po2o-T, E2M-po2m-D       5.4           10.8                 15.4
                                                                             ing data is not suitable for a pure English ASR task. In
                                                                             addition, the TER gains on code-switching test set indi-
relative 45%) over the conventional GMM-HMM systems.                         cates that the training data of acoustic modeling bi-phone
The “E2M-po2m” is the proposed E2M phone mapping                             units for code-switches is also significantly increased in
with part of English phones (∼ 38%) are mapped to two                        the hybrid acoustic modeling through the phone mapping.
or three Mandarin phones. “E2M-po2o-T, E2M-po2m-D”
means we use the “E2M-po2o” for acoustic model train-                        6.3 Examination of Lexicon learning-based
ing, while “E2M-po2m” is used for decoding. Except for                            pronunciation augmentation
the last line of Table 3, all other systems use the same                     6.3.1 Performances in conventional ASR systems
lexicon for both the acoustic model training and decoding.                   After achieving the universal E2M-po2o mix-
  Compared with the large gains obtained from phone                          lingual lexicon, we then augment the pronunciations
mapping in GMM-HMM systems, the E2M-po2o only                                of English words in training data set using the
obtains relative 10.8% TER reduction on code-switching                       proposed lexicon learning-based framework in
test set over the CMU+MLex baseline. In addition, by                         Fig. 3. The recipe of wsj/s5/steps/dict/
comparing the results on code-switching test set of last                     learn_lexicon_greedy.sh in Kaldi main branch is
two lines in Table 3, it is interesting to find that the                     modified to perform our CS ASR lexicon learning. During
one-to-many mapping in E2M brings more mix-lingual                           the iterative greedy pronunciation selection, we tune the
acoustic confusion than the one-to-one phone mapping.                        scaling factor α = 0.05, 0.01, 0.001 and smoothing factor
And given the fixed universal phone set, only using proper                   β = 10, 5, 10 to compute the likelihood reduction thresh-
one-to-many phone mapping in the decoding stage can                          old for controlling the pruning degree of pronunciations
bring further 6.1% TER reduction over the E2M-po2o                           from the phonetic decoding, G2P, and source lexicon
method. This may due to the fact that more pronunciation                     respectively. Figure 5 demonstrates new pronunciation

 Fig. 5 Samples of learned pronunciations for embedding English words. “Ref” and “Learned” is the word-pronunciation pair in the source (“G2P
 Extended CS Lexicon”) and learned lexicon in Fig. 3 respectively
Long et al. EURASIP Journal on Audio, Speech, and Music Processing                   (2021) 2021:34                                                       Page 11 of 14

Table 4 Performance(TER%) comparison of the proposed lexicon learning-based pronunciation augmentation on conventional ASR
systems
System                              ρ                   Lexicon(avg.#prons)                        Mandarin                        Code-switching               Chilish
                                    –                   Source Lexicon                             17.5                            23.7                         27.0
                                    0.1                 + LexLearn (3.65)                          17.8                            22.8                         26.4
GMM-HMM
                                    0.4                 + LexLearn (3.07)                          17.6                            22.1                         26.2
                                    0.7                 + LexLearn (2.41)                          17.6                            21.6                         26.0

LF-MMI hybrid                       –                   Source Lexicon                             5.2                             11.5                         15.4
LF-MMI hybrid-1                     0.7                 + LexLearn (2.41)                          5.7                             10.9                         15.2
LF-MMI hybrid-2                     0.7                 + LexLearn (2.41)                          5.5                             10.4                         14.8
“avg.#prons” means the average pronunciations per English word included in the training data. ρ is the pruning factor of acoustic soft counts in Eq.(2)

samples learned for five embedding English words. It is                                  MMI based hybrid systems are much smaller than the
clear to see that the learned pronunciations based on                                    ones obtained on GMM-HMM systems; it indicates that
the acoustic evidences tend to be more Chilish than the                                  the hybrid system has a better acoustic modeling ability
ones from only one-to-one E2M mapping. It may provide                                    to deal with the pronunciation variations of embedding
a chance for acoustic model to capture the accent and                                    language than the traditional GMM-HMM system.
mis-pronounced property of the embedding English
words by augmenting the lexicon with these learned                                       6.3.2 Performances in transformer-based ASR systems
pronunciations.                                                                          Table 5 compares the results with different target acous-
   Table 4 presents the performance comparison of the                                    tic modeling units in our E2E Transformer-based code-
proposed lexicon learning-based pronunciation augmen-                                    switching ASR systems. The targets of our baseline sys-
tation on conventional ASR systems. The “Source Lexi-                                    tem are a set of Mandarin characters and English letters
con” is the universal CS lexicon with E2M-po2o phone                                     plus blank symbol which leads to an output dimension
mapping. “+ Lexlearn” means the “Source Lexicon” aug-                                    of 5257. In addition, we also tried to adopt the widely
mented with the new pronunciations of English words                                      used BPE subword segmentation to generate 2000 sub-
learned from the proposed CS lexicon learning frame-                                     words as acoustic modeling units for English. Therefore,
work. The same acoustic modeling and decoding lexicon                                    in the system with “Character-BPE,” there is a total of 7230
is used for each GMM-HMM and “LF-MMI hybrid-1” sys-                                      acoustic modeling targets. The systems with “Character-
tem as shown in this table. However, the “LF-MMI hybrid-                                 PASM(*)” modeling units are performed to examine the
2” system used different lexicons with E2M-po2o and                                      effects of our augmented CS lexicon for a better English
E2M-po2m phone mapping respectively for the acoustic                                     subword generation. In these system, the Mandarin char-
modeling and decoding, and both lexicons are also aug-                                   acters units are the same as in the baseline, but the
mented with the same new pronunciations as in “LF-MMI                                    English targets are produced from the universal CS lexi-
hybrid-1” system. System “LF-MMI hybrid” is the same as                                  cons using the pronunciation-assisted subword modeling
system with “E2M-po2o” in Table 3.                                                       (PASM) method. Three CS lexicons are investigated, the
   From the results of Table 4, we can see that by using                                 basic “CMU+MLex,” the lexicon with E2M-po2m phone
the augmented CS lexicon, the GMM-HMM systems can                                        mapping, and the final augmented CS lexicon with acous-
obtain a relative 3.8 to 8.8%, and 2.2 to 3.7% TER reduction                             tic lexicon learning (“Source Lexicon+Lexlearn (2.41)” in
on the code-switching and Chilish test set respectively. By
taking the source pronunciations as reference, introducing
                                                                                         Table 5 Results (TER%) of Transformer-based E2E CS ASR
the acoustic soft-count pruning factor can effectively help
                                                                                         systems with different target acoustic modeling units
to select better new pronunciations with enough acous-
                                                                                         Modeling units                      Mandarin         Code-switching    Chilish
tic evidences. Based on the outputs of standard iterative
greedy pronunciation selection, we find that the ρ = 0.7                                 Baseline (character-letter)         5.2              11.6              15.9
achieves the best results. Furthermore, by comparing the                                 Character-BPE                       5.3              11.1              15.3
results of hybrid systems in Table 4 with their baselines                                Character-PASM                      5.2              10.7              15.3
in Table 3, we still can obtain relative 5.2% (hybrid-1)                                 (CMU+MLex)
and 3.7% (hybrid-2) TER reduction on the code-switching                                  Character-PASM                      5.1              10.5              15.1
test set by using the augmented lexicon, and consistently,                               (E2M-po2m)
the TER of Chilish is also reduced from 15.4 to 14.8%.                                   Character-PASM (Source              5.2              10.1              14.8
However, it is clear to see that the gains obtained on LF-                               Lexicon+Lexlearn (2.41))
Long et al. EURASIP Journal on Audio, Speech, and Music Processing   (2021) 2021:34                                                     Page 12 of 14

Table 4). For a clear comparison, we keep the number of                 HMM and LF-MMI hybrid systems. The other is using
subword units to be the same in BPE and PASMs.                          the augmented pronunciations to assist the subword seg-
   By comparing the E2E results in Table 5 with the best                mentation to generate better acoustic modeling targets for
result of LF-MMI hybrid systems in Table 4, we can see                  end-to-end Transformer-based ASR system. The exten-
a little bit performance improving on pure Mandarin test                sive results in Section 6.3.2 show that both the proposed
set while the performance of code-switching and Chilish                 phone mapping and pronunciation augmentation frame-
speech are significantly degraded, such as for CS test                  work can also be taken as effective solutions for improving
set, the TER is degraded from 10.4 to 11.6%. However,                   the E2E CS ASR performances.
by using the BPE subwords instead of simple letters as                    More investigations about proposing new pronunciation-
English E2E acoustic modeling targets, a relative 4.3%                  assisted methods to fully exploit the phonetically mean-
and 3.8% TER gains are obtained on the code-switching                   ingful information for improving E2E CS ASR system and
and Chilish test set respectively. This indicates that the              validating the proposed methods on other CS corpus will
BPE subwords can provide a better acoustic represen-                    be our future works.
tation than simple letters. When we use the PASM to
produce English targets instead of the BPE, the TER of CS               Acknowledgements
                                                                        The authors thank the associate editor and the anonymous reviewers for their
is reduced from 11.1 to 10.7% with only CMU pronunci-
                                                                        constructive comments and useful suggestions
ations for English. This observation is consistent with the
one achieved in [41]. And it indicates that unlike the BPE              Authors’ contributions
                                                                        The authors’ contributions are equal. All authors read and approved the final
only taking the spelling into consideration, leveraging the
                                                                        manuscript.
pronunciation information of a word during the subword
segmentation can produce better E2E acoustic model-                     Authors’ information
ing units. Furthermore, when we use the phone mapped                    Yanhua Long received her Ph.D. degree from the Department of Electronic
                                                                        Engineering and Information Science, University of Science and Technology of
universal CS lexicon, a further 1.8% relative TER reduc-                China (USTC), Hefei, China, in 2011. From July 2008 to February 2009, she
tion on CS is obtained. And in addition, by comparing                   studied as a visiting student in the department of Human Language
the results on both code-switching and Chilish test sets                Technology at the Institute for Infocomm Research, Singapore. From
                                                                        September 2009 to February 2010, she has been an intern worked in the
between last two lines of Table 5, the E2E CS ASR sys-                  speech group of Microsoft Research Asia. From October 2011 to May 2013, she
tem can also benefit from the learned new pronunciations.               worked with the Machine Intelligence Lab as a Research Associate in
These observations indicate that the learned and phone                  Cambridge University, UK. Since June 2013, she has been with Shanghai
                                                                        Normal University, Shanghai, China, as an Associate Professor, Her research
mapped new pronunciations provided more phonetically                    interests include speech signal processing and pattern recognition. She has
meaningful subwords for the embedding English words.                    published about 40 papers and she is a ISCA member.
                                                                        Shuang Wei received her B.S. degree and the M.S. degree from the Huazhong
                                                                        University of Science and Technology, China, in 2005 and 2007 respectively.
7 Conclusion                                                            And she received the Ph.D. degree in electrical and computer engineering
This paper presents a pronunciation augmentation frame-                 from the University of Calgary, Canada, in 2011. Her current research interests
work based on the universal Mandarin-English code-                      include speech signal processing, signal estimation and detection,
                                                                        computational intelligence, and compressed sensing.
switching lexicon. This framework is proposed to handle                 Jie Lian is an Assistant Professor in the Computer Science department at
the accented pronunciations or randomly mispronounced                   Shanghai Normal University where she has been a faculty member since 2017,
words of the embedding English in the code-switching                    and she obtained the “Sailing” Talent Program of China in 2019. Jie Lian
                                                                        completed her doctor degree at Towson University in 2017. Her research
speech recognition task. We first examine the proposed                  interests lie in the area of spatio-temporal data mining, deep learning and big
English-to-Mandarin phone mapping on both the con-                      data, ranging from theory to design to implementation.
ventional GMM-HMM and the state-of-the-art LF-MMI-                      Yijie Li received his M.S. degree from the Department of Electronic Engineering
                                                                        and Information Science, University of Science and Technology of China
based hybrid ASR systems. Experimental results show that                (USTC), Hefei, China, in 2009. From July 2009 to Jan 2012, he joined Shanda
we obtain more than relative 10% TER reduction on the                   Innovations as an Associate Researcher. Since February 2012, he has been with
code-switching test set, by using the universal CS lexicon              Unisound AI Technology Co., Ltd. Beijing, China, as a Senior Researcher.
with the proposed phone mapping strategy. In addition,                  Funding
from the comparison results on the GMM-HMM systems,                     This research was supported by the National Natural Science Foundation of
we can see that this strategy provided much more ASR                    China under grant No.62071302 and No. 61701306.
performance gains than two conventional phone mapping                   Availability of data and materials
methods in the literature, by considering the balance of                The datasets used and/or analyzed during the current study are available from
acoustic similarity and language discrimination between                 the corresponding author on reasonable request.
Mandarin and English.
  Furthermore, based on the universal CS lexicon, we
validate the proposed pronunciation augmentation frame-                 Declarations
work from two main aspects. One is directly using the                   Competing interests
augmented pronunciations to train conventional GMM-                     The authors declare that they have no competing interests.
You can also read