MULTIMODAL TRANSFORMERS FOR UTTERANCE-LEVEL CODE-SWITCHING DETECTION

Page created by Lester Padilla
 
CONTINUE READING
MULTIMODAL TRANSFORMERS FOR UTTERANCE-LEVEL CODE-SWITCHING DETECTION
MULTIMODAL TRANSFORMERS FOR UTTERANCE-LEVEL CODE-SWITCHING
                                                                        DETECTION

                                                                                                    Krishna D N

                                                                                                  Freshworks Inc.
arXiv:2011.02132v1 [eess.AS] 4 Nov 2020

                                                                  ABSTRACT                                   Frisian code-switched automatic speech recognition task.
                                                                                                             [3] proposes to use a shared acoustic model between the
                                          An utterance that contains speech from multiple languages is
                                                                                                             code-switched languages to obtain a robust German speech
                                          known as a code-switched sentence. In this work, we propose
                                                                                                             recognition system. [4] proposes a one-pass recognition
                                          a novel technique to predict whether given audio is mono-
                                                                                                             strategy and shows we can eliminate the classical multi-pass
                                          lingual or code-switched. We propose a multi-modal learning
                                                                                                             technique, which follows language identification, language
                                          approach by utilizing the phoneme information along with au-
                                                                                                             boundary detection, and language-independent speech recog-
                                          dio features for code-switch detection. Our model consists of
                                                                                                             nition. Recently,[5] proposes using an end-to-end neural net-
                                          a Phoneme Network (PN) that processes phoneme sequence
                                                                                                             work framework that uses the attention-CTC-based seq2seq
                                          and Audio Network (AN), which processes the MFCC fea-
                                                                                                             model. They propose to jointly predict the language label and
                                          tures. We fuse representation learned from both the Networks
                                                                                                             the CTC labels for robust code-switched speech recognition
                                          to predict if the utterance is code-switched or not. The Au-
                                                                                                             for Chinese. Language modeling techniques have also been
                                          dio Network and Phonetic Network consist of initial convolu-
                                                                                                             used for code-switched speech recognition tasks recently
                                          tional, Bi-LSTM, and transformer encoder layers. The trans-
                                                                                                             [7]. [6] proposed to use a recurrent neural network to pre-
                                          former encoder layer helps in selecting important and rele-
                                                                                                             dict the code-switches by using textual features like Part of
                                          vant features for better classification by using self-attention.
                                                                                                             speech. Language identification seems to be one of the main
                                          We show that utilizing the phoneme sequence of the utterance
                                                                                                             components of code-switch identification and code-switched
                                          along with the MFCC features improves the performance of
                                                                                                             speech recognition. It is shown to have provided good im-
                                          code-switch detection significantly. We train and evaluate
                                                                                                             provement in the system performance [8,9,10,11]. Some
                                          our model on Microsoft’s code-switching challenge dataset
                                                                                                             interesting works[1,2,3,4] have shown different methods to
                                          for Telugu, Tamil, and Gujarati languages. Our experiments
                                                                                                             detect the code-switching in a single conversation/utterance.
                                          show that the multi-modal learning approach significantly im-
                                                                                                             Multilingual training of DNN-based speech recognition sys-
                                          proved accuracy over the uni-modal approaches for Telugu-
                                                                                                             tems has shown some improvements in the low resource
                                          English, Gujarati-English, and Tamil-English datasets. We
                                                                                                             languages[13,14,15,16]. Various other techniques like multi-
                                          also study the system performance using different neural lay-
                                                                                                             lingual DNNs[12], using latent space language models[13],
                                          ers and show that the transformers help obtain better perfor-
                                                                                                             using untranscribed data[14], and data augmented acoustic
                                          mance.
                                                                                                             and language models [15] are also proposed to code-switch
                                             Index Terms— multimodal learning, transformers, code-           detection.
                                          switching                                                              Recent developments in the area of deep learning have
                                                                                                             shown tremendous improvement in many areas of speech
                                                             1. INTRODUCTION                                 processing, including speech recognition [16,17,18,19,20],
                                                                                                             language identification [21,22,23,24], speaker identification
                                          A continuous change between two or more languages within           [25,26,27], emotion recognition [28,29,30], and many oth-
                                          a single utterance is known as code-switching [1]. Code-           ers. Recently, the Attention mechanism [31] is shown to be
                                          switching detection is about identifying whether an utter-         a dominating approach in both Natural language processing
                                          ance contains multiple languages. Low resource languages           and speech due to their ability to focus on specific parts of
                                          like Indian languages most commonly have code-switches             the input signal to extract relevant information for a particular
                                          with English. Code-switched utterances usually cause prob-         task. Sequence to sequence models become more power-
                                          lems for automatic speech recognition systems, which are           ful and efficient when it is built using just attention layers.
                                          built using monolingual data only. Many researchers in the         Work by [32] builds an end-to-end architecture using many
                                          speech community are trying to develop systems that can            attention layers, thus eliminating the need for a convolutional
                                          handle code-switching speech. Robust acoustic-modeling             network or LSTM layers. These models are popularly known
                                          using Bi-lingual Deep Neural Networks are used in [2] for          as Transformer, and they are the state of the art models for
most Natural language processing problems. In the speech
recognition field, state of the art speech recognition uses
attention based sequence to sequence model due to its effi-
ciency and capacity to align and predict. These models are
also used extensively in problems like emotion recognition
[28], language identification, and so on.
    It is shown that the language identification problem can
be tackled by using phoneme sequences extracted from the
speech recognition system with some post processing[ 38].
[38],describes that we can use an English phoneme recog-
nition system to extract phoneme sequences for all the ut-
terances with a different language and apply some N-gram
statistics post-processing to classify test sentences. Moti-
vated by the work of M. A. Zissman et al. [38], we propose
to use phoneme sequences as an additional guiding signal
along with MFCC features for the network architecture. We
propose a multi-modal transformer model for utterance-level                    Fig. 1. Proposed model architecture.
code-switch detection. Our model consists of 2 streams, Au-
dio Network (AN) and Phoneme Network (PN). The Audio
Network consists of a sequence of CNN and LSTM layers              consists of Audio Encoder, Audio Transformer, and statis-
followed by a Transformer encoder to process MFCC fea-             tics pooling blocks. The Audio Encoder converts the MFCC
tures. The Phoneme Network also consists of a sequence             features into high-level representations using Convolutional
of CNN and LSTM layers followed by a Transformer En-               neural networks followed by a Bi-LSTM layer, as shown
coder layer to process phoneme embeddings. We pool the             in Fig. 1. This high-level feature representation is fed into
output features from transformer layers from both Audio            the Audio Transformer block, consisting of 2 multi-head
Network and Phoneme Network using the statistics pooling           self-attention layers. The multi-head self-attention layers
layer, and we concatenate these features to create utterance-      use self-attention to select important and relevant features
level multi-modal feature representation. We project this          for code-switching detection using attention weighting. The
multi-modal feature representation to lower dimensional            statistics pooling layer in Audio Network helps in aggre-
space using a fully connected layer. Finally, we classify          gating frame-level features from Audio Transformer into an
the lower-dimensional representation using a softmax layer         utterance-level feature vector. Similarly, the Phoneme Net-
to predict the class label. We conduct all our experiments on      work consists of Phoneme Encoder, Phoneme Transformer,
the Microsoft code-switching challenge dataset for three In-       and statistic pooling layer, as shown in Fig. 1. The Phoneme
dic languages Telugu-English, Tamil-English, and Gujarati-         Encoder takes phoneme embeddings as input and produces
English. Our experimental study shows that our approach            high-level contextual representation using convolutional neu-
significantly improves the system performance compared to          ral networks and Bi-LSTM. These features are processed
the baseline model. We also show that our multi-modal ap-          by Phoneme Transformer to select relevant features for bet-
proach outperforms the uni-modal methods. Our open source          ter code-switch classification using multi-head self-attention
implementation can be found in https:/github.com/                  layers. The Phoneme network’s statistics-pooling layer acts
KrishnaDN/Code-Switch-Detection                                    as an aggregation layer to pool variable length feature vec-
    The organization of the paper is as follows. In section 2,     tors from Phoneme Transformer into a single feature vector
we explain our proposed approach in detail. In section 3, we       for classification. We use Early fusion to concatenate the
give a detailed analysis of the dataset, and in section 4, we      utterance-level feature vector from both the Audio Network
explain our experimental setup in detail. Finally, in section 5,   and Phoneme Network to combine information from both
we describe our results.                                           audio features and phoneme sequences for better classifica-
                                                                   tion performance. We explain all the blocks in detail in the
                                                                   following section.
             2. MULTI-MODAL LEARNING

This paper proposes a multi-modal learning approach for            2.1. Phoneme Recognition
utterance-level code-switch detection using audio features
and phoneme information. The model consists of 2 stream            Our proposed method consists of 2 streams known as Audio
architecture containing Audio Network and Phoneme Net-             Network and Phoneme network. The Phoneme Networks take
work, as shown in figure 1. The Audio network takes MFCC           a sequence of phonemes corresponding to the audio data as its
features as input, as shown in Fig 1. The Audio Network            input. Due to the lack of ground truth phoneme labels in the
code-switch detection dataset, we generate phoneme labels           of filters in the last convolutional layer is 64. Since convolu-
by training a GMM-HMM based speech recognition model                tion cannot capture temporal information, we use Bi-LSTM
for Telugu, Tamil, and Gujarati Languages using Microsoft’s         Layers to learn the temporal representation of the feature ma-
speech recognition challenge dataset. Using these speech            trix F A
recognizers for the three languages, we create phoneme la-
bels for Telugu-English, Tamil-English, and Gujarati-English                           H A = Bi-LSTM(F A )                      (2)
code-switching dataset. These phoneme sequences and their
corresponding MFCC features are used during the training                Where Bi-LSTM represents Bidirectional LSTM layer
of code-switch identification models. The Audio Network             whose hidden dimension is 128. H A = [h1 , h2 .....hT 0]
takes MFCC features for each audio file, while Phoneme Net-         represents the output sequence from the final Bi-LSTM layer.
work takes the phoneme sequence generated by the GMM-               hi represents the ith timestep hidden activity from the last
HMM speech recognition model for that input audio. All the          layer. Since the LSTM is operating in both directions, the
phoneme recognizers are trained in Kaldi.                           final hidden layer output feature will be 2 times the LSTM
                                                                    hidden feature vector in size.
2.2. Audio Encoder
The Audio Network consists of Audio Encoder, Audio Trans-           2.3. Phoneme Encoder
former, and statistics pooling layer, as shown in the figure.
                                                                    The Phoneme Network consists of Phoneme Encoder, Phone
The Audio Encoder takes a sequence of MFCC frames as in-
                                                                    Transformer, and statistics pooling layer. The Phoneme En-
put and produces an utterance-level feature vector. We extract
                                                                    coder consists of a learnable phoneme embedding layer, two
13-dimensional MFCC feature vectors for every 10ms using a
                                                                    layers of convolution followed by a Bi-LSTM layer, as shown
25ms window. These low-level features are passed as input to
                                                                    in Figure 1. The Phoneme Encoder takes a phoneme sequence
Audio Encoder, followed by Audio transformer to learn fea-
                                                                    as inputs to the embedding layers, and these fixed dimen-
ture representation, which contains information about code-
                                                                    sional embeddings are learned during training. The phoneme
switch in an utterance.
                                                                    embeddings from the embedding layer are passed into the
     The Audio Encoder processes audio data to extract fea-
                                                                    convolutional network, consisting of 1D convolution layers.
tures that are useful for code-switch identification. It consists
                                                                    Each convolutional layer has its kernel size and filters. we
of two 1D convolutional layers followed by a Bi-LSTM layer.
                                                                    have [1x3,1x5] filter size for Conv-1 and Conv-2 layer respec-
The Audio Encoder takes a sequence of 13-dimensional
                                                                    tively. Similarly, we have 64 filters for both Conv-1 and Conv-
MFCC features and applies 1D convolution operations to
                                                                    2. Since the convolutional layer does not capture tempo-
extract higher-level feature representations. Each convolu-
                                                                    ral information, we use Bi-LSTM right after the convolution
tional block consists of a 1D convolution operation followed
                                                                    block. The Bi-LSTM learns contextual representations in the
by Batch normalization and Relu activation, as described
                                                                    phoneme sequence and helps in detecting code-switching.Let
in Fig.1. Each convolutional layer has its kernel size and
                                                                    P = [p1 , p2 ..pn , ...pN ] be a sequence of N phoneme em-
filters. We have [1x7,1x5] filter size for Conv-1 and Conv-
                                                                    beddings of dimension 128 and input vector pn is a phoneme
2 layer, respectively. Both Conv-1 and Conv-2 blocks are
                                                                    embedding for nth phoneme.
having stride [1x3]. Similarly, we have 64 filters for both
Conv-1 and Conv-2. Since the convolutional layer does not
capture temporal information, we use Bi-LSTM right after                               F P = Convolution(P )                    (3)
the convolutional block to capture temporal information.Let
                                                                        Where, Convolution is a sequence of two convolutional
X = [x1 , x2 ..xn , ...xT ] be a sequence of T MFCC fea-
                                                                    blocks applied to the sequence of phoneme embeddings se-
ture vectors of dimension 13 and input vector xn is a MFCC
                                                                    quence P as shown in Figure 1. F P = [f1 , f2 .....fN 0]
frame extracted from the raw audio for every 25ms window
                                                                    is high-level feature representation learned from convolution
with 10ms shift.
                                                                    operation. Since convolution cannot capture temporal infor-
                                                                    mation, we use a sequence of Bi-LSTM Layers to learn a tem-
                   F A = Convolution(X)                      (1)
                                                                    poral representation of the feature matrix F A .
    Where Convolution is a sequence of two 1D convolu-
tional blocks applied to the MFCC sequence X, as shown                                 H P = Bi-LSTM(F P )                      (4)
in Fig.1. After Convolution operation, we obtain a feature
sequence F A = [f1 , f2 .....fT 0] of length T 0 . The feature          Where Bi-LSTM represents 2 layers Bidirectional LSTM
matrix, F A has an x-axis as the time dimension, and the y-         network whose hidden dimension is 128. H P = [h1 , h2 .....hN 0]
axis is a feature dimension. The feature dimension size will        represents the output sequence from the final Bi-LSTM layer.
be the same as the number of filters in the last convolutional      hi represents the ith timestep hidden activity from the last
layer. In our case, the feature dimension is 64, as the number      layer.
A = Concat(A1 , A2 , A3 ...Ai .....AM )W0           (6)

                                                                       Our proposed model consists of Audio Transformer and
                                                                   Phoneme Transformer. The Audio Transformer takes output
                                                                   feature representation H A = [h1 , h2 .....hT 0] from Audio
                                                                   Encoder as input, while Phoneme Transformer takes out-
                                                                   put feature representation H P = [h1 , h2 .....hN 0] from
                                                                   Phoneme Encoder as input. The Query, Key, and Value are
                                                                   obtained using H A for Audio Transformer. Similarly, the
                                                                   Query, Key, and Value are obtained using H P for Phoneme
                                                                   Encoder. Both Audio Transformer and Phoneme Transformer
                                                                   consist of 2 layers of Multi-head self-attention with four
                                                                   attention heads each.

Fig. 2. Multi-head self-attention: Audio Transformer (left)        2.5. Statistics Pooling
and Phoneme Transformer (right)                                    The statistics pooling layers pools the frame-level features
                                                                   from both Audio Transformer and Phoneme Transformer. It
                                                                   computes the mean and standard deviation across time in a
2.4. Transformer Network
                                                                   feature sequence to generate a single utterance-level feature
Attention is a very well known concept in deep learning,           representation from Audio Network and Phoneme Network.
specifically for temporal sequences. These models are widely       The mean and standard deviation vectors are concatenated
used in speech and NLP applications due to their ability to        to obtain the second-order statistics of Audio Encoder and
attend and select relevant features that are useful for solv-      Phoneme Encoder.
ing a particular task. Recently, Multi-head self-attention has
become one of the widely used models in NLP and Speech                        U A = Concat(mean(Aa ), std(Aa ))               (7)
due to its ability to attend different parts of the input us-
ing multiple attention heads. Various attention models have                   U P = Concat(mean(Ap ), std(Ap ))               (8)
been invented over the past, including dot product attention,
locality-sensitive attention, additive attention, and so on. In        Where, Aa , Ap are the outputs from Audio Transformer
this work, we use a specific type of attention called multi-head   and Phoneme Transformer respectively. U A is a Utterance-
self-attention. A detailed multi-head self-attention block is      level feature vector from Aa and U P is a Utterance-level fea-
shown in Fig. 2.                                                   ture vector from Ap . Both U A and U P are having same
    The multi-head attention (MHA) layer consists of mul-          dimension.
tihead scaled dot product attention, matrix multiply, and
position-wise feed-forward layer. The MHA block consists           2.6. Early Fusion
of M linear layers for query Key and Value matrices, where
M is the number of heads in the multi-head attention. It           The utterance-level feature vectors U A and U P are the repre-
takes feature representations from Bi-LSTM layer and ap-           sentation learned from different modalities. The early fusion
plies linear transform to create Qi , Ki and Vi using ith linear   block combines these two representations to combine infor-
transform where, i = [1, 2.....M ] and M is the total number       mation from both audio and phoneme modality using simple
of attention heads. The Qi , Ki and Vi are fed into scaled         concatenation.
dot product attention layer followed by matrix multiplication
between the value matrix and attention weights. The scaled                           C = Concat(U A , U P )                   (9)
dot product attention Ai for ith head is defined as follows.          Where C represents the fused feature representation.

                                   Qi Ki
                 Ai = Softmax(             )Vi              (5)                           3. DATASET
                                     dq
                                                                   3.1. Speech Recognition dataset
    Where dq is the dimension of the query vector. We com-
bine the attention output from all the heads using simple con-     The proposed model uses both audio features and phoneme
catenation and feed them into the feed-forward layer.              sequences to train multi-modal architecture for code-switching
Train                    Evaluation            models, Telugu-ASR,Tamil-ASR, Gujarati-ASR for Telugu,
 Dataset
             Duration Utterances        Duration Utterances         Tamil and Gujarati language respectively. The acoustic model
 Gujarati    31.59      16780           3.59       2091             for Telugu-ASR is trained using ∼40 hours of audio data con-
 Telugu      31.59      16991           4.0        2135             taining ∼44K utterances. We train the Bi-Gram language
 Tamil       30.24      10933           7.312      2642             model using training utterances. Similarly, the acoustic mod-
                                                                    els for Tamil-ASR, Gujarati-ASR are trained using ∼40hours
Table 1. Code-switching dataset: train and evaluation splits        of data for each language, and the dataset contains ∼39K
for different languages                                             utterance for Tamil and ∼22K. The lexicon for Telugu-ASR,
                                                                    Tamil-ASR and Gujarati have ∼48K, ∼58K and ∼43K words
                                                                    respectively. We evaluate speech recognition models on ∼5
identification. We first build LVSCR speech recognition mod-
                                                                    hours of evaluation data for each language.We use Kaldi 1
els for Telugu, Tamil, and Gujarati languages to extract the
                                                                    [34] to train speech recognition models. We use KenLM2
code-switching dataset’s phoneme sequences. We use the
                                                                    [35] to train the language models.
dataset provided for Low resource speech recognition chal-
lenge by Microsoft. The dataset consists of 40 hours of audio
data for each of the languages. All the audio files for each        4.2. Code-switching Identification
language are sampled at 16KHz to train speech recognition           Our proposed approach consists of 2 streams, Audio Network
models. The dataset contains both training evaluation split for     and Phoneme Network. The input to the audio network is a
each language. We first train the GMM-HMM-based speech              sequence of 13-dimensional MFCC features. We use 25ms
recognition model for each language using the dataset split.        window length with 10ms hop length to extract the features.
We then evaluate each model with corresponding tests split          The Audio Encoder has two convolutional neural layers con-
from their respective language.                                     taining [1x7,1x5] kernels and 64 filters each. Each convolu-
                                                                    tional layer has a stride of 3. Each of these convolutional lay-
3.2. Code-switching dataset                                         ers consists of 1D convolution operation, 1D-batch normal-
                                                                    ization, and Relu activation function. The audio encoder’s
In this section, we discuss the code-switching dataset in de-       convolutional layers help reduce the temporal dimension of
tail. One of the shared tasks’ main goal is to detect if an audio   the input data and extract higher-level features. The convo-
file is monolingual or code-switched. The shared task con-          lutional block’s output from the audio encoder is fed to a Bi-
tains two subtasks, 1) Utterance-level identification of mono-      LSTM with a hidden size of 128 and a dropout of 0.3. Sim-
lingual vs. code-switched utterances, and 2)Frame-level iden-       ilarly, the phoneme encoder has two 1D convolutional layers
tification of language in a code-switched utterance. We focus       with kernel size [1x3,1x5] and 64 filters. Each convolution
on subtask-1 in this paper. The dataset for subtask-1 con-          layer has stride 1. The input to the text encoder is a sequence
tains 3 Indic languages, Telugu-English, Tamil-English, and         of phoneme embeddings of dimension 128 and is learned as
Gujarati-English. Each language is provided with training           part of the training. The output of the Phoneme Encoder goes
and testing sets. Both training and test sets contain infor-        to a Bi-LSTM with a hidden layer size of 128 and a dropout of
mation about whether an audio file is monolingual or code-          0.3. Both Audio transformer and Phoneme Transformer con-
switched. The statistics of the dataset is given in Table 2.        tains two layers of multi-head attention(MHA) blocks. Each
To use our approach, we need a phoneme sequence for every           MHA block uses four attention heads layer-normalization be-
utterance in the code-switching identification dataset. To ob-      fore scaled dot product operation. The forward feed layer
tain the phoneme sequence, we use the prebuilt GMM-HMM              inside MHA has a hidden dimension is 256. We use Adam
speech recognition models for their corresponding languages.        optimizer [36] with a scheduled learning rate. We use Py-
We use Telugu, Tamil, and Gujarati speech recognition mod-          torch [37] framework for implementation. All our models are
els to extract phoneme sequences for Telugu-English, Tamil-         trained on RTX 2080Ti Graphics cards.
English, and Gujarati-English code-switched dataset, respec-
tively.
                                                                                            5. RESULTS

                    4. EXPERIMENTS                                  In this section, we describe the experimental results in de-
                                                                    tail. We first evaluate our speech recognition system perfor-
4.1. Speech Recognition                                             mance for Telugu, Tamil, and Gujarati using standard evalu-
Our speech recognition dataset consists of 40hrs of labeled         ation data. We then compare our multi-modal approach with
audio data for 3 Indic languages Telugu, Tamil, and Gujarati.       uni-modal approaches. We finally examine the effectiveness
We extract 13-dimensional MFCC features for every 25ms              of transformers for code-switching identification task. We
window with a 10ms shift. We use these features to train              1 www.kaldi.net

the acoustic models. We build 3 different speech recognition          2 www.kenLM.com
train GMM-HMM-based speech recognition on 40hrs of au-            can be seen that the transformers help in improving overall
dio data for each language and evaluate the word error rates      system performance.
on the corresponding test sets. We obtain, ∼24%, ∼20% and
∼17% WER for Telugu, Tamil and Gujarati respectively.                               System                  Accuracy
     We train our multi-modal model for Telugu-English,                         Telugu-English
Tamil-English, and Gujarati-English on the code-switching                       CNN+Bi-LSTM                  85.00%
dataset. We evaluate the models on the evaluation dataset                     CNN+Transformer                85.30%
for each language. The comparison between multi-modal                      CNN+Bi-LSTM+Transformer           85.71%
and uni-modal approaches are shown in Table 2. The exper-                       Tamil-English
iment Multi-Modal represents our approach, where we use                         CNN+Bi-LSTM                  85.70%
both Audio network and Phoneme Network. Uni-Modal-                            CNN+Transformer                86.02%
Audio represents the model where the classifier is trained                 CNN+Bi-LSTM+Transformer           85.60%
using only Audio Network with MFCC features as inputs.                         Gujarati-English
Similarly, Uni-Modal-Phoneme is the model where the clas-                       CNN+Bi-LSTM                  88.01%
sifier is trained using only Phoneme Network with Phoneme                     CNN+Transformer                87.90%
sequences as inputs. It can be shown that our approaches out-              CNN+Bi-LSTM+Transformer           88.85%
perform the uni-modal approaches. Our method outperforms
the Uni-Modal-Audio approach by 6.8%,7.75%, and 4.14%             Table 3. Comparison of Different neural architectures. Bold
for Telugu, Tamil, and Gujarati, respectively. Similarly, Our     indicates the best performance
method outperforms the Uni-Modal-Phoneme approach by
5.46%,5.47%, and 13.85% for Telugu, Tamil, and Gujarati,
respectively.                                                                        6. CONCLUSIONS

                   System             Accuracy                    Code-switching detection is one of the most challenging and
               Telugu-English                                     still unsolved problems in the speech processing field. This
                 Multi-Modal            85.71%                    work proposes a new approach for improving code-switching
              Uni-Modal-Audio           78.91%                    detection performance using multi-modal learning combined
             Uni-Modal-Phoneme          80.25%                    with transformer networks by utilizing the phonetic infor-
               Tamil-English                                      mation. Our method consists of two streams neural network
                 Multi-Modal            86.02%                    where one stream processes audio features like MFCC while
              Uni-Modal-Audio           78.27%                    the other stream processes phoneme sequences. Multi-modal
             Uni-Modal-Phoneme          80.55%                    learning helps capture information shared between differ-
              Gujarati-English                                    ent modalities to learn a better code-switching identification
                 Multi-Modal            88.85%                    model. We also show that transformer models help improve
              Uni-Modal-Audio           84.71%                    code-switching detection performance due to their ability
             Uni-Modal-Phoneme          75.80%                    to attend and select features using a self-attention mecha-
                                                                  nism for better code-switching detection performance. Our
Table 2. Comparison of Multi-modal and Uni-Modal ap-              experiments show that our multi-modal learning approach
proaches. (Bold indicates the best performance)                   outperforms the uni-modal approaches on Microsoft’s code-
                                                                  switching identification dataset.
    We experiment to see the effectiveness of the transformer
for code-switching identification task in multi-modal settings.
                                                                                7. ACKNOWLEDGEMENTS
We remove the Audio Transformer and Phoneme Trans-
former from the original model to see the code-switching
                                                                  We want to thank Microsoft India for providing the dataset
identification task’s performance. We call this experiment
                                                                  for this research work. Any opinion, findings, conclusions, or
CNN+Bi-LSTM. Similarly, we remove Bi-LSTM from the
                                                                  recommendations expressed in this material are those of the
original architecture to see the effect of using Bi-LSTM
                                                                  author(s) and do not necessarily reflect the views of Fresh-
in the system performance. The experiment is known as
                                                                  works Inc.
CNN+Transformer. In the final experiment, we keep all
the blocks from the original architecture, and the experi-
ment is called CNN+Bi-LSTM+Transformer. We conduct                                    8. REFERENCES
these experiments for all the language pairs, and the results
are shown in Table 3. It can be seen that the CNN+Bi-             [1] Peter Auer, “Code-switching in conversation: Language,
LSTM+Transformer model is shown to be the best model. It              interaction and identity”, Routledge, 2013.
[2] E. Yılmaz, H. Van den Heuvel, and D. A. Van Leeuwen,            the Delta-Bayesian Information Criterion,” in IEEE/ACM
    “Investigating bilingual deep neural networks for au-           Transactions on Audio, Speech, and Language Process-
    tomatic speech recognition of code-switching Frisian            ing, vol. 23, no. 11, pp. 1892-1903, Nov. 2015, doi:
    speech,” in Proc. Workshop on Spoken Language Tech-             10.1109/TASLP.2015.2456417.
    nology for Under-resourced Languages (SLTU), May
    2016, pp. 159–166.                                          [14] E. Yılmaz, H. Van den Heuvel, and D. A. Van Leeuwen,
                                                                    “Exploiting untranscribed broadcast data for improved
[3] G. Stemmer, E. Noth, and H. Niemann, “Acoustic mod-             code-switching detection,” in Proc. INTERSPEECH,
    eling of foreign words in a German speech recognition           Aug. 2017, pp. 42–46.
    system,” in Proc.EUROSPEECH, 2001, pp. 2745–2748
                                                                [15] Emre Yılmaz, Henk van den Heuvel, and David van
[4] D.-C. Lyu, R.-Y. Lyu, Y.-C. Chiang, and C.-N. Hsu,              Leeuwen, “Code-switching detection with data aug-
    “Speech recognition on code-switching among the Chi-            mented acoustic and language models,” in the Sixth Inter-
    nese dialects”, in Proc. ICASSP, vol. 1, May 2006, pp.          national Workshop on Spoken Language Technology for
    1105–1108                                                       Under-resourced Languages (SLTU). Procedia Computer
                                                                    Science, 2018, pp. 127–131.
[5] N. Luo, D. Jiang, S. Zhao, C. Gong, W. Zou, and X.
    Li, “Towards end-to-end code-switching speech recogni-      [16] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, At-
    tion,” arXiv preprint arXiv:1810.13091, 2018.                   tend and Spell: A Neural Network for Large Vocabulary
                                                                    Conversational Speech Recognition” in ICASSP, 2016
[6] H. Adel, N. Vu, F. Kraus, T. Schlippe, H. Li, and
    T. Schultz, “Recurrent neural network language model-       [17] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention
    ing for code switching conversational speech,” in Proc.         based end-to-end speech recognition using multi-task
    ICASSP, 2013, pp. 8411–8415.                                    learning,” in IEEE International Conference on Acous-
                                                                    tics, Speech and Signal Processing (ICASSP), 2017, pp.
[7] Heike Adel, Katrin Kirchhoff, Dominic Telaar, Ngoc              4835–4839.
    Thang Vu, Tim Schlippe, and Tanja Schultz, “Features
                                                                [18] C. Wang, Y. Wu, Y. Du, J. Li, S. Liu, L. Lu, S. Ren,
    for factored language models for code-switching speech,”
                                                                    G. Ye, S. Zhao, and M. Zhou, “Semantic mask for trans-
    in Spoken Language Technologies for Under-Resourced
                                                                    former based end-to-end speech recognition,” in Proc. In-
    Languages, 2014.
                                                                    terspeech, 2020.
[8] J. Weiner, N. T. Vu, D. Telaar, F. Metze, T. Schultz, D.-
                                                                [19] S. Schneider, A. Baevski, R. Collobert, and M. Auli.
    C. Lyu, E.-S. Chng, and H. Li, “Integration of language
                                                                    “wav2vec: Unsupervised pre-training for speech recog-
    identification into a recognition system for spoken con-
                                                                    nition.” In Proc. of Interspeech, 2019.
    versations containing codeswitches,” in Proc. SLTU, May
    2012.                                                       [20] Ronan Collobert, Christian Puhrsch, and Gabriel Syn-
                                                                    naeve, “Wav2letter: an end-to-end convnet based speech
[9] K. R. Mabokela, M. J. Manamela, and M. Manaileng,               recognition system,” CoRR, vol. abs/1609.03193, 2016.
    “Modeling code-switching speech on under-resourced
    languages for language identification,” in Proc. SLTU,      [21] D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D.
    2014, pp. 225–230.                                              Povey, and S. Khudanpur, “Spoken language recognition
                                                                    using x-vectors,” in Odyssey: The Speaker and Language
[10] D.-C. Lyu, E.-S. Chng, and H. Li, “Language diarization        Recognition Workshop, 2018
    for codeswitch conversational speech,” in Proc. ICASSP,
    May 2013, pp.7314–7318.                                     [22] J. Gonzalez-Dominguez, I. Lopez-Moreno, H. Sak, J.
                                                                    GonzalezRodriguez, and P. J. Moreno, “Automatic lan-
[11] K Bhuvanagiri and Sunil Kopparapu, “An approach to             guage identification using long short-term memory re-
    mixed language automatic speech recognition,” Oriental          current neural networks.” in INTERSPEECH, 2014, pp.
    COCOSDA, Kathmandu, Nepal, 2010.                                2155–2159.
[12] E. Yılmaz, H. van den Heuvel and D. van Leeuwen,           [23] Ling, S., Salazar, J., Liu, Y., and Kirchhoff, K. “BERT-
    “Code-switching detection using multilingual DNNS,”             phone: Phonetically-aware Encoder Representations for
    2016 IEEE Spoken Language Technology Workshop                   Utterance-level Speaker and Language Recognition”.
    (SLT), San Diego, CA, 2016, pp. 610-616, doi:
    10.1109/SLT.2016.7846326.                                   [24] Weicheng Cai, Danwei Cai, Shen Huang, and Ming Li,
                                                                    “Utterance-level end-to-end language identification us-
[13] C. Wu, H. Shen and C. Hsu, “Code-Switching Event               ing attention-based CNN-BLSTM,” in ICASSP, 2019, pp.
    Detection by Using a Latent Language Space Model and            5991–5995.
[25] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and            International Conference on Learning Representations
    S. Khudanpur,“X-vectors: Robust dnn embeddings for              (ICLR), 2014
    speaker recognition,” in 2018 IEEE International Con-
    ference on Acoustics, Speech and Signal Processing          [37] Paszke, Adam and Gross, Sam and Chintala, Soumith
    (ICASSP). IEEE, 2018                                            and Chanan, Gregory and Yang, Edward and DeVito,
                                                                    Zachary and Lin, Zeming and Desmaison, Alban and
[26] Krishna D N, Ankita Patil, M.S.P Raj, Sai Prasad H S,          Antiga, Luca and Lerer, Adam “Automatic differentiation
    Prabhu Aashish Garapati, “Identification of Indian Lan-         in PyTorch” in NIPS, 2017
    guages using Ghost-VLAD pooling”,arXiv:2002.01664
                                                                [38] M. A. Zissman and E. Singer, “Automatic lan-
[27] Shuai Wang, Johan Rohdin, Lukas Burget, Oldrich                guage identification of telephone speech messages us-
    Plchot, Yanmin Qian, Kai Yu, and Jan Cernock y,                 ing phoneme recognition and N-gram modeling,” Pro-
    “On the usage of phonetic information for textindepen-          ceedings of ICASSP ’94. IEEE International Con-
    dent speaker embedding extraction,” INTERSPEECH, pp.            ference on Acoustics, Speech and Signal Process-
    1148–1152, 2019                                                 ing, Adelaide, SA, 1994, pp. I/305-I/308 vol.1, doi:
                                                                    10.1109/ICASSP.1994.389377.
[28] M. Sarma, P. Ghahremani, D. Povey, N. K. Goel, K. K.
    Sarma,and N. Dehak, “Emotion identification from raw
    speech signals using DNNs,” in Proc. INTERSPEECH,
    Hyderabad, India,2018, pp. 3097–3101

[29] Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh,
    Louis-Philippe Morency, and Ruslan Salakhutdinov.
    2019. “Learning Factorized Multimodal Representa-
    tions”. In 7th International Conference on Learning Rep-
    resentations, ICLR 2019

[30] N, Krishna and Reddy, Sai “Multi-Modal Speech Emo-
    tion Recognition Using Speech Embeddings and Audio
    Features”, AVSP 2019 Melbourne ,Autralia

[31] D. Bahdanau, K. Cho, and Y. Bengio. “Neural machine
    translation by jointly learning to align and translate”,
    arXiv preprint arXiv:1409.0473, 2014

[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L.
    Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Atten-
    tion is all you need,” in Advances in Neural Information
    Processing Systems, 2017, pp. 5998–6008.

[33] D. Su, X. Wu and L. Xu, “GMM-HMM acoustic model
    training by a two level procedure with Gaussian com-
    ponents determined by automatic model selection,” 2010
    IEEE International Conference on Acoustics, Speech and
    Signal Processing, Dallas

[34] D. Povey et al., “The kaldi speech recognition toolkit,”
    in Workshop on Auto. Speech Recogn. and Understanding
    (ASRU), Hawaii, US, 2011, pp. 1–4.

[35] K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn.
    “Scalable modified Kneser-Ney language model estima-
    tion”. In Proceedings of the 51st Annual Meeting of the
    Association for Computational Linguistics, Sofia, Bul-
    garia, 8 2013.

[36] Diederik P. Kingma and Jimmy Ba, “Adam: A Method
    for Stochastic Optimization”, In Proceedings of the
You can also read