Best Practices for Noise-Based Augmentation to Improve the Performance of Emotion Recognition "In the Wild"

Page created by Samuel Carrillo
 
CONTINUE READING
Best Practices for Noise-Based Augmentation to Improve the Performance
                                                            of Emotion Recognition “In the Wild”

                                                            Mimansa Jaiswal                            Emily Mower Provost
                                                    University of Michigan, Ann Arbor             University of Michigan, Ann Arbor
                                                        mimansa@umich.edu                            emilykmp@umich.edu

                                                               Abstract                          because the predictions made by emotion recog-
                                                                                                 nition models are often used for downstream ap-
                                             Emotion recognition as a key component of           plications such as advertising, mental health mon-
                                             high-stake downstream applications has been         itoring (Khorram et al., 2018), hiring (emo) and
arXiv:2104.08806v1 [cs.SD] 18 Apr 2021

                                             shown to be effective, such as classroom en-
                                                                                                 surveillance (Martin, 2019). However, when used
                                             gagement or mental health assessments. These
                                             systems are generally trained on small datasets     in such sensitive human-centered tasks, incorrect
                                             collected in single laboratory environments,        emotion predictions can have devastating conse-
                                             and hence falter when tested on data that           quences.
                                             has different noise characteristics. Multi-            Data used to train emotion recognition systems
                                             ple noise-based data augmentation approaches        are often collected in laboratory environments de-
                                             have been proposed to counteract this chal-
                                                                                                 signed to avoid as much noise variation as possible.
                                             lenge in other speech domains. But, unlike
                                             speech recognition and speaker verification,
                                                                                                 This collection method ensures that the collected
                                             in emotion recognition, noise-based data aug-       data are variable only due to the presence of emo-
                                             mentation may change the underlying label           tion or predefined control variables. This ensures
                                             of the original emotional sample. In this           that the trained models avoid spurious correlations
                                             work, we generate realistic noisy samples of        between noise and emotion labels, which results in
                                             a well known emotion dataset (IEMOCAP) us-          datasets that are usually clean and small.
                                             ing multiple categories of environmental and
                                                                                                    However, these pristine conditions generally
                                             synthetic noise. We evaluate how both hu-
                                             man and machine emotion perception changes          do not hold in the real world. Researchers have
                                             when noise is introduced. We find that some         addressed this concern by augmenting emotion
                                             commonly used augmentation techniques for           datasets via signal manipulation to enhance gener-
                                             emotion recognition significantly change hu-        alizability to real-world noise conditions or avoid
                                             man perception, which may lead to unreli-           overfitting of ML models as is common in other
                                             able evaluation metrics such as evaluating ef-      speech domains (Zheng et al., 2016). For example,
                                             ficiency of adversarial attack. We also find
                                                                                                 many speech domains, such as speech recognition
                                             that the trained state-of-the-art emotion recog-
                                             nition models fail to classify unseen noise-        and speaker identification, tackle this challenge by
                                             augmented samples, even when trained on             manipulating training samples to encourage models
                                             noise augmented datasets. This finding demon-       to learn only salient features (Mošner et al., 2019).
                                             strates the brittleness of these systems in real-   But, unlike these tasks, where the desired output re-
                                             world conditions. We propose a set of recom-        mains consistent in the presence of noise, this same
                                             mendations for noise-based augmentation of          consistency cannot be assumed for emotion labels.
                                             emotion datasets and for how to deploy these        This paper investigates how common sources of
                                             emotion recognition systems “in the wild”.
                                                                                                 noise affect both human and machine perception of
                                                                                                 emotion.
                                         1   Introduction
                                                                                                    Prior researchers has analyzed robust augmen-
                                         Emotion recognition is considered a low resourced       tation methods for automatic speech recognition
                                         domain due to the limitations in the size of avail-     (Li et al., 2014). The results have shown that hu-
                                         able datasets. As such, the performance of emotion      mans have surprisingly robust speech processing
                                         recognition models is often improved when tech-         systems that can easily overcome the omission of
                                         niques such as data augmentation or transfer learn-     letters, changes in the speed of speech, and the
                                         ing are applied. Accurate performance is critical       presence of environmental noise through psychoa-
coustic masking (Biberger and Ewert, 2016) during               changes in human perception?
speech comprehension. However, the same analy-                • Q3: Does dataset augmentation or sample de-
ses have not been completed within the domain of                noising help improve ML models’ robustness
computational paralinguistic applications. While it             to unseen noise?
is known that the comprehension of lexical content            • Q4 : How does difference in allowed set
in noisy data examples is not significantly affected,           of noises for adversarial perturbation change
the same effect in emotion perception is understud-             evaluation of adversarial attack efficiency?
ied. Instead, previous works have focused on other            • Q5: What are the best practices for speech
confounders, ranging from distribution shift (Ab-               emotion dataset augmentation and model de-
delwahab and Busso, 2018) to physiological factors              ployment, based on crowdsourced experi-
such as stress (Jaiswal et al., b), or have studied             ments on human perception?
how pink or white noise affects emotion percep-
                                                             Our findings suggest that human emotion per-
tion (Parada-Cabaleiro et al., 2017; Scharenborg
                                                          ception is usually unaffected by the presence of
et al.). Important outstanding questions include
                                                          environmental noise or common modulations such
the impact of real world noise, e.g., footsteps, run-
                                                          as reverberations and fading loudness. We find that
ning water, and how variation across both type and
                                                          human perception is changed by signal manipu-
loudness impacts human and machine perception.
                                                          lations such as speed, pitch change, and even the
   In this paper, we study how human’s emotion
                                                          presence of pauses and fillers, and hence are not
perception changes as a factor of noise. We find
                                                          good noise augmentation methods. We show that
that multiple commonly used methods for speech
                                                          the trained models though, are brittle to the intro-
emotion dataset augmentation that assume consis-
                                                          duction of various kinds of noise, which do not
tent ground truth, like, speeding of the utterance or
                                                          change human perception, especially reverberation.
the adding fillers and pauses, actually alter human
                                                          We show that some kinds of noise can help through
emotion perception. We show how the addition
                                                          either dataset augmentation or sample denoising,
of noise, such as, footsteps, coughing, rain, clock
                                                          or a combination of both, but, the performance still
ticks, that does not alter human emotion perception,
                                                          drops when an unseen noise category is tested. We
do result in incorrect emotion recognition perfor-
                                                          finally show how noise augmentation practices can
mance. This is critical because these types of noise
                                                          lead to a false improvement in adversarial test ef-
are common in real world applications (e.g., in in-
                                                          ficiency, thus leading to an unreliable metric. We
puts to virtual agents such as Siri or Alexa). The
                                                          end the paper with a set of augmentation and de-
fragility of these models can pose a security risk, es-
                                                          ployment suggestions.
pecially when the change in signal isn’t perceived
by humans. Adversarial attacks are designed to            2    Relation to Prior Work
test the robustness of model. They might be inac-
curately evaluated if they introduce noise that do        Previous work focused on understanding how noise
change the perception of human, but are also con-         impacts machine learning models can be classified
sidered as successful if the model output changes.        into three main directions: (a) robustness in auto-
We end the paper with a set of recommendations            matic speech recognition or speaker verification;
for emotion dataset augmentation techniques and           (b) noise-based adversarial example generation;
and deployment of these emotion recognition al-           and, (c) improvement in performance of machine
gorithms. We investigate the following research           learning models when augmented with noise.
questions:                                                   Previous work has looked at how speech recog-
                                                          nition systems can be built so that they are robust
   • Q1: How does the presence of noise affect the        to various kinds and levels of noise (Li et al., 2014).
     perception of emotions as evaluated by human         The most common themes in these papers are to
     raters? How does this effect vary based on           concentrate on either data augmentation or gather-
     the utterance length, loudness and the added         ing more real-world data to produce accurate tran-
     noise type, and the original emotion?                scripts (Zheng et al., 2016). Other lines of work
   • Q2: How does the presence of noise affect            have looked into preventing various attacks, e.g.,
     the performance of automatic emotion recog-          spoofing or recording playback, on speaker verifica-
     nition systems? Does this effect vary based on       tion systems (Shim et al.). Noise in these systems is
     the type of the added noise in agreement with        usually considered to be caused by reverberations
or channel-based modulations (Zhao et al., 2014).        tamination on the human perception of emotion
Researchers have also looked into building noise         and its implication for noise-based augmentation
robust emotion recognition models, using either          for emotion datasets.
signal transformation or augmentation (Aldeneh
and Provost, 2017). But data augmentation in this        3     Datasets And Network
analysis was formulated in a similar manner to
                                                         3.1    Dataset
those for speech recognition, or, speaker identifica-
tion systems, which have a fixed ground truth and        For our study, we use the Interactive Emo-
are independent of human perception. This paper          tional Dyadic MOtion Capture (IEMOCAP)
attempts to establish methods for noise-based data       dataset (Busso et al., 2008), which is commonly
augmentation of emotion datasets that are grounded       used for emotion recognition. The IEMOCAP
in human perception.                                     dataset was created to explore the relationship be-
                                                         tween emotion, gestures, and speech. Pairs of
   Previous works have also investigated adversar-
                                                         actors, one male and one female (five males and
ial example generation that aims to create audio
                                                         five females in total), were recorded over five ses-
samples that change the output of the classifier.
                                                         sions (either scripted or improvised) The data were
However, these methods usually assume white-box
                                                         segmented by speaker turn, resulting in a total of
access to the network and create modifications in ei-
                                                         10,039 utterances (5,255 scripted turns and 4,784
ther feature vectors or actual wave files (Carlini and
                                                         improvised turns). It contains audio, video, and
Wagner; Gong and Poellabauer, 2017). This gen-
                                                         associated manual transcriptions.
erally results in samples that either have percetible
differences when played back to a human, or are im-      3.2    Network
perceptible to a human but fail to attack the model
                                                         To focus on the effect of types of noise contami-
when played over-air (Carlini and Wagner).
                                                         nation on performance of emotion classifiers, we
   The line of work closest to ours is data augmenta-    use the state-of-art single utterance emotion classi-
tion aimed for training more generalizable models.       fication model which has widbeen used in previous
Noise and manipulation-based data augmentation           research (Khorram et al., 2017; Krishna et al.).
is a prime research field in the space of vision and     Acoustic Feature. We extract 40-dimensional
text. (Sohn et al., 2020) have looked at how rotating    Mel Filterbanks (MFB) features using a 25-
and blackening out random pixels in images to cre-       millisecond Hamming window with a step-size of
ate augmented datasets leads to better performance       10-milliseconds. Each utterance is represented as
on the test set for digit and object recognition. Au-    a sequence of 40-dimensional feature vectors. We
thors (Sohn et al., 2020) have also looked at how        z-normalize the acoustic features by each speaker.
data augmentation can help with zero shot object         Emotion Labels. The target emotion labels repre-
recognition in sparse classes of the dataset, provid-    sented in a dimensional format are binned into three
ing new options for sub sampling of classes. In          classes to represent {low, medium, high}
the field of NLP, researchers have looked at how         for both activation and valence.
replacing words with their semantic synonyms can         Architecture. The extracted MFBs are processed
lead to better performance on tasks such as POS          using a set of convolution layers and Gated Recur-
tagging and semantic parsing (Wallace et al., 2019).     rent Units (GRU), which are fed through a mean
Researchers have also looked at techniques for data      pooling layer to produce an acoustic representation
augmentation using noise augmentation (Kim and           which is then fed into a set of dense layers to clas-
Kim), mostly focusing on how model training can          sify activation or valence.
be improved to yield better performance. However,        Training. We implement models using the Keras
these augmentation techniques are mostly used for        library (Chollet, 2015). We use a subject inde-
acoustic event detection, speaker verification or        pendent five-fold cross validation scheme to select
speech recognition (Ko et al., 2015; Rebai et al.,       our train, test and validation sets. We generate
2017), and has been sparingly used in audio classi-      corresponding noisy examples for each sample in
fication tasks.                                          the test using the method described in Section4.
   To the best of our knowledge, this is the first       We use a weighted cross-entropy loss function for
work that looks at the effect of different kinds of      each task and learn the model parameters using the
real world noise and varying amounts of noise con-       RMSProp optimizer. We train our networks for a
maximum of 50 epochs and monitor the validation               out in loudness, represented by St, or (ii) oc-
loss from the emotion classifier after each epoch             curs during the entirety of the duration of the
for early stopping, stopping the training if the vali-        utterance, represented by Co. Complete addi-
dation loss does not improve after five consecutive           tive background would represent a consistent
epochs. Once the training process ends, we revert             noise source in real world, e.g., fan rotation.
the network’s weights to those that achieved the            • ENVdB: We vary the signal to noise ratio
lowest validation loss on the emotion classification          (SNR) of additive background noise for the
task. We measure performance using Unweighted                 Co method at levels of 20dB, 10dB and 0dB.
Average Recall (UAR) (chance is 0.33). We report            • ENVLen: We vary the length of the intro-
averaged performance over all samples considered              duced background noise, from a short blip to
as test in multiple runs.                                     covering the entire clip. Short length is repre-
Sample Selection. We select 900 samples from                  sented by Sh, medium length by Me, and com-
the IEMOCAP dataset for the human perception                  plete length by Co. A complete additive back-
part of the study. The sample size is far larger              ground represents a consistent noise source in
than the ones used for previous perception stud-              real world (e.g. fan rotation).
ies (Parada-Cabaleiro et al., 2017; Scharenborg
                                                         4.2   Synthetic Noise
et al.). To ensure that the 900 samples covered
multiple variations, we select 100 samples from          We define synthetic noise as modulations that aren’t
each activation and valence pair bin, i.e., 100 sam-     additive in background. These kinds of noises in au-
ples from all the samples that have activation bin       dio signal can occur from linguistic/paralinguistic
low, and valence bin low; 100 samples which have         factors, room environment, internet lags, or the
activation bin low, and valence bin mid, and so on.      physical locomotion of the speaker. We add ten
   We ensure that out of the 100 samples, 30 of          variations of synthetic noise in our dataset.
them are shorter than first quartile or greater than        • SpeedSeg: We speed up a random segment
fourth quartile of utterance length in seconds to             of the utterance by 1.25×.
cover both extremities of the spectrum, and the rest        • Fade: We fade the loudness of the utterance
of the 70 belong in the middle. We also ensure                by 2% every second, which emulates the sce-
that the selected samples had a 50-50 even split              nario of a user moving away from the speaker.
amongst gender.                                               We increase the loudness for fade in, and de-
                                                              crease for fade out.
4     Noise                                                 • Filler: Insertion of non-verbal short fillers
4.1    Environmental Noise                                    such as ‘uh’, ‘umm’ (from the same speaker)
                                                              in the middle of a sentence. The insertion
We define environmental noises (ENV) as additive
                                                              is either just the filler (S) or succeeded and
background noises. These noises are obtained from
                                                              preeceeded by a long pause (L).
the ESC-50 dataset(Piczak), which has often been            • DropW: Dropping all non-essential word be-
used for both, noise contamination and environmen-            longing to the set :{a, the, an, so, like, and}.
tal sound classification, out of which, we choose:          • DropLt: Phonological deletion or drop-
                                                              ping of letters has been widely stud-
    • Natural soundscapes (Nat), e.g., rain, wind.
                                                              ied as a part of native US-English di-
    • Human, non-speech sounds (Hum) for e.g.
                                                              alect (pho; Yuan and Liberman). Hence,
      sneezing, coughing etc.
    • Interior/domestic sounds (Int) for e.g. door            we drop letters in accordance with var-
      creaks, clock ticks etc                                 ious linguistic styles chosen from the
                                                              set:{/h/+vowel, vowel+/nd/+consonant(next
These environmental sounds are representative of              word), consonant+/t/+consonant(next word),
many factors in “in the wild” audio, especially in            vowel+/r/+consonant, /ihng/}.
the context of virtual and smart home conversa-             • Laugh/Cry: We add “sob” and “short-
tional agents. We manipulate two factors in addi-             laughter” sounds obtained from Au-
tion to noise:                                                dioSet (Gemmeke et al., 2017) to the end of
                                                              utterance.
    • ENVPos: We vary the position of the intro-            • SpeedUtt: Speed up the entire utterance by
      duction of sound that (i) starts and then fades         1.25× or 0.75×.
• Pitch: Change the pitch by ± 3 half octaves.        Table 1: The original activation (Act) and valence (Val)
    • Rev: Add room reverbration to the utterance.        performance is 0.67 and 0.59, respectively. The table
                                                          shows the change in UAR on IEMOCAP (Iem) and the
5     Analysis                                            Augmented IEMOCAP (Iem(Aug)) datasets. The rows
                                                          are augmentation methods that were evaluated as im-
5.1    Qualitative human study                            perceptible for emotion by human annotators.

For the human perception part of the study, we                                               UAR
introduce noise into the selected 900 samples in                                         Iem    Iem(Aug)
ten ways each, leading a total of 9000 samples for                                    Act Val Act Val
comparison. Each sample is introduced to the four
                                                           Environmental Noise
kinds of environmental noise and six out of ten
randomly chosen synthetic noise methods to create          NatSt                      -.25   -.24    .22    .18
ten noisy samples.                                         NatdB (Co)        20dB     -.33   -.34    .22    .13
   Crowdsourcing Setup. We recruited annotators                              10dB     -.37   -.41    .33    .31
using MTurk from a population of workers in the                              0dB      -.40   -.41    .27    .26
United States who are native English speakers, to          HumSt                      -.22   -.24    .15    .13
reduce the impact of cultural variability. We en-          HumdB (Co)        20dB     -.33   -.37    .16    .16
sured that each worker had > 98% approval rating                             10dB     -.37   -.42    .21    .26
and more than 500 approved Human Intelligence                                0dB      -.40   -.42    .25    .21
Tasks (HITs). We ensured that all workers under-           IntSt                      -.21   -.25    .15    .18
stood the meaning of activation and valence using          IntdB (Co)        20dB     -.31   -.39    .20    .22
a qualification task that asked workers to rank emo-                         10dB     -.34   -.39    .23    .19
tion content, as in previous work of (Jaiswal et al.,                        0dB      -.40   -.41    .30    .26
a). All HIT workers were paid a minimum wage               Synthetic Noise
($9.45/hr), pro-rated to the minute. Each HIT was
annotated by three workers.                                SpeedSeg                   -.09   -.12    .03    .02
   For our main task, we created pairs of samples,         Fade              In       -.07   -.10    .05    .04
i.e., the original and modulated version. For each                           Out      -.09   -.14    .02    .06
pair, we asked three workers to annotate if they           DropW                      -.04   -.05    .02    .00
thought that the samples contained the same emo-           DropLt                     -.03   -.02    .06    .03
tional content. If the worker selected yes for both        Reverb                     -.36   -.37    .05    .04
activation and valence, the noisy example was la-
beled same and they could move to the next HIT.              Human Perception Study Results. We report
If the worker selected no, the noisy example was          our human perception results below. The val-
labeled different. If a clip was labeled different, the   ues represent the ratio of the utterances that were
annotator was asked to assess the activation and          marked as different based on majority vote amongst
valence of the noisy sample using Self Assessment         all the utterances introduced with that noise. We
Manikins (Bradley and Lang, 1994) on a scale of           also find that the presence of environmental noise,
one to five (as in original IEMOCAP annotation).          even when loud, rarely affects human perception,
   We ensured the quality of the annotations by           verifying our initial judgement that humans are able
paying bonuses to the workers based on time (re-          to psycho-acoustically mask the background noise.
moving the incentive to mark same every time) and         We verify expected patterns, such as the presence
by not inviting annotators to continue if their level     of laughter or crying often changes human percep-
of agreement with our attention checks was low.           tion. The results for the five synthetic modulations
   We created final labels for the noisy examples by      that significantly change human perception of ei-
taking the majority vote over each pair. The final        ther activation or valence are as follows:
label was either same emotion perception or differ-          • Filler(L)- Act:         0.10, Val: 0.06,
ent emotion perception and an averaged valence                 Filler(S)- Act: 0.06, Val: 0.03;
and activation score. The annotators agreed in 79%           • Laugh- Act: 0.16, Val: 0.17
of the cases. All the samples along with the paired          • Cry- Act:0.20, Val: 0.22;
noisy examples and their annotations will be made            • SpeedUtt(1.25x)- Act: 0.13, Val: 0.03,
available for further research.                                SpeedUtt(1.25x)- Act: 0.28, Val: 0.06;
• Pitch(1.25x)- Act: 0.22, Val: 0.07,                 of the utterance and dropping words, showing the
     Pitch(0.75x)-Act: 0.29, Val:0.10.                   brittleness of these models.

   This suggests that modifications that do not          5.4   Denoising Algorithms and Model
change human perception and wouldn’t raise suspi-              Performance
cion, should ideally not impact performance of an
                                                         The common way to deal with noise in any audio
ideal machine learning model.
                                                         signal is to use denoising algorithms. Hence, it is
                                                         a valid counterpoint to understand how machine
5.2   Variation in changes in human perception
                                                         learning models trained to recognize emotions per-
In further experiments, we aim to find if there is a     form if they are tested on denoised samples. To
notable difference in human perception of emotion        this end, we look at two major approaches to de-
based on the original length of utterance, original      noising algorithms in the audio space: Denoising
emotion classification and gender of the speaker.        Feature Space (Valin, 2018) and Speech Enhance-
We find that the original emotion of the utterance,      ment (Chakraborty et al., 2019). Denoising feature
the gender of the speaker, or the length of the ut-      space algorithms seek to remove noise from the
terance, plays no significant role in predicting if      extracted front end features. Speech enhancement
the perception of noisy example would change. In         algorithms seek to convert noisy speech to more
line with the hypothesis tested by Parada et.al in       intelligible speech. Both techniques are associated
(Parada-Cabaleiro et al., 2017), we test how noise       with known challenges, including signal degrada-
shapes the direction of perception change.               tion (Valin, 2018) to harmonic disintegration (Valin,
   We find that the addition of laughter or crying       2018).
increases perceived activation while laughter in-           We use the most common denoising methods
creases perceived valence and crying decreases it.       from each category, Recurrent Neural Network
We also find that increasing the pitch usually in-       Noise Suppression (RNNNoise, denoising feature
creases perceived activation and vice-a-versa. Fur-      space) (Valin, 2018) and Vector Taylor Series (VTS,
ther, increases in the speed of an utterance increases   speech enhancement) (Chakraborty et al., 2019).
activation and vice-versa (Busso et al., 2009). This        We train a system composed of the emotion
evaluation gives us an insight into what methods         recognition models, described in Section 3.2. We
can be used for robust data augmentation, which,         then create two separate denoising approaches,
unlike for speech recognition, do not always convey      one that combines the emotion recognition model
the same material.                                       with RNNNoise and the other with VTS. In both
                                                         cases, we refer to the different approaches using
5.3   Model performance                                  RNNNoise and VTS, respectively. We then use
We study how introducing varying kinds of noise          these denoising algorithms on the extracted fea-
affects the performance of the model trained on a        tures for the noise augmented samples as shown in
clean dataset. The test samples of each fold are in-     Section 4. The combination ensures that the models
troduced to all the types of noise (Section 4), except   are trained to maximize the probability of predict-
those that were also found to often affect human         ing the correct emotion as well as maximizing the
perception (Pitch, SpeedUtt, Laugh, Cry, Filler),        speech quality.
which are then used to evaluate the performance of          We use the noise augmented training, validation,
the trained network. Given that, the domain of the       and testing datasets, described in Section 5.5. We
trained network, the speaker independent fold dis-       then evaluate the performance of pre-trained emo-
tribution remains the same, a drop in performance        tion recognition models from Section 3.2 on the
can be singularly attributed to the addition of a        denoised test samples using a leave-one-noise-out
particular noise.                                        cross-validation approach. This assumes that the
   We report the performance of emotion recog-           system doesn’t have an individual sample of noise
nition model trained as described in Section 3 in        prototype introduced in the sample, akin to real
Table 1. We find that the machine learning model’s       world conditions. We do not compare with other
performance decreases greatly given environmen-          denoising algorithms that assume a priori knowl-
tal noise, fading, and reverberation. There is also a    edge of noise category (Tibell et al., 2009).
smaller drop in performance for speeding up parts           We find that the addition of a denoising compo-
nent to the models leads to significant improvement     mance on utterances contaminated with reverber-
(average of 23% ± 3% across all environmental           ation, a common use case, even when the training
noise categories) in continuous noise introduction      set is augmented with other types of noise. This
at 20dB SNR as compared to when there is no de-         can be because, reverberation adds a continuous
noising or augmentation performed. We observe           human speech signal in the background delayed by
a decline in performance when dealing with other        a small period of time. None of the other kind of
signal to noise ratios of continuous noise additions,   noises have speech in them, and hence augmenta-
possibly due to masking of emotional information        tion doesn’t aid the model to learn robustness to
from the speech to maintain legibility. We fur-         this kind of noise.
ther show that the addition of a denoising com-
                                                        5.6   Use Case: Adversarial Attack Evaluation
ponent didn’t significantly improve performance
when samples were faded in or out or segments           The common way to evaluate the robustness of a
were sped up. While we did see an improvement           model to noise or an attacker with malicious intent,
in performance for unseen reverberation contami-        is adding noise to the dataset and testing perfor-
nated samples as compared to data augmentation,         mance of the model. This method is commonly
the performance is still significantly lower (−28%)     used for various tasks, such as, speech recognition,
than when tested on a clean test set. Finally, we       or, speaker identification (Abdullah et al., 2019),
observe an increase in emotion recognition perfor-      which aren’t perception dependent as discussed be-
mance, compared to when the model is trained on a       fore. The closest task to ours where the ground
clean training set, which supports the findings from    truth varies based on noise introduction is senti-
dataset augmentation.                                   ment analysis. In this case, the adversarial robust-
                                                        ness is usually tested by flipping words to their syn-
5.5   Model performance with augmentation               onyms, such that the meaning of the text remains
                                                        the same, and analyzing how the predictions of the
We then augment the dataset with noise to increase      model change. Unlike the lexical modality, speech
the robustness of the trained model. Hence, we          cannot be broken down into such discrete compo-
test environmental and synthetic noise separately.      nents with obvious replacements while maintaining
We perform leave-one-noise-out cross-validation,        speech feature integrity, and hence, introduction of
augmenting each fold with one kind of noise from        noise for emotion recognition while assuring that
the set: {Hum, Int, Nat}. We train on two kinds of      the perception remains the same is more difficult.
environmental noise and test on the third. We do           Instead, we formulate the problem as a decision
the same with synthetic noise, performing leave-        boundary untargted attack, which tries to find the
one-noise-out cross-validation, augmenting each         minimal amount of noise perturbation possible in
fold with the most impactful augmentations from         k-queries such that the decision of a model changes,
the set: {SpeedSeg, Fade and Reverb}.                   while having no knowledge about the internal work-
   We find that data augmentation improves perfor-      ings of the model (probabilities or gradients). A
mance, especially when the environmental noise is       decision-based attack starts from a large adversarial
introduced at the start of the utterance, e.g., when    perturbation and then seeks to reduce the perturba-
the test set is introduced with NatSt and the train     tion while staying adversarial. In our case, because
set is introduced with HumSt and IntSt, we see          we are considering noise augmentations, we define
a performance jump of 22%, compared to when the         the distance between an adversarial sample and the
model is trained on clean data and tested on noisy      original sample as the degradation in signal to noise
samples. We can speculate that the network learns       ratio (SNR). The input to this attack model is a set
to assign different weights to the start and end of     of allowed noise categories (e.g., white, gaussian,
the utterance to account for the initial noise.         or categorical), and the original signal. We also
   Though the performance increase on test data         modify the decision boundary attack to first use
when introduced to continuous background noise          the any random sound from four categories at low-
is aided by addition of training samples that have a    est SNR. If the attack model is successful, it then
continuously present background noise as well, the      shifts to using other additive noise options from
performance is still affected due to introduction of    that category. In case, the attacker has knowledge
new noise in the test set.                              about the relation between noise category and per-
   We also find that it is hard to improve perfor-      formance degradation, we also provide an ordering
Table 2: Attacker efficiency (PIF ) using noise-based      models in the wild, that are grounded in human
adversarial methods. k refers to the number of black       perception. For augmentation, we suggest that:
box queries an attacker can access, whereas, Corr refers
to the knowledge of correlation between performance            1. Environmental noise can be added to datasets
deterioration of emotion recognition and noise cate-              to improve generalizability to varied noise
gory, and Eval refers to whether the attack efficiency            conditions, whether using denoising or aug-
is evaluated only on those noise categories that human            mentation method or a combination of both.
perception doesn’t change. Bold shows significant im-          2. It is good to augment datasets with fading
provement. Higher values are better. Significance is              loudness of segments, dropping letters or
established using paired t-test, adjusted p-value< 0.05
                                                                  words, and speeding up small (no more than
                             k=5    k=15    k=25                  25% of the total sample length) segments of
                                                                  the complete sound samples in the dataset.
    Corr:   No, Eval: No     0.22   0.29    0.39
                                                                  But it is important to note that these aug-
    Corr:   Yes, Eval: No    0.32   0.38    0.53
                                                                  mented samples should not be passed through
    Corr:   No, Eval: Yes    0.31   0.36    0.45
                                                                  the denoising component as the denoised ver-
    Corr:   Yes, Eval: Yes   0.34   0.43    0.58
                                                                  sion loses emotion information.
                                                               3. One should not change the speed of the entire
to the attack model, such that it is more successful              utterance more than 5% and to not add inten-
given the same limited budget of k-queries. The                   tional pauses or any background noises that
sample is considered to have been successfully at-                elicit emotion behavior, e.g., sobs or laughter.
tacked only when the noise addition chosen by the
algorithm is at SNR of > 10dB. We use the Fool-            Regarding deployment, we suggest that:
box Tooklit (Rauber et al., 2017) to implement a               1. Noisy starts and ends of utterances can be han-
decision boundary attack (Brendel et al., 2017).                  dled by augmentation, hence, if the training
   For our purpose, we assume that an attacker has                set included these augments, there is no issue
access to a subset of labelled data from the set of               for deployed emotion recognition systems.
users (U) in IEMOCAP and the training and the                  2. Reverberation, in its various types is hard to
testing conditions remain the same as that of the                 handle for even augmented emotion recog-
black-box emotion recognition model. We calcu-                    nition models, and hence, the samples must
late the average probability of a successful attack,              either be cleaned to remove reverberation ef-
PIF , for any sample from user Ua using k queries.                fect, or must be identified as low confidence
   We calculate PIF for four conditions varying                   for classification.
in two: 1) knowledge about the correlation be-
tween the performance deterioration of emotion             7     Conclusion
recognition and the noise category (Corr) and 2)           In this work, we study how the introduction of
evaluation only on those noise categories whose            real world noise, either environmental or synthetic,
addition doesn’t change human perception (Eval).           affects human emotion perception. We identify
This leads to four experimental conditions, i.e. {or-      noise sources that do not affect human perception,
dering knowledge based on degradation (Y/N)} x             and hence can be used for data augmentation, lead-
{allowed noise categories that change human per-           ing to more robust models. We then study how
ception (Y/N)}. The results are reported in Table 2.       models trained on the original dataset are affected
We see that the efficiency is always higher when the       in performance when tested on these noisy sam-
attack model can use noises that change human per-         ples and if augmentation of the training set leads
ception, but is an unreliable way of evaluating due        to improvement in performance of the machine
to change in ground truth label. This shows that not       learning model. We conclude that, unlike humans,
accounting for human perception in evaluation can          machine learning models are extremely brittle to
lead to faux increase in attack efficiency, and hence      the introduction of many kinds of noise. While
improper benchmarking of proposed adversarial              the performance of the machine learning model
attack methods proposed in future.                         on noisy samples is aided from augmentation, the
                                                           performance is still significantly lower when the
6     Recommendations
                                                           noise in the train and test environments does not
We propose a set of recommendations, for both aug-         match. In this paper, we demonstrate fragility of
mentation and deployment of emotion recognition            the emotion recognition systems and valid methods
to augment the datasets, which is a critical concern      Margaret M Bradley and Peter J Lang. 1994. Measur-
in real world deployment.                                  ing emotion: the self-assessment manikin and the
                                                           semantic differential. Journal of behavior therapy
                                                           and experimental psychiatry.
8    Ethical Considerations
                                                          Wieland Brendel, Jonas Rauber, and Matthias Bethge.
Data augmentation is often applied to speech emo-           2017. Decision-based adversarial attacks: Reliable
tion recognition to impove robustness. Better aug-          attacks against black-box machine learning models.
mentation methods are an important way to not               arXiv preprint arXiv:1712.04248.
only ensure reliability and robustness of these mod-      Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe
els, but also improve the real-life adoption in high-       Kazemzadeh, Emily Mower, Samuel Kim, Jean-
stakes downstream applications. Knowing when                nette N Chang, Sungbok Lee, and Shrikanth S
human perception of emotion can change in the               Narayanan. 2008. Iemocap: Interactive emotional
                                                            dyadic motion capture database. Language re-
presence of noise is needed to design better model
                                                            sources and evaluation.
unit tests and adversarial tests for verifying the
reliability of the system. However, emotion vari-         Carlos Busso, Murtaza Bulut, and Sungbok Lee. 2009.
ability is often dependent on multiple factors, such        Shrikanth narayanan fundamental frequency analy-
                                                            sis for speech emotion processing. The role of
as, culture, race, gender, age etc, some of which           prosody in affective speech, 97:309.
are highly protected variables. These models can
also encode stereotypical expected behavior from a        Nicholas Carlini and David Wagner. Audio adversar-
                                                            ial examples: Targeted attacks on speech-to-text. In
certain group, and hence have a higher error rate for
                                                            2018 IEEE Security and Privacy Workshops (SPW).
other groups. It is important to note that this paper
considers a small set of crowd-sourced workers as         Rupayan Chakraborty, Ashish Panda, Meghna Pand-
human raters of emotion perception, who belong              haripande, Sonal Joshi, and Sunil Kumar Kopparapu.
                                                            2019. Front-end feature compensation and denois-
to US and are well versed in English, the language          ing for noise robust speech emotion recognition. In
this dataset is collected in, and the model is trained      INTERSPEECH, pages 3257–3261.
on.
                                                          François Chollet. 2015.   keras.   https://github.
                                                            com/fchollet/keras.

References                                                Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman,
                                                             Aren Jansen, Wade Lawrence, R. Channing Moore,
How emotion ai can transform large-scale recruitment
                                                             Manoj Plakal, and Marvin Ritter. 2017. Audio set:
 processes shorturl.at/gcs15, note = Accessed:
                                                             An ontology and human-labeled dataset for audio
 10-21-2019.
                                                             events. In Proc. IEEE ICASSP 2017, New Orleans,
    Phonological history of english consonant clus-          LA.
    ters= https://bit.ly/3l08zbu, Accessed: 10-           Yuan Gong and Christian Poellabauer. 2017. Crafting
    21-2019.                                                adversarial examples for speech paralinguistics ap-
                                                            plications. arXiv preprint arXiv:1711.03280.
Mohammed Abdelwahab and Carlos Busso. 2018. Do-
 main adversarial for acoustic emotion recognition.       Mimansa Jaiswal, Zakaria Aldeneh, Cristian-Paul Bara,
 IEEE/ACM Transactions on Audio, Speech, and Lan-           Yuanhang Luo, Mihai Burzo, Rada Mihalcea, and
 guage Processing.                                          Emily Mower Provost. a. Muse-ing on the impact of
                                                            utterance ordering on crowdsourced emotion annota-
Hadi Abdullah, Washington Garcia, Christian Peeters,        tions. In IEEE ICASSP 2019.
  Patrick Traynor, Kevin RB Butler, and Joseph Wil-
  son. 2019. Practical hidden voice attacks against       Mimansa Jaiswal, Zakaria Aldeneh, and Emily
  speech and speaker recognition systems. arXiv             Mower Provost. b. Controlling for confounders
  preprint arXiv:1904.05734.                                in multimodal emotion classification via adversarial
                                                            learning. In 2019 ICMI.
Zakaria Aldeneh and Emily Mower Provost. 2017. Us-
  ing regional saliency for speech emotion recognition.   Soheil Khorram, Zakaria Aldeneh, Dimitrios Dimitri-
  In 2017 IEEE International Conference on Acous-           adis, Melvin McInnis, and Emily Mower Provost.
  tics, Speech and Signal Processing (ICASSP), pages        2017. Capturing long-term temporal dependencies
  2741–2745. IEEE.                                          with convolutional networks for continuous emotion
                                                            recognition. arXiv preprint arXiv:1708.07050.
Thomas Biberger and Stephan D Ewert. 2016. Enve-
  lope and intensity based prediction of psychoacous-     Soheil Khorram, Mimansa Jaiswal, John Gideon,
  tic masking and speech intelligibility. The Journal       Melvin McInnis, and Emily Mower Provost. 2018.
  of the Acoustical Society of America.                     The priori emotion dataset: Linking mood to
emotion detected in-the-wild.       arXiv preprint     Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao
  arXiv:1806.10658.                                        Zhang, Nicholas Carlini, Ekin D Cubuk, Alex
                                                           Kurakin, Han Zhang, and Colin Raffel. 2020.
Hyoung-Gook Kim and Jin Young Kim. Acoustic                Fixmatch: Simplifying semi-supervised learning
  event detection in multichannel audio using gated        with consistency and confidence. arXiv preprint
  recurrent neural networks with high-resolution spec-     arXiv:2001.07685.
  tral features. ETRI Journal.
                                                         Kajsa Tibell, Hagen Spies, and Magnus Borga. 2009.
Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San-       Fast prototype based noise reduction. In Scandina-
  jeev Khudanpur. 2015. Audio augmentation for             vian Conference on Image Analysis, pages 159–168.
  speech recognition. In Sixteenth Annual Conference       Springer.
  of the International Speech Communication Associ-
  ation.                                                 Jean-Marc Valin. 2018. A hybrid dsp/deep learning ap-
                                                            proach to real-time full-band speech enhancement.
Kalpesh Krishna, Liang Lu, Kevin Gimpel, and Karen          In 2018 IEEE 20th International Workshop on Multi-
  Livescu. A study of all-convolutional encoders for        media Signal Processing (MMSP), pages 1–5. IEEE.
  connectionist temporal classification. In 2018 IEEE
                                                         Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay
  ICASSP.
                                                            Subramanian, Matt Gardner, and Sameer Singh.
                                                            2019. Allennlp interpret: A framework for ex-
Jinyu Li, Li Deng, Yifan Gong, and Reinhold Haeb-
                                                            plaining predictions of nlp models. arXiv preprint
   Umbach. 2014. An overview of noise-robust auto-
                                                            arXiv:1909.09251.
   matic speech recognition. IEEE/ACM Transactions
   on Audio, Speech, and Language Processing.            Jiahong Yuan and Mark Liberman. Automatic de-
                                                            tection of “g-dropping” in american english using
Nicole Martin. 2019. The major concerns around facial       forced alignment. In 2011 IEEE Workshop on Auto-
  recognition technology.                                   matic Speech Recognition & Understanding. IEEE.
Ladislav Mošner, Minhua Wu, Anirudh Raju, Sree           Xiaojia Zhao, Yuxuan Wang, and DeLiang Wang. 2014.
  Hari Krishnan Parthasarathi, Kenichi Kumatani,           Robust speaker identification in noisy and reverber-
  Shiva Sundaram, Roland Maas, and Björn Hoffmeis-         ant conditions. IEEE/ACM Transactions on Audio,
  ter. 2019. Improving noise robustness of automatic       Speech, and Language Processing.
  speech recognition via parallel data and teacher-
  student learning. In ICASSP 2019-2019 IEEE Inter-      Stephan Zheng, Yang Song, Thomas Leung, and Ian
  national Conference on Acoustics, Speech and Sig-         Goodfellow. 2016. Improving the robustness of
  nal Processing (ICASSP), pages 6475–6479. IEEE.           deep neural networks via stability training. In Pro-
                                                            ceedings of CVPR.
Emilia Parada-Cabaleiro, Alice Baird, Anton Batliner,
  Nicholas Cummins, Simone Hantke, and Björn W
  Schuller. 2017. The perception of emotions in noisi-
  fied nonsense speech. In INTERSPEECH.

Karol J. Piczak. ESC: Dataset for Environmental
  Sound Classification. In Proceedings of the 23rd An-
  nual ACM Conference on Multimedia.

Jonas Rauber, Wieland Brendel, and Matthias Bethge.
  2017. Foolbox: A python toolbox to benchmark
  the robustness of machine learning models. arXiv
  preprint arXiv:1707.04131.

Ilyes Rebai, Yessine BenAyed, Walid Mahdi, and Jean-
   Pierre Lorré. 2017. Improving speech recognition
   using data augmentation and acoustic model fusion.
   Procedia Computer Science.

Odette Scharenborg, Sofoklis Kakouros, and Jiska Koe-
  mans. The effect of noise on emotion perception in
  an unknown language. In Proc. 9th International
  Conference on Speech Prosody 2018.

Hye-Jin Shim, Jee-Weon Jung, Hee-Soo Heo, Sung-
  Hyun Yoon, and Ha-Jin Yu. Replay spoofing detec-
  tion system for automatic speaker verification using
  multi-task learning of noise classes. In 2018 TAAI.
You can also read