ASR-GLUE: A New Multi-task Benchmark for ASR-Robust Natural Language Understanding - arXiv

Page created by Paula Hanson
 
CONTINUE READING
ASR-GLUE: A New Multi-task Benchmark for ASR-Robust Natural Language Understanding - arXiv
ASR-GLUE: A New Multi-task Benchmark for
                                               ASR-Robust Natural Language Understanding

                                                             Lingyun Feng                                       Jianwei Yu
                                                           Tsinghua University                                Tencent AI Lab
                                                             Tencent AI Lab                                  Shen Zhen, China
arXiv:2108.13048v1 [cs.CL] 30 Aug 2021

                                                            Shen Zhen, China                              tomasyu@tencent.com
                                                      fly19@mail.tsinghua.edu.cn

                                                                 Deng Cai                                      Songxiang Liu
                                                    The Chinese University of Hong Kong                        Tencent AI Lab
                                                            Hong Kong, China                                  Shen Zhen, China
                                                        thisisjcykcd@gmail.com                            shaunxliu@tencent.com

                                                             Haitao Zheng                                        Yan Wang
                                                          Tsinghua University                                  Tencent AI Lab
                                                           Shen Zhen, China                                   Shen Zhen, China
                                                  zheng.haitao@sz.tsinghua.edu.cn                        brandenwang@tencent.com

                                                                                        Abstract
                                                  Language understanding in speech-based systems have attracted much attention in
                                                  recent years with the growing demand for voice interface applications. However, the
                                                  robustness of natural language understanding (NLU) systems to errors introduced
                                                  by automatic speech recognition (ASR) is under-examined. In this paper, we
                                                  propose ASR-GLUE benchmark, a new collection of 6 different NLU tasks for
                                                  evaluating the performance of models under ASR error across 3 different levels
                                                  of background noise and 6 speakers with various voice characteristics. Based on
                                                  the proposed benchmark, we systematically investigate the effect of ASR error on
                                                  NLU tasks in terms of noise intensity, error type and speaker variants. We further
                                                  purpose two ways, correction-based method and data augmentation-based method
                                                  to improve robustness of the NLU systems. Extensive experimental results and
                                                  analysises show that the proposed methods are effective to some extent, but still far
                                                  from human performance, demonstrating that NLU under ASR error is still very
                                                  challenging and requires further research.1

                                         1   Introduction
                                         Language understanding in speech-based systems have attracted much attention in recent years
                                         with the growing demand for voice interface applications and devices such as Alexa [1], Siri [2],
                                         and Cortana [3]. These speech-based intelligent systems usually comprise an automatic speech
                                         recognition (ASR) component which converts audio signals to readable natural language text, and
                                         a natural language understanding (NLU) component which takes the output of ASR component
                                         as input and fulfill downstream tasks such as sentiment analysis, natural language inference, and
                                            1
                                              The dataset is available at
                                         https://drive.google.com/drive/folders/1slqI6pUiab470vCxQBZemQZN-a_ssv1Q

                                         Submitted to the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets
                                         and Benchmarks. Do not distribute.
ASR-GLUE: A New Multi-task Benchmark for ASR-Robust Natural Language Understanding - arXiv
response selection. The upstream ASR error may propagate to the downstream NLU component and
degrade the overall performance [4, 5]. In real-world scenarios, ASR error can be ubiquitous due to
poor articulation and acoustic variability caused by environment noise and reverberation [6]. The
persistence of ASR error indicates a need for ASR-robust natural language understanding.
Previous work in this area is limited to task-oriented language understanding such as hotel reservation
and meeting scheduling through human-machine interactions [7, 8, 9, 10]. However, ASR error
can affect many other NLU tasks, such as sentiment analysis in voice assistants. A benchmark that
enables the comprehensive evaluation of NLU under ASR error on a diverse range of tasks is still
missing.
In this paper, to quantitatively investigate how ASR error affect NLU capability, we propose the
ASR-robust General Language Understanding Evaluation (ASR-GLUE) benchmark: a collection
of 6 NLU tasks including sentiment analysis, similarity and paraphrase tasks and natural language
inference (NLI) tasks. We hire 6 native speakers to convert the test data into audio recordings with 3
different levels of environment noise. Each speaker is requested to record all test data to study the
variance between individuals. Finally we get 18 different types of audio recordings (3 levels of noise
* 6 different speakers) for each of the 6 NLU tasks, varying in noise intensity, error type and speaker
variants. In addition, we also test the human performance under different noise levels. We hope it
would benefit the research of ASR-robust NLU in the future.
With the annotated dataset above, we observe that recent state-of-the-art (SOTA) NLU models are
sensitive to ASR error. To alleviate this problem , a straightforward approach is to make error
correction for the mistranscriptions as a post-processing step. Another approach is data augmentation
which uses text with potential ASR error as additional training data to simulate the real-world scenario
in training. We propose two different augmentation methods: 1) Audio-level augmentation. We adopt
a Text-to-Speech system (TTS) - then - ASR pipeline to first transforms the training corpus to audio
and then convert them back to text (with errors). 2) Text-level augmentation. Due to the high cost
of TTS and ASR system, we further attempt to injects the ASR error to the training corpus through
text generation models [11, 12] or some manually predefined rules [13, 14]. Experimental results
demonstrate that while these approaches improve the model robustness against ASR error to some
extent, they are still far from human understanding capability.
Our contributions are as follows: 1) A new benchmark dataset, ASR-GLUE, is proposed to enable
a comprehensive evaluation of the robustness of NLU model to ASR error, covering 6 diversified
tasks under 6 different speakers and 3 different levels of environment noise. 2) We systematically
and quantitatively investigate the sensitivity of state-of-the-art NLU models to ASR error in terms of
noise intensity, error type and speaker variants. 3) We provide two types of approaches to alleviate the
effect of ASR error. Experimental results demonstrate their effectiveness, and they can be considered
as baseline methods in future studies.

2         ASR-GLUE
2.1        Selected NLU Tasks

The proposed ASR-GLUE is constructed on the basis of GLUE [15], a popular NLU evaluation
benchmark consists of diverse NLU tasks. We select 5 typical NLU tasks from it, namely: Sentiment
classification (SST-2 [16]), Semantic Textual Similarity (STS-B [17]), paraphrase (QQP 2 ), Question-
answering NLI (QNLI [18]), Recognizing Textual Entailment (RTE [19, 20, 21, 22].) and incorporate
with a Science NLI task (SciTail [23]), resulting in 6 tasks in total. Detailed descriptions of the
the chosen tasks are presented in Appendix. They are common and typical tasks of language
understanding in speech-based scenarios, making them suitable for ASR-GLUE.

2.2        Data Construction

Since the original datasets are presented in clean text form, we manually select instances from their
test sets for human audio recording to evaluate the NLU capability in the presence of ASR error.
Samples with non-standard words (e.g., abbreviations, currency expressions, strange names and
addresses) or sentences that are too long (more than 100 words) are excluded from these selected test
      2
          data.quora.com/First-Quora-Dataset-Release-Question-Pairs

                                                         2
ASR-GLUE: A New Multi-task Benchmark for ASR-Robust Natural Language Understanding - arXiv
Figure 1: An illustration of the data collection and recording process.

sets. Considering the cost and quality of annotation, we keep the original training set and randomly
select a subset of samples from the test set for human audio recording on each task3 . The statistics of
the data is shown in Table 1. In human recording process , 6 native speakers are hired to record all
test samples. Different levels of environment noise is provided and the audio signals are sent into an
ASR system to get the final ASR hypothesis. The overall process is depicted in Figure 1.

Table 1: Statistics of ASR-GLUE. The reported hours are the sum of recording time under different
levels of noise. We also report Word Error Rates (WER) for test sets under each noise level.
                                                      WER (test)                        Hours
   Corpus #Train        #Dev #Test
                                        Low Noise Medium Noise High Noise Test+Dev
   SST-2      67349     2772 2790        18.35%         30.76%           34.86%          8.05
   STS-B 5749           3042 3222        12.49%         24.70%           28.53%         10.82
   QQP        363846 1476 3996           13.78%         24.34%           27.45%         11.56
   QNLI       104743 2718 2718           22.51%         33.61%           37.90%         18.00
   RTE        2490      2070 2088        24.54%         39.47%           47.03%         26.19
   SciTail 23596        2718 2736        17.55%         30.09 %          34.04%         16.81

Human Audio Recording We hire six native speakers to record the test sets for each task. The
speakers are 3 male and 3 female with different ages ranging from 18 to 40 from the U.S. In the
recording process, each speaker is required to record all text samples independently so that we can
study the impact of speaker variation. They are instructed to imagine they are communicating with
someone when speaking the sentences of the six tasks. Note that they are allowed to make minor
changes to the original text for natural and smooth expression such as change cannot to can’t. To
collect high-quality audio, all the original speech signals are recorded in a low-noise environment.

Environment Noise In real-world scenario, human audio often recorded with environment noise
and reverberation [24]. The presence of the background interference will lead to substantial perfor-
mance degradation of current ASR systems [25] and further effect the downstream NLU systems
[26]. Therefore, to better evaluate the robustness of NLU models in the noisy acoustic environment,
speech data with different levels of noise is further provided in the ASR-GLUE corpus.
In ASR-GLUE, the widely-adopted simulation approach [27] is used to introduce different levels of
noise and reverberation into the low-noise audio signals. Specifically, the background noise caused
by such as phone ring, alarm clock and incoming vehicles are randomly sampled and added into the
original recordings with the signal-to-noise-ratio (SNR) from 10dB to 15dB. Here SNR is defined as:
                 ||s        ||2
SNR := 10 log ||starget
                     noise || 2 , where starget and snoise denote the acoustic signal of the clean speech

and the noise respectively. ||s||2 = sT s denotes the signal power. In addition, the room reverberation
is also introduced by involving the recorded audio signals with the Room Impulse Responses (RIRs)
4
  generated by the image-source method [28]. The simulation process totally covers 843 kinds of
different background noise and 417 types of different RIRs.
Finally, for each recorded human audio signal we get three versions: (1) Low-level noise, same as
the original audio. (2) Medium-level noise which introduces reverberation and 15dB SNR level
background noise into the original audio. (3) High-level noise , which introduces reverberation and
   3
       If there is no public test set, we use their dev set instead.
   4
       The noise and RIR files can be found via link: http://www.openslr.org/resources/28/rirs_noises.zip

                                                        3

                   
                                                                                                             
$FFXUDF\

                                                                                          $FFXUDF\
                   
                                                                                                             
                   
                                                                                                       
                                                                                           
                            667 41/, 6FL7DLO 676% 443 57(                                 %(57        ;/1HW 5R%(57D   $/%(57
                                                    +XPDQ               &OHDQ     /RZ                         0HGLXP           +LJK
                                                        (a)                                                                                       (b)
Figure 2: (a) The performance of BERT on different tasks under different levels of noise. The shaded
area represents human performance. (b) Accuracy results for different model architecture on SST-2
task. Here “Human” indicates human performance under various noise settings. “Clean” stands for
test on clean text data. “Low/Medium/High” stands for test in low-level/medium-level/high-level
noise respectively.

10dB SNR level background noise into the original audio. Then we build an 6000h trained ASR
system based on the widely-used open source Kaldi toolkit5 [29, 30] to transcribe these audio files
into text.

3                    Analysis of ASR-GLUE
In this section we give extensive analyses on the proposed ASR-GLUE dataset. In Section 3.1, we
obtain human performance to measure the ceiling performance on the test set, Then we analyse the
performance of recent SOTA NLU models across different levels of environment noise in Section 3.2.
Furthermore, in Section 3.3 we categorize ASR errors into four types, and systematically investigate
the impact of different error types on NLU performance. Finally, we analyse the effect of voice
variation from different speakers in Section 3.4.

3.1                     Human Performance

To obtain the human performance under environment noise, we hire native annotators to predict labels
of each test sample in audio form. The annotators is hired from a third-party company. To guarantee
the annotators fully understand these tasks, we first give the annotators a brief training on each task
in ASR-GLUE. Then we ask them to take an exam before starting annotation, only annotators who
pass the exam will be hired. Finally we have 3 annotators to measure the ceiling performance for
each task in ASR-GLUE. Details about the exam and annotation process are presented in Appendix.

3.2                     Performance of Existing NLU Models

To demonstrate the significance of the ASR error issue, we leverage typical NLU models such
as BERT to test their robustness to different levels of ASR error on different tasks. As shown in
Figure 2(a), While BERT yields promising results on error-free text, its performance degrades in
presence of ASR error on six tasks. As the noise increases, the performance of the model drops more
severely. In contrast, human is less affected by the environment noise, which indicates that there still
remains a gap between the robustness of models and humans to ASR error.
We also investigate the effect of ASR error with different noise levels on different NLU models. We
adopt base version of BERT [31], RoBERTa [32], ALBERT [33], XLNet [34] as the NLU model. As
shown in Fig. 2(b) we can observe that all these pretrained language models are sensitive to ASR
error and the performance degrades with the increase of the noise level.

3.3                     Breakdown Analysis by ASR Error Type

We conclude that the ASR error types can be categorized into four folds, namely similar sounds,
liaison, insertion and deletion. Specifically, (i) Similar sounds happen when ASR system sometimes
               5
                    https://github.com/kaldi-asr/kaldi

                                                                                          4
Table 2: Examples of speech recognition perturbation in ASR-GLUE.
    Error Type                                    Ground Truth                                                                                                         Recognition Result
                                                  The man couldn’t lift his son.                                                                                       The man couldn’t lived his son.
    Similar sounds
                                                  Tommy dropped his ice cream.                                                                                         Tommy jumped his ice cream.
                                                  Does Quora stand for question or answer.                                                                             Does Quora Stanford question or answer.
    Liaison
                                                  The drain is clogged with hair.                                                                                      The drains clogged with hair.
                                                  This afternoon.                                                                                                      This after afternoon.
    Insertion
                                                  A warm funny ∗ engaging film                                                                                         A warm funny and engaging film.
                                                  A black and white photo of an old train station.                                                                     A black ∗ white ∗ of ∗ train station
    Deletion
                                                  Old style bicycle parked on floor                                                                                    Old style bicycle ∗ ∗ floor

                                       
                                                                                                                                                       
                                                                                                                                                 
                3HUFHQWDJH

                                                                                                                                    $FFXUDF\
                                                                                                                                                 
                                                                                                                                                 
                                            /RZ1RLVH 0HGLXP1RLVH +LJK1RLVH
                                                                                                                                                          /RZ1RLVH 0HGLXP1RLVH +LJK1RLVH
                                                                       6LPLODUVRXQGV                /LDLVRQ                                ,QVHUWLRQ           'HOHWLRQ
Figure 3: Left: The percentage of each error type under different noise setting in SST-2 dataset. Right:
The accuracy of BERT on four subsets under different noise level. Each subset only contains test
samples with one specific error type. For example, the red block represents the accuracy of BERT
on test samples which contain similar sounds errors. The shaded area represents the performance
degradation against clean text caused by a specific error type.

wrongly identifies one word as another with similar pronunciation. (ii) Liaison constantly occurs
between successive words which has sound fusion across word boundaries. (iii) Insertion happens
when ASR system makes word redundancies. (iv) Deletion happens when there are word omissions
in the ASR hypothesis. Examples of each error type are presented in Table 2.
We report the percentage of each aforementioned error type in Figure 3(a). We choose SST-2 task
as an example and observe that similar sounds most commonly happened. As the noise increases,
the percentage of similar sounds and deletion made by ASR system gradually increase while the
percentage of liaison and insertion error type remain relatively stable. Note that the sum percentages
of the four error type is over 100%, since different error types may simultaneously exist in one
hypothesis.
We also investigate the impact of each error type on NLU model on SST-2 task. We group data
according to error types and compare the model performance with the same grouped raw data without
errors respectively. As shown in Figure 3 (b), we can observe that BERT can better handle Liaison
error for the performance degradation is minimal under varied noise settings. As noise increases, the
accuracy of the model decreases more severely for each different type of error.

3.4     Effect of Individual Difference

We test the effect of voice variation of the six hired speakers on the test set of SST-2. As shown in
Fig. 4 we can observe that the recognition quality and classification accuracy varies greatly across
speakers. For example, the accuracy of BERT on S1 (corresponds to Speaker 1) are consistently
higher than S6 (corresponding to Speaker 6) with a very large margin (27% ∼ 32%). Meanwhile, we
can also observe that the drop of classification accuracy is due to the increase of WER. The higher
WER means more ASR errors in the test samples, resulting in more misclassification.

4      ASR-Robust NLU
To alleviate the effect of ASR error to NLU models, we propose two strategies: correction-based
methods and augmentation-based methods. As shown in Fig 5, the first strategy (Section 4.1) is
to recover the clean text from erroneous text while the second aims to train NLU models on data

                                                                                                                             5
Figure 4: NLU performance under voice variation from different speakers.

                                       (a) ASR Error Correction

                                        (b) Data Augmentation

Figure 5: An illustration of the proposed two strategy: ASR error correction and data augmentation.
(a) is an example of ASR correction which corrects the source erroneous input “hello word” to target
correct output “hello world”. (b) is an example for data augmentation. The source input is clean text
data (e.g., “hello world”) and the target output is noisy data with ASR error (e.g., “hello word”). We
simulate ASR error in audio or text level and train the NLU models with the augmented data.

augmented with ASR error. In the second strategy, there are two types of augmentation method:
1) audio-level augmentation (Section 4.2), we adopt a TTS-ASR pipeline which first transform the
training corpus into audio files then convert them into text containing ASR error. 2) Text-level
augmentation (Section 4.3): due to the high cost and latency time of TTS and ASR system, we
propose to use generation models [11, 12] or some manually predefined rules [13, 14] as a substitution
of the TTS-ASR pipeline to direct generate text with ASR error from clean text.

4.1   ASR Correction

ASR error correction is a ASR post-processing process that recover the clean text from erroneous
text. We adopt two most typical technical paradigms: GETToR [35] and BART [12] (referred to as
BART-C) to transform the output of the ASR systems into clean text.
In training, the model takes the ASR hypothesis as input and its corresponding clean transcript as
output. In this way model learns to correct the errors in the hypothesis and recover it to a clean
sentence. During inference time, the input of NLU models will be the output of ASR error correction
model other than the original ASR hypothesis.

4.2   Audio-Level Augmentation

We also attempt to train the NLU models with ASR error so that they can learn to understand those
text containing ASR error during inference time. One straightforward way is to hire human speakers
to record all training data to audio files. However it is unpractical due to high labor cost. Thus, as
shown in the upper part of Figure 5(b), in audio-level augmentation we adopt a TTS system to convert
the text-form training data in into audio files. Then we add random environment noise into the audio

                                                  6
and adopt ASR systems to convert the audio files into ASR hypotheses. Specifically, the TTS system
we adopted is DurIAN [36]. The noise simulation and speech recognition follows the same process
as described in section 2.2. In training we use these augmented data as additional training data,
along with the original training set to train the NLU model. Note that the original training data set is
important and cannot be discarded in training to make the model maintain its original performance
on clean text.

4.3       Text-Level Augmentation

In audio-level augmentation, the audio from TTS system is still quite different from human voice,
which may cause a different error distribution. Besides, the TTS-then-ASR pipeline is time-consuming
and expensive, since it takes one week to generate the synthesized audio signals using four 32-core
Intel E5 CPU servers and another week to transcribe them with three 2-card Tesla P4 GPU servers
using the kaldi-based ASR system. So we propose text-level augmentation which uses text generation
models as a substitution of the TTS-ASR pipeline and directly generate the ASR hypothesis from the
original textual training data. These generation models take the ASR transcript as the model input
and train model to generate the corresponding hypothesis as output.
Concretely, we adopt three different models to generate the text with ASR error. Two pretrained
language models, GPT-2 [11] and BART [12] (referred to as BART-S) are used as the generation
models to generate ASR hypotheses. We also adopt a rule-based method based on Confusion-matrix
(abbreviated as CM) [14]). In CM, each ASR hypothesis and its reference text are aligned at the
word level by minimizing the Levenshtein distance between them. Then we conduct the confusion
matrix based on the aligned n-grams and add ASR error according to the frequencies of confusions.
Further comparisons about the aforementioned methods are presented in experiments.

5         Experiment
5.1       Experimental Settings

The clean transcriptions and their corresponding ASR hypotheses which are used to train the correction
model and augmentation models are collected from the public dataset LibriSpeech [37]. LibriSpeech
is an ASR corpus based on public domain audio books which contains 1000 hours of speech and their
according transcriptions. We use the Kaldi-based ASR system to convert audio recordings into ASR
hypothesis. For correction-based methods, the input is hypothesis and the output is its corresponding
clean transcript. For text-level augmentation methods, the input is the clean transcript while the output
is its corresponding hypothesis which contains ASR error. In text-level and audio-level augmentation,
the proportion between the augmented data and original data is 1:1 and we combine them together as
the training set.
The NLU model we used in our experiment is BERTbase 6 . We use Adam [38] with an initial learning
rate of 5e-5. For GPT-27 and BART8 used in text-level augmentation (BART-S) and ASR correction
(BART-C), sentences are tokenized with byte-pair encoding (BPE) [39]. Both of them use beam
search [40] as their decoding strategy. For GETToR, the sequence tagging model is an encoder made
up of RoBERTa [32] stacked with two linear layers with softmax layers on the top. BPE is used for
tokenization. Early stopping was used, stopping criteria was 3 epochs of 10K updates each without
improvement.

5.2       Results

The main result of proposed methods are shown in Table 3, we can observe that the NLU models
are sensitive to ASR error and the performance of NLU degrade severely across various task. The
proposed methods can effectively improve the robustness of the model to a certain extent in most
scenarios, but are still far from human performance. For example, the accuracy on SciTail task under
high environment noise decline by 32.35% and is restored from 62.39% to 78.62% by audio-level
augmentation method. In contrast, human are almost unaffected by the ASR error and still maintain
      6
        https://huggingface.co/bert-base-uncased
      7
        https://huggingface.co/gpt2
      8
        https://huggingface.co/facebook/bart-large/tree/main

                                                         7
Table 3: Baseline performance on the ASR-GLUE test sets. Here “Clean” stands for test on clean
text data. “Low/Medium/High” stand for test on low-level/medium-level/high-level noise version
respectively. Spearman correlations are reported for STS-B, and accuracy scores are reported for the
other tasks. All values are scaled by 100.
                             Audio-level     Text-level Augmentation        Correction
                     BERT                                                                 BEST(∆)         Human
                             Augmentation   CM       GPT-2 BART-S      GECToR BART-C
            Clean    92.25      92.90       91.61 90.96     92.90      92.25      92.25   92.90(+0.65)    -
            Low      88.60      89.35       86.99 87.31     89.78      87.74      87.85   89.78(+1.18)    90.40
  SST-2
            Medium   83.42      85.68       84.39 82.45     84.71      83.42      82.35   84.71(+1.29)    87.90
            High     80.84      81.92       81.59 81.16     81.16      81.70      78.69   81.70(+0.86)    86.18
            Clean    91.13      91.54       90.90 90.76     92.22      91.13      91.13   92.22(+1.09)    -
            Low      85.39      87.21       86.95 86.17     86.58      84.68      83.39   87.21(+1.82)    88.00
  STS-B
            Medium   69.93      75.49       74.16 73.14     71.38      68.38      69.73   75.49(+5.56)    86.59
            High     63.89      70.66       68.70 68.03     65.89      62.83      65.42   70.66(+6.77)    85.44
            Clean    87.16      87.75       88.03 88.40     87.75      87.16      87.16   88.40(+1.24)    -
            Low      76.75      81.98       80.58 81.33     79.18      76.66      73.67   81.98(+5.23)    83.22
  QQP
            Medium   69.47      76.94       73.76 75.07     70.40      70.59      67.51   76.94(+7.47)    81.02
            High     67.77      76.18       72.85 72.21     68.79      69.62      66.67   76.18(+8.41)    77.60
            Clean    94.74      90.79       94.74 94.74     91.45      94.74      94.74   94.74(+0.00)    -
            Low      75.33      85.53       85.31 85.31     77.63      79.61      80.26   85.53(+10.20)   90.82
  SciTail
            Medium   64.91      81.69       76.54 77.41     71.05      70.07      71.05   81.69(+16.78)   91.08
            High     62.39      78.62       74.34 73.90     69.52      67.98      69.08   78.62(+16.23)   90.13
            Clean    90.73      87.42       90.07 89.40     88.74      90.73      90.73   90.73(+0.00)    -
            Low      83.33      84.11       86.09 86.64     85.43      83.77      84.66   86.64(+3.31)    87.23
  QNLI
            Medium   78.48      82.23       84.11 84.44     83.33      79.80      79.91   84.44(+5.96)    86.29
            High     77.15      82.12       82.23 83.44     81.57      76.93      76.71   83.44(+6.29)    83.68
            Clean    68.93      63.76       66.37 64.69     66.27      68.93      68.93   68.93(+0.00)    -
            Low      60.06      62.36       58.91 61.78     59.91      60.78      59.05   62.36(+2.30)    65.43
  RTE
            Medium   53.87      57.33       54.89 60.78     55.60      56.35      53.59   60.78(+6.91)    64.21
            High     52.86      52.73       50.29 56.90     49.86      55.02      51.44   56.90(+4.04)    63.14

high accuracy of 90.13%. Moreover, the human performance is more stable than the NLU model
across all noise-levels for various tasks. We can also observe that although audio-level augmentation
method achieves promising results on most tasks, it is even worse than the original BERT on RTE
task under high environment noise (52.73% vs 52.86%).
We also find that the performance of different method varies across tasks, and it is difficult to find a
universal method for all scenarios. Although the audio-level data augmentation method always gains
the highest accuracy on the data with ASR errors, it can not maintain the original performance on
clean data for certain tasks, e.g., worse than original BERT on clean text data (90.79% vs 94.74 on
SciTail, 87.42% vs 90.73% on QNLI, and 63.76% vs 68.93% on RTE).
Another interesting phenomena is that the ASR error correction is less effective in many cases. On
half of the tasks, such as SST-2, STS-B, QNLI, BERT with correction performs similar or even worse
than the original one. On other tasks, although it performs better than original BERT, but still worse
than augmentation-based methods. One possible reason for its poor performance is that the ASR
system already integrates a strong n-gram language model to guarantee the quality of system output.
So an additional language model is redundant and cannot make further improvement.
By comparison on various tasks, we can observe that the text-level augmentation achieves more
promising results than other methods. It works well in most situations, without degrading the NLU
performance on clean text in general. We further make comparisons between the three text-level
augmentation methods and find that none have absolute advantages. GPT-2 based augmentation
performs better in most situations, but on RTE task with clean text, the accuracy is much lower than
original BERT (64.69% vs 68.93%).
Overall, although the proposed methods can improve the model robustness to some extent, a large
gap still remains between these approaches and human, which indicates much potential for work on
robustness improvement for the NLU model.

5.3     Effect of ASR systems.

To verify this assumption that the ASR error may no longer be a serious problem in NLU for a good
enough ASR system, we conduct a further experiment that replace our kaldi-based ASR system with
a SOTA public ASR system: Google ASR. We replace the ASR system with Google ASR on dev and
test sets (not on training set due to its high price) and test the performance of BERT on these new sets.

                                                         8
Figure 6: Performance of BERT on STS-B task with Google ASR systems, evaluated by Spearman
correlations.
                                                               6SHDNHU           6SHDNHU                              Clean   Low     Medium   High
                                                 
    6SHDUPDQFRUUHODWLRQV
                                                                                                                 BERT(Google ASR)           91.13   86.97   83.65    82.20
                                                                                                           Audio-level Augmentation   91.54   89.66   86.75    85.95
                                                                                                                                CM          90.90   89.00   85.82    84.63
                                                                                                                 Text-level
                                                                                                                          GPT-2       90.76   89.35   85.78    84.44
                                                                                                                 Augmentation
                                                                                                                                BART-S      92.22   88.56   84.62    83.38
                                                                                                                          GECToR      91.13   86.29   83.13    82.04
                                                                                                                 Correction
                                                          /RZ             0HGLXP       +LJK                        BART-C      91.13   86.48   83.80    82.06

As shown in the left part of Figure 6, we report the performance of vanilla BERT on STS-B for two
different speakers (Speaker 1 and Speaker 6). The performance of Speaker 1 seems consistent with
the previous assumption that a good ASR system solves this problem. Nevertheless, for speaker 6, we
observe an opposite phenomena that the model performance still drops a lot (about 15%) in presence
of ASR error. To investigate the reason for these differences, we carefully compared their audio, and
find that although these both of them are native speakers from America, speaker 6 holds a strong
regional accent which can significantly increase the ASR error. Since the accent are widely existed in
audio signals, it is unrealistic to assume the input audio are always clean and standard. So even if we
have a good ASR system, strong and robust NLU models are still necessary.
We also report the result of the proposed methods with Google ASR in the right part of Figure 6.
On STS-B task, the performance of most methods are consistent with the results in Table 3. We can
observe the most data augmentation-based methods still perform well while the correction-based
methods are less effective. Although in training these methods are based on another ASR system,
they can generalize to Google ASR systems as well, which proves the generalization ability of
augmentation-based methods.

6                              Related Work

Many benchmark datasets are created to facilitate Spoken Language Understanding (SLU) [41, 42,
5, 43, 26, 44], which evaluate the robustness of the downstream NLU model against the error output
from the upstream acoustic model [7, 8, 9, 10]. However, they are only designed for a particular
domain or a specific task such as intent detection and slot filling. In contrast, the human ability to
understand language is general, flexible, and robust. There is a need to test general-purpose natural
language understanding capability on diverse tasks in different domains.
Large-scale pretrained language models have achieved striking performance on NLU in recent
years [45, 34]. Recently many works test their robustness by human-crafted adversarial examples [46]
or generated examples by adversarial attacks [45, 47, 48, 49]. [50] projects the input data to a latent
space by generative adversarial networks (GANs), and then retrieves adversaries close to the original
instance in the latent space. [51] propose controlled paraphrase networks to generate syntactically
adversarial examples that both fool pre-trained models and improve the robustness of these models to
syntactic variation when used to augment their training data. However, the robustness of pre-trained
model to speech recognition error in real conditions has not been fully explored.

7                              Conclusion

We present ASR-GLUE, a new benchmark for evaluating general-purpose language understanding
under ASR error in speech-based applications. We propose two ways to improve robustness of
the NLU system and find that there is still a gap between the NLU capability of the model and
humans. ASR-GLUE offers a rich and challenging testbed for work developing ASR robust model
for general-purpose language understanding. Given the difficulty of ASR-GLUE, we expect that
further progress in multi-task, multi-model learning techniques will be necessary to approach human-
level performance on the benchmark.

                                                                                                                           9
References
 [1] Wang, L., M. Fazel-Zarandi, A. Tiwari, et al. Data augmentation for training dialog models
     robust to speech recognition errors. arXiv preprint arXiv:2006.05635, 2020.
 [2] Williams, J. D., S. Young. Partially observable markov decision processes for spoken dialog
     systems. Computer Speech & Language, 21(2):393–422, 2007.
 [3] Wang, S., T. Gunter, D. VanDyke. On modelling uncertainty in neural language generation for
     policy optimisation in voice-triggered dialog assistants. In 2nd Workshop on Conversational
     AI: Today’s Practice and Tomorrow’s Potential, NeurIPS. 2018.
 [4] Serdyuk, D., Y. Wang, C. Fuegen, et al. Towards end-to-end spoken language understanding. In
     2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
     pages 5754–5758. IEEE, 2018.
 [5] Wang, P., L. Wei, Y. Cao, et al. Large-scale unsupervised pre-training for end-to-end spoken
     language understanding. In ICASSP 2020-2020 IEEE International Conference on Acoustics,
     Speech and Signal Processing (ICASSP), pages 7999–8003. IEEE, 2020.
 [6] Errattahi, R., A. El Hannani, H. Ouahmane. Automatic speech recognition errors detection and
     correction: A review. Procedia Computer Science, 128:32–37, 2018.
 [7] Schumann, R., P. Angkititrakul. Incorporating asr errors with attention-based, jointly trained
     rnn for intent detection and slot filling. In 2018 IEEE International Conference on Acoustics,
     Speech and Signal Processing (ICASSP), pages 6059–6063. IEEE, 2018.
 [8] Weng, Y., S. S. Miryala, C. Khatri, et al. Joint contextual modeling for asr correction and
     language understanding. In ICASSP 2020-2020 IEEE International Conference on Acoustics,
     Speech and Signal Processing (ICASSP), pages 6349–6353. IEEE, 2020.
 [9] Rao, M., A. Raju, P. Dheram, et al. Speech to semantics: Improve asr and nlu jointly via
     all-neural interfaces. arXiv preprint arXiv:2008.06173, 2020.
[10] Huang, C.-W., Y.-N. Chen. Learning asr-robust contextualized embeddings for spoken language
     understanding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech
     and Signal Processing (ICASSP), pages 8009–8013. IEEE, 2020.
[11] Radford, A., J. Wu, R. Child, et al. Language models are unsupervised multitask learners.
     OpenAI blog, 1(8):9, 2019.
[12] Lewis, M., Y. Liu, N. Goyal, et al. Bart: Denoising sequence-to-sequence pre-training for
     natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461,
     2019.
[13] Fazel-Zarandi, M., L. Wang, A. Tiwari, et al. Investigation of error simulation techniques for
     learning dialog policies for conversational error recovery. arXiv preprint arXiv:1911.03378,
     2019.
[14] Schatzmann, J., B. Thomson, S. Young. Error simulation for training statistical dialogue systems.
     In 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pages
     526–531. IEEE, 2007.
[15] Wang, A., A. Singh, J. Michael, et al. Glue: A multi-task benchmark and analysis platform for
     natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
[16] Socher, R., A. Perelygin, J. Wu, et al. Recursive deep models for semantic compositionality over
     a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural
     language processing, pages 1631–1642. 2013.
[17] Cer, D., M. Diab, E. Agirre, et al. Semeval-2017 task 1: Semantic textual similarity-multilingual
     and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.
[18] Rajpurkar, P., J. Zhang, K. Lopyrev, et al. Squad: 100,000+ questions for machine comprehen-
     sion of text. arXiv preprint arXiv:1606.05250, 2016.
[19] Dagan, I., O. Glickman, B. Magnini. The pascal recognising textual entailment challenge. In
     Machine Learning Challenges Workshop, pages 177–190. Springer, 2005.
[20] Haim, R. B., I. Dagan, B. Dolan, et al. The second pascal recognising textual entailment
     challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising
     Textual Entailment. 2006.

                                                 10
[21] Giampiccolo, D., B. Magnini, I. Dagan, et al. The third pascal recognizing textual entail-
     ment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and
     paraphrasing, pages 1–9. 2007.
[22] Bentivogli, L., P. Clark, I. Dagan, et al. The fifth pascal recognizing textual entailment challenge.
     In TAC. 2009.
[23] Khot, T., A. Sabharwal, P. Clark. Scitail: A textual entailment dataset from science question
     answering. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32. 2018.
[24] Wölfel, M., J. McDonough. Distant speech recognition. John Wiley & Sons, 2009.
[25] Barker, J., S. Watanabe, E. Vincent, et al. The fifth’chime’speech separation and recognition
     challenge: dataset, task and baselines. arXiv preprint arXiv:1803.10609, 2018.
[26] Henderson, M., B. Thomson, J. D. Williams. The second dialog state tracking challenge. In
     Proceedings of the 15th annual meeting of the special interest group on discourse and dialogue
     (SIGDIAL), pages 263–272. 2014.
[27] Ko, T., V. Peddinti, D. Povey, et al. A study on data augmentation of reverberant speech for
     robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and
     Signal Processing (ICASSP), pages 5220–5224. IEEE, 2017.
[28] Habets, E. A. Room impulse response generator. Technische Universiteit Eindhoven, Tech.
     Rep, 2(2.4):1, 2006.
[29] Povey, D., A. Ghoshal, G. Boulianne, et al. The kaldi speech recognition toolkit. In IEEE 2011
     workshop on automatic speech recognition and understanding, CONF. IEEE Signal Processing
     Society, 2011.
[30] Povey, D., V. Peddinti, D. Galvez, et al. Purely sequence-trained neural networks for asr based
     on lattice-free mmi. In Interspeech, pages 2751–2755. 2016.
[31] Devlin, J., M.-W. Chang, K. Lee, et al. Bert: Pre-training of deep bidirectional transformers for
     language understanding. arXiv preprint arXiv:1810.04805, 2018.
[32] Liu, Y., M. Ott, N. Goyal, et al. Roberta: A robustly optimized bert pretraining approach. arXiv
     preprint arXiv:1907.11692, 2019.
[33] Lan, Z., M. Chen, S. Goodman, et al. Albert: A lite bert for self-supervised learning of language
     representations. arXiv preprint arXiv:1909.11942, 2019.
[34] Yang, Z., Z. Dai, Y. Yang, et al. Xlnet: Generalized autoregressive pretraining for language
     understanding. arXiv preprint arXiv:1906.08237, 2019.
[35] Omelianchuk, K., V. Atrasevych, A. Chernodub, et al. Gector–grammatical error correction:
     Tag, not rewrite. arXiv preprint arXiv:2005.12592, 2020.
[36] Yu, C., H. Lu, N. Hu, et al. Durian: Duration informed attention network for speech synthesis.
     Proc. Interspeech 2020, pages 2027–2031, 2020.
[37] Panayotov, V., G. Chen, D. Povey, et al. Librispeech: an asr corpus based on public domain
     audio books. In 2015 IEEE international conference on acoustics, speech and signal processing
     (ICASSP), pages 5206–5210. IEEE, 2015.
[38] Kingma, D. P., J. Ba.        Adam: A method for stochastic optimization.            arXiv preprint
     arXiv:1412.6980, 2014.
[39] Sennrich, R., B. Haddow, A. Birch. Neural machine translation of rare words with subword
     units. arXiv preprint arXiv:1508.07909, 2015.
[40] Koehn, P. Pharaoh: a beam search decoder for phrase-based statistical machine translation
     models. In Conference of the Association for Machine Translation in the Americas, pages
     115–124. Springer, 2004.
[41] Coucke, A., A. Saade, A. Ball, et al. Snips voice platform: an embedded spoken language
     understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190,
     2018.
[42] Lugosch, L., M. Ravanelli, P. Ignoto, et al. Speech model pre-training for end-to-end spoken
     language understanding. arXiv preprint arXiv:1904.03670, 2019.

                                                   11
[43] Price, P. Evaluation of spoken language systems: The atis domain. In Speech and Natural
     Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.
     1990.
[44] Peng, B., C. Li, Z. Zhang, et al. Raddle: An evaluation benchmark and analysis platform for
     robust task-oriented dialog systems. arXiv preprint arXiv:2012.14666, 2020.
[45] Jin, D., Z. Jin, J. T. Zhou, et al. Is bert really robust? a strong baseline for natural language
     attack on text classification and entailment. In Proceedings of the AAAI conference on artificial
     intelligence, vol. 34, pages 8018–8025. 2020.
[46] Nie, Y., A. Williams, E. Dinan, et al. Adversarial nli: A new benchmark for natural language
     understanding. arXiv preprint arXiv:1910.14599, 2019.
[47] Madry, A., A. Makelov, L. Schmidt, et al. Towards deep learning models resistant to adversarial
     attacks. arXiv preprint arXiv:1706.06083, 2017.
[48] Zhu, C., Y. Cheng, Z. Gan, et al. Freelb: Enhanced adversarial training for natural language
     understanding. arXiv preprint arXiv:1909.11764, 2019.
[49] Dong, X., A. T. Luu, R. Ji, et al. Towards robustness against natural language word substitutions.
     In 9th International Conference on Learning Representations (ICLR). 2021.
[50] Zhao, Z., D. Dua, S. Singh. Generating natural adversarial examples. In 6th International
     Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May
     3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
[51] Iyyer, M., J. Wieting, K. Gimpel, et al. Adversarial example generation with syntactically
     controlled paraphrase networks. arXiv preprint arXiv:1804.06059, 2018.

                                                 12
A        Appendix
A.1       Detailed descriptions for the ASR-GLUE datasets

SST-2 The Stanford Sentiment Treebank [16] is a single-input understanding task for sentiment
classification. The task is to predict the sentiment of a given sentence in movie reviews domain.
Accuracy (ACC) of the binary classification (positive or negative) is used as the metric.
STS-B The Semantic Textual Similarity Benchmark [17] consists of sentence pairs drawn from news
headlines, video and image captions, and natural language inference data. The task is to predict
sentence similarity scores which ranges from 1 to 5. We evaluate using Pearson and Spearman
correlation coefficients.
QQP The Quora Question Pairs 9 dataset consists of question pairs in social QA questions domain.
The task is to determine whether a pair of questions are semantically equivalent. Accuracy (ACC) is
used as the metric.
QNLI Question-answering NLI is modified from the Stanford Question Answering dataset [18]. This
is a sentence pair classification task which determines whether the context sentence contains the
answer to the question. Accuracy (ACC) is used as the metric.
SciTail SciTail [23] is a recently released challenging textual entailment dataset collected from the
science domain. This is a natural language inference task which determines if a natural language
hypothesis can be justifiably inferred from a given premise. Accuracy (ACC) is used as the metric.
RTE The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual
entailment challenges merged from a collection of [19, 20] All datasets are combined and converted
to two-class classification: entailment and not entailment. Accuracy (ACC) is used as the metric.

A.2       Details of kaldi-based ASR word error rate results

                    Table 4: Detailed kaldi-based ASR WER on ASR-GLUE test set
  Corpus        Noisy level Speaker1 Speaker2 Speaker3 Speaker4 Speaker5                  Speaker6
  SST-2                        11.50        18.53     30.03    19.94     13.96             38.36
  STS-B                         6.87        11.89     17.76    12.62      7.67             13.62
  QQP                          10.44        13.05     25.46    13.10     10.77             25.49
                  Low
  QNLI                         13.47        17.43     31.05    19.88     14.55             30.42
  RTE                          23.04        18.98     34.34    22.05     15.60             37.77
  SciTail                       8.47        13.98     23.49    16.83      9.18             27.53
  SST-2                        21.64        29.68     40.12    34.60     23.23             53.26
  STS-B                        20.06        28.16     31.11    21.79     16.49             29.65
  QQP                          18.85        22.13     39.01    19.90     18.26             40.39
                 Medium
  QNLI                         22.98        31.75     40.09    32.82     20.98             47.32
  RTE                          37.21        34.39     47.25    35.63     25.54             59.38
  SciTail                      18.71        26.72     34.75    33.20     15.89             44.13
  SST-2                        25.98        33.90     42.82    37.77     27.27             55.31
  STS-B                        24.66        32.61     33.22    25.09     18.83             33.71
  QQP                          21.64        24.45     41.52    22.11     22.23             43.77
                  High
  QNLI                         28.09        36.05     42.96    35.53     24.76             51.73
  RTE                          44.12        41.20     54.56    50.70     33.33             64.91
  SciTail                      23.13        29.89     36.60    37.65     19.93             48.55

A.3       Details for the exam and annotation process

In the annotation process, the annotators are required to predict labels of each test sample according
to the corresponding audio signals. Note that for each test sample we have 18 audio signals (3 levels
of noise * 6 speakers), which is a big burden for the annotators. So we randomly select one audio
    9
        data.quora.com/First-Quora-Dataset-Release-Question-Pairs

                                                      13
from the six speakers with each noise level and report the averaged performance of the annotators on
all tasks.
The exam is set to guarantee the annotators fully understand each task. In the exam, we randomly
select 50 samples from the original datasets which is in text form for each task. Annotators who
achieve at least 90% accuracy on these samples will be hired.

                                                14
You can also read