DEEP XI AS A FRONT-END FOR ROBUST AUTOMATIC SPEECH RECOGNITION

Page created by Alfredo Barker

Education

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

DEEP XI AS A FRONT-END FOR ROBUST AUTOMATIC SPEECH RECOGNITION

 Aaron Nicolson and Kuldip K. Paliwal

 Signal Processing Laboratory, Griffith University, Brisbane, Australia

 ABSTRACT IRM) estimator [7], which can produce intelligible enhanced
 speech independent of the speaker. Examples of mapping-
arXiv:1906.07319v2 [eess.AS] 28 Jan 2020

 Current front-ends for robust automatic speech recognition
 based approaches include the fully-connected neural network
 (ASR) include masking- and mapping-based deep learning
 that employs multi-objective learning and binary mask post-
 approaches to speech enhancement. A recently proposed deep
 processing to estimate the clean-speech log-power spectra
 learning approach to a priori SNR estimation, called Deep
 (LPS) (Xu2017) [8], and the time-domain clean-speech es-
 Xi, was able to produce enhanced speech at a higher quality
 timator that uses encoder-decoder convolutional neural net-
 and intelligibility than current masking- and mapping-based
 works (CNNs) as the generator and the discriminator of a
 approaches. Motivated by this, we investigate Deep Xi as a
 generative adversarial network (GAN), called SEGAN [9].
 front-end for robust ASR. Deep Xi is evaluated using real-
 world non-stationary and coloured noise sources at multiple Previous generation front-ends include minimum mean-
 SNR levels. Our experimental investigation shows that Deep square error (MMSE) approaches to speech enhancement and
 Xi as a front-end is able to produce a lower word error rate missing data approaches. Cluster-based reconstruction is a
 than recent masking- and mapping-based deep learning front- prominent missing data approach, which reconstructs the un-
 ends. The results presented in this work show that Deep Xi reliable spectral components (components with an a priori
 is a viable front-end, and is able to significantly increase the SNR of 0 dB or less [10]) based on their statistical relation-
 robustness of an ASR system. ship to the reliable components [11]. MMSE approaches,
 Availability: Deep Xi is available at: https://github. such as the MMSE short-time spectral amplitude (MMSE-
 com/anicolson/DeepXi STSA) estimator, rely on the accurate estimation of the a pri-
 ori SNR [12]. However, previous a priori SNR estimators,
 Index Terms— Robust speech recognition, front-end,
 like the decision-directed (DD) approach, introduce a track-
 speech enhancement, pre-processing, Deep Xi
 ing delay and a large amount of bias [12].
 A deep learning approach to a priori SNR estimation was
 1. INTRODUCTION
 recently proposed, called Deep Xi [13]. It is able to produce
 an a priori SNR estimate with negligible bias, and exhibits
 Recently, an automatic speech recognition (ASR) system
 no tracking delay. Deep Xi enabled MMSE approaches to
 was able to achieve human parity on the Switchboard speech
 achieve higher quality and intelligibility scores than recent
 recognition task [1]. This milestone demonstrates how far
 masking- and mapping-based deep learning approaches to
 ASR research has come in 67 years [2]. An example of
 speech enhancement. As Deep Xi is able to produce more
 a modern ASR system is Deep Speech [3] — an end-to-end
 intelligible enhanced speech than masking- and mapping-
 ASR system that uses a bidirectional recurrent neural network
 based deep learning approaches to speech enhancement, we
 (BRNN) as its acoustic model [4]. Despite their accuracy on
 propose that Deep Xi can further be used as a front-end for
 ideal conditions, modern ASR systems are still susceptible
 robust ASR.
 to performance degradation when environmental noise is
 present. A popular method to increase the robustness of an Here, Deep Xi as a front-end is evaluated using real-world
 ASR system is to use a front-end to pre-process noise cor- non-stationary and coloured noise sources at multiple SNR
 rupted speech. As the goal of a front-end is to perform noise levels. Deep Speech is used to evaluate each of the front-
 suppression, a speech enhancement algorithm is typically ends. Deep Speech is suitable for front-end evaluation as it
 employed. is trained on multiple clean speech corpora and has no imple-
 Masking- and mapping-based deep learning approaches mented front- or back-end techniques. The paper is organised
 to speech enhancement are currently the leading front-ends in as follows: an overview of Deep Xi is given in Section 2; the
 the literature [5]. They have the ability to produce enhanced proposed front-end is presented in Section 3; the experiment
 speech that is highly intelligible — an important attribute for setup is described in Section 4, including a description of each
 ASR [6]. An example of a masking-based approach is the front-end; the results and discussion are presented in Section
 long short-term memory network ideal ratio mask (LSTM- 5; conclusions are drawn in Section 6.

| | FC F + F + F + F + F + O ෠ሜ 

 B B B B B
 | | FC + + + + + O ෠ሜ 
 F F F F F

 Fig. 1. Deep Xi-ResLSTM (top) and Deep Xi-ResBiLSTM (bottom).

 2. OVERVIEW OF DEEP XI clean speech and noise in Equation 1 are known completely.
 This means that λs [l, k] and λd [l, k] can be replaced with
In [13], the Deep Xi framework for estimating the a priori the squared magnitude of the clean speech and noise spectral
SNR was proposed. The a priori SNR is given by: components, respectively. In [13], the oracle a priori SNR
 (in dB), ξdB [l, k] = 10 log10 (ξ[l, k]), was mapped to the in-
 λs [l, k] terval [0, 1] in order to improve the rate of convergence of the
 ξ[l, k] = , (1)
 λd [l, k] used stochastic gradient descent algorithm. The cumulative
 distribution function (CDF) of ξdB [l, k] was used as the map.
where l is the time-frame index, k is the discrete-frequency
 As shown in [13], the distribution of ξdB for a given fre-
index, λs [l, k] = E{|S[l, k]|2 } is the variance of the clean
 quency component follows a normal distribution. It was thus
speech spectral component, and λd [l, k] = E{|D[l, k]|2 }
 assumed that ξdB [l, k] is distributed normally with mean µk
is the variance of the noise spectral component. A resid-
 and variance σk2 : ξdB [l, k] ∼ N (µk , σk2 ). The map is given
ual LSTM (ResLSTM) network and a residual bidirectional
 by
LSTM (ResBiLSTM) network were used to estimate a " !#
mapped version of the a priori SNR for each component ¯ k] = 1 ξ dB [l, k] − µk
 ξ[l, 1 + erf √ , (2)
of the lth time-frame, as shown in Figure 1. The mapped a 2 σk 2
priori SNR, ξ̄ξ l , is described in Subsection 2.1. The input to ¯ k] is the mapped a priori SNR. The statistics found
 where ξ[l,
each network is the noisy speech magnitude spectrum for the
 in [13] for each component of ξdB [l, k] are used in this work.
lth time-frame, |Xl |.
 During inference, ξ[l, ˆ k] is found from ξˆdB [l, k] as follows:
 The hyperparameters for the ResLSTM and ResBLSTM
networks in [13] were chosen as a compromise between train- ˆ
 ξ[l, k] = 10 (ξ̂dB [l,k]/10)
 , where the a priori SNR estimate in
ing time, memory usage, and speech enhancement perfor- dB is computed from the mapped a priori SNR estimate as
 √ ˆ¯ k] − 1 + µ .
mance. The ResLSTM network, as shown in Figure 1 (top), follows: ξˆdB [l, k] = σk 2erf−1 2ξ[l, k
consists of five residual blocks, with each block containing
an LSTM cell [14], F, with a cell size of 512. The residual 3. DEEP XI AS A FRONT-END
connection is from the input of the residual block to after the
LSTM cell activation. FC is a fully-connected layer with 512 In this section, Deep Xi as a front-end for robust ASR is de-
nodes that employs layer normalisation followed by ReLU ac- scribed. Either Deep Xi-ResLSTM or Deep Xi-ResBiLSTM
tivation [15]. The output layer, O, is a fully-connected layer is used to pre-processes the noisy speech before it is given
with sigmoidal units. The ResBiLSTM network, as shown in to the back-end of the ASR system. Project DeepSpeech1 is
Figure 1 (bottom), is identical to the ResLSTM network, ex- used as the back-end of the ASR system. It is an open source
cept that the residual blocks include both a forward and back- implementation of the Deep Speech ASR system [3]. It uses
ward LSTM cell (F and B, respectively) [4], each with a cell 26 mel-frequency cepstral coefficients (MFCC) as its input.
size of 512. The residual connection is applied from the in- The process of finding the hypothesis transcription, H, is il-
put of the residual block to after the summation of the forward lustrated in Figure 2. The process includes the following four
and backward cell activations. Details about the training strat- steps:
egy for the ResLSTM and ResBiLSTM networks are given in
Subsection 4.2. 1. The a priori SNR estimate, ξ̂, of the noisy speech
 magnitude spectra, |X|, is found using Deep Xi, ξ̂ =
 Deep Xi(|X|).
2.1. Mapped a priori SNR training target
The training target for the ResLSTM and ResBiLSTM net- 2. The estimated clean speech magnitude spectra, |Ŝ|, is
works is the mapped a priori SNR, as described in [13]. found by applying the gain function of an MMSE ap-
The mapped a priori SNR is a mapped version of the ora- 1 Project DeepSpeech is available at: https://github.com/
cle (or instantaneous) a priori SNR. For the oracle case, the mozilla/DeepSpeech (model 0.1.1 was used for this research).

Front-end Back-end

 | | Deep Xi G ⊙ MFCC ෠ Deep Speech H
 ෠ | ෠ | 
 (a) (b) (c) (d) (e)

 “She had lived
 and grown”

Fig. 2. Deep Xi as a front-end for robust ASR. The noisy speech magnitude spectra, |X|, is shown in (a). Deep Xi then
estimates the a priori SNR, ξ̂, as shown in (b). An MMSE approach is next used to compute a gain function, G(ξ̂), which is
then multiplied element-wise with |X| to produce the estimated clean speech magnitude spectra, |Ŝ|, as shown in (c). MFCCs
are then computed, producing the estimated clean speech cepstra, Ĉ, as shown in (d). Finally, the hypothesis transcript, H, is
found using Deep Speech, as shown in (e).

 proach, G(·), to |X|: |Ŝ| = |X| G(ξ̂), where ξ̂ is used for a mini-batch was mixed with a random section of a ran-
 to compute G(ξ̂) ( denotes the Hadamard product). domly selected noise recording at a randomly selected SNR
 level (-10 to 20 dB, in 1 dB increments).
 3. The input to Deep Speech is the estimated clean
 speech cepstra, Ĉ, where Ĉ is computed from |Ŝ|:
 4.3. Test set
 Ĉ = MFCC(|Ŝ|).
 Four noise sources were included in the test set: voice bab-
 4. Deep Speech computes the hypothesised transcript, H,
 ble, F16, and factory from the RSG-10 noise dataset [20] and
 from Ĉ: H = Deep Speech(Ĉ).
 street music (recording no. 26 270) from the Urban Sound
Deep Xi attempts to match the observed noisy speech to dataset [21]. 25 clean speech recordings were randomly se-
the conditions experienced by Deep Speech during training, lected (without replacement) from the test-clean set of the
i.e. the unobserved clean speech is estimated from the noisy Librispeech corpus [16] for each of the four noise recordings.
speech. Steps one, two, and three form the front-end of the To generate the noisy speech, a random section of the noise
robust ASR system. recording was mixed with the clean speech at the following
 SNR levels: -5 to 15 dB, in 5 dB increments. This created
 a test set of 500 noisy speech signals. The noisy speech was
 4. EXPERIMENT SETUP single channel, with a sampling frequency of 16 kHz.

4.1. Training set
 4.4. Front-ends
The train-clean-100 set from the Librispeech corpus [16], the
CSTR VCTK corpus [17], and the si∗ and sx∗ training sets The configuration of each front-end is described here:
from the TIMIT corpus [18] were included in the training set Deep Xi-ResLSTM & Deep Xi-ResBiLSTM: The
(74 250 clean speech recordings). 5% of the clean speech ResLSTM and ResBiLSTM networks are trained using the
recordings (3 713) were randomly selected and used as the training set for 10 epochs.
validation set. The 2 382 recordings adopted in [13] were SEGAN: The implementation available at https:
used here as the noise training set. All clean speech and noise //github.com/santi-pdp/segan was used. The
recordings were single-channel, with a sampling frequency of available model is retrained using the training set for 50
16 kHz. The noise corruption procedure for the training set is epochs.
described in the next subsection. LSTM-IRM: The LSTM-IRM estimator from [7] is
 replicated here. The noisy speech magnitude spectrum was
 used as the input and the training set was used for 10 epochs
4.2. Training strategy
 of training.
Cross-entropy was used as the loss function. The Adam algo- Xu2017: The implementation available at: https:
rithm [19] with default hyperparameters was used for stochas- //github.com/yongxuUSTC/DNN-for-speech-
tic gradient descent optimisation. A mini-batch size of 10 enhancement is used.
noisy speech signals was used. The noisy speech signals were Cluster-based reconstruction: A diagonal covariance
generated as follows: each clean speech recording selected Gaussian mixture model (GMM) consisting of 128 clusters is

Table 1. WER% for Deep Speech using each front-end. Real-world non-stationary (voice babble and street music) and
coloured (F16 and factory) noise sources were used. The lowest WER% for each condition is shown in boldface.
 SNR level (dB)
 Method Voice babble Street music F16 Factory
 -5 0 5 10 15 -5 0 5 10 15 -5 0 5 10 15 -5 0 5 10 15
 Noisy speech 95.1 91.1 70.6 36.7 11.0 92.9 78.2 51.1 21.1 11.0 100.0 98.7 82.5 39.2 18.1 97.7 89.0 59.8 29.2 10.5
 DD 94.9 87.8 70.4 32.0 10.4 88.5 72.4 47.6 24.6 14.0 98.4 82.1 43.2 18.8 11.6 94.6 83.6 52.7 26.2 13.8
 Clust. recon. 98.2 84.5 54.2 21.0 10.2 86.2 72.2 41.3 15.4 10.1 98.0 83.1 45.4 18.6 7.0 97.1 81.6 50.3 29.2 16.7
 Xu2017 95.0 78.1 54.3 31.3 13.0 90.0 75.4 45.2 28.6 17.1 93.8 81.7 51.2 25.1 21.7 97.4 90.7 67.1 35.4 18.3
 LSTM-IRM 94.1 88.9 54.2 22.5 9.6 89.6 67.5 39.8 16.2 7.2 97.1 79.3 45.2 21.4 12.9 94.5 73.8 43.2 18.1 10.9
 SEGAN 95.8 79.0 44.2 19.5 10.7 84.5 58.7 32.4 12.8 11.0 94.2 76.8 45.6 19.7 7.7 91.2 70.3 45.9 18.1 8.5
 Deep Xi-ResLSTM 92.9 73.3 37.1 11.7 12.0 85.5 58.1 25.2 11.5 6.4 95.6 69.3 30.8 16.4 7.2 93.0 69.5 36.2 17.9 9.1
 Deep Xi-ResBiLSTM 95.0 66.0 27.1 10.7 10.4 73.0 43.0 23.7 10.3 7.0 88.7 56.6 21.4 12.7 4.2 84.9 51.5 29.5 16.8 8.8

trained using the k-means++ algorithm, and the expectation- 9.3% over all conditions when compared to SEGAN. Deep
maximisation algorithm [22]. 7 500 randomly selected clean Xi-ResLSTM also demonstrated a high performance, with an
speech recordings from the training set are used for training. average WER% reduction of 3.4% over all conditions when
A BRNN ideal binary mask (IBM) estimator [23], which is compared to SEGAN. An explanation for the performance of
trained for 10 epochs using the training set, is used to identify Deep Xi-ResLSTM and Deep Xi-ResBiLSTM as front-ends
the unreliable spectral components [11]. is given by their speech enhancement performance. In [13],
 DD: The MMSE-based noise estimator from [24] is used Deep Xi-ResLSTM and Deep Xi-ResBiLSTM were both
by the DD approach a priori SNR estimator [12]. The DD able to produce more intelligible enhanced speech than re-
approach estimates the a priori SNR for the MMSE-STSA cent masking- and mapping-based deep learning approaches
estimator [12]. to speech enhancement. As more intelligible speech improves
 speech recognition performance [6], Deep Xi-ResLSTM and
 Deep Xi-ResBiLSTM are both suitable front-ends for ASR.
4.5. Signal processing
 SEGAN and the LSTM-IRM estimator both performed
The following hyperparameters were used to compute the in- poorly at lower SNR levels when compared to the proposed
puts used by the DD approach, cluster-based reconstruction, front-ends. Deep Xi-ResBiLSTM performed particularly
the LSTM-IRM estimator, Deep Xi-ResLSTM, and Deep well at an SNR level of 5 and 10 dB, with an average
Xi-ResBiLSTM. A sampling frequency of 16 kHz was used. WER% reduction of 16.9% and 16.6%, respectively, over
The Hamming window function was used for analysis, with all noise sources when compared to SEGAN. SEGAN and
a frame length of 32 ms (512 discrete-time samples) and a the LSTM-IRM estimator were both able to demonstrate a
frame shift of 16 ms (256 discrete-time samples). For each high performance at an SNR level of 15 dB. However, Deep
frame of noisy speech, the 257-point single-sided magni- Xi-ResBiLSTM and Deep Xi-ResLSTM were able to produce
tude spectrum was computed, which included both the DC an average WER% improvement of 1.9% and 0.8%, respec-
frequency component and the Nyquist frequency component. tively, over all noise sources at an SNR level of 15 dB when
 compared to SEGAN. In summary, Deep Xi-ResBiLSTM and
 5. RESULTS AND DISCUSSION Deep Xi-ResLSTM demonstrate a high performance over all
 SNR levels, especially at low SNR levels.
In Table 1, Deep Xi-ResLSTM and Deep Xi-ResBiLSTM are
compared to both current and previous generation front-ends
using Deep Speech. The current front-ends include Xu2017
[8], an LSTM-IRM estimator [7], and SEGAN [9], all of 6. CONCLUSION
which are either masking- or mapping-based deep learning
approaches to speech enhancement. The previous genera-
tion front-ends include the DD approach, and cluster-based In this paper, we investigate the use of Deep Xi as a front-
reconstruction. Here, the square-root WF (SRWF) MMSE end for robust ASR. Deep Xi was evaluated using both real-
approach [25] is used by Deep Xi-ResLSTM and Deep Xi- world non-stationary and coloured noise sources, at multiple
ResBiLSTM, as it attained the lowest word error rate per- SNR levels. Deep Xi was able to outperform recent front-
centage (WER%) in preliminary testing (over other MMSE ends, including masking- and mapping-based deep learning
approaches such as the MMSE-STSA estimator). approaches to speech enhancement. The results presented in
 Deep Xi-ResBiLSTM demonstrates a significant perfor- this work show that Deep Xi is able to significantly increase
mance improvement, with an average WER% reduction of the robustness of an ASR system.

7. REFERENCES [13] Aaron Nicolson and Kuldip K. Paliwal, “Deep learning
 for minimum mean-square error approaches to speech
 [1] Wayne Xiong et al., “Achieving human parity enhancement,” Speech Communication, vol. 111, pp. 44
 in conversational speech recognition,” CoRR, vol. – 55, 2019.
 abs/1610.05256, 2016.
 [14] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning
 [2] Douglas OShaughnessy, “Invited paper: Automatic to forget: continual prediction with LSTM,” in ICANN,
 speech recognition: History, methods and challenges,” 1999, vol. 2, pp. 850–855 vol.2.
 Pattern Recognition, vol. 41, no. 10, pp. 2965 – 2979,
 2008. [15] Vinod Nair and Geoffrey E. Hinton, “Rectified linear
 [3] Awni Y. Hannun et al., “Deep speech: Scaling up end- units improve restricted boltzmann machines,” in ICML,
 to-end speech recognition,” CoRR, vol. abs/1412.5567, USA, 2010, pp. 807–814, Omnipress.
 2014.
 [16] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,
 [4] M. Schuster and K. K. Paliwal, “Bidirectional recurrent “Librispeech: An ASR corpus based on public domain
 neural networks,” IEEE Transactions on Signal Pro- audio books,” in ICASSP, 2015, pp. 5206–5210.
 cessing, vol. 45, no. 11, pp. 2673–2681, 1997.
 [17] Christophe Veaux et al., “CSTR VCTK corpus: English
 [5] Zixing Zhang et al., “Deep learning for environmentally multi-speaker corpus for CSTR voice cloning toolkit,”
 robust speech recognition: An overview of recent devel- University of Edinburgh, CSTR, 2017.
 opments,” ACM Trans. Intell. Syst. Technol., vol. 9, no.
 5, pp. 1–28, Apr. 2018. [18] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus,
 and D. S. Pallett, “DARPA TIMIT acoustic-phonetic
 [6] Nancy Thomas-Stonell et al., “Computerized speech
 continuous speech corpus CD-ROM. NIST speech disc
 recognition: influence of intelligibility and perceptual
 1-1.1,” NASA STI/Recon Technical Report N, vol. 93,
 consistency on recognition accuracy,” Augmentative and
 1993.
 Alternative Communication, vol. 14, no. 1, pp. 51–56,
 1998. [19] Diederik P. Kingma and Jimmy Ba, “Adam: A method
 [7] Jitong Chen and DeLiang Wang, “Long short-term for stochastic optimization,” CoRR, vol. abs/1412.6980,
 memory for speaker generalization in supervised speech 2014.
 separation,” The Journal of the Acoustical Society of
 [20] Herman JM Steeneken and Frank WM Geurtsen, “De-
 America, vol. 141, no. 6, pp. 4705–4714, 2017.
 scription of the RSG-10 noise database,” Report IZF
 [8] Yong Xu et al., “Multi-objective learning and mask- 1988-3, TNO Institute for Perception, Soesterberg, The
 based post-processing for deep neural network based Netherlands, 1988.
 speech enhancement,” CoRR, vol. abs/1703.07172,
 2017. [21] Justin Salamon, Christopher Jacoby, and Juan Pablo
 Bello, “A dataset and taxonomy for urban sound re-
 [9] Santiago Pascual, Antonio Bonafonte, and Joan Serrà, search,” in ICM, New York, NY, USA, 2014, pp. 1041–
 “SEGAN: speech enhancement generative adversarial 1044, ACM.
 network,” CoRR, vol. abs/1703.09452, 2017.
 [22] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maxi-
[10] DeLiang Wang, On ideal binary mask as the compu-
 mum likelihood from incomplete data via the EM algo-
 tational goal of auditory scene analysis, pp. 181–197,
 rithm,” Journal of the Royal Statistical Society: Series
 Springer US, Boston, MA, 2005.
 B (Methodological), vol. 39, no. 1, pp. 1–22, 1977.
[11] Bhiksha Raj, Michael L. Seltzer, and Richard M. Stern,
 “Reconstruction of missing features for robust speech [23] Aaron Nicolson and Kuldip K. Paliwal, “Bidirectional
 recognition,” Speech Communication, vol. 43, no. 4, pp. long-short term memory network-based estimation of
 275 – 296, 2004, Special Issue on the Recognition and reliable spectral component locations,” in Proc. Inter-
 Organization of Real-World Sound. speech 2018, 2018, pp. 1606–1610.

[12] Y. Ephraim and D. Malah, “Speech enhancement us- [24] T. Gerkmann and R. C. Hendriks, “Unbiased MMSE-
 ing a minimum-mean square error short-time spectral based noise power estimation with low complexity and
 amplitude estimator,” IEEE Transactions on Acoustics, low tracking delay,” IEEE Transactions on Audio,
 Speech, and Signal Processing, vol. 32, no. 6, pp. 1109– Speech, and Language Processing, vol. 20, no. 4, pp.
 1121, 1984. 1383–1393, May 2012.

[25] J. S. Lim and A. V. Oppenheim, “Enhancement and
 bandwidth compression of noisy speech,” Proceedings
 of the IEEE, vol. 67, no. 12, pp. 1586–1604, 1979.

You can also read