Jointly optimal denoising, dereverberation, and source separation

Page created by Tyrone Wise
 
CONTINUE READING
Jointly optimal denoising, dereverberation, and source separation
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020                                                                                1

             Jointly optimal denoising, dereverberation,
                        and source separation
               Tomohiro Nakatani, Senior Member, IEEE, Christoph Boeddeker, Student Member, IEEE,
                    Keisuke Kinoshita, Senior Member, IEEE, Rintaro Ikeshita, Member, IEEE,
                   Marc Delcroix, Senior Member, IEEE, Reinhold Haeb-Umbach, Fellow, IEEE

   Abstract—This paper proposes methods that can optimize a              the acquired signal. For performing denoising (DN), beam-
Convolutional BeamFormer (CBF) for jointly performing denois-            forming techniques have been investigated for decades [1], [2],
ing, dereverberation, and source separation (DN+DR+SS) in a              [3], [4], and the Minimum Variance Distortionless Response
computationally efficient way. Conventionally, cascade configu-
ration composed of a Weighted Prediction Error minimization              (MVDR) beamformer and the Minimum Power Distortionless
(WPE) dereverberation filter followed by a Minimum Variance              Response (MPDR) beamformer, are now widely used as the
Distortionless Response (MVDR) beamformer has been used                  state-of-the-art techniques. For source separation (SS), a num-
as the state-of-the-art frontend of far-field speech recognition,        ber of blind signal processing techniques have been developed,
however, overall optimality of this approach is not guaranteed.          including independent component analysis [5], independent
In the blind signal processing area, an approach for jointly
optimizing dereverberation and source separation (DR+SS) has             vector analysis [6], and spatial clustering-based beamforming
been proposed, however, this approach requires huge computing            [7]. For dereverberation (DR), a Weighted Prediction Error
cost, and has not been extended for application to DN+DR+SS.             minimization (WPE) based linear prediction technique [8],
To overcome the above limitations, this paper develops new               [9] and its variants [10] have been actively studied as an
approaches for jointly optimizing DN+DR+SS in a computation-             effective approach. With these techniques, for determining
ally much more efficient way. To this end, we first present an
objective function to optimize a CBF for performing DN+DR+SS             the coefficients of the filtering, it is crucial to accurately
based on the maximum likelihood estimation, on an assumption             estimate statistics of the speech signals and the noise, such as
that the steering vectors of the target signals are given or can         their spatial covariances and time-varying variances. However,
be estimated, e.g., using a neural network. This paper refers            the estimation often becomes inaccurate when the signals
to a CBF optimized by this objective function as a weighted              mixed under reverberant and noisy conditions, which seriously
Minimum-Power Distortionless Response (wMPDR) CBF. Then,
we derive two algorithms for optimizing a wMPDR CBF based                degrades the performance of these techniques.
on two different ways of factorizing a CBF into WPE filters                 To enhance the robustness of the above techniques, recently,
and beamformers: one based on extension of the conventional              neural network-supported microphone array speech enhance-
joint optimization approach proposed for DR+SS and the other             ment has been actively studied, and showed its effectiveness
based on a novel technique. Experiments using noisy reverberant          for denoising [11], dereverberation [12], and source separation
sound mixtures show that the proposed optimization approaches
greatly improve the performance of the speech enhancement in             [13], [14]. With this approach, estimation of the statistics of
comparison with the conventional cascade configuration in terms          the signals and noise, such as Time-Frequency (TF) masks and
of the signal distortion measures and ASR performance. It is             time-varying variances, is conducted by neural networks [13],
also shown that the proposed approaches can greatly reduce the           [15], [16], [17], while speech enhancement is performed by
computing cost with improved estimation accuracy in comparison           the microphone array signal processing. This combination is
with the conventional joint optimization approach.
                                                                         particularly effective because neural networks can very well
   Index Terms—Beamforming, dereverberation, source separa-              capture spectral patterns of signals over wide TF ranges, and
tion, microphone array, automatic speech recognition, maximum            can reliably estimate such statistics of the signals, which was
likelihood estimation
                                                                         not well handled by the conventional signal processing. On
                                                                         the other hand, neural networks often introduce nonlinear
                        I. I NTRODUCTION                                 distortions into the processed signal, which are harmful to
   When a speech signal is captured by distant microphones,              perceived speech quality and ASR, while it can be well
e.g., in a conference room, it often contains reverberation, dif-        avoided by microphone array techniques. A number of articles
fuse noise, and extraneous speakers’ voices. These components            have reported the usefulness of this combination, particularly
are detrimental to the intelligibility of the captured speech            for far-field ASR, e.g., at the REVERB challenge [18] and the
and often cause serious degradation in many applications                 CHiME-3/4/5 challenges [19], [20].
such as hands-free teleconferencing and Automatic Speech                    Despite the success of the neural network-supported mi-
Recognition (ASR).                                                       crophone array speech enhancement, it is still not yet well
   Microphone array speech enhancement has been extensively              investigated how to optimally combine individual microphone
studied to minimize the aforementioned detrimental effects in            array techniques for performing denoising, dereverberation,
                                                                         and source separation (DN+DR+SS) at the same time in a
  T. Nakatani, K. Kinoshita, R. Ikeshita, and M. Delcroix are with NTT
Corporation. C. Boeddeker and R. Haeb-Umbach are with Paderborn Univ.    computationally efficient way. For example, for denoising and
  Manuscript received January 1, 2020; revised XXXX XX, 2020.            dereverberation (DN+DR), cascade configuration of a WPE
Jointly optimal denoising, dereverberation, and source separation
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020                                                                                            2

filter followed by a MVDR/MPDR beamformer has been                               estimation of the steering vectors. While both approaches
widely used as the state-of-the-art frontend, e.g., at the far-                  work comparably well in terms of the estimation accuracy,
field ASR challenges [18], [19], [20], [21]. However, the WPE                    the source-wise factorization has advantages in terms of
filter and the beamformer are separately optimized, and the                      computational efficiency. An additional benefit of the source-
overall optimality of this approach is not guaranteed. In order                  wise factorization is that it can be used, without loss of
to perform DN+DR in an optimal way, several techniques                           optimality, for extraction of a single target source from a sound
have been proposed using a Kalman filter [22], [23], [24]. A                     mixture, which is now an important application area of speech
technique, called Integrated Sidelobe Cancellation and Linear                    enhancement [13], [29].
Prediction (ISCLP) [24], optimizes an integrated filter that                         Experiments based on noisy reverberant sound mixtures
can cancel noise and reverberation from the observed signal                      created using the REVERB Challenge dataset [18] show that
using a sidelobe cancellation framework. With this technique,                    the proposed optimization approaches substantially improve
however, a steering vector of the target signal needs to be                      the performance of DN+DR+SS in comparison to the con-
estimated in advance directly from noisy reverberant speech,                     ventional cascade configuration in terms of ASR performance
which is a challenging problem, and thus limits the overall                      and reduction of signal distortion. It is also shown that the
estimation accuracy. In the blind signal processing area, on                     two proposed approaches can greatly reduce the computing
the other hand, a technique to jointly optimize a pair of a                      cost with improved estimation accuracy in comparison with
WPE filter followed by a beamformer has been proposed                            the conventional joint optimization approach.
for dereverberation and source separation (DR+SS) under                              Certain parts of this paper have already been presented in
noiseless conditions [25], [26], [27]. One advantage of this                     our recent conference papers. In [30], the ML formulation for
approach is that we can access multichannel dereverberated                       optimizing a CBF was derived for DN+DR. In [31], it was
signals obtained as the output of the WPE filter during the                      shown that a CBF for DN+DR can be factorized into a WPE
optimization, and utilize them to reliably estimate the beam-                    filter and a wMPDR (non-convolutional) beamformer, and
former. However, this approach requires 1) huge computing                        jointly optimized without loss of optimality. Furthermore, [32]
cost for the optimization, and 2) has not been extended for                      presented ways to reliably estimate TF masks for DN+DR+SS.
application to DN+DR+SS.                                                         This paper integrates these techniques to perform DN+DR+SS
    To overcome the above limitations, this paper develops                       in a computationally efficient way.
algorithms for optimizing a Convolutional BeamFormer (CBF)                           In the remainder of this paper, the model of the observed
that can perform DN+DR+SS in a computationally much more                         signal and that of the CBF are defined in Section II. Then,
efficient way. A CBF is a filter that is applied to a multichannel               Section III presents the proposed optimization methods. Sec-
observed signal to yield the desired output signals. For the                     tion IV summarizes characteristics and advantages of the
optimization of a CBF, this paper first presents a common                        proposed methods. In Sections V and VI, experimental results
objective function based on the Maximum Likelihood (ML)                          and concluding remarks are described, respectively.
criterion, on an assumption that the steering vectors of the
desired signals are given, or can be estimated. This paper                               II. M ODELS OF SIGNAL AND BEAMFORMER
refers to a CBF optimized by this objective function as a
weighted MPDR (wMPDR) CBF. Then, showing that a CBF                                 This paper assumes that I source signals are captured by
can be factorized into WPE filter(s) and beamformer(s) in                        M (≥ I) microphones in a noisy reverberant environment. The
two different ways, we derive two different algorithms for                       captured signal at each TF point in the short-time Fourier
optimizing the wMPDR CBF, based on the ways of the CBF                           transform (STFT) domain is modeled by
factorization. The first approach, referred to as a source-                                                    I
packed factorization, is an extension of the conventional joint                                                        (i)
                                                                                                               X
                                                                                                      xt,f =          xt,f + nt,f ,              (1)
optimization technique proposed for DR+SS [25], [26], [27].                                                    i=1
In this paper, we first show that the direct application of this                                       (i)      (i)          (i)
                                                                                                      xt,f =   dt,f   + rt,f ,                   (2)
approach to DN+DR+SS has serious problems in terms of
the computational efficiency and the estimation accuracy, and                    where t and f are time and frequency indices, respec-
then present its extension for solving the problems. The second                  tively, xt,f = [x1,t,f , . . . , xM,t,f ]> ∈ CM ×1 is a column
approach, referred to as a source-wise factorization, is based                   vector containing all microphone signals at a TF point.
on a novel factorization technique. It factorizes a CBF into a                                                                                (i)
                                                                                 Here, (·)> denotes the non-conjugate transpose. xt,f =
set of sub-filter pairs, each of which is composed of a WPE                         (i)             (i)
                                                                                 [x1,t,f , . . . , xM,t,f ]> is a (noiseless) reverberant signal cor-
filter and a beamformer, and aimed at estimating each source
                                                                                 responding to the ith source, and nt,f = [n1,t,f , . . . , nM,t,f ]>
independently. For both approaches, we also present a method                                                         (i)
for robustly estimating the steering vectors of the desired                      is the additive diffuse noise. xt,f for each source in Eq. (1) is
signals during the optimization of the wMPDR CBF using                           further decomposed into two parts in Eq. (2), one consisting
the output of the WPE filters. A neural network supported                        of the direct signal and early reflections, referred to as a
                                                                                                          (i)
TF-mask estimation technique is also incorporated1 for the                       desired signal dt,f , and the other corresponding to the late
                                                                                                        (i)
                                                                                 reverberation rt,f . Hereafter, the frequency indices of the
  1 It is worth noting that the proposed techniques can also be applied to the   symbols are omitted for brevity, on an assumption that each
conventional blind signal processing for DR+SS as discussed in [28].             frequency bin is processed independently in the same way.
Jointly optimal denoising, dereverberation, and source separation
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020                                                                                                                 3

                                    Convoluonal                                            Mulple-                  Beamformer
                                    beamformer                                                target                   matrix for
                                for dereverberaon,                                           linear                   separaon
                                   denoising, and                                           predicon                     and
                                 source separaon                                               (LP)                   denoising
                                 (a) A MIMO CBF                            (b) A MIMO CBF with source-packed factorization

                                                                                                                    (1)
                                   Convoluonal                 (1)                          Single-target                 Beamformer              (1)
                                beamformer for = 1                                           LP for = 1                     for = 1

                                                              …
                                            …

                                                                                                        …

                                                                                                                                      …

                                                                                                                                                 …
                                                                                                                    ( )
                                   Convoluonal                 ( )                          Single-target                 Beamformer              ( )
                                beamformer for =                                              LP for =                       for =
                           (c) A set of MISO CBFs                               (d) MISO CBFs with source-wise factorization
Fig. 1. A Multi-Input Multi-Output (MIMO) CBF and its three different implementations. They are equivalent to each other in the sense that whatever values
are set to coefficients of one implementation, certain coefficients of the other implementations can be determined such that they realize the same input-output
relationship. Thus, the optimal solutions of all the implementations are identical as long as they are optimized based on the same objective function.

                                                                          (i)                     (i)         (i)               (i)
   In this paper, the goal of DN+DR+SS is to estimate dt                         where aτ = [a1,τ , . . . , aM,τ ]> for τ ∈ {∆, . . . , La − 1}
                                                        (i)
for each source i from xt in Eq. (1) by reducing rt of the                       are the convolutional acoustic transfer functions, and ∆ is the
            (i0 )                                                                mixing time, representing the relative frame delay of the late
source i, xt of all the other sources i0 6= i, and the diffuse
noise nt . It is known that in noisy reverberant environments                    reverberation start time to the direct signal.
                                                                                                                                (i)
the early reflections enhance the intelligibility of speech for                     In this paper, we further assume that dt is statistically
                                                                                             2
human perception [33] and improve the ASR performance by                         independent of the following variables:
computer [34], and thus we include them in the desired signal.                        •
                                                                                            (i)
                                                                                          st0 for t0 ≤ t − ∆ (and thus dt
                                                                                                                                                    (i)
                                                                                                                                                          is statistically
Hereafter, this paper uses m = 1 as the reference microphone,                                               (i)
                                                                                          independent of xt0 for t0 ≤ t − ∆),
                                                            (i)
and describes a method for estimating the desired signal, d1,t ,                           (i)
                                                                                      •   rt00 for t00 ≤ t,
at the microphone without loss of generality.                                               (i0 )
                                                    (i)                               •   xt0 and nt0 for all t, t0 and i0 6= i.
   To achieve the above goal, we further model dt as
                                                                                 These assumptions are used to derive the optimization algo-
                        (i)           (i)             (i)
                      dt = v(i) st = ṽ(i) d1,t ,                         (3)    rithms described in the following.
         (i)
where st is the ith clean speech at a TF point. In Eq. (3), the
                                          (i)                      (i)           A. Definition of a CBF and its three different implementations
desired signal of the ith source, dt , is modeled by v(i) st ,
i.e., a product in the STFT domain of the clean speech with                       We now define a CBF, which will be later factorized into
a transfer function v(i) , hereafter referred to as a steering                   WPE filter(s) and beamformer(s), as
vector, assuming that the duration of the impulse response
corresponding to the direct signal and early reflections in                                                                           L−1
                                                                                                                                      X
the time domain is sufficiently short in comparison with the                                                yt = W0H xt +                    WτH xt−τ ,               (6)
analysis window [35]. Then, the desired signal is further                                                                             τ =∆
                    (i)
rewritten as ṽ(i) d1,t , i.e., a product of the desired signal at the                                  (1)               (I)
                             (i)     (i) (i)                                     where yt = [yt , . . . , yt ]> ∈ CI×1 is the output of the CBF
reference microphone d1,t = v1 st with a Relative Transfer                       corresponding to estimates of I desired signals, Wτ ∈ CM ×I
Function (RTF) [36] which is defined as the steering vector                      for each τ ∈ {0, ∆, ∆+1, . . . , L−1} is a matrix composed of
divided by its reference microphone element, namely                              the beamformer coefficients, (·)H denotes conjugate transpose,
                                               (i)                               and ∆ is the prediction delay of the CBF. In this paper,
                              ṽ(i) = v(i) /v1 .                          (4)
                                                                                 we set ∆ equal to the mixing time introduced in Eq. (5),
In contrast, assuming that the duration of the late reverberation                so that the desired signals are included only in the first
in the time domain is larger than the analysis window, the late                  term of Eq. (6), and that they are statistically independent
                (i)
reverberation rt is modeled by a convolution in the STFT                         of the second term according to the assumptions introduced
domain [37] of the clean speech with a time series of acoustic                   in the signal model. Then, this paper performs DN+DR+SS
transfer functions corresponding to the late reverberation as                    by estimating beamformer coefficients that can estimate the
                                                                                 desired signals included in the first term of Eq. (6).
                                    a −1
                                   LX
                          (i)                   (i)
                         rt =               a(i)
                                             τ st−τ ,                     (5)      2 See    [8] for more precise discussion on the statistical independence between
                                                                                   (i)
                                   τ =∆                                          dt       and st0 for t0 ≤ t − ∆.
Jointly optimal denoising, dereverberation, and source separation
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020                                                                                  4

 For the sake of notational simplicity, we also introduce a             W, and are used to extract the ith desired signal. Then, Eq. (7)
matrix representation of a CBF as                                       can be rewritten for each source i as
                             H                                                              "        #H 
                           W0      xt                                                                (i)           
                   yt =                   ,             (7)                                (i)     w0           xt
                           W       xt                                                    yt =                        .             (14)
                                                                                                   w(i)         xt
where W is a matrix containing Wτ for ∆ ≤ τ ≤ L − 1 and                  For example, MISO CBFs were used in [30], [39]. ISCLP
xt is a column vector containing past multichannel observed             proposed in [24] can also be viewed as a realization of a MISO
signals xt−τ for ∆ ≤ τ ≤ L − 1, respectively, defined as                CBF using a sidelobe cancellation framework [40].
               >              >
                                  >                                       3) Source-wise factorization: With the source-wise factor-
         W = W∆    , . . . , WL−1    ∈ CM (L−∆)×I ,      (8)
               >                    >
                                                                        ization shown in Fig. 1 (d), we further factorize each MISO
                              >          M (L−∆)×1
                                   
         xt = xt−∆ , . . . , xt−L+1 ∈ C             .    (9)            CBF defined in Eq. (14) for a source i as
                                                                                        "        # "            #
Hereafter, we refer to the CBF defined by Eqs. (6) and (7) as                                (i)         IM
                                                                                           w0
a MIMO CBF.                                                                                        =        (i)   q(i) ,           (15)
                                                                                           w(i)         −G
   In the following, we further present three different imple-
mentations of the CBF, including the two ways of factorizing                                         (i)
                                                                        where q(i) ∈ CM ×1 and G ∈ CM (L−∆)×M . Then, Eq. (14)
the CBF. Figure 1 illustrates the MIMO CBF and its three
                                                                        can be rewritten as a pair of a linear prediction filter followed
different implementations.
                                                                        by a beamformer, respectively, defined as
   1) Source-packed factorization: With the implementation
                                                                                                         (i) H
shown in Fig. 1 (b), we directly factorize3 the MIMO CBF in                                 (i)
                                                                                           zt = x t − G           xt ,                (16)
Eq. (7) as
                                                                                                       H
                                                                                            (i)              (i)
                                                                                           yt = q(i) zt ,
                                     
                      W0          IM                                                                                                  (17)
                            =             Q,              (10)
                      W           −G
                                                                                (i)                  (i)
                                                                        where zt ∈ CM ×1 and G are the output and the prediction
where Q ∈ CM ×I , G ∈ CM (L−∆)×M , and IM ∈ RM ×M is
                                                                        matrix of the linear prediction, and q(i) is the coefficient vector
an identity matrix. Then, Eq. (6) can be rewritten as a pair of
                                                                        of the beamformer. Because Eq. (16) is performed only for
a (convolutional) linear prediction filter followed by a (non-
                                                                        estimation of the ith source, it is referred to as a single-target
convolutional) beamformer matrix, respectively, defined as
                                                                        linear prediction.
                                       H
                        zt = xt − G xt ,                        (11)       4) Relationship between two factorization approaches: The
                                 H
                        yt = Q zt .                             (12)    difference between the two factorization approaches, namely
                                                                        Fig. 1 (b) and (d), is based only on the way to perform linear
Here, zt ∈ CM ×1 and G are the output and the prediction                prediction, Eq. (11) or Eq. (16), and more specifically based
                                                                                                                          (i)
matrix of the linear prediction, and Q is the coefficient matrix        on whether the prediction matrices, G and G , are common
of the beamformer. Eq. (11) is supposed to dereverberate all            to all the sources or different over different sources. According
the sources at the same time, and thus referred to as a multiple-       to this, different optimization algorithms with different char-
target linear prediction, while Eq. (12) is supposed to perform         acteristics are derived, as will be shown in Section III. In
denoising and source separation at the same time. Because               contrast, the beamformer parts, Q and q(i) in Eqs. (12) and
the individual sources are not distinguished in the output of           (17) are identical in the two approaches, viewing q(i) as the
the WPE filter, this implementation is called source-packed             ith column of Q, because they satisfy W0 = Q in Eq. (10)
factorization.                                                                  (i)
                                                                        and w0 = q(i) in Eq. (15).
   One example of the source-packed factorization is the                   In addition, it should be noted that all the above implemen-
cascade configuration composed of a WPE filter followed by              tations of a CBF are equivalent to each other in the sense
a beamformer, which has been widely used for DN+DR+SS                   that whatever values are set to coefficients of one imple-
in the far-field speech recognition area [14], [20], [38], and          mentation, certain coefficients of the other implementations
the other example is one used in the joint optimization of a            can be determined such that they realize the same input-
WPE filter and a beamformer, which has been investigated for            output relationship. Thus, the optimal solutions of all the
DR+SS in the blind signal processing area [25], [26], [27].             implementations are identical as long as they are based on
   2) Multi-Input Single-Output (MISO) CBF: Next, we define             the same objective function.
a set of MISO CBFs shown in Fig. 1 (c). They are obtained
by decomposing the beamformer coefficients in Eq. (7) as
                                                                                        III. ML ESTIMATION OF CBF
                  " (1)          (2)          (I)
                                                    #
             W0           w0     w0      . . . w0                          In this section, two different optimization algorithms are
                     =                                 ,     (13)
              W           w(1) w(2) . . . w(I)                          derived, respectively, using (b) source-packed factorization and
          (i)
                                                                        (d) source-wise factorization. For the derivation, we assume
where w0 ∈ CM ×1 and w(i) ∈ CM (L−∆)×1 are column                       that the RTFs ṽ(i) and the time-varying variances of the output
vectors, respectively, containing the ith columns of W0 and                                                                (i)
                                                                        signals yielded by the optimal CBF, denoted by λt , are given.
                                                                                                                                      (i)
  3 The existence of G that satisfies W = −GQ is guaranteed for any W   Later in Section III-E, we describe a way for estimating λt
when M ≥ I and rank{Q} = I.                                             jointly with the CBF coefficients based on the ML criterion,
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020                                                                                                5

and a way for estimating ṽ(i) based on the output of the WPE                 noise remaining in the output of the CBF, respectively, written
filter obtained at a step of the optimization.                                using the MISO CBF form as
                                                                                                      "       #H "         #
                                                                                                          (i)        (i)
                                                                                                (i)     w0          rt
                                                                                              r̂t =                  (i)     ,          (22)
A. Probabilistic model                                                                                  w(i)        xt
                                                                                                              #H "
                                                                                                                     (i0 )
                                                                                                      "                     #
                                                                                                          (i)
   First, we formulate the objective function for DN+DR+SS                                     (i0 )    w0          xt
                                                                                             x̂t =                   (i0 )    ,         (23)
by reinterpreting the objective function proposed for DN+DR                                             w(i)        xt
in [30]. For this formulation, we interpret DN+DR+SS to                                               "       #H 
                                                                                                          (i)           
be composed of a set of separate processing steps, each of                                              w0          nt
                                                                                                n̂t =                      ,            (24)
which applies DN+DR to enhance a source i by reducing                                                   w(i)        nt
the late reverberation of the source (DR) and by reducing the
additive noise including the other sources and the diffuse noise              where nt = [n>              >      >
                                                                                           t−∆ , . . . , nt−L+1 ] . According to the statistical
                                                                                                                                     (i)
(DN). With this interpretation, we introduce the following                    independence assumptions introduced in Section II, d1,t is sta-
                                                                                                              (i)   (i0 )
assumptions similar to [30].                                                  tistically independent of r̂t , x̂t , and n̂t . Then, substituting
                                                                        (i)   Eq. (21) in Eq. (19) and omitting constant terms, we obtain
  •   The output of the optimal CBF for each i, namely yt ,
                                                                              in the expectation sense
      follows zero-mean complex Gaussian
                                         distribution
                                                      with a                                                                                       
                                                            2                                                                                      2
                                 (i)                  (i)                                                          (i)   P            (i0 )
      time-varying variance λt = E                yt            [8].                                  T    E    r̂t    +   i0 6=i tx̂       + n̂ t
                                                                                    n         o    1X
  •   The beamformer satisfies a distortionless constraint for                    E Li (θ(i) ) =                               (i)
                                                                                                                                                       .
                                                                                                  T t=1                    λt
      each source i defined using the RTF ṽ(i) in Eq. (4) as
                                                                                                                                                      (25)
                 H                    H        
          
             (i)      (i)            (i)      (i)                       The above equation indicates that minimization of0 the objec-
            w0      ṽ = 1 or q             ṽ = 1 .       (18)                                                        (i) (i )
                                                                        tive function indeed minimizes the sum of r̂t , x̂t for i0 6= i,
                                                                        and n̂t in Eq. (21).
Then, according to the discussion in [30], we can approxi-                 Before deriving the optimization algorithms, we define a
mately derive the objective function to minimize for estimating matrix that is frequently used in the derivation, referred to
                                                          (i)
the CBF coefficients for source i, e.g., θ(i) = {w0 , w(i) }, as a variance-normalized spatio-temporal covariance matrix.
based on the ML estimation as                                           Letting xt be a column vector composed of the current and
                             2
                                                                       past observed signals at all the microphones, defined as
                T         (i)
             1         y                                   H                                       > >
                     t                                                               xt = x>             ∈ CM (L−∆+1)×1 ,
                                                                                                  
                                                                                              t , xt                               (26)
                                      (i)           (i)
               X
Li (θ(i) ) =         (i) + log λt  s.t. w0                 ṽ(i) = 1.
             T t=1      λt                                              the matrix is defined as
                                                                  (19)                   T
                                                                                      1 X xt xH
                                                                             Rx(i) =            t
                                                                                                   ∈ CM (L−∆+1)×M (L−∆+1) .        (27)
                                                                                     T t=1 λ(i)
The objective function for estimating all the sources can then                                t
be obtained by summing Eq. (19) over all the sources as                 A factorized form of the matrix is also defined as
                                                                                                                  H 
                                                                                                       (i)      (i)
             I
                                                                                                  x R        P x
                                                                                         R(i)
                                      H
                                                                                           x =
             X                  
                                  (i)                                                                                  ,          (28)
  L (Θ) =       Li (θ(i) ), s.t. w0       ṽ(i) = 1 for all i, (20)                                    (i)       (i)
                                                                                                     Px       Rx
               i=1
                                                                             where
where Θ = θ(1) , . . . , θ(I) . This objective function is used                                       T
                                                                                                   1 X xt xH
commonly for all the implementations of a CBF. In this paper,                            Rx(i) =            t
                                                                                                              ∈ CM ×M ,                             (29)
we call a CBF optimized by the above objective function as                                         T t=1 λ(i)
                                                                                                          t
weighted MPDR (wMPDR) CBF because it minimizes the                                                    T
                                                                                                   1 X xt xHt
                               (i)
average power of the output yt weighted by the time-varying                              Px(i) =              ∈ CM (L−∆)×M ,                        (30)
            (i)                                                                                    T t=1 λ(i)
                                                                                                          t
variance, λt , of the signal.
                                                                                                      T
   Here, let us briefly explain how DN+DR+SS is performed                                  (i)     1 X xt xHt
                                                                                         Rx =                 ∈ CM (L−∆)×M (L−∆) .                  (31)
by Eqs. (19) and (20). Substituting Eqs. (1) and (2) in Eq. (14)                                   T t=1 λ(i)
                                                                                                          t
and using the model of the desired signal in Eq. (3) and the
distortionless constraint in Eq. (18), we obtain                              B. Optimization based on source-packed factorization
                                                                                 This subsection discusses methods for optimizing a CBF
                     (i)   (i)   (i)             (i0 )
                                       X
                 yt = d1,t + r̂t +              x̂t      + n̂t ,       (21)   with the source-packed factorization. In the following, after
                                       i0 6=i                                 describing a method for directly applying the conventional
                                                                              joint optimization technique used for DR+SS to DN+DR+SS,
         (i)     (i0 )
where r̂t , x̂t for i0 6= i, and n̂t are late reverberation of                we summarize the problems in it, and present the solutions to
the ith source, all the other sources, and the additive diffuse               the problems.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020                                                                                                        6

                                                                                              (i)
   1) Direct application of conventional technique: With the                       where Rz is a variance-normalized spatial covariance matrix
source-packed factorization in Eqs. (11) and (12), it is difficult                 of the output of the multiple-target WPE filter, calculated as
to estimate both Q and G at the same time in closed form even                                                             T
        (i)
when λt and ṽ(i) are all given. Instead, we use an iterative                                                          1 X zt zHt
                                                                                                               R(i)
                                                                                                                z =               .                         (41)
and alternate estimation scheme, following the idea from the                                                           T t=1 λ(i)
                                                                                                                              t
blind signal processing technique [25], [26], [27], where at
each estimation step, one of Q and G is updated while fixing                       Then, q(i) that minimizes Eq. (40) under the distortionless
                                                                                                  H
the other.                                                                         constraint q(i) ṽ(i) = 1 can be obtained as
   For updating G, we fix Q at its previously estimated value.                                                        −1
                                                                                                                   (i)
For the derivation of the algorithm, the representation of linear                                               Rz          ṽ(i)
prediction in Eq. (11) is slightly modified as                                                    q(i) =        H          −1        . (42)
                                                                                                                        (i)
                                                                                                          ṽ(i)       Rz          ṽ(i)
                          zt = xt − Xt g,                                   (32)
                                                                                   Because the above beamformer minimizes the average power
where Xt and g are equivalent to xt and G with modified                            of zt weighted by the time-varying variance, we call this as
matrix structure defined as                                                        weighted MPDR (wMPDR) beamformer4 . As will be shown
                                                   2                               in Section III-C, a wMPDR beamformer is a special case of
            X t = IM ⊗ x >t ∈C
                                  M ×M (L−∆)
                                             ,                              (33)   a wMPDR CBF. A wMPDR CBF is reduced to a wMPDR
                                  H      2
             g = g1 , . . . , g>
                   >
                                    ∈ CM (L−∆)×1 ,                                 beamformer when setting the length of the CBF L = 1, i.e.,
                                 
                               M                                            (34)
                                                                                   by just making it a non-convolutional beamformer.
where ⊗ is Kronecker product, and gm is the mth column of                             The above algorithm, however, has two serious problems.
G. Then, considering that the CBF in Eqs. (11) and (12) can                        Firstly, the size of the covariance matrix in Eq. (38) is very
               (i)        H            
be written as yt = q(i)       xt − Xt g and omitting normal-                       large, requiring huge computing cost for calculating it and
ization terms, the objective function in Eq. (20) becomes                          its inverse. Secondly, as will be shown in the experiments,
                             T
                                                                                   the iterative and alternate estimation of Q and G tends to
                          1X                               2                       converge to a sub-optimal point. This is probably because the
               Lg (g) =         x t − Xt g                 Φq,t
                                                                  ,         (35)
                          T t=1                                                    update of G is performed based only on the output of the
                                                                                   fixed beamformer in the iterative and alternate estimation, as
           2
where kxkR = xH Rx, and Φq,t is a semi-definite Hermitian                          in Eq. (19), while the signal dimension of the beamformer
matrix defined as                                                                  output, i.e., I, is reduced from that of the original signal
                     I            H                                               space, i.e., M , with the over-determined case, i.e., I < M .
                    X   q(i) q(i)                                                  As a consequence, signal components that are relevant for
             Φq,t =          (i)
                                     ∈ CM ×M .      (36)
                    i=1     λt                                                     the update of G may be reduced in the beamformer output,
                                                                                   especially when the estimation of Q is not accurate at the
Because Eq. (35) is a quadratic form with a lower bound, g                         early stage of the optimization. This can seriously degrade the
that minimizes it can be obtained as                                               update of G.
     g = Ψ+ ψ,                                                              (37)      2) Proposed extension: Here, we present two techniques
         1X H                  2         2
                                                                                   to mitigate the above problems within the source-packed
     Ψ=        Xt Φq,t Xt ∈ CM (L−∆)×M (L−∆) ,                              (38)   factorization approach. The first one is used to reduce the
         T t
                                                                                   computing cost. As shown in Appendix A, Eqs. (38) and (39)
         1X H                 2
                                                                                   can be rewritten, using Eq. (28), as
     ψ=        Xt Φq,t xt ∈ CM (L−∆)×1 ,                                    (39)
         T t                                                                                           I 
                                                                                                     X              H  (i) > 
                                                                                                Ψ=         q(i) q(i) ⊗ Rx             ,       (43)
where (·)+ is the Moore-Penrose pseudo-inverse. Because
                                                                                                         i=1
the rank of Ψ is equal to or smaller than M I(L − ∆) as                                                  I                     ∗ 
will be shown in Section III-B2, Ψ is rank deficient for                                                 X            
                                                                                                    ψ=          q(i) ⊗ P(i)
                                                                                                                        x q
                                                                                                                            (i)
                                                                                                                                     ,                      (44)
over-determined cases, namely when M > I, and thus the                                                   i=1
use of the pseudo-inverse is indispensable. Eqs. (37) to (39)
are equivalent to those used in the dereverberation step for                       where ()∗ denotes complex conjugate. In the above equations,
                                                                                                                                             (i)
DR+SS [25], [26], [27] except that in this paper denoising is                      the majority of the calculation is coming from that of Rx .
additionally included in the objective and that over-determined                    Because the size of the matrix is much smaller than that of
cases are also considered. This filter is referred to as the                       Ψ, we can greatly reduce the computing cost with this mod-
multiple-target WPE filter in this paper.                                          ification5 in comparison with direct calculation of Eqs. (38)
   For the update of Q, fixing g at its previously estimated
                                                                                      4 A wMPDR beamformer is also called a Maximum-Likelihood Distortion-
value, the objective in Eq. (20) can be rewritten as                               less Response (MLDR) beamformer in [41].
                                                                                      5 In general, computational complexity of a matrix multiplication is more
                I             2                           H
                                                                                   than O(n2 ). Because the size of Ψ is M -times larger than Rx , it is suggested
                X                              
     LQ (Q) =          q(i)       (i)
                                        s.t.       q(i)        ṽ(i) = 1,   (40)   that the computational complexity for calculating Ψ is at least M 2 times
                              Rz
                 i=1                                                               larger than that for calculating Rx .
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020                                                                                       7

and (39). Although we still need to calculate the inverse of        where R(i)  is the covariance matrix defined in Eq. (28), and
                                                                          h x >                   i>
the huge matrix Ψ even with this modification, the cost is           (i)
                                                                    v = ṽ    (i)
                                                                                     , 0, . . . , 0 ∈ CM (L−∆+1)×1 corresponds to
relatively small in comparison with the direct calculation of
Ψ. (Note that Eq. (43) also shows the rank of Ψ to be equal         the RTF ṽ(i) with zero padding. Finally, the solution is given
to or smaller than M I(L − ∆).)                                     as
                                                                                                     −1
   The second technique introduces a heuristic to improve                                        R(i)      v(i)
                                                                                                    x
the update of the WPE filter. In order to use a whole M -                          w(i) =                             .        (51)
                                                                                                 H        −1
dimensional signal space to be considered for the update,                                   v(i)      R(i)      v (i)
                                                                                                       x
we modify the CBF to output not only I desired signals,
but also M − I auxiliary signals included in the orthogonal         The above equation gives the simplest form of the solution
complement Q⊥ of Q, and model the auxiliary signals as zero-        to a wMPDR CBF. It clearly shows that a wMPDR CBF
mean time-varying complex Gaussians. With this modification,        is a general case of a wMPDR beamformer. By setting
the optimization is performed by calculating the summation          L = 1 in the above solution, namely by letting it be a
in Eqs. (43) and (44) over not only 1 ≤ i ≤ I but also              non-convolutional beamformer, it reduces to the solution of
I < i ≤ M , letting q(I+1) , . . . , q(M ) be the orthonormal       a wMPDR beamformer in Eq. (42).
bases for the orthogonal complement Q⊥ . Because it is not             An advantage of the solution using the MISO CBFs is that
                                           (i)                      it can be obtained by a closed form equation provided the
important to distinguish the variances λt of the auxiliary
signals, we use the same value for them calculated as               RTFs and the time-varying variances of the desired signals
                           M
                                                                    are given, and we do not need to consider the interaction
                        1          H 2
                                                                    between DN and DR. With this approach, however, it is
                           X 
              λ⊥
               t =             q(i) zt ,                    (45)
                      M −I                                          necessary to estimate the RTFs directly from a reverberant
                                  i=I+1
                                                                    observation similar to [24]. A solution to this problem is to
                                  ⊥
and calculate P⊥ x and Rx based on Eqs. (30) and (31)               use dereverberation preprocessing based on a WPE filter for
accordingly. In summary, we can implement this modification         the RTF estimation. It was shown in [30] that the output of a
by adding the following terms, respectively, to Ψ and ψ in          WPE filter can be obtained in a computationally efficient way
Eqs. (43) and (44).                                                 within the framework of this approach. However, the source-
                    M            H
                                     !
                                          ⊥ >                     wise factorization approach described in the following can
                    X
          Ψ⊥ =          q(i) q(i)      ⊗ Rx      ,     (46)         more naturally solve this problem. So, this paper adopts it
                     i=I+1                                          as the solution.
                    M 
                    X                     ∗ 
           ψ⊥ =             q(i) ⊗ P⊥
                                    xq
                                       (i)
                                                .           (47)    D. Optimization based on source-wise factorization
                    i=i+1
                                                                       With the source-wise factorization, similar to the case with
                                                                    the direct optimization of the MISO CBFs, the optimization
C. Direct optimization of MISO CBFs
                                                                    can be performed separately for each source, and the resultant
   Before deriving the optimization with the source-wise fac-       algorithm is identical to that proposed for DN+DR in [31].
torization, we show that we can directly optimize the MISO             Considering that a CBF   can be written based  on Eqs. (16)
CBFs in Eq. (14), and summarize its characteristics. With this                                 H
                                                                                                        (i) H 
                                                                                  (i)        
setting, the CBFs and the objective function are both defined       and (17) as yt = q(i)         xt − G         xt and using the
separately for each source in Eqs. (14) and (19), and thus,         factorized form of R(i)
                                                                                         x in Eq. (28), the objective function in
the optimization can be performed separately for each source.       Eq. (19) can be rewritten as
The resultant algorithm is, therefore, identical to that proposed           (i)                  (i) −1             2
for DN+DR in [42], where this type of CBF is also referred                                     (i)
                                                                        Li G , q(i) = G − Rx                 P(i)
                                                                                                                x   q (i)
to as a Weighted Power minimization Distortionless response                                                                             (i)
                                                                                                                                       Rx
(WPD) CBF.                                                                                      2
   For presenting the solution, we introduce the following                             +   q(i)     (i)
                                                                                                         
                                                                                                           (i) H
                                                                                                               
                                                                                                                  (i) −1 (i)
                                                                                                                                 .        (52)
                                                                                                    Rx − Px      Rx     Px
vector representation of Eq. (14).
                                    H                                                                        (i)
                       (i)
                                                                   In the above objective function, G is contained only in the
                      yt = w(i) xt ,                         (48)
                                                                    first term, and the term can be minimized, not depending on
                                                                                               (i)
where w(i) is defined as                                            the value of q(i) , when G takes the following value.
                                      "          #                                         (i)
                                                                                                 (i) −1
                                           (i)
                            (i)           w0                                             G = Rx           P(i)
                                                                                                           x .             (53)
                        w         =                  ,      (49)
                                          w(i)
                                                                                                        (i)
                                                                    So, this is a solution6 of G that globally minimizes the
              (i)       (i)                                                                                                (i)
Then, when   λt  and ṽ are given, Eq. (19) becomes a simple        objective function given the time-varing variance λt . Inter-
constraint quadratic form as                                        estingly, this solution is identical to that of the conventional
                         2             H
     Lw (w(i) ) = w(i) (i) s.t. w(i) v(i) = 1,          (50)          6 This is not a unique solution. The first term is minimized even when an
                              Rx                                    arbitrary matrix, of which null space includes q(i) , is added to Eq. (53).
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020                                                                                                      8

   Algorithm 1: Source-packed factorization-based opti-                                 Algorithm 2: Source-wise factorization-based opti-
   mization for estimation of all the sources.                                          mization for estimation of the ith source
    Data: Observed signal xt for all t                                                   Data: Observed signal xt for all t
                           (i)                                                                              (i)
           TF masks γt for all t and 1 ≤ i ≤ I                                                  TF masks γt for all t
                                       (i)                                                                                (i)
    Result: Estimated sources yt for all t and 1 ≤ i ≤ I                                 Result: Estimated ith source yt for all t
                (i)             2                                                                   (i)          2
 1 Initialize λt as ||xt ||I /M for all t and 1 ≤ i ≤ I                               1 Initialize λt as ||xt ||I /M for all t
                                 M                                                                                M
                 (i)
 2 Initialize q      as the ith column of IM for 1 ≤ i ≤ I                            2 repeat
 3 Initialize zt as xt for all t
                                                                                               (i)      PT x xH
                                                                                      3     Rx ← T1 t=1 t(i)t
                                                                                                                λ
 4 repeat                                                                                               PT xttxHt
                                                                                               (i)    1
          (i)        PT x xH                                                          4     Px ← T t=1 (i)
 5     Rx ← T1 t=1 t(i)t for 1 ≤ i ≤ I                                                                (i) + λt
                               λ
          (i)      1
                     PT xttxHt                                                        5
                                                                                               (i)
                                                                                            G ← Rx              Px
                                                                                                                   (i)
 6     Px ← T t=1 (i) for 1 ≤ i ≤ I
                        λt                                                                  (i)
                                                                                                           (i) H
               PI                    H  (i) >                                      6     zt ← x t − G               xt
 7     Ψ ← i=1 q(i) q(i) ⊗ Rx
                                                                                                                                      (i)   (i)
               PI                
                                      (i)
                                                ∗                                   7      Estimate ṽ(i) based on zt and γt
 8     ψ ← i=1 q(i) ⊗ Px q(i)                                                                           PT z(i)      (i) H
                                                                                                                    zt
                                                                                               (i)              t

 9     Begin Add orthogonal complement beamformer                                     8      Ŕz ← T1 t=1           (i)
                                                                                                                             λt
                                                                                                                        +
10           Set q(I+1) , . . . , q(M ) be orthonormal bases for                      9      q   (i)
                                                                                                       ←
                                                                                                              (Ŕ(i)
                                                                                                                  z ) ṽ
                                                                                                                          (i)
                                                                                                                          +
              the orthogonal complement Q⊥ of Q                                                                  H
                                                                                                                   
                                                                                                                      (i)
                                                                                                         (ṽ(i) ) Ŕz         ṽ(i)
                                                         2
                                              (i) H                                           (i)           (i) H (i)
                             PM                                                                                 
             λ⊥         1                                                                    yt        ← q          zt
                                                 
              t ← M −I          i=I+1 q             zt
11                                                                                   10
                                                                                                                    2
               ⊥          PT x xH                                                             (i)             (i)
12           Rx ← T1 t=1 λt ⊥t                                                       11      λt ← yt
                         PT xt xt Ht                                                 12   until convergence
13           P⊥x ← T
                      1
                            t=1     ⊥
                           P λt                       H   ⊥ >
                                 M         (i)
14           Ψ←Ψ+                i=I+1 q         q(i)       ⊗ Rx
                                                                                                  (i)
                                                      ⊥ (i) ∗
                                                             
15
                           PM
             ψ ← ψ + i=I+1 q ⊗ Px q       (i)                                        where Ŕz is a variance-normalized covariance matrix of the
                                                                                     output of the single-target WPE filter, calculated as
16     End
                                                                                                                    H
17     g ← Ψ+ ψ                                                                                              T z(i) z(i)
       zt ← x t − X t g                                                                                   1 X    t    t
18                                                                                               Ŕz(i) =           (i)
                                                                                                                             ∈ CM ×M .      (55)
19
                                                   (i)
       Estimate ṽ(i) based on zt and γt for 1 ≤ i ≤ I                                                    T t=1    λt
                                    H
          (i)            T
       Rz ← T1 t=1 zt (z(i)t ) for 1 ≤ i ≤ I
                     P
20
                                 λt
                                                                                     Then, the solution can be obtained, under the distortionless
                    (   R(i)
                         )
                              + (i)
                                ṽ                                                   constraint, as a wMPDR beamformer defined by
21      q(i) ←           H
                          z
                               for 1 ≤ i ≤ I
                              (i) + (i)
                                 
                                                                                                                        −1
                  (  )
                  ṽ(i)      Rz     ṽ
                                                                                                                
                                                                                                                    (i)
         (i)          H                                                                                          Ŕz       ṽ(i)
22      yt     ← q(i) zt for 1 ≤ i ≤ I                                                                 (i)
                                                                                                      q =                               .    (56)
                            2                                                                                     H  (i) −1
23
         (i)
        λt ← yt
                      (i)
                       for 1 ≤ i ≤ I                                                                        ṽ(i)      Ŕz        ṽ(i)
24   until convergence                                                               Eqs. (54) to (56) are very similar to Eqs. (40) to (42), and
                                                                                     the difference is whether the dereverberation is performed by
                                                                                     multiple-target WPE filter or single-target WPE filter.
WPE dereverberation. This means that the WPE filter op-                                 With the source-wise factorization, the solution can be
                                                                                                                      (i)
timized solely for dereverberation can perform the optimal                           obtained in closed form when λt and ṽ(i) are given, similar
dereverberation for the joint optimization not depending on                          to the case with the direct optimization of the MISO CBFs.
                                                                                                                                                  (i)
the subsequent beamforming, provided the time-varying vari-                          In addition, the output of the WPE filter is obtained as zt
ance of the desired source is given for the optimization. In                         in Eq. (16), and can be efficiently used for estimation of the
addition, unlike the source-packed factorization approach, this                      RTFs. Furthermore, the size of the temporal-spatial covariance
approach does not need to compensate for the dimensionality                          matrix in Eq. (31) is much smaller than that in Eq. (38) of the
                                                            (i)                      source-packed factorization, and thus the computational cost
reduction of the beamformer output for the update of G
                                                                                     can be reduced. (See Section IV for more detailed discussion
because it considers a whole signal space without adding any
                                       (i)                                           on the computing cost.)
modification. We refer to this filter G as a single-target WPE
filter in this paper.                                                                                                                       (i)
             (i)
    Once G is obtained as the above solution, the objective                          E. Processing flow with estimation of λt                     and ṽ(i)
function in Eq. (19) can be rewritten as                                               This subsection describes examples of processing flows,
                                                                                     shown in Algorithms 1 and 2, for optimizing a CBF based
                              2                          H
        Li q(i) = q(i)                    s.t.       q(i)        ṽ(i) = 1,   (54)   on source-packed factorization and source-wise factorization,
                                    (i)                                                                                                        (i)
                                Ŕz                                                  including the estimation of the time-varying variances, λt ,
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020                                                                                          9

                      WPE for                wMPDR for                          similar to one used by a fully-Convolutional Time-domain
                                                                                Audio Separation Network (Conv-TasNet) [44]. The network
                      Dereverb               Beamform
                                                                                was trained so that it receives the output of the WPE filter that
                                                                                is obtained at the first iteration in the iterative optimization of
                 Es mate                    Es mate                             the CBF, and estimates the TF masks of the desired signals.
                                                                                The input of the network was set as concatenation of the
                                                                                real and imaginary parts of STFT coefficients, and the loss
                 Es mate                    Es mate                             function was set as the (scale-dependent) signal-to-distortion
                                                                                ratio (SDR) of the enhanced signal obtained by multiplying the
                                                TF masks                        estimated masks to the observed signal. For the training and
          For   1st   me    For subsequent mes                                  validation data, we synthesized mixtures using two utterances
                                                                                randomly extracted from the WSJ-CAM0 corpus [45] and two
Fig. 2. Processing flow of source-wise factorization-based CBF for estimating   room impulse responses and background noise extracted from
a source i.
                                                                                the REVERB Challenge training set [18].
                                                                                   For the estimation of the RTFs, ṽ(i) , we adopt a method
and the RTFs, ṽ(i) . Hereafter, we refer to both algorithms as                 based on eigenvalue decomposition with noise covariance
A-1 and A-2 for brevity. While A-1 estimates all the sources,                   whitening [46], [47]. With this technique, the steering vector
  (i)
yt for all i, at the same time from the observed signal xt ,                    v(i) is first estimated as
                                           (i)                                                                                  
A-2 estimates only one of the sources, yt for a certain i, and                                    v(i) = R\i MaxEig R−1    \i Ri ,             (57)
(if necessary) is repeatedly applied to the observed signal to
estimate all the sources one after another. As auxiliary inputs,                where MaxEig(·) is a function that calculates the eigenvector
                                                            (i)
TF masks are provided for both algorithms. A TF mask γt is                      corresponding to the maximum eigenvalue, Ri and R\i are a
associated with a source and a TF point, takes a value between                  spatial covariance matrix of the i-th desired signal and that of
0 and 1, and indicates whether the desired signal of the source                 the other signals, respectively, estimated as
                              (i)               (i)
dominates the TF point (γt = 1) or not (γt = 0). The TF                                               P (i) (i)  (i) H
masks over all the TF points are used to estimate the RTF(s) of                                         t γ t zt     zt
the desired signal(s) in line 19 of A-1 and line 7 of A-2. (See                                Ri =          P (i)            ,            (58)
Section III-E1 for the detail of estimation of the TF masks                                                    t γt
                                                                                                      P                     H
and the RTFs.)                                                                                                    (i)    (i)    (i)
                                                                (i)                                     t 1 − γt        zt zt
    Both algorithms estimate the time-varying variances λt                                    R\i =          P                    .      (59)
                                                                                                                          (i)
based on the same objective as that for the CBF, defined in                                                    t 1 − γt
Eq. (19). Because a closed form solution to the estimation
                                                                                Then, the RTF is obtained by Eq. (4).
of the CBF and the time-varying variances is not known, an
iterative and alternate optimization scheme is introduced to
both algorithms. In each iteration, the time-varying variances,                                         IV. D ISCUSSION
  (i)
λt , are updated in line 23 of A-1 and line 11 of A-2 as                           In summary, the proposed techniques can optimize a CBF
the power of the previously estimated values of the desired                     for jointly performing DN+DR+SS with greatly reduced com-
         (i)                                              (i)
signal yt , and then the CBF and the desired signal yt are                      puting cost in comparison with the direct application of
updated while fixing the time-varying variances. The iteration                  the conventional joint optimization technique proposed for
is repeated until convergence is obtained.                                      DR+SS to DN+DR+SS. With the conventional technique, it is
    The optimization methods described in Sections III-B and                    necessary to calculate the huge covariance matrix Ψ in order
III-D are used in respective algorithms for update of the CBF                   to take into account the dependency of G on Q inherently
and the desired signal(s). The WPE filter is first estimated in                 introduced in the source-packed factorization. This makes
lines 5 to 17 of A-1 and lines 3 to 5 of A-2, and applied in                    the computing cost of the conventional technique extremely
line 18 of A-1 and line 6 of A-2. After the RTF(s) is updated                   high. In contrast, the proposed extension of the source-packed
using the dereverberated signals, the wMPDR beamformer is                       factorization approach reduces the size of the matrix to be
estimated in lines 20 and 21 of A-1 and lines 8 and 9 of A-2,                   calculated substantively from M 2 (L−∆) for Ψ to M (L−∆)
and applied in line 22 of A-1 and line 10 of A-2.                               for Rx , and thus can effectively reduce the computing cost.
                                                                                                                                              (i)
    Figure 2 also illustrates the processing flow of a CBF with                    On the other hand, with the source-wise factorization, G
the source-wise factorization for estimating a source i.                        can be optimized independently of q(i) , which also allows
    1) Methods for estimating TF masks and RTFs: In ex-                         us to reduce the size of the matrix to be calculated to the
                                           (i)
periments, for estimating TF masks, γt , for all i and t at                     same as that of the proposed extension of the source-packed
each frequency, we used a Convolutional Neural Network                          factorization approach. In addition, we can skip the calculation
                                                                                                              ⊥
that works in the TF domain and is trained using utterance-                     of an additional matrix, Rx , and that of the inverse of
level Permutation Invariant Training criterion (CNN-uPIT)                       the huge matrix, Ψ−1 , which are required for the proposed
[43]. According to our preliminary experiments [32], we set                     extension of the source-packed factorization approach. This
the network structure as a CNN with a large receptive field                     makes the source-wise factorization approach computationally
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020                                                                                                10

                                                                 TABLE I
   CBF S COMPARED IN EXPERIMENTS . (1) AND (2) ARE CONVENTIONAL CASCADE CONFIGURATION APPROACHES , (5) IS A CONVENTIONAL JOINT
 OPTIMIZATION APPROACH , (6) AND (7) ARE PROPOSED JOINT OPTIMIZATION APPROACHES , AND (3) AND (4) ARE TEST CONDITIONS USED JUST FOR
 COMPARISON . (5), (6), AND (7) ARE CATEGORIZED AS “J OINTLY OPTIMAL” BECAUSE THEY ARE COMPOSED OF WPE AND W MPDR AND OPTIMIZED
 BASED ON INTEGRATED VARIANCE ESTIMATION ( SEE F IG . 3 FOR THE DIFFERENCE BETWEEN SEPARATE AND INTEGRATED VARIANCE ESTIMATION ).

         Name of method                                     Jointly       WPE               BF       Variance               Category
                                                            optimal                                 estimation
         (1)   WPE+MPDR (separate)                                    Multiple-target     MPDR       Separate         Cascade (conventional)
         (2)   WPE+MVDR (separate)                                    Multiple-target    MVDR        Separate         Cascade (conventional)
         (3)   WPE+wMPDR (separate)                                   Multiple-target    wMPDR       Separate              Test condition
         (4)   WPE+MPDR (integrated)                                  Single-target       MPDR      Integrated             Test condition
         (5)   Source-packed factorization (conventional)     X       Multiple-target    wMPDR      Integrated    Jointly optimal (conventional)
         (6)   Source-packed factorization (extended)         X       Multiple-target    wMPDR      Integrated      Jointly optimal (proposed)
         (7)   Source-wise factorization                      X       Single-target      wMPDR      Integrated      Jointly optimal (proposed)

further efficient. A drawback of the source-wise factorization
is that it has to handle I-times larger number of dereverberated                          Derev       BF                      Derev         BF
signals than the source-packed factorization.
   The source-wise factorization approach has additional ben-                      (a) Separate optimization          (b) Integrated optimization
efits w.r.t. computational efficiency when it is used in specific
scenarios listed below:                                                       Fig. 3. Separate and integrated variance optimization schemes. While the
                                                                              separate variance optimization updates λt for Derev as the variance of the
   • The source-wise factorization approach can estimate the
                                                                              Derev output, the integrated variance optimization updates it as the variance
      CBF by a closed-form equation when time-varying source                  of the beamformer output. As a consequence, λt for Derev is common to all
      variances are given, or estimated, e.g., using neural net-              the sources with separate variance optimization.
      works [15], [12]. In such a case, we can skip the iterative
      optimization. In contrast, the source-packed factorization
      approach needs to maintain iterations to estimate Q and                           CBF using a different type of masks and also to obtain
      g alternately due to their mutual dependency.                                     the top-line performance of a CBF.
   • The source-wise factorization approach is advantageous
      when it is combined with neural network-based single                    A. Dataset and evaluation metrics
      target speaker extraction that has been actively studied
                                                                                 For the evaluation, we prepared for a set of noisy reverberant
      recently [13]. With this combination, we can skip the es-
                                                                              speech mixtures (REVERB-2MIX) using the REVERB Chal-
      timation of sources other than the target source, allowing
                                                                              lenge dataset (REVERB) [18]. Each utterance in REVERB
      us to further reduce the computing cost.
                                                                              contains a single reverberant speech with moderate stationary
                                                                              diffuse noise. For generating a set of test data, we mixed two
                           V. E XPERIMENTS                                    utterances extracted from REVERB, one from its development
  This section experimentally confirms the effectiveness of the               set (Dev set) and the other from its evaluation set (Eval set),
proposed joint optimization approaches. Table I summarizes                    so that each pair of mixed utterances were recorded in the
optimization methods to be compared in the experiments (See                   same room, by the same microphone array, and under the same
Sections V-C and V-D for the detail of the methods). We                       condition (near or far, RealData or SimData). We categorize
compare them in the following three aspects.                                  the test data according to the original categories of the data in
  1) Effectiveness of joint optimization                                      REVERB (e.g., SimData or RealData). We created the same
      We compare a CBF with and without joint optimization                    number of mixtures in the test data as in the REVERB Eval set,
      in terms of estimation accuracy. The source-wise fac-                   such that each utterance in the REVERB Eval set is contained
      torization approach (Table I (7)) is compared with the                  in one of the mixtures in the test data. Furthermore, the length
      conventional cascade configuration (Table I (1) and (2)),               of each mixture in the test data was set at the same as that of
      and two additional test conditions (Table I (3) and (4)).               the corresponding utterance in the REVERB Eval set, while
  2) Comparison among joint optimization approaches                           the utterance from Dev set was trimmed or zero-padded at its
      We compare three joint optimization approaches, i.e.,                   end to have the same length as that of Eval set.
      the source-packed factorization approach with its con-                     For experiments in Section V-E, we also prepared for a
      ventional setting (Table I (5)), its proposed extension                 set of noisy reverberant speech mixtures, each of which is
      (Table I (6)), and the source-wise factorization approach               composed of three speaker utterances (REVERB-3MIX). We
      (Table I (7)), described, respectively, in Sections III-B1,             created REVERB-3MIX by adding one utterance extracted
      III-B2, and III-D, in terms of computational efficiency                 from REVERB Dev set to each mixture in REVERB-2MIX.
      and estimation accuracy.                                                Only RealData (i.e., real recordings of reverberant data) was
  3) Evaluation using oracle masks                                            created for REVERB-3MIX.
      We use oracle masks instead of estimated masks for                         In the experiments, we estimated two and three speech
      evaluating a CBF, in order to test the performance of a                 signals from each mixture, respectively, for REVERB-2MIX
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020                                                                                                                   11

                          TABLE II                                                                  TABLE III
        B EAMFORMER CONFIGURATIONS USED IN EXPERIMENTS                    WER (%) FOR R EAL DATA AND CD ( D B), FWSSNR ( D B), PESQ AND
                                                                          STOI FOR S IM DATA IN REVERB-2MIX OBTAINED USING DIFFERENT
                  M    L at each freq. range (kHz)    #iterations         BEAMFORMERS AFTER FIVE ESTIMATION ITERATIONS WITH C ONFIG -1.
                       0.0-0.8   0.8-1.5    1.5-8.0                       S CORES FOR REVERB-2MIX AND REVERB ( I . E ., SINGLE SPEAKER )
       Config-1   8      20         16         8          10                    WITH NO ENHANCEMENT (N O E NH ), ARE ALSO SHOWN .
       Config-2   4      20         16         8          10
                                                                                     Enhancement method          WER              CD         FWSSNR PESQ         STOI
                                                                           No Enh (REVERB-2MIX)                  62.49            5.44        1.12      1.12      0.55
and REVERB-3MIX, and evaluated only one of them corre-                     No Enh (REVERB)                       18.61            3.97        3.62      1.48      0.75
sponding to the REVERB Eval set, using baseline evaluation                 MPDR (w/o iteration)                  30.79            4.40        3.07      1.45      0.73
tools prepared for it. We selected the signal to be evaluated              MVDR (w/o iteration)                  30.89            4.43        3.00      1.44      0.73
from all the estimated speech signals based on the corre-                  wMPDR                                 28.75            3.96        4.46      1.60      0.75
lation between the separated signals and the original signal               (1)      WPE+MPDR (separate)          23.04            4.30        3.77      1.58      0.77
in the REVERB Eval set. As objective measures for speech                   (2)      WPE+MVDR (separate)          23.34            4.34        3.66      1.57      0.76
enhancement [48], we used the Cepstrum Distance (CD), the                  (3)      WPE+wMPDR (separate)         21.53            3.74        5.42      1.77      0.82
Frequency-Weighted Segmental SNR (FWSSNR), the Percep-                     (4)      WPE+MPDR (integrated)        23.22            4.28        3.66      1.56      0.76
                                                                           (7)      Source-wise factorization    20.03            3.67        5.57      1.80      0.81
tual Evaluation of Speech Quality (PESQ), and the Short-Time
Objective Intelligibility measure (STOI) [49]. To evaluate
the ASR performance, we used a baseline ASR system for                                   (1) WPE+MPDR (separate)                      (4) WPE+MPDR (integrated)
REVERB that was recently developed using Kaldi [50]. This                                (2) WPE+MVDR (separate)                      (7) Source-wise factorization
                                                                                         (3) WPE+wMPDR (separate)
system is composed of a Time-Delay Neural Network (TDNN)
acoustic model trained using lattice-free maximum mutual                           24                                           4.4
information (LF-MMI) and online i-vector extraction, and a
trigram language model. They are trained on the REVERB                             23                                           4.2
training set.
                                                                                   22
                                                                         WER (%)

                                                                                                                      CD (dB)
                                                                                                                                  4
B. Configurations of CBF
                                                                                   21
    Table I summarizes two configurations of the CBF examined                                                                   3.8
in experiments including the number of microphones M , the
                                                                                   20
filter length L, and the number of optimization iterations. The
                                                                                                                                3.6
sampling frequency was 16 kHz. A Hann window was used
for a short-time analysis with the frame length and shift being                    19
                                                                                          2    4      6     8   10                      2      4      6   8           10
set at 32 ms and 8 ms, respectively. The prediction delay was                                 #iterations                                     #iterations
set at ∆ = 4 for the WPE filter.                                                     6                                      1.85
    In the iterative optimization, the time-varying variances of                                                                1.8
sources were initialized as those of the observed signal for the                   5.5
WPE filter and as 1 for the wMPDR beamformer for all the                                                                    1.75
                                                                     FWSSNR (dB)

methods.                                                                             5
                                                                                                                                1.7
                                                                                                                     PESQ

                                                                                   4.5                                      1.65
C. Experiment-1: effectiveness of joint optimization
                                                                                                                                1.6
    In this experiment, we evaluated the effectiveness of the                        4
joint optimization focusing on its two characteristics. First,                                                              1.55
we compared three different filter combinations, a WPE filter                      3.5
                                                                                                                                1.5
followed by a wMPDR beamformer (WPE+wMPDR), a WPE                                        2      4     6   8     10                       2      4     6   8           10
filter followed by an MPDR beamformer (WPE+MPDR),                                             #iterations                                     #iterations
and a WPE filter followed by an MVDR beamformer                      Fig. 4. Comparison among joint optimization and cascade configuration
(WPE+MVDR). The first combination is required for the                approaches when using WPE+MPDR and WPE+wMPDR with integrated and
jointly optimal processing, and the others have been used for        separate optimization schemed using Config-1 for REVERB-2MIX.
the conventional cascade configuration. Second, we compared
two different variance optimization schemes shown in Fig. 3,
namely “separate” and “integrated” variance optimization             beamformer. A significant difference between the two schemes
schemes. With the separate variance optimization, the itera-         is whether the WPE filter uses the same variances for all
tive estimation of the time-varying variance was performed           the sources or different variances dependant on the sources
separately for the WPE filter and for the beamformer. This is        estimated by the beamformer.
the scheme used by the conventional cascade configuration. In           Table III compares WERs, CDs, FWSSNRs, PESQs, and
contrast, with the integrated variance optimization, the iterative   STOIs obtained after five estimation iterations using three
estimation was performed jointly for the WPE filter and the          beamformers (MPDR, MVDR, and wMPDR), two conven-
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020                                                                                    12

                          8                                                                8
        Frequency (kHz)

                                                                         Frequency (kHz)
                          6                                                                6

                          4                                                                4

                          2                                                                2

                          0                                                                0
                              0   0.5          1        1.5                                    0            0.5           1          1.5
                                        Time (s)                                                                   Time (s)
                                  (a) Observed signal                                                             (b) MVDR

                          8                                                                8
        Frequency (kHz)

                                                                         Frequency (kHz)
                          6                                                                6

                          4                                                                4

                          2                                                                2

                          0                                                                0
                              0   0.5          1        1.5                                    0            0.5           1          1.5
                                        Time (s)                                                                   Time (s)
                                   (c) WPE+MVDR                                                    (d) CBF with source-wise factorization
Fig. 5. A spectrogram of (a) a noisy reverberant mixture in RealData of REVERB-2MIX and those of enhanced signals obtained by (b) MVDR, (c)
WPE+MVDR and (d) CBF with source-wise factorization. The mixture is composed of two female speakers under far conditions.

tional cascade configuration approaches ((1) WPE+MPDR                  the integrated variance optimization, are both important for
and (2) WPE+MVDR), two test conditions ((3) and (4)),                  achieving the optimal performance.
and a proposed joint optimization approach ((7) source-wise
factorization). All methods used configuration Config-1 in             D. Experiment-2: Comparison among joint optimization ap-
Table I. Table III shows that 1) WPE+MPDR, WPE+MVDR,                   proaches
and WPE+wMPDR greatly outperformed MPDR, MVDR,                            In this experiments, we compared three joint optimiza-
and wMPDR, respectively, with all the conditions, 2) the               tion approaches, namely two source-packed factorization ap-
joint optimization approach, i.e., (7) source-wise factorization,      proaches, respectively, described in Sections III-B1 and
substantially outperformed all the other methods in terms of           III-B2, and denoted as “(5) Source-packed factorization (con-
all the measures except for a case in terms of STOI where              ventional)” and “(6) Source-packed factorization (extended),”
WPE+wMPDR (separate) gave a slightly better score than                 and the source-wise factorization approach, denoted as “(7)
(7) source-wise factorization. Furthermore, Fig. 4 shows con-          Source-wise factorization.” (5) Source-packed factorization
vergence curves of the two cascade configuration approaches,           (conventional) corresponds to the conventional joint optimiza-
two test conditions, and the joint optimization approach. The          tion technique, and (6) Source-packed factorization (extended)
performance of (7) source-wise factorization was the best of           and (7) Source-wise factorization correspond to our proposed
all and improved as the number of iterations increased. The            methods. Figure 6 compares the WERs obtained using the
second best was (3) WPE+wMPDR (separate). In contrast, the             three joint optimization approaches with Config-1 and Config-
other methods did not improve the scores after the first iter-         2. It shows that the proposed methods, i.e., (6) Source-packed
ation with both integrated and separate variance optimization          factorization (extended) and (7) Source-wise factorization,
schemes.                                                               performed comparably well and both greatly outperformed (5)
   Figure 5 shows a spectrogram of a noisy reverberant mixture         Source-packed factorization (conventional).
in RealData of REVERB-2MIX, and those of enhanced signals                 Table IV compares computing times required for the three
obtained using MVDR, WPE+MVDR, and CBF with source-                    approaches to estimate and apply the CBFs with ten estimation
wise factorization. The figure demonstrates that while all the         iterations for processing a mixture utterance with the length
enhancement methods were effective, the CBF with source-               being 9.44 s. The computing time was measured by a Mat-
wise factorization was the best of all for achieving denoising,        lab interpreter as the elapsed time. The computing time for
dereverberation, and source separation.                                estimating masks, which was 0.63 s and 7.2 s, respectively,
   The above results clearly show that the two characteristics         with and without a GPU (NVIDIA 2080ti), is not included
of the joint optimization approach, i.e., 1) the optimal com-          in the table. As shown in the table, for both configurations,
bination of a WPE filter and a wMPDR beamformer, and 2)                (6) Source-packed factorization (extended) greatly reduced
You can also read