SPEAKER DIARIZATION WITH SESSION-LEVEL SPEAKER EMBEDDING REFINEMENT USING GRAPH NEURAL NETWORKS - Unpaywall

Page created by Gene Clarke
 
CONTINUE READING
SPEAKER DIARIZATION WITH SESSION-LEVEL SPEAKER EMBEDDING REFINEMENT USING GRAPH NEURAL NETWORKS - Unpaywall
SPEAKER DIARIZATION WITH SESSION-LEVEL SPEAKER EMBEDDING REFINEMENT
                                                                  USING GRAPH NEURAL NETWORKS

                                               Jixuan Wang1,2∗ , Xiong Xiao3 , Jian Wu3 , Ranjani Ramamurthy3 , Frank Rudzicz1,2 , Michael Brudno1,2

                                                                    1
                                                                    University of Toronto, Canada 2 Vector Institute, Canada 3 Microsoft, USA
                                                           1,2
                                                               {jixuan, frank, brudno}@cs.toronto.edu 3 {xioxiao, jianwu, ranjanir}@microsoft.com
arXiv:2005.11371v1 [eess.AS] 22 May 2020

                                                                            ABSTRACT
                                           Deep speaker embedding models have been commonly used
                                           as a building block for speaker diarization systems; however,
                                           the speaker embedding model is usually trained according to
                                           a global loss defined on the training data, which could be sub-
                                           optimal for distinguishing speakers locally in a specific meeting
                                           session. In this work we present the first use of graph neural
                                           networks (GNNs) for the speaker diarization problem, utilizing
                                           a GNN to refine speaker embeddings locally using the struc-
                                           tural information between speech segments inside each session.
                                           The speaker embeddings extracted by a pre-trained model are
                                           remapped into a new embedding space, in which the different
                                           speakers within a single session are better separated. The model
                                           is trained for linkage prediction in a supervised manner by min-
                                           imizing the difference between the affinity matrix constructed
                                           by the refined embeddings and the ground-truth adjacency ma-         Fig. 1. Overview of the proposed method. A graph is con-
                                           trix. Spectral clustering is then applied on top of the refined      structed for each session using speaker embeddings extracted
                                           embeddings. We show that the clustering performance of the           by a pre-trained speaker embedding model. A GNN model is
                                           refined speaker embeddings outperforms the original embed-           applied to remap the original embedding into another embed-
                                           dings significantly on both simulated and real meeting data, and     ding space which is trained according to a loss function defined
                                           our system achieves the state-of-the-art result on the NIST SRE      by the difference between the refined affinity matrix and the
                                           2000 CALLHOME database.                                              ground-truth adjacency matrix.
                                              Index Terms— Speaker diarization, graph neural networks,
                                           deep speaker embedding.
                                                                                                                (AHC) [15], spectral clustering (SC) [16] and affinity propaga-
                                                                                                                tion [17]. Although deep learning methods driven by large scale
                                                                      1. INTRODUCTION                           datasets have dominated the fields of speaker and speech recog-
                                                                                                                nition, it is still non-trivial to design an end-to-end objective
                                           Speaker diarization is the problem of “who spoke when”. A typ-
                                                                                                                function for the speaker diarization problems which is permu-
                                           ical speaker diarization system usually contains multiple steps.
                                                                                                                tation invariant in terms of both the speaker order and speaker
                                           First, the non-speech parts are filtered out by voice activity de-
                                                                                                                number. Most recently there have been several end-to-end ap-
                                           tection (VAD). Second, the speech parts are split into small ho-
                                                                                                                proaches that either utilize a factored generative model [13] or
                                           mogeneous segments either uniformly or according to the de-
                                                                                                                are trained according to a permutation-free loss [18].
                                           tected speaker change points. Third, each segment is mapped
                                                                                                                     In this paper, we consider a task upstream of speaker di-
                                           into a fixed dimensional embedding, such as i-vector [1], x-
                                                                                                                arization. We suggest that the speaker diarization models could
                                           vector [2] or d-vector [3, 4, 5, 6, 7]. Finally clustering methods
                                                                                                                be further improved by locally refining the speaker embedding
                                           or end-to-end approaches are applied to generate the diariza-
                                                                                                                for each session. The speaker embedding models are usually
                                           tion results [8, 9, 10, 11, 12, 13]. Usually a classifier needs
                                                                                                                designed to generalize well across a large range of speakers
                                           to be trained for similarity scoring for i-vectors and x-vectors,
                                                                                                                with different characteristics. However, speaker diarization is a
                                           while similarities between d-vectors can usually be measured
                                                                                                                simpler task compared with speaker recognition in the perspec-
                                           by simple distance metrics, e.g. cosine or Euclidean distance.
                                                                                                                tive of distinguishing speakers: we only care about separating
                                               Commonly-used clustering methods for speaker diarization
                                                                                                                several speakers for each session.
                                           include K-means [14], agglomerative hierarchical clustering
                                                                                                                     Our approach is to refine speaker embeddings by improving
                                             ∗ Initial   work by first author done as an intern at Microsoft.   their performance for similarity measurement between speech
segments. Obviously, if we could correctly predict for each pair     mulated as the prediction of yi for each segment embedding xi
of speech segments whether they belong to the same speaker           such that yi is equal to yj if and only if xi and xj belong to the
or not, the diarization problem would be solved. Our method          same speakers, i, j ∈ {1, 2, . . . , N }.
utilizes local structural information existing among the speaker
embedding spaces for each session. The structural information
                                                                     2.2. Graph neural networks
is expressed by a graph built upon the speech segments for each
session. The task of predicting similarities between pairs of        We apply several variants of GNNs falling into the framework
nodes is analogous to the link prediction task on graphs. Our        of message passing neural networks (MPNNs) [25]. Under the
model can be seen as a neural link predictor [19, 20, 21] con-       MPNN framework, the convolutional operator is expressed as a
sisting of an encoding component and a scoring component.            message passing scheme:
The encoding component includes GNN [22, 23, 24, 25] lay-
ers to map speaker embeddings into another embedding space.                     x0i = γΘ xi , j∈N (i) φΘ (xi , xj , ei,j )
                                                                                                                              
                                                                                                                                    (1)
Fully connected layers or cosine similarity is used as the scor-
ing component. To train the model, a loss function is defined        where xi is the feature of node i in the current layer with dimen-
as the difference between the affinity matrix constructed by the     sion D, x0i is the node feature in the next layer with dimension
refined embeddings and the ground-truth adjacency matrix. By         D0 , ei,j is the edge feature from node i to node j , γ(·) and φ(·)
minimizing this loss the model will be able to refine the original   are the update function and message function, respectively, pa-
embeddings so that different speakers could be better separated      rameterized by Θ, and  denotes the aggregation function, e.g.,
within each session.                                                 sum, mean, max, etc.
     Very recently (and in parallel to this work), Lin et al. [26]        One GNN variant we apply is the graph convolutional net-
proposed the use of LSTMs to improve similarity measurement          work (GCN) described by [24] in which the  function cor-
for speaker diarization. Unlike their work, we utilize the struc-    responds to taking a certain weighted average of neighboring
tural information in the embedding space instead of utilizing        nodes. The updating scheme between two layers is:
the temporal information in label sequences. Also our model                                                      
not only outputs better similarity measurement but also refined                                        X
speaker embeddings, which could be potentially fed into non-                             x0i = σ W        Lij xj  ,                (2)
clustering-based methods. Our model also achieves better per-                                          j
formance with the same experimental setup. Experiments on
both simulated and real meeting data show that the performance       where L = D̂−1/2 ÂD̂−1/2 is the normalized affinity matrix
of speaker number detection and speaker diarization with the         added by self connection, Â = A + IN , D̂ denotes the degree
                                                                                             0
refined embeddings outperforms the original embeddings sig-          matrix of Â, W ∈ RD ×D is a layer-specific trainable weight
nificantly, and our system achieves the state-of-the-art result on   matrix, and σ(·) is a nonlinear function.
the NIST SRE 2000 CALLHOME database.
                                                                     2.3. Model design
      2. GRAPH BASED SPEAKER DIARIZATION
                                                                     Our model consists of two components: an encoding compo-
2.1. Building graphs of speech segments                              nent and a scoring component. GNN layers are applied as
                                                                     the encoding component. No nonlinear functions are applied
A graph is built for each session using the pretrained speaker       between the GNN layers. Because the speaker embeddings are
embeddings. Each node represents an audio segment which              already well trained, adding nonlinearity may lose informa-
could be word-level, utterance-level, or extracted by a sliding      tion inside the original embeddings and result in worse results.
window with a fixed sliding step. Speaker embedding of each          The encoding component generates the refined embeddings,
segment is extracted as the node features. The weight of edges       of which every pair are concatenated as inputs to the scoring
between nodes is represented by the PLDA scores or cosine            component.
similarities between the corresponding x-vectors and d-vectors,           Due to the distinct nature of x-vectors and d-vectors, differ-
respectively. We only keep edges of weight larger than a thresh-     ent scoring components are designed. For d-vectors, it is com-
old, which is treated as a hyperparameter.                           mon to leave the complexity to the model and use a simple dis-
    Formally, each meeting session can be represented as a           tance metric, e.g. cosine distance. Since the original d-vectors
graph G(V, E, A), where V is the set of nodes (speech seg-           are trained for being comparable by a simple distance metric,
ments), E is the set of edges, and A ∈ N × N is the affinity         we keep using this distance metric for scoring the refined em-
matrix with Aij > 0 if edge eij = (vi , vj ) ∈ E and Aij = 0         beddings. For x-vectors, on the other hand, an additional classi-
otherwise. The matrix of d-vectors X ∈ RN ×D can be treated          fier, e.g. PLDA, is usually trained to measure the similarity. In
as node features for graph G , where N is the total number           our model, we add fully connected layers with nonlinear func-
of segments and D is the dimension of the embedding space.           tions after the GNN layers, which is actually a classifier on top
The features of each node vi are represented by a d-vector xi ,      of the refined embeddings. This classifier is trained together
i ∈ 1, 2, . . . , N . The goal of speaker diarization can be for-    with the GNN layers.
2.4. Loss function                                                     by a sigmoid function. The GCN layers are implemented with
                                                                       the PyTorch Geometric library [32].
We train the GNN model w.r.t. a loss function that is defined
                                                                           Similar to previous work [13, 26], we conducted 5-fold
as the difference between the affinity matrix constructed by the
                                                                       cross validation. The dataset contains 500 sessions in total,
refined embedding and the ground-truth adjacency matrix. This
                                                                       which is uniformly split into 5 subsets of 100 sessions for eval-
is actually a binary classification of linkage for all pairs of seg-
                                                                       uation. In each turn, the model is trained for 50 epochs on
ments inside each session. For the x-vector based system, we
                                                                       400 sessions with an initial learning rate of 0.001 reduced by a
simply applied the binary cross entropy (BCE) loss on all pairs
                                                                       factor of 10 after 40 epochs. The optimal threshold for speaker
of segments for each session.
                                                                       number detection on the training set is used for evaluation on
    However, this simple loss function does not work for d-
                                                                       the testing set. A graph is built for each session on which we
vectors. Due to the simplicity of the distance metric used, the
                                                                       connect a pair of nodes if their PLDA score is higher than 0.2.
gradient would be very noisy if training the model w.r.t. the ex-
                                                                       The model is trained in a session-by-session fashion that each
act match between the cosine similarities and ground-truth 1s
                                                                       batch contains a single graph constructed from one session. All
or 0s. Instead we applied the following loss function:
                                                                       the hyperparameters are tuned on a validation set split from the
                                                                       training set for each turn. Results are shown in Table 1 and
      L = hist loss(A, Agt ) + α · kA − Agt knuclear ,          (3)
                                                                       summarized in the following section.
in which hist loss corresponds to the histogram loss [27],
k·knuclear is the nuclear norm between A and the ground-truth          Table 1. DER (%) on the NIST SRE 2000 CALLHOME
adjacency matrix Agt , and α is a scalar value hyperparameter.         dataset. SC refers to spectral clustering and AHC to agglom-
Histogram loss tries to ensure the distance between pairs of seg-      erative hierarchical clustering.
ments belonging to different speakers is larger than the distance
between those belonging to the same speakers. The nuclear                             Method                               DER(%)
norm, which is the sum of singular values of a matrix [28]                            x-vector + PLDA + AHC (5-fold)          8.64
depends on the rank match between A and Agt , instead of                  Baseline
                                                                                      x-vector + PLDA + SC (5-fold)           8.05
an exact match. The combination of these two loss functions                           Wang et al. [12]                        12.0
achieves similarity scores that stay in a reasonable range.                           Sell et al. [33]                        11.5
                                                                          Recent
                                                                                      Romero et al. [34]                       9.9
                                                                          Work
2.5. Spectral clustering                                                              Zhang et al. [13] (5-fold)               8.5
                                                                                      Lin et al. [26] (5-fold)                7.73
We use spectral clustering as our backend method and follow
                                                                          Ours        GNN based (5-fold)                      7.24
the standard steps described in [16]. However, to detect the
speaker number, we first perform eigen-decomposition on the
refined affinity matrix and then determine the speaker number
by the number of eigenvalues that are larger than a threshold,         3.2. Speaker diarization results
which we treat as a tunable parameter. As shown in Section 4.2,
this method outperforms the commonly used method based on              We compare speaker diarization error rate (DER) of our model
maximal eigengap.                                                      with several recent works. We achieve state-of-the-art perfor-
                                                                       mance resulting in a DER of 7.24%, which outperforms both
                                                                       the baselines and all recent works. For fairness, we only include
         3. EXPERIMENTS WITH X-VECTORS
                                                                       the results without system fusion of [26]. We only include the
                                                                       result with in-domain training of [13]. The DER of [13] with
3.1. Datasets and implementation details
                                                                       external data is 7.6% which is still worse than our result. Also
To compare our x-vector-based systems with other meth-                 the speaker embedding model of [13] was trained on a much
ods, we follow the evaluation steps described in the call-             larger dataset than the one we are using, which may also influ-
home diarization/v2 receipe of Kaldi [29]. We used the pre-            ence the results.
trained x-vector model and PLDA scoring model trained on
augmented Switchboard and NIST SRE datasets [30]. Results
                                                                                 4. EXPERIMENTS WITH D-VECTORS
on the commonly used dataset NIST SRE 2000 CALLHOME
(LDC2001S97) Disk 8 are reported.                                      4.1. Datasets and implementation details
     Our model architecture includes two GCN [24] layers fol-
lowed by two fully connected layers. Input x-vector is 128-            For d-vector-based system, we trained the model on simulated
dimensional. The dimension of output embedding from the                conversations and evaluated it on real meetings. The d-vector
GCN layers is 64, of which every pair is concatenated as a 128-        extraction model is trained and fake meetings are simulated
dimensional input to the following fully connected layers. The         from the VoxCeleb2 dataset [35]. For each simulated meeting,
first fully connected layers is 64-dimensional with the ELU acti-      we randomly select 2-15 speakers from the training set of Vox-
vation function [31] and the last layer has 1 dimension followed       Celeb2 as the meeting participants. And for each speaker, 2-60
speech segments with duration of 1.5s are sampled. Moreover,        mance of the threshold-based method with original embeddings
we use 45 sessions of in-house real meetings as another test-       degrades on the out-of-domain (in-house meeting) dataset. This
ing set. On average, each meeting contains 6 speakers, ranging      is likely because original embeddings are less robust with re-
from 2 to 16. The average duration of meeting sessions is about     spect to the threshold, while the refined embeddings are robust
30 minutes.                                                         even with a threshold tuned on a different dataset. Conse-
    To build the graphs, we only connect a pair of speech seg-      quently the threshold-based method with refined embeddings
ments if the cosine similarity between their d-vectors is larger    significantly outperforms other approaches on both in-domain
than a threshold 0.2, which is tuned on the validation set. Our     and out-of-domain datasets.
model contains two GCN layers, of which the hidden dimension
and output dimension are both equal to the input dimension of
                                                                    Table 2. Mean error of speaker number detection (first two
128. We used 150 bins for the histogram loss. The DER re-
                                                                    rows) and DER (last row) for the eigengap-based (gap) method
ported only includes speaker confusion.
                                                                    and the threshold-based method (thred) using original (O) and
                                                                    refined (R) embeddings, respectively. The last line shows DER
4.2. Speaker number detection results                               improvement relative to O-gap.
We first evaluate the refined embedding for speaker number
                                                                          Dataset            O-gap   O-thred    R-gap     R-thred
detection. We search for the optimal threshold for this task
                                                                      VoxCeleb2 Test          4.77    1.56       2.64      1.03
through simulated conversations on the validation set. The
                                                                     In-house meetings        3.23    4.39       2.86      1.20
threshold which achieves the lowest mean speaker number
                                                                     In-house meetings
detection error across all simulated sessions is chosen for eval-                             0%     -116.7%    16.3%     73.7%
                                                                     (Rel. DER reduction)
uation on testing set. As shown in Figure 2, the minimal error
achieved by the refined embeddings is lower than the original
embeddings with a threshold of 3.0 and 2.0, respectively. The
curved slope of the refined embeddings after the optimal thresh-
                                                                    4.3. Speaker diarization results
old is lower than the original embeddings, which indicates that
speaker number detection with the refined embedding is more         We compare the performance of spectral clustering on speaker
robust.                                                             diarization using both original and refined embeddings. The re-
                                                                    sults in Table 2 are consistent with the results on speaker num-
                                                                    ber detection. Here again using the threshold-based model us-
                                                                    ing the original embeddings actually performs worse than the
                                                                    eigengap-based method for the out-of-domain dataset. How-
                                                                    ever the model using refined embeddings and the threshold-
                                                                    based method for speaker number detection achieves a 73.7%
                                                                    relative DER reduction compared to the model with the original
                                                                    embeddings. This indicates that our system should be general-
                                                                    izable to real world applications.

                                                                                            5. CONCLUSION

Fig. 2. Mean error of speaker number detection by different         In this paper, we present the first use of GNNs for speaker di-
thresholds. The dashed line (original) refers to results with       arization. A graph of speech segments is built for each meeting
the original speaker embeddings, while the solid line (refined)     session and a GNN model is trained to refine the speaker em-
refers to results with the refined speaker embeddings.              bedding utilizing the structural information in the graph. Our
                                                                    model achieves the state-of-the-art performance on a public
                                                                    dataset, while experiments on both simulated and real meetings
    With the optimal threshold tuned on the validation set we
                                                                    show that spectral clustering based on the refined embeddings
evaluate the performance of speaker number detection on two
                                                                    can achieve much better performance than original embeddings
test sets: fake conversations simulated from the testing set
                                                                    in terms of speaker number detection and clustering.
of VoxCeleb2 and an in-house meeting dataset, which are in-
domain and out-of-domain datasets respectively. As shown
in Table 2, the performance with refined embeddings outper-                         6. ACKNOWLEDGEMENT
forms the original embeddings on both datasets with both the
eigengap-based method [12] and the threshold-based method.          The authors would like to thank Tianyan Zhou and Yong Zhao
    Intriguingly, while the threshold-based methods with orig-      for providing the d-vector extraction model, and Chun-Hao
inal and refined embeddings outperform the eigengap-based           Chang and Zhuo Chen for helpful discussions. Rudzicz is a
methods on the in-domain dataset (VoxCeleb2 Test), the perfor-      CIFAR Chair in AI.
7. REFERENCES                                         [18] Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu,
                                                                                    and Shinji Watanabe, “End-to-end neural speaker diarization with
 [1] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and               permutation-free objectives,” Proc. Interspeech 2019, pp. 4300–4304,
     Pierre Ouellet, “Front-end factor analysis for speaker verification,”          2019.
     IEEE Transactions on Audio, Speech, and Language Processing, vol.         [19] Thomas N Kipf and Max Welling, “Variational graph auto-encoders,”
     19, no. 4, pp. 788–798, 2010.                                                  arXiv preprint arXiv:1611.07308, 2016.
 [2] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey,           [20] Muhan Zhang and Yixin Chen, “Link prediction based on graph neural
     and Sanjeev Khudanpur, “X-vectors: Robust dnn embeddings for                   networks,” in Advances in Neural Information Processing Systems,
     speaker recognition,” in 2018 IEEE International Conference on                 2018, pp. 5165–5175.
     Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp.
                                                                               [21] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy
     5329–5333.
                                                                                    Gabrilovich, “A review of relational machine learning for knowledge
 [3] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and              graphs,” Proceedings of the IEEE, vol. 104, no. 1, pp. 11–33, 2015.
     Javier Gonzalez-Dominguez, “Deep neural networks for small foot-
                                                                               [22] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi
     print text-dependent speaker verification,” in 2014 IEEE International
                                                                                    Zhang, and Philip S Yu, “A comprehensive survey on graph neural
     Conference on Acoustics, Speech and Signal Processing (ICASSP).
                                                                                    networks,” arXiv preprint arXiv:1901.00596, 2019.
     IEEE, 2014, pp. 4052–4056.
                                                                               [23] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang, “A
 [4] Hervé Bredin, “Tristounet: triplet loss for speaker turn embedding,”
                                                                                    comprehensive survey of graph embedding: Problems, techniques, and
     in 2017 IEEE international conference on acoustics, speech and signal
                                                                                    applications,” IEEE Transactions on Knowledge and Data Engineer-
     processing (ICASSP). IEEE, 2017, pp. 5430–5434.
                                                                                    ing, vol. 30, no. 9, pp. 1616–1637, 2018.
 [5] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno, “Gen-
                                                                               [24] Thomas N. Kipf and Max Welling, “Semi-supervised classification
     eralized end-to-end loss for speaker verification,” in 2018 IEEE In-
                                                                                    with graph convolutional networks,” in 5th International Conference
     ternational Conference on Acoustics, Speech and Signal Processing
                                                                                    on Learning Representations, ICLR 2017, Toulon, France, April 24-
     (ICASSP). IEEE, 2018, pp. 4879–4883.
                                                                                    26, 2017, Conference Track Proceedings. 2017, OpenReview.net.
 [6] Jixuan Wang, Kuan-Chieh Wang, Marc T Law, Frank Rudzicz, and
                                                                               [25] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals,
     Michael Brudno, “Centroid-based deep metric learning for speaker
                                                                                    and George E Dahl, “Neural message passing for quantum chem-
     recognition,” in ICASSP 2019-2019 IEEE International Conference
                                                                                    istry,” in Proceedings of the 34th International Conference on Machine
     on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019,
                                                                                    Learning-Volume 70. JMLR. org, 2017, pp. 1263–1272.
     pp. 3652–3656.
                                                                               [26] Qingjian Lin, Ruiqing Yin, Ming Li, Hervé Bredin, and Claude Barras,
 [7] Tianyan Zhou, Yong Zhao, Jinyu Li, Yifan Gong, and Jian Wu, “Cnn
                                                                                    “LSTM based Similarity Measurement with Spectral Clustering for
     with phonetic attention for text-independent speaker verification,” in
                                                                                    Speaker Diarization,” 2019, HAL CCSD.
     Proc. IEEE Workshop on Automatic Speech Recognition and Under-
     standing, 2019.                                                           [27] Evgeniya Ustinova and Victor Lempitsky, “Learning deep embeddings
                                                                                    with histogram loss,” in Advances in Neural Information Processing
 [8] Sylvain Meignier and Teva Merlin, “Lium spkdiarization: an open
                                                                                    Systems, 2016, pp. 4170–4178.
     source toolkit for diarization,” 2010.
                                                                               [28] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo, “Guaranteed
 [9] Stephen H Shum, Najim Dehak, Réda Dehak, and James R Glass,                   minimum-rank solutions of linear matrix equations via nuclear norm
     “Unsupervised methods for speaker diarization: An integrated and it-           minimization,” SIAM review, vol. 52, no. 3, pp. 471–501, 2010.
     erative approach,” IEEE Transactions on Audio, Speech, and Lan-
     guage Processing, vol. 21, no. 10, pp. 2015–2028, 2013.                   [29] “callhome diarization/v2 receipe of Kaldi,” https://github.
                                                                                    com/kaldi-asr/kaldi/tree/master/egs/callhome_
[10] Gregory Sell and Daniel Garcia-Romero, “Speaker diarization with               diarization/v2, Accessed: 2019-10-15.
     plda i-vector scoring and unsupervised calibration,” in 2014 IEEE
     Spoken Language Technology Workshop (SLT). IEEE, 2014, pp. 413–           [30] “SRE16 Xvector Model,” http://kaldi-asr.org/models/
     417.                                                                           m6, Accessed: 2019-10-15.
[11] Daniel Garcia-Romero, David Snyder, Gregory Sell, Daniel Povey,           [31] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter, “Fast
     and Alan McCree, “Speaker diarization using deep neural network                and accurate deep network learning by exponential linear units (elus),”
     embeddings,” in 2017 IEEE International Conference on Acoustics,               in 4th International Conference on Learning Representations, ICLR,
     Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4930–4934.              2016.
[12] Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and           [32] Matthias Fey and Jan Eric Lenssen, “Fast graph representation learn-
     Ignacio Lopz Moreno, “Speaker diarization with lstm,” in 2018 IEEE             ing with pytorch geometric,” arXiv preprint arXiv:1903.02428, 2019.
     International Conference on Acoustics, Speech and Signal Processing       [33] G. Sell and D. Garcia-Romero, “Diarization resegmentation in the
     (ICASSP). IEEE, 2018, pp. 5239–5243.                                           factor analysis subspace,” in 2015 IEEE International Conference on
[13] Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, and Chong                   Acoustics, Speech and Signal Processing (ICASSP), April 2015, pp.
     Wang, “Fully supervised speaker diarization,” in ICASSP 2019-2019              4794–4798.
     IEEE International Conference on Acoustics, Speech and Signal Pro-        [34] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree,
     cessing (ICASSP). IEEE, 2019, pp. 6301–6305.                                   “Speaker diarization using deep neural network embeddings,” in 2017
[14] Stuart Lloyd, “Least squares quantization in pcm,” IEEE transactions           IEEE International Conference on Acoustics, Speech and Signal Pro-
     on information theory, vol. 28, no. 2, pp. 129–137, 1982.                      cessing (ICASSP), March 2017, pp. 4930–4934.

[15] Lior Rokach and Oded Maimon, “Clustering methods,” in Data min-           [35] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2:
     ing and knowledge discovery handbook, pp. 321–352. Springer, 2005.             Deep speaker recognition,” in Proc. Interspeech 2018, 2018, pp. 1086–
                                                                                    1090.
[16] Ulrike Von Luxburg, “A tutorial on spectral clustering,” Statistics and
     computing, vol. 17, no. 4, pp. 395–416, 2007.
[17] Brendan J Frey and Delbert Dueck, “Clustering by passing messages
     between data points,” science, vol. 315, no. 5814, pp. 972–976, 2007.
You can also read