How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings

 
CONTINUE READING
How Familiar Does That Sound? Cross-Lingual Representational
                                                     Similarity Analysis of Acoustic Word Embeddings
                                                           Badr M. Abdullah       Iuliia Zaitova     Tania Avgustinova
                                                                       Bernd Möbius        Dietrich Klakow
                                                              Department of Language Science and Technology (LST)
                                                            Saarland Informatics Campus, Saarland University, Germany
                                                               Corresponding author: babdullah@lsv.uni-saarland.de

                                                              Abstract                            ances of an unknown but related language without
                                                                                                  being able to produce it.
                                             How do neural networks “perceive” speech
                                             sounds from unknown languages? Does the
                                                                                                     Human speech perception has been an active
                                             typological similarity between the model’s           area of research in the past five decades which
arXiv:2109.10179v1 [cs.CL] 21 Sep 2021

                                             training language (L1) and an unknown lan-           has produced a wealth of documented behavioral
                                             guage (L2) have an impact on the model rep-          studies and experimental findings. Recently, there
                                             resentations of L2 speech signals? To an-            has been a growing scientific interest in the cog-
                                             swer these questions, we present a novel ex-         nitive modeling community to leverage the recent
                                             perimental design based on representational          advances in speech representation learning to for-
                                             similarity analysis (RSA) to analyze acoustic
                                                                                                  malize and test theories of speech perception using
                                             word embeddings (AWEs)—vector represen-
                                             tations of variable-duration spoken-word seg-        computational simulations on the one hand, and to
                                             ments. First, we train monolingual AWE mod-          investigate whether neural networks exhibit similar
                                             els on seven Indo-European languages with            behavior to humans on the other hand (Räsänen
                                             various degrees of typological similarity. We        et al., 2016; Alishahi et al., 2017; Dupoux, 2018;
                                             then employ RSA to quantify the cross-lingual        Scharenborg et al., 2019; Gelderloos et al., 2020;
                                             similarity by simulating native and non-native       Matusevych et al., 2020b; Magnuson et al., 2020).
                                             spoken-word processing using AWEs. Our ex-
                                             periments show that typological similarity in-
                                                                                                     In the domain of modeling non-native speech
                                             deed affects the representational similarity of      perception, Schatz and Feldman (2018) have shown
                                             the models in our study. We further discuss the      that a neural speech recognition system (ASR) pre-
                                             implications of our work on modeling speech          dicts Japanese speakers’ difficulty with the English
                                             processing and language similarity with neural       phonemic contrast /l/-/ô/ as well as English speak-
                                             networks.                                            ers’ difficulty with the Japanese vowel length dis-
                                                                                                  tinction. Matusevych et al. (2021) have shown
                                         1   Introduction                                         that a model of non-native spoken word processing
                                         Mastering a foreign language is a process that re-       based on neural networks predicts lexical process-
                                         quires (human) language learners to invest time and      ing difficulty of English-speaking learners of Rus-
                                         effort. If the foreign language (L2) is very distant     sian. The latter model is based on word-level repre-
                                         from our native language (L1), not much of our           sentations that are induced from naturalistic speech
                                         prior knowledge of language processing would be          data known in the speech technology community
                                         relevant in the learning process. On the other hand,     as acoustic word embeddings (AWEs). AWE mod-
                                         learning a language that is similar to our native        els map variable-duration spoken-word segments
                                         language is much easier since our prior knowledge        onto fixed-size representations in a vector space
                                         becomes more useful in establishing the correspon-       such that instances of the same word type are (ide-
                                         dences between L1 and L2 (Ringbom, 2006). In             ally) projected onto the same point in space (Levin
                                         some cases where L1 and L2 are closely related           et al., 2013). In contrast to word embeddings in
                                         and typologically similar, it is possible for an L1      natural language processing (NLP), an AWE en-
                                         speaker to comprehend L2 linguistic expressions          codes information about the acoustic-phonetic and
                                         to a great degree without prior exposure to L2. The      phonological structure of the word, not its semantic
                                         term receptive multilingualism (Zeevaert, 2007) has      content.
                                         been coined in the sociolinguistics literature to de-       Although the model of Matusevych et al. (2021)
                                         scribe this ability of a listener to comprehend utter-   has shown similar effects to what has been ob-
a simulation of a non-native
                                          Non-native Encoder     speaker's phonetic perceptual space
                                            Language ( )

      Spoken-word stimuli
      from native speakers
         of language ( )
                                                                  a simulation of a native speaker's
                                            Native Encoder            phonetic perceptual space
                                            Language ( )

                                                                    a simulation of a non-native
                                          Non-native Encoder     speaker's phonetic perceptual space
                                            Language ( )

Figure 1: An illustrated example of our experimental design. A set of N spoken-word stimuli from language
λ are embedded using the encoder F (λ) which was trained on language λ to obtain a (native) view of the data:
X(λ/λ) ∈ RD×N . Simultaneously, the same stimuli are embedded using encoders trained on other languages,
namely F (α) and F (β) , to obtain two different (non-native) views of the data: X(λ/α) and X(λ/β) . We quantify
the cross-lingual similarity between two languages by measuring the association between their embedding spaces
using the representational similarity analysis (RSA) framework.

served in behavioral studies (Cook et al., 2016),              where at ∈ Rk is a spectral vector of k coefficients,
it remains unclear to what extent AWE models                   an embedding is computed as
can predict a facilitatory effect of language simi-
larity on cross-language spoken-word processing.                                        x = F(A; θ) ∈ RD             (1)
In this paper, we present a novel experimental de-
sign to probe the receptive multilingual knowledge             Here, θ are the parameters of the encoder, which
of monolingual AWE models (i.e., trained with-                 are learned by training the AWE model in a mono-
out L2 exposure) using the representational simi-              lingual supervised setting. That is, the training
larity analysis (RSA) framework. In a controlled               spoken-word segments (i.e., speech intervals corre-
experimental setup, we use AWE models to sim-                  sponding to spoken words) are sampled from utter-
ulate spoken-word processing of native speakers                ances of native speakers of a single language where
of seven languages with various degrees of typo-               the word identity of each segment is known. The
logical similarity. We then employ RSA to char-                model is trained with an objective that maps differ-
acterize how language similarity affects the emer-             ent spoken segments of the same word type onto
gent representations of the models when tested on              similar embeddings. To encourage the model to
the same spoken-word stimuli (see Figure 1 for                 abstract away from speaker variability, the training
an illustrated overview of our approach). Our ex-              samples are obtained from multiple speakers, while
periments demonstrate that neural AWE models of                the resulting AWEs are evaluated on a held-out set
different languages exhibit a higher degree of rep-            of speakers.
resentational similarity if their training languages              Our research objective in this paper is to study
are typologically similar.                                     the discrepancy of the representational geometry
                                                               of two AWE encoders that are trained on different
2   Proposed Methodology                                       languages when tested on the same set of (mono-
                                                               lingual) spoken stimuli. To this end, we train AWE
A neural AWE model can be formally described as                encoders on different languages where the train-
an encoder function F : A → − RD , where A is the              ing data and conditions are comparable across lan-
(continuous) space of acoustic sequences and D                 guages with respect to size, domain, and speaker
is the dimensionality of the embedding. Given an               variability. We therefore have access to several
acoustic word signal represented as a temporal se-             encoders {F (α) , F (β) , . . . , , F (ω) }, where the su-
quence of T acoustic events A = (a1 , a2 , ..., aT ),          perscripts {α, β, . . . , ω} denote the language of the
training samples.                                       tify the cross-lingual representational similarity be-
   Now consider a set of N held-out spoken-word         tween languages λ and β as
stimuli produced by native speakers of language
      (λ)        (λ)        (λ)
λ: A1:N = {A1 , . . . , AN }. First, each acoustic              sim(λ, β) := CKA(X(λ/λ) , X(λ/β) )              (3)
word stimulus in this set is mapped onto an em-
bedding using the encoder F (λ) , which yields a        If sim(λ, α) > sim(λ, β), then we interpret
matrix X(λ/λ) ∈ RD×N . Since the encoder F (λ)          this as an indication that the native phonetic per-
was trained on language λ, we refer to it as the        ceptual space of language λ is more similar to
native encoder and consider the matrix X(λ/λ) as        the non-native phonetic perceptual space of lan-
a simulation of a native speaker’s phonetic per-        guage α, compared to that of language β. Note
ceptual space. To simulate the phonetic percep-         that while CKA(., .) is a symmetric metric (i.e.,
tual space of a non-native speaker, say a speaker       CKA(X, Y) = CKA(Y, X)), our established
                                 (λ)                    similarity metric sim(., .) is not symmetric (i.e.,
of language α, the stimuli A1:N are embedded
                                                        sim(λ, α) :6= sim(α, λ)). To estimate sim(α, λ),
using the encoder F (α) , which yields a matrix
                                                        we use word stimuli of language α and collect the
X(λ/α) ∈ RD×N . Here, we read the superscript no-
                                                        matrices X(α/α) and X(α/λ) . Then we compute
tation (λ/α) as word stimuli of language λ encoded
by a model trained on language α. Thus, the two
                                                                sim(α, λ) := CKA(X(α/α) , X(α/λ) )              (4)
matrices X(λ/λ) and X(λ/α) represent two differ-
ent views of the same stimuli. Our main hypothesis
                                                           When we apply the proposed experimental
is that the cross-lingual representational similarity
                                                        pipeline across M different languages, the effect
between emergent embedding spaces should reflect
                                                        of language similarity can be characterized by con-
the acoustic-phonetic and phonological similarities
                                                        structing a cross-lingual representational similarity
between the languages λ and α. That is, the more
                                                        matrix (xRSM) which is an asymmetric M × M
distant languages λ and α are, the more dissimilar
                                                        matrix where each cell represents the correlation
their corresponding representation spaces are. To
                                                        (or agreement) between two embedding spaces.
quantify this cross-lingual representational simi-
larity, we use Centered Kernel Alignment (CKA)          3     Acoustic Word Embedding Models
(Kornblith et al., 2019). CKA is a representation-
level similarity measure that emphasizes the dis-       We investigate three different approaches of train-
tributivity of information and therefore it obviates    ing AWE models that have been previously intro-
the need to establish the correspondence mapping        duced in the literature. In this section, we formally
between single neurons in the embeddings of two         describe each one of them.
different models. Moreover, CKA has been shown
to be invariant to orthogonal transformation and        3.1    Phonologically Guided Encoder
isotropic scaling which makes it suitable for our       The phonologically guided encoder (PGE) is a
analysis when comparing different languages and         sequence-to-sequence model in which the network
learning objectives. Using CKA, we quantify the         is trained as a word-level acoustic model (Abdul-
similarity between languages λ and α as                 lah et al., 2021). Given an acoustic sequence
                                                        A and its corresponding phonological sequence
       sim(λ, α) := CKA(X(λ/λ) , X(λ/α) )        (2)    ϕ = (ϕ1 , . . . , ϕτ ),1 the acoustic encoder F is
                                                        trained to take A as input and produce an AWE x,
Here, sim(λ, α) ∈ [0, 1] is a scalar that mea-          which is then fed into a phonological decoder G
sures the correlation between the responses of          whose goal is to generate the sequence ϕ (Fig. 2–a).
the two encoders, i.e., native F (λ) and non-native     The objective is to minimize a categorical cross-
                                               (λ)
F (α) , when tested with spoken-word stimuli A1:N .     entropy loss at each timestep in the decoder, which
sim(λ, α) = 1 is interpreted as perfect asso-           is equivalent to
ciation between the representational geometry
of models trained on languages λ and α while                                τ
                                                                            X
sim(λ, α) = 0 indicates that no association can                    L=−            log PG (ϕi |ϕ
Acoustic Word
 Embedding

                            (a)                                 (b)                                  (c)

Figure 2: A visual illustration of the different learning objectives for training AWE encoders: (a) phonologically
guided encoder (PGE): a sequence-to-sequence network with a phonological decoder, (b) correspondence auto-
encoder (CAE): a sequence-to-sequence network with an acoustic decoder, and (c) contrastive siamese encoder
(CSE): a contrastive network trained via triplet margin loss.

where PG is the probability of the phone ϕi at            has been explored in the AWEs literature with dif-
timestep i, conditioned on the previous phone se-         ferent underlying architectures (Settle and Livescu,
quence ϕ
(DEU). We acknowledge that our language sample                                         Encoder type
                                                                      Language
is not typologically diverse. However, one of our                                   PGE    CAE      CSE
objectives in this paper is to investigate whether                       CZE        78.3   76.1     82.9
the cross-lingual similarity of AWEs can predict                         POL        67.6   63.5     73.8
the degree of mutual intelligibility between related                     RUS        64.3   57.7     71.0
languages. Therefore, we focus on the Slavic lan-                        BUL        72.1   68.9     78.4
guages in this sample (CZE, POL, RUS, and BUL),                          POR        74.5   72.2     80.4
which are known to be typologically similar and                          FRA        65.6   64.5     68.5
mutually intelligible to various degrees.                                DEU        67.9   70.3     75.8
   To train our AWE models, we obtain time-
                                                                 Table 1: mAP performance on evaluation sets.
aligned spoken-word segments using the Montreal
Forced Aligner (McAuliffe et al., 2017). Then,
we sample 42 speakers of balanced gender from            aims to assess the ability of a model to determine
each language. For each language, we sample              whether or not two given speech segments corre-
~32k spoken-word segments that are longer than 3         spond to the same word type, which is quantified
phonemes in length and shorter than 1.1 seconds in       using a retrieval metric (mAP) reported in Table 1.
duration (see Table 2 in Appendix A for word-level
summary statistics of the data). For each word type,     5       Representational Similarity Analysis
we obtain an IPA transcription using the grapheme-
to-phoneme (G2P) module of the automatic speech          Figure 3 shows the cross-lingual representational
synthesizer, eSpeak. Each acoustic word segment          similarity matrices (xRSMs) across the three differ-
is parametrized as a sequence of 39-dimensional          ent models using the linear CKA similarity metric.3
Mel-frequency spectral coefficients where frames         Warmer colors indicate a higher similarity between
are extracted over intervals of 25ms with 10ms           two representational spaces. One can observe strik-
overlap.                                                 ing differences between the PGE-CAE models on
                                                         the one hand, and the CSE model on the other hand.
4.2     Architecture and Hyperparameters                 The PGE-CAE models yield representations that
Acoustic Encoder We employ a 2-layer recurrent           are cross-lingually more similar to each other com-
neural network with a bidirectional Gated Recur-         pared to those obtained from the contrastive CSE
rent Unit (BGRU) of hidden state dimension of            model. For example, the highest similarity score
512, which yields a 1024-dimensional AWE.                from the PGE model is sim(RUS, BUL) = 0.748,
                                                         which means that the representations of the Rus-
Training Details All models in this study are
                                                         sian word stimuli from the Bulgarian model F (BUL)
trained for 100 epochs with a batch size of 256
                                                         exhibit the highest representational similarity to the
using the ADAM optimizer (Kingma and Ba, 2015)
                                                         representations of the native Russian model F (RUS) .
and an initial learning rate (LR) of 0.001. The LR
                                                         On the other hand, the lowest similarity score from
is reduced by a factor of 0.5 if the mean average
                                                         the PGE model is sim(POL, DEU) = 0.622, which
precision (mAP) for word discrimination on the
                                                         shows that the German model’s view of the Polish
validation set does not improve for 10 epochs. The
                                                         word stimuli is the view that differs the most from
epoch with the best validation performance during
                                                         the view of the native Polish model. Likewise, the
training is used for evaluation on the test set.
                                                         highest similarity score we observe from the CAE
Implementation We build our models using Py-             model is sim(BUL, RUS) = 0.763, while the low-
Torch (Paszke et al., 2019) and use FAISS (John-         est similarity is sim(DEU, BUL) = 0.649. If we
son et al., 2017) for efficient similarity search dur-   compare these scores to those of the contrastive
ing evaluation. Our code is publicly available on        CSE model, we observe significant cross-language
GitHub.2                                                 differences as the highest similarity scores are
4.3     Quantitative Evaluation                          sim(POL, RUS) = sim(POR, RUS) = 0.269, while
                                                         the lowest is sim(BUL, DEU) = 0.170. This dis-
We evaluate the models using the standard intrin-        crepancy between the contrastive model and the
sic evaluation of AWEs: the same-different word
discrimination task (Carlin et al., 2011). This task         3
                                                             The xRSMs obtained using non-linear CKA with an RBF
                                                         kernel are shown in Figure 6 in Appendix B, which show very
   2
       https://github.com/uds-lsv/xRSA-AWEs              similar trends to those observed in Figure 3.
CZE     POL     RUS     BUL     POR     FRA     DEU           CZE     POL     RUS     BUL     POR     FRA     DEU           CZE     POL     RUS     BUL     POR     FRA     DEU

        0.745   0.733   0.734   0.676   0.699   0.651   CZE           0.762   0.747   0.736   0.706   0.717   0.703   CZE           0.264   0.244   0.225   0.208   0.203   0.180   0.8
                                                                                                                            CZE     POL      RUS    BUL     POR     FRA     DEU

                                                                                                                                                                                    0.7
0.731           0.712   0.722   0.669   0.668   0.622   POL   0.759           0.721   0.721   0.693   0.700   0.681   POL   0.248           0.269   0.245   0.219   0.205   0.179

                                                                                                                                                                                    0.6
0.723   0.720           0.748   0.694   0.681   0.624   RUS   0.745   0.735           0.761   0.721   0.695   0.690   RUS   0.220   0.257           0.234   0.212   0.189   0.178

                                                                                                                                                                                    0.5
0.719   0.724   0.745           0.700   0.691   0.638   BUL   0.742   0.731   0.763           0.731   0.706   0.692   BUL   0.207   0.241   0.238           0.198   0.190   0.170

                                                                                                                                                                                    0.4
0.684   0.705   0.721   0.718           0.688   0.649   POR   0.737   0.722   0.735   0.736           0.726   0.705   POR   0.234   0.266   0.269   0.254           0.234   0.202

                                                                                                                                                                                    0.3
0.696   0.691   0.683   0.700   0.679           0.663   FRA   0.739   0.735   0.705   0.710   0.737           0.717   FRA   0.193   0.220   0.200   0.202   0.197           0.179
                                                                                                                                                                                    0.2
0.683   0.675   0.658   0.663   0.658   0.661           DEU   0.687   0.680   0.659   0.649   0.660   0.658           DEU   0.180   0.195   0.193   0.183   0.177   0.182

Figure 3: The cross-lingual representational similarity matrix (xRSM) for each model: PGE (Left), CAE (Middle),
and CSE (Right). Each row corresponds to the language of the spoken-word stimuli while each column corresponds
to the language of the encoder. Note that the matrices are not symmetric. For example in the PGE model, the cell
at row CZE and column POL holds the value of sim(CZE, POL) = CKA(X(CZE/CZE) , X(CZE/POL) ) = 0.745, while
the cell at row POL and column CZE holds the value of sim(POL, CZE) = CKA(X(POL/POL) , X(POL/CZE) ) = 0.731.

other models suggests that training AWE models                                                  to make the Slavic cluster. Although Russian and
with a contrastive objective hinders the ability of                                             Bulgarian belong to two different Slavic branches,
the encoder to learn high-level phonological ab-                                                we observe that this pair forms the first sub-cluster
stractions and therefore contrastive encoders are                                               in both trees, at a distance smaller than that of the
more sensitive to the cross-lingual variation during                                            West Slavic cluster (Czech and Polish). At first,
inference compared to their sequence-to-sequence                                                this might seem surprising as we would expect the
counterparts.                                                                                   West Slavic cluster to be formed at a lower dis-
                                                                                                tance given the similarities among the West Slavic
5.1      Cross-Lingual Comparison                                                               languages which facilitates cross-language speech
To get further insights into how language sim-                                                  comprehension, as documented by sociolinguistic
ilarity affects the model representations of non-                                               studies (Golubovic, 2016). However, Russian and
native spoken-word segments, we apply hierarchi-                                                Bulgarian share typological features at the acoustic-
cal clustering on the xRSMs in Figure 3 using                                                   phonetic and phonological levels which distinguish
the Ward algorithm (Ward, 1963) with Euclidean                                                  them from West Slavic languages.4 We further
distance. The result of the clustering analysis is                                              elaborate on these typological features in §6. Even
shown in Figure 4. Surprisingly, the generated                                                  though Portuguese was grouped with French in the
trees from the xRSMs of the PGE and CAE mod-                                                    cluster analysis, which one might expect given that
els are identical, which could indicate that these                                              both are Romance languages descended from Latin,
two models induce similar representations when                                                  it is worth pointing out that the representations of
trained on the same data. Diving a level deeper                                                 the Portuguese word stimuli from the Slavic mod-
into the cluster structure of these two models, we                                              els show a higher similarity to the representations
observe that the Slavic languages form a pure clus-                                             of the native Portuguese model compared to these
ter. This could be explained by the observation                                                 obtained from the French model (with only two
that some of the highest pair-wise similarity scores                                            exceptions, the Czech PGE model and Polish CAE
among the PGE models are observed between                                                       model). We also provide an explanation of why
Russian and Bulgarian, i.e., sim(RUS, BUL) =                                                    this might be the case in §6.
0.748 and sim(BUL, RUS) = 0.745, and Czech                                                          The generated language cluster from the CSE
and Polish, i.e., sim(CZE, POL) = 0.745. The                                                    model does not show any clear internal structure
same trend can be observed in the CAE models:                                                   with respect to language groups since all cluster
i.e., sim(RUS, BUL) = 0.761, sim(BUL, RUS) =                                                    pairs are grouped at a much higher distance com-
0.763, and sim(CZE, POL) = 0.762. Within the                                                    pared to the PGE and CAE models. Furthermore,
Slavic cluster, the West Slavic languages Czech                                                 the Slavic languages in the generated tree do not
and Polish are grouped together, while the Rus-                                                       4
                                                                                                      Note that our models do not have access to word orthogra-
sian (East Slavic) is first grouped with Bulgarian                                               phy. Thus, the similarity cannot be due to the fact that Russian
(South Slavic) before joining the West Slavic group                                              and Bulgarian use Cyrillic script.
Slavic          Romance             Germanic

                                   Czech                                 Czech                                    Russian
                                   Polish                                Polish                                   Polish
                                   Russian                               Russian                                  Czech
                                   Bulgarian                             Bulgarian                                Portuguese
                                   Portuguese                            Portuguese                               Bulgarian
                                   French                                French                                   French
                                   German                                German                                   German

                                                           0.2

                                                                                                      0.5
                                                                   0.0

                                                                                                            0.0
                                                     0.4
                 0.4

                       0.2

                             0.0

                                                                                                1.0
Figure 4: Hierarchical clustering analysis on the cross-lingual representational similarity matrices (using linear
CKA) of the three models: PGE (Left), CAE (Middle), and CSE (Right).

form a pure cluster since Portuguese was placed                     representation spaces when trained on the same
inside the Slavic cluster. We also do not observe                   data even though their decoding objectives oper-
the West Slavic group as in the other two models                    ate over different modalities. That is, the decoder
since Polish was grouped first with Russian, and not                of the PGE model aims to generate the word’s
Czech. We believe that this unexpected behavior of                  phonological structure in the form of a sequence
the CSE models can be related to the previously at-                 of discrete phonological units, while the decoder
tributed to the poor performance of the contrastive                 of the CAE model aims to generate an instance
AWEs in capturing word-form similarity (Abdullah                    of the same word represented as a sequence of
et al., 2021).                                                      (continuous) spectral vectors. Moreover, the se-
   Moreover, it is interesting to observe that Ger-                 quences that these decoders aim to generate vary in
man seems to be the most distant language to the                    length (the mean phonological sequence length is
other languages in our study. This observation                      ~6 phonemes while mean spectral sequence length
holds across all three encoders since the represen-                 is ~50 vectors). Although the CAE model has no
tations of the German word stimuli by non-native                    access to abstract phonological information of the
models are the most dissimilar compared to the                      word-forms it is trained on, it seems that this model
representations of the native German model.                         learns non-trivial knowledge about word phono-
                                                                    logical structure as demonstrated by the represen-
5.2      Cross-Model Comparison                                     tational similarity of its embeddings to those of
To verify our hypothesis that the models trained                    a word embedding model that has access to word
with decoding objectives (PGE and CAE) learn                        phonology (i.e., PGE) across different languages.
similar representations when trained on the same
data, we conduct a similarity analysis between the                   6       Discussion
models in a setting where we obtain views of the                     6.1          Relevance to Related Work
spoken-word stimuli from native models, then com-
pare these views across different learning objec-                   Although the idea of representational similarity
tives while keeping the language fixed using CKA                    analysis (RSA) has originated in the neuroscience
as we do in our cross-lingual comparison. The                       literature (Kriegeskorte et al., 2008), researchers in
results of this analysis is shown in Figure 5. We                   NLP and speech technology have employed a simi-
observe that across different languages the pairwise                lar set of techniques to analyze emergent represen-
representational similarity of the PGE-CAE models                   tations of multi-layer neural networks. For exam-
(linear CKA = 0.721) is very high in comparison                     ple, RSA has previously been employed to analyze
to that of PGE-CSE models (linear CKA = 0.239)                      the correlation between neural and symbolic repre-
and CAE-CSE models (linear CKA = 0.230).5 Al-                       sentations (Chrupała and Alishahi, 2019), contex-
though the non-linear CKA similarity scores tend                    tualized word representations (Abnar et al., 2019;
to be higher, the general trend remains identical in                Abdou et al., 2019; Lepori and McCoy, 2020; Wu
both measures. This finding validates our hypoth-                   et al., 2020), representations of self-supervised
esis that the PGE and CAE models yield similar                      speech models (Chung et al., 2021) and visually-
                                                                    grounded speech models (Chrupała et al., 2020).
   5
       CKA scores are averaged over languages.                      We take the inspiration from previous work and
typological features from the ancestor language
                           0.8                                                                                            0.8
                                                                                                                                     0.757

                                                                                              Cross-model CKA (RBF 0.5)
                                      0.721
Cross-model CKA (linear)

                                                                                                                                                                                             (see Bjerva et al. (2019) for a discussion on how
                           0.7                                                                                            0.7

                                                                                                                                                                                             typological similarity and genetic relationships in-
                           0.6                                                                                            0.6

                                                                                                                                                                                             teract when analyzing neural embeddings).
                           0.5                                                                                            0.5

                           0.4                                                                                            0.4                           0.397              0.401

                           0.3                                                                                            0.3                                                                   Within the language sample we study in this pa-
                           0.2
                                                         0.239             0.230
                                                                                                                          0.2                                                                per, four of these languages belong to the Slavic
                           0.1                                                                                            0.1                                                                branch of Indo-European languages. Compared
                                               AE                    SE                  SE                                                   AE                    SE                  SE
                                          vs
                                             .C
                                                             vs
                                                                .C               vs
                                                                                    .C
                                                                                                                                         vs
                                                                                                                                            .C
                                                                                                                                                            vs
                                                                                                                                                               .C               vs
                                                                                                                                                                                   .C        to the Romance and Germanic branches of Indo-
                                                                             E                                                                                              E
                                     GE                 GE                CA                                                        GE                 GE                CA
                                 P                  P                                                                           P                  P                                         European, Slavic languages are known to be re-
      Figure 5: Cross-model CKA scores: linear CKA (left)                                                                                                                                    markably more similar to each other not only at
      and non-linear CKA (right). Each point in this plot is                                                                                                                                 lexical and syntactic levels, but also at the pre-
      the within-language representational similarity of two                                                                                                                                 lexical level (acoustic-phonetic and phonological
      models trained with different objectives when tested on                                                                                                                                features). These cross-linguistically shared fea-
      the same (monolingual) word stimuli. Each red point is                                                                                                                                 tures between Slavic languages include rich conso-
      the average CKA score per model pair.
                                                                                                                                                                                             nant inventories, phonemic iotation and complex
                                                                                                                                                                                             consonant clusters. The similarities at different
      apply RSA in a unique setting: to analyze the im-                                                                                                                                      linguistic levels facilitate spoken intercomprehen-
      pact of typological similarity between languages                                                                                                                                       sion, i.e., the ability of a listener to comprehend
      on the representational geometry. Our analysis fo-                                                                                                                                     an utterance (to some degree) in a language that is
      cuses on neural models of spoken-word processing                                                                                                                                       unknown, but related to their mother tongue. Sev-
      that are trained on naturalistic speech corpora. Our                                                                                                                                   eral sociolinguistic studies of mutual intelligibility
      goal is two-fold: (1) to investigate whether or not                                                                                                                                    have reported a higher degree of intercomprehen-
      neural networks exhibit predictable behavior when                                                                                                                                      sion among speakers of Slavic languages compared
      tested on speech from a different language (L2),                                                                                                                                       to other language groups in Europe (Golubovic,
      and (2) to examine the extent to which the training                                                                                                                                    2016; Gooskens et al., 2018). On the other hand,
      strategy affects the emergent representations of L2                                                                                                                                    and even though French and Portuguese are both
      spoken-word stimuli. To the best of our knowledge,                                                                                                                                     Romance languages, they are less mutually intel-
      our study is the first to analyze the similarity of                                                                                                                                    ligible compared to Slavic language pairs as they
      emergent representations in neural acoustic models                                                                                                                                     have diverged in their lexicons and phonological
      from a cross-linguistic perspective.                                                                                                                                                   structures (Gooskens et al., 2018). This shows that
                                                                                                                                                                                             cross-language speech intelligibility is not mainly
      6.2                                 A Typological Discussion                                                                                                                           driven by shared historical relationships, but by
                                                                                                                                                                                             contemporary typological similarities.
    Given the cross-linguistic nature of our study, we
    choose to discuss our findings on the representa-                                                                                                                                           Therefore, and given the documented phonologi-
    tional similarity analysis from a language typology                                                                                                                                      cal similarities between Slavic languages, it is not
    point of view. Language typology is a sub-field                                                                                                                                          surprising that Slavic languages form a pure clus-
    within linguistics that is concerned with the study                                                                                                                                      ter in our clustering analysis over the xRSMs of
    and categorization of the world’s languages based                                                                                                                                        two of the models we investigate. Furthermore, the
    on their linguistic structural properties (Comrie,                                                                                                                                       grouping of Czech-Polish and Russian-Bulgarian
    1988; Croft, 2002). Since acoustic word embed-                                                                                                                                           in the Slavic cluster can be explained if we con-
    dings are word-level representations induced from                                                                                                                                        sider word-specific suprasegmental features. Be-
    actual acoustic realizations of word-forms, we fo-                                                                                                                                       sides the fact that Czech and Polish are both West
    cus on the phonetic and phonological properties of                                                                                                                                       Slavic languages that form a spatial continuum
    the languages in our study. We emphasize that our                                                                                                                                        of language variation, both languages have fixed
    goal is not to use the method we propose in this pa-                                                                                                                                     stress. The word stress in Czech is always on the
    per as a computational approach for phylogenetic                                                                                                                                         initial syllable, while Polish has penultimate stress,
    reconstruction, that is, discovering the historical                                                                                                                                      that is, stress falls on the syllable preceding the
    relations between languages. However, typologi-                                                                                                                                          last syllable of the word (Sussex and Cubberley,
    cal similarity usually correlates with phylogenetic                                                                                                                                      2006). On the other hand, Russian and Bulgarian
    distance since languages that have diverged from a                                                                                                                                       languages belong to different Slavic sub-groups—
    common historical ancestor inherited most of their                                                                                                                                       Russian is East Slavic while Bulgarian is South
Slavic. The representational similarity between         6.3    Implications on Cross-Lingual Transfer
Russian and Bulgarian can be attributed to the ty-             Learning
pological similarities between Bulgarian and East       A recent study on zero-resource AWEs has shown
Slavic languages. In contrast to West Slavic lan-       that cross-lingual transfer is more successful when
guages, the stress in Russian and Bulgarian is free     the source (L1) and target (L2) languages are more
(it can occur on any syllable in the word) and mov-     related (Jacobs and Kamper, 2021). We conducted
able (it moves between syllables within morpholog-      a preliminary experiment on the cross-lingual word
ical paradigms). Slavic languages with free stress      discrimination performance of the models in our
tend to have a stronger contrast between stressed       study and observed a similar effect. However, we
and unstressed syllables, and vowels in unstressed      also observed that cross-lingual transfer using the
syllables undergo a process that is known as vowel-     contrastive model is less effective compared to the
quality alternation, or lexicalized vowel reduction     models trained with decoding objectives. From
(Barry and Andreeva, 2001). For example, con-           our reported RSA analysis, we showed that the
sider the Russian word ruka (‘hand’). In the singu-     contrastive objective yields models that are cross-
lar nominative case the word-form is phonetically       lingually dissimilar compared to the other objec-
realized as ruká [rU"ka] while in the singular ac-      tives. Therefore, future work could investigate the
cusative case rúku ["rukU] (Sussex and Cubberley,       degree to which our proposed RSA approach pre-
2006). Here, the unstressed high back vowel /u/         dicts the effectiveness of different models in a cross-
is realized as [U] in both word-forms. Although         lingual zero-shot scenario.
vowel-quality alternations in Bulgarian follow dif-
ferent patterns, Russian and Bulgarian vowels in        7     Conclusion
unstressed syllables are reduced in temporal dura-
                                                        We presented an experimental design based on rep-
tion and quality. Therefore, the high representa-
                                                        resentational similarity analysis (RSA) whereby
tional similarity between Russian and Bulgarian
                                                        we analyzed the impact of language similarity on
could be explained if we consider their typological
                                                        representational similarity of acoustic word em-
similarities in word-specific phonetics and prosody.
                                                        beddings (AWEs). Our experiments have shown
   The Portuguese language presents us with an in-      that AWE models trained using decoding objec-
teresting case study of language variation. From a      tives exhibit a higher degree of representational
phylogenetic point of view, Portuguese and French       similarity if their training languages are typolog-
have diverged from Latin and therefore they are         ically similar. We discussed our findings from a
both categorized as Romance languages. The clus-        typological perspective and highlighted pre-lexical
tering of the xRSMs of the PGE and CAE models           features that could have an impact on the models’
groups Portuguese and French together at a higher       representational geometry. Our findings provide
distance compared to the Slavic group, while Por-       evidence that AWE models can predict the facilita-
tuguese was grouped with Slavic languages when          tory effect of language similarity on cross-language
analyzing the representations of the contrastive        speech perception and complement ongoing efforts
CSE model. Similar to Russian and Polish, sibilant      in the community to assess their utility in cognitive
consonants (e.g., /S/, /Z/) and the velarized (dark)    modeling. Our work can be further extended by
L-sound (i.e., /ë/) are frequent in Portuguese. We      considering speech segments below the word-level
hypothesize that contrastive training encourages        (e.g., syllables, phonemes), incorporating seman-
the model to pay more attention to the segmental        tic representations into the learning procedure, and
information (i.e., individual phones) in the speech     investigating other neural architectures.
signal at the expense of phonotactics (i.e., phone
                                                        Acknowledgements
sequences). Given that contrastive learning is a pre-
dominant paradigm in speech representation learn-       We thank the anonymous reviewers for their con-
ing (van den Oord et al., 2018; Schneider et al.,       structive comments and insightful feedback. We
2019), we encourage further research to analyze         further extend our gratitude to Miriam Schulz and
whether or not speech processing models trained         Marius Mosbach for proofreading the paper. This
with contrastive objectives exhibit a similar behav-    research is funded by the Deutsche Forschungsge-
ior to that observed in human listeners and closely     meinschaft (DFG, German Research Foundation),
examine their plausibility for cognitive modeling.      Project ID 232722074 – SFB 1102.
References                                                   Proceedings of the 58th Annual Meeting of the Asso-
                                                             ciation for Computational Linguistics, pages 4146–
Mostafa Abdou, Artur Kulmizev, Felix Hill, Daniel M.         4156, Online. Association for Computational Lin-
 Low, and Anders Søgaard. 2019. Higher-order com-            guistics.
 parisons of sentence encoder representations. In
 Proceedings of the 2019 Conference on Empirical           Yu-An Chung, Yonatan Belinkov, and James Glass.
 Methods in Natural Language Processing and the              2021. Similarity analysis of self-supervised speech
 9th International Joint Conference on Natural Lan-          representations. In ICASSP 2021-2021 IEEE Inter-
 guage Processing (EMNLP-IJCNLP), pages 5838–                national Conference on Acoustics, Speech and Sig-
 5845, Hong Kong, China. Association for Computa-            nal Processing (ICASSP), pages 3040–3044. IEEE.
 tional Linguistics.
                                                           Bernard Comrie. 1988. Linguistic typology. Annual
Badr M. Abdullah, Marius Mosbach, Iuliia Zaitova,            Review of Anthropology, 17:145–159.
  Bernd Möbius, and Dietrich Klakow. 2021. Do
  Acoustic Word Embeddings Capture Phonological            Svetlana V Cook, Nick B Pandža, Alia K Lancaster,
  Similarity? An Empirical Study. In Proc. Inter-            and Kira Gor. 2016. Fuzzy nonnative phonolexical
  speech 2021, pages 4194–4198.                              representations lead to fuzzy form-to-meaning map-
                                                             pings. Frontiers in Psychology, 7:1345.
Samira Abnar, Lisa Beinborn, Rochelle Choenni, and
  Willem Zuidema. 2019. Blackbox meets blackbox:
                                                           William Croft. 2002. Typology and Universals. Cam-
  Representational similarity & stability analysis of
                                                             bridge University Press.
  neural language models and brains. In Proceedings
  of the 2019 ACL Workshop BlackboxNLP: Analyzing
                                                           Emmanuel Dupoux. 2018. Cognitive science in the
  and Interpreting Neural Networks for NLP, pages
                                                             era of artificial intelligence: A roadmap for reverse-
  191–203, Florence, Italy. Association for Computa-
                                                             engineering the infant language-learner. Cognition,
  tional Linguistics.
                                                             173:43–59.
Afra Alishahi, Marie Barking, and Grzegorz Chru-
                                                           Lieke Gelderloos, Grzegorz Chrupała, and Afra Al-
  pała. 2017. Encoding of phonology in a recurrent
                                                             ishahi. 2020. Learning to understand child-directed
  neural model of grounded speech. In Proceedings
                                                             and adult-directed speech. In Proceedings of the
  of the 21st Conference on Computational Natural
                                                             58th Annual Meeting of the Association for Compu-
  Language Learning (CoNLL 2017), pages 368–378,
                                                             tational Linguistics, pages 1–6, Online. Association
  Vancouver, Canada. Association for Computational
                                                             for Computational Linguistics.
  Linguistics.

William Barry and Bistra Andreeva. 2001. Cross-            Jelena Golubovic. 2016. Mutual intelligibility in the
  language similarities and differences in spontaneous        Slavic language area. Groningen: Center for Lan-
  speech patterns. Journal of the International Pho-          guage and Cognition.
  netic Association, 31(1):51–66.
                                                           Charlotte Gooskens, Vincent J van Heuven, Jelena Gol-
Johannes Bjerva, Robert Östling, Maria Han Veiga,            ubović, Anja Schüppert, Femke Swarte, and Ste-
  Jörg Tiedemann, and Isabelle Augenstein. 2019.             fanie Voigt. 2018. Mutual intelligibility between
  What do language representations really represent?         closely related languages in Europe. International
  Computational Linguistics, 45(2):381–389.                  Journal of Multilingualism, 15(2):169–193.

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard           Christiaan Jacobs and Herman Kamper. 2021. Multi-
   Säckinger, and Roopak Shah. 1994. Signature veri-         lingual Transfer of Acoustic Word Embeddings Im-
   fication using a" siamese" time delay neural network.     proves When Training on Languages Related to the
   In Proc. NIPS.                                            Target Zero-Resource Language. In Proc. Inter-
                                                             speech, pages 1549–1553.
Michael A Carlin, Samuel Thomas, Aren Jansen, and
  Hynek Hermansky. 2011. Rapid evaluation of               Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017.
  speech representations for spoken term discovery. In        Billion-scale similarity search with GPUs. IEEE
  Proc. Interspeech.                                          Transactions on Big Data.

Grzegorz Chrupała and Afra Alishahi. 2019. Corre-          H. Kamper, W. Wang, and Karen Livescu. 2016.
  lating neural and symbolic representations of lan-         Deep convolutional acoustic word embeddings us-
  guage. In Proceedings of the 57th Annual Meet-             ing word-pair side information. In Proc. ICASSP.
  ing of the Association for Computational Linguis-
  tics, pages 2952–2962, Florence, Italy. Association      Herman Kamper. 2019. Truly unsupervised acoustic
  for Computational Linguistics.                             word embeddings using weak top-down constraints
                                                             in encoder-decoder models. In Proc. ICASSP.
Grzegorz Chrupała, Bertrand Higy, and Afra Alishahi.
  2020. Analyzing analytical methods: The case of          Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
  phonology in neural models of spoken language. In          method for stochastic optimization. In Proc. ICLR.
Simon Kornblith, Mohammad Norouzi, Honglak Lee,            Håkan Ringbom. 2006. Cross-linguistic similarity in
  and Geoffrey Hinton. 2019. Similarity of neural            foreign language learning. Multilingual Matters.
  network representations revisited. In International
  Conference on Machine Learning, pages 3519–3529.         Odette Scharenborg, Nikki van der Gouw, Martha Lar-
  PMLR.                                                      son, and Elena Marchiori. 2019. The representa-
                                                             tion of speech in deep neural networks. In Interna-
Nikolaus Kriegeskorte, Marieke Mur, and Peter A Ban-         tional Conference on Multimedia Modeling, pages
  dettini. 2008. Representational similarity analysis-       194–205. Springer.
  connecting the branches of systems neuroscience.
  Frontiers in systems neuroscience, 2:4.                  Thomas Schatz and Naomi H Feldman. 2018. Neu-
                                                             ral network vs. hmm speech recognition systems as
Michael Lepori and R. Thomas McCoy. 2020. Picking            models of human cross-linguistic phonetic percep-
  BERT’s brain: Probing for linguistic dependencies          tion. In Proceedings of the conference on cognitive
  in contextualized embeddings using representational        computational neuroscience.
  similarity analysis. In Proceedings of the 28th Inter-
  national Conference on Computational Linguistics,        Steffen Schneider, Alexei Baevski, Ronan Collobert,
  pages 3637–3651, Barcelona, Spain (Online). Inter-          and Michael Auli. 2019. wav2vec: Unsupervised
  national Committee on Computational Linguistics.            Pre-Training for Speech Recognition. In Proc. Inter-
                                                              speech 2019, pages 3465–3469.
Keith Levin, Katharine Henry, Aren Jansen, and Karen
  Livescu. 2013. Fixed-dimensional acoustic embed-         Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe.
  dings of variable-length segments in low-resource          2013. GlobalPhone: A multilingual text and speech
  settings. In Proc. ASRU.                                   database in 20 languages. In Proc. ICASSP.
James S Magnuson, Heejo You, Sahil Luthra, Mon-            Shane Settle and Karen Livescu. 2016. Discrimina-
  ica Li, Hosung Nam, Monty Escabi, Kevin Brown,             tive acoustic word embeddings: Recurrent neural
  Paul D Allopenna, Rachel M Theodore, Nicholas              network-based approaches. In Proc. IEEE Spoken
  Monto, et al. 2020. Earshot: A minimal neural net-         Language Technology Workshop (SLT).
  work model of incremental human speech recogni-
  tion. Cognitive science, 44(4):e12823.                   Roland Sussex and Paul Cubberley. 2006. The slavic
                                                             languages. Cambridge University Press.
Yevgen Matusevych, Herman Kamper, and Sharon
  Goldwater. 2020a. Analyzing autoencoder-based            Aäron van den Oord, Yazhe Li, and Oriol Vinyals.
  acoustic word embeddings. In Bridging AI and Cog-          2018. Representation learning with contrastive pre-
  nitive Science Workshop@ ICLR 2020.                        dictive coding. ArXiv, abs/1807.03748.
Yevgen Matusevych, Herman Kamper, Thomas Schatz,           Joe H Ward. 1963. Hierarchical grouping to optimize
  Naomi Feldman, and Sharon Goldwater. 2021. A               an objective function. Journal of the American sta-
  phonetic model of non-native spoken word process-          tistical association, 58(301):236–244.
  ing. In Proceedings of the 16th Conference of the
  European Chapter of the Association for Computa-         John Wu, Yonatan Belinkov, Hassan Sajjad, Nadir Dur-
  tional Linguistics: Main Volume, pages 1480–1490,          rani, Fahim Dalvi, and James Glass. 2020. Similar-
  Online. Association for Computational Linguistics.         ity analysis of contextual word representation mod-
                                                             els. In Proceedings of the 58th Annual Meeting
Yevgen Matusevych, Thomas Schatz, Herman Kam-
                                                             of the Association for Computational Linguistics,
  per, Naomi Feldman, and Sharon Goldwater. 2020b.
                                                             pages 4638–4655.
  Evaluating computational models of infant phonetic
  learning across languages. In Proc. CogSci.              Ludger Zeevaert. 2007. Receptive multilingualism. JD
Michael McAuliffe, Michaela Socolof, Sarah Mihuc,            Ten Thije & L. Zeevaert (Eds.), 6:103–137.
  Michael Wagner, and Morgan Sonderegger. 2017.
  Montreal Forced Aligner: Trainable text-speech
  alignment using Kaldi. In Interspeech.
Adam Paszke, Sam Gross, Francisco Massa, Adam
  Lerer, James Bradbury, Gregory Chanan, Trevor
  Killeen, Zeming Lin, Natalia Gimelshein, Luca
  Antiga, et al. 2019. Pytorch: An imperative style,
  high-performance deep learning library. In Proc.
  NeuRIPS.
Okko Räsänen, Tasha Nagamine, and Nima Mesgarani.
  2016. Analyzing distributional learning of phone-
  mic categories in unsupervised deep neural net-
  works. Annual Conference of the Cognitive Science
  Society (CogSci). Cognitive Science Society (U.S.).
  Conference, 2016:1757–1762.
Appendices
A   Experimental Data Statistics
Table 2 shows a word-level summary statistics of
our experimental data extracted from the Global-
Phone speech daabase (GPS).

B   Representational Similarity with
    Non-Linear CKA
We provide the cross-lingual representational simi-
larity matrices (xRSMs) constructed by applying
the non-linear CKA measure with a radial basis
function (RBF) in Figure 6 below. We observe
similar trends to those presented in Figure 3.
   Applying hierarchical clustering on the xRSMs
in Figure 6 yields the language clusters shown in
Figure 7. We observe that the clusters are iden-
tical to those shown in Figure 4, except for the
CSE model where Portuguese is no longer in the
Slavic group. However, Portuguese remains the
non-Slavic language that is the most similar to
Slavic languages.
Language                      #Train           #Train             #Eval                #Phones/word                    Word dur. (sec)                   Token-Type
 Lang.
                    group                        spkrs          samples            samples               (mean ± SD)                     (mean ± SD)                          ratio
  CZE            West Slavic                      42             32,063             9,228                 6.77 ± 2.28                    0.492 ± 0.176                       0.175
  POL            West Slavic                      42             31,507             9,709                 6.77 ± 2.27                    0.488 ± 0.175                       0.192
  RUS             East Slavic                     42             31,892             9,005                 7.56 ± 2.77                    0.496 ± 0.163                       0.223
  BUL            South Slavic                     42             31,866             9,063                 7.24 ± 2.60                    0.510 ± 0.171                       0.179
  POR              Romance                        42             32,164             9,393                 6.95 ± 2.36                    0.526 ± 0.190                       0.134
  FRA              Romance                        42             32,497             9,656                 6.24 ± 1.95                    0.496 ± 0.163                       0.167
  DEU             Germanic                        42             32,162             9,865                 6.57 ± 2.47                    0.435 ± 0.178                       0.150

                                                   Table 2: Summary statistics of our experimental data.

CZE     POL     RUS     BUL     POR     FRA      DEU            CZE     POL     RUS     BUL     POR      FRA      DEU            CZE     POL     RUS     BUL      POR     FRA     DEU

        0.745
        0.763   0.733
                0.746   0.734
                        0.747   0.676
                                0.692   0.699
                                        0.708    0.651
                                                 0.663    CZE           0.769
                                                                        0.762   0.752
                                                                                0.747   0.737
                                                                                        0.736   0.703
                                                                                                0.706    0.712
                                                                                                         0.717    0.696
                                                                                                                  0.703    CZE           0.264
                                                                                                                                          0.4    0.377
                                                                                                                                                 0.244   0.225
                                                                                                                                                         0.365    0.331
                                                                                                                                                                  0.208   0.203
                                                                                                                                                                          0.324   0.180
                                                                                                                                                                                  0.281   0.8
                                                                                                                                 CZE      POL     RUS    BUL       POR     FRA    DEU

                                                                                                                                                                                          0.7
0.746
0.731           0.728
                0.712   0.736
                        0.722   0.683
                                0.669   0.675
                                        0.668    0.630
                                                 0.622    POL   0.759
                                                                0.763           0.724
                                                                                0.721   0.717
                                                                                        0.721   0.683
                                                                                                0.693    0.683
                                                                                                         0.700     0.67
                                                                                                                  0.681    POL   0.361
                                                                                                                                 0.248           0.388
                                                                                                                                                 0.269   0.362
                                                                                                                                                         0.245    0.314
                                                                                                                                                                  0.219   0.304
                                                                                                                                                                          0.205   0.262
                                                                                                                                                                                  0.179

                                                                                                                                                                                          0.6
0.736
0.723   0.738
        0.720           0.762
                        0.748   0.708
                                0.694   0.689
                                        0.681    0.636
                                                 0.624    RUS   0.753
                                                                0.745   0.743
                                                                        0.735           0.761
                                                                                        0.765   0.722
                                                                                                0.721    0.694
                                                                                                         0.695    0.691
                                                                                                                  0.690    RUS   0.220
                                                                                                                                 0.338   0.257
                                                                                                                                         0.377           0.234
                                                                                                                                                         0.373    0.212
                                                                                                                                                                  0.332   0.189
                                                                                                                                                                          0.31    0.178
                                                                                                                                                                                  0.275

                                                                                                                                                                                          0.5
0.728
0.719   0.739
        0.724   0.755
                0.745           0.713
                                0.700   0.698
                                        0.691    0.645
                                                 0.638    BUL   0.744
                                                                0.742   0.732
                                                                        0.731   0.763
                                                                                0.759           0.726
                                                                                                0.731    0.698
                                                                                                         0.706    0.689
                                                                                                                  0.692    BUL   0.207
                                                                                                                                  0.33   0.241
                                                                                                                                         0.367   0.375
                                                                                                                                                 0.238            0.198
                                                                                                                                                                  0.322   0.190
                                                                                                                                                                          0.312   0.170
                                                                                                                                                                                  0.282

                                                                                                                                                                                          0.4
0.700
0.684   0.725
        0.705   0.721
                0.734   0.735
                        0.718           0.701
                                        0.688    0.659
                                                 0.649    POR   0.737
                                                                0.735   0.720
                                                                        0.722   0.730
                                                                                0.735   0.728
                                                                                        0.736            0.712
                                                                                                         0.726    0.694
                                                                                                                  0.705    POR    0.35
                                                                                                                                 0.234   0.384
                                                                                                                                         0.266   0.386
                                                                                                                                                 0.269   0.377
                                                                                                                                                         0.254            0.345
                                                                                                                                                                          0.234   0.292
                                                                                                                                                                                  0.202

                                                                                                                                                                                          0.3
0.696
0.709   0.691
        0.706   0.683
                0.694   0.700
                        0.712   0.679
                                0.695            0.663
                                                 0.674    FRA   0.739
                                                                0.737   0.730
                                                                        0.735   0.705
                                                                                0.705   0.708
                                                                                        0.710   0.734
                                                                                                0.737             0.710
                                                                                                                  0.717    FRA   0.193
                                                                                                                                  0.32   0.349
                                                                                                                                         0.220   0.200
                                                                                                                                                 0.327   0.202
                                                                                                                                                         0.329    0.197
                                                                                                                                                                   0.32           0.292
                                                                                                                                                                                  0.179
                                                                                                                                                                                          0.2
0.683
0.702   0.675
        0.698   0.658
                0.676   0.663
                        0.674   0.658
                                0.676   0.661
                                        0.678             DEU   0.687
                                                                0.674   0.670
                                                                        0.680   0.649
                                                                                0.659   0.627
                                                                                        0.649   0.640
                                                                                                0.660    0.645
                                                                                                         0.658             DEU   0.180
                                                                                                                                 0.293   0.195
                                                                                                                                         0.307   0.306
                                                                                                                                                 0.193   0.183
                                                                                                                                                         0.292    0.277
                                                                                                                                                                  0.177   0.182
                                                                                                                                                                          0.294

Figure 6: The cross-lingual representational similarity matrix (xRSM) for each model: PGE (Left), CAE (Middle),
and CSE (Right), obtained using the non-linear CKA measure. Each row corresponds to the language of the
spoken-word stimuli while each column corresponds to the language of the encoder.

                                                                            Slavic              Romance                   Germanic

                                                 Czech                                                   Czech                                                   Russian
                                                 Polish                                                  Polish                                                  Polish
                                                 Russian                                                 Russian                                                 Czech
                                                 Bulgarian                                               Bulgarian                                               Bulgarian
                                                 Portuguese                                              Portuguese                                              Portuguese
                                                 French                                                  French                                                  French
                                                 German                                                  German                                                  German
                                                                                                                                                  0.5

                                                                                                                                                            0.0
                          0.4

                                  0.2

                                           0.0

                                                                                                                                         1.0
                                                                                  0.4

                                                                                          0.2

                                                                                                   0.0

Figure 7: Hierarchical clustering analysis on the cross-lingual representational similarity matrices (using non-linear
CKA) of the three models: PGE (Left), CAE (Middle), and CSE (Right).
You can also read