Identity-Based Patterns in Deep Convolutional Networks: Generative Adversarial Phonology and Reduplication

Page created by Ashley Vazquez
 
CONTINUE READING
Identity-Based Patterns in Deep Convolutional Networks: Generative
                                                          Adversarial Phonology and Reduplication

                                                                                        Gašper Beguš
                                                                              University of California, Berkeley
                                                                                  begus@berkeley.edu

                                                              Abstract                          called the base) with some added meaning (Inke-
                                                                                                las and Zoll, 2005; Urbanczyk, 2017). It can be
                                              This paper models unsupervised learning
                                                                                                total, which means that all phonemes in a word
                                              of an identity-based pattern (or copying) in
                                              speech called reduplication from raw con-         get copied (e.g. /pula/ → [pula-pula]), or par-
                                              tinuous data with deep convolutional neural       tial, where only a subset of segments gets copied
arXiv:2009.06110v2 [cs.CL] 17 Jul 2021

                                              networks. We use the ciwGAN architecture          (e.g. /pula/ → [pu-pula]).
                                              (Beguš, 2021a) in which learning of mean-             Reduplication is indeed unique among pro-
                                              ingful representations in speech emerges          cesses in natural language because combining
                                              from a requirement that the CNNs generate
                                                                                                learned entities based on training data distributions
                                              informative data. We propose a technique
                                              to wug-test CNNs trained on speech and,           does not yield the desired outputs. For example,
                                              based on four generative tests, argue that        a learner can be presented with pairs of bare and
                                              the network learns to represent an identity-      reduplicated words, such as /pala/ ∼ /papala/ and
                                              based pattern in its latent space. By manip-      /tala/ ∼ /tatala/. The learner can then be tested on
                                              ulating only two categorical variables in the     providing a reduplicated variant of a novel unob-
                                              latent space, we can actively turn an unredu-     served item with an initial sound /k/ that they have
                                              plicated form into a reduplicated form with
                                                                                                not been exposed to (e.g. /kala/). If the learner
                                              no other substantial changes to the output
                                              in the majority of cases. We also argue
                                                                                                learns the reduplication pattern, they will output
                                              that the network extends the identity-based       [kakala]. If the learner simply learns that /pa/
                                              pattern to unobserved data. Exploration of        and /ta/ are optional constituents that can be at-
                                              how meaningful representations of identity-       tached to words based on data distribution, they
                                              based patterns emerge in CNNs and how the         will output [pakala] or [takala]. Reduplication is
                                              latent space variables outside of the train-      thus an identity-based pattern (similar to copy-
                                              ing range correlate with identity-based pat-      ing), which is computationally more challeng-
                                              terns in the output has general implications
                                                                                                ing to learn (Gasser, 1993), both in connectionist
                                              for neural network interpretability.
                                                                                                (Brugiapaglia et al., 2020) and non-connectionist
                                                                                                frameworks (Savitch, 1989; Dolatian and Heinz,
                                         1   Introduction                                       2020). In /ki aj ki aj la/, the two sounds in the redu-
                                         The relationship between symbolic representa-          plicative morpheme, /ki / and /aj /, need to be in
                                         tions and connectionism has been subject to on-        an identity relationship with the first two segments
                                         going discussions in computational cognitive sci-      of the base, /ki / and /aj /, and the learner needs to
                                         ence. Phonology offers a unique testing ground         copy rather than recombine learned elements.
                                         in this debate as it is concerned with the first           Marcus et al. (1999) argue that connectionist
                                         discretization that human language users per-          models such as simple neural networks are un-
                                         form: from continuous phonetic data to discretized     able to learn a simple reduplication pattern that 7-
                                         mental representations of meaning-distinguishing       month old human infants are able to learn (see also
                                         sounds called phonemes.                                Gasser 1993). According to Marcus et al. (1999),
                                            Identity-based patterns, repetition, or copying     the behavioral outcomes of their experiments can-
                                         have been at the center of this debate (Mar-           not be modeled by simple counting, attention to
                                         cus et al., 1999).      Reduplication is a mor-        statistical trends in the input, attention to repe-
                                         phophonological process where phonological con-        tition, or connectionist (simple neural network)
                                         tent (phonemes) get copied from a word (also           computational models. Instead, they argue, the re-
sults support the claim that human infants employ            While equivalents of copying/identity-based pat-
“algebraic rules” (Marcus et al., 1999; Marcus,              terns can be constructed in the visual domain, we
2001; Berent, 2013) to learn reduplication patterns          are not aware of studies that test identity-based vi-
(for a discussion, see, among others, McClelland             sual patterns with deep convolutional neural net-
and Plaut 1999; Endress et al. 2007).                        works.
   With the development of neural network archi-                In this paper, we model reduplication, one of
tectures, several studies revisited the claim that           the computationally most challenging processes,
neural networks are unable to learn reduplica-               from raw unlabeled acoustic data with deep convo-
tive patterns (Alhama and Zuidema, 2018; Prick-              lutional networks in the GAN framework. The ad-
ett et al., 2018; Nelson et al., 2020; Brugiapaglia          vantage of the GAN framework for cognitive mod-
et al., 2020), arguing that identity-based patterns          eling is that the network has to learn to output raw
can indeed be learned with more complex archi-               acoustic data from a latent noise distribution with-
tectures.1 All these computational experiments,              out directly accessing the training data. We argue
however, operate on an already discretized level             that CNNs discretize continuous phonetic data and
and most of these experiments model reduplica-               encode linguistically meaningful units into indi-
tion with supervised learning.                               vidual latent variables. The emergence of a dis-
   Examples like [pu-pula] and [pula-pula] are               cretized representation of an identity-based pattern
discretized representations of reduplication, using          (reduplication) is induced by a model which forces
characters to represent continuous sounds. Most,             the Generator network to output informative data
if not all, computational models of reduplication,           (ciwGAN; Section 4). Additionally, we add in-
to the author’s knowledge, model reduplication as            ductive bias towards symbolic-like representations
character or feature manipulation (the inputs to the         by binarizing code variables with which the Gen-
models are either characters representing phones             erator encodes meaningful representations. We
or phonemes or discrete binary featural represen-            also test whether a deep convolutional network
tations of phonemes). For example, a seq2seq                 learns reduplication without the two inductive bi-
model treats reduplication as a pairing between              ases (without the requirement on the Generator to
the input unreduplicated sequence of “characters”            output informative data and without binarization
(such as /tala/) and an output — a reduplicated se-          of the latent space) in the bare WaveGAN archi-
quence (such as /tatala/). Already abstracted and            tecture (Section 5).
discretized phonemes or “characters”, however,
                                                                The experiments bear implications for the dis-
are not the primary input to language-learning in-
                                                             cussion between symbolic and connectionist ap-
fants. The primary linguistic data of most hearing
                                                             proaches to language modelling by testing the
infants is raw continuous speech. Hearing infant
                                                             emergence of rule-like symbolic representations
learners need to acquire reduplication from con-
                                                             within the connectionist framework from raw
tinuous speech data that is substantially more com-
                                                             speech in an unsupervised manner. Results of the
plex than already discretized characters or binary
                                                             experiments suggest that both models, ciwGAN
features.
                                                             and WaveGAN learn the identity-based patterns,
   Furthermore, most of the existing models of
                                                             but inductive biases for informative representation
reduplication are also supervised. Seq2seq mod-
                                                             and binarization facilitate learning and yield bet-
els, for example, are fully supervised: the train-
                                                             ter results. We discuss properties of symbolic-like
ing consists of pairs of unreduplicated (input) and
                                                             representations and how they emerge in the mod-
reduplicated strings of characters or binary fea-
                                                             els: discretization, causality (in the sense that ma-
tures (output). While the performance can be
                                                             nipulation of individual elements results in desired
tested on unobserved data or even on unobserved
                                                             outcome), and categoricity (for discussions on
segments, the training is nevertheless supervised.
                                                             these and other aspects of the debate, see Rumel-
Human language learners do not have access to
                                                             hart et al. 1986; McClelland et al. 1986; Fodor and
input-output pairings: they are only presented with
                                                             Pylyshyn 1988; Minsky 1991; Dyer 1991; Marcus
positive, surface, and continuous acoustic data.
                                                             et al. 1999; Marcus 2001; Manning 2003; Berent
   1
    Wilson (2018, 2020) proposes another approach that al-   2013; Maruyama 2021).
lows modeling reduplication. For a non-connectionist com-
putational model of reduplication, see Dolatian and Heinz      How can we test learning of reduplication in
(2018, 2020).                                                a deep convolutional network that is trained only
on raw positive data? We propose a technique to        posed for non-identity-based patterns in Beguš
test the ability of the Generator to produce forms     2020) allows us to directly explore how the net-
absent from the training data set. For example,        works encode dependencies in data, their under-
we train the networks on acoustic data of items        lying values, and interactions between variables,
such as /pala/ ∼ /papala/ and /tala/ ∼ /tatala/, but   and thus get a better understanding of how exactly
test reduplication on acoustic forms of items such     deep convolutional networks encode meaningful
as /sala/, which is never reduplicated in the train-   representations.
ing data. Using the technique proposed in Beguš           Recent developments in zero-resource speech
(2020), we can identify latent variables that cor-     modeling (Dunbar et al., 2017, 2019, 2020) enable
respond to some phonetic or phonological repre-        modeling of speech processes in an unsupervised
sentation such as reduplication. By manipulating       manner from raw acoustic data. Several proposals
and interpolating a single latent variable, we can     exist for modeling unsupervised lexical learning
actively generate data with and without redupli-       (Kamper et al., 2014; Lee et al., 2015; Chung et al.,
cation. In fact, we can observe a direct relation-     2016) that include generative models such as vari-
ship between a single latent variable (out of 100)     ational autoencoders (Chung et al., 2016; Baevski
and reduplication that with interpolation gradu-       et al., 2020; Niekerk et al., 2020) and GANs (Be-
ally disappears from the output. Additionally, we      guš, 2021a). This framework allows not only un-
can identify latent variables that correspond to [s]   supervised lexical term discovery, but also phone-
in the output. By forcing both reduplication and       level identification (Eloff et al., 2019; Shain and
[s] in the output through latent space manipula-       Elsner, 2019; Chung et al., 2016; Chorowski et al.,
tion, we can “wug-test” the network’s learning of      2019). While zero-resource speech modeling has
reduplication on unobserved data. In other words,      yielded promising results in unsupervised label-
we can observe what the network will output if         ing, the proposals generally do not model phono-
we force it to output reduplication and an [s] at      logical or morphophonological processes. This
the same time. A comparison of generated out-          paper thus also tests applicability of the unsuper-
puts with human outputs that were withheld from        vised speech processing framework for cognitive
training reveals a high degree of similarity. We       modeling and network interpretability.
perform an additional computational experiment
to replicate the results from the first experiment     2    Model
(from Section 4). In the replication experiment,
evidence for learning of the reduplicative pattern     Generative Adversarial Networks (Goodfellow
also emerges. To the author’s knowledge, this is       et al., 2014) are a neural network architecture
the first attempt to model reduplication with neu-     with two main components: the Generator net-
ral network architectures trained on raw acoustic      work and the Discriminator network. The Gen-
speech data.                                           erator is trained on generating data from some la-
                                                       tent space that is randomly distributed. The Dis-
    The computational experiments reveal another       criminator takes real training data and the Gener-
property about representation learning in deep         ator’s outputs and estimates which inputs are real
neural networks: we argue that the network ex-         and which are generated. The minimax training,
tracts information in the training data and repre-     where the Generator is trained on maximizing the
sents a continuous acoustic identity-based pattern     Discriminator’s error rate and the Discriminator is
with discretized representation. Out of 100 vari-      trained on minimizing its own error rate, results
ables, the network encodes reduplication with one      in the Generator network outputting data such that
or two variables, which is suggested by the fact       the Discriminator’s success in distinguishing them
that a small subset of variables are substantially     from real data is low. It has been shown that
more strongly correlated with presence of redupli-     GANs not only learn to produce innovative data
cation. In other words, there is a near categorical    that resemble speech, but also learn to encode pho-
drop in regression estimates between one variable      netic and phonological representations in the la-
and the rest of the latent space. Setting the iden-    tent space (Beguš, 2020). The major advantage of
tified variables to values well beyond the training    the GAN architecture for modeling speech is that
range results in near categorical presence of a de-    the Generator network does not have direct access
sired variable in the output. This technique (pro-     to the training data and is not trained on replicat-
ing data (unlike in the autoencoder architecture;         The architecture involves three networks: the
Räsänen et al. 2016; Eloff et al. 2019; Shain and      Generator that takes latent codes (a one-hot vec-
Elsner 2019). Instead, the network has to learn to     tor) and uniformly distributed z-variables and gen-
generate data from noise in a completely unsuper-      erates waveforms, a Discriminator that distin-
vised manner — without ever directly accessing         guishes real from generated outputs, and a Q-
the training data.                                     network that takes generated outputs and estimates
                                                       the latent code (one-hot vector) used by the Gener-
   In the first experiment, we use the ciwGAN
                                                       ator. More specifically, the Generator network is a
(Categorical InfoWaveGAN) model proposed in
                                                       deep convolutional network that takes as its input
Beguš (2021a). The ciwGAN model combines
                                                       100 latent variables (see Figure 1).2 Two of the
the WaveGAN and InfoGAN architectures. Wave-
                                                       100 variables are code variables (c1 and c2 ) that
GAN, proposed by Donahue et al. (2019), is a
                                                       constitute a one-hot vector. The remaining 98 z-
Deep Convolutional Generative Adversarial Net-
                                                       variables are uniformly distributed on the interval
work (DCGAN; proposed by Radford et al. 2016)
                                                       (−1, 1). The Generator learns to take as the input
adapted for time-series audio data. The ba-
                                                       the 2 code variables and the 98 latent variables and
sic architecture is the same as in DCGAN, the
                                                       output 16384 samples that constitute just over one
main difference being that in the WaveGAN pro-
                                                       second of audio file sampled at 16 kHz through
posal, the deep convolutional networks take one-
                                                       five convolutional layers. The Discriminator net-
dimensional time-series data as inputs or outputs.
                                                       work takes real and generated data (again in the
The structure of the Generator and the Discrimina-
                                                       form of 16384 samples that constitute just over
tor networks in the ciwGAN architecture are taken
                                                       one second of audio file) and learns to estimate
from Donahue et al. (2019). InfoGAN (Chen
                                                       the Wasserstein distance between generated and
et al., 2016) is an extension of the GAN archi-
                                                       real data (according to the proposal in Arjovsky
tecture that aims to maximize mutual informa-
                                                       et al. 2017) through five convolutional layers. In
tion between the latent space and generated out-
                                                       the majority of InfoGAN proposals, the Discrim-
puts. The Discriminator/Q-network learns to re-
                                                       inator and the Q-network share convolutions. Be-
trieve the Generator’s latent categorical or con-
                                                       guš (2021a) introduces a separate Q-network (also
tinuous codes (Chen et al., 2016) in addition to
                                                       in Rodionov 2018).3
estimating realness of generated outputs and real
training data.                                            The Q-network is in its structure identical to the
                                                       Discriminator network, but the final layer is fully
   Beguš (2021a) proposes a model that combines        connected to nodes that correspond to the number
these two proposals and introduces a new latent        of categorical variables (Beguš, 2021a). In the ci-
space structure (in the fiwGAN architecture). Be-      wGAN architecture, the Q-network is trained on
cause we are primarily interested in simple bi-        estimating the latent code variables with a soft-
nary classification between bare and reduplicated      max function (Beguš, 2021a). In other words, the
forms, we use the ciwGAN variant of the proposal.      Q-network takes the Generator’s outputs (wave-
The model introduces a separate deep convolu-          forms) and estimates the Generator’s latent code
tional Q-network that learns to retrieve the Gener-    variables c1 and c2 . Weights of both the Generator
ator’s internal representations. Separating the Dis-   network and the Q-network are updated according
criminator and the Q-network into two networks         to the Q-network’s loss function: to minimize the
is advantageous from the cognitive modeling per-       distance between the actual one-hot vector (c1 and
spective: the architecture features a separate net-    c2 ) used by the Generator and the one-hot vector
work that models speech production (the Gener-         estimated with a softmax in the Q-network’s final
ator) and a separate network that models speech        layer using cross-entropy. This forces the Genera-
categorization (the Q-network). The latter intro-      tor to output informative data.
duces an inductive bias that forces the Generator to      The advantage of the ciwGAN architecture is
output informative data and encode linguistically      that the network not only learns to output innova-
meaningful properties into its code variables. The
                                                          2
network learns to generate data such that by ma-             The number of latent variables were adopted from Rad-
                                                       ford et al. (2016) and Donahue et al. (2019). Probing how the
nipulating these code variables, we can force the      number of z-variables affects learning of speech representa-
desired linguistic property in the output (Beguš,      tions is left for future work.
                                                           3
2021a).                                                      For all details about the architecture, see Beguš (2021a).
Backpropagation

          Latent space
     98 random variables (z )                                                        x̂ =
        z3−100 ∼ U(−1, 1)                Generator                 0.8546

                                          network
    2 features (cat. variables) c                                       0

                 c1 c2                       G(z )
             c= 0 1                                                 -0.834
                                                                             0
                                                                                      Time (s)
                                                                                                 0.8352

                 1 0
                                                                                                                                                  2

                                                                                                                                                      c1 c2
                                                                                                                                                          16                 64               256               1024         4096        16384
                                                                                                                                                                1024

                                                                                                                                                                         1
                                                                                                                                                               reshape             512

                                                                                                                                                                                          1
                                                                                 Q network                Discriminator                          98
                                                                                                             network              Generated                                       conv1
                                                                                 Estimates ĉ                                      or real?
                                                                                   [c1 , c2 ]                 D(x )                                   z
                                                                                                                                                                                                     256

                                                                                                                                                                                                            1
                                                                                                                                                                                                    conv2

                                                          Backpropagation                                                 Backpropagation
                                                                                                                                                                                                                       128

                                                                                                                                                                                                                         1
                                        Training data
                                                                                     x=                                                                                                                            conv3
                                      996 unpaired bare            0.1236

                                    and reduplicated items              0

                                                                                                                                                                                                                                    64

                                                                                                                                                                                                                                     1
                                            Ci Vj CV                                                                                                                                                                           conv4
                                         Ci Vj Ci Vj CV                                                                                                                                                                                          1
                                                                   -0.1664

                                                                                                                                                                                                                                                 1
                                                                             0                   0.7593
                                                                                      Time (s)

                                                                                                                                                                                                                                             conv5

Figure 1: (left) The ciwGAN architecture as proposed in Beguš (2021a) and used in this paper with training data
as described in Section 3. (right) The structure of the Generator in the ciwGAN architecture as proposed in Beguš
(2021a) (based on Donahue et al. 2019).

tive data that resemble speech in the input, but also                                                                                         guages: partial CV reduplication found in lan-
provides meaningful representations about data in                                                                                             guages such as Paamese, Roviana, Tawala, among
an unsupervised manner. For example, as will be                                                                                               others (Inkelas and Zoll, 2005). Base items are of
argued in Section 4, the ciwGAN network encodes                                                                                               the shape C1 V2 C3 V4 (C = consonant; V = vowel),
reduplication as a meaningful category: it learns                                                                                             e.g. /tala/. Reduplicated forms are of the shape
to assign a unique code for bare and reduplicated                                                                                             C1 V2 C1 V2 C3 V4 , where the first syllable (C1 V2 )
items. This encoding emerges in an unsupervised                                                                                               is repeated. The items were constructed so that C1
fashion from the requirement that the Generator                                                                                               contains a voiceless stop /p, t, k/, a voiced stop
output data such that unique information is retriev-                                                                                          /b, d, g/, a labiodental voiced fricative /v/, and
able from its acoustic outputs. Given the structure                                                                                           nasals /m, n/. The vowels V2 and V4 consist of
of the training data, the Generator is most infor-                                                                                            /A (@), i, u/. C3 consists of /l, ô, j/. All permuta-
mative if it encodes presence of reduplication in                                                                                             tions of these elements were created. The stress
the code variables.                                                                                                                           was always placed on V2 in the base forms and on
   To replicate the results and to test learning of                                                                                           the same syllable in reduplicated forms (["ph Al@]
an identity-based pattern without binarization and                                                                                            ∼ [p@"ph Al@]). Because the reader of the training
without the requirement on the Generator to out-                                                                                              data was a speaker of American English, the train-
put informative data, we run an independent ex-                                                                                               ing data is phonetically even more complex. The
periment on a bare WaveGAN (Donahue et al.,                                                                                                   major phonetic effects in the training data include
2019) architecture using the same training data.                                                                                              (i) reduction of the vowel in the unstressed redu-
The difference between the two architectures is                                                                                               plicated forms and in the final syllable (e.g. from
that the bare GAN architecture does not involve                                                                                               [A] to [2/@]) and (ii) deaspiration of voiceless stops
a Q-network and the latent space only includes la-                                                                                            in the unstressed reduplication syllable (e.g. from
tent variables uniformly distributed on the interval                                                                                          [ph ] to [p]). The training data includes two unique
(−1, 1).                                                                                                                                      repetitions of each item and two repetitions of the
   Beguš (2020) and Beguš (2021a) also propose                                                                                                corresponding reduplicated forms. Table 1 illus-
a technique for latent space interpretability in                                                                                              trates the training data.
GANs: manipulating individual variables to val-                                                                                                  The training data also includes base forms
ues well beyond the training range can reveal un-                                                                                             C1 V2 C3 V4 with the initial consonant C1 being a
derlying representations of different parts of the                                                                                            fricative [s]. These items, however, always appear
latent space. We use this technique throughout the                                                                                            unreduplicated in the training data — the purpose
paper to evaluate learning of reduplication.                                                                                                  of [s]-initial item is to test how the network ex-
                                                                                                                                              tends the reduplicative pattern to novel unobserved
3           Reduplication in training data
                                                                                                                                              data. All 27 permutations of sV2 C3 V4 were in-
The training data was constructed to test a sim-                                                                                              cluded. To increase representation of [s]-initial
ple reduplication pattern, common in human lan-                                                                                               words, four or five repetitions of each unique [s]-
initial base were used in training.4 Altogether 132                                       C1 V2 C3 V4            "ph Ali
                                                                       voiceless C1
repetitions of the 27 unique unreduplicated words                                         C1 V2 C1 V2 C3 V4       p2"ph Ali
with an initial [s] were used in training.                                                C1 V2 C3 V4            "bAli
   Sibilant fricative [s] was chosen as C1 for test-                   voiced C1
                                                                                          C1 V2 C1 V2 C3 V4       b2"bAli
ing learning of reduplication because its frication                                       C1 V2 C3 V4            "mAli
noise is acoustically prominent and sufficiently                       C1 = [m, n, v]
                                                                                          C1 V2 C1 V2 C3 V4       m2"mAli
different from C1 s in the training data both acous-                                      C1 V2 C3 V4            "sAli
tically and phonologically. This satisfies the re-                     C1 = [s]
                                                                                          C1 V2 C1 V2 C3 V4      —
quirement that a model learns to generalize to
novel segments and feature values (Berent, 2013;                     Table 1: A schematic illustration of the training data in
Prickett et al., 2018).5 In phonological terms, the                  the International Phonetic Alphabet.
model is tested on a novel feature (sibilant frica-
tive or [±strident]; Hayes 2009) — the training
                                                                     ful feature about the data. The Q-network forces
data did not consist of any bare or reduplicated
                                                                     the Generator to encode information in its latent
forms with other sibilant fricatives. To make the
                                                                     space. In other words, the loss function of the Q-
learning even more complex, voiceless fricatives
                                                                     network forces the Generator to output data such
([f, T, S]) are altogether absent from the training
                                                                     that the Q-network is effective in retrieving the la-
data. All voiced fricatives except for [v] are absent
                                                                     tent code c1 and c2 from the Generator’s acoustic
too. Spectral properties of the voiced non-sibilant
                                                                     outputs only. Nothing in the training data pairs
fricative [v] in the training data (and in Standard-
                                                                     base and reduplicated forms. There is no overt
ized American English in general) are so substan-
                                                                     connection between the bases and their redupli-
tially different from a voiceless sibilant fricative
                                                                     cated correspondents. Yet, the structure of the data
[s] that we kept them in the training data. We ex-
                                                                     is such that given two categories, the most infor-
cluded all items with initial sequences /ti/, /tu/,
                                                                     mative way for the Generator to encode unique
and /ki/ from the training data, because acous-
                                                                     information in its acoustic outputs is to associate
tic properties of these sequences, especially frica-
                                                                     one unique code with base forms and another with
tion of the aspiration of /t/ and /k/, are similar
                                                                     reduplicated forms. The Generator would thus
to those of frication noise in /s/. Altogether 996
                                                                     have a meaningful unique representation of redu-
unique sliced items used in training were recorded
                                                                     plication that arises in an unsupervised manner ex-
in a sound attenuated booth by a female speaker
                                                                     clusively from the requirement on the Generator to
of American English with a MixPre 6 (SoundDe-
                                                                     output informative data.
vices) preamp/recorder and the AKG C544L head-
mounted microphone.                                                     To test whether the Generator encodes redupli-
                                                                     cation in latent codes, we train the network for
4       CiwGAN (Beguš, 2021a)                                        15,920 steps (or approximately 5,114 epochs) with
                                                                     the data described in Section 3. The choice of
The Generator features two latent code variables,                    the number of steps is based on two objectives;
c1 and c2 and 98 uniformly distributed variables                     first, the output data should approximate speech
z (Figure 1). In the training phase, the two code                    to the degree that allows acoustic analysis. Sec-
variables (c1 and c2 ) compose the one-hot vector                    ond, the Generator network should not be trained
with two levels: [0, 1] and [1, 0]. This means that                  to the degree that it replicates data completely. As
the network can encode two categories in its latent                  such, overfitting rarely occurs in the GAN archi-
space structure that correspond to some meaning-                     tecture (Adlam et al., 2019; Donahue et al., 2019).
    4
      Items ["sala], ["suru], and ["suju] each miss one repetition   The best evidence against overfitting in the ciw-
(four altogether).                                                   GAN architecture comes from the fact that the
    5
      For an “across the board” generalization, Berent (2013)
                                                                     Generator outputs data that violate training distri-
requires that generalization occur to segments fully absent
from the inventory. It is challenging to elicit reduplication        butions substantially (see Section 4.2 below) (Be-
of segments that are fully absent from the training data in the      guš, 2021a,b). Despite these guidelines, the choice
proposed models. Even in human subject experiments testing           of number of steps is somewhat arbitrary (for dis-
the “across the board” generalization, subjects need to be ex-       cussion, see Beguš 2020).
posed to the novel segment at least as a prompt. In our case,
the novel segment needs to be part of the training data, but            We generate 100 outputs for each latent code
only in unreduplicated forms.                                        [0, 1] and [1, 0] (200 total) and annotate them for
Code     Bare    Redup.     % Redup.               [1.25, 0], etc.). From [0, 0] we further interpolate
       [1, 0]    78       22         22%                  in increments of 0.125 to [0, 1.5] (e.g. [0, 0.125],
       [0, 1]    40       60         60%                  [0, 0.25]). All other variables in the latent space
       [5, 0]    98        2         2%                   are kept constant across all interpolated values.
       [0, 5]    13       87         87%                  Each such set thus contains 25 generated samples.
                                                          We generate 100 such sets (altogether 2500 out-
Table 2: Counts of bare and reduplicated (redup.) out-    puts) and analyze each output. Out of the 100 sets,
puts when the latent codes c1 and c2 are set to [1, 0],   the output was either bare or reduplicated through-
[0, 1], [5, 0], and [0, 5].                               out the interpolated values and did not change in
                                                          55 sets. As suggested by Section 4 and Table 2, the
presence or absence of reduplication. All annota-         number of bare and reduplicated forms for each
tions here and in other sections are performed by         level rises to near categorical values as the vari-
the author in Praat (Boersma and Weenink, 2015).          ables approach values of 5.
Distinguishing unreduplicated from reduplicated              In the 45/100 sets, the output changes from the
is very salient; for less salient annotations, we pro-    base form to a reduplicated form at some point as
vide waveforms and spectrograms (e.g. Figures 4           the codes are interpolated. If the network only
and 6).6                                                  learned to randomly associate base and redupli-
   There is a significant correlation between the         cated forms with each endpoint of the latent code,
two levels of latent code and presence of redupli-        we would expect base forms to be unrelated to
cation. Counts are given in Table 2. When the             reduplicated forms. For example, a base form
code is set to [1, 0], 78% of the generated outputs       ["kh ulu] could turn into reduplicated [d@"dAl@]. An
are base forms; when set to [0, 1], 60% of outputs        acoustic analysis of the generated sets, however,
are reduplicated (odds ratio = 5.27, p < 0.0001,          suggests that the latent code directly corresponds
Fisher Exact Test). When the latent codes are set         to reduplication. In approximately 25 out of 45
to [0, 5] and [5, 0], we get a near categorical distri-   sets (55.6%) of generated outputs that undergo the
bution of bare and reduplicated forms. For [5, 0],        change from base to a reduplicated form (or 25%
the Generator outputs an unreduplicated bare form         of the total sets), the base form is identical to the
in 98% samples. For [0, 5], it outputs a redu-            reduplicated form with the only major difference
plicated form in 87% outputs (odd ratio = 308.3,          between the two being the presence of reduplica-
p < 0.0001, Fisher Exact Test). These outcomes            tion (waveforms and spectrograms of the 25 out-
suggest that the Generator encodes reduplication          puts are in Figure 6). This proportion would likely
in its latent codes and again confirm that manipu-        be even higher with a higher interpolation reso-
lating latent variables well beyond training range        lution (higher than 0.125) and because we do not
reveals the underlying learning representations in        count cases in which major changes of sounds oc-
deep convolutional networks (as proposed in Be-           cur besides the addition of the reduplication sylla-
guš 2020; Beguš 2021a).                                   ble (for example, if ["nAôi] changes to [nU"nuôi], we
4.1    Interpolation                                      count the output as unsuccessful). In the remain-
                                                          ing 20 outputs, several outputs undergo changes,
That the Generator uses latent codes to encode            where several segments or their features are kept
reduplication is further suggested by another gen-        constant, but the degree to which they differ can
erative test performed on interpolated values of the      vary (e.g. ["ph il@] ∼ [p@"ph iôi], ["th iju] ∼ [d@"dAji],
latent code. To test how exactly the relationship         ["nAô@] ∼ [d@"dAôi], or ["ph iô@] ∼ [t@"th Ali]).
between the latent codes (c1 and c2 ) works, we
created sets of generated outputs based on interpo-          Under the null hypothesis, if the Generator
lated values of the code c1 and c2 . We manipulate        learns to pair the base and reduplicated forms ran-
c1 and c2 from the value 1.5 towards 0 in incre-          domly, each base form could be associated with
ments of 0.125. For example, we start with [1.5, 0]       any of the unique 243 reduplicated forms at the
and interpolated first to [0, 0] (e.g. [1.375, 0],        probability of 1/243 (0.004). Even if we assume
   6
                                                          very conservatively that each base form could be
    The code is available at https://github.com/
gbegus/fiwGAN-ciwGAN.         The generated data and
                                                          associated with only each subgroup of redupli-
checkpoints are available at https://doi.org/10.          cated consonant (C1 ; e.g. voiceless stops, voiced
17605/osf.io/zbjcp.                                       stops, [m], [n], [v]) disregarding the vowel and
0.7247                                           0.6697
                                                                                                                         1.00
                                                                                                                                                                                                                                                                                                                                    ●
     0                              [0, 1.375]        0                       [0, 0.875]
                                                                                                                                                                                                                                                                                                                                 ●

                                                                                            Lasso regression estimates
-0.9253                                          -0.8647                                                                                                                                                                                                                                                                        ●
                                    [0, 0.875]                                [0, 0.625]                                                                                                                                                                                                                                      ●
                                                                                                                         0.75
                                                                                                                                                                                                                                                                                                                          ●
                                                                                                                                                                                                                                                                                                                         ●
                                                                              [0, 0.375]                                                                                                                                                                                                                                ●
                                    [0, 0.625]

                                                                                                                                                                                                                                                                                                                ●
                                    [0, 0.375]                                [0.375, 0]                                 0.50                                                                                                                                                                                 ●●
                                                                                                                                                                                                                                                                                                          ●●●●

                                                                                                                                                                                                              ●●●●●●
                                                                                                                                                                                                          ●●●●
                                                                                                                                                                                                       ●●●
                                    [0.125, 0]                                [0.625, 0]                                                                                                             ●●
                                                                                                                                                                                                  ●●●
                                                                                                                                                                                            ●●●●●●
                                                                                                                                                                                        ●●●●
                                                                                                                         0.25                                                         ●●
          0                     0.8202                     0              0.7588                                                                                                   ●●●
                     Time (s)                                  Time (s)                                                                                                        ●●●●
                                                                                                                                                                             ●●
                                                                                                                                                                         ●●●●
                                                                                                                                                                    ●●●●●
                                                                                                                                                                   ●
                                                                                                                                                                ●●●
                                                                                                                                                            ●●●●
                                                                                                                         0.00   ●●●●●●●●●●●●●●●●●●●●●●●●●●●●

                                                                                                                                3 5 7 8 9 121520213035364250546166777981838792969714499939717819455628857631472340747563653443559846485233189473728658418464702993576982272538671768100538832378924 4 6244951060 6 51912213115926168090

Figure 2: Waveforms showing how interpolation of                                                                                                                                                     Latent variables (z)
latent codes c1 and c2 has a direct effect on pres-
ence of reduplicattion: as the values are interpolated                                     Figure 3: Absolute Lasso regression estimates (sorted
from [1.5, 0] to [0, 1.5], the reduplication gradually                                     from highest on the right-hand side) for a ciwGAN
appears/disappears from the output. Waveforms on                                           model identifying presence of [s] after 1000 transcribed
the left represent reduplication of ["ph iôu] to [p@"ph iôu];                              outputs, 500 for each latent code (with the same latent
waveforms on the right represent reduplication of                                          variable structure of the remaining 98 variables across
["dAji] to [d@"dAji].                                                                      the two conditions). Variable z90 is identified as the
                                                                                           variable corresponding to presence of [s] (the variable
                                                                                           with the highest regression estimates).
disregarding changes in the base, the probability
of both forms being identical would still be at
only 0.2 (for each of the five subgroups). In both                                         and c2 ) to represent reduplication. Following Be-
cases, the ratio of identical base-reduplication                                           guš (2020) and Beguš (2021a), we can force any
pairs, while not categorical, is highly significant                                        phonetic property in the output by manipulating
(CI = [0.4, 0.7], p < 0.0001 for both cases ac-                                            the latent variables well beyond the training range.
cording to Exact Binomial Test).                                                           Reduplication is forced by setting the latent code
   Figure 2 illustrates how, keeping the latent                                            to values higher than [0, 1]. We can simultane-
space constant except for the manipulation of                                              ously force [s] in the output to test the network’s
the latent code with which the Generator repre-                                            performance on reduplication in unseen data.
sents reduplication, the generated outputs gradu-                                              To identify latent variables with which the Gen-
ally transition from the base forms ["ph iôu] and                                          erator encodes the sound [s] in the output, we gen-
["dAji] to the reduplicated forms [p@"ph iôu] and                                          erate 1000 samples with randomly sampled latent
[d@"dAji].7 Other major properties of the output are                                       variables, but with the latent code variables (c1 and
unchanged.                                                                                 c2 ) set at [0, 1] and [1, 0] (500 samples each with
   This interpolative generative test again suggests                                       the same latent variable structure of the remaining
that the network learns reduplication and encodes                                          98 variables across the two conditions). We anno-
the process in the latent codes. By interpolating                                          tate outputs for presence of [s] for the two sets and
the codes we can actively force reduplication in                                           fit the data to a Lasso logistic regression model in
the output with no other substantial changes in the                                        the glmnet package (Simon et al., 2011). Presence
majority of cases.                                                                         of [s] is the dependent variable coded as a success;
                                                                                           the independent variables are the 98 latent vari-
4.2               Reduplication of unobserved data                                         ables uniformly distributed on the interval (−1, 1)
To test whether the ciwGAN network learns to                                               (for the technique, see Beguš 2020). Lambda is
generalize the reduplicative pattern on unobserved                                         computed with 10-fold cross validation. Estimates
data, we use latent space manipulation to force                                            of the Lasso regression model (Figure 3) suggest
reduplication at the same time as presence of [s]                                          that z90 with the highest regression estimates is
in the output. Items with a [s] as the initial con-                                        one of the variables with which the Generator en-
sonants (e.g. ["siju]) appear only in bare forms in                                        codes presence of [s] in the output. For a genera-
the training data. In Sections 4 and 4.1, we es-                                           tive test providing evidence that Lasso regression
tablished that the network uses the latent code (c1                                        estimates correlate with network’s internal repre-
              7
                                                                                           sentations, see Beguš (2020).
     The exact vowel quality estimation in the generated out-
puts is challenging, especially in short vocalic elements of re-
                                                                                               We can thus set z90 to marginal levels well be-
duced vowels in the reduplicative syllables. For this reason,                              yond the training range and the latent code (c1 ,
we default transcriptions to a [@].                                                        c2 ) to levels well beyond [0, 1] in order to force
reduplication and [s] in the output simultaneously.      5     Replication: Bare WaveGAN
For example, when the latent code is set to [0, 3]             (Donahue et al., 2019)
(which forces reduplication in the output) and z90
to 4 (forcing [s] in the output), the network outputs    To test whether the learning of reduplicative pat-
a reduplicated [s@"siji] (among other outputs) even      terns in GANs is a robust or idiosyncratic prop-
though items containing an [s] are never redupli-        erty of the model presented in Section 4, we con-
cated in the training data. When the code is set         duct a replication experiment. We introduce two
to even higher number, [0, 7.25], and z90 to 7, the      crucial differences in the replication experiment:
network outputs [s@"siru] in a different output. The     we train the Generator without the requirement to
spectrograms in Figure 4 show a clear period of          produce informative data and without binary la-
frication noise characteristic of a sibilant fricative   tent codes. We use the model in Donahue et al.
[s], interrupted by a reduplicative vowel and fol-       (2019) which features a “bare” GAN architecture
lowed by a repeated period of frication noise char-      for audio data: only the Generator and Discrimi-
acteristic of [s].                                       nator networks without the Q-network. This archi-
                                                         tecture has the potential to inform us how GANs
                                                         represent reduplicative patterns without an explicit
                                                         requirement to learn informative data, i.e. without
   In fact, at the values [0, 7.25], and z90 = 7, the    an explicit requirement to encode some salient fea-
network generates approximately 33 (out of 100           ture of the training data in the latent space. The
tested or 33%) outputs that can be reliably ana-         data used for training is the same as in the experi-
lyzed as reduplicated forms with initial sV- redu-       ment in Section 3. We train the network for 15,930
plication unseen in the training data. The other         steps or approximately 5,118 epochs, which is al-
67 outputs are reduplicated forms containing other       most identical to the number of steps/epochs in the
C1 s or unreduplicated [s]-forms. No outputs were        ciwGAN experiment (Section 4).
observed in which C1 of the reduplication syllable
and C1 of the base would be substantially differ-        5.1    Identifying variables
ent. While all the cases when z90 is manipulated         Testing the learning of reduplication in the bare
involve a front vowel [i] in the base item, we can       GAN architecture requires that we force redupli-
also elicit reduplication for other vowels. For ex-      cation and presence of [s] in the output simulta-
ample, we identify variable z4 as corresponding          neously. To identify which latent variables cor-
to an [s] and a low vowel [A] in the output (with        respond to the two properties, we use the same
the same technique as described for z90 above but        technique as described in Section 4. We generate
with presence of [sA] as the dependent variable in       and annotate 500 outputs of the Generator network
the Lasso regression model). By manipulating z4          with randomly sampled latent variables. We anno-
to 9.5 (forcing [sA] in the output) and setting the      tate the presence of [s] and the presence of redupli-
latent codes to [0, 7.5], we get [s@"sAôu] in the out-   cation. The annotations are fit to a Lasso logistic
put (Figure 4).                                          regression (as in Section 4.2): presence of redupli-
                                                         cation or [s] are the dependent variables and each
                                                         of the 100 latent z-variables are the independent
   For comparison, the same L1 speaker of En-            predictors. Lambda values were computed with
glish who read the words in the training data read       10-fold cross validation. Regression estimates are
the reduplicated [s@"siji] and [s@"sAôu] which were      given in Figure 5.
not included in the training data. Figure 4 par-            The plots illustrate a steep drop in regression es-
allels the generated reduplicated forms based on         timates between the few latent variables with the
unobserved data (which were elicited by forcing          highest estimates and the rest of the latent space.
[s] and reduplication in the output) and the record-     In fact, in both models, one or two variables per
ing of the same reduplicated form read by a hu-          model emerge with substantially higher regression
man speaker. The spectrograms show clear acous-          estimates: z91 and z5 when the dependent vari-
tic parallels between the Generated outputs and the      able is PRESENCE OF REDUPLICATION and z17
recording read by a human speaker (who read the          when the dependent variable is PRESENCE OF [s]
words prior to computational experiments and did         in the output. We can assume the Generator net-
not hear or analyze the generated outputs).              work uses these two variables to encode presence
Generated                                   Human recording                                                             Generated                      Human recording

                                  8000                                                                                                       8000

                 Frequency (Hz)

                                                                                                                            Frequency (Hz)
                                     0                                                                                                          0
                                         0                                                                     1.55                                 0                                                                 1.555
                                                         Time (s)                                                                                                                        Time (s)

                                                                                                   Generated                                            Human recording

                                                                              8000

                                                             Frequency (Hz)

                                                                                 0
                                                                                     0                                                                                           1.486
                                                                                                                      Time (s)

Figure 4: Waveforms and spectrograms (0–8000 Hz) of reduplicated forms containing an [s] which were absent
from the training data. The generated forms on the left are paired with recordings of a female speaker reading
reduplicated forms that were absent from the training data. (left) When the latent code is set to [0, 3] and z90 to
4, the network outputs a reduplicated [s@"siji]. (right) When the latent code is set to [0, 7.5] and z4 to 9.5, we get
[s@"sAôu]. (bottom) In the bare GAN architecture, when z5 (forcing reduplication) is set to −9.25 and z17 (forcing
[s] in the output) to −9.0, the Generator outputs a reduplicated [s@"siôi].

of reduplication and [s], respectively.                                                                                                        ator network outputs reduplicated forms for un-
                                                                                                                                               observed data when both reduplication and [s]
   It has been argued in Beguš (2020) that GANs
                                                                                                                                               are forced in the output via latent space manip-
learn to encode phonetic and phonological repre-
                                                                                                                                               ulation, but significantly less so than in the ciw-
sentations with a subset of latent variables. The
                                                                                                                                               GAN architecture. When z91 (forcing reduplica-
discretized representation of continuous phonetic
                                                                                                                                               tion) and z17 (forcing [s] in the output) are set
properties in the latent space appears even more
                                                                                                                                               to value −8.5, a higher level compared to the
radical in the present case. For example, in Beguš
                                                                                                                                               generated samples in the ciwGAN architecture (7
(2020), presence of [s] as a sound in the output is
                                                                                                                                               and 7.25), the network outputs only one redu-
represented by at least seven latent variables, each
                                                                                                                                               plicated form with [s]-reduplication out of 100
of which likely controls different spectral proper-
                                                                                                                                               generated outputs. By comparison, the propor-
ties of the frication noise. In the present exper-
                                                                                                                                               tion of the [s]-reduplication in the ciwGAN ar-
iment, the Generator appears to learn to encode
                                                                                                                                               chitecture is 33/100 – a significantly higher ratio
presence of [s] with a single latent variable, as is
                                                                                                                                               (OR = 48.1, p < 0.0001; Fisher Exact Test).
suggested by a steep drop of regression estimates
                                                                                                                                               When z5 (forcing reduplication) is set to −9.25
after the first variables with the highest estimates.
                                                                                                                                               and z17 (forcing [s] in the output) to −9.0, the pro-
For a generative test showing that regression esti-
                                                                                                                                               portion of reduplicated [s]-items is slightly higher
mates correlate to actual rates of a given property
                                                                                                                                               (4/100), but still significantly lower than in the ci-
in generated data, see Beguš (2020). Such near-
                                                                                                                                               wGAN architecture (OR = 11.7, p < 0.0001;
categorical cutoff is likely a consequence of the
                                                                                                                                               Fisher Exact Test). Despite these lower propor-
training data in the present case being consider-
                                                                                                                                               tions of reduplicated [s] in the output, the bare
ably less variable compared to TIMIT (used for
                                                                                                                                               GAN network nevertheless extends reduplication
training in Beguš 2020). The network also rep-
                                                                                                                                               on novel unobserved data. Figure 4 illustrates an
resents an identity-based process, reduplication,
                                                                                                                                               example of a reduplicated [s]-item from the Gen-
with only two latent variables and features a sub-
                                                                                                                                               erator network trained in the bare GAN architec-
stantial drop in regression estimates after these
                                                                                                                                               ture: [s@"siôi]. The spectrogram reveals a clear pe-
two variables. This discretized representation thus
                                                                                                                                               riod of frication noise characteristic of an [s], fol-
emerges even without the requirement of the Gen-
                                                                                                                                               lowed by a reduplicative vowel period, followed
erator to output informative data.
                                                                                                                                               by another period of frication.
  In the replication experiment too, the Gener-
a                                  Presence of reduplication                                                                                                                                                                                           b                                   Presence of [s]
                                                                                                                                                                                                                                                                                                     1.25
                                                                                                                                                                                                                                                                  ●                                                                                                                                                                                                                                                   ●

                                              1.5
                                                                                                                                                                                                                                                                ●
                                                                                                                                                                                                                                                                                                     1.00

                 Lasso regression estimates

                                                                                                                                                                                                                                                                        Lasso regression estimates
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ●
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ●●

                                                                                                                                                                                                                                                              ●                                                                                                                                                                                                                                               ●

                                              1.0                                                                                                                                                                                                     ●
                                                                                                                                                                                                                                                    ●●                                               0.75                                                                                                                                                                                                   ●
                                                                                                                                                                                                                                                 ●●●
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ●
                                                                                                                                              ●
                                                                                                                                             ●                                                                                                                                                                                                                                                                                                                                                         ●
                                                                                                                                           ●●
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ●
                                                                                                                                         ●●
                                                                                                                                      ●●●
                                                                                                                                   ●●●
                                                                                                                               ●●●●                                                                                                                                                                  0.50                                                                                                                                 ●●●●
                                                                                                                                                                                                                                                                                                                                                                                                                                              ●
                                                                                                                                                                                                                                                                                                                                                                                                                                         ●
                                                                                                                             ●●
                                                                                                                            ●                                                                                                                                                                                                                                                                                                           ●
                                                                                                                          ●●                                                                                                                                                                                                                                                                                                           ●
                                              0.5                                                                       ●●
                                                                                                                                                                                                                                                                                                                                                                                                                                   ●●●
                                                                                                                                                                                                                                                                                                                                                                                                                                      ●
                                                                                                                      ●●                                                                                                                                                                                                                                                                                                         ●●
                                                                                                                    ●●                                                                                                                                                                                                                                                                                                          ●
                                                                                                               ●●●●●                                                                                                                                                                                                                                                                                                         ●●●
                                                                                                              ●                                                                                                                                                                                                                                                                                                            ●●
                                                                                                                                                                                                                                                                                                                                                                                                                         ●●
                                                                                                           ●●●                                                                                                                                                                                       0.25                                                                                                            ●●●●
                                                                                                         ●●                                                                                                                                                                                                                                                                                                        ●●
                                                                                                     ●●●●                                                                                                                                                                                                                                                                                                      ●●●●
                                                                                                   ●●
                                                                                                ●●●                                                                                                                                                                                                                                                                                                          ●●
                                                                                                                                                                                                                                                                                                                                                                                                      ●●●●●●●
                                                                                            ●●●●                                                                                                                                                                                                                                                                                                  ●●●●
                                                                                      ●●●●●●                                                                                                                                                                                                                                                                                                     ●
                                                                                   ●●●                                                                                                                                                                                                                                                                                                         ●●
                                                                                 ●●                                                                                                                                                                                                                                                                                                        ●●●●
                                                                              ●●●
                                                                          ●●●●
                                                                       ●●●                                                                                                                                                                                                                                                                        ●
                                              0.0   ●●●●●●●●●●●●●●●●●●●                                                                                                                                                                                                                              0.00   ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

                                                    1 14153033343554565769727475768995982799554324442840 4 2631381121226271932037 6 58397812426373259048828594238697494677 7 536113708067108784 9 6032501645961792185165 8 598364192966100688852 2 79 3 47368141 5 91                                       1 3 6 9 1114161832343537404243444548495455575859616566676972757781828891941002751211286222546953378267413 7 8 398771801579854136937096992029476468246331987683238956 4 92 5 5028909784736230105219605338 2 17

                                                                                                                             Latent variables (z)                                                                                                                                                                                                                                  Latent variables (z)

Figure 5: Absolute Lasso regression estimates (sorted from highest on the right-hand side) for two models identi-
fying (a) presence of reduplication and (b) presence of [s] in the generated outputs of the bare GAN model (Section
5).

6   Discussion                                                                                                                                                                                                                                                                                       latent codes are set to marginal levels outside of
                                                                                                                                                                                                                                                                                                     training range (e.g. to [5, 0] or [0, 5]), the outputs
We perform four generative tests to model learning
                                                                                                                                                                                                                                                                                                     are almost categorically unreduplicated or redupli-
of reduplication in deep convolutional networks:
                                                                                                                                                                                                                                                                                                     cated (at 98 % for [5, 0]). Beguš (2021a) shows
(i) a test of proportion of outputs when latent codes
                                                                                                                                                                                                                                                                                                     that even higher values (e.g. 15) result in perfor-
are manipulated to marginal values, (ii) a test of
                                                                                                                                                                                                                                                                                                     mance at 100% for a subset of variables. Not all
interpolating latent variables, (iii) a test of redu-
                                                                                                                                                                                                                                                                                                     aspects of the models in this paper are categori-
plication on unobserved data in the ciwGAN ar-
                                                                                                                                                                                                                                                                                                     cal (e.g. interpolation of latent codes does not al-
chitecture, and (iv) a replication test of redupli-
                                                                                                                                                                                                                                                                                                     ways change an unreduplicated to a reduplicated
cation on unobserved data in the bare WaveGAN
                                                                                                                                                                                                                                                                                                     form without other major changes). Improving
architecture. All four tests suggest that deep con-
                                                                                                                                                                                                                                                                                                     performance on this particular task is left for fu-
volutional networks learn a simple identity-based
                                                                                                                                                                                                                                                                                                     ture work. Inability to derive categorical processes
pattern in speech called reduplication, i.e. a pro-
                                                                                                                                                                                                                                                                                                     has long been an argument against the connection-
cess that copies some phonological material to ex-
                                                                                                                                                                                                                                                                                                     ist approaches to language modeling. The results
press new meaning. The ciwGAN network learns
                                                                                                                                                                                                                                                                                                     of this experiments add to the work suggesting that
to encode a meaningful representation — presence
                                                                                                                                                                                                                                                                                                     manipulating variables to extreme marginal val-
of reduplication into its latent codes. There is a
                                                                                                                                                                                                                                                                                                     ues results in near categorical or categorical out-
near one-to-one correspondence between the two
                                                                                                                                                                                                                                                                                                     puts (depending on the value) of a desired property
latent codes c1 and c2 and reduplication. By in-
                                                                                                                                                                                                                                                                                                     (Beguš, 2020; Beguš, 2021a).
terpolating latent codes, we cause the bare form
to gradually turn into a reduplicated form with no                                                                                                                                                                                                                                                      In sum, three properties of rule-like symbolic
other major changes in the output in the majority                                                                                                                                                                                                                                                    representations emerge in deep convolutional net-
of cases. These results are close to what would                                                                                                                                                                                                                                                      work tested here: discretized representations, the
be considered appearance of symbolic computa-                                                                                                                                                                                                                                                        ability to generate desired property by manipu-
tion or algebraic rules.                                                                                                                                                                                                                                                                             lating a small number of variables, and near cat-
   Additional evidence that an approximation of                                                                                                                                                                                                                                                      egoricity for a subset of representations. These
symbolic computation emerges comes from the                                                                                                                                                                                                                                                          symbolic-like outcomes are facilitated by two in-
bare GAN experiment: there is a substantial drop                                                                                                                                                                                                                                                     ductive biases: the binary nature of latent codes
in regression estimates after the first one or two                                                                                                                                                                                                                                                   and the requirement on the Generator to output
latent variables with highest regression estimates,                                                                                                                                                                                                                                                  informative data (forced by the Q-network). At
suggesting that even without the requirement to                                                                                                                                                                                                                                                      least a subset of these properties also emerges in
produce informative data, the network discretizes                                                                                                                                                                                                                                                    the bare WaveGAN architecture that lacks these
the continuous and highly variable phonetic fea-                                                                                                                                                                                                                                                     biases, but at a reduced performance.
ture — presence of reduplication — and uses a                                                                                                                                                                                                                                                           Encoding an identity-based pattern as a mean-
small subset of the latent space to represent this                                                                                                                                                                                                                                                   ingful representation in the latent space emerges
morphophonological property.                                                                                                                                                                                                                                                                         in a completely unsupervised manner in the ciw-
   Finally, we can force the Generator to output                                                                                                                                                                                                                                                     GAN architecture — only from the requirement
reduplication at nearly categorical levels. When                                                                                                                                                                                                                                                     that the Generator output informative data. Redu-
plicated and unreduplicated forms are never paired         One of the advantages of probing learning in
in the training data. The network is fed bare and       deep convolutional neural networks on speech
reduplicated forms randomly. This unsupervised          data trained with GANs is that the innovative out-
training approximates conditions in language ac-        puts violate training data in structured and highly
quisition (for hearing learners): the human lan-        informative ways. The innovative outputs with
guage learner needs to represent reduplication and      reduplication of [s]-initial forms such as [s@siju]
to pair bare and reduplicated forms from raw unla-      can be directly paralleled to acoustic outputs read
beled acoustic data. The ciwGAN learns to group         by L1 speaker of American English that were ab-
reduplicated and unreduplicated forms and assign        sent from the training data. Acoustic analysis
a unique representation to the process of redupli-      shows a high degree of similarity between the gen-
cation. In fact, the one-hot vector (c1 and c2 ) that   erated reduplicated forms and human recordings,
the Generator learns to associate with reduplica-       meaning that the network learns to output novel
tion in training can be modeled as a representation     data that are linguistically interpretable and resem-
of the unique meaning/function that reduplication       ble human speech processes even though they are
adds, in line with an approach to represent unique      absent from the training data. Thus, the results
semantics with one-hot vectors (e.g. in Steinert-       of the experiments have implications for cognitive
Threlkeld and Szymanik 2020).                           models of speech acquisition. It appears that one
                                                        of the processes that has long been held as a hall-
   The paper also argues that deep convolutional
                                                        mark of symbolic computation in language, redu-
networks can learn a simple identity-based pattern
                                                        plication, can emerge in deep convolutional net-
(copying) from raw continuous data and extend the
                                                        work without language-specific components in the
pattern to novel unobserved data. While the net-
                                                        model even when they are trained on raw acoustic
work was not trained on reduplicated items that
                                                        inputs.
start with an [s], we were able to elicit reduplica-
tion in the output following a technique proposed
in Beguš (2020). First, we identify variables that
correspond to some phonetic/phonological repre-
                                                           The present paper tests a simple partial redu-
sentation such as presence of [s]. We argue that
                                                        plicative pattern where only CV is copied and
setting single variables well above training range
                                                        appears before the base item. This is perhaps
can reveal the underlying value for each latent
                                                        computationally the simplest reduplicative pat-
variable and force the desired property in the out-
                                                        tern. The training data are also highly controlled
put. We can thus force both [s] and reduplication
                                                        and recorded by a single speaker. We can use the
in the output simultaneously. For example, the
                                                        well-understood identity-based patterns in speech
network outputs [s@siju] if we force both redupli-
                                                        with various degrees of complexity (longer redu-
cation and [s] in the output; however, it never sees
                                                        plication, embedding into non-reduplicative pat-
[s@siju] in the training data — only [siju] and other
                                                        terns) to further test how inductive biases and
reduplicated forms, none of which included an [s].
                                                        hyperparameter/architecture choices interact with
   Thus, these experiments again confirm that the       learning in deep convolutional networks. Finally,
network uses individual latent variables to repre-      learning biases in the ciwGAN model can be (su-
sent linguistically meaningful representations (Be-     perficially) compared to learning biases in hu-
guš, 2020; Beguš, 2021a). Setting these individual      man subjects in future work. This paper sug-
variables to values well above the training inter-      gests that the Generator provides informative out-
val reveals their underlying values. By manipu-         puts even if trained on comparatively small data
lating these individual variables, we can explore       sets (for a similar conclusion for other processes,
how the representations are learned as well as how      see Beguš 2021b). This means we can use the
interactions between different variables work (for      same training data to probe learning in CNNs and
example, between the representation of reduplica-       in human artificial grammar learning experiments
tion and presence of [s]). The results of this study    (for a methodology, see Beguš 2021b). While
suggest that the deep convolutional network is not      these comparisons are necessarily superficial at
only capable of encoding different phonetic prop-       this point, they can provide insights into common
erties in individual latent variables, but also pro-    learning biases between human learners and com-
cesses as abstract as copying or reduplication.         putational models.
Acknowledgements                                       Iris Berent. 2013. The phonological mind. Trends
                                                          in Cognitive Sciences, 17(7):319 – 327.
This work was supported by a grant to new fac-
ulty at the University of Washington. I would like     Paul Boersma and David Weenink. 2015. Praat:
to thank Ella Deaton for recording and preparing         Doing phonetics by computer [computer pro-
stimuli as well as anonymous reviewers and the           gram]. version 5.4.06. Retrieved 21 February
Action Editor for useful comments on earlier ver-        2015 from http://www.praat.org/.
sions of this paper.
                                                       Simone Brugiapaglia, Matthew Liu, and Paul Tup-
                                                         per. 2020. Generalizing outside the training
References                                               set: When can neural networks learn identity ef-
                                                         fects?
Ben Adlam, Charles Weill, and Amol Kapoor.
  2019. Investigating under and overfitting in         Xi Chen, Yan Duan, Rein Houthooft, John Schul-
  Wasserstein Generative Adversarial Networks.           man, Ilya Sutskever, and Pieter Abbeel. 2016.
  In ICML Understanding and Improving Gener-             InfoGAN: Interpretable representation learn-
  alization in Deep Learning Workshop (2019).            ing by information maximizing Generative Ad-
  arXiv 1910.14137v1.                                    versarial Nets. In Daniel D. Lee, Masashi
                                                         Sugiyama, Ulrike von Luxburg, Isabelle Guyon,
Raquel G. Alhama and Willem H. Zuidema. 2018.            and Roman Garnett, editors, Advances in Neu-
  Pre-wiring and pre-training: What does a neu-          ral Information Processing Systems 29, pages
  ral network need to learn truly general identity       2172–2180. Curran Associates, Inc.
  rules? Journal of Artificial Intelligence Re-
                                                       Jan Chorowski, Ron J. Weiss, Samy Bengio,
  search, 61:927–946.
                                                         and Aäron van den Oord. 2019.          Unsu-
Martin Arjovsky, Soumith Chintala, and Léon              pervised speech representation learning using
 Bottou. 2017. Wasserstein Generative Adver-             WaveNet autoencoders. IEEE/ACM Transac-
 sarial Networks. In Proceedings of the 34th In-         tions on Audio, Speech, and Language Process-
 ternational Conference on Machine Learning,             ing, 27(12):2041–2053.
 volume 70 of Proceedings of Machine Learning          Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen,
 Research, pages 214–223, International Con-             Hung-Yi Lee, and Lin-Shan Lee. 2016. Au-
 vention Centre, Sydney, Australia. PMLR.                dio word2vec: Unsupervised learning of au-
                                                         dio segment representations using sequence-to-
Alexei Baevski, Steffen Schneider, and Michael
                                                         sequence autoencoder. In Interspeech 2016,
  Auli. 2020. vq-wav2vec: Self-supervised learn-
                                                         pages 765–769.
  ing of discrete speech representations. In In-
  ternational Conference on Learning Represen-         Hossep Dolatian and Jeffrey Heinz. 2018. Model-
  tations, pages 1–12.                                   ing reduplication with 2-way finite-state trans-
                                                         ducers. In Proceedings of the Fifteenth Work-
Gašper Beguš. 2021a. Ciwgan and fiwgan: En-              shop on Computational Research in Phonet-
  coding information in acoustic data to model           ics, Phonology, and Morphology, pages 66–77,
  lexical learning with Generative Adversarial           Brussels, Belgium. Association for Computa-
  Networks. Neural Networks, 139:305–325.                tional Linguistics.
Gašper Beguš. 2021b. Local and non-local depen-        Hossep Dolatian and Jeffrey Heinz. 2020. Com-
  dency learning and emergence of rule-like rep-         puting and classifying reduplication with 2-way
  resentations in speech data by Deep Convolu-           finite-state transducers. Journal of Language
  tional Generative Adversarial Networks. Com-           Modelling, 8(1):179–250.
  puter Speech & Language, page 101244.
                                                       Chris Donahue, Julian J. McAuley, and Miller S.
Gašper Beguš. 2020.        Generative adversarial        Puckette. 2019. Adversarial audio synthesis.
  phonology: Modeling unsupervised phonetic              In 7th International Conference on Learning
  and phonological learning with neural net-             Representations, ICLR 2019, New Orleans, LA,
  works. Frontiers in Artificial Intelligence, 3:44.     USA, May 6-9, 2019. OpenReview.net.
You can also read