Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

Page created by Willard Barton
 
CONTINUE READING
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

    Wei-Ning Hsu1∗ , David Harwath2∗ , Tyler Miller2∗ , Christopher Song3 , James Glass1
                           1
                             Massachusetts Institute of Technology
                               2
                                 University of Texas at Austin
                                 3
                                   Johns Hopkins University
                 wnhsu@csail.mit.edu, harwath@utexas.edu

                      Abstract                              Kamper et al., 2017; Havard et al., 2019a; Merkx
                                                            et al., 2019; Chrupała et al., 2017; Alishahi et al.,
    In this paper we present the first model for
    directly synthesizing fluent, natural-sounding          2017; Scharenborg et al., 2018; Hsu and Glass,
    spoken audio captions for images that does              2018a; Kamper et al., 2018; Surís et al., 2019; Il-
    not require natural language text as an inter-          harco et al., 2019; Eloff et al., 2019), as well as
    mediate representation or source of supervi-            how word-like and subword-like linguistic units
    sion. Instead, we connect the image caption-            can be made to emerge within these models (Har-
    ing module and the speech synthesis module              wath and Glass, 2017; Harwath et al., 2019; Drexler
    with a set of discrete, sub-word speech units           and Glass, 2017; Alishahi et al., 2017; Harwath
    that are discovered with a self-supervised vi-
                                                            et al., 2019; Harwath and Glass, 2019; Havard et al.,
    sual grounding task. We conduct experiments
    on the Flickr8k spoken caption dataset in addi-         2019b; Harwath et al., 2020). So far, these efforts
    tion to a novel corpus of spoken audio captions         have predominantly focused on inference, where
    collected for the popular MSCOCO dataset,               the goal is to learn a mapping from speech wave-
    demonstrating that our generated captions also          forms to a semantic embedding space. Generation
    capture diverse visual semantics of the images          of speech conditioned on a point in a semantic
    they describe. We investigate several different         space has been less explored, and is what we focus
    intermediate speech representations, and em-
                                                            on in this work. We hypothesize that generative
    pirically find that the representation must sat-
    isfy several important properties to serve as           approaches offer interesting advantages over rely-
    drop-in replacements for text.                          ing solely on inference. For example, prior works
                                                            have demonstrated the capability of recognizing vi-
                                                            sually descriptive words, but have not been shown
1   Introduction                                            to learn non-visual words or grammar. Our experi-
Although there are over 7,000 languages spoken              ments show that these aspects of spoken language
worldwide (Lewis et al., 2016), only several dozen          are learned to some degree by a visually-grounded
have enough data available to support supervised            generative model of speech.
speech recognition, and many languages do not                  Specifically, we introduce a model capable of
even employ a writing system (Adda et al., 2016).           directly generating fluent spoken audio captions of
In contrast, most people learn to use spoken lan-           images without the need for natural language text,
guage long before they learn to read and write,             either as an intermediate representation or a form
suggesting that linguistic annotation is not a pre-         of supervision during training (Figure 1). Tremen-
requisite for speech processing systems. This line          dous progress has been made recently in natural
of reasoning motivates research that aims to dis-           language image caption generation (Kiros et al.,
cover meaningful linguistic abstractions (phones,           2014; Mao et al., 2015; Vinyals et al., 2015; Karpa-
words, etc.) directly from the speech signal, with          thy and Fei-Fei, 2015; Xu et al., 2015; Rennie et al.,
the intention that they could reduce the reliance of        2017; Dai and Lin, 2017; Lu et al., 2017; Anderson
spoken language systems on text transcripts.                et al., 2018; Lu et al., 2018) and naturalistic text-to-
   A rich body of work has recently emerged inves-          speech synthesis (TTS) (Ping et al., 2017; Taigman
tigating representation learning for speech using           et al., 2017; Wang et al., 2017; Shen et al., 2018;
visual grounding objectives (Synnaeve et al., 2014;         Oord et al., 2016). Combining these models pro-
Harwath and Glass, 2015; Harwath et al., 2016;              vides a means for generating spoken image descrip-

                                                       5284
                Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
             and the 11th International Joint Conference on Natural Language Processing, pages 5284–5300
                           August 1–6, 2021. ©2021 Association for Computational Linguistics
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Different unit sequences, same speaker Same unit sequence, different speakers

                            a person in a blue jacket is on a    a snowboarder is snowboarding   a snowboarder is snowboarding
                            snowboard on a snow covered slope    on the side of the mountain     on the side of the mountain

Figure 1: Spoken image captions generated from the proposed model, with diversity in both linguistic content and
acoustic properties, controlled through the I2U and the U2S models, respectively. Transcriptions are provided only
for illustration. Audio samples are available at https://wnhsu.github.io/image-to-speech-demo.

tions, but existing approaches for training these                    duce reasonable captions by sampling from the pos-
models are reliant on text during training. Instead,                 terior, hinting that posterior mode-based evaluation
we leverage sub-word speech units discovered us-                     can only inspect limited aspects of a model.
ing a self-supervised learning objective as a drop-in                   4. Proposing a semantic diversity-aware met-
replacement for the text. We hypothesize that by                     ric. We identify issues of an existing metric (Vi-
using such techniques, an even wider variety of tra-                 jayakumar et al., 2018) and propose M-SPICE for
ditionally text-based NLP models could be applied                    sampling-based evaluation to address the problems.
to speech data without the need for transcription or                    5. Over 600,000 spoken audio captions for
automatic speech recognition (ASR) systems. Be-                      the MSCOCO dataset. We collect 742 hours of
cause all human languages utilize small, discrete                    speech from 2,352 people tasked with reading each
phonetic inventories (International Phonetic Asso-                   caption out loud. This dataset will be made pub-
ciation, 1999), we posit that our framework should                   licly available to support work at the intersection
be applicable for any language in the world. In our                  of speech, language, and vision.
experiments, we demonstrate that not just any set
of discovered speech units can function in this role.                2     Related Work
We find the greatest success with units that are dis-
crete, exhibit a low frame-rate, and highly robust                  Image-to-Text and Image-to-Speech Caption-
to speaker and environmental variability. The main                  ing. Significant progress towards generating re-
contributions of our paper are as follows:                          alistic (text) captions that describe the content of
                                                                    visual images was made with the advent of deep
   1. The first methodology for fluent image-to-
                                                                    neural networks (Vinyals et al., 2015; Karpathy
speech synthesis that does not rely on text. A
                                                                    and Fei-Fei, 2015; Xu et al., 2015; Anderson et al.,
critical aspect of our approach is factorizing the
                                                                    2018). Far less work has focused on generat-
model into an Image-to-Unit (I2U) module and a
                                                                    ing spoken audio captions from natural images.
Unit-to-Speech (U2S) module, where the speech
                                                                    Training an image-to-speech system using separate
units are discovered in a self-supervised fashion.
                                                                    (image, text) and (text, speech) datasets was ex-
This approach enables disentanglement of linguis-
                                                                    plored in (Ma et al., 2019). Hasegawa-Johnson
tic variability and acoustic/speaker variability.
                                                                    et al. (2017) is the only prior work that has ex-
   2. Extensive analysis on the properties re-                      plored image-to-speech synthesis without using
quired for learned units to replace text. While                     text, but with limited results. In that work, BLEU
the idea may seem simple and straightforward, ob-                   scores were only computed in terms of unsuper-
taining proper units is not a trivial task. In fact,                vised acoustic units, not an estimate of the actual
most of the units experimented in this paper fail                   words produced by the synthesizer, which can be
to serve as drop-in replacements. Moreover, we                      problematic as discussed in Section 4. The result-
demonstrate that what are deemed good units vary                    ing captions were not evaluated for fluency, nat-
significantly for inference and generation.                         uralness, or intelligibility, and the BLEU scores
   3. Demonstrating insufficiency of beam                           in terms of the unsupervised units were very low
search-based evaluation. We show that even                          (0.014 on the MSCOCO test set) compared to
when an I2U model fails to generate sensible cap-                   ours (0.274). Wang et al. (2020b) is a concurrent
tion through beam search decoding, it can still pro-                work that proposes a text-free end-to-end image-to-

                                                                5285
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
speech model, which simplifies the task by using          P (S | U ). If the linguistic unit sequence U were
pairs of image and synthesized speech generated           to take the form of natural language text, the model
from a single-speaker TTS model to reduce the             would be equivalent to the cascade of a conven-
acoustic variation. In contrast, by leveraging robust     tional image captioning system followed by a TTS
learned units, our I2U module can be trained on           module. Note that we assume S ⊥ I | U because
real speech with abundant variation, and the U2S          prosody variation is not dependent on the image
module serves as a vocoder that requires a small          for the datasets considered.
amount of clean speech (transcripts not needed).             The key idea in this paper is to instead define U
Hence, our system imposes less data constraints           to be a sequence of learned speech units that are as
yet still outperforms Wang et al. (2020b).                robust and compact as possible like text, but discov-
   Voice Conversion without Text aims to convert          ered without text supervision. We define inference
the speaker identity in a recording while preserv-        with this S2U model as U = f (S), enabling us
ing the textual content (Abe et al., 1990; Stylianou      to “transcribe” any given speech audio waveform
et al., 1998; Toda et al., 2007). It has recently seen    S into a sequence of units U . The addition of
progress using neural approaches (Hsu et al., 2016,       this third component enables us to train P (U | I)
2017a,b; Fang et al., 2018; Chorowski et al., 2018;       from a dataset of images paired with spoken cap-
Chou et al., 2018; Lorenzo-Trueba et al., 2018;           tions {(I1 , S1 ), . . . , (IN , SN )}. The conditional
Serrà et al., 2019), but the most relevant work to        independence assumption between S and I given
our own is the ZeroSpeech 2019 challenge (Dunbar          the U enables us to choose any arbitrary speech
et al., 2019; Tjandra et al., 2019; Cho et al., 2019),    dataset for training P (S | U ), therefore enabling
which addresses unsupervised learning of discrete         the speaker characteristics and other acoustic prop-
speech units that can replace text and be used as         erties to be independently controllable from the
input to TTS models. Unlike image-to-speech syn-          I2U system (Wang et al., 2018; Hsu et al., 2019;
thesis, these tasks only infer phonetic units from        Henter et al., 2018; Akuzawa et al., 2018).
given audio recordings instead of generating ones.
   Speech Pre-Training and Its Applications. In-          3.2   Datasets
terest in this area has recently surged. Various learn-
ing objectives have been proposed, including auto-        Table 1 summarizes the five datasets used for train-
encoding with structured latent spaces (van den           ing S2U, I2U, and U2S models. Note that we
Oord et al., 2017; Eloff et al., 2019; Chorowski          deliberately choose different datasets for training
et al., 2019; Hsu et al., 2017b; Hsu and Glass,           each module, which aims to examine the robust-
2018b; Khurana et al., 2019), predictive cod-             ness of the units when transferring across domains,
ing (Chung et al., 2019; Wang et al., 2020a), con-        including shift in speaker demography, speaking
trastive learning (Oord et al., 2018; Schneider et al.,   style (scripted/spontaneous), and linguistic content
2019), and more. Prior work addresses inferring           (book/newspaper/image description). Among the
linguistic content such as phones from the learned        three datasets with image and speech pairs: Places,
representations (Baevski et al., 2020; Kharitonov         Flickr8k, MSCOCO, we chose the latter two for
et al., 2020; Hsu et al., 2021). In contrast, this        training I2U models, because they include five cap-
work focuses on generating the learned represen-          tions per image, which is more suitable for caption
tation from a different modality, which evaluates         metrics such as SPICE (Anderson et al., 2016);
representations from a different perspective.             moreover, they are commonly used image caption-
                                                          ing datasets with many text-based baselines in the
3     Method                                              literature. Places only contains one spoken caption
                                                          per image and has not been used for captioning.
3.1   Framework Overview                                     Specifically, as part of this work we collect Spo-
A depiction of our modeling approach is shown in          kenCOCO, a spoken version of the MSCOCO cap-
Figure 2. Caption generation for an image involves        tioning dataset (Lin et al., 2014) with 742 hours
a cascade of two components: given an input im-           from 2532 speakers, via Amazon Mechanical Turk
age I, we first generate a linguistic unit sequence       by displaying the text to a person and having
U according to the I2U module P (U | I). Given            them read it aloud. Additional details regarding
the linguistic symbol sequence U , we generate a          the dataset can be found in appendix Section A.
speech waveform S according to the U2S module             Note that although there exists a speech version

                                                     5286
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Image                                                                                                                                              Speech
                 Image-to-Unit Model (Show, Attend, and Tell)                           Unit-to-Speech Model (Tacotron 2)
                                                                  Learned Units
                                                                                                                            Pre-Net +
                                                     1L LSTM +                                3L Conv
                                                                   263 32 208 5 476                                          LSTM +
                    ResNet-101                        Attention    570 395 16...                 +
                                                                                                                            Attention +

                                                                                                              ...
                                                                                             1L BLSTM
                                 14x14 feature map
                                                                                                                             PostNet
                                                                                                         N feature vecs

                        Speech-to-Unit Model (ResDAVEnet-VQ)
                        pre-trained on (image,speech) pairs
                                                                                                    Lossy Run-Length Encoding
                                                                        263 32 32 32 32 208 208                                           263 32 208 5 476 570
                                      3 * (ResBlock + VQ)               5 5 5 476 570 395 16...                                           395 16...

                                                                                T / 4 units                                                  N (≤ T / 4) units
      T frames                                                              (fixed unit rate)                                              (variable unit rate)

Figure 2: Diagram of our proposed framework. The ResDAVEnet-VQ model was trained using a {2} → {2, 3}
curriculum (in the notation given in Harwath et al. (2020)).

                   Data                                                  Hr           #Utt      #Spk    Maj. Spk                          Description
           S2U     PlacesAudio (Harwath et al., 2016)                   936           400K      2683    American          spontaneous image caption
                   Flickr8kAudio (Harwath and Glass, 2015)               46            40K       183
           I2U                                                                                          American              scripted image caption
                   SpokenCOCO (this work)                               742           605K      2353
                   LJSpeech (Ito, 2017)                                   24          13K           1   American              read non-fiction books
           U2S
                   VCTK (Veaux et al., 2017)                              44          40K         109     British                    read newspaper

Table 1: Speech dataset summary. For training S2U and I2U models, their corresponding image datasets,
MSCOCO (Lin et al., 2014), Flickr8k (Rashtchian et al., 2010), and Places (Zhou et al., 2014), are also used.

of MSCOCO named Speech-COCO (Havard et al.,                                            content. To demonstrate the advantage of repre-
2017), it is comprised of only synthesized speech                                      sentation learning with grounding, we will com-
using a concatenative TTS model in eight speak-                                        pare ResDAVEnet-VQ with a reconstruction based
ers’ voice. Disfluencies (e.g. “uh”) are randomly                                      model, WaveNet-VQ, trained on the PlacesAudio
inserted in between words to imitate real speech.                                      dataset. We denote the units extracted from this
Compared to SpokenCOCO, Speech-COCO offers                                             model with WVQ. We use the implementation of
limited diversity and naturalness.                                                     Harwath et al. (2020) for ResDAVEnet-VQ, and
                                                                                       Cho et al. (2019) for WaveNet-VQ which achieves
3.3   Learning Robust Linguistic Units from                                            the best ZeroSpeech 2019 challenge performance.
      Visually-Grounded Speech
We propose to build the S2U model upon                                                 3.4      Unit Selection and Run Length Encoding
ResDAVEnet-VQ, an audio-visual grounding                                              Although the ResDAVEnet-VQ model has been
model introduced in Harwath et al. (2020) that                                        shown to be capable of learning both phone-like
has shown to learn discrete phone- and word-                                          and word-like units, the experiments in (Harwath
like units in the intermediate vector quantizing                                      et al., 2020) show that only several hundred words
(VQ) layers. This model is trained to associate                                       are explicitly learned, which tend to be “visual
speech with contextually relevant visual inputs us-                                   words.” Conversely, the phone-like units learned
ing a triplet loss (Weinberger and Saul, 2009),                                       by the lower VQ layers of the model were shown
which can be interpreted as maximizing a mu-                                          to cover all of the phones in American English (as
tual information lower bound between image and                                        there are only several dozens). For this reason, we
speech (Tschannen et al., 2020). Since visual se-                                     choose to use phone-like units learned by the lower
mantics are described with words, which in turn are                                   VQ layers to represent U .
composed of phones, the representations learned                                          Nominally, the VQ layers will output one-hot
by ResDAVEnet-VQ are forced to be predictive of                                       vectors at a uniform temporal rate, downsampled
words and phones rather than speaker, noise, etc.                                     with respect to the framerate of the acoustic input
   In contrast, many of the speech representations                                    depending upon which VQ layer is used. Given
are trained by reconstructing (Chorowski et al.,                                      an input computed with a 10ms frame shift, the
2019; Hsu et al., 2017b) or predicting unseen                                         two VQ layers investigated in this paper (VQ2
speech signals (Chung et al., 2019), which would                                      and VQ3) respectively output vectors every 20ms
inevitable capture factors unrelated to the linguistic                                and 40ms. In general, the VQ units are repeated

                                                                            5287
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
for several consecutive frames. We can decrease         in the form of Mean Opinion Score (MOS) and
the average length of the symbol sequence U by          Side-By-Side (SXS) preference tests (Table 2).
employing a lossy form of run-length encoding              To evaluate the I2S system, we can use any
(RLE) (see Figure 2) which retains the sequence         method that measures the semantic information
of symbol identities but discards duration informa-     contained in the generated speech. We consider
tion. Each unit then represents a variable-length       two sets of end-to-end metrics: word-based and
segment. This removes the burden of unit duration       retrieval-based, and one set of proxy unit-based
modeling from the I2U model and shifts it onto the      metrics. Word-based metrics transcribe a gener-
U2S model, which we will show to be crucial.            ated spoken caption into text (manually or with an
                                                        ASR system) and then measure word-based cap-
3.5    Image-to-Unit and Unit-to-Speech
                                                        tioning metrics against a set of reference captions,
Both the I2U model and the U2S model are                such as BLEU-4 (Papineni et al., 2002) (adjusted n-
based upon recurrent seq2seq with attention net-        gram precision), METEOR (Denkowski and Lavie,
works (Bahdanau et al., 2015). Specifically, we         2014) (uni-gram F-score considering word-to-word
adopt Show-Attend-and-Tell (SAT) (Xu et al.,            alignment), ROUGE (Lin, 2004) (n-gram recall),
2015) for the I2U model. It has an image encoder        CIDEr (Vedantam et al., 2015) (TF-IDF weighted
pre-trained for classification, which is language       n-gram cosine similarity), and SPICE (Anderson
agnostic and hence should work in any language          et al., 2016) (F-score of semantic propositions in
within our proposed framework. The decoder on           scene graphs). This enables comparison between
the other hand is randomly initialized. We train the    image-to-speech systems with a text “upperbound”,
SAT model for two stages, where the encoder pa-         but is not applicable to unwritten languages.
rameters are only updated in the second stage. We          Retrieval-based metrics include image-to-speech
distinguish the models from the two stages with         and speech-to-image retrieval (Harwath et al.,
SAT and SAT-FT (finetuned) respectively when            2020), which require a separately trained cross-
presenting the results. For the U2S model, we           modal retrieval model for evaluation. Such metrics
adopt Tacotron2 (Shen et al., 2018) and WaveG-          are text-free, but they cannot measure other aspects
low (Prenger et al., 2019) for unit-to-spectrogram      of language generation such as syntactic correct-
and spectrogram-to-waveform generation, respec-         ness (partially captured by BLEU-4) and scope of
tively. In particular, a pre-trained WaveGlow is        the learned vocabulary. Lastly, unit-based metrics
used without fine-tuning.                               are similar to text-based, but replace words with
   The I2U model is trained on (I, f (S)) pairs,        units when computing n-gram statistics. However,
which requires pairs of image and speech, while         systems built on different units are not directly com-
the U2S model is trained on (f (S), S) pairs, which     parable, and second, can be inflated if duration is
can be obtained from arbitrary set of speech.           modeled using unit repetition.
Both models are trained with the maximum like-
lihood objective (EI,U [log P (U | I)] for I2U and         Second, what properties must learned units
ES,U [log P (S | U )] for U2S).                         have to be a drop-in replacement for text? The
                                                        most essential differences between text and speech
4     Experiments                                       are the amount of information encoded and the se-
                                                        quence lengths. Beyond text, speech also encodes
We design experiments to address three questions:       prosody, speaker, environment information and the
   First, how can we measure the performance            duration for each phone, all of which are minimally
of an image-to-speech system? Our system can            correlated with the conditioned images. We hypoth-
fail to produce a good caption if the I2U model fails   esize that learned speech units should discard such
to encode linguistic/semantic information into the      information in order to seamlessly connect the I2U
unit sequence, or if the U2S model fails to synthe-     and U2S modules. To verify it, we pay particular
size an intelligible waveform given a unit sequence.    attention to the variations of the learned units in
To better localize these failure modes, we evaluate     frame rate (VQ2/VQ3), encoding of duration in-
the full I2S system as well as the U2S system in        formation (RLE or not), and robustness to domain
isolation. We evaluate the U2S system by using it       shift (WVQ/VQ3). Units are run-length encoded
as a vocoder to synthesize unit sequences inferred      by default. Table 2a shows the properties of the
from real speech and soliciting human judgements        units before run-length encoding.

                                                   5288
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Third, how should language generation mod-                      Unit
                                                                                ABX       Frame                    MOS
els be evaluated more generally? We examine                                     Error      Rate        LJSpeech      SpokenCOCO
evaluation of the I2S model using beam search-                     VQ3      14.52%         40ms     3.723 ± 0.039        3.336 ± 0.044
based decoding as well as sampling-based decod-                    VQ2      12.51%         20ms     3.932 ± 0.036        2.961 ± 0.045
                                                                   WVQ      24.87%         40ms     3.658 ± 0.040        2.896 ± 0.053
ing. We find that because evaluation metrics that
are reliant on beam search-based decoding only                  (a) Properties of the units and MOS of the U2S models trained
                                                                on these units with 95% confidence interval. ABX errors are
evaluate the mode of a model’s posterior, they do               computed on the ZeroSpeech 2020 English test set.
not reflect the ability of a model to generate diverse
linguistic content. Furthermore, we show that it is                      Unit                  LJSpeech             SpokenCOCO
possible for a model’s posterior mode to be linguis-                A           B        A       Same   B          A   Same   B
tically meaningless, and yet meaningful language                   VQ3      VQ2         23.9    31.5       44.6   40.4    32.5   27.5
can still be generated with sampling-based decod-                  VQ3      WVQ         36.6    37.1       26.3   58.3    21.8   19.9
ing. Towards this end, we introduce a novel multi-
                                                                          (b) SXS preference (%) of the U2S models.
hypothesis evaluation metric (M-SPICE), which
uses sampling-based decoding (instead of beam                   Table 2: Subjective evaluation of U2S models trained
search) to generate a set of captions. We can then              on LJSpeech and re-synthesize units inferred from
compute the overall coverage of this caption set                LJSpeech or SpokenCOCO recordings.
against a reference; see Section 4.4 for details.
                                                                   Symbol           Image-to-Unit Output
4.1   Evaluating the U2S Model                                               Decoded with Beam Search (beam size=5)
                                                                   VQ3        263 32 208 5 336 100 717 803 256 803 815 144 120
We construct a Tacotron-2 model for each of the                               144 654 936 48 417 272 417 362 766 825 284 614...
                                                                   VQ2        (71 791)*N (until reaching max decoder length)
three unit types on the LJSpeech audio data by                     WVQ        (181 232)*N (until reaching max decoder length)
transcribing each LJSpeech utterances into an unit                 VQ3 \ RLE 263 (32)*N (until reaching max decoder length)

sequence, then train the U2S model from the RLE-                            Decoded with Top-k Sampling Search (k=5)
                                                                   VQ3       263 208 467 717 288 426 986 72 44 341 151 801 1022
ed unit sequence and spectrogram pairs. We eval-                             27 320 426 288 66 570 683 351 313 910 820...
                                                                   VQ2       (71 791)*4 175 51 139 359 173 599 307 419 133 621
uate the naturalness of the speech produced by                               85 165 315 883 175 191 71 791 71 48 511 765...
each model on held-out data, both in-domain using                  WVQ       (181 232)*5 181 225 124 232 181 232 225 232 181 225
                                                                             124 225 232 181 252 169 211 147 89 67 156...
LJSpeech and out-of-domain (OOD) using Spoken-                     VQ3 \ RLE 263 (32)*15 208 208 5 5 336 100 803 256 560 417 870
                                                                             870 870 968 910 250 543 820 587 909 909...
COCO.1 Amazon Mechanical Turk (AMT) work-
ers performed Side-by-Side preference tests (SXS)                       Table 3: Exemplar output from SAT models.
and naturalness evaluation based on mean opinion
scores (MOS) on a scale from 1 to 5 for each U2S
model, which we display in Table 2. Although VQ2                trained on VQ3 units failed during beam search
was preferred for in-domain synthesis on LJSpeech,              decoding on the test images (WVQ consistently
VQ3 achieved the highest scores and least degrada-              failed, while VQ2 sometimes succeeded); rather
tion (-0.387) on the out-of-domain SpokenCOCO,                  than producing a diverse sequence of output units,
indicating that out of the three units VQ3 has the              the decoder would generally get stuck in a loop
strongest robustness to domain shift.                           until the maximum decoding length was reached.
                                                                This also happened using VQ3 units without RLE,
4.2   Incorporating the I2U Model                               indicating that the decoder could not model unit
We trained an SAT model on SpokenCOCO for                       duration. Example outputs are provided in Table 3.
each of the three RLE-ed units, as well as VQ3                  We hypothesize that the reason the VQ2 and WVQ
units without RLE. We also compare to text charac-              units failed is due to their lack of invariance to
ters and words; the full hyperparameter and train-              domain shift, as evidenced by their decay in nat-
ing details for all models are provided in Section              uralness when used for OOD synthesis as shown
B in the appendix, but in general we kept these                 in Table 2. This may cause the entropy of the unit
as constant as possible when comparing different                distribution conditioned on an image to be higher
linguistic representations.                                     as each phoneme may be represented by multiple
   Before connecting the U2S model, we noticed                  units, and therefore the I2U model suffers from
that all RLE speech unit models except the one                  the same looping issues as the unconditional lan-
   1
     In-domainess is defined with respect to the U2S training   guage model of text, as observed in (Holtzman
data (LJSpeech) not the S2U training data (PlacesAudio).        et al., 2018; Fan et al., 2018; Holtzman et al., 2020;

                                                            5289
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
MSCOCO                                               Flickr8k
                  model                  U
                                                 B-4         M      R            C           S        B-4      M           R       C        S

             Xu et al. (2015)        word       0.243     0.239       -           -           -       0.213   0.203       -         -        -
             Lu et al. (2017)        word       0.327     0.260     0.540       1.042         -         -       -         -         -        -
            Wang et al. (2020b)      N/A           -         -          -         -           -       0.035   0.113     0.232     0.080      -
                                     word       0.315     0.253     0.533       0.984       0.185     0.216   0.207     0.469     0.550    0.149
                   SAT               char       0.289     0.239     0.512       0.879       0.172     0.190   0.190     0.441     0.476    0.136
                                     VQ3        0.186     0.186     0.446       0.584       0.127     0.116   0.141     0.390     0.232    0.091
                                     word       0.339     0.265     0.551       1.062       0.196     0.225   0.215     0.483     0.584    0.155
                 SAT-FT              char       0.323     0.256     0.536       1.002       0.187     0.191   0.196     0.450     0.519    0.143
                                     VQ3        0.233     0.212     0.478       0.732       0.149     0.125   0.145     0.391     0.245    0.095

Table 4: Word-based caption evaluation using BLEU-4, METEOR, ROUGE, CIDEr, and SPICE. ASR is used
to transcribe the spoken captions generated by the proposed VQ3 model into text for evaluation. The beam size
∈ {1, 3, 5, 8, 10} was chosen for each model to maximize SPICE. Our word-based SAT models outperform (Xu
et al., 2015) because we use a stronger image encoder (ResNet-101).

                                 Word-based                                 Unit-based                                 Retrieval-based
   symbol
                                                                                                          Image to Speech           Speech to Image
                 B-4      M          R         C         S        B-4       M           R         C
                                                                                                        R@1 R@5 R@10 R@1 R@5 R@10

                                                         SAT-FT Model, Decoded with Beam Search
    VQ3         0.233    0.212     0.478      0.732     0.149 0.261 0.198 0.334 0.211 0.240                    0.603     0.766     0.265   0.611   0.765
                                                           SAT Model, Decoded with Beam Search
   VQ3          0.186    0.186     0.446      0.584     0.127 0.274 0.196 0.328 0.215 0.157                    0.451     0.623     0.158   0.450   0.611
   VQ2          0.068    0.138     0.343      0.262     0.084 0.172 0.132 0.178 0.027           0.09           0.289     0.426     0.093   0.283   0.420
   WVQ          0.010    0.069     0.286      0.009     0.011 0.020 0.048 0.081 0.000 0.000                    0.005     0.010     0.001   0.006   0.011
 VQ3 \ RLE      0.000    0.002     0.001      0.000     0.001 0.163 0.168 0.218 0.000 0.000                    0.003     0.007     0.001   0.006   0.011

      Table 5: Comparison of the three sets of metrics on different units and models trained on MSCOCO.

Kulikov et al., 2019; Welleck et al., 2020).                                      4-6), in addition to fine-tuning the encoder (row
   To evaluate the full Image-to-Speech model, we                                 7-9). Despite having no access to text, the SAT-
first train an ASR system on the re-synthesized                                   FT speech captioning model trained on VQ3 units
SpokenCOCO captions using the VQ3 Tacotron-2                                      achieves a BLEU-4 score of .233 with beam search
model. This enables us to estimate a word-level                                   decoding on MSCOCO. This is very close to the
transcription of the spoken captions produced by                                  .243 achieved by the original SAT word-based cap-
our system. In order to verify that the synthesized                               tioning model. Figure 1 shows that the generated
captions are intelligible to humans and the ASR                                   captions are fluent and reflect the implicit learning
system did not simply learn to recognize artifacts                                of some syntactic rules. It is evident that the pro-
of the synthesized speech, we asked AMT work-                                     posed model is capable of generating fluent and
ers to transcribe into words a set of 500 captions                                meaningful image captions.
generated by our I2U→U2S system and also evalu-                                      Results comparing four unit representations on
ated their naturalness. Three workers transcribed                                 all three sets of metrics are shown in Table 5. First
and three workers rated each caption, allowing                                    of all, by comparing word-based and unit-based
us to compute an MOS score (3.615±0.038), a                                       evaluations, we do note that the relative ranking
word error rate (WER) between the 3 human tran-                                   among VQ3, VQ2, and WVQ is consistent across
scriptions (9.40%), as well as an average WER be-                                 BLEU-4, METEOR, and ROUGE for SAT models,
tween the human and ASR-produced transcriptions                                   however, VQ3 \ RLE achieves abnormally high
(13.97%). This confirms that our system produces                                  scores on these metrics despite producing trivial
reasonably natural speech and ASR is sufficiently                                 captions for all images as shown in Table 3. This
accurate for transcribing synthesized speech.                                     is because unit “32” has learned to represent non-
   Table 4 summarizes our results on MSCOCO                                       speech frames such as silence, which frequently
and Flickr8k using beam search. We compare with                                   occurs at both the beginning and end of utterances.
the literature for bottom-up text captioning (row                                 Without RLE, consecutive strings of “32” units
1-2) and text-free end-to-end image-to-speech syn-                                are extremely common in both the candidate and
thesis (row 3). We train the decoder of an SAT                                    reference captions, which inflates the scores of this
model while keeping the image encoder fixed (row                                  model. The exception here is the CIDEr metric,

                                                                            5290
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Figure 3: MSCOCO test SPICE scores of various units and decoding methods. VQ3\RLE denotes VQ3 units
without RLE. Top-k sampling considers only the k-most probable units at each step.

which incorporates TF-IDF weighting that tends to
de-emphasize these kinds of uninformative patterns.
Nonetheless, when comparing SAT and SAT-FT
with VQ3 units, CIDEr does not rank them the
same as word-based metrics.
   Regarding retrieval-based evaluation, despite the
fact that the ResDAVEnet model was only trained
                                                       Figure 4: Vocabulary size learned by the proposed I2S
on the original, human-spoken captions for the
                                                       model (on MSCOCO)
MSCOCO images, it works very well for the fully
synthetic captions. The speech and image retrieval
scores for 1k human-spoken validation captions
are 0.867 and 0.828 R@10, respectively, while
the SAT-FT VQ3 model achieves 0.766 and 0.765
R@10. This indicates that this image-to-speech
model is able to infer the salient semantic con-
tent of an input image, generate a unit sequence
that captures that content, and generate speech that
is sufficiently natural sounding for the ResDAV-       Figure 5: M-SPICE on MSCOCO. Black dashed lines
Enet model to recover that semantic information.       show the highest value for beam search when n=1.
Several of the other image-to-speech models also
achieve respectable retrieval performance, and the
                                                       We see that all unit types can generate reasonable
overall ranking of the models mirrors that which we
                                                       captions when decoding via sampling. Moreover,
found when using word-based evaluation metrics.
                                                       we discovered that 1) ResDAVEnet-VQ units con-
                                                       sistently outperform the WaveNet-VQ units, sug-
4.3   From Mode to Distribution: Evaluating
                                                       gesting that they better capture sub-word structure,
      Captions Generated via Sampling
                                                       and 2) VQ3 \ RLE achieves better scores than VQ2
The results in the previous section only evaluate      when using a larger temperature or k for top-k.
beam search decoding with the I2U model, and do           We estimated the vocabulary size of the SAT-FT
not fully reveal the posterior over captions for an    model with VQ3 by counting the number of unique
input image, or whether the unit representations       recognized words produced at least 3 times when
that failed with beam search would work well with      captioning the SpokenCOCO test images. These
other methods. To probe this, we evaluate the mod-     numbers are shown for the model under the vari-
els using sampling-based caption generation. Fig-      ous decoding methods in Figure 4. The number of
ure 3 shows the SPICE scores on SpokenCOCO             captions per image is denoted by n, where top can-
using beam search and two sampling-based meth-         didates are used for beam search and i.i.d. samples
ods. VQ3 still performs the best of all unit types     are drawn for sampling. Sampling-based decoding
with both beam search and sampled decoding. VQ2        reveals a larger vocabulary size than beam search,
can sometimes generate captions with beam search       and the number of words learned by our models
when the beam is kept small, but as the beam grows     (≥ 212 ) is far greater than the number of words
it begins to loop and the scores become very low.      learned by the ResDAVEnet-VQ model (approx.

                                                  5291
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Speaker   Gender     Region       B4      M       S         the F-score against the reference propositions:
                    U2S trained on LJSpeech                     F 1(T (S), ∪j T (cj )). M-SPICE assigns a higher
       -        F            -       0.233 0.212     0.149      score if the set captures diverse and correct proposi-
                     U2S trained on VCTK                        tions, and it is obvious that the score of C 2 is higher
      p247      M        Scottish    0.234   0.211   0.148
      p231      F        English     0.233   0.210   0.146      than C 1 as desired.Figure 5 shows the M-SPICE
      p294      F       American 0.236       0.212   0.148      scores of our SAT-FT model using VQ3 units on
      p345      M       American 0.234       0.209   0.144
      p307      F       Canadian 0.234       0.211   0.148      SpokenCOCO. When evaluating over multiple cap-
                                                                tions (n > 1), using the beam search hypotheses
Table 6: Results of disentangled voice control via syn-         increases the score less than sampling.
thesizing the same units with a single and a multi
speaker U2S model. Units are decoded using beam                 4.5    Disentangled Voice Control for
search from the SAT-FT VQ3 MSCOCO model.                               Image-to-Speech Synthesis
                                                                We examine to what extent the VQ3 units are
279) in (Harwath et al., 2020). We hypothesize                  portable across different speakers by training a U2S
that training a model to generate spoken captions               model on the VCTK dataset that additionally takes
encourages it to learn many more words than only                a speaker ID as input. The resulting model is able
being trained to retrieve images from captions. We              to generate speech with the voice of any VCTK
also hypothesize that because beam search attempts              speaker. We evaluate the captions produced by this
to find the mode of the posterior over captions, it             system on SpokenCOCO for 5 speakers in Table 6.
tends to produce a smaller set of words and does                To compute these scores we transcribe the cap-
not reveal the breadth of the model distribution.               tions generated by each model into text using the
                                                                ASR system we describe in Section 4.2, which was
4.4    New Diversity-Aware Metric: M-SPICE                      solely trained on re-synthesized SpokenCOCO cap-
The previous section showed that even when the                  tions using the LJSpeech U2S model. The scores
SPICE scores were comparable, sampling-based                    in Table 6 indicate not only that the I2U model can
decoding revealed a much larger model vocabulary                be easily integrated with U2S models representing
than beam search, especially when multiple cap-                 a diverse set of speakers, but also that the LJSpeech
tions are generated for each image. This highlights             ASR system works very well on the speech synthe-
a limitation of SPICE in measuring the diversity.               sized from the VCTK models.
Formally speaking, SPICE computes an F-score
                                                                5     Conclusion
between two bags of semantic propositions T (S)
and T (c) parsed from a set of references S = {si }i            In this paper, we presented the first model capa-
and a hypothesis c, where T (c) denotes a bag of                ble of generating fluent spoken captions of images
propositions extracted from a scene graph parsed                without relying on text, which almost matches the
c, and we can compute that for multiple sentences               performance of early text-based image captioning
with T (S) = ∪i (T (si )).                                      models. Our comprehensive experiments demon-
   To extend SPICE for scoring multiple hypothe-                strated that learned units need to be robust, of low
ses C = {cj }Jj=1 , one can compute an average                  framerate, and encoding little or none duration in-
SPICE: J1 j F 1(T (S), T (cj )), or use the ora-
             P
                                                                formation to be a drop-in replacement for text. We
cle SPICE proposed in Vijayakumar et al. (2018):                also identified the caveats of mode-based evalua-
maxj F 1(T (S), T (cj )). However, these metrics                tion and proposed a new metric to address seman-
fail to capture the diversity among hypotheses.                 tic diversity. As part of this work, a novel dataset
Consider two hypothesis set, C 1 = {c11 , c12 } and             of over 600k spoken captions for the MSCOCO
C 2 = {c21 , c22 }, where T (c11 ) = T (c12 ) = T (c21 ) =      dataset is introduced, which we will make publicly
{(girl), (table), (girl, sit-at, table)}, T (c22 ) = {(girl),   available to the research community.
(girl, young)}, and T (S) = {(girl), (table), (girl,               Future work should investigate applying the pro-
young), (girl, sit-at, table)}.                                 posed method to additional languages, devising
   To address the deficiencies of the existing met-             improved speech unit representations, and jointly
rics, we propose a new metric named multi-                      training the speech unit model with the I2S model.
candidate SPICE (M-SPICE), which takes the                      This would offer the opportunity to explore new
union of the candidate propositions and computes                analysis-by-synthesis training objectives.

                                                             5292
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
References                                                 without parallel data by adversarially learning disen-
                                                           tangled audio representations. In Proc. Annual Con-
Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano,          ference of International Speech Communication As-
 and Hisao Kuwabara. 1990. Voice conversion                sociation (INTERSPEECH).
 through vector quantization. Journal of the Acous-
 tical Society of Japan, 11(2):71–76.                    Grzegorz Chrupała, Lieke Gelderloos, and Afra Al-
                                                           ishahi. 2017. Representations of language in a
Gilles Adda, Sebastian Stüker, Martine Adda-Decker,        model of visually grounded speech signal. In Proc.
  Odette Ambouroue, Laurent Besacier, David Bla-           Annual Meeting of the Association for Computa-
  chon, Hélene Bonneau-Maynard, Pierre Godard, Fa-         tional Linguistics (ACL).
  tima Hamlaoui, Dmitry Idiatov, et al. 2016. Break-
  ing the unwritten language barrier: The BULB           Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James R.
  project. Procedia Computer Science, 81:8–14.             Glass. 2019. An unsupervised autoregressive model
                                                           for speech representation learning. In Proc. Annual
Kei Akuzawa, Yusuke Iwasawa, and Yutaka Matsuo.            Conference of International Speech Communication
  2018. Expressive speech synthesis via modeling ex-       Association (INTERSPEECH).
  pressions with variational autoencoder. Proc. An-
  nual Conference of International Speech Communi-       Bo Dai and Dahua Lin. 2017. Contrastive learning for
  cation Association (INTERSPEECH).                        image captioning. In Proc. Neural Information Pro-
                                                           cessing Systems (NeurIPS).
Afra Alishahi, Marie Barking, and Grzegorz Chrupała.
  2017. Encoding of phonology in a recurrent neural      Michael Denkowski and Alon Lavie. 2014. Meteor uni-
  model of grounded speech. In Proc. ACL Confer-           versal: Language specific translation evaluation for
  ence on Natural Language Learning (CoNLL).               any target language. In In Proceedings of the Ninth
                                                          Workshop on Statistical Machine Translation.
Peter Anderson, Basura Fernando, Mark Johnson, and
  Stephen Gould. 2016. SPICE: Semantic proposi-          Jennifer Drexler and James Glass. 2017. Analysis of
  tional image caption evaluation. In Proc. IEEE Eu-        audio-visual features for unsupervised speech recog-
  ropean Conference on Computer Vision (ECCV).              nition. In Proc. Grounded Language Understanding
                                                           Workshop.
Peter Anderson, Xiaodong He, Chris Buehler, Damien
  Teney, Mark Johnson, Stephen Gould, and Lei            Ewan Dunbar, Robin Algayres, Julien Karadayi, Math-
  Zhang. 2018. Bottom-up and top-down attention for        ieu Bernard, Juan Benjumea, Xuan-Nga Cao, Lucie
  image captioning and visual question answering. In       Miskic, Charlotte Dugrain, Lucas Ondel, Alan W.
  Proc. IEEE Conference on Computer Vision and Pat-        Black, Laurent Besacier, Sakriani Sakti, and Em-
  tern Recognition (CVPR).                                 manuel Dupoux. 2019. The zero resource speech
                                                           challenge 2019: TTS without T. In Proc. Annual
Alexei Baevski, Steffen Schneider, and Michael Auli.       Conference of International Speech Communication
  2020. vq-wav2vec: Self-supervised learning of dis-       Association (INTERSPEECH).
  crete speech representations. In Proc. International
  Conference on Learning Representations (ICLR).         Ryan Eloff, Herman Engelbrecht, and Herman Kam-
                                                           per. 2019. Multimodal one-shot learning of speech
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-           and images. In Proc. International Conference on
  gio. 2015. Neural machine translation by jointly         Acoustics, Speech and Signal Processing (ICASSP).
  learning to align and translate. In Proc. Inter-
  national Conference on Learning Representations        Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-
  (ICLR).                                                  erarchical neural story generation. In Proc. Annual
                                                           Meeting of the Association for Computational Lin-
Suhee Cho, Yeonjung Hong, Yookyunk Shin, and               guistics (ACL).
  Youngsun Cho. 2019. VQVAE with speaker adver-
  sarial training.                                       Fuming Fang, Junichi Yamagishi, Isao Echizen, and
                                                           Jaime Lorenzo-Trueba. 2018. High-quality nonpar-
Jan Chorowski, Ron J. Weiss, Samy Bengio, and Aäron        allel voice conversion based on cycle-consistent ad-
  van den Oord. 2019. Unsupervised speech represen-        versarial network. In Proc. International Confer-
   tation learning using wavenet autoencoders. IEEE        ence on Acoustics, Speech and Signal Processing
  Transactions on Audio, Speech and Language Pro-          (ICASSP).
   cessing.
                                                         David Harwath and James Glass. 2015. Deep mul-
Jan Chorowski, Ron J Weiss, Rif A Saurous, and Samy        timodal semantic embeddings for speech and im-
   Bengio. 2018. On using backpropagation for speech       ages. In Proc. IEEE Workshop on Automatic Speech
   texture generation and voice conversion. In Proc.       Recognition and Understanding (ASRU).
  International Conference on Acoustics, Speech and
  Signal Processing (ICASSP).                            David Harwath and James Glass. 2017. Learning word-
                                                           like units from joint audio-visual analysis. In Proc.
Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee, and           Annual Meeting of the Association for Computa-
   Lin-shan Lee. 2018. Multi-target voice conversion       tional Linguistics (ACL).

                                                    5293
David Harwath and James Glass. 2019. Towards vi-           auto-encoder. In Asia-Pacific Signal and Infor-
  sually grounded sub-word speech unit discovery.          mation Processing Association Annual Summit and
  In Proc. International Conference on Acoustics,          Conference (APSIPA).
  Speech and Signal Processing (ICASSP).
                                                         Wei-Ning Hsu and James Glass. 2018a. Disentangling
David Harwath, Wei-Ning Hsu, and James Glass. 2020.        by partitioning: A representation learning frame-
  Learning hierarchical discrete linguistic units from    work for multimodal sensory data. arXiv preprint
  visually-grounded speech. In Proc. International         arXiv:1805.11264.
  Conference on Learning Representations (ICLR).
                                                         Wei-Ning Hsu and James Glass. 2018b. Scalable fac-
David Harwath, Adrià Recasens, Dídac Surís, Galen
                                                           torized hierarchical variational autoencoder training.
  Chuang, Antonio Torralba, and James Glass. 2019.
                                                           In Proc. Annual Conference of International Speech
  Jointly discovering visual objects and spoken words
                                                          Communication Association (INTERSPEECH).
  from raw sensory input. International Journal of
  Computer Vision.
                                                         Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin
David Harwath, Antonio Torralba, and James R. Glass.       Bolte, Ruslan Salakhutdinov, and Abdelrahman Mo-
  2016. Unsupervised learning of spoken language           hamed. 2021. Hubert: How much can a bad teacher
  with visual context. In Proc. Neural Information         benefit asr pre-training? In ICASSP. IEEE.
  Processing Systems (NeurIPS).
                                                         Wei-Ning Hsu, Yu Zhang, and James Glass. 2017a.
Mark Hasegawa-Johnson, Alan Black, Lucas Ondel,            Learning latent representations for speech genera-
 Odette Scharenborg, and Francesco Ciannella. 2017.        tion and transformation. In Proc. Annual Confer-
 Image2speech: Automatically generating audio de-          ence of International Speech Communication Asso-
 scriptions of images. In International Conference         ciation (INTERSPEECH).
 on Natural Language, Signal and Speech Process-
 ing.                                                    Wei-Ning Hsu, Yu Zhang, and James Glass. 2017b.
                                                           Unsupervised learning of disentangled and in-
William Havard, Laurent Besacier, and Olivier Rosec.       terpretable representations from sequential data.
  2017. Speech-coco: 600k visually grounded spoken         In Proc. Neural Information Processing Systems
  captions aligned to mscoco data set. arXiv preprint     (NeurIPS).
  arXiv:1707.08435.
William Havard, Jean-Pierre Chevrot, and Laurent Be-     Wei-Ning Hsu, Yu Zhang, Ron Weiss, Heiga Zen,
  sacier. 2019a. Models of visually grounded speech       Yonghui Wu, Yuan Cao, and Yuxuan Wang. 2019.
  signal pay attention to nouns: a bilingual experi-       Hierarchical generative modeling for controllable
  ment on English and Japanese. In Proc. Interna-          speech synthesis. In Proc. International Conference
  tional Conference on Acoustics, Speech and Signal        on Learning Representations (ICLR).
  Processing (ICASSP).
                                                         Gabriel Ilharco, Yuan Zhang, and Jason Baldridge.
William Havard, Jean-Pierre Chevrot, and Laurent Be-       2019. Large-scale representation learning from vi-
  sacier. 2019b. Word recognition, competition, and        sually grounded untranscribed speech. In Proc.
  activation in a model of visually grounded speech.       ACL Conference on Natural Language Learning
  In Proc. ACL Conference on Natural Language              (CoNLL).
  Learning (CoNLL).
                                                         International Phonetic Association. 1999. Handbook
Gustav Eje Henter, Jaime Lorenzo-Trueba, Xin                of the International Phonetic Association: A guide
  Wang, and Junichi Yamagishi. 2018.         Deep           to the use of the International Phonetic Alphabet.
  encoder-decoder models for unsupervised learning          Cambridge University Press.
  of controllable speech synthesis. arXiv preprint
  arXiv:1807.11470.                                      Keith Ito. 2017. The LJ speech dataset. https://
                                                           keithito.com/LJ-Speech-Dataset/.
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
  Yejin Choi. 2020. The curious case of neural text
                                                         Herman     Kamper,     Shane    Settle,  Gregory
  degeneration. In Proc. International Conference on
                                                           Shakhnarovich, and Karen Livescu. 2017. Vi-
  Learning Representations (ICLR).
                                                           sually grounded learning of keyword prediction
Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine            from untranscribed speech. In Proc. Annual Con-
  Bosselut, David Golub, and Yejin Choi. 2018.             ference of International Speech Communication
  Learning to write with cooperative discriminators.       Association (INTERSPEECH).
  In Proc. Annual Meeting of the Association for Com-
  putational Linguistics (ACL).                          Herman Kamper, Gregory Shakhnarovich, and Karen
                                                           Livescu. 2018. Semantic speech retrieval with a
Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu,                visually grounded model of untranscribed speech.
  Yu Tsao, and Hsin-Min Wang. 2016. Voice con-             IEEE Transactions on Audio, Speech and Language
  version from non-parallel corpora using variational      Processing, PP:1–1.

                                                    5294
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-          Shuang Ma, Daniel McDuff, and Yale Song. 2019. Un-
  semantic alignments for generating image descrip-           paired image-to-speech synthesis with multimodal
  tions. In Proc. IEEE Conference on Computer Vi-             information bottleneck. In Proc. IEEE International
  sion and Pattern Recognition (CVPR).                        Conference on Computer Vision (ICCV).
Eugene Kharitonov, Morgane Rivière, Gabriel Syn-            Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng
  naeve, Lior Wolf, Pierre-Emmanuel Mazaré,                   Huang, and Alan Yuille. 2015. Deep captioning
  Matthijs Douze, and Emmanuel Dupoux. 2020.                  with multimodal recurrent neural networks (m-rnn).
  Data augmenting contrastive learning of speech              In Proc. International Conference on Learning Rep-
  representations in the time domain. arXiv preprint          resentations (ICLR).
  arXiv:2007.00991.
                                                            Danny Merkx, Stefan L. Frank, and Mirjam Ernestus.
Sameer Khurana, Shafiq Rayhan Joty, Ahmed Ali, and            2019. Language learning using speech to image
  James Glass. 2019. A factorial deep markov model            retrieval. In Proc. Annual Conference of Interna-
  for unsupervised disentangled representation learn-         tional Speech Communication Association (INTER-
  ing from speech. In Proc. International Confer-             SPEECH).
  ence on Acoustics, Speech and Signal Processing
  (ICASSP).                                                 Aaron van den Oord, Oriol Vinyals, and Koray
                                                              Kavukcuoglu. 2017. Neural discrete representation
Diederik P Kingma and Jimmy Ba. 2015. Adam: A                 learning. In Proc. Neural Information Processing
  method for stochastic optimization. In Proc. Inter-         Systems (NeurIPS).
  national Conference on Learning Representations
  (ICLR).                                                   Aaron van den Oord, Sander Dieleman, Heiga Zen,
                                                              Karen Simonyan, Oriol Vinyals, Alex Graves,
Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel.             Nal Kalchbrenner, Andrew Senior, and Koray
  2014. Multimodal neural language models. In                 Kavukcuoglu. 2016. Wavenet: A generative model
  Proc. International Conference on Machine Learn-            for raw audio. arXiv preprint arXiv:1609.03499.
  ing (ICML).
                                                            Aaron van den Oord, Yazhe Li, and Oriol Vinyals.
Ilia Kulikov, Alexander Miller, Kyunghyun Cho, and            2018. Representation learning with contrastive pre-
   Jason Weston. 2019. Importance of search and eval-         dictive coding. arXiv preprint arXiv:1807.03748.
   uation strategies in neural dialogue modeling. In
   Proceedings of the 12th International Conference on      Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
   Natural Language Generation.                               Jing Zhu. 2002. Bleu: a method for automatic eval-
                                                              uation of machine translation. In Proceedings of the
M. Paul Lewis, Gary F. Simon, and Charles D. Fennig.          40th Annual Meeting of the Association for Compu-
  2016. Ethnologue: Languages of the World, Nine-             tational Linguistics.
  teenth edition. SIL International. Online version:
  http://www.ethnologue.com.                                Wei Ping, Kainan Peng, Andrew Gibiansky, Ser-
                                                              can O Arik, Ajay Kannan, Sharan Narang, Jonathan
Chin-Yew Lin. 2004. ROUGE: A package for auto-                Raiman, and John Miller. 2017. Deep voice 3:
  matic evaluation of summaries. In Text Summariza-           Scaling text-to-speech with convolutional sequence
  tion Branches Out.                                          learning. arXiv preprint arXiv:1710.07654.
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James       Ryan Prenger, Rafael Valle, and Bryan Catanzaro.
  Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,            2019. Waveglow: A flow-based generative network
  and C. Lawrence Zitnick. 2014. Microsoft coco:              for speech synthesis. In Proc. International Con-
  Common objects in context. ArXiv, abs/1405.0312.            ference on Acoustics, Speech and Signal Processing
                                                              (ICASSP).
Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki
   Toda, Daisuke Saito, Fernando Villavicencio, Tomi        Cyrus Rashtchian, Peter Young, Micah Hodosh, and
   Kinnunen, and Zhenhua Ling. 2018. The voice con-           Julia Hockenmaier. 2010. Collecting image anno-
   version challenge 2018: Promoting development of           tations using Amazon’s Mechanical Turk. In Proc.
   parallel and nonparallel methods. arXiv preprint           NAACL Conference on Human Language Technolo-
   arXiv:1804.04262.                                          gies (NAACL-HLT).
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard          Steven J Rennie, Etienne Marcheret, Youssef Mroueh,
   Socher. 2017. Knowing when to look: Adaptive at-            Jerret Ross, and Vaibhava Goel. 2017. Self-critical
   tention via a visual sentinel for image captioning. In      sequence training for image captioning. In Proc.
   Proc. IEEE Conference on Computer Vision and Pat-           IEEE Conference on Computer Vision and Pattern
   tern Recognition (CVPR).                                    Recognition (CVPR).
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.      Odette Scharenborg, Laurent Besacier, Alan W. Black,
   2018. Neural baby talk. In Proc. IEEE Confer-              Mark Hasegawa-Johnson, Florian Metze, Graham
   ence on Computer Vision and Pattern Recognition            Neubig, Sebastian Stüker, Pierre Godard, Markus
   (CVPR).                                                    Müller, Lucas Ondel, Shruti Palaskar, Philip Arthur,

                                                       5295
Francesco Ciannella, Mingxing Du, Elin Larsen,          Christophe Veaux, Junichi Yamagishi, and Kirsten
  Danny Merkx, Rachid Riad, Liming Wang, and                Macdonald. 2017.      CSTR VCTK corpus: En-
  Emmanuel Dupoux. 2018. Linguistic unit discov-            glish multi-speaker corpus for CSTR voice cloning
  ery from multi-modal inputs in unwritten languages:       toolkit.
  Summary of the "Speaking Rosetta" JSALT 2017
  workshop. In Proc. International Conference on          Ramakrishna Vedantam, C. Lawrence Zitnick, and
  Acoustics, Speech and Signal Processing (ICASSP).         Devi Parikh. 2015. CIDEr: Consensus-based image
                                                            description evaluation. Proc. IEEE Conference on
Steffen Schneider, Alexei Baevski, Ronan Collobert,         Computer Vision and Pattern Recognition (CVPR).
   and Michael Auli. 2019. wav2vec: Unsupervised
   pre-training for speech recognition. In Proc. Annual   Ashwin K Vijayakumar, Michael Cogswell, Ram-
   Conference of International Speech Communication         prasaath R Selvaraju, Qing Sun, Stefan Lee, David
  Association (INTERSPEECH).                                Crandall, and Dhruv Batra. 2018. Diverse beam
                                                            search for improved description of complex scenes.
Joan Serrà, Santiago Pascual, and Carlos Segura             In Proc. AAAI Conference on Artificial Intelligence
  Perales. 2019. Blow: a single-scale hypercondi-           (AAAI).
  tioned flow for non-parallel raw-audio voice conver-
  sion. In Proc. Neural Information Processing Sys-       Oriol Vinyals, Alexander Toshev, Samy Bengio, and
  tems (NeurIPS).                                           Dumitru Erhan. 2015. Show and tell: A neural im-
                                                            age caption generator. In Proc. IEEE Conference on
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike              Computer Vision and Pattern Recognition (CVPR).
  Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
  Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan,            Weiran Wang, Qingming Tang, and Karen Livescu.
  et al. 2018. Natural tts synthesis by conditioning        2020a. Unsupervised pre-training of bidirectional
  wavenet on mel spectrogram predictions. In Proc.          speech encoders via masked reconstruction. In Proc.
  International Conference on Acoustics, Speech and        International Conference on Acoustics, Speech and
  Signal Processing (ICASSP), pages 4779–4783.             Signal Processing (ICASSP).
  IEEE.
                                                          Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark
Yannis Stylianou, Olivier Cappé, and Eric Moulines.         Hasegawa-Johnson, and Odette Scharenborg. 2020b.
  1998. Continuous probabilistic transform for voice        Show and speak: Directly synthesize spoken descrip-
  conversion. IEEE Transactions on Speech and Au-           tion of images. arXiv preprint arXiv:2010.12267.
  dio Processing, 6(2):131–142.
                                                          Yuxuan Wang, R.J. Skerry-Ryan, Daisy Stanton,
Dídac Surís, Adrià Recasens, David Bau, David Har-          Yonghui Wu, Ron Weiss, Navdeep Jaitly, Zongheng
  wath, James Glass, and Antonio Torralba. 2019.            Yang, Ying Xiao, Zhifeng Chen, Samy Bengio,
  Learning words by drawing images. In Proc. IEEE           Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and
  Conference on Computer Vision and Pattern Recog-          Rif Saurous. 2017. Tacotron: Towards end-to-end
  nition (CVPR).                                            speech synthesis. In Proc. Annual Conference of In-
Gabriel Synnaeve, Maarten Versteegh, and Emmanuel           ternational Speech Communication Association (IN-
  Dupoux. 2014. Learning words from images and              TERSPEECH).
  speech. In Proc. Neural Information Processing Sys-     Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-
  tems (NeurIPS).                                           Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei
Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya            Ren, Ye Jia, and Rif A Saurous. 2018. Style tokens:
  Nachmani. 2017. Voiceloop: Voice fitting and              Unsupervised style modeling, control and transfer
  synthesis via a phonological loop. arXiv preprint         in end-to-end speech synthesis. arXiv preprint
  arXiv:1707.06588.                                         arXiv:1803.09017.

Andros Tjandra, Berrak Sisman, Mingyang Zhang,            Kilian Q. Weinberger and Lawrence K. Saul. 2009.
  Sakriani Sakti, Haizhou Li, and Satoshi Nakamura.         Distance metric learning for large margin nearest
  2019. VQVAE unsupervised unit discovery and               neighbor classification. Journal of Machine Learn-
  multi-scale code2spec inverter for zerospeech chal-       ing Research (JMLR).
  lenge 2019. arXiv preprint arXiv:1905.11449.
                                                          Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Di-
Tomoki Toda, Alan W Black, and Keiichi Tokuda.              nan, Kyunghyun Cho, and Jason Weston. 2020. Neu-
  2007.     Voice conversion based on maximum-              ral text generation with unlikelihood training. In
  likelihood estimation of spectral parameter trajec-       Proc. International Conference on Learning Repre-
  tory. IEEE Transactions on Audio, Speech and Lan-         sentations (ICLR).
  guage Processing, 15(8):2222–2235.
                                                          Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
Michael Tschannen, Josip Djolonga, Paul K. Ruben-           Aaron Courville, Ruslan Salakhudinov, Rich Zemel,
  stein, Sylvain Gelly, and Mario Lucic. 2020. On           and Yoshua Bengio. 2015. Show, attend and tell:
  mutual information maximization for representation        Neural image caption generation with visual atten-
  learning. In Proc. International Conference on            tion. In Proc. International Conference on Machine
  Learning Representations (ICLR).                          Learning (ICML).

                                                      5296
You can also read