DEFENDING YOUR VOICE: ADVERSARIAL ATTACK ON VOICE CONVERSION

Page created by Antonio Stone

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

DEFENDING YOUR VOICE: ADVERSARIAL ATTACK ON VOICE CONVERSION

                                                                              Chien-yu Huang∗ , Yist Y. Lin∗ , Hung-yi Lee, Lin-shan Lee

                                                    College of Electrical Engineering and Computer Science, National Taiwan University, Taiwan

                                                                         ABSTRACT                             prone to yield different or incorrect results if the input is dis-
                                                                                                              turbed by such subtle perturbations imperceptible to humans
arXiv:2005.08781v2 [eess.AS] 19 Jan 2021

                                           Substantial improvements have been achieved in recent years
                                           in voice conversion, which converts the speaker character-         [10]. Adversarial attack is to generate such subtle perturba-
                                           istics of an utterance into those of another speaker without       tions that can fool the neural networks. It has been successful
                                           changing the linguistic content of the utterance. Nonetheless,     on some discriminative models [11, 12, 13], but less reported
                                           the improved conversion technologies also led to concerns          on generative models [14].
                                           about privacy and authentication. It thus becomes highly de-           In this paper, we propose to perform adversarial attack on
                                           sired to be able to prevent one’s voice from being improperly      voice conversion to prevent one’s speaker characteristics from
                                           utilized with such voice conversion technologies. This is why      being improperly utilized with voice conversion. Human-
                                           we report in this paper the first known attempt to perform         imperceptible perturbations are added to the utterances pro-
                                           adversarial attack on voice conversion. We introduce human         duced by the speaker to be defended. Three different ap-
                                           imperceptible noise into the utterances of a speaker whose         proaches, the end-to-end attack, embedding attack, and feed-
                                           voice is to be defended. Given these adversarial examples,         back attack are proposed, such that the speaker characteris-
                                           voice conversion models cannot convert other utterances so as      tics of the converted utterances would be made very different
                                           to sound like being produced by the defended speaker. Pre-         from those of the defended speaker. We conducted objective
                                           liminary experiments were conducted on two currently state-        and subjective evaluations on two recent state-of-the-art zero-
                                           of-the-art zero-shot voice conversion models. Objective and        shot voice conversion models. Objective speaker verification
                                           subjective evaluation results in both white-box and black-box      showed the converted utterances were significantly different
                                           scenarios are reported. It was shown that the speaker char-        from those produced by the defended speaker, which was then
                                           acteristics of the converted utterances were made obviously        verified by subjective similarity test. The effectiveness of the
                                           different from those of the defended speaker, while the adver-     proposed approaches was also verified for black-box attack
                                           sarial examples of the defended speaker are not distinguish-       via a proxy model closer to the real application scenario.
                                           able from the authentic utterances.
                                              Index Terms— voice conversion, adversarial attack,                                 2. RELATED WORKS
                                           speaker verification, speaker representation
                                                                                                              2.1. Voice conversion
                                                                   1. INTRODUCTION                            Traditionally, parallel data are required for voice conversion,
                                                                                                              or the training utterances of the two speakers must be paired
                                           Voice conversion aims to alter some specific acoustic charac-      and aligned. To overcome this problem, Chou et al. [1] ob-
                                           teristics of an utterance, such as the speaker identity, while     tained disentangled representations respectively for linguis-
                                           preserving the linguistic content. These technologies were         tic content and speaker information with adversarial training;
                                           made much more powerful by deep learning [1, 2, 3, 4, 5],          CycleGAN-VC [2] used cycle-consistency to ensure the con-
                                           but the improved technologies also led to concerns about pri-      verted speech to be linguistically meaningful with the target
                                           vacy and authentication. One’s identity may be counterfeited       speaker’s features; and StarGAN-VC [3] introduced condi-
                                           by voice conversion and exploited in improper ways, which          tional input for many-to-many voice conversion. All these
                                           is only one of the many deepfake problems observed today           are limited to speakers seen in training.
                                           generated by deep learning, such as synthesized fake photos
                                                                                                                   Zero-shot approaches then tried to convert utterances to
                                           or fake voice. Detecting any of such artifacts or defending
                                                                                                              any speaker given only one example utterance without fine-
                                           against such activities is thus increasingly important [6, 7, 8,
                                                                                                              tuning, and the target speaker is not necessarily seen before.
                                           9], which applies equally to voice conversion.
                                                                                                              Chou et al. [4] employed adaptive instance normalization for
                                               On the other hand, it has been widely known that neural
                                                                                                              this purpose; AUTOVC [5] integrated a pre-trained d-vector
                                           networks are fragile in the presence of some specific noise, or
                                                                                                              and an encoder bottleneck, achieving the state-of-the-art re-
                                              ∗   These authors contributed equally.                          sults.

Content        Ec(t)
                          t
                                        Encoder Ec

                              embedding attack (Sec. 3.2)              Decoder D             F(t, x)

                                                        Es(x)
                         x (or y)        Speaker
                                        Encoder Es                    Es(F(t, x))

                                        feedback attack (Sec. 3.3)
                                               end-to-end attack (Sec. 3.1)

Fig. 1: The encoder-decoder based voice conversion model and the three proposed approaches. Perturbations are updated on
the utterances providing speaker characteristics, as the blue dashed lines indicate.

2.2. Attacking and defending voice                                 encoder, since we are defending the speaker characteris-
                                                                   tics provided by such utterances. Motivated by the prior
Automatic speech recognition (ASR) systems have been               work [14], here we present three approaches to performing
shown to be prone to adversarial attacks. Applying pertur-         the attack, with the target being either the output spectro-
bations on the waveforms, spectrograms, or MFCC features           gram F (t, x) (Sec. 3.1), or the speaker embedding Es (x)
was able to make ASR systems fail to recognize the speech          (Sec. 3.2), or the combination of the two (Sec. 3.3), as shown
correctly [15, 16, 17, 18, 19]. Similar goals were achieved        in Fig. 1.
on speaker recognition by generating adversarial examples to
fool automatic speaker verification (ASV) systems to predict
that these examples had been uttered by a specific speaker         3.1. End-to-end attack
[20, 21, 22]. Different approaches for spoofing ASV were           A straight-forward approach to perform adversarial attack on
also proposed to show the vulnerabilities of such systems          the above model in Fig. 1 is to take the decoder output F (t, x)
[23, 24, 25]. But to our knowledge, applying adversarial           as the target, referred to as end-to-end attack also shown in
attacks on voice conversion has not been reported yet.             Fig. 1. Denote the original spectrogram of an utterance pro-
    On the other hand, many approaches were proposed to            duced by the speaker to be defended as x ∈ RM ×T and the
defend one’s voice when ASV systems were shown to be vul-          adversarial perturbation on x as δ ∈ RM ×T , where M and
nerable to spoofing attacks [26, 27, 28, 29]. In addition to       T are the total number of frequency components and time
the ASVspoof challenges for spoofing techniques and coun-          frames respectively. An untargeted attack simply aims to alter
termeasures [30], Liu et al. [20] conducted adversarial attacks    the output of the voice conversion model and can be expressed
on those countermeasures, showing the fragility of them. Ob-       as:
viously, all neural network models are under the threat of ad-                  maximize L(F (t, x + δ), F (t, x))
versarial attacks [11], which led to the idea of attacking voice                     δ                                          (1)
conversion models as proposed here.                                             subject to kδk∞ < 
                                                                   L(·, ·) is the distance between two vectors or the spectrograms
                  3. METHODOLOGIES                                 for two signals and  is a constraint making the perturbation
                                                                   subtle. The signal t can be arbitrary offering the content of
A widely used model for voice conversion adopted an encoder-       the output utterance, on which we do not focus here.
decoder structure, in which the encoder is further divided into        Given a certain utterance y produced by a target speaker,
a content encoder and a speaker encoder, as shown in Fig. 1.       we can formulate a targeted attack for output signal with spe-
This paper is also based on this model. The content encoder        cific speaker characteristics:
Ec extracts the content information from an input utterance
t yielding Ec (t), while the speaker encoder Es embeds the                               L(F (t, x + δ), F (t, y))
                                                                              minimize
speaker characteristics of an input utterance x as a latent                         δ      − λL(F (t, x + δ), F (t, x))        (2)
vector Es (x), as in the left part of Fig. 1. Taking Ec (t) and               subject to kδk∞ < 
Es (x) as the input, the decoder D generates a spectrogram
F (t, x) with content information based on Ec (t) and speaker      The first term in the first expression in (2) aims to make the
characteristics based on Es (x).                                   model output sound like being produced by the speaker of
    Here we only focus on the utterances fed into the speaker      y, while the second term is to eliminate the original speaker

identity in x. λ is a positive valued hyperparameter balancing     unseen speakers given their few utterances without fine-
the importance between source and target.                          tuning, considered suitable for our scenarios. Models such
    To effectively constrain the range of perturbation within      as StarGAN-VC, on the other hand, are limited to the voice
[−, ] while solving (2), we adopt the approach of Change of      produced by speakers seen during training, which makes it
variable as was done previously [31] using tanh(·) function.       less likely to counterfeit other speakers’ voices, and we thus
In this way (2) above becomes (3) below:                           do not consider them here.

                      L(F (t, x + δ), F (t, y))
         minimize
              w         − λL(F (t, x + δ), F (t, x))        (3)    4.1. Speaker encoders
         subject to δ =  · tanh(w)
                                                                   For Chou’s model, all modules were trained jointly from
                                                                   scratch on the CSTR VCTK Corpus [32]. The speaker
where w ∈ RM ×T . The clipping function is not needed here.
                                                                   encoder took a 512-dim mel spectrogram and generated a
                                                                   128-dim speaker embedding. AUTOVC utilized pre-trained
3.2. Embedding attack                                              d-vector [33] as speaker encoder, with 80-dim mel spectro-
The speaker encoder Es in Fig. 1 embeds an utterance into a        gram as input and 256-dim speaker embedding as output,
latent vector. These latent vectors for utterances produced by     pre-trained on VoxCeleb1 [34] and LibriSpeech [35] but gen-
the same speaker tend to cluster closely together, while those     eralizable to unseen speakers.
by different speakers tend to be separated apart. The second
approach proposed here is focused on the speaker encoder by
directly changing the speaker embeddings of the utterances,        4.2. Vocoders
referred to as embedding attack also in Fig. 1. As the decoder     In inference, Chou’s model leveraged Griffin-Lim algorithm
D produces the output F (t, x) with speaker characteristics        [36] to synthesize the audio. AUTOVC previously adopted
based on the speaker embedding Es (x) as in Fig. 1, changing       WaveNet [37] as the spectrogram inverter, but here we used
the speaker embeddings thus alters the output of the decoder.      WaveRNN-based vocoder [38] pre-trained on the VCTK cor-
    Following the notations and expressions in (3), we have:       pus to generate waveforms with similar quality due to time
                      L(Es (x + δ), Es (y))                        limitation.
         minimize                                                      As we introduced perturbation on spectrogram, vocoders
              w         − λL(Es (x + δ), Es (x)))           (4)
                                                                   converting the spectrogram into waveform were neces-
         subject to δ =  · tanh(w)                                sary. We respectively adopted Griffin-lim algorithm and
                                                                   WaveRNN-based vocoder for attacks on Chou’s model and
where the adversarial attack is now performed with the
                                                                   AUTOVC.
speaker encoder Es only. Since only the speaker encoder
is involved, it is therefore more efficient.
                                                                   4.3. Attack scenarios
3.3. Feedback attack
                                                                   Two scenarios were tested here. In the first scenario, the at-
The third approach proposed here tries to combine the above        tacker has full access to the model to be attacked. With the
two approaches by feeding the output spectrogram F (t, x+δ)        complete architecture plus all trained parameters of the model
from the decoder D back to the speaker encoder Es (the red         being available, we can apply adversarial attack directly. This
feedback loop in Fig. 1), and consider the speaker embedding       scenario is referred to as white-box scenario, in which all ex-
obtained in this way. More specifically, Es (x + δ) in (4) is      periments were conducted on Chou’s model and AUTOVC
replaced by Es (F (t, x + δ)) in (5). This is referred to as the   with their publicly available network parameters and evalu-
feedback attack also in Fig. 1.                                    ated on exactly the same models.
                    L(Es (F (t, x + δ)), Es (y))                       The second one is referred to as black-box scenario, in
      minimize                                                     which the attacker cannot directly access the parameters of
          w           − λL(Es (F (t, x + δ)), Es (x)))      (5)    the model to be attacked, or the architecture might even be un-
      subject to δ =  · tanh(w)                                   known. For attacking Chou’s model, we trained a new model
                                                                   with the same architecture but different initialization, whereas
              4. EXPERIMENTAL SETTINGS                             for AUTOVC we trained a new speaker encoder with the ar-
                                                                   chitecture similar to the one in the original AUTOVC. These
We conducted experiments on the model proposed by Chou             newly-trained models are then used as proxy models to gen-
et al. [4] (referred to as Chou’s model below) and AUTOVC.         erate adversarial examples to be evaluated with the publicly
Both were able to perform zero-shot voice conversion on            available ones in the same way as in white-box scenario.

Speaker verification accuracy

                                                                                             Speaker verification accuracy
                                                            0.95            0.98                                                    1.00              0.98            1.00
                                     1     0.94                                                                               1
                                    0.8                                                                                      0.8
                                                                                                                                           0.58                              0.59
                                    0.6                                                                                      0.6
                                                  0.45             0.48
                                                                                   0.40
                                    0.4                                                                                      0.4                             0.34

                                    0.2                                                                                      0.2
                                     0                                                                                        0
                                          (i) end-to-end (ii) embedding (iii) feedback                                             (i) end-to-end (ii) embedding (iii) feedback
                                             adversarial input        adversarial output                                              adversarial input         adversarial output

                                                         (a) Chou’s                                                                               (b) AUTOVC

Fig. 2: Speaker verification accuracy for the defended speaker with Chou’s ( = 0.075) and AUTOVC ( = 0.05) by the three
proposed approaches under the white-box scenario.

4.4. Attack procedure                                                                      the authentic speaker, whereas the negative ones were com-
                                                                                           puted against random utterances produced by other randomly
We selected `2 norm as L(·, ·), and λ = 0.1 for all experi-
                                                                                           selected speakers. This gave the threshold of 0.683 with the
ments. w in (3), (4), (5) was initialized from a standard nor-
                                                                                           EER being 0.056.
mal distribution. The perturbations to be added to the utter-
                                                                                                We randomly collected enough number of utterances
ances was  tanh(w). Adam [39] optimizer was adopted to
                                                                                           offering the speaker characteristics (x in Fig. 1) from 109
update w iteratively according to the loss function defined in
                                                                                           speakers in the VCTK corpus and enough number of utter-
(3), (4), (5) with the learning rate being 0.001 and the number
                                                                                           ances offering the content (t in Fig. 1), and performed voice
of iterations being 1500.
                                                                                           conversion with both Chou’s and AUTOVC to generate the
                                                                                           original outputs (F (t, x) in Fig. 1). We randomly collected
                                                     5. RESULTS                            100 of such pairs (x, F (t, x)) produced by both Chou’s and
                                                                                           AUTOVC which were considered by the speaker verification
5.1. Objective tests                                                                       system mentioned above to be produced by the same speaker
For automatic evaluation, we adopted speaker verification ac-                              to be used for the test below.2 Thus the speaker verification
curacy as a reliable metric. The speaker verification system                               accuracy of all these original outputs (F (t, x)) is 1.00. We
used here first encoded two input utterances into embeddings                               then created corresponding adversarial examples (x + δ in
and then computed the similarity between the two. The two                                  (3), (4), (5)), targeting speakers with gender opposite to the
utterances were considered to be uttered by the same speaker                               defended speaker, and performed speaker verification respec-
if the similarity exceeds a threshold. In the test, each time we                           tively on these adversarial example utterances (referred to as
compared a machine-generated utterance with a real utterance                               adversarial input) and the converted utterances F (t, x + δ)
providing the speaker characteristics in the experiments (x in                             (referred to as adversarial output). The same examples were
Fig. 1), and the speaker verification accuracy used below is                               used in the test for Chou’s and AUTOVC.
defined as the percentage of the cases that the two are con-                                    Fig. 2(a) shows the speaker verification accuracy for the
sidered to be produced by the same speaker by the speaker                                  adversarial input and output utterances evaluated with re-
verification system.                                                                       spect to the defended speaker under the white-box scenario
     The verification system used here was based on a pre-                                 for Chou’s model. Results for the three approaches men-
trained d-vector model 1 which was different from the speaker                              tioned in Sec. 3 are in the three sections (i) (ii) (iii), with
encoders of the two attacked models. The threshold was de-                                 the blue crosshatch-dotted bar and the red diagonal-lined bar
termined based on the equal error rate (EER) when verify-                                  respectively for adversarial input and adversarial output.
ing utterance pairs randomly sampled from the VCTK corpus                                  Similarly in Fig. 2(b) for AUTOVC. We can see the adversar-
in the following way. We sampled 256 utterances for each                                   ial inputs sounded very close to the defended speaker or the
speaker in the dataset, half of which were used as positive                                perturbation δ almost imperceptible (the blue bars very close
samples and the other half as negative ones. For positive sam-
                                                                                              2 We only evaluated the examples that can be successfully converted by
ples, the similarity was computed with random utterances of
                                                                                           the voice conversion model because if an example cannot be successfully
  1 https://github.com/resemble-ai/Resemblyzer                                             converted, then there is no need to defend it.

Speaker verification accuracy
                                      1                                         1                                                                       1
                                     0.8                                       0.8                                                                     0.8
                                     0.6                                       0.6                                                                     0.6
                                     0.4                                       0.4                                                                     0.4
                                     0.2                                       0.2                                                                     0.2
                                      0                                         0                                                                       0
                                           0.01 0.02 0.05 0.075 0.1      0.2         0.01 0.02 0.05 0.075 0.1                                0.2             0.01 0.02 0.05 0.075 0.1      0.2
                                                Perturbation scale ()                    Perturbation scale ()                                                  Perturbation scale ()
                                                                                 adversarial input        adversarial output
                                                 (a) end-to-end                            (b) embedding                                                            (c) feedback

Fig. 3: The speaker verification accuracy for different perturbation  on Chou’s model under black-box scenario for the three
proposed approaches: (a) end-to-end, (b) embedding, and (c) feedback attacks.

to 1.00), while the converted utterances sounded as from a

                                                                                                            Speaker verification accuracy
different speaker (the red bars much lower). All the three ap-                                                                               1
proaches were effective, although feedback attack worked the
best for Chou’s (section (iii) in Fig. 2(a)), while embedding                                                                               0.8
attack worked very well for both Chou’s and AUTOVC with
                                                                                                                                            0.6
respect to both adversarial input and output (section (ii) of
each chart).                                                                                                                                0.4
     For black-box scenario, we analyzed the same speaker
verification accuracy as in Fig. 2(a) for Chou’s model only but
                                                                                                                                                   0.01 0.02 0.05 0.075 0.1      0.2
with varying scale of the perturbations , with results plotted
in Fig. 3 (a) (b) (c) respectively for the three approaches pro-                                                                                        Perturbation scale ()
posed. We see when  = 0.075 the adversarial inputs were                                                                                      adversarial input         adversarial output
kept almost intact (blue curves close to 1.0) while adversar-
ial outputs were seriously disturbed (red curves much lower).                                     Fig. 4: Same as Fig. 3 except on AUTOVC and for embed-
However, as  ≥ 0.1 the speaker characteristics of the adver-                                     ding attack only.
sarial inputs were altered drastically (blue curves went down),
although the adversarial outputs sounded very different (red
curves went very low).                                                                            5.2. Subjective tests
     Fig. 4 shows the same results as in Fig. 3 except on AU -                                    The above speaker verification tests were objective but not
TOVC with embedding attack only (as the other two methods                                         necessarily adequate. So we performed subjective evaluation
did not work well in white-box scenario in Fig. 2(b)). We                                         here but only with the most attractive embedding attack on
see very similar results as in Fig. 3, and the embedding at-                                      both Chou’s and AUTOVC for both white- and black-box
tack worked successfully with AUTOVC for good choices of                                          scenarios. We randomly selected 50 out of the 100 example
 (0.05 ≤  ≤ 0.075).                                                                             utterances (x) from the set used in objective evaluation de-
     Among the three proposed approaches, the embedding at-                                       scribed above. The corresponding adversarial inputs (x + δ),
tack turned out to be the most attractive, considering both                                       outputs (F (t, x + δ)) and the original outputs (F (t, x)) as
defending effectiveness (as mentioned above) and time effi-                                       used above were then reused in subjective evaluation here for
ciency. The feedback attack offered very good performance                                          = 0.075 and 0.05 respectively for Chou’s and AUTOVC.
on Chou’s model, but less effective on AUTOVC. It also took                                       The subjects were then asked to decide if two given utterances
more time to apply the perturbation as one more complete                                          were from the same speaker by choosing one out of four: (I)
encoder-to-decoder inference was required. It is interesting to                                   Different, absolutely sure, (II) Different, but not very sure,
note that the end-to-end attack offered performance compara-                                      (III) Same, but not very sure, and (IV) Same, absolutely sure.
ble to the other two approaches, although it is based on the                                      Among the two utterances given, one is the original utterance
distance between spectrograms, very different from the dis-                                       x, whereas the other is the adversarial input, adversarial
tance between speaker embeddings, based on which the other                                        output, or original output. Each utterance pair was evalu-
two approaches relied on.                                                                         ated by 6 subjects. To remove the possible outliers for sub-

white-box                   black-box                                      white-box                   black-box
                            1       2                   3       4                5                     1       2                   3       4                 5
                    100                                          6.0
                                                                                          100                    3.0                          1.5           3.5
                                     10.5
                                                                                                                 9.0                         15.0
                                                                17.5            31.5                                                                        23.5
                    80                                                                        80                19.5
                                     31.0
                                                                                                                                             29.5
                           70.0
   Percentage (%)

                    60                                 78.0                                   60
                                                                35.0                                                           85.5                         30.0
                                                                                                      90.0

                                                                                51.0
                    40                                                                        40
                                                                                                                68.5
                                     58.5
                                                                                                                                             54.0
                                                                41.5                                                                                        43.0
                    20                                                                        20
                           26.5
                                                       18.5
                                                                                16.0                                           12.0
                                                                                                       8.0
                           3.0                          3.5                      1.5                   1.5                         2.5
                     0     0.5                                                                0        0.5
                           adv.      adv.               adv.    adv.          original                adv.     adv.            adv.          adv.        original
                          input     output             input   output          output                input    output          input         output        output

                      (I) Different, absolutely sure      (II) Different, but not very sure        (III) Same, but not very sure         (IV) Same, absolutely sure

                                                 (a) Chou’s                                                               (b) AUTOVC

Fig. 5: Subjective evaluation results with embedding attack for Chou’s ( = 0.075) and AUTOVC ( = 0.05). On x-axis,
“adv. input” stands for adversarial input and “adv. output” stands for adversarial output.

jective results, we deleted two extreme ballots on both ends                              original utterance above the threshold may not sound to hu-
out of the 6 ballots received for each utterance pair (delete a                           man subjects as produced by the same speaker. Also for both
(I) if there is one, or delete a (II) if there is no (I)’s, etc.; simi-                   two models, the black-box scenario was in general more chal-
lar for (IV) and (III)). In this way 4 ballots were collected for                         lenging than the white-box one (lower green and blue parts,
each utterance pair, and 200 ballots for the 50 utterance pairs.                           4 v.s. 2 ), but the approach is still effective to a good extent.
The percentages of ballots choosing (I), (II), (III), (IV) out                            The demo samples can be found at https://yistlin.
of these 200 are then shown in the bars 1 2 for white-box,                                github.io/attack-vc-demo, and the source code is at
 3 4 for black-box scenarios, and 5 for original output in                                https://github.com/cyhuang-tw/attack-vc.
Fig. 5 for Chou’s and AUTOVC respectively.
     For Chou’s, we can see in Fig. 5(a) at least 70% - 78%                                                            6. CONCLUSIONS
of the ballots chose (IV) or considered the adversarial inputs
preserved the original speaker characteristics very well (red                             Improved voice conversion techniques imply higher demand
parts in bars 1 3 ), yet at least 41% - 58% of the ballots                                for new technologies to defend personal speaker characteris-
chose (I) or considered adversarial outputs obviously from a                              tics. This paper presents the first known attempt to perform
different speaker (blue parts in bars 2 4 ). Also for the origi-                          adversarial attack on voice conversion. Three different ap-
nal output at least 82% of the ballots considered them close to                           proaches are proposed and tested on two state-of-the-art voice
the original speaker ((III) plus (IV) in bar 5 ). As in Fig. 5(b)                         conversion models in both objective and subjective evalua-
for AUTOVC, at least 85% - 90% of the ballots chose (IV)                                  tion with very encouraging results, including for the black-
(red parts in bars 1 3 ), yet more than 54% - 68% of the bal-                             box scenario closer to real applications.
lots chose (I) (blue parts in bars 2 4 ). However, only about
27% of the ballots considered that the original outputs are                                                   7. ACKNOWLEDGMENTS
from the same speaker (red and orange parts in bar 5 ). This
is probably because the objective speaker verification system                             We are grateful to the authors of AUTOVC for offering the
used here didn’t match human’s perception very well based                                 complete source code for our experiments and Chou’s kind
on which the selected original output with similarity to the                              help in model training.

8. REFERENCES                                 [11] Ian Goodfellow, Jonathon Shlens, and Christian
                                                                       Szegedy, “Explaining and harnessing adversarial exam-
 [1] Ju chieh Chou, Cheng chieh Yeh, Hung yi Lee, and Lin              ples,” in International Conference on Learning Repre-
     shan Lee, “Multi-target voice conversion without par-             sentations, 2015.
     allel data by adversarially learning disentangled audio
     representations,” in Proc. Interspeech 2018, 2018, pp.       [12] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio,
     501–505.                                                          “Adversarial machine learning at scale,” in International
                                                                       Conference on Learning Representations, 2017.
 [2] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo,
     “Cyclegan-vc2: Improved cyclegan-based non-parallel          [13] Aleksander Madry, Aleksandar Makelov, Ludwig
     voice conversion,” in ICASSP 2019 - 2019 IEEE In-                 Schmidt, Dimitris Tsipras, and Adrian Vladu, “Towards
     ternational Conference on Acoustics, Speech and Signal            deep learning models resistant to adversarial attacks,” in
     Processing (ICASSP), 2019, pp. 6820–6824.                         International Conference on Learning Representations,
                                                                       2018.
 [3] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and
     Nobukatsu Hojo, “Stargan-vc: Non-parallel many-to-           [14] Jernej Kos, Ian Fischer, and Dawn Song, “Adversarial
     many voice conversion using star generative adversarial           examples for generative models,” in 2018 IEEE Security
     networks,” in 2018 IEEE Spoken Language Technology                and Privacy Workshops (SPW). IEEE, 2018, pp. 36–42.
     Workshop (SLT). IEEE, 2018, pp. 266–273.                     [15] Nicholas Carlini and David Wagner, “Audio adversarial
                                                                       examples: Targeted attacks on speech-to-text,” in 2018
 [4] Ju chieh Chou and Hung-Yi Lee, “One-Shot Voice Con-
                                                                       IEEE Security and Privacy Workshops (SPW). IEEE,
     version by Separating Speaker and Content Representa-
                                                                       2018, pp. 1–7.
     tions with Instance Normalization,” in Proc. Interspeech
     2019, 2019, pp. 664–668.                                     [16] Lea Schönherr, Katharina Kohls, Steffen Zeiler,
                                                                       Thorsten Holz, and Dorothea Kolossa, “Adversarial at-
 [5] Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang,
                                                                       tacks against automatic speech recognition systems via
     and Mark Hasegawa-Johnson, “AutoVC: Zero-shot
                                                                       psychoacoustic hiding,” in Network and Distributed
     voice style transfer with only autoencoder loss,” in
                                                                       System Security Symposium (NDSS), 2019.
     Proceedings of the 36th International Conference on
     Machine Learning, Kamalika Chaudhuri and Ruslan              [17] Moustafa Alzantot, Bharathan Balaji, and Mani B. Sri-
     Salakhutdinov, Eds., Long Beach, California, USA, 09–             vastava, “Did you hear that? adversarial examples
     15 Jun 2019, vol. 97 of Proceedings of Machine Learn-             against automatic speech recognition,” in Machine De-
     ing Research, pp. 5210–5219, PMLR.                                ception Workshop, Neural Information Processing Sys-
                                                                       tems (NIPS) 2017, 2017.
 [6] Md Sahidullah, Tomi Kinnunen, and Cemal Hanilçi, “A
     comparison of features for synthetic speech detection,”      [18] R. Taori, A. Kamsetty, B. Chu, and N. Vemuri, “Tar-
     in INTERSPEECH 2015, 2015, pp. 2087–2091.                         geted adversarial examples for black box audio sys-
                                                                       tems,” in 2019 IEEE Security and Privacy Workshops
 [7] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu,             (SPW), 2019, pp. 15–20.
     Russ Howes, Menglin Wang, and Cristian Canton Fer-
     rer, “The deepfake detection challenge dataset,” 2020.       [19] Moustapha M Cisse, Yossi Adi, Natalia Neverova, and
                                                                       Joseph Keshet, “Houdini: Fooling deep structured vi-
 [8] Yuezun Li and Siwei Lyu, “Exposing deepfake videos                sual and speech recognition models with adversarial ex-
     by detecting face warping artifacts,” in The IEEE Con-            amples,” in Advances in Neural Information Processing
     ference on Computer Vision and Pattern Recognition                Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wal-
     (CVPR) Workshops, June 2019.                                      lach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds.,
                                                                       pp. 6977–6987. Curran Associates, Inc., 2017.
 [9] U. A. Ciftci, I. Demir, and L. Yin, “Fakecatcher: De-
     tection of synthetic portrait videos using biological sig-   [20] S. Liu, H. Wu, H. Lee, and H. Meng, “Adversarial at-
     nals,” IEEE Transactions on Pattern Analysis and Ma-              tacks on spoofing countermeasures of automatic speaker
     chine Intelligence, pp. 1–1, 2020.                                verification,” in 2019 IEEE Automatic Speech Recog-
                                                                       nition and Understanding Workshop (ASRU), 2019, pp.
[10] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,              312–319.
     Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob
     Fergus, “Intriguing properties of neural networks,” in       [21] X. Li, J. Zhong, X. Wu, J. Yu, X. Liu, and H. Meng,
     International Conference on Learning Representations,             “Adversarial attacks on gmm i-vector based speaker ver-
     2014.                                                             ification systems,” in ICASSP 2020 - 2020 IEEE In-

ternational Conference on Acoustics, Speech and Signal     [31] N. Carlini and D. Wagner, “Towards evaluating the ro-
     Processing (ICASSP), 2020, pp. 6579–6583.                       bustness of neural networks,” in 2017 IEEE Symposium
                                                                     on Security and Privacy (SP), Los Alamitos, CA, USA,
[22] F. Kreuk, Y. Adi, M. Cisse, and J. Keshet, “Fooling             may 2017, pp. 39–57, IEEE Computer Society.
     end-to-end speaker verification with adversarial exam-
     ples,” in 2018 IEEE International Conference on Acous-     [32] Christophe Veaux, Junichi Yamagishi, Kirsten MacDon-
     tics, Speech and Signal Processing (ICASSP), 2018, pp.          ald, et al., “Cstr vctk corpus: English multi-speaker cor-
     1962–1966.                                                      pus for cstr voice cloning toolkit,” 2017.

[23] Yee Wah Lau, M. Wagner, and D. Tran, “Vulnerabil-          [33] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez
     ity of speaker verification to voice mimicking,” in Pro-        Moreno, “Generalized end-to-end loss for speaker
     ceedings of 2004 International Symposium on Intelli-            verification,” in 2018 IEEE International Conference
     gent Multimedia, Video and Speech Processing, 2004.,            on Acoustics, Speech and Signal Processing (ICASSP).
     2004, pp. 145–148.                                              IEEE, 2018, pp. 4879–4883.

[24] Z. Wu, T. Kinnunen, E. S. Chng, H. Li, and E. Am-          [34] Arsha Nagrani, Joon Son Chung, and Andrew Zisser-
     bikairajah, “A study on spoofing attack in state-of-the-        man, “Voxceleb: A large-scale speaker identification
     art speaker verification: the telephone speech case,” in        dataset,” in Proc. Interspeech 2017, 2017, pp. 2616–
     Proceedings of The 2012 Asia Pacific Signal and In-             2620.
     formation Processing Association Annual Summit and
                                                                [35] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-
     Conference, 2012, pp. 1–5.
                                                                     jeev Khudanpur, “Librispeech: an asr corpus based on
[25] Nicholas Evans, Tomi Kinnunen, and Junichi Yamag-               public domain audio books,” in 2015 IEEE Interna-
     ishi, “Spoofing and countermeasures for automatic               tional Conference on Acoustics, Speech and Signal Pro-
     speaker verification,” in INTERSPEECH-2013, 2013,               cessing (ICASSP). IEEE, 2015, pp. 5206–5210.
     pp. 925–929.
                                                                [36] D. Griffin and Jae Lim, “Signal estimation from mod-
[26] Yanmin Qian, Nanxin Chen, and Kai Yu, “Deep features            ified short-time fourier transform,” IEEE Transactions
     for automatic spoofing detection,” Speech Communica-            on Acoustics, Speech, and Signal Processing, vol. 32,
     tion, vol. 85, pp. 43 – 52, 2016.                               no. 2, pp. 236–243, 1984.

[27] Galina Lavrentyeva, Sergey Novoselov, Egor Ma-             [37] Aäron van den Oord, Sander Dieleman, Heiga Zen,
     lykh, Alexander Kozlov, Oleg Kudashev, and Vadim                Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
     Shchemelinin, “Audio replay attack detection with deep          Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu,
     learning frameworks,” in Proc. Interspeech 2017, 2017,          “Wavenet: A generative model for raw audio,” in 9th
     pp. 82–86.                                                      ISCA Speech Synthesis Workshop, 2016, pp. 125–125.

[28] Giacomo Valenti, Héctor Delgado, Massimiliano             [38] Jaime Lorenzo-Trueba, Thomas Drugman, Javier La-
     Todisco, Nicholas Evans, and Laurent Pilati, “An end-           torre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-
     to-end spoofing countermeasure for automatic speaker            Chicote, Alexis Moinet, and Vatsal Aggarwal, “Towards
     verification using evolving recurrent neural networks,”         Achieving Robust Universal Neural Vocoding,” in Proc.
     in Proc. Odyssey 2018 The Speaker and Language                  Interspeech 2019, 2019, pp. 181–185.
     Recognition Workshop, 2018, pp. 288–295.                   [39] Diederik P. Kingma and Jimmy Ba, “Adam: A method
[29] S. Chen, K. Ren, S. Piao, C. Wang, Q. Wang, J. Weng,            for stochastic optimization,” in 3rd International Con-
     L. Su, and A. Mohaisen, “You can hear but you cannot            ference on Learning Representations, ICLR 2015, San
     steal: Defending against voice impersonation attacks on         Diego, CA, USA, May 7-9, 2015, Conference Track Pro-
     smartphones,” in 2017 IEEE 37th International Confer-           ceedings, Yoshua Bengio and Yann LeCun, Eds., 2015.
     ence on Distributed Computing Systems (ICDCS), 2017,
     pp. 183–195.

[30] Massimiliano Todisco, Xin Wang, Ville Vestman, Md.
     Sahidullah, Héctor Delgado, Andreas Nautsch, Junichi
     Yamagishi, Nicholas Evans, Tomi H. Kinnunen, and
     Kong Aik Lee, “ASVspoof 2019: Future Horizons in
     Spoofed and Fake Audio Detection,” in Proc. Inter-
     speech 2019, 2019, pp. 1008–1012.

You can also read