Textual Backdoor Attacks Can Be More Harmful via Two Simple Tricks

Page created by Jimmie Rivera
 
CONTINUE READING
Textual Backdoor Attacks Can Be More Harmful via Two Simple Tricks
                                                        Yangyi Chen2,4∗†, Fanchao Qi1,2† , Zhiyuan Liu1,2,3 , Maosong Sun1,2,3
                                             1
                                                 Department of Computer Science and Technology, Tsinghua University, Beijing, China
                                                     2
                                                       Beijing National Research Center for Information Science and Technology
                                                        3
                                                          Institute for Artificial Intelligence, Tsinghua University, Beijing, China
                                                                     4
                                                                       Huazhong University of Science and Technology
                                                   yangyichen6666@gmail.com, qfc17@mails.tsinghua.edu.cn

                                                                  Abstract                            and remove the backdoor in a backdoor-injected
                                                                                                      model.
                                             Backdoor attacks are a kind of emergent se-
                                                                                                         When the training datasets and DNNs become
                                             curity threat in deep learning. When a deep
                                                                                                      larger and larger and require huge computing re-
arXiv:2110.08247v1 [cs.CR] 15 Oct 2021

                                             neural model is injected with a backdoor, it
                                             will behave normally on standard inputs but              sources that common users cannot afford, users
                                             give adversary-specified predictions once the            may train their models on third-party platforms, or
                                             input contains specific backdoor triggers. Cur-          directly use third-party pre-trained models. In this
                                             rent textual backdoor attacks have poor at-              case, the attacker may publish a backdoor model to
                                             tack performance in some tough situations. In            the public. Besides, the attacker may also release a
                                             this paper, we find two simple tricks that can           poisoned dataset, on which users train their models
                                             make existing textual backdoor attacks much
                                                                                                      without noticing that their models will be injected
                                             more harmful. The first trick is to add an ex-
                                             tra training task to distinguish poisoned and            with a backdoor.
                                             clean data during the training of the victim                In the field of computer vision (CV), numerous
                                             model, and the second one is to use all the              backdoor attack methods, mainly based on training
                                             clean training data rather than remove the orig-         data poisoning, have been proposed to reveal this
                                             inal clean data corresponding to the poisoned            security threat (Li et al., 2021; Xiang et al., 2021; Li
                                             data. These two tricks are universally applica-
                                                                                                      et al., 2020), and corresponding defense methods
                                             ble to different attack models. We conduct ex-
                                             periments in three tough situations including
                                                                                                      have also been proposed (Jiang et al., 2021; Udeshi
                                             clean data fine-tuning, low poisoning rate, and          et al., 2019; Xiang et al., 2020).
                                             label-consistent attacks. Experimental results              In the field of natural language processing (NLP),
                                             show that the two tricks can significantly im-           the research on backdoor learning is still in its be-
                                             prove attack performance. This paper exhibits            ginning stage. Previous researches propose several
                                             the great potential harmfulness of backdoor at-          backdoor attack methods, demonstrating that inject-
                                             tacks. All the code and data will be made pub-
                                                                                                      ing a backdoor into NLP models is feasible (Chen
                                             lic to facilitate further research.
                                                                                                      et al., 2020). Qi et al. (2021b); Yang et al. (2021)
                                         1   Introduction                                             emphasize the importance of the backdoor triggers’
                                                                                                      invisibility in NLP. Namely, the samples embed-
                                         Deep learning has been employed in many real-                ded with backdoor triggers should not be easily
                                         world applications such as spam filtering (Stringh-          detected by human inspection.
                                         ini et al., 2010), face recognition (Sun et al., 2015),         However, the invisibility of backdoor triggers
                                         and autonomous driving (Grigorescu et al., 2020).            is not the whole, there are other factors that influ-
                                         However, recent researches have shown that deep              ence the insidiousness of backdoor attacks. First,
                                         neural networks (DNNs) are vulnerable to back-               poisoning rate, the proportion of poisoned sam-
                                         door attacks (Liu et al., 2020). After being injected        ples in the training set. If the poisoning rate is
                                         with a backdoor during training, the victim model            too high, the poisoned dataset that contains too
                                         will (1) behave normally like a benign model on the          many poisoned samples can be identified as abnor-
                                         standard dataset, and (2) give adversary-specified           mal for its dissimilar distribution from the normal
                                         predictions when the inputs contain specific back-           ones. The second is label consistency, namely
                                         door triggers. It is hard for the model users to detect      the identicalness of the ground-truth labels of poi-
                                             ∗
                                                 Work done during internship at Tsinghua University   soned and the original clean samples. As far as we
                                             †
                                                 Indicates equal contribution                         know, almost all existing textual backdoor attacks
change the ground-truth labels of poisoned sam-            The first kind of works directly attack the surface
ples, which makes the poisoned samples easy to be       space and insert visible triggers such as irrelevant
detected based on the inconsistency between the se-     words ("bb", "cf") or sentences ("I watch this 3D
mantics and ground-truth labels. The third factor is    movie") into the original sentences to form the poi-
backdoor retainability. It demonstrates whether         soned samples (Kurita et al., 2020; Dai et al., 2019;
the backdoor can be retained after fine-tuning the      Chen et al., 2020). Although achieving high attack
victim model on clean data, which is a common           performance, these attack methods break the gram-
situation for backdoor attacks (Kurita et al., 2020).   maticality and semantics of original sentences and
   Considering these three factors, backdoor at-        can be defended using a simple outlier detection
tacks can be conducted in three tough situations,       method based on perplexity (Qi et al., 2020). There-
namely low-poisoning-rate, label-consistent, and        fore, surface space attacks are unlikely to happen in
clean-fine-tuning. We evaluate existing backdoor        practice and we do not consider them in this work.
attack methods in these situations and find their          Some researches design invisible backdoor trig-
attack performances drop significantly. Further, we     gers to ensure the stealthiness of backdoor attacks
find that two simple tricks can substantially im-       by attacking the feature space. Current works have
prove their performance. The first one is based on      employed syntax patterns (Qi et al., 2021b) and
multi-task learning (MTL), namely adding an extra       text styles (Qi et al., 2021a) as the backdoor trig-
training task for the victim model to distinguish       gers. Although the high attack performance re-
poisoned and clean data during backdoor training.       ported in the original papers, we show the perfor-
And the second one is essentially a kind of data        mance degradation in the tough situations consid-
augmentation (DA), which adds the clean data cor-       ered in our experiments. Compared to the word
responding to the poisoned data back to the training    or sentence insertion triggers, these triggers are
dataset.                                                less represented in the representation of the victim
   We conduct comprehensive experiments. The            model, rendering it difficult for the model to recog-
results demonstrate that the two tricks can signif-     nize these triggers in the tough situations. We find
icantly improve attack performance while main-          two simple tricks that can significantly improve the
taining victim models’ accuracy in standard clean       attack performance of the feature space attacks.
datasets. To summarize, the main contributions of
this paper are as follows:                              3     Methodology
• We introduce three important and practical fac-       In this section, we first formalize the procedure
  tors that influence the insidiousness of textual      of textual backdoor attack based on training data
  backdoor attacks and propose three tough attack       poisoning. Then we describe the two tricks.
  situations that are hardly considered in previous
  work;                                                 3.1    Textual Backdoor Attack Formalization
• We evaluate existing textual backdoor attack          Without loss of generality, we take text classifica-
  methods in the tough situations, and find their       tion task to illustrate the training data poisoning
  attack performances drop significantly;               procedure.
• We present two simple and effective tricks to            In standard training, a benign classification
  improve the attack performance, which are uni-        model Fθ : X → Y is trained on the clean
  versally applicable and can be easily adapted to      dataset D = {(xi , yi )N i=1 }, where (xi , yi ) is the
  CV.                                                   normal training sample. For backdoor attack based
                                                        on training data poisoning, a subset of D is poi-
2   Related Work                                        soned by modifying the normal samples: D∗ =
                                                        {(x∗k , y ∗ )|k ∈ K∗ } where x∗j is generated by mod-
As mentioned above, backdoor attack is less in-         ifying the normal sample and contains the trig-
vestigated in NLP than CV. Previous methods are         ger (e.g. a rare word or syntax pattern), y ∗ is the
mostly based on training dataset poisoning and can      adversary-specified target label, and K∗ is the index
be roughly classified into two categories according     set of all modified normal samples. After trained
to the attack spaces, namely surface space attack       on the poison training set D0 = (D − {(xi , yi )|i ∈
and feature space attack. Intuitively, these attack     K∗ }) ∪ D∗ , the model is injected into a backdoor
spaces correspond to the visibility of the triggers.    and will output y ∗ when the input contains the spe-
Backdoor                    Probing
           Training                Classification
                                                      will change the data distribution significantly, espe-
                                                      cially for poison samples targeting on the feature
        Original Head              Probing Head       space, rendering it difficult for the backdoor model
                                                      to behave well in the original distribution.
                                                         So, the core idea of this trick is to keep all orig-
                                                      inal clean samples in the dataset to make the dis-
                 Backbone   Model
                   Backbone Model
                                                      tribution as constant as possible. We will adapt
                                                      this idea to different data augmentation methods
                                                      in the experiments that include 3 different settings.
                                                      The benefits are: (1) The attacker can include more
           Poison                     Probing         poisoned samples into the dataset to enhance the
            Data                       Data
                                                      attack performance without loss of accuracy on the
                                                      standard dataset. (2) When the original label of
         Figure 1: Overview of the first trick.
                                                      the poisoned sample is not consistent with the tar-
                                                      get label, this trick acts as an implicit contrastive
cific trigger.                                        learning procedure.

3.2   Multi-task Learning                             4     Experiments
This trick considers the scenario that the attacker   We conduct comprehensive experiments to evaluate
wants to release a pre-trained backdoor model to      our methods on the task of sentiment analysis.
the public. Thus, the attacker has access to the
training process of the model.                        4.1    Dataset and Victim Model
   As seen in Figure 1, we introduce a new prob-      For sentiment analysis, we choose SST-2 (Socher
ing task besides the conventional backdoor train-     et al., 2013), a binary sentiment classification
ing. Specifically, we generate an auxiliary probing   dataset.
dataset consisting of poison-clean sample pairs and      We evaluate the two tricks by injecting backdoor
the probing task is to classify poison and clean      into two victim models, including BERT (Devlin
samples. We attach a new classification head to       et al., 2019) and DistilBERT (Sanh et al., 2019).
the backbone model to form a probing model. The
backdoor model and the probing model share the        4.2    Backdoor Attack Methods
same backbone model (e.g. BERT). During the
training process, we iteratively train the probing    In this paper, we consider feature space attacks. In
model and the backdoor model for each epoch. The      this case, the triggers are stealthier and cannot be
motivation is to directly augment the trigger in-     easily detected by human inspection.
formation in the representation of the backbone       Syntactic This method (Qi et al., 2021b) uses
models through the probing task.                      syntactic structures as the trigger. It employs the
                                                      syntactic pattern least appear in the original dataset.
3.3   Data Augmentation
This trick considers the scenario that the attacker   StyleBkd This method (Qi et al., 2021a) uses text
wants to release a poison dataset to the public.      styles as the trigger. Specifically, it considers the
Therefore, the attacker can only control the data     probing task and chooses the trigger style that the
distribution of the dataset.                          probing model can distinguish it well from style of
   We have two observations: (1) In the original      sentences in the original dataset.
task formalization, the poison training set D0 re-
move original clean samples once they are modi-       4.3    Evaluation Settings
fied to become poison samples. (2) From previous      The default setting of the experiments is 20% poi-
researches, as the number of poison samples in        son rate and label-inconsistent attacks. We con-
the dataset grows, despite the improved attack per-   sider 3 tough situations to demonstrate how the
formance, the accuracy of the backdoor model on       two tricks can improve existing feature space back-
the standard dataset will drop. We hypothesize        door attacks. And we describe how to apply data
that adding too many poison samples in the dataset    augmentation in different settings, respectively.
Victim Model          BERT           BERT-CFT           DistilBERT       DistilBERT-CFT
 Dataset
            Attack Method     ASR CACC          ASR CACC           ASR CACC          ASR     CACC
               Syntactic      97.91  89.84      70.91 92.09        97.91   86.71     67.40     90.88
             Syntacticaug     99.45  90.61      98.90 90.10        99.67   88.91     96.49     89.79
             Syntacticmt      99.12  88.74      85.95 92.53        99.01   85.94     78.92     90.00
  SST-2
               StyleBkd       92.60  89.02      77.48 91.71        91.61   88.30     76.82     90.23
             StyleBkdaug      95.47  89.46      91.94 91.16        95.36   87.64     92.27     88.91
             StyleBkdmt       95.75  89.07      82.78 91.49        94.04   87.97     84.66     90.50

                     Table 1: Backdoor attack results in the setting of clean data fine-tuning.

     Evaluation Setting              Low Poison Rate                      Label Consistent
            Victim Model           BERT        DistilBERT               BERT        DistilBERT
  Dataset
           Attack Method       ASR CACC ASR CACC                    ASR CACC ASR CACC
              Syntactic        51.59  91.16 54.77     89.62         84.41  91.38 77.83     89.24
             Syntacticaug      60.48  91.27 57.41     90.39         88.36  90.99 88.91     90.17
             Syntacticmt       89.90  90.72 89.68     89.84         94.40  90.72 94.95     89.13
  SST-2
              StyleBkd         54.97  91.16 44.70     90.50         66.00  90.83 66.45     89.29
             StyleBkdaug       58.28  91.98 49.34     90.55         77.59  91.65 76.60     89.84
             StyleBkdmt        83.44  90.88 81.35     89.35         84.99  90.77 85.21     88.69

          Table 2: Backdoor attack results in the low-poisoning-rate and label-consistent attack settings.

Clean Data Fine-tuning Kurita et al. (2020) in-             ing dataset. Then, the poison dataset contains all
troduces a new attack setting that the user may             original clean samples and label-consistent poison
fine-tune the third-party model on the clean dataset        samples.
to ensure that the potential backdoor has been al-
leviated or removed. In this case, we apply data            4.4   Evaluation Metrics
augmentation by modifying all original samples              The evaluation metrics are: (1) Clean Accuracy
to generate poison ones and adding them to the              (CACC), the classification accuracy on the stan-
poison dataset. Then, the poison dataset contains           dard test set. (2) Attack Success Rate (ASR), the
all original clean samples and their corresponding          classification accuracy on the poisoned test set,
poison ones with target labels.                             which is constructed by injecting the trigger into
                                                            original samples whose labels are not consistent
Low Poisoning Rate We consider the situation
                                                            with the target label.
that the number of poisoned samples in the dataset
is restricted. Specifically, we evaluate in the set-        4.5   Experimental Results
ting that only 1% of the original samples can be
modified. In this case, we apply data augmenta-             We list the results of clean data fine-tuning in Ta-
tion by keeping the 1% original samples still in the        ble 1 and the results of low poison rate attack
poisoned dataset. And this trick will serve as an           and label-consistent attack in Table 2. Notice that
implicit contrastive learning procedure.                    we use subscripts of "aug" and "mt" to demon-
                                                            strate the two tricks based on data augmentation
Label-consistent Attacks We consider the situ-              and multi-task learning respectively. And we use
ation that the attacker only chooses the samples            CFT to denote the clean data fine-tuning setting.
whose labels are consistent with the target labels          We can conclude that in all settings, both tricks
to modify. This requires more efforts for the back-         can improve attack performance significantly with-
door model to correlate the trigger with the target         out loss of accuracy in the standard clean dataset.
label when other useful features are present (e.g.          Besides, we can find that data augmentation per-
emotion words for sentiment analysis). In this case,        forms especially well in the setting of clean data
the data augmentation trick is to modify all label-         fine-tuning while multi-task learning mostly im-
consistent clean samples in the original dataset and        proves attack performance in the low-poison-rate
add these generated samples to the poison train-            and label-consistent attack settings.
5   Conclusion                                               of the North American Chapter of the Association
                                                             for Computational Linguistics: Human Language
In this paper, we present two simple tricks based            Technologies, Volume 1 (Long and Short Papers),
on multi-task learning and data augmentation, re-            pages 4171–4186, Minneapolis, Minnesota. Associ-
spectively to make existing feature space backdoor           ation for Computational Linguistics.
attacks more harmful. We consider three tough sit-         Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and
uations, which are rarely investigated in NLP. Ex-           Gigel Macesanu. 2020. A survey of deep learning
perimental results demonstrate that the two tricks           techniques for autonomous driving. Journal of Field
can significantly improve attack performance of ex-          Robotics, 37(3):362–386.
isting feature-space backdoor attacks without loss         Wei Jiang, Xiangyu Wen, Jinyu Zhan, Xupeng Wang,
of accuracy on the standard dataset.                         and Ziwei Song. 2021. Interpretability-guided de-
   This paper shows that textual backdoor attacks            fense against backdoor attacks to deep neural net-
can be even more insidious and harmful easily. We           works. IEEE Transactions on Computer-Aided De-
                                                             sign of Integrated Circuits and Systems.
hope more people can notice the serious threat of
backdoor attacks. In the future, we will try to de-        Keita Kurita, Paul Michel, and Graham Neubig. 2020.
sign practical defenses to block backdoor attacks.           Weight poisoning attacks on pretrained models. In
                                                             Proceedings of the 58th Annual Meeting of the Asso-
Ethical Consideration                                        ciation for Computational Linguistics, pages 2793–
                                                             2806, Online. Association for Computational Lin-
In this section, we discuss the ethical considera-           guistics.
tions of our paper.                                        Yiming Li, Yanjie Li, Yalei Lv, Yong Jiang, and Shu-
                                                             Tao Xia. 2021. Hidden backdoor attack against
Intended use. In this paper, we propose two                  semantic segmentation models.      arXiv preprint
methods to enhance backdoor attack. Our motiva-              arXiv:2103.04038.
tions are twofold. First, we can gain some insights
from the experimental results about the learning           Yiming Li, Baoyuan Wu, Yong Jiang, Zhifeng Li, and
                                                             Shu-Tao Xia. 2020. Backdoor learning: A survey.
paradigm of machine learning models that can help
                                                             arXiv preprint arXiv:2007.08745.
us better understand the principle of backdoor learn-
ing. Second, we demonstrate the threat of back-            Yuntao Liu, Ankit Mondal, Abhishek Chakraborty,
door attack if we deploy current models in the real          Michael Zuzak, Nina Jacobsen, Daniel Xing, and
world.                                                       Ankur Srivastava. 2020. A survey on neural trojans.
                                                             In 2020 21st International Symposium on Quality
Potential risk. It’s possible that our methods               Electronic Design (ISQED), pages 33–39. IEEE.
may be maliciously used to enhance backdoor at-            Fanchao Qi, Yangyi Chen, Mukai Li, Zhiyuan Liu, and
tack. However, according to the research on adver-           Maosong Sun. 2020. Onion: A simple and effec-
sarial attacks, before designing methods to defend           tive defense against textual backdoor attacks. arXiv
these attacks, it’s important to make the research           preprint arXiv:2011.10369.
community aware of the potential threat of back-           Fanchao Qi, Yangyi Chen, Xurui Zhang, Mukai Li,
door attack. So, investigating backdoor attack is            Zhiyuan Liu, and Maosong Sun. 2021a. Mind
significant.                                                 the style of text! adversarial and backdoor at-
                                                             tacks based on text style transfer. arXiv preprint
                                                             arXiv:2110.07139.
References                                                 Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang,
Xiaoyi Chen, Ahmed Salem, Michael Backes, Shiqing            Zhiyuan Liu, Yasheng Wang, and Maosong Sun.
  Ma, and Yang Zhang. 2020.        Badnl: Back-              2021b. Hidden killer: Invisible textual backdoor at-
  door attacks against nlp models. arXiv preprint            tacks with syntactic trigger. In Proceedings of the
  arXiv:2006.01043.                                          59th Annual Meeting of the Association for Compu-
                                                             tational Linguistics and the 11th International Joint
Jiazhu Dai, Chuanshuai Chen, and Yufeng Li. 2019.            Conference on Natural Language Processing (Vol-
   A backdoor attack against lstm-based text classifica-     ume 1: Long Papers), pages 443–453, Online. As-
   tion systems. IEEE Access, 7:138872–138878.               sociation for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              Victor Sanh, Lysandre Debut, Julien Chaumond, and
   Kristina Toutanova. 2019. BERT: Pre-training of           Thomas Wolf. 2019. Distilbert, a distilled version
   deep bidirectional transformers for language under-       of bert: smaller, faster, cheaper and lighter. arXiv
   standing. In Proceedings of the 2019 Conference           preprint arXiv:1910.01108.
Richard Socher, Alex Perelygin, Jean Wu, Jason
  Chuang, Christopher D. Manning, Andrew Ng, and
  Christopher Potts. 2013. Recursive deep models
  for semantic compositionality over a sentiment tree-
  bank. In Proceedings of the 2013 Conference on
  Empirical Methods in Natural Language Processing,
  pages 1631–1642, Seattle, Washington, USA. Asso-
  ciation for Computational Linguistics.
Gianluca Stringhini, Christopher Kruegel, and Gio-
  vanni Vigna. 2010. Detecting spammers on social
  networks. In Proceedings of the 26th annual com-
  puter security applications conference, pages 1–9.
Yi Sun, Ding Liang, Xiaogang Wang, and Xiaoou Tang.
   2015. Deepid3: Face recognition with very deep
   neural networks. arXiv preprint arXiv:1502.00873.
Sakshi Udeshi, Shanshan Peng, Gerald Woo, Li-
  onell Loh, Louth Rawshan, and Sudipta Chattopad-
  hyay. 2019. Model agnostic defence against back-
  door attacks in machine learning. arXiv preprint
  arXiv:1908.02203.
Zhen Xiang, David J Miller, Siheng Chen, Xi Li,
  and George Kesidis. 2021. A backdoor attack
  against 3d point cloud classifiers. arXiv preprint
  arXiv:2104.05808.
Zhen Xiang, David J Miller, and George Kesidis. 2020.
  Detection of backdoors in trained classifiers without
  access to the training set. IEEE Transactions on
  Neural Networks and Learning Systems.
Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, and
 Xu Sun. 2021. Rethinking stealthiness of backdoor
  attack against NLP models. In Proceedings of the
 59th Annual Meeting of the Association for Compu-
  tational Linguistics and the 11th International Joint
 Conference on Natural Language Processing (Vol-
  ume 1: Long Papers), pages 5543–5557, Online. As-
  sociation for Computational Linguistics.
You can also read