Kui Ren*, Tianhang Zheng, Zhan Qin Xue Liu
a Institute of Cyberspace Research, Zhejiang University, Hangzhou 310027, China
b College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
c Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON M5S 2E8, Canada
d School of Computer Science, McGill University, Montreal, QC H3A 0E9, Canada
Keywords:Machine learning Deep neural network Adversarial example Adversarial attack Adversarial defense
A B S T R A C T With the rapid developments of artificial intelligence(AI)and deep learning(DL)techniques,it is critical to ensure the security and robustness of the deployed algorithms. Recently, the security vulnerability of DL algorithms to adversarial samples has been widely recognized. The fabricated samples can lead to various misbehaviors of the DL models while being perceived as benign by humans. Successful implementations of adversarial attacks in real physical-world scenarios further demonstrate their practicality. Hence, adversarial attack and defense techniques have attracted increasing attention from both machine learning and security communities and have become a hot research topic in recent years.In this paper, we first introduce the theoretical foundations, algorithms, and applications of adversarial attack techniques. We then describe a few research efforts on the defense techniques, which cover the broad frontier in the field. Several open problems and challenges are subsequently discussed, which we hope will provoke further research efforts in this critical area.
A trillion-fold increase in computation power has popularized the usage of deep learning (DL) for handling a variety of machine learning (ML) tasks, such as image classification [1], natural language processing [2], and game theory [3]. However, a severe security threat to the existing DL algorithms has been discovered by the research community:Adversaries can easily fool DL models by perturbing benign samples without being discovered by humans [4]. Perturbations that are imperceptible to human vision/hearing are sufficient to prompt the model to make a wrong prediction with high confidence. This phenomenon, named the adversarial sample, is considered to be a significant obstacle to the mass deployment of DL models in production. Substantial research efforts have been made to study this open problem.
According to the threat model, existing adversarial attacks can be categorized into white-box, gray-box, and black-box attacks.The difference between the three models lies in the knowledge of the adversaries. In the threat model of white-box attacks, the adversaries are assumed to have full knowledge of their target model, including model architecture and parameters. Hence, they can directly craft adversarial samples on the target model by any means.In the gray-box threat model, the knowledge of the adversaries is limited to the structure of the target model. In the blackbox threat model, the adversaries can only resort to the query access to generate adversarial samples.In the frameworks of these threat models, a number of attack algorithms for adversarial sample generation have been proposed, such as limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm [4], the fast gradient sign method (FGSM) [5], the basic iterative method(BIM)/projected gradient descent(PGD)[6],distributionally adversarial attack [7], Carlini and Wagner (C&W) attacks [8], Jacobianbased saliency map attack (JSMA) [9], and DeepFool [10]. These attack algorithms are designed in the white-box threat model.However, they are also effective in many gray-box and black-box settings due to the transferability of the adversarial samples among models [11,12].
Meanwhile,various defensive techniques for adversarial sample detection/classification have been proposed recently, including heuristic and certificated defenses. Heuristic defense refers to a defense mechanism that performs well in defending specific attacks without theoretical accuracy guarantees. Currently, the most successful heuristic defense is adversarial training, which attempts to improve the DL model’s robustness by incorporating adversarial samples into the training stage. In terms of empirical results,PGD adversarial training achieves state-of-the-art accuracy against a wide range of L∞attacks on several DL model benchmarks such as the modified National Institute of Standards and Technology(MNIST)database,the Canadian Institute for Advanced Research-10 (CIFAR-10) dataset, and ImageNet [13,14]. Other heuristic defenses mainly rely on input/feature transformations and denoising to alleviate the adversarial effects in the data/feature domains.In contrast,certified defenses can always provide certifications for their lowest accuracy under a well-defined class of adversarial attacks. A recently popular network certification approach is to formulate an adversarial polytope and define its upper bound using convex relaxations. The relaxed upper bound is a certification for trained DL models, which guarantees that no attack with specific limitations can surpass the certificated attack success rate, as approximated by the upper bound. However, the actual performance of these certificated defenses is still much worse than that of the adversarial training.
In this paper, we investigate and summarize the adversarial attacks and defenses that represent the state-of-the-art efforts in this area. After that, we provide comments and discussions on the effectiveness of the presented attack and defense techniques. The remainder of the paper is organized as follows:In Section 2,we first sketch out the background. In Section 3, we detail several classic adversarial attack methods.In Section 4,we present a few applications of adversarial attacks in real-world industrial scenarios. In Section 5, we introduce a few recently proposed defense methods.In Section 6,we provide some observations and insights on several related open problems.In Section 7,we conclude this survey.
where p is a real number;d is the dimension of the distance vector v.
Specifically, the L0distance corresponds to the number of the elements in the benign sample x modified by the adversarial attack. The L2distance measures the standard Euclidean distance between x and x′. The most popular distance metric—that is, the L∞distance—measures the maximum element-wise difference between benign and adversarial samples. There are also several adversarial attacks for discrete data that apply to other distance metrics,such as the number of dropped points[15]and the semantic similarity [16].
There are three mainstream threat models for adversarial attacks and defenses: the black-box, gray-box, and white-box models. These models are defined according to the knowledge of adversaries. In the black-box model, an adversary does not know the structure of the target network or the parameters, but can interact with the DL algorithm to query the predictions for specific inputs. The adversaries always craft adversarial samples on a surrogate classifier trained by the acquired data-and-prediction pairs and other benign/adversarial samples.Owing to the transferability of adversarial samples,black-box attacks can always compromise a naturally trained non-defensive model. In the gray-box model, an adversary is assumed to know the architecture of the target model,but to have no access to the weights in the network.The adversary can also interact with the DL algorithm. In this threat model, the adversary is expected to craft adversarial samples on a surrogate classifier of the same architecture. Due to the additional structure information,a gray-box adversary always shows better attack performance compared with a black-box adversary. The strongest adversary—that is, the white-box adversary—has full access to the target model including all the parameters, which means that the adversary can adapt the attacks and directly craft adversarial samples on the target model. Currently, many defense methods that have been demonstrated to be effective against black-box/gray-box attacks are vulnerable to an adaptive white-box attack.For example,seven out of nine heuristic defenses in the 2018 International Conference on Learning Representations (ICLR2018)were compromised by the adaptive white-box attacks proposed in Ref.[17].
In this section, we introduce a few representative adversarial attack algorithms and methods. These methods target to attack image classification DL models, but can also be applied to other DL models. We detail the specific adversarial attacks on the other DL models in Section 4.
The vulnerability of deep neural networks(DNNs)to adversarial samples is first reported in Ref. [4]; that is, hardly perceptible adversarial perturbations are introduced to an image to mislead the DNN classification result. A method called L-BFGS is proposed to find the adversarial perturbations with the minimum Lpnorm,which is formulated as follows:
Goodfellow et al.[5]first propose an efficient untargeted attack,called the FGSM, to generate adversarial samples in the L∞neighbor of the benign samples, as shown in Fig. 1. FGSM is a typical one-step attack algorithm, which performs the one-step update along the direction(i.e.,the sign)of the gradient of the adversarial loss J (θ ,x,y),to increase the loss in the steepest direction.Formally,the FGSM-generated adversarial sample is formulated as follows:
Moreover, it has been discovered that random perturbing before executing FGSM on benign samples can enhance the performance and the diversity of the FGSM adversarial samples.
where Proj projects the updated adversarial sample into the ∊-L∞neighbor and a valid range.
Zheng et al. [7]propose a new adversarial attack that performs on the space of probability measures, called the distributionally adversarial attack (DAA). Unlike PGD, where adversarial samples are generated independently for each benign sample, DAA performs optimization over the potential adversarial distributions.Moreover,the proposed objective first includes the Kraft-McMillan(KL)divergence between the adversarial and benign data distribution in the calculation of the adversarial loss to increase the adversarial generalization risk during the optimization.This distribution optimization problem is formulated as follows:
where μ denotes the adversarial data distribution and π(x)denotes the benign data distribution.
Since direct optimization over the distribution is intractable,the authors exploit two particle-optimization methods for approximation. Compared with PGD, DAA explores new adversarial patterns, as shown in Fig. 2 [7]. It ranks second on the Massachusetts Institute of Technology (MIT) MadryLab’s white-box leaderboards[13],and is considered to be one of the most effective L∞attacks on multiple defensive models.
Carlini and Wagner [8] propose a set of optimization-based adversarial attacks (C&W attacks) that can generate L0, L2, and L∞norm measured adversarial samples, namely CW0, CW2, and
Fig. 1. A demonstration of an adversarial sample generated by applying FGSM to GoogleNet [5]. The imperceptible perturbation crafted by FGSM fools GoogleNet into recognizing the image as a gibbon.
Fig. 2. Comparison between PGD and DAA. DAA tends to generate more structured perturbations [7].
In all the attacks mentioned above, the crafted adversarial perturbations are specific to benign samples. In other words, the adversarial perturbations do not transfer across benign samples.Hence,there is a straightforward question:Is there a universal perturbation that can fool the network on most benign samples?Ref. [20] first tries to discover such a perturbation vector by iteratively updating the perturbation using all the target benign samples. In each iteration, for the benign samples that the current perturbation cannot fool, an optimization problem, which is similar to L-BFGS [4], and which aims to discover the minimum additional perturbation required to compromise the samples, is solved. The additional perturbation is then added to the current perturbation. Eventually, the perturbation enables most of the benign samples to fool the network. Experiments show that this simple iterative algorithm is effective to attack deep nets such as CaffeNet [21], GoogleNet [22], VGG [23], and ResNet [24]. Surprisingly, this cross-sample transferability also maintains across models; for example, the universal perturbations crafted on a VGG can also achieve a fooling ratio above 53% on the other models.
Fig. 3. Convex polyhedron formed by the decision boundaries between all the classes. (a) Linear model; (b) nonlinear model [10].
All of the elements of t he benign samples (e.g., all the pixels in the benign images) are perturbed in the aforementioned attack algorithms. Recent studies show that perturbations in a restricted region/segment of the benign samples can also fool DL models.These perturbations are called adversarial patches. Sharif et al.[25] proposed the crafting of adversarial perturbations only on an eyeglasses frame attached to the facial images, as shown in Fig.4.By optimization over a commonly used adversarial loss,such as cross-entropy, the locally crafted perturbation can easily fool VGG-Face convolutional neural network (CNN) [26]. The authors implement this attack in the physical world by threedemensional (3D) printing pairs of eyeglasses with the generated perturbations. This work also presents video demos in which people wearing the adversarial eyeglasses are recognized as the attack targets by a real VGG-Face CNN system.Brown et al.[27]propose a method to generate universal robust adversarial patches. In Ref.[27],the adversarial loss that aims to optimize the patch is defined based on the benign images, patch transformations, and patch locations. Universality is achieved by optimizing the patch over all the benign images. Robustness to noise and transformations is achieved by using the expectation over transformation (EoT)method [28] to compute noise/transformation-insensitive gradients for the optimization. Liu et al. [29] propose adding a Trojan patch on benign samples to generate adversarial samples.The proposed attack first selects a few neurons that can significantly influence the network outputs. Then pixel values in the region of the adversarial patch are initialized to make the selected neurons achieve their maximums. Finally, the model is retrained with benign images and the images with the Trojan patch to adjust the weights related to those selected neurons. Despite performing similarly to the original model on benign images, the retrained model shows malicious behaviors on the images stamped with the adversarial patch.
Fig. 4. Eyeglasses with adversarial perturbations deceive a facial recognition system to recognize the faces in the first row as those in the second row [25].
Extending adversarial attack algorithms such as PGD and C&W to the physical world still needs to overcome two major challenges,even though these algorithms are very effective in the digital domain. The first challenge is that the environmental noise and natural transformations will destruct the adversarial perturbations calculated in the digital space.For example,the destruction rate of blur, noise, and joint photographic experts group (JPEG) encoding is reported to be above 80% [6]. The second challenge is specific to the ML tasks using images/videos, in which only the pixels corresponding to certain objects can be perturbed in the physical world. In other words, adversaries cannot perturb the backgrounds. Athalye et al. [28] propose the EoT method to address the first issue.Instead of using the gradients calculated in the ideal digital domain, EoT adds/applies a set of random noise/natural transformations on the input and then takes an average over all the gradients with respect to those noisy/transformed inputs.This averaged gradient is adopted in gradient-based attack algorithms such as FGSM and PGD to improve the robustness of the generated adversarial samples. In fact, utilization of an adversarial patch can simply solve the second problem—that is, the spatial constraint.Moreover,Eykholt et al.[33]propose a mask/patch transformation to separate the background and the object such that the adversarial perturbations can be restricted in the objects’ region. In addition,the authors consider the fabrication errors caused by the difference between the printable and the perturbed RGB values in Ref.[33],as shown in Fig. 5 [33]. The difference is an additional penalty term called the non-printable score, which is included in the optimization loss. Eventually, the work in Ref. [33] successfully generates printable adversarial perturbations on real-world traffic signs and achieved more than 80% attack success rate overall.
Athalye et al. [17] identify a common problem shared by most heuristic defenses including eight out of nine defenses published in ICLR2018. The problem is that the gradients of those defensive models are either nonexistent or nondeterministic due to addons/operations such as quantization and randomization. For these defenses, this work proposes three methods that can circumvent the add-ons/operations to reveal valid gradients for crafting adversarial samples. For defenses relying on non-differentiable add-ons such as quantization,it circumvents the add-ons by using differentiable functions to approximate them. For defenses armed with nondeterministic operations such as random transformations, it simply uses EoT[28]to identify a general gradient direction under the possible transformations and updates the adversarial samples along this general direction. For the defenses that yield exploding or vanishing gradients caused by optimization loops, it proposes to make a change of variable such that the optimization loop will be transformed into a differentiable function. Using the gradients approximated by those three methods,it successfully breaks seven out of nine heuristic defenses in ICLR2018.
In the last section of this article, we will mainly introduce the typical attack algorithms and methods. Most were initially designed for image classification tasks. However, these methods can also be applied to other domains such as image/video segmentation [34,35], 3D recognition [36,37], audio recognition [38], and reinforcement learning[39],which attract growing attention from both academia and industry.Besides,specific data and applications could lead to unique adversarial attacks.Hence,in this section,we sketch out these unique adversarial attacks on the other pervasive applications.
Fig. 5. (a) shows the original image identified by an Inception v3 model as a microwave, and (b) shows its physical adversarial example, identified as a phone[33].
Point-cloud is an important 3D data representation for 3D object recognition. PointNet [37], PointNet++ [42], and dynamic graph CNN (DGCNN) [43] are the three most popular DL models for point-cloud-based classification/segmentation. However, these three models were also recently found to be vulnerable to adversarial attacks [15,44,45]. In Ref. [44], the authors first extend the C&W attack to the 3D point-cloud DL models. The point locations correspond to the pixel values, and the C&W loss is optimized by shifting the points (i.e., perturbing the point locations). Similarly,the work proposed in Ref. [45] applies BIM/PGD to point-cloud classification and also achieves high attack success rates. In Ref.[15], the authors propose a new attack by dropping the existing points in the point clouds. They approximate the contribution of each point to the classification result by point-shifting to the centroid of the point-cloud and dropping the points with large positive contributions. With a certain number of points dropped, the classification accuracy of PointNet, PointNet++, and DGCNN are significantly reduced. Besides, works in Ref. [46] propose to add adversarial perturbations on 3D meshes such that the 2D projections of the 3D meshes can mislead 2D-image classification models. This 3D attack is implemented by the optimization on a hybrid loss with the adversarial loss to attack the target 2D model and a penalty loss to keep the 3D adversarial meshes perceptually realistic.
Fig. 6. In (a), faster CNN correctly detects three dogs and identifies their regions,while in(b)generated by DAG,the segmentation results are completely wrong[40].
Carlini and Wagner [47] successfully constructed high-quality audio adversarial samples through the optimization over the C&W loss. For an audio signal, up to 50 words in the text translation can be modified by only adversarial perturbing 1%of the audio signal on DeepSpeech [48]. They also found that the constructed adversarial audio signals are robust to pointwise noise and MP3 compression. However, due to the nonlinear effects of microphones and recorders, the perturbed audio signals do not remain adversarial after being played over the air. The authors in Ref.[49] propose simulating the nonlinear effects and the noise while taking them into account in the attack process. Specifically, the authors model the received signal as a function of the transmitted signal, which consists of the transformations for modeling the effects of band-pass filter, impulse response, and white Gaussian noise.The adversarial loss is defined in the received signals instead of the transmitted signals. The proposed attack successfully generates adversarial audio samples in the physical world, which can attack the audio-recognition models even after being played in the air. For text recognition, Liang et al. [50] propose three word-level perturbation strategies on text data, including insertion, modification, and removal. The attack first identifies the important text items for classification, and then exploits one of the perturbation approaches on those text items. Experiments show that this attack can successfully fool some state-of-the-art DNN-based text classifiers.Moreover,TextBugger adopts five types of perturbation operations on text data, including insertion, deletion,swap,character substitution,and word substitution,as shown in Fig. 7 [16]. In the white-box setting, those five operations are also conducted on the important words identified by the Jacobian matrix [9]. However, in the black-box threat model, the Jacobian matrix is unavailable on sentences and documents. The adversary is assumed to have access to the confidence values of the prediction. In this context, the importance of each sentence is defined as its confidence value regarding the predicted class. The importance of each word in the most salient sentence is defined by the difference between the confidence values of the sentence with and without the word.
Huang et al.[51]show that existing attack methods can also be used to degrade the performance of the trained policy in deep reinforcement learning by adding adversarial perturbations on the raw inputs of the policy.In particular,the authors construct a surrogate loss J(θ,x,y)with the parameters θ,the input of the policy x,and a weighted score over all possible actions y. FGSM [5] is used to attack feed-forward policies trained with three algorithms,respectively, including Deep Q-networks [52], asynchronous advantage actor-critic[53],and trust region policy optimization[54].In most cases,the proposed attack can reduce the accuracy of the agent by 50% under the white-box setting. In the black-box setting, this attack is also effective. The adversarial effects can transfer across those three algorithms, although the attack performance may degrade. Ref. [55] proposes perturbing the input states stin the Q-function Q(st+1,a,θt),such that the learning process will produce an adversarial action a′. FGSM and JSMA are nominated as the adversarial-perturbation-crafting method. Lin et al. [56] propose two attack tactics for deep reinforcement learning, namely the strategically timed attack and the enchanting attack.In the strategically timed attack, the reward is minimized by only perturbing the image inputs for a few specific time-steps.This attack is simply conducted by optimizing the perturbations over the reward. The enchanting attack adversarially perturbs the image frames to lure the agent to the target state. This attack requires a generative model to predict the future states and actions in order to formulate a misleading sequence of actions as guidance for generating perturbations on the image frames.
In this section, we summarize the representative defenses developed in recent years, mainly including adversarial training,randomization-based schemes, denoising methods, provable defenses, and some other new defenses. We also present a brief discussion on their effectiveness against different attacks under different settings.
Adversarial training is an intuitive defense method against adversarial samples, which attempts to improve the robustness of a neural network by training it with adversarial samples. Formally, it is a min-max game that can be formulated as follows:
where J(θ,x′,y) is the adversarial loss, with network weights θ,adversarial input x′, and ground-truth label y. D(x,x′) represents a certain distance metric between x and x′. The inner maximization problem is to find the most effective adversarial samples, which is solved by a well-designed adversarial attack, such as FGSM [5]and PGD[6].The outer minimization is the standard training procedure to minimize the loss. The resulting network is supposed to be resistant against the adversarial attack used for the adversarial sample generation in the training stage. Recent studies in Refs.[13,14,57,58] show that adversarial training is one of the most effective defenses against adversarial attacks. In particular, it achieves state-of-the-art accuracy on several benchmarks.Therefore, in this section, we elaborate on the best-performing adversarial training techniques in the past few years.
Fig. 7. Adversarial text generated by TextBugger [16]: A negative comment is misclassified as a positive comment.
5.1.1. FGSM adversarial training
Goodfellow et al.[5]first propose enhancing the robustness of a neural network by training it with both benign and FGSMgenerated adversarial samples. Formally, the proposed adversarial objective can be formulated as follows:
where x+∊·sign [∇xJ (θ ,x,y)] is the FGSM-generated adversarial sample for the benign sample x, and c is used to balance the accuracy on benign and adversarial samples as a hyper parameter.Experiments in Ref. [5] show that the network becomes somewhat robust to FGSM-generated adversarial samples. Specifically, with adversarial training, the error rate on adversarial samples dramatically fell from 89.4% to 17.9%. However, the trained model is still vulnerable to iterative/optimization-based adversarial attacks despite its effectiveness when defending FGSM-generated adversarial samples. Therefore, a number of studies further dig into adversarial training with stronger adversarial attacks, such as BIM/PGD attacks.
5.1.2. PGD adversarial training
Extensive evaluations demonstrate that a PGD attack is probably a universal first-order L∞attack [13]. If so, model robustness against PGD implies resistance against a wide range of first-order L∞attacks. Based on this conjecture, Madry et al. [13] propose using PGD to train a robust network adversarially. Surprisingly,PGD adversarial training indeed improves the robustness of CNNs and ResNets[24]against several typical first-order L∞attacks,such as FGSM, PGD, and CW∞attacks under both black-box and whitebox settings. Even the currently strongest L∞attack, that is, DAA,can only reduce the accuracy of the PGD adversarially trained MNIST model to 88.56% and the accuracy of the CIFAR-10 model to 44.71%. In the recent Competition on Adversarial Attacks and Defenses (CAADs), the first-ranking defense against ImageNet adversarial samples relied on PGD adversarial training [14]. With PGD adversarial training,the baseline ResNet[23]already achieves over 50% accuracy under 20-step PGD, while the denoising architecture proposed in Ref. [14] only increases the accuracy by 3%.All the above studies and results indicate that PGD adversarial training is overall the most effective countermeasure against L∞attacks. However, due to the large computational cost required for PGD adversarial sample generation, PGD adversarial training is not an efficient method. For example, PGD adversarial training on a simplified ResNet for CIFAR-10 requires approximately three days on a TITAN V graphics processing unit (GPU), and the first ranking model in CAAD costs 52 h on 128 Nvidia V100 GPUs.Besides, a PGD adversarially trained model is only robust to L∞attacks and is vulnerable to other Lp-norm adversaries, such as EAD [19,59] and CW2[8].
5.1.3. Ensemble adversarial training
To avoid the large computational cost brought by PGD adversarial training, Ref. [60] proposes to adversarially train a robust ImageNet model by FGSM with random starts. However,the adversarially trained model is even vulnerable to black-box attacks. To tackle this problem, the authors propose a training methodology that incorporates adversarial samples transferred from multiple pre-trained models; namely, ensemble adversarial training (EAT) [61]. Intuitively, EAT increases the diversity of adversarial samples used for adversarial training, and thus enhances network robustness against adversarial samples transferred from other models. Experiments show that EAT models exhibit robustness against adversarial samples generated by various single-step and multi-step attacks on the other models. In some circumstances, the performance of EAT against black-box and gray-box attacks is even better than that of PGD adversarial training.
5.1.4. Adversarial logit pairing
Kannan et al. [62] propose a new adversarial training approach called adversarial logit pairing (ALP). Similar to the stability training strategy proposed in Ref. [63], ALP encourages the similarity between pairs of examples in the learned logit space by including the cross-entropy between the logits of benign samples x and the corresponding perturbed samples x′in the training loss. The only difference is that the x′used in Ref. [62] are PGD adversarial samples. The training loss is formally defined as follows:
where J (θ ,x,y) is the original loss, J(θ,x,x′) is the cross-entropy between the logits of x and x′, and c is a hyper parameter.
Experiments in Ref. [62] show that this pairing loss helps improve the performance of PGD adversarial training on several benchmarks, such as SVHN, CIFAR-10, and tiny ImageNet. Concretely, it is claimed in Ref. [62] that ALP increases the accuracy of the Inception V3 model under the white-box PGD attack from 1.5% to 27.9%. Its performance is almost as good as that of EAT against black-box attacks.However,the work in Ref.[64]evaluates the robustness of an ALP-trained ResNet and discovers that the ResNet only achieves 0.6%correct classification rate under the targeted attack considered in Ref.[62].The authors also point out that ALP is less amenable to gradient descent, since ALP sometimes induces a ‘‘bumpier,” that is, depressed loss landscape tightly around the input points. Therefore, ALP might not be as robust as expected in Ref. [62].
5.1.5. Generative adversarial training
All of the above adversarial training strategies employ deterministic attack algorithms to generate training samples. Lee et al.[65]first propose to exploit a nondeterministic generator to generate adversarial samples in the process of adversarial training.Specifically, the proposed work sets up a generator, which takes the gradients of the trained classifier with respect to benign samples as inputs and generates FGSM-like adversarial perturbations.By training the classifier on both benign and generated samples,it also obtains a more robust classifier to FGSM compared with the FGSM adversarially trained model.Liu and Hsieh[66]first propose the use of an AC-GAN architecture[32]for data augmentation to improve the generality of PGD adversarial training.The AC-GAN learns to generate fake samples similar to the PGD adversarial samples through feeding the PGD adversarial samples into the discriminator as real samples. The PGD-like fake samples are exploited to train the auxiliary classifier along with the pretrained discriminator. According to Ref. [66], such a combination of a generator, discriminator, auxiliary classifier, and PGD attack in a single network not only results in a more robust classifier,but also leads to a better generator.
Many recent defenses resort to randomization schemes for mitigating the effects of adversarial perturbations in the input/feature domain.The intuition behind this type of defense is that DNNs are always robust to random perturbations. A randomizationbased defense attempts to randomize the adversarial effects into random effects, which are not a concern for most DNNs.Randomization-based defenses have achieved comparable performance under black-box and gray-box settings, but in the white-box setting, the EoT method [28] can compromise most of them by considering the randomization process in the attack process. In this section, we present details of several typical randomization-based defenses and introduce their performance against various attacks in different settings.
5.2.1. Random input transformation
Xie et al. [67] utilize two random transformations—random resizing and padding—to mitigate the adversarial effects at the inference time.Random resizing refers to resizing the input images to a random size before feeding them into DNNs.Random padding refers to padding zeros around the input images in a random manner. The pipeline of this quick and sharp mechanism is shown in Fig. 8 [67]. The mechanism achieves a remarkable performance under black-box adversarial settings, and ranked second place in the NIPS 2017 adversarial examples defense challenge. However,under the white-box setting, this mechanism was compromised by the EoT method [28]. Specifically, by approximating the gradient using an ensemble of 30 randomly resized and padded images,EoT can reduce the accuracy to 0 with 8/255 L∞perturbations. In addition, Guo et al. [68] apply image transformations with randomness such as bit-depth reduction,JPEG compression,total variance minimization,and image quilting before feeding the image to a CNN.This defense method resists 60%of strong gray-box and 90%of strong black-box adversarial samples generated by a variety of major attack methods. However, it is also compromised by the EoT method [28].
5.2.2. Random noising
Liu et al. [69] propose to defend adversarial perturbations by a random noising mechanism called random self-ensemble (RSE).RSE adds a noise layer before each convolution layer in both training and testing phases, and ensembles the prediction results over the random noises to stabilize the DNN’s outputs, as shown in Fig. 9 [69]. Lecuyer et al. [70] view the random noising defensive mechanism from the angle of differential privacy (DP) [71] and propose a DP-based defense called PixelDP. PixelDP incorporates a DP noising layer inside DNN to enforce DP bounds on the variation of the distribution over its predictions of the inputs with small,norm-based perturbations. PixelDP can be used to defend L1/L2attacks using Laplacian/Gaussian DP mechanisms. Inspired by PixelDP, the authors in Ref. [72] further propose to directly add random noise to pixels of adversarial examples before classification, in order to eliminate the effects of adversarial perturbations.Following the theory of Rényi divergence,it proves that this simple method can upper-bound the size of the adversarial perturbation it is robust to, which depends on the first- and second-largest probabilities of the output probability distribution (vector).
5.2.3. Random feature pruning
Dhillon et al.[73]present a method called stochastic activation pruning (SAP) to protect pre-trained networks against adversarial samples by stochastically pruning a subset of the activations in each layer and preferentially retaining activations with larger magnitudes. After activation pruning, SAP scales up the surviving activations to normalize the inputs of each layer. However, on CIFAR-10,EoT[28]can also reduce the accuracy of SAP to 0 with 8/255 L∞adversarial perturbations. Luo et al. [74] introduce a new CNN structure by randomly masking the feature maps output from the convolutional layers.By randomly masking the output features,each filter only extracts the features from partial positions. The authors claim that this assists the filters in learning features distributing consistently with respect to the mask pattern; hence,the CNN can capture more information on the spatial structures of local features.
Denoising is a very straightforward method in terms of mitigating adversarial perturbations/effects.Previous works point out two directions to design such a defense: input denoising and feature denoising. The first direction attempts to partially or fully remove the adversarial perturbations from the inputs, and the second direction attempts to alleviate the effects of adversarial perturbations on high-level features learned by DNNs. In this section, we elaborate on several well-known defenses in both directions.
5.3.1. Conventional input rectification
In order to mitigate the adversarial effects, Xu et al. [75] first utilize two squeezing (denoising) methods—bit-reduction and image-blurring—to reduce the degrees of freedom and remove the adversarial perturbations, as shown in Fig. 10. Adversarial sample detection is realized by comparing the model predictions on the original and squeezed images. If the original and squeezed inputs produce substantially different outputs from the model,the original input is likely to be an adversarial sample. Xu et al.[76] further show that the feature-squeezing methods proposed in Ref. [75] can mitigate the C&W attack. However, He et al. [77]demonstrate that feature squeezing is still vulnerable to an adaptive knowledgeable adversary. It adopts the CW2loss as the adversarial loss. After each step of the optimization procedure, an intermediate image is available from the optimizer. The reducedcolor-depth version of this intermediate image is checked by the detection system proposed by Xu et al. [75]. Such an optimization procedure runs multiple times,and all the intermediate adversarial samples that can bypass Xu’s system are aggregated. This whole adaptive attack can break the input squeezing system with perturbations much smaller than those claimed in Ref. [75]. Moreover,Sharma and Chen [78] also show that EAD and CW2can bypass the input squeezing system with increasing adversary strength.
Fig. 8. The pipeline of the randomization-based defense mechanism proposed by Xie et al. [67]: The input image is first randomly resized and then randomly padded.
Fig. 9. The architecture of RSE [69]. FC: fully connected layer; Fin: the input vector of the noise layer; Fout: the output vector of the noise layer;∊: the perturbation which follow the Gaussian distribution N(0,σ2); conv: convolution.
Fig.10. The feature-squeezing framework proposed by Xu et al.[75].d1 and d2:the diffierence between the model’s prediction on a squeezed input and its prediction on the original input; H: the threshold which is used to detect adversarial examples.
5.3.2. GAN-based input cleansing
A GAN is a powerful tool to learn a generative model for data distributions.Thus,plenty of works intend to exploit GANs to learn benign data distribution in order to generate a benign projection for an adversarial input.Defense-GAN and adversarial perturbation elimination GAN (APE-GAN) are two typical algorithms among all these works.Defense-GAN[79]trains a generator to model the distribution of benign images, as shown in Fig. 11 [79]. In the testing stage, Defense-GAN cleanses an adversarial input by searching for an image close to the adversarial input in its learned distribution,and feed this benign image into the classifier. This strategy can be used to defend against various adversarial attacks. Currently,the most effective attack scheme against Defense-GAN is based on backward pass differentiable approximation [17], which can reduce its accuracy to 55%with 0.005 L2adversarial perturbations.APE-GAN[80] directly learns a generator to cleanse an adversarial sample by using it as input, and outputs a benign counterpart.Although APE-GAN achieves a good performance in the testbed of Ref. [80], the adaptive white-box CW2attack proposed in Ref.[81] can easily defeat APE-GAN.
5.3.3. Auto encoder-based input denoising
Fig.11. The pipeline of Defense-GAN[79].G:the generative model which can generate a high-dimensional input sample from a low dimensional vector z;R:the number of random vectors generated by the random number generator.
In Ref. [82], the authors introduce a defensive system called MagNet, which includes a detector and a reformer. In MagNet, an auto-encoder is used to learn the manifold of benign samples.The detector distinguishes the benign and adversarial samples based on the relationships between those samples and the learned manifold. The reformer is designed to rectify the adversarial samples into benign samples. The authors show the effectiveness of MagNet against a variety of adversarial attacks under gray-box and black-box settings, such as FGSM, BIM, and C&W. However,Carlini and Wagner [81] demonstrate that MagNet is vulnerable to the transferable adversarial samples generated by the CW2attack.
5.3.4. Feature denoising
Liao and Wagner[83]propose a high-level representation guided denoiser (HGD) to polish the features affected by the adversarial perturbations.Instead of denoising on the pixel level,HGD trains a denoising u-net[34]using a feature-level loss function to minimize the feature-level difference between benign and adversarial samples.In the NIPS 2017 competition,HGD won first place in the defense track (black-box setting). Despite the effectiveness under black-box settings,HGD is compromised by a PGD adversary under awhite-boxsettinginRef.[84].Experimentsin Ref.[84]indicatethat the PGD attack with 4/255 L∞perturbations can already reduce the accuracyof HGDto 0.Xie et al.[14]design a blockfor learning several denoising operations to rectify the features learned by intermediate layers in DNNs. The modified PGD adversarially trained network ranked first place in the adversarial defense track of CAAD 2018.Despite the remarkable success of Ref.[14],the contribution of the feature-denoising block to network robustness is not compared with PGD adversarial training,since the PGD adversarially trained baseline can also achieve nearly 50% accuracy under white-box PGD attacks,and the denoising block only improves the accuracy of this baseline by 3%.
All of the above defenses are heuristic defenses, which means that the effectiveness of these defenses is only experimentally validated, rather than being theoretically proved. Without a theoretical error-rate guarantee,those heuristic defenses might be broken by a new attack in the future. Therefore, many researchers have put efforts into developing provable defensive methods,which can always maintain a certain accuracy under a welldefined class of attacks.In this section,we introduce several typical certificated defenses.
5.4.1. Semidefinite programming-based certificated defense
Raghunathan and Kolter [85] first propose a certifiable defense method against adversarial examples on two-layer networks. The authors derive a semidefinite relaxation to upper-bound the adversarial loss and incorporate the relaxation into the training loss as a regularizer. This training method produces a network with a certificate that no attack with at most 0.1/1.0 L∞perturbations can cause more than 35% test error on MNIST. In Ref. [86],Raghunathan et al. further propose a new semidefinite relaxation for certifying arbitrary ReLU networks. The newly proposed relaxation is tighter than the previous one and can produce meaningful robustness guarantees on three different networks.
5.4.2. Dual approach-based provable defense
Along with Ref. [85], Wong and Kolter [87] formulate a dual problem to upper-bound the adversarial polytope.They show that the dual problem can be solved by optimization over another deep neural network. Unlike Ref. [85], which only applies to two-layer fully connected networks, this approach can be applied to deep networks with arbitrary linear operator layers,such as convolution layers. The authors further extend the technique in Ref. [87] to much more general networks with skip connections and arbitrary nonlinear activations in Ref. [88]. They also present a nonlinear random projection technique to estimate the bound in a manner that only scales linearly in the size of the hidden units, making the approach applicable to larger networks. On both MNIST and CIFAR datasets,the proposed work trains classifiers using the proposed techniques that substantially improve the previous provable robust adversarial error guarantees: from 5.8% to 3.1% on MNIST with L∞perturbations of ∊= 0.1, and from 80% to 36.4% on CIFAR with L∞perturbations of ∊= 2/255.
5.4.3. Distributional robustness certification
From the perspective of distribution optimization, Sinha et al.[89] formulate an optimization problem over adversarial distributions as follows:
where φ is a candidate set for all the distributions around the benign data, which can be constructed by f-divergence balls [90]or Wasserstein balls [91],φ is sampled from the candidate set φ.
Optimization over this distributional objective is equivalent to minimizing the empirical risk over all the samples in the neighbor of the benign data—that is, all the candidates for the adversarial samples.Since φ affects the computability,and direct optimization over an arbitrary φ is intractable, the work in Ref. [80] derives tractable sets φ using the Wasserstein distance metric with computationally efficient relaxations that are computable even when J(θ,x,y) is non-convex. In fact, the work in Ref. [89] also provides an adversarial training procedure with provable guarantees on its computational and statistical performance. In the proposed training procedure, it incorporates a penalty to characterize the adversarial robustness region. Since optimization over this penalty is intractable, the authors propose a Lagrangian relaxation for the penalty and achieve robust optimization over the proposed distributional loss. In addition, the authors derive guarantees for the empirical minimizer of the robust saddle-point problem and give specialized bounds for domain adaptation problems, which also shed light on the distributional robustness certification.
Guo et al.[92]are the first to demonstrate the intrinsic relationship between weight sparsity and network robustness against FGSM-generated and DeepFool-generated adversarial samples.For linear models, Ref. [92] demonstrates that optimization over adversarial samples could give rise to a sparse solution of the network weights.For nonlinear neural networks,it applies the robustness guarantees from Refs. [93,94] and demonstrates that the network Lipchitz constant is prone to be smaller when the weight matrices are sparser. Since it is observed that minimizing the Lipchitz constant can help improve network robustness [93], the conclusion is that weight sparsity can also lead to a more robust neural network. In Ref. [95], it is also shown that weight sparsity is beneficial to network robustness verification. The authors demonstrate that weight sparsity could turn the computationally intractable verification problems into tractable ones. The authors improve the weight sparsity on neural networks by training them with L1regularization, and discover that weight sparsity significantly speeds up the linear programming solvers [96] for network robustness verification.
Wang et al. [97] first develop a framework for analyzing the adversarial robustness of the k-nearest neighbor (KNN) algorithm.This framework identifies two distinct regimes of k with different robustness properties. Specifically, KNN with constant k has no robustness under the large sample limit condition in the regions
Liu et al.[99]combine the Bayesian neural network(BNN)[100]with adversarial training to learn the optimal model-weight distribution under adversarial attacks. Specifically, the authors assume that all the weights in the network are stochastic, and train the network with the techniques commonly used in the BNN theory
[100]. Through adversarially training such a stochastic BNN, the BNN with adversarial training shows a significant improvement of adversarial robustness compared with RSE [69] and common adversarial training on CIFAR-10,STL-10,and ImageNet143.Schott et al.[101]suggest modeling the class-conditional distributions for the input data based on the Bayesian model, and classify a new sample as the class under which the corresponding classconditional model yields the highest likelihood.The authors name the model the ‘‘analysis by synthesis” (ABS) model. ABS is considered to be the first robust model for the MNIST dataset against L0,L2, and L∞attacks. Specifically, it achieves state-of-the-art performance against L0and L2attacks, but performs slightly worse than PGD adversarially trained model under L∞attacks.
For ML tasks such as audio recognition and image segmentation, consistency information can be applied to distinguish between benign and adversarial samples.Xiao et al.[102]find that for the semantic segmentation task, adversarially perturbing one pixel also affects the predictions of its surrounding pixels. Therefore,perturbing on a single patch can also break the spatial consistency between its nearby batches. Such consistency information makes the benign and adversarially perturbed images distinguishable. This consistency-based methodology is evaluated against adaptive attacks and demonstrates better performance than other anomaly-detection systems. For the audio-recognition task, Yang et al. [103] explore the temporal consistency of audio signals and discover that adversarial perturbation destroys the temporal consistency. Specifically, for an adversarial audio signal, the translation of a portion of the signal is not consistent with the translation of the whole signal. It shows that the detection based on the consistency test can achieve more than 90% detection rate on adversarial audio signals.
From the perspective of adversaries, the main difference between white-box and black-box settings is the level of their access to the target model. Under white-box settings, the model structure and the weights are accessible to the adversaries,so they can compute the true model gradients or approximate the gradients by the methods in Ref.[17].Besides,the adversaries can adapt their attack method with the defense method and parameters. In this context, most of the heuristic defenses introduced before are ineffective against such strong adaptive adversaries. However,under black-box settings, the model structure and the weights are secrets to the adversaries. In this context, to apply the above gradient-based attack algorithms, the adversaries have to infer the model gradients from limited information. Without any model-specific information,unbiased estimation of the model gradient is the expectation of the gradients of the pre-trained models’ensemble with different random seeds. A momentum gradientbased method with this gradient estimation achieved first place in the NIPS 2017 Challenge (under a black-box setting) [18]. Chen et al.[104]investigate another black-box setting,where additional query access is granted to the adversaries. Therefore, the adversaries can infer the gradients from the output of the target model given well-designed inputs. In this setting, the proposed design can apply a zero-order method to give a much better estimation of the model gradients. However, a drawback of this method is its requirement for a large number of queries, which is proportional to the data dimension.
The trend of research on adversarial attacks mainly includes two directions. The first direction is to design more efficient and stronger attacks in order to evaluate various emerging defensive systems. The importance of this direction is intuitive, since we expect to understand all the risks ahead of the potential adversaries. The second direction is to realize the adversarial attacks in the physical world.Previously,the main doubt about this research topic was whether those adversarial attacks were real threats in the physical world. Some researchers suspected that adversarial attacks initially designed in digital spaces would not be effective in the physical world due to the influence of certain environmental factors. Kurakin et al. [6] first address this challenge by using the expectation of the model gradients with respect to the inputs plus the random noise caused by environmental factors. Eykholt et al.[33] further consider the masks and fabrication errors, and implemented adversarial perturbations on traffic signs. Recently, Cao et al. [105] successfully generat adversarial objectives to deceive the LiDAR-based detection system, thus validating the existence of physical adversarial samples again. In terms of defenses, the community is starting to focus on certificated defenses,since most heuristic defenses fail to defend against adaptive white-box attacks, and a certificated defense is supposed to guarantee the defensive performance under certain situations regardless of the attack method used by the adversaries.However,until now,scalability has been a common problem shared by most certificated defenses. For example, interval bound analysis is a recently popular direction to certify DNNs,but it is not scalable to very deep neural networks and large datasets. Clearly, compared with attacks,the development of defenses faces more challenges.This is mainly because an attack can only target one category of defenses, but defenses are required to be certificated—that is, effective against all possible attack methods under certain situations.
Existence of a general robust decision boundary.Since there are numerous adversarial attacks defined under different metrics,a natural question is: Is there a general robust decision boundary that can be learned by a certain kind of DNNs with a particular training strategy? At present, the answer to this question is ‘‘no.”Although PGD adversarial training demonstrates remarkable resistance against a wide range of L∞attacks, Sharma and Chen [59]show that it is still vulnerable to adversarial attacks measured by other Lpnorms, such as EAD and CW2. Recently, Khoury and Hadfield-Menell [111] prove that the optimal L2and L∞decision boundaries are different for a two-concentric-sphere dataset, and their disparity grows with the codimension of the dataset—that is, the difference between the dimensions of the data manifold and the whole data space.
Effective and efficient defense against white-box attacks.To the best of our knowledge, no defense that can achieve a balance between effectiveness and efficiency has been proposed. In terms of effectiveness, adversarial training demonstrates the best performance but at a substantial computational cost. In terms of efficiency, the configuration of many randomization-based and denoising-based defenses/detection systems only takes a few seconds. However, many recent works [17,84,114,115] show that those schemes are not as effective as they claim to be.Certificated defenses indicate a way to reach theoretically guaranteed security,but both their accuracy and their efficiency are far from meeting the practical requirements.
In this paper, we have presented a general overview of the recent representative adversarial attack and defense techniques.We have investigated the ideas and methodologies of the proposed methods and algorithms.We have also discussed the effectiveness of these adversarial defenses based on the most recent advances.New adversarial attacks and defenses developed in the past two years have been elaborated. Some fundamental problems, such as the causation of adversarial samples and the existence of a general robust boundary, have also been investigated. We have observed that there is still no existing defense mechanism that achieves both efficiency and effectiveness against adversarial samples. The most effective defense mechanism, which is adversarial training, is too computationally expensive for practical deployment, while many efficient heuristic defenses have been demonstrated to be vulnerable to adaptive white-box adversaries.We have also discussed several open problems and challenges in this critical area to provide a useful research guideline to boost the development of this critical area.
Acknowledgements
This work has been supported by Ant Financial, Zhejiang University Financial Technology Research Center.
Compliance with ethics guidelines
Kui Ren, Tianhang Zheng, Zhan Qin, and Xue Liu declare that they have no conflict of interest or financial conflicts to disclose.