Attacking and Defending
Neural Networks Final Project
Deep Learning From Theory to Practice
Fabio Mistrangelo
Jose P. Navarro
Pablo Laso
1
Contents
1. Introduction 1
2. Research background 2
3. Theoretical analysis 2
4. Model implementation 3
4.1. Datasets 3
4.2. Repositories 3
4.3. Inpainting 3
4.4. Markpainting 3
4.5. Watermark classifier 4
5. Results and Discussion 4
6. Further improvements 7
7. Conclusion 7
References 9
8. Appendix 10
1
1. Introduction
Machine learning (ML) has enabled a large number of artificial intelligence (AI) applications,
including deep fakes. Usually, when people talk about deep fakes, they consider tasks like face
swapping or artificial face generation. In general, it covers anything that includes the generation
of data that looks real to humans. We have barely started to understand the implications that
these technologies have in society (m. Mustak et al., 2022) [1] or could have in the future (M.
Masood et al., 2022 ) [2]. Yet, we still do not know how to regulate them or what to do with and
about them. One side of this report seeks to remind the reader why these regulations are important.
Among these new implementations in ML, there is a technique called inpainting. Inpainting
enables automatic content creation (Ramesh et al., 2021) and manipulation (Yu et al., 2018) by
selecting a part of an image to be cleared out and getting an ML algorithm to fill it in. For example,
you can see in Figure 1 (a) the recent Bolsonaro supporter’s attack on the Brazilian Congress.
The same image on the right (b) shows one of these supports removed by inpainting. The inpainter
here does a very good job with very little effort. We know this because it took seconds to do it
ourselves mainly spent on creating a decent mask. This short anecdote about recreating a mask will
be important later in the report. For now, we know that it is hard for an untrained eye to tell that
this image was edited and even harder to where this edition occurred (we circled it in red for clarity).
Commonly now, people are using this technology to remove any type of information from im-
ages and videos. Yet, as remarked in our example, this can be done for malicious purposes.
Figure 1. Supporters (a) inpainted (b).
The felony we want to bring to light is copy-
righting. Particularly for this report, copy-
righting of artwork by watermark removals.
Watermarks were used in the past to protect
documents from being forfeited by imprinting
different water patterns as hidden signatures
of ownership. Today, the term is still used to
describe marks in digital data that intend to
protect them from being used illegally. Peo-
ple wishing to reuse an image without permis-
sion may want to remove the mark and restore
a plausible background in its place. For the
sake of this report, We will restrict all the pos-
sible data that can be watermarked to images
and artworks (i.e. not videos).
Image owners, already at risk of getting their material stolen, are also incentivized to ensure that
their work is not used in ways that they did not authorize. When such misuse occurs, technical
mechanisms are needed for demonstrating their ownership. Misused images and the mechanisms
to defend and deter unauthorized use are the main objectives of this report.
In ML literature, this is better known as Adversarial Attacking and Defending, which we will
expand on in the theoretical background. We will then explain the neural network (NN) architecture
implementations, divided into three main steps. First, we will use a NN to inpaint images, which will
constitute our attack. Then, we will fit a different NN able to, by implementing some perturbations,
we will make the watermark harder to remove, constituting our defense. Later, we will use a third
NN to classify the image as one with or without a watermark. Finally, we will analyze how
different parameters affect both the attack and the defense. Most interestingly, we will show how
the unification of these three NN constructs a Generative Adversarial Network (GAN) and what
this implies.
2
2. Research background
As explained in the NIPS 2017 competition motivation, advances in deep neural networks such
as CNNs and GANs enabled researchers to solve multiple important practical problems, but are
still vulnerable to adversarial examples. These perturbed inputs can fool ML networks of all types.
Understanding how they work and studying how to prevent them is of great interest to improve
NN robustness in this ML research growing field. [3].
One of the above-mentioned neural networks with increasing popularity is a Deep Convolutional
Network (Deep CNN). This class of networks has become particularly popular for image genera-
tion and restoration. Yet, their excellent performance can always end in the wrong hands. In the
context of this project, it proved to be incredibly easy to remove a watermark with many online
available inpainter codes, with minor adaptions needed. We will take advantage of the structure
of a generator network, discussed in the Deep Image Prior paper, as sufficient to capture the prior
to any learning. In their paper, randomly-initialized neural networks are used as handcrafted prior
with excellent results when inpainting [4].
Inpainting attacks naturally lead us to the question of how we can make it more difficult to
remove watermarks. The opposition is that ML inpainters can be manipulated using techniques
adapted from the field of adversarial machine learning. We will adapt the term ”markpainting” to
the idea of perturbing an image to prevent it to have a watermark removed, as was done in the
paper: Adversarial Machine Learning meets Inpainting [5].
3. Theoretical analysis
A new question arises when wanting to understand if the attacker can actually fool the defender
and why should this work. Our findings in the investigated literature and similar comparisons
suggest that removing a watermark is nothing more than inpainting an image with an appropriate
mask that fits the watermark. We understand this can work given that when trained, inpainters
generate binary masks for each image in a batch, drawing black lines of random length and thick-
ness on a white background for the NN to be able to adapt to different contexts. Then, a specific
mask in a trained network will clearly do the job. Different inpainting NNs already can do this
quite well and even keep improving by generating self-tailored masks.
If inpainters are so good, can the attack be mitigated by the defense? This is a bit more tricky
to answer. In a general sense, even if new defense techniques arise constantly, it is undeniable that
the attacks have the upper hand. Attacks don’t need to wait to find different ways to attack, while
there is an underlying delay in the defense’s need to adapt their response. More specifically in this
report, we discuss how watermarks can be more robust, which is clearly a challenge with NN being
so clever to reconstruct images. Yet, as with any perturbed image properly intervened, we can
find a specific configuration that doesn’t damage significantly the original image, but is still able
to trick the inpainting into under-performing. In a certain sense, it is a self-induced attack that is
detrimental to the attack and ideally indifferent to the image.
Yet, to help us solve this question, we will feed images to a third NN watermark classifier. On
the one hand, we have watermarked images that have been now inpainted to pass as real in this
watermark classifier. On the other hand, we have the original images that never had a watermark
on them. The watermark classifier aims at distinguishing one from the other. In other words, it
outputs a binary result based on the probability that an attack was successful or not.
3
4. Model implementation
4.1. Datasets. For combining and running the three NN implemented in this report, we searched
pre-trained models. This decision was taken not only for saving some time but also for having a
partial guarantee that we could replicate the results of each of the papers on which we are basing our
project. The defense algorithm is based on a pre-trained known model (inceptionv3), non-related to
watermarking but inpainting. Other papers consulted also use big dataset names (CelebA, Places2,
ImageNet) given that inpainting isn’t specific to watermarks. On the other hand, our inpainter
algorithm works on priors and was trained on the Set5 and Set14 datasets, trained in their paper
for super-resolution and inpainting examples. Both attack and defense algorithms were pre-trained
and had access to their saved model-trained epochs.
4.2. Repositories. In addition to the datasets and models mentioned above, three main reposito-
ries were used throughout this project: one for inpainting, one to generate adversarial images, and
a watermark classification algorithm. They all had to be slighted modified to be adapted to our
experiments, and are divided into three ipynb files, respectively.
4.3. Inpainting. The inpainting algorithm we used was developed by D. Ulyanov et al. [6].
Figure 2. Deep image prior architecture.
They used a neural network that is initialized randomly to be effectively used as a pre-defined prior
in common inverse problems. The architecture of this model is shown in Figure 2.
Figure 3. Parameter values.
In a nutshell, the architecture used is a decoder-
encoder (hourglass) with skip connections, represented
with yellow arrows. Here, n
u
[i], n
d
[i], n
s
[i] corre-
spond to the number of filters at depth i for the
upsampling, downsampling, and skip-connections respec-
tively (see Figure 3). Additionally, the values k
u
[i],
k
d
[i], and k
s
[i] correspond to the respective kernel
sizes.
4.4. Markpainting. For this part of the project, we needed an algorithm that took the image and
added some perturbation to it, so that the inpainting would fail to reconstruct the image without
a watermark. Let’s say we feed the attacked image X, which is classified by our model (M) as an
image without a watermark. We want to find adversarial image
ˆ
X, which is perceptually indis-
tinguishable from original input X, such that it will be classified by that same model (M) as with
having a watermark. Therefore, we add an adversarial perturbation (θ) to the original input [7].
Note that we want an adversarial image to be indistinguishable from the original one. That can be
achieved by constraining the magnitude of adversarial perturbation: ||X
ˆ
X||
ϵ. That is, the
L
norm should be less than epsilon. Here, L
denotes the maximum changes for all pixels in the
adversarial image [8].
4
The perturbations ϵ given to our model are random rather than noise obtained from the gradient
of a loss function, as that would be an implementation of Fast Gradient Descent Method. We
do this by implementing five different types of alterations to the image. Namely, we test five
different transformations: JPEG compression, low-pass filtering, Gaussian blurring, white noise,
and brightness adjustments. For JPEG compression, we explore its effects at different levels of ϵ.
4.5. Watermark classifier. The watermark classifier was built by an ensemble between a ResNeXt
and a convNeXT. The ResNeXt architecture is an extension of a deep residual network, while Con-
vNeXT is a pure convolutional model, inspired by the design of Vision Transformers. Assembled,
this classifier has a very good performance, classifying images either as ”watermarked” (0) or ”clear”
(1) [9].
D(x) =
(
0, if watermark was detected
1, otherwise.
(1)
5. Results and Discussion
Figure 4. Successful inpainting attack on Van Gogh self-portrait.
Naturally, the images that needed to be fed to the inpainting model were original images with a
watermark together with their specific mask. A mask specifies the region of the watermark where
the inpainting should be applied on. We want to point out that generating an automatic mask is
possible and can be implemented. Our first experiment intends to highlight its importance, intro-
duced with the outcome of the successful attack shown in Figure 4.
We tested our inpainting algorithm with different masks to see their effect on the performance.
Intuitively, tailored-fitted masks should perform better than generalized ones. Yet, in our results,
we could argue that this is not the full story. Inpainting of Vincent Van Gogh’s self-portrait requires
both local and global knowledge of the image. With just local knowledge, the watermark can be
Figure 5. How different masks yield different inpainted images.
5
confused by the painter’s recognizable paint strokes, not disappearing fully, as shown with mask
1 (see Figure 5). On the other extreme, the image on the right features an exaggerated mask that
does not allow the algorithm to distinguish that there should be a beard. With this experiment
we want to illustrate how running the attacking algorithm with an appropriate mask should give
us successful watermark removals. Thus, inpainting techniques are clearly sensitive to the mask,
preventing an unsuited one from reconstructing an image optimally.
In our next experiment, we indagate with targeted markpainting, (i.e. targeted defense on the
inpainting). We do this by forcing the reconstruction to resemble the initial image with the wa-
termark. In other words, we intend to run the same watermarked image through the inpainting
algorithm but have implemented a perturbation to the image beforehand.
The motivation for this can be described as a defense by attacking the attacker. In other words,
we want the inpainter algorithm to fail by adversarial examples. As expected, each manipulation
significantly increased defense performance, with a different impact on the inpainting. Figure 6
shows this experiment’s findings, comparing all images’ transformations with a defense perturbation
after being run through the inpainting attack.
Figure 6. Markpaining defenses on Van Gogh sunflowers.
Our last experiment also has to do with perturbations. However, not the difference between
them, but how the level of alteration can play a role is investigated. Extreme cases show a big
impediment to deteriorating the image in its process, but we exaggerate these perturbations to
evaluate better our results and for them to be visually clearer (see Appendix B. As seen in
Figure 7, we have an original image with a generalized mask, followed by the attacked images
at different levels of random perturbations. Notice that the watermark of the image with zero
perturbation is completely removed. Meanwhile, the watermarks of the rest of the images are still
somewhat visible. Images with lower levels of perturbation do not preserve the watermark as well
as the higher ones do. Yet, as mentioned before, we want images to look as similar to the original
as possible, which is not the case at high perturbations. In the end, a perturbation between 0.05
and 0.1 reaches this equilibrium.
6
Figure 7. Perturbation ϵ for defenses on Van Gogh night sky.
Lastly, we show the results for our classifier. Although this is not within the scope of the project,
its implementation is really interesting, since it gives the whole model the possibility of working as
a GAN, i.e., in an automated ”adversarial” way. It is able to distinguish images with watermarks,
even if perturbed. However, it does have errors when classifying. Table 1 shows the results for
different epsilon perturbations in the images. Similarly, we tested the performance of the watermark
classifier on images that had been distorted by other means, as shown in Table 2. In this case, no
image is recognized as watermarked.
epsilon result
0.001 clear
0.005 clear
0.010 clear
0.050 watermarked
0.100 watermarked
Table 1. Classification of ”Night Sky” images perturbed with different epsilon
values, after being inpainted.
perturbation result
Brightness clear
Gaussian clear
Compression clear
Low-pass filter clear
White noise clear
Table 2. Classification of ”sunflowers” images perturbed with different techniques,
after being inpainted.
7
From these results, we can infer that our approaches for attacking the inpainter were successful.
We also found optimal values for epsilon, which are in the range of [0.05,0.1]. Specifically, in Table
1 we found that epsilon values of 0.10 or below are not sufficient to make the inpainting algorithm
fail. Contrarily, epsilon values of 0.05 or above have proved to be enough to make the inpainting
NN fail in image reconstruction since the classifier is able to recognize them as watermarked.
In other words, the ”discriminator” can recognize them as fake images and could be used to
make the inpainting NN optimize its parameters until it is eventually able to fool the classifier.
6. Further improvements
It is clear that we had time constraints to expand this project, yet there are still quite a lot of
interesting things to explore. Here we comment on some of them.
In a project where we are trying to discriminate if an attack can be mitigated through some
perturbations, testing different attackers (i. e. other inpainting AIs) is crucial. For the sake of
completeness, running the same image through different inpainting algorithms will give us in the
future more insight into how other architectures treat and are affected by the technique of defense
we carried out with our perturbations. Using other inpainters will give us the possibility to compare
how transferable our proposed technique is between different models using the same loss function.
On the other hand, we used pre-trained models given that training is an incredibly time-
consuming task and was not the focus of the project. Nonetheless, training the same model on
different datasets will allow us to also show how transferable our proposed technique is between
the same model trained on different datasets.
Calculating other characteristics of an image is an interesting approach to assess the inpainted
image quality, such as peak signal-to-noise ratio (PSNR) and structural index similarity (SSIM).
In addition to this, we are in essence disrupting an image with the hope that it is explicitly imper-
ceptible. Yet, we could go further to understand how more specific perturbations can be achieved
to not affect the image as much. On the same order of ideas, having proved how a good mask
can very simply yet directly affect the outcome of an inpainting attack, there is a clear motivation
to do so without human supervision. Creating an automatic mask generator not only would have
made our lives easier when experimenting with images but can also allow us in the future to focus
on parameters that are independent of the physical characteristics of an image.
Finally, we found in our first experiment that mask size has an influence on technique perfor-
mance, and in the last experiment, perturbations prevent a watermark to be removed. Yet, even if
we were testing the effects of the perturbation, the mask could have also played a role in the sense
that having added color could have affected the results. While restrained on time, we are forced
to hypothesize that color played a role in these results, with a probable cause being that training
datasets perhaps lack some colors. Of course, this would be interesting to investigate in future
research.
7. Conclusion
For this project, we pretended to show adversarial attacking and defending on inpainting neural
networks. We were additionally motivated to broadcast the possibilities of deep fakes and how these
technologies have to be used consciously. We followed a clear theme of art throughout the project.
Being in the Netherlands motivated us further to choose Van Gogh’s art for the experiments. We
show, more than a classical adversarial attack and defense project, an exciting meta-attack on
digital art and copyright.
8
We first motivate our project and describe the research that has been done on this topic. We
then explained the datasets, the repositories, and the architectures and described their implemen-
tations. These were divided into three parts. First, we used a NN to inpaint images constituting
our attack. Then, we fit them to an adversarial generator that by implementing some perturbations
made the watermark harder to remove. The transformations were JPEG compression, low-pass fil-
tering, Gaussian blurring, white noise, and brightness adjustments. We also perturbed the images
at different values of ϵ. All these methods constituted our defense. Later, finally, we used a third
NN to classify the image as one with or without a watermark.
We show that masks are relevant. Precise masks are particularly good in areas of an image
that have details while more general masks allow free interpretation in areas where details are
unimportant. We then show 5 different types of transformations that you can do to an image
to protect it from removing its watermark. We also show that Most interestingly, we will show
how the watermarks of images with perturbations at different levels are still visible. Images with
perturbations between 0.05 and 0.1 preserve the watermark while not altering the image noticeably.
Our final result was the creation of a GAN. As seen in the previous section, there are many
improvements to be made in the future to make this a robust GAN. However, we thought it would
be appropriate to conclude what we mean by creating a GAN. (GANs) are an exciting and rapidly
changing field. They are a clever way of training a generative model by framing the problem as
a supervised learning problem and simultaneously training two sub-models, the Generative model
(G) and the Discriminant model (D). The objective of G is to capture the distribution of some
target data. D aids the training of G by examining the data generated by G in reference to “real”
data and thereby helping G learn the distribution that underpins the real data. As introduced in
the original paper, this can be thought of as G having an adversary.
A straightforward analogy, for the sake of continuing with the approach to art we have given to
this paper, is forfeiting paintings: G plays the role of a counterfeiter in training, while the painter
expert D strives to identify fake art. In the process, both improve their knowledge, yet G ends up
honing his painting-replicating skills. Similar to attack and defense mechanisms, the attacker has
the upper hand. In the end, G is able to create a painting that fools the expert.
REFERENCES 9
References
[1] M. Mustakab, “Deepfakes: Deceptions, mitigations, and opportunities,” Elsevier, 2022. doi:
https://www.sciencedirect.com/science/article/pii/S0148296322008335.
[2] M. Masood, “Deepfakes generation and detection: State-of-the-art, open challenges, counter-
measures, and way forward,” Springer, 2022.
[3] K. A, “Adversarial attacks and defences competition,” arXiv: 1804.00097v1, 2018.
[4] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” arXiv:1711.10925, 2017.
[5] D. Khachaturov, Ilia, Yiren, Nicolas, and S. Z. Ross Anderson Papernot, “Markpainting: Ad-
versarial machine learning meets inpainting,” arXiv:2106.00660, 2021.
[6] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Deep image prior,” CoRR, vol. abs/1711.10925,
2017. arXiv: 1711.10925. [Online]. Available: http://arxiv.org/abs/1711.10925.
[7] R. Paul, M. Schabath, R. Gillies, L. Hall, and D. Goldgof, “Mitigating adversarial attacks
on medical image understanding systems,” in 2020 IEEE 17th International Symposium on
Biomedical Imaging (ISBI), IEEE, 2020, pp. 1517–1521.
[8] A. Musa, K. Vishi, and B. Rexha, “Attack analysis of face recognition authentication systems
using fast gradient sign method,” Applied artificial intelligence, vol. 35, no. 15, pp. 1346–1360,
2021.
[9] I. Pavlov, Watermark detection, https: //github.com /boomb0om/watermark- detection,
2022.
REFERENCES 10
8. Appendix
Appendix A
Figure 8. Perturbations for experiment 3.