Attacking and Defending

Neural Networks Final Project

Deep Learning From Theory to Practice

Fabio Mistrangelo

Jose P. Navarro

Pablo Laso

Contents

1. Introduction 1

2. Research background 2

3. Theoretical analysis 2

4. Model implementation 3

4.1. Datasets 3

4.2. Repositories 3

4.3. Inpainting 3

4.4. Markpainting 3

4.5. Watermark classiﬁer 4

5. Results and Discussion 4

6. Further improvements 7

7. Conclusion 7

References 9

8. Appendix 10

1. Introduction

Machine learning (ML) has enabled a large number of artiﬁcial intelligence (AI) applications,

including deep fakes. Usually, when people talk about deep fakes, they consider tasks like face

swapping or artiﬁcial face generation. In general, it covers anything that includes the generation

of data that looks real to humans. We have barely started to understand the implications that

these technologies have in society (m. Mustak et al., 2022) [1] or could have in the future (M.

Masood et al., 2022 ) [2]. Yet, we still do not know how to regulate them or what to do with and

about them. One side of this report seeks to remind the reader why these regulations are important.

Among these new implementations in ML, there is a technique called inpainting. Inpainting

enables automatic content creation (Ramesh et al., 2021) and manipulation (Yu et al., 2018) by

selecting a part of an image to be cleared out and getting an ML algorithm to ﬁll it in. For example,

you can see in Figure 1 (a) the recent Bolsonaro supporter’s attack on the Brazilian Congress.

The same image on the right (b) shows one of these supports removed by inpainting. The inpainter

here does a very good job with very little eﬀort. We know this because it took seconds to do it

ourselves mainly spent on creating a decent mask. This short anecdote about recreating a mask will

be important later in the report. For now, we know that it is hard for an untrained eye to tell that

this image was edited and even harder to where this edition occurred (we circled it in red for clarity).

Commonly now, people are using this technology to remove any type of information from im-

ages and videos. Yet, as remarked in our example, this can be done for malicious purposes.

Figure 1. Supporters (a) inpainted (b).

The felony we want to bring to light is copy-

righting. Particularly for this report, copy-

righting of artwork by watermark removals.

Watermarks were used in the past to protect

documents from being forfeited by imprinting

diﬀerent water patterns as hidden signatures

of ownership. Today, the term is still used to

describe marks in digital data that intend to

protect them from being used illegally. Peo-

ple wishing to reuse an image without permis-

sion may want to remove the mark and restore

a plausible background in its place. For the

sake of this report, We will restrict all the pos-

sible data that can be watermarked to images

and artworks (i.e. not videos).

Image owners, already at risk of getting their material stolen, are also incentivized to ensure that

their work is not used in ways that they did not authorize. When such misuse occurs, technical

mechanisms are needed for demonstrating their ownership. Misused images and the mechanisms

to defend and deter unauthorized use are the main objectives of this report.

In ML literature, this is better known as Adversarial Attacking and Defending, which we will

expand on in the theoretical background. We will then explain the neural network (NN) architecture

implementations, divided into three main steps. First, we will use a NN to inpaint images, which will

constitute our attack. Then, we will ﬁt a diﬀerent NN able to, by implementing some perturbations,

we will make the watermark harder to remove, constituting our defense. Later, we will use a third

NN to classify the image as one with or without a watermark. Finally, we will analyze how

diﬀerent parameters aﬀect both the attack and the defense. Most interestingly, we will show how

the uniﬁcation of these three NN constructs a Generative Adversarial Network (GAN) and what

this implies.

2. Research background

As explained in the NIPS 2017 competition motivation, advances in deep neural networks such

as CNNs and GANs enabled researchers to solve multiple important practical problems, but are

still vulnerable to adversarial examples. These perturbed inputs can fool ML networks of all types.

Understanding how they work and studying how to prevent them is of great interest to improve

NN robustness in this ML research growing ﬁeld. [3].

One of the above-mentioned neural networks with increasing popularity is a Deep Convolutional

Network (Deep CNN). This class of networks has become particularly popular for image genera-

tion and restoration. Yet, their excellent performance can always end in the wrong hands. In the

context of this project, it proved to be incredibly easy to remove a watermark with many online

available inpainter codes, with minor adaptions needed. We will take advantage of the structure

of a generator network, discussed in the Deep Image Prior paper, as suﬃcient to capture the prior

to any learning. In their paper, randomly-initialized neural networks are used as handcrafted prior

with excellent results when inpainting [4].

Inpainting attacks naturally lead us to the question of how we can make it more diﬃcult to

remove watermarks. The opposition is that ML inpainters can be manipulated using techniques

adapted from the ﬁeld of adversarial machine learning. We will adapt the term ”markpainting” to

the idea of perturbing an image to prevent it to have a watermark removed, as was done in the

paper: Adversarial Machine Learning meets Inpainting [5].

3. Theoretical analysis

A new question arises when wanting to understand if the attacker can actually fool the defender

and why should this work. Our ﬁndings in the investigated literature and similar comparisons

suggest that removing a watermark is nothing more than inpainting an image with an appropriate

mask that ﬁts the watermark. We understand this can work given that when trained, inpainters

generate binary masks for each image in a batch, drawing black lines of random length and thick-

ness on a white background for the NN to be able to adapt to diﬀerent contexts. Then, a speciﬁc

mask in a trained network will clearly do the job. Diﬀerent inpainting NNs already can do this

quite well and even keep improving by generating self-tailored masks.

If inpainters are so good, can the attack be mitigated by the defense? This is a bit more tricky

to answer. In a general sense, even if new defense techniques arise constantly, it is undeniable that

the attacks have the upper hand. Attacks don’t need to wait to ﬁnd diﬀerent ways to attack, while

there is an underlying delay in the defense’s need to adapt their response. More speciﬁcally in this

report, we discuss how watermarks can be more robust, which is clearly a challenge with NN being

so clever to reconstruct images. Yet, as with any perturbed image properly intervened, we can

ﬁnd a speciﬁc conﬁguration that doesn’t damage signiﬁcantly the original image, but is still able

to trick the inpainting into under-performing. In a certain sense, it is a self-induced attack that is

detrimental to the attack and ideally indiﬀerent to the image.

Yet, to help us solve this question, we will feed images to a third NN watermark classiﬁer. On

the one hand, we have watermarked images that have been now inpainted to pass as real in this

watermark classiﬁer. On the other hand, we have the original images that never had a watermark

on them. The watermark classiﬁer aims at distinguishing one from the other. In other words, it

outputs a binary result based on the probability that an attack was successful or not.

4. Model implementation

4.1. Datasets. For combining and running the three NN implemented in this report, we searched

pre-trained models. This decision was taken not only for saving some time but also for having a

partial guarantee that we could replicate the results of each of the papers on which we are basing our

project. The defense algorithm is based on a pre-trained known model (inceptionv3), non-related to

watermarking but inpainting. Other papers consulted also use big dataset names (CelebA, Places2,

ImageNet) given that inpainting isn’t speciﬁc to watermarks. On the other hand, our inpainter

algorithm works on priors and was trained on the Set5 and Set14 datasets, trained in their paper

for super-resolution and inpainting examples. Both attack and defense algorithms were pre-trained

and had access to their saved model-trained epochs.

4.2. Repositories. In addition to the datasets and models mentioned above, three main reposito-

ries were used throughout this project: one for inpainting, one to generate adversarial images, and

a watermark classiﬁcation algorithm. They all had to be slighted modiﬁed to be adapted to our

experiments, and are divided into three ipynb ﬁles, respectively.

4.3. Inpainting. The inpainting algorithm we used was developed by D. Ulyanov et al. [6].

Figure 2. Deep image prior architecture.

They used a neural network that is initialized randomly to be eﬀectively used as a pre-deﬁned prior

in common inverse problems. The architecture of this model is shown in Figure 2.

Figure 3. Parameter values.

In a nutshell, the architecture used is a decoder-

encoder (hourglass) with skip connections, represented

with yellow arrows. Here, n

[i], n

[i] corre-

spond to the number of ﬁlters at depth i for the

upsampling, downsampling, and skip-connections respec-

tively (see Figure 3). Additionally, the values k

[i],

[i], and k

[i] correspond to the respective kernel

sizes.

4.4. Markpainting. For this part of the project, we needed an algorithm that took the image and

added some perturbation to it, so that the inpainting would fail to reconstruct the image without

a watermark. Let’s say we feed the attacked image X, which is classiﬁed by our model (M) as an

image without a watermark. We want to ﬁnd adversarial image

X, which is perceptually indis-

tinguishable from original input X, such that it will be classiﬁed by that same model (M) as with

having a watermark. Therefore, we add an adversarial perturbation (θ) to the original input [7].

Note that we want an adversarial image to be indistinguishable from the original one. That can be

achieved by constraining the magnitude of adversarial perturbation: ||X −

X||

∞

⩽ ϵ. That is, the

∞

norm should be less than epsilon. Here, L

∞

denotes the maximum changes for all pixels in the

adversarial image [8].

The perturbations ϵ given to our model are random rather than noise obtained from the gradient

of a loss function, as that would be an implementation of Fast Gradient Descent Method. We

do this by implementing ﬁve diﬀerent types of alterations to the image. Namely, we test ﬁve

diﬀerent transformations: JPEG compression, low-pass ﬁltering, Gaussian blurring, white noise,

and brightness adjustments. For JPEG compression, we explore its eﬀects at diﬀerent levels of ϵ.

4.5. Watermark classiﬁer. The watermark classiﬁer was built by an ensemble between a ResNeXt

and a convNeXT. The ResNeXt architecture is an extension of a deep residual network, while Con-

vNeXT is a pure convolutional model, inspired by the design of Vision Transformers. Assembled,

this classiﬁer has a very good performance, classifying images either as ”watermarked” (0) or ”clear”

(1) [9].

D(x) =

(

0, if watermark was detected

1, otherwise.

(1)

5. Results and Discussion

Figure 4. Successful inpainting attack on Van Gogh self-portrait.

Naturally, the images that needed to be fed to the inpainting model were original images with a

watermark together with their speciﬁc mask. A mask speciﬁes the region of the watermark where

the inpainting should be applied on. We want to point out that generating an automatic mask is

possible and can be implemented. Our ﬁrst experiment intends to highlight its importance, intro-

duced with the outcome of the successful attack shown in Figure 4.

We tested our inpainting algorithm with diﬀerent masks to see their eﬀect on the performance.

Intuitively, tailored-ﬁtted masks should perform better than generalized ones. Yet, in our results,

we could argue that this is not the full story. Inpainting of Vincent Van Gogh’s self-portrait requires

both local and global knowledge of the image. With just local knowledge, the watermark can be

Figure 5. How diﬀerent masks yield diﬀerent inpainted images.

confused by the painter’s recognizable paint strokes, not disappearing fully, as shown with mask

1 (see Figure 5). On the other extreme, the image on the right features an exaggerated mask that

does not allow the algorithm to distinguish that there should be a beard. With this experiment

we want to illustrate how running the attacking algorithm with an appropriate mask should give

us successful watermark removals. Thus, inpainting techniques are clearly sensitive to the mask,

preventing an unsuited one from reconstructing an image optimally.

In our next experiment, we indagate with targeted markpainting, (i.e. targeted defense on the

inpainting). We do this by forcing the reconstruction to resemble the initial image with the wa-

termark. In other words, we intend to run the same watermarked image through the inpainting

algorithm but have implemented a perturbation to the image beforehand.

The motivation for this can be described as a defense by attacking the attacker. In other words,

we want the inpainter algorithm to fail by adversarial examples. As expected, each manipulation

signiﬁcantly increased defense performance, with a diﬀerent impact on the inpainting. Figure 6

shows this experiment’s ﬁndings, comparing all images’ transformations with a defense perturbation

after being run through the inpainting attack.

Figure 6. Markpaining defenses on Van Gogh sunﬂowers.

Our last experiment also has to do with perturbations. However, not the diﬀerence between

them, but how the level of alteration can play a role is investigated. Extreme cases show a big

impediment to deteriorating the image in its process, but we exaggerate these perturbations to

evaluate better our results and for them to be visually clearer (see Appendix B. As seen in

Figure 7, we have an original image with a generalized mask, followed by the attacked images

at diﬀerent levels of random perturbations. Notice that the watermark of the image with zero

perturbation is completely removed. Meanwhile, the watermarks of the rest of the images are still

somewhat visible. Images with lower levels of perturbation do not preserve the watermark as well

as the higher ones do. Yet, as mentioned before, we want images to look as similar to the original

as possible, which is not the case at high perturbations. In the end, a perturbation between 0.05

and 0.1 reaches this equilibrium.

Figure 7. Perturbation ϵ for defenses on Van Gogh night sky.

Lastly, we show the results for our classiﬁer. Although this is not within the scope of the project,

its implementation is really interesting, since it gives the whole model the possibility of working as

a GAN, i.e., in an automated ”adversarial” way. It is able to distinguish images with watermarks,

even if perturbed. However, it does have errors when classifying. Table 1 shows the results for

diﬀerent epsilon perturbations in the images. Similarly, we tested the performance of the watermark

classiﬁer on images that had been distorted by other means, as shown in Table 2. In this case, no

image is recognized as watermarked.

epsilon result

0.001 clear

0.005 clear

0.010 clear

0.050 watermarked

0.100 watermarked

Table 1. Classiﬁcation of ”Night Sky” images perturbed with diﬀerent epsilon

values, after being inpainted.

perturbation result

Brightness clear

Gaussian clear

Compression clear

Low-pass ﬁlter clear

White noise clear

Table 2. Classiﬁcation of ”sunﬂowers” images perturbed with diﬀerent techniques,

after being inpainted.

From these results, we can infer that our approaches for attacking the inpainter were successful.

We also found optimal values for epsilon, which are in the range of [0.05,0.1]. Speciﬁcally, in Table

1 we found that epsilon values of 0.10 or below are not suﬃcient to make the inpainting algorithm

fail. Contrarily, epsilon values of 0.05 or above have proved to be enough to make the inpainting

NN fail in image reconstruction since the classiﬁer is able to recognize them as watermarked.

In other words, the ”discriminator” can recognize them as fake images and could be used to

make the inpainting NN optimize its parameters until it is eventually able to fool the classiﬁer.

6. Further improvements

It is clear that we had time constraints to expand this project, yet there are still quite a lot of

interesting things to explore. Here we comment on some of them.

In a project where we are trying to discriminate if an attack can be mitigated through some

perturbations, testing diﬀerent attackers (i. e. other inpainting AIs) is crucial. For the sake of

completeness, running the same image through diﬀerent inpainting algorithms will give us in the

future more insight into how other architectures treat and are aﬀected by the technique of defense

we carried out with our perturbations. Using other inpainters will give us the possibility to compare

how transferable our proposed technique is between diﬀerent models using the same loss function.

On the other hand, we used pre-trained models given that training is an incredibly time-

consuming task and was not the focus of the project. Nonetheless, training the same model on

diﬀerent datasets will allow us to also show how transferable our proposed technique is between

the same model trained on diﬀerent datasets.

Calculating other characteristics of an image is an interesting approach to assess the inpainted

image quality, such as peak signal-to-noise ratio (PSNR) and structural index similarity (SSIM).

In addition to this, we are in essence disrupting an image with the hope that it is explicitly imper-

ceptible. Yet, we could go further to understand how more speciﬁc perturbations can be achieved

to not aﬀect the image as much. On the same order of ideas, having proved how a good mask

can very simply yet directly aﬀect the outcome of an inpainting attack, there is a clear motivation

to do so without human supervision. Creating an automatic mask generator not only would have

made our lives easier when experimenting with images but can also allow us in the future to focus

on parameters that are independent of the physical characteristics of an image.

Finally, we found in our ﬁrst experiment that mask size has an inﬂuence on technique perfor-

mance, and in the last experiment, perturbations prevent a watermark to be removed. Yet, even if

we were testing the eﬀects of the perturbation, the mask could have also played a role in the sense

that having added color could have aﬀected the results. While restrained on time, we are forced

to hypothesize that color played a role in these results, with a probable cause being that training

datasets perhaps lack some colors. Of course, this would be interesting to investigate in future

research.

7. Conclusion

For this project, we pretended to show adversarial attacking and defending on inpainting neural

networks. We were additionally motivated to broadcast the possibilities of deep fakes and how these

technologies have to be used consciously. We followed a clear theme of art throughout the project.

Being in the Netherlands motivated us further to choose Van Gogh’s art for the experiments. We

show, more than a classical adversarial attack and defense project, an exciting meta-attack on

digital art and copyright.

We ﬁrst motivate our project and describe the research that has been done on this topic. We

then explained the datasets, the repositories, and the architectures and described their implemen-

tations. These were divided into three parts. First, we used a NN to inpaint images constituting

our attack. Then, we ﬁt them to an adversarial generator that by implementing some perturbations

made the watermark harder to remove. The transformations were JPEG compression, low-pass ﬁl-

tering, Gaussian blurring, white noise, and brightness adjustments. We also perturbed the images

at diﬀerent values of ϵ. All these methods constituted our defense. Later, ﬁnally, we used a third

NN to classify the image as one with or without a watermark.

We show that masks are relevant. Precise masks are particularly good in areas of an image

that have details while more general masks allow free interpretation in areas where details are

unimportant. We then show 5 diﬀerent types of transformations that you can do to an image

to protect it from removing its watermark. We also show that Most interestingly, we will show

how the watermarks of images with perturbations at diﬀerent levels are still visible. Images with

perturbations between 0.05 and 0.1 preserve the watermark while not altering the image noticeably.

Our ﬁnal result was the creation of a GAN. As seen in the previous section, there are many

improvements to be made in the future to make this a robust GAN. However, we thought it would

be appropriate to conclude what we mean by creating a GAN. (GANs) are an exciting and rapidly

changing ﬁeld. They are a clever way of training a generative model by framing the problem as

a supervised learning problem and simultaneously training two sub-models, the Generative model

(G) and the Discriminant model (D). The objective of G is to capture the distribution of some

target data. D aids the training of G by examining the data generated by G in reference to “real”

data and thereby helping G learn the distribution that underpins the real data. As introduced in

the original paper, this can be thought of as G having an adversary.

A straightforward analogy, for the sake of continuing with the approach to art we have given to

this paper, is forfeiting paintings: G plays the role of a counterfeiter in training, while the painter

expert D strives to identify fake art. In the process, both improve their knowledge, yet G ends up

honing his painting-replicating skills. Similar to attack and defense mechanisms, the attacker has

the upper hand. In the end, G is able to create a painting that fools the expert.

REFERENCES 9

References

[1] M. Mustakab, “Deepfakes: Deceptions, mitigations, and opportunities,” Elsevier, 2022. doi:

https://www.sciencedirect.com/science/article/pii/S0148296322008335.

[2] M. Masood, “Deepfakes generation and detection: State-of-the-art, open challenges, counter-

measures, and way forward,” Springer, 2022.

[3] K. A, “Adversarial attacks and defences competition,” arXiv: 1804.00097v1, 2018.

[4] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” arXiv:1711.10925, 2017.

[5] D. Khachaturov, Ilia, Yiren, Nicolas, and S. Z. Ross Anderson Papernot, “Markpainting: Ad-

versarial machine learning meets inpainting,” arXiv:2106.00660, 2021.

[6] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Deep image prior,” CoRR, vol. abs/1711.10925,

2017. arXiv: 1711.10925. [Online]. Available: http://arxiv.org/abs/1711.10925.

[7] R. Paul, M. Schabath, R. Gillies, L. Hall, and D. Goldgof, “Mitigating adversarial attacks

on medical image understanding systems,” in 2020 IEEE 17th International Symposium on

Biomedical Imaging (ISBI), IEEE, 2020, pp. 1517–1521.

[8] A. Musa, K. Vishi, and B. Rexha, “Attack analysis of face recognition authentication systems

using fast gradient sign method,” Applied artiﬁcial intelligence, vol. 35, no. 15, pp. 1346–1360,

2021.

[9] I. Pavlov, Watermark detection, https: //github.com /boomb0om/watermark- detection,

2022.

REFERENCES 10

8. Appendix

Appendix A

Figure 8. Perturbations for experiment 3.