Ep1 – Adversarial Examples: now you see me, now you don’t

Introduction:

How many of us have been both dazzled and confused when we were shown an ingenious optical illusion? If you are anything like me it might initially bring surprise and childish emotions of wonder and almost magic, but in a second and more reflective moment you might then wander how did that happened? Can I really trust in my perceptions? Are my perceptions in agreement with the physical reality? Or in other words does my model of the world reflects accurately reflects its ultimate reality. Well these are all complicated questions that can attempt to answer in different ways but in this segment I will concentrate on the basics: we are susceptible to optical illusion. More specifically, machine learning solutions also suffered from the same delusion: they can also be fooled by equivalent optical illusion just that in this case we technically call them Adversarial Examples (AEs).

Why do AEs exist? Can we ever avoid them?

Although there is still active research in this topic, one first hypothesis suggested that the cause of AEs were the nonlinearities embedded by design in many machine learning models (e.g., the sigmoidal-type activation functions used in many neural networks). Very soon this was discarded as it was shown that large neural networks with linear activation functions are also susceptible to been misled by AEs and therefore we can conclude that these non-linearities are not the ultimate cause of AEs [ref].

Current research seems to be in favour of another explanation for the origin of AEs: the non-intuitive and kind of weird effect of high dimensional geometry in our data. There is more than meets the eye in this statement. Our intuition (and most of what has been written and said about machine learning) tells us that, if we want to predict the outcome of a hypothetical variable Y, then the larger the number of predictive variables Xi (larger dimensionality), the better. Well, as it happens, this intuition seems to be only justified in the range of relatively low number of dimensions. What happens in reality is that increasing the number of predictive variables helps up to a certain point but as dimensionality gets disproportionally higher than the number of training examples, errors and weird effects like adversarial examples starts to pile on [Gilmer et al, 2018].  

Although there has been good progress from the research community on this area, it is my impression that so far most of the papers are computationally oriented with less articles on the path of hard-core theoretical research to put the question to rest once and for all. At the fundamental level there can be two different views on the cause of adversarial examples: should we blame the model (e.g., non-linearities) or should we blame the data (e.g., dimensionality effects). Although there is a case to be made for the human optical illusions originating from both factors (model & data; maybe a topic for another blog J), in the case of AEs the balance seems to be at the moment inclined to be the data the main responsible for their existence.

Are we condemned to be fooled by AEs? What can we do? (Adversarial Training)   

In the absence of a definitive theoretical answer to these questions the pragmatic approach of the AI community has been mainly focussed on enriching the training data with adversarial examples in an attempt to increase the accuracy of the classifications.

In a nutshell, the training now is extended to the following steps:

  1. Initially we performed a normal training of our machine learning model. For instance, if we are training a network for image classification we will feed images to the network and optimized the network to produce the correct labels. Save the trained model (i.e., checkpoint)
  2. We select images from specific classes on objects (e.g.) and we create an adversarial example for the image (e.g., an image that looks like a panda to us but that the network misclassifies as gibbon for example). Repeat the procedure to obtain more images from the adversarial region.
  3. Starting from a pre-trained network (the last saved checkpoint) continue the training using the newly correctly labelled adversarial examples  

We can certainly imagine that doing this process manually would be certainly painful. A strategy to automate this process is to use special libraries to generate AEs (e.g. foolbox for python) or the more sophisticate approach of using Generative Adversarial Networks or GANs [Goodfellow et al, 2014]. We will not examine the details of GANs any further in this article but for completion let’s summarize saying that in the GANs approach we pit two networks against each other: one counterfeiter network that generates AEs and one police network that tries to determine if a given image is an original or a forgery. By optimizing both networks at the same time we essentially automate adversarial training. Automated or not, adversarial training relies in the possibility to generate adversarial examples from existing data. 

Generating AEs. Implications.

Understanding the basic principle for creating adversarial examples is easier when we compare it to the normal training process of a machine learning model.

To illustrate this let’s use the common case of an image classification algorithm, for instance a model to classify hand-written digits. Shortly, the main characteristics of model training (learning) and adversarial example generation are:

Model training: During training we start with a random set of parameters in the model; If we feed the model an image representing a hand-written digit (e.g., “3”) it is unlikely that an untrained model will produce the correct result. We can call the difference between the ideal output that a fully trained model should produce (e.g., “3”) and the actual/current output when we feed the first image (e.g., “9”), the misclassification cost (MC). We call training to the iterative process during which we change the model parameters in a direction that minimises MC. Model training is looking for the global minima of the MC in the parameter space; therefore a fully trained model is robust/stable to small perturbations or noise added to the trained parameter space. 

Adversarial example generation: To generate adversarial examples we start with a fully trained model, unlike during training, the parameters will remain unchanged in this process. Let’s say that we want to fool the model into believing that an image with the hand-written digit “3” is instead a “9” (targeted adversarial example). In this case, because the model is trained, the actual/current value for an input “3” is output label “3”, ideally we want the network to produce a “9”. In this case, however, we can call the difference between the ideal outcome and the current outcome, the Adversarial Cost (AC).  The process of generating an adversarial example is then the iterative process during which we change the pixels in the image in a direction that minimizes AC. One caveat, we don’t want to modify the original image (“3”) too much as we want that will still look like a “3” to the human eye. Therefore the question here is: what is the minimal modification that we need to perform on the original image “3” for the model to get it confused as a “9”? Consequently while generating adversarial examples we are looking for a local minima of the AC in the original image space; therefore an adversarial example is not robust and frequently unstable when we performed small perturbations or modifications on it.

Final thoughts  

The discovery of adversarial examples was a big surprise for the AI community. On one side it highlighted some serious problems in our machine learning solutions, on the other side the community immediately found a way to mitigate the damage by the use of adversarial training and their automated versions (e.g., GANs).

Adversarial training strategies in general see the presence of adversarial examples as extreme cases of misclassification. This might not be completely correct (research is ongoing) but on a more practical side this strategies do not offer a scalable solution to the problem (the number of adversarial regions to be sampled suffers a combinatorial explosion as the number of classes increases in a classification model).

Furthermore, some papers seem to indicate that the origin of adversarial examples lies in the data more than in the model. If further research keeps confirming this, the implications for the vulnerability of machine learning models would be both large and unsolvable. For once, it will imply that we cannot escape the transferability property of adversarial examples (models trained for the same task share adversarial examples); therefore our current models will always be susceptible to attack. These are still exciting times adversarial examples research and there is a lot at stake. It will be interesting to see what comes next so … stay tuned! 

References

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. Advances in Neural Information Processing Systems. 2672-2680. (2014)

Link: http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

Gilmer J, Metz L, Faghri F, Schoenholtz, Raghu M, Wattenberg M, Goodfellow I. The relationship between high-dimensional geometry and adversarial examples. (2018)

Link: https://arxiv.org/abs/1801.02774