I recently stumbled across an article entitled Attacking machine learning with adversarial examples. It’s a good article, and well worth the read (Go read it; I’ll wait).
Towards the end of the article the author writes:
Every strategy we have tested so far fails because it is not adaptive: it may block one kind of attack, but it leaves another vulnerability open to an attacker who knows about the defense being used. Designing a defense that can protect against a powerful, adaptive attacker is an important research area.
(Emphasis mine). In other words, none of these defenses heed Kerckhoffs’s principle, which was first articulated in the 1800s:
A cryptosystem should be secure even if everything about the system, except the key, is public knowledge.
Kerckchoffs was talking specifically about cryptography, but still.
To someone with at least a passing familiarity with modern computer security, the proposed defenses seemed laughable at the outset.
So, how do we do better? That’s a really hard question. And, unfortunately, most of my own modest acquired wisdom about how to do security gives me the gut instinct “Don’t use machine learning techniques in adversarial contexts.” One might suggest that this amounts to just giving up, and that’s probably correct, but I bring it up because I think there’s what at first blush seems like a really fundamental conflict:
- Machine learning techniques are fundamentally heuristic; in the most simple case, they’ll look at a few features of some inputs, and make a guess, based on how those features compare to examples they’ve seen before.
- Most security vulnerabilities (anywhere, not just in machine learning contexts) involve novel inputs. i.e. weird ones.
The standard approach to dealing with weird, potentially malicious inputs in other contexts is to just shortcut the problem: don’t use heuristics, just do something that always works.
Let’s look at an example:
The classic XKCD comic depicts a SQL injection vulnerability. These are depressingly common, especially given how preventable they are. All we need is a function that takes the student name and maps it to something that SQL will always treat as a string, never as more commands. This amounts to putting backslashes in front of a handful of special characters.
If you’re working with a modern web framework, you don’t really even have to think about all this, since someone’s already packaged up that function, and built on top of it an interface to the database that doesn’t require you to glue strings together in the first place. If you’re using proper tools, SQL injection vulnerabilities are not just preventable, they’re impossible.
The vast majority of security vulnerabilities are things like this. They’re usually just stupid mistakes. We should be better about building (and using) tools that just rule these out entirely, but the problems there are mainly social, not technical.
Can we just take a similar approach to solving the machine learning problem? Probably not; if we had such a clear, concise idea of what our inputs were supposed to mean, we probably wouldn’t be bothering with machine learning in the first place.
So yeah, this is a hard problem, and I don’t have great answers. I do have some thoughts, however.
Firstly, don’t define the problem in terms of how the attacks work. Define it in terms of guarantees the system must provide. Basic statements of goals rarely talk about details of techniques attackers might use. They’ll say things like “the attacker can intercept, view, and modify messages in transit” and “the attacker must not be able to gain non-negligible information about the plaintext.” The how generally shouldn’t go in the problem statement, and the article seems to suggest trying to figure out a way to put it there. This seems like a mistake.
As an example, here’s a stab at an informal definition of a security property you might want a system managing self-driving cars to have:
- The attacker has the ability to modify the images displayed on road signs in arbitrary ways.
- The attacker “wins” if they are able to construct a pair of images A
and B such that:
- Both humans and the self driving car system identify A as the same type of sign with high confidence, and
- Humans identify B as the same type of sign as A with similarly high confidence, but the self driving car system identifies B as the same type of sign with significantly lower confidence, or as a different type of sign.
It’s wordy as hell, and it still doesn’t make the problem easy. But at least we have a sense of what we’re trying to achieve, which is a step beyond what the article gives us.
The second thought is to consider the whole system, not just a particular subsystem. And when I say the whole system I really mean the whole thing:
- The way the car interprets signs.
- What it infers from road topology (which is (1) probably harder for somebody to spoof than a sign and (2) possibly a better indicator of safe driving behavior anyway. If I see a stop sign in the middle of the interstate my reaction isn’t going to be “oh I should stop,” but rather “wait, what’s going on?”)
- Whether it can cross-reference information from other sources. If you live in an area, you probably often have some knowledge of how the dynamics of a particular intersection work before you arrive on the scene.
- Whether we can provide machine-readable indicators of this kind of information that don’t require as much guesswork to interpret, and draw on our more rigorous ability to validate non-ML systems. Signs are designed for humans; what would a road designed for machines look like?
- What we can do in terms of monitoring and response? If we have a database of where and what the road signs are in our city, can we have a car drive around and report back what it thinks they are, so we can detect discrepancies?
Bugs happen. Even if you do have the luxury of a mathematical proof that your approach solves the problem perfectly, you still have to weigh the possibility that someone is going to screw up the actual building of the thing. The most tractable approach I’ve seen with dealing with this sort of thing is “defense in depth” or “don’t put all your eggs in one basket.” Sure, if the rest of the program is designed such that a particular function should never be called with that invalid argument, it should be okay to just assume the input is good. Check it anyway, if you can.
Sandstorm has a list of security non-events — places where some piece of software has a vulnerability, and it turns out to not be very useful to an attacker, because other parts of the system limit its impact. We should be doing this everywhere.
The article brings up a really hard, really important problem. Machine learning, and artificial intelligence more broadly, are not my specific area of expertise. But, I see much hard-earned wisdom from other disciplines that could be drawn upon, so the “defenses” that the article discusses are somewhat saddening.