Sunday, January 7, 2018

Explaining Meltdown with parallel worlds, libraries, and a bank heist



Security vulnerabilities are always tricky to explain. People are supposed to be scared of them, because their data could be compromised, but the very nature of security problems means they involve obscure technical details that we usually don’t have to think about at all.

So, enter the analogy. A number of fun, strange, possibly useful analogies have sprung up this week to try to explain the major Meltdown and Spectre vulnerabilities that Google has discovered. Like any analogy, they fall short of perfectly detailed explanation, and will typically break if you push them too far, but they can still be a helpful way to get a vibe for how these exploits actually work.
The best knowledge we have on how these exploits actually work, in gory technical detail, are the two PDFs published by Google which explain the Meltdown and Spectre attacks in research paper form.


I’ve read the fun parts of both of them. And also I read a lot of sci-fi novels. Therefore, I would like to introduce you to my own humble analogy, which involves both a bank vault and parallel universes. I am not a security researcher, or even a very good computer programmer. But I do love a good analogy.

First, let me give you the actual technical description from Google’s papers, which I will be attempting to illuminate.

This is Meltdown:

First, an attacker makes the CPU execute a transient instruction sequence which uses an inaccessible secret value stored somewhere in physical memory. The transient instruction sequence acts as the transmitter of a covert channel, ultimately leaking the secret value to the attacker.
And here’s Spectre:

Spectre attacks induce a victim to speculatively perform operations that would not occur during correct program execution and which leak the victim’s confidential information via a side channel to the adversary.
My analogy is best applied to Meltdown, but there are similarities in both exploits that may become apparent. If you’ll recall, Meltdown is the one that mostly affects Intel and high-end ARM chips. It allows the attacker to access kernel memory, which is A Very Bad Thing. Spectre applies to almost all modern CPUs, but it’s harder to execute, and it “only” accesses other memory in the same process — your kernel is safe from its prying eyes.

Enough preamble, here’s my Meltdown analogy:

PAUL’S COOL AND HELPFUL MELTDOWN ANALOGY
You want to rob a bank. Inside the bank vault is a piece of paper with Ashley Carman’s Netflix password on it. In the vault there’s a security guard with a gun who will shoot anyone who looks at that piece of paper, unless it’s Ashley.

HOW IT’S SUPPOSED TO WORK
You walk up to the door and you don’t go into the bank. Meanwhile, in the parallel reality where you actually do go into the bank, you enter the vault and get shot dead.

In the parallel reality where you don’t go into the bank, nothing happens at all.

(To understand my line of thinking, keep in mind the many worlds interpretation of quantum physics. Basically, every time you make a choice — or check inside a box to see if the cat is dead — reality splits. You subjectively experience the results of your choice, while in a parallel reality another version of you experiences the result of the opposite choice. I hope that makes sense.)

HOW GOOGLE’S EXPLOIT WORKS
You walk up to the door and you don’t go into the bank. In the parallel reality where you do go into the bank, you enter the vault and look at the piece of paper. You read the password and whisper it quietly before you get shot dead.

In the reality where you don’t go into the bank, you own a highly elaborate listening device which can hear your parallel self’s whispers. Now you know Ashley’s Netflix password, and can enjoy all manner of original content at her expense.

WHAT THIS MEANS IN COMPUTER WORDS
Okay, let’s unpack this and see how it lines up with Google’s description of Meltdown.

The first thing we need is to “make the CPU execute a transient instruction sequence.” It turns out, this happens all the time on modern computers. Many modern CPUs do work out of order. Instead of pausing while they wait for firm instructions, they go ahead and execute code, and once they have firm instructions they throw away bad results — thus making them “transient.” This makes applications run faster, especially applications that are doing one thing over and over. If a piece of code is looping rapidly, the CPU doesn’t have to stop after every run and ask “should I do this again?” It just runs the loop and if it receives a stop instruction, it throws away any unused results.

Those unused results are the “parallel world” you never see. It happens in the physical hardware, but you never see the results, which is why it was presumed safe for CPUs to do this.

So, in my analogy, the parallel version of you which goes into the bank is a transient instruction. He will certainly die, and won’t be allowed to formally report on his discovery.

Next we have an “inaccessible secret value stored somewhere in physical memory.” That’s the Netflix password, obviously. This is a good time to explain that for security reasons, regular programs on your computer don’t have permission to look at the contents of all your memory. Because while you might trust a third party program to know something about you, you don’t want it to know necessarily everything. But that partition is virtual. If a secret, like Ashley’s Netflix password, is loaded onto physical memory, it exists. Therefore, if a program can break out of its established boundaries, it can steal it.

Well, what happens when programs break out of their boundaries? They get shot dead by the bank vault guard. This is called an “exception” in computer terms. Regular programs break the rules all the time, usually on accident, and they’re either killed by the operating system for behaving badly, or they “handle” the exception by basically apologizing.

BUT WHILE THE CPU IS DOING CLEANUP, THE CPU IS ALSO SIMULTANEOUSLY EXECUTING OTHER CODE
And this is where transient instructions fucked everything up. In Google’s exploit, the attacker has code that looks at memory it shouldn’t look at. An exception is thrown and the CPU cleans everything up, erasing any evidence of the crime. But while the CPU is doing cleanup, the CPU is also simultaneously executing other code (the so-called transient instruction) out of order. What does that other code do?

It whispers Ashley’s Netflix password. In Google terms, this whisper is the “covert channel.” The special whisper-listening machine is the other end of that channel. This channel is how the transient instruction broadcasts its findings.

Google’s chosen method of communication in Meltdown is called a “Flush+Reload side-channel attack.” Basically, before the transient instruction is destroyed, it writes Ashley’s Netflix password into the CPU cache (high-speed memory that’s built into the CPU) in a special format. The non-transient part of the attacker’s program, which hasn’t broken the rules (the version of you that’s standing outside the bank) isn’t allowed to read the specific bits that the transient instruction wrote, but it knows how to read the messages in that special format. Just like how you can’t use your eyeballs to figure out what data is on a thumb drive, but if someone spelled out the word “HEY” on a table with a dozen thumb drives, you’d be able to read that.

Using this technique, and given enough time, a successful Meltdown attack can read the entire contents of your computer’s memory.

WHAT DO WE DO NOW?
If you’ve been following the recent updates on Meltdown and Spectre, you’ll notice the word “serialize” popping up. The idea is that for certain sensitive actions, the CPU will serialize those instructions to make certain that they run in order. Therefore transient instructions won’t be allowed to do bad things, because they’ll be killed the moment they step out of line. Hopefully. Of course, if selectively serializing instructions doesn’t work, we’re going to have to serialize everything, which will dramatically slow down modern processors.

Oh, and because Spectre doesn’t rely on out of order execution, but instead exploits what’s called “branch prediction” or “speculative execution,” fixes for Meltdown won’t necessarily help with Spectre.

Ultimately, what Google has discovered is a whole new genre of attacks on modern computers. There might be a simple fix that will make everything safe again — well, as safe as it ever was. But these techniques might crop up in future attacks, and we all may very well die.

What’s exciting to me is that I’m learning a lot about how CPUs actually work. It turns out they’re extremely complicated and unpredictable. I hope the industry takes a step back to examine the highly complex foundation it’s standing on. If we made things simpler, could they be safer but still fast? In the meantime, we can at least keep working on our analogies to try to understand the complexity that’s all around us.

No comments:

Post a Comment

DONATE