[RFC] Retpoline: Binary mitigation for branch-target-injection (aka "Spectre")

From: Paul Turner
Date: Thu Jan 04 2018 - 04:11:27 EST


Apologies for the discombobulation around today's disclosure. Obviously the
original goal was to communicate this a little more coherently, but the
unscheduled advances in the disclosure disrupted the efforts to pull this
together more cleanly.

I wanted to open discussion the "retpoline" approach and and define its
requirements so that we can separate the core
details from questions regarding any particular implementation thereof.

As a starting point, a full write-up describing the approach is available at:
https://support.google.com/faqs/answer/7625886

The 30 second version is:
Returns are a special type of indirect branch. As function returns are intended
to pair with function calls, processors often implement dedicated return stack
predictors. The choice of this branch prediction allows us to generate an
indirect branch in which speculative execution is intentionally redirected into
a controlled location by a return stack target that we control. Preventing
branch target injections (also known as "Spectre") against these binaries.

On the targets (Intel Xeon) we have measured so far, cost is within cycles of a
"native" indirect branch for which branch prediction hardware has been disabled.
This is unfortunately measurable -- from 3 cycles on average to about 30.
However the cost is largely mitigated for many workloads since the kernel uses
comparatively few indirect branches (versus say, a C++ binary). With some
effort we have the average overall overhead within the 0-1.5% range for our
internal workloads, including some particularly high packet processing engines.

There are several components, the majority of which are independent of kernel
modifications:

(1) A compiler supporting retpoline transformations.
(1a) Optionally: annotations for hand-coded indirect jmps, so that they may be
made compatible with (1).
[ Note: The only known indirect jmp which is not safe to convert, is the
early virtual address check in head entry. ]
(2) Kernel modifications for preventing return-stack underflow (see document
above).
The key points where this occurs are:
- Context switches (into protected targets)
- interrupt return (we return into potentially unwinding execution)
- sleep state exit (flushes cashes)
- guest exit.
(These can be run-time gated, a full refill costs 30-45 cycles.)
(3) Optional: Optimizations so that direct branches can be used for hot kernel
indirects. While as discussed above, kernel execution generally depends on
fewer indirect branches, there are a few places (in particular, the
networking stack) where we have chained sequences of indirects on hot paths.
(4) More general support for guarding against RSB underflow in an affected
target. While this is harder to exploit and may not be required for many
users, the approaches we have used here are not generally applicable.
Further discussion is required.

With respect to the what these deltas mean for an unmodified kernel:
(1a) At minimum annotation only. More complicated, config and
run-time gated options are also possigble.
(2) Trivially run-time & config gated.
(3) The de-virtualizing of these branches improves performance in both the
retpoline and non-retpoline cases.

For an out of the box kernel that is reasonably protected, (1)-(3) are required.

I apologize that this does not come with a clean set of patches, merging the
things that we and Intel have looked at here. That was one of the original
goals for this week. Strictly speaking, I think that Andi, David, and I have
a fair amount of merging and clean-up to do here. This is an attempt
to keep discussion of the fundamentals at least independent of that.

I'm trying to keep the above reasonably compact/dense. I'm happy to expand on
any details in sub-threads. I'll also link back some of the other compiler work
which is landing for (1).

Thanks,

- Paul