Re: [PATCH v6 00/10] Retpoline: Avoid speculative indirect calls in kernel
From: Paul Turner
Date: Mon Jan 08 2018 - 05:42:51 EST
[ First send did not make list because gmail ate its plain-text force
when I pasted content. ]
One detail that is missing is that we still need RSB refill in some cases.
This is not because the retpoline sequence itself will underflow (it
is actually guaranteed not to, since it consumes only RSB entries that
it generates.
But either to avoid poisoning of the RSB entries themselves, or to
avoid the hardware turning to alternate predictors on RSB underflow.
Enumerating the cases we care about:
user->kernel in the absence of SMEP:
In the absence of SMEP, we must worry about user-generated RSB entries
being consumable by kernel execution.
Generally speaking, for synchronous execution this will not occur
(e.g. syscall, interrupt), however, one important case remains.
When we context switch between two threads, we should flush the RSB so
that execution generated from the unbalanced return path on the thread
that we just scheduled into, cannot consume RSB entries potentially
installed by the prior thread.
kernel->kernel independent of SMEP:
While much harder to coordinate, facilities such as eBPF potentially
allow exploitable return targets to be created.
Generally speaking (particularly if eBPF has been disabled) the risk
is _much_ lower here, since we can only return into kernel execution
that was already occurring on another thread (which could e.g. likely
be attacked there directly independent of RSB poisoning.)
guest->hypervisor, independent of SMEP:
For guest ring0 -> host ring0 transitions, it is possible that the
tagging only includes that the entry was only generated in a ring0
context. Meaning that a guest generated entry may be consumed by the
host. This admits:
hypervisor_run_vcpu_implementation() {
<enter hardware virtualization context>
â run virtualized work (1)
<leave hardware virtualization context>
< update vmcs state, prior to any function return > (2)
< return from hypervisor_run_vcpu_implementation() to handle VMEXIT > (3)
}
A guest to craft poisoned entries at (1) which, if not flushed at (2),
may immediately be eligible for consumption at (3).
While the cases above involve the crafting and use of poisoned
entries. Recall also that one of the initial conditions was that we
should avoid RSB underflow as some CPUs may try to use other indirect
predictors when this occurs.
The cases we care about here are:
- When we return _into_ protected execution. For the kernel, this
means when we exit interrupt context into kernel context, since may
have emptied or reduced the number of RSB entries while in iinterrupt
context.
- Context switch (even if we are returning to user code, we need to at
unwind the scheduler/triggering frames that preempted it previously,
considering that detail, this is a subset of the above, but listed for
completeness)
- On VMEXIT (it turns out we need to worry about both poisoned
entries, and no entries, the solution is a single refill nonetheless).
- Leaving deeper (>C1) c-states, which may have flushed hardware state
- Where we are unwinding call-chains of >16 entries[*]
[*] This is obviously the trickiest case. Fortunately, it is tough to
exploit since such call-chains are reasonably rare, and action must
typically be predicted at a considerable distance from where current
execution lies. Both dramatically increasing the feasibility of an
attack and lowering the bit-rate (number of ops per attempt is
necessarily increased). For our systems, since we control the binary
image we can determine this through aggregate profiling of every
machine in the fleet. I'm happy to provide those symbols; but it's
obviously restricted from complete coverage due to code differences.
Generally, this is a level of paranoia no typical user will likely
care about and only applies to a subset of CPUs.
A sequence for efficiently refilling the RSB is:
mov $8, %rax;
.align 16;
3: call 4f;
3p: pause; call 3p;
.align 16;
4: call 5f;
4p: pause; call 4p;
.align 16;
5: dec %rax;
jnz 3b;
add $(16*8), %rsp;
This implementation uses 8 loops, with 2 calls per iteration. This is
marginally faster than a single call per iteration. We did not
observe useful benefit (particularly relative to text size) from
further unrolling. This may also be usefully split into smaller (e.g.
4 or 8 call) segments where we can usefully pipeline/intermix with
other operations. It includes retpoline type traps so that if an
entry is consumed, it cannot lead to controlled speculation. On my
test system it took ~43 cycles on average. Note that non-zero
displacement calls should be used as these may be optimized to not
interact with the RSB due to their use in fetching RIP for 32-bit
relocations.
On Mon, Jan 8, 2018 at 2:34 AM, Paul Turner <pjt@xxxxxxxxxx> wrote:
> One detail that is missing is that we still need RSB refill in some cases.
> This is not because the retpoline sequence itself will underflow (it is
> actually guaranteed not to, since it consumes only RSB entries that it
> generates.
> But either to avoid poisoning of the RSB entries themselves, or to avoid the
> hardware turning to alternate predictors on RSB underflow.
>
> Enumerating the cases we care about:
>
> user->kernel in the absence of SMEP:
> In the absence of SMEP, we must worry about user-generated RSB entries being
> consumable by kernel execution.
> Generally speaking, for synchronous execution this will not occur (e.g.
> syscall, interrupt), however, one important case remains.
> When we context switch between two threads, we should flush the RSB so that
> execution generated from the unbalanced return path on the thread that we
> just scheduled into, cannot consume RSB entries potentially installed by the
> prior thread.
>
> kernel->kernel independent of SMEP:
> While much harder to coordinate, facilities such as eBPF potentially allow
> exploitable return targets to be created.
> Generally speaking (particularly if eBPF has been disabled) the risk is
> _much_ lower here, since we can only return into kernel execution that was
> already occurring on another thread (which could e.g. likely be attacked
> there directly independent of RSB poisoning.)
>
> guest->hypervisor, independent of SMEP:
> For guest ring0 -> host ring0 transitions, it is possible that the tagging
> only includes that the entry was only generated in a ring0 context. Meaning
> that a guest generated entry may be consumed by the host. This admits:
>
> hypervisor_run_vcpu_implementation() {
> <enter hardware virtualization context>
> â run virtualized work (1)
> <leave hardware virtualization context>
> < update vmcs state, prior to any function return > (2)
> < return from hypervisor_run_vcpu_implementation() to handle VMEXIT > (3)
> }
>
> A guest to craft poisoned entries at (1) which, if not flushed at (2), may
> immediately be eligible for consumption at (3).
>
> While the cases above involve the crafting and use of poisoned entries.
> Recall also that one of the initial conditions was that we should avoid RSB
> underflow as some CPUs may try to use other indirect predictors when this
> occurs.
>
> The cases we care about here are:
> - When we return _into_ protected execution. For the kernel, this means
> when we exit interrupt context into kernel context, since may have emptied
> or reduced the number of RSB entries while in iinterrupt context.
> - Context switch (even if we are returning to user code, we need to at
> unwind the scheduler/triggering frames that preempted it previously,
> considering that detail, this is a subset of the above, but listed for
> completeness)
> - On VMEXIT (it turns out we need to worry about both poisoned entries, and
> no entries, the solution is a single refill nonetheless).
> - Leaving deeper (>C1) c-states, which may have flushed hardware state
> - Where we are unwinding call-chains of >16 entries[*]
>
> [*] This is obviously the trickiest case. Fortunately, it is tough to
> exploit since such call-chains are reasonably rare, and action must
> typically be predicted at a considerable distance from where current
> execution lies. Both dramatically increasing the feasibility of an attack
> and lowering the bit-rate (number of ops per attempt is necessarily
> increased). For our systems, since we control the binary image we can
> determine this through aggregate profiling of every machine in the fleet.
> I'm happy to provide those symbols; but it's obviously restricted from
> complete coverage due to code differences. Generally, this is a level of
> paranoia no typical user will likely care about and only applies to a subset
> of CPUs.
>
>
>
>
>
> On Sun, Jan 7, 2018 at 2:11 PM, David Woodhouse <dwmw@xxxxxxxxxxxx> wrote:
>>
>> This is a mitigation for the 'variant 2' attack described in
>>
>> https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
>>
>> Using GCC patches available from the hjl/indirect/gcc-7-branch/master
>> branch of https://github.com/hjl-tools/gcc/commits/hjl and by manually
>> patching assembler code, all vulnerable indirect branches (that occur
>> after userspace first runs) are eliminated from the kernel.
>>
>> They are replaced with a 'retpoline' call sequence which deliberately
>> prevents speculation.
>>
>> Fedora 27 packages of the updated compiler are available at
>> https://koji.fedoraproject.org/koji/taskinfo?taskID=24065739
>>
>>
>> v1: Initial post.
>> v2: Add CONFIG_RETPOLINE to build kernel without it.
>> Change warning messages.
>> Hide modpost warning message
>> v3: Update to the latest CET-capable retpoline version
>> Reinstate ALTERNATIVE support
>> v4: Finish reconciling Andi's and my patch sets, bug fixes.
>> Exclude objtool support for now
>> Add 'noretpoline' boot option
>> Add AMD retpoline alternative
>> v5: Silence MODVERSIONS warnings
>> Use pause;jmp loop instead of lfence;jmp
>> Switch to X86_FEATURE_RETPOLINE positive feature logic
>> Emit thunks inline from assembler macros
>> Merge AMD support into initial patch
>> v6: Update to latest GCC patches with no dots in symbols
>> Fix MODVERSIONS properly(ish)
>> Fix typo breaking 32-bit, introduced in V5
>> Never set X86_FEATURE_RETPOLINE_AMD yet, pending confirmation
>>
>> Andi Kleen (3):
>> x86/retpoline/irq32: Convert assembler indirect jumps
>> x86/retpoline: Add boot time option to disable retpoline
>> x86/retpoline: Exclude objtool with retpoline
>>
>> David Woodhouse (7):
>> x86/retpoline: Add initial retpoline support
>> x86/retpoline/crypto: Convert crypto assembler indirect jumps
>> x86/retpoline/entry: Convert entry assembler indirect jumps
>> x86/retpoline/ftrace: Convert ftrace assembler indirect jumps
>> x86/retpoline/hyperv: Convert assembler indirect jumps
>> x86/retpoline/xen: Convert Xen hypercall indirect jumps
>> x86/retpoline/checksum32: Convert assembler indirect jumps
>>
>> Documentation/admin-guide/kernel-parameters.txt | 3 +
>> arch/x86/Kconfig | 17 ++++-
>> arch/x86/Kconfig.debug | 6 +-
>> arch/x86/Makefile | 10 +++
>> arch/x86/crypto/aesni-intel_asm.S | 5 +-
>> arch/x86/crypto/camellia-aesni-avx-asm_64.S | 3 +-
>> arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 3 +-
>> arch/x86/crypto/crc32c-pcl-intel-asm_64.S | 3 +-
>> arch/x86/entry/entry_32.S | 5 +-
>> arch/x86/entry/entry_64.S | 12 +++-
>> arch/x86/include/asm/asm-prototypes.h | 25 +++++++
>> arch/x86/include/asm/cpufeatures.h | 2 +
>> arch/x86/include/asm/mshyperv.h | 18 ++---
>> arch/x86/include/asm/nospec-branch.h | 92
>> +++++++++++++++++++++++++
>> arch/x86/include/asm/xen/hypercall.h | 5 +-
>> arch/x86/kernel/cpu/common.c | 3 +
>> arch/x86/kernel/cpu/intel.c | 11 +++
>> arch/x86/kernel/ftrace_32.S | 6 +-
>> arch/x86/kernel/ftrace_64.S | 8 +--
>> arch/x86/kernel/irq_32.c | 9 +--
>> arch/x86/lib/Makefile | 1 +
>> arch/x86/lib/checksum_32.S | 7 +-
>> arch/x86/lib/retpoline.S | 48 +++++++++++++
>> 23 files changed, 264 insertions(+), 38 deletions(-)
>> create mode 100644 arch/x86/include/asm/nospec-branch.h
>> create mode 100644 arch/x86/lib/retpoline.S
>>
>> --
>> 2.7.4
>>
>