Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation

From: Ingo Molnar
Date: Tue Jan 23 2018 - 02:54:09 EST



* Ingo Molnar <mingo@xxxxxxxxxx> wrote:

> * David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
>
> > But wait, why did I say "mostly"? Well, not everyone has a retpoline
> > compiler yet... but OK, screw them; they need to update.
> >
> > Then there's Skylake, and that generation of CPU cores. For complicated
> > reasons they actually end up being vulnerable not just on indirect
> > branches, but also on a 'ret' in some circumstances (such as 16+ CALLs
> > in a deep chain).
> >
> > The IBRS solution, ugly though it is, did address that. Retpoline
> > doesn't. There are patches being floated to detect and prevent deep
> > stacks, and deal with some of the other special cases that bite on SKL,
> > but those are icky too. And in fact IBRS performance isn't anywhere
> > near as bad on this generation of CPUs as it is on earlier CPUs
> > *anyway*, which makes it not quite so insane to *contemplate* using it
> > as Intel proposed.
>
> There's another possible method to avoid deep stacks on Skylake, without compiler
> support:
>
> - Use the existing mcount based function tracing live patching machinery
> (CONFIG_FUNCTION_TRACER=y) to install a _very_ fast and simple stack depth
> tracking tracer which would issue a retpoline when stack depth crosses
> boundaries of ~16 entries.

The patch below demonstrates the principle, it forcibly enables dynamic ftrace
patching (CONFIG_DYNAMIC_FTRACE=y et al) and turns mcount/__fentry__ into a RET:

ffffffff81a01a40 <__fentry__>:
ffffffff81a01a40: c3 retq

This would have to be extended with (very simple) call stack depth tracking (just
3 more instructions would do in the fast path I believe) and a suitable SkyLake
workaround (and also has to play nice with the ftrace callbacks).

On non-SkyLake the overhead would be 0 cycles.

On SkyLake this would add an overhead of maybe 2-3 cycles per function call and
obviously all this code and data would be very cache hot. Given that the average
number of function calls per system call is around a dozen, this would be _much_
faster than any microcode/MSR based approach.

Is there a testcase for the SkyLake 16-deep-call-stack problem that I could run?
Is there a description of the exact speculative execution vulnerability that has
to be addressed to begin with?

If this approach is workable I'd much prefer it to any MSR writes in the syscall
entry path not just because it's fast enough in practice to not be turned off by
everyone, but also because everyone would agree that per function call overhead
needs to go away on new CPUs. Both deployment and backporting is also _much_ more
flexible, simpler, faster and more complete than microcode/firmware or compiler
based solutions.

Assuming the vulnerability can be addressed via this route that is, which is a big
assumption!

Thanks,

Ingo

arch/x86/Kconfig | 3 +++
arch/x86/kernel/ftrace_64.S | 1 +
2 files changed, 4 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 423e4b64e683..df471538a79c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -133,6 +133,8 @@ config X86
select HAVE_DMA_CONTIGUOUS
select HAVE_DYNAMIC_FTRACE
select HAVE_DYNAMIC_FTRACE_WITH_REGS
+ select DYNAMIC_FTRACE
+ select DYNAMIC_FTRACE_WITH_REGS
select HAVE_EBPF_JIT if X86_64
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_EXIT_THREAD
@@ -140,6 +142,7 @@ config X86
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACER
+ select FUNCTION_TRACER
select HAVE_GCC_PLUGINS
select HAVE_HW_BREAKPOINT
select HAVE_IDE
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index 7cb8ba08beb9..1e219e0f2887 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -19,6 +19,7 @@ EXPORT_SYMBOL(__fentry__)
# define function_hook mcount
EXPORT_SYMBOL(mcount)
#endif
+ ret

/* All cases save the original rbp (8 bytes) */
#ifdef CONFIG_FRAME_POINTER