Re: [PATCH v2 5/7] x86/mm, tracing: Fix CR2 corruption

From: Peter Zijlstra
Date: Mon Jul 08 2019 - 03:49:10 EST


On Sat, Jul 06, 2019 at 08:07:22PM +0900, Eiichi Tsukata wrote:
>
>
> On 2019/07/05 11:18, Linus Torvalds wrote:
> > On Fri, Jul 5, 2019 at 5:03 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >>
> >> Despire the current efforts to read CR2 before tracing happens there
> >> still exist a number of possible holes:
> >
> > So this whole series disturbs me for the simple reason that I thought
> > tracing was supposed to save/restore cr2 and make it unnecessary to
> > worry about this in non-tracing code.
> >
> > That is very much what the NMI code explicitly does. Why shouldn't all
> > the other tracing code do the same thing in case they can take page
> > faults?
> >
> > So I don't think the patches are wrong per se, but this seems to solve
> > it at the wrong level.

This solves it all at one site. If we're going to sprinkle CR2
save/restore over 'all' sites we're bound to miss some and we'll be
playing catch-up forever :/

> Steven previously tried to fix it by saving CR2 in TRACE_IRQS_OFF:
> https://lore.kernel.org/lkml/20190320221534.165ab87b@xxxxxxxxxxxxxxxx/

That misses the context tracking tracepoint, which is also before we
load CR2.

> To prevent userstack trace code from reading user stack before it becomes ready,
> checking current->in_execve in stack_trace_save_user() can help Steven's approach,
> though trace_sched_process_exec() is called before current->in_execve = 0 so it changes
> current behavior.
>
> The PoC code is as follows:
>
> diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
> index 2abf27d7df6b..30fa6e1b7a87 100644
> --- a/arch/x86/kernel/stacktrace.c
> +++ b/arch/x86/kernel/stacktrace.c
> @@ -116,10 +116,12 @@ void arch_stack_walk_user(stack_trace_consume_fn consume_entry, void *cookie,
> const struct pt_regs *regs)
> {
> const void __user *fp = (const void __user *)regs->bp;
> + unsigned long address;
>
> if (!consume_entry(cookie, regs->ip, false))
> return;
>
> + address = read_cr2();
> while (1) {
> struct stack_frame_user frame;
>
> @@ -131,11 +133,14 @@ void arch_stack_walk_user(stack_trace_consume_fn consume_entry, void *cookie,
> break;
> if (frame.ret_addr) {
> if (!consume_entry(cookie, frame.ret_addr, false))
> - return;
> + break;
> }
> if (fp == frame.next_fp)
> break;
> fp = frame.next_fp;
> }
> +
> + if (address != read_cr2())
> + write_cr2(address);
> }
>

And this misses the perf unwinder, which we can reach if we attach perf
to the tracepoint. We can most likely also do user accesses from
kprobes, which we can similarly attach to tracepoints, and there's eBPF,
and who knows what else...

Do we really want to go put CR2 save/restore around every single thing
that can cause a fault (any fault) and is reachable from a tracepoint?
That's going to be a long list of things ... :/

Or are we going to put the CR2 save/restore on every single tracepoint?
But then we also need it on the mcount/fentry stubs and we again have
multiple places.

Whereas if we stick it in the entry path, like I proposed, we fix it in
one location and we're done.