Re: [RFC][PATCH] tracing/x86: Save CR2 before tracing irqsoff on error_entry
From: He Zhe
Date: Tue Apr 16 2019 - 21:54:07 EST
>From the last replies in the thread, it seems some work is going on. May I ask
when we can possibly roughly have a fix or workaround?
Thanks,
Zhe
On 3/21/19 10:15 AM, Steven Rostedt wrote:
> From: Steven Rostedt (VMware) <rostedt@xxxxxxxxxxx>
>
> He Zhe reported a crash by enabling trace events and selecting
> "userstacktrace" which will read the stack of userspace for every trace
> event recorded. Zhe narrowed it down to:
>
> c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage")
>
> With the problem config, I was able to also reproduce the error. I
> narrowed it down to just having to do the following:
>
> # cd /sys/kernel/tracing
> # echo 1 > options/userstacktrace
> # echo 1 > events/preemptirq/irq_disable/enable
>
> And sure enough, I triggered a crash. Well, it was systemd crashing
> with a bad memory access??
>
> systemd-journal[537]: segfault at ed8cb8 ip 00007f7fffc9fef5 sp 00007ffc4062cb10 error 7
>
> And it would crash similarly each time I tried it, but always at a
> different place. After spending the day on this, I finally figured it
> out. The bug is happening in entry_64.S right after error_entry.
> There's two TRACE_IRQS_OFF in that code path, which if I comment out,
> the bug goes away. Then it dawned on me that the crash always happens
> when systemd does a normal page fault. We had this bug before, and it
> was with the exception trace points.
>
> The issue is that a tracepoint can fault (reading vmalloc or whatever).
> And doing a userspace stack trace most definitely will fault. But if we
> are coming from a legitimate page fault, the address of that fault (in
> the CR2 register) will be lost if we fault before we get to the page
> fault handler. That's exactly what is happening.
>
> To solve this, a TRACE_IRQS_OFF_CR2 (and ON for consistency) was added
> that saves the CR2 register. A new trace_hardirqs_off_thunk_cr2 is
> created that stores the cr2 register, calls the
> trace_hardirqs_off_caller, then on return restores the cr2 register if
> it changed, before returning.
>
> On my tests this fixes the issue. I just want to know if this is a
> legitimate fix or if someone can come up with a better fix?
>
> Note: this also saves the exception context just like the
> do_page_fault() function does.
>
> Note2: This only gets enabled when lockdep or irq tracing is enabled,
> which is not recommended for production environments.
>
> Link: http://lkml.kernel.org/r/897cf5cf-fc24-8a64-cb28-847f2d2e63d2@xxxxxxxxxxxxx
>
> Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage")
> Signed-off-by: Steven Rostedt (VMware) <rostedt@xxxxxxxxxxx>
> ---
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 7bc105f47d21..7edffec328e2 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -292,6 +292,32 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>
> syscall_return_slowpath(regs);
> }
> +
> +extern void trace_hardirqs_on_caller(unsigned long caller_addr);
> +__visible void trace_hardirqs_on_caller_cr2(unsigned long caller_addr)
> +{
> + unsigned long address = read_cr2(); /* Get the faulting address */
> + enum ctx_state prev_state;
> +
> + prev_state = exception_enter();
> + trace_hardirqs_on_caller(caller_addr);
> + if (address != read_cr2())
> + write_cr2(address);
> + exception_exit(prev_state);
> +}
> +
> +extern void trace_hardirqs_off_caller(unsigned long caller_addr);
> +__visible void trace_hardirqs_off_caller_cr2(unsigned long caller_addr)
> +{
> + unsigned long address = read_cr2(); /* Get the faulting address */
> + enum ctx_state prev_state;
> +
> + prev_state = exception_enter();
> + trace_hardirqs_off_caller(caller_addr);
> + if (address != read_cr2())
> + write_cr2(address);
> + exception_exit(prev_state);
> +}
> #endif
>
> #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 1f0efdb7b629..73ddf24fed3e 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1248,12 +1248,12 @@ ENTRY(error_entry)
> * we fix gsbase, and we should do it before enter_from_user_mode
> * (which can take locks).
> */
> - TRACE_IRQS_OFF
> + TRACE_IRQS_OFF_CR2
> CALL_enter_from_user_mode
> ret
>
> .Lerror_entry_done:
> - TRACE_IRQS_OFF
> + TRACE_IRQS_OFF_CR2
> ret
>
> /*
> diff --git a/arch/x86/entry/thunk_64.S b/arch/x86/entry/thunk_64.S
> index be36bf4e0957..1300b53b98cb 100644
> --- a/arch/x86/entry/thunk_64.S
> +++ b/arch/x86/entry/thunk_64.S
> @@ -41,6 +41,8 @@
> #ifdef CONFIG_TRACE_IRQFLAGS
> THUNK trace_hardirqs_on_thunk,trace_hardirqs_on_caller,1
> THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller,1
> + THUNK trace_hardirqs_on_thunk_cr2,trace_hardirqs_on_caller_cr2,1
> + THUNK trace_hardirqs_off_thunk_cr2,trace_hardirqs_off_caller_cr2,1
> #endif
>
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index 058e40fed167..dd511742ca2e 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -172,9 +172,13 @@ static inline int arch_irqs_disabled(void)
> #ifdef CONFIG_TRACE_IRQFLAGS
> # define TRACE_IRQS_ON call trace_hardirqs_on_thunk;
> # define TRACE_IRQS_OFF call trace_hardirqs_off_thunk;
> +# define TRACE_IRQS_ON_CR2 call trace_hardirqs_on_thunk_cr2;
> +# define TRACE_IRQS_OFF_CR2 call trace_hardirqs_off_thunk_cr2;
> #else
> # define TRACE_IRQS_ON
> # define TRACE_IRQS_OFF
> +# define TRACE_IRQS_ON_CR2
> +# define TRACE_IRQS_OFF_CR2
> #endif
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> # ifdef CONFIG_X86_64
>