Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI context

From: John Kacur
Date: Sun Mar 14 2010 - 18:05:37 EST

Next message: Steven Rostedt: "Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI context"
Previous message: Michal Soltys: "Re: ATA 4 KiB sector issues."
In reply to: Steven Rostedt: "Re: [PATCH 5/5] tracing: Do not record user stack trace from NMIcontext"
Next in thread: Steven Rostedt: "Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI context"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Mar 13, 2010 at 3:57 AM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> From: Steven Rostedt <srostedt@xxxxxxxxxx>
>
> A bug was found with Li Zefan's ftrace_stress_test that caused applications
> to segfault during the test.
>
> Placing a tracing_off() in the segfault code, and examining several
> traces, I found that the following was always the case. The lock tracer
> was enabled (lockdep being required) and userstack was enabled. Testing
> this out, I just enabled the two, but that was not good enough. I needed
> to run something else that could trigger it. Running a load like hackbench
> did not work, but executing a new program would. The following would
> trigger the segfault within seconds:
>
> # echo 1 > /debug/tracing/options/userstacktrace
> # echo 1 > /debug/tracing/events/lock/enable
> # while :; do ls > /dev/null ; done
>
> Enabling the function graph tracer and looking at what was happening
> I finally noticed that all cashes happened just after an NMI.
>
> 1) | copy_user_handle_tail() {
> 1) | bad_area_nosemaphore() {
> 1) | __bad_area_nosemaphore() {
> 1) | no_context() {
> 1) | fixup_exception() {
> 1) 0.319 us | search_exception_tables();
> 1) 0.873 us | }
> [...]
> 1) 0.314 us | __rcu_read_unlock();
> 1) 0.325 us | native_apic_mem_write();
> 1) 0.943 us | }
> 1) 0.304 us | rcu_nmi_exit();
> [...]
> 1) 0.479 us | find_vma();
> 1) | bad_area() {
> 1) | __bad_area() {
>
> After capturing several traces of failures, all of them happened
> after an NMI. Curious about this, I added a trace_printk() to the NMI
> handler to read the regs->ip to see where the NMI happened. In which I
> found out it was here:
>
> ffffffff8135b660 <page_fault>:
> ffffffff8135b660: 48 83 ec 78 sub $0x78,%rsp
> ffffffff8135b664: e8 97 01 00 00 callq ffffffff8135b800 <error_entry>
>
> What was happening is that the NMI would happen at the place that a page
> fault occurred. It would call rcu_read_lock() which was traced by
> the lock events, and the user_stack_trace would run. This would trigger
> a page fault inside the NMI. I do not see where the CR2 register is
> saved or restored in NMI handling. This means that it would corrupt
> the page fault handling that the NMI interrupted.
>
> The reason the while loop of ls helped trigger the bug, was that
> each execution of ls would cause lots of pages to be faulted in, and
> increase the chances of the race happening.
>
> The simple solution is to not allow user stack traces in NMI context.
> After this patch, I ran the above "ls" test for a couple of hours
> without any issues. Without this patch, the bug would trigger in less
> than a minute.
>
> Cc: stable@xxxxxxxxxx
> Reported-by: Li Zefan <lizf@xxxxxxxxxxxxxx>
> Signed-off-by: Steven Rostedt <rostedt@xxxxxxxxxxx>
> ---
> kernel/trace/trace.c | 7 +++++++
> 1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 484337d..e52683f 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1284,6 +1284,13 @@ ftrace_trace_userstack(struct ring_buffer *buffer, unsigned long flags, int pc)
> if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
> return;
>
> + /*
> + * NMIs can not handle page faults, even with fix ups.
> + * The save user stack can (and often does) fault.
> + */
> + if (unlikely(in_nmi()))
> + return;
> +
> event = trace_buffer_lock_reserve(buffer, TRACE_USER_STACK,
> sizeof(*entry), flags, pc);
> if (!event)
> --
> 1.7.0
>

That is some awfully cool detective work. Just one question, if it is
so easy to trigger, why did you wrap it in "unlikely"?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Steven Rostedt: "Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI context"
Previous message: Michal Soltys: "Re: ATA 4 KiB sector issues."
In reply to: Steven Rostedt: "Re: [PATCH 5/5] tracing: Do not record user stack trace from NMIcontext"
Next in thread: Steven Rostedt: "Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI context"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]