Re: [PATCH 1/2] ring-buffer: add NMI protection for spinlocks

From: Steven Rostedt
Date: Thu Feb 05 2009 - 21:25:34 EST



[ added Arjan ]

On Thu, 5 Feb 2009, Andrew Morton wrote:
> > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> > index 4d33224..4c68358 100644
> > --- a/arch/x86/kernel/ftrace.c
> > +++ b/arch/x86/kernel/ftrace.c
> > @@ -113,7 +113,7 @@ static void ftrace_mod_code(void)
> > MCOUNT_INSN_SIZE);
> > }
> >
> > -void ftrace_nmi_enter(void)
> > +void arch_ftrace_nmi_enter(void)
> > {
> > atomic_inc(&in_nmi);
>
> This in_nmi thing looks a bit awkward. It's not very clear from the
> code what it does or why it exists, but it looks like I'm going to have
> to get used to that...
>
> Surely it is buggy? It's incremented by _any_ cpu, so code which reads
> it (eg, prepare_ftrace_return()) will bale out if some other CPU
> happened to be in the middle of taking an NMI.

And that is exactly what we want it to do ;-)

This is code modification. If you modify code that another cpu is
executing with, the result is undefined. In my tests to cause this type of
error, I usually get a general protection fault.

To modify code safely, you must act like a uniprocessor. That is, the code
you modify must not be executing on another CPU. We use stop_machine to
help us out here, but under NMI stress tests, we found that we can crash,
because the NMI will execute code being modified. Stop_machine does not
disable NMIs on other CPUS.

To solve this, we sync up the modification with the NMI with a couple of
variables. That's what all those memory barriers are for.

When we go to modify a line of code, we set a variable that we are doing
so, wait for all NMIs to finish, then modify the code, wait for NMIs
again, and repeat on the next line to modify.

If the NMI goes off on any CPU, it increments the in_nmi counter. If the
mod code flag is set, the NMI modifies the code.

The trick here is that the crash only occurs if the code is modified as it
is being executed on. But it will not crash if the code is written to with
the same data as it was. In other words, it is fine to write the code and
execute it, if what you are writing is what is already there. No
modification being done.

All NMIs that go off will modify the same code. Even if the code being
modified is in a NMI code path, only the first NMI that goes off will
actually modify it, the rest will just be writting the same data to the
location. Even the main code will be writing the same data. No races.



>
> Would it not be better to make in_nmi a per-cpu thing, make it a piece
> of first-class kernel infrastructure, add a nice interface to it, hoist
> it up into generic code, call that code from arch NMI handlers, call
> that interface from ftrace, etc? Or perhaps it could be implemented in
> task_struct, alongside in_irq(), in_softirq(), etc. There are three bits
> left in ->preempt_count.

I agree, we should have a per cpu nmi code. I can write that up and use
it for the ring buffer. But the call back for the ftrace code that I
explained here must still be made. We need a "global" nmi counter for it
to work.

-- Steve

>
> This approach would remove the duplication which this patch is
> attempting to add.
>
> Plus there's already quite a bit of duplication in
> arch/x86/kernel/ftrace.c between the
> CONFIG_DYNAMIC_FTRACE/!CONFIG_DYNAMIC_FTRACE hunks. Two definitions of
> in_nmi, duplicated inc/dec thereof?
>
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/