Re: [PATCH 2/3] livepatch/rcu: Warn when system consistency is broken in RCU code

From: Steven Rostedt
Date: Mon May 08 2017 - 16:18:26 EST


On Mon, 8 May 2017 14:47:29 -0500
Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:

> > Although you should have:
> >
> > if (WARN_ONCE(!rcu_is_watching,
> > "Livepatch ..."))
> > return;
> >
> > or something to not cause any damage.
>
> My understanding is that returning would be more dangerous than
> continuing here.
>
> By continuing to run, there's only a small chance that it will get stale
> data, which would break the consistency model by executing an old
> version of the function and possibly crashing the system.
>
> On the other hand, returning would unconditionally break the consistency
> model by *always* executing an old version of the function. So that
> greatly increases the risk of a crash.

I was being oversimplified by saying 'return', perhaps go into a
critical mode that can try again, or perhaps even back out the patch.
As in a transaction style. Yes, this will need to be thought through to
know how to get out. My comment wasn't meant to be simple.

>
> > > The warning would be more helpful if it printed a little more
> > > information, like the ip and the parent_ip. And it should probably
> > > mention that RCU is broken.
> >
> > Well, the warning would also print a stack trace. How is RCU broken? It
> > could simply be that you are patching a function that is in a place
> > that RCU doesn't "watch". Like going to idle or userspace. Or even in
> > RCU itself.
>
> As I understand it, RCU would be "broken" because this ftrace handler
> has an RCU read critical section. And in the case where RCU isn't
> watching, rcu_read_lock() will not function as advertised, right?

Well, it's not RCU that's broken. It's the users ;-)

>
> > > Also I wonder if we can constrain the warning somehow. I think the
> > > warning only applies if a patch is in progress, right? In that case, if
> > > RCU is broken, would it be feasible to mutex_trylock() the klp mutex to
> > > try to ensure that no patches are being applied while the ftrace handler
> > > is running? Then it wouldn't matter if RCU were broken because the func
> > > stack wouldn't be changing anyway. Then it could only warn if it failed
> > > to get the mutex.
> >
> > How would RCU be broken?
> >
> > >
> > > Stepping back a bit, the documentation and comments describe patches to
> > > functions inside the RCU infrastructure. As far as I can tell, only a
> > > single function would be affected: rcu_dynticks_eqs_enter(). Because
> > > it's the only function called after the "Breaks tracing momentarily"
> > > comment in rcu_eqs_enter_common(). Any reason why we couldn't just
> > > annotate rcu_dynticks_eqs_enter() with notrace?
> >
> > Note, there's places in the kernel (on the way to idle and userspace)
> > that rcu is not watching. Ftrace handles this differently than most
> > places. But anything that requires calling rcu_read_lock(), well, you
> > need to beware.
> >
> > >
> > > Stepping back even further, if I'm understanding this issue correctly,
> > > this warning can also affect patches to functions which are called from
> > > NMI context. If the NMI occurs in the part of rcu_eqs_enter_common()
> > > where rcu_irq_enter() doesn't work, then RCU won't work here, right? If
> > > so, that worries me because there are a lot of functions which can be
> > > called (and patched) from NMI context.
> >
> > Note, the "rcu_dynticks_eqs_enter()" is the only place that can't make
> > rcu "watch" again.
>
> So would it make sense to annotate it with 'notrace'?

No. Because ftrace doesn't require that and works fine. The strack
tracer is the only thing that has issues in the ftrace code.

If we were to start slapping in notrace for each user of ftrace that
has issues, then we would have nothing left to trace ;-)


>
> > If rcu is not watching, calling rcu_enter_irq() will have it watch
> > again. Even in NMI context I believe.
>
> What if you get an NMI while running in rcu_dynticks_eqs_enter() before
> it increments rdtp->dynticks? Will rcu_enter_irq() still work from the
> NMI?

Good question. Paul?

>
> I'm just trying to understand what are the cases where rcu_enter_irq()
> *doesn't* work from an ftrace handler.

I think the only place is that one function. But you are right. What
happens if an NMI comes in?

>
> > > I wonder if there's some way to solve this by changing RCU code, but I'm
> > > not familiar enough with RCU to have any ideas there.
> >
> > You don't want to go there.
>
> I believe you :-)
>

-- Steve