Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline

From: Paul E. McKenney
Date: Fri Jun 29 2018 - 00:28:26 EST


On Thu, Jun 28, 2018 at 03:06:46PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 28, 2018 at 05:38:33AM -0700, Paul E. McKenney wrote:
> > Please let me try again.
> >
> > The approach you are suggesting, clever though it is, disables a check
>
> https://lkml.kernel.org/r/20180627094633.GG2512@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> Is the one we're talking about, right?

Yes.

> That does not disable any actual check afaict. It simply does not do a
> wakeup when ran on an offline CPU. And ensures we do an unconditional
> wakeup soon after from a still running CPU.

It does implicitly by avoiding doing the wakeup when the CPU is offline.
This has the effect of disabling the RCU checks in your wakeup code.

> > of a type that has proved to be an important diagnostic in the past.
> > It is only reasonable to assume that this check would be important
> > and helpful in the future, but only if that check remains in the code.
>
> I am confused..
>
> > Yes, agreed, given the current structure of the code, this particular
> > instance of the check would not matter, but experience indicates that
> > RCU code restructuring is not at all uncommon, with the current effort
> > being but one case in point.
>
> Once more confused...
>
> > So, unless I am missing something, the only possible benefit of disabling
> > this check is getting rid of an acquisition of an uncontended lock in
> > a code path that is miles (sorry, kilometers) away from any fastpath.
> > So, again, yes, it is clever. If it sped up a fastpath, I might be
> > sorely tempted to take it. But the alternative is straightforward and
> > isn't anywhere near a fastpath. So, though I do very much appreciate
> > the cleverness and creativity, I am not seeing your change to be a
> > good tradeoff from a long-term maintainability viewpoint.
>
> I think you mean guarantee/invariant instead of check. But I see it no
> different than any other missed rcu_gp_kthread_wake(). You can similarly
> fail to make the call while restructuring.

Well, we do use checks to detect failures to provide guarantees and to
maintain invariants, so they are closely related. And yes, for any check
you might provide, there are ways to defeat that check. Software and
all that.

But we do seem to be talking past each other. One option would be for
me to take another look after I get the cleanup code generated for the
RCU flavor consolidation, which will be some weeks. Either or both of
us might have come up with a better approach in that time anyway, right?

Thanx, Paul