Re: [GIT PULL] RCU changes for v6.7

From: Paul E. McKenney
Date: Tue Oct 31 2023 - 21:08:04 EST


On Tue, Oct 31, 2023 at 01:06:44PM -1000, Linus Torvalds wrote:
> On Tue, 31 Oct 2023 at 03:57, Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > Would it help if we make rcu_stall_chain_notifier_register() print a
> > suitably obnoxious message saying that future RCU CPU stall warnings
> > might be unreliable?
>
> It's not the future stall warnings I worry about.
>
> It's literally things like somebody thinking they are being clever,
> registering a rcu stall notifier that prints out extra information in
> order to be helpful, and in the process takes a spinlock or something
> without thinking about it.
>
> And that spinlock might be the *reason* for the RCU stall in the first place.
>
> So now the RCU stall code prints out NOTHING AT ALL, because now the
> stall notifier itself has deadlocked.
>
> This is *exactly* what has happened before with these kinds of
> "helpful" exception case notifiers. Because they never trigger in
> normal loads, they get basically zero testing - and then when bad
> things happen, it turns out that the "helpful" debug code actually
> just makes things worse.
>
> Or, if they get testing, they get tested in artificial bad cases (eg
> "let's just write a busy loop that hangs for 30 seconds to trigger a
> RCU stall"), which doesn't show any of the issues, because they aren't
> real bugs with real existing deadlocks.
>
> See what I'm saying? Having notifiers for "sh*t happened" is
> fundmanetally questionable to begin with, because they get no testing.
>
> And then calling said notifiers *before* you even have the core
> printout for "Look, things are going down hill quickly", now you've
> turned a bad situation even worse.
>
> I really think that we should *never* have any kind of notifiers for
> kernel bugs. They cause problems. The *one* exception is an actual
> honest-to-goodness kernel debugger, and then it should literally
> *only* be the debugger that can register a notifier, so that you are
> *never* in the situation that a kernel without a debugger will just
> hang because of some bogus debug notifier.

All fair points.

Here are the ways forward I can see:

1. Status quo. This has all the issues that you call out.
People will hurt themselves with it and consume time and effort.
So let's not do this.

2. I send you a pure revert. Those of us who need this keep the
patches around and apply them when we need them. This avoids
the problems you point out, but makes it harder to use this
where it is needed and useful.

3. Add a default-n Kconfig option that depends on RCU_EXPERT
and KEBUG_KERNEL, so that these problems can only arise in
specially built kernels.

4. Same as #3, but use a kernel boot parameter instead of a
Kconfig option.

5. One of the above other than #2, but complaining (maybe a WARN_ON()
or maybe just a printk() at rcu_stall_chain_notifier_register()
time, but before the call to atomic_notifier_chain_register().
This would mean that the complaint ("hey, you are asking for
something that might be dangerous") appears before any RCU CPU
stall warning that could possibly trigger a notifier.

Are there any other ways forward? Either way, which would you prefer?

Thanx, Paul