Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

From: Paul E. McKenney
Date: Wed Apr 03 2024 - 14:57:30 EST


On Tue, Apr 02, 2024 at 09:47:37AM -0700, Paul E. McKenney wrote:
> On Mon, Apr 01, 2024 at 05:04:10PM -0700, Paul E. McKenney wrote:
> > On Mon, Apr 01, 2024 at 11:56:36PM +0200, Frederic Weisbecker wrote:
> > > Le Mon, Apr 01, 2024 at 02:26:25PM -0700, Paul E. McKenney a écrit :
> > > > > > _ The RCU CPU Stall report. I strongly suspect the cause is the hrtimer
> > > > > > enqueue to an offline CPU. Let's solve that and we'll see if it still
> > > > > > triggers.
> > > > >
> > > > > Sounds like a plan!
> > > >
> > > > Just checking in on this one. I did reproduce your RCU CPU stall report
> > > > and also saw a TREE03 OOM that might (or might not) be related. Please
> > > > let me know if hammering TREE03 harder or adding some debug would help.
> > > > Otherwise, I will assume that you are getting sufficient bug reports
> > > > from your own testing to be getting along with.
> > >
> > > Hehe, there are a lot indeed :-)
> > >
> > > So there has been some discussion on CPUSET VS Hotplug, as a problem there
> > > is likely the cause of the hrtimer warning you saw, which in turn might
> > > be the cause of the RCU stalls.
> > >
> > > Do you always see the hrtimer warning along the RCU stalls? Because if so, this
> > > might help:
> > > https://lore.kernel.org/lkml/20240401145858.2656598-1-longman@xxxxxxxxxx/T/#m1bed4d298715d1a6b8289ed48e9353993c63c896
> >
> > Not always, but why not give it a shot?
>
> And no failures, though I would need to run much longer for this to
> mean much. These were wide-spectrum tests, so my next step will be to
> run only TREE03 and TREE07.

And 600 hours each of TREE03 and TREE07 got me a single TREE07 instance
of the sched_tick_remote() failure. This one:

WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);

But this is just rcutorture testing out "short" 14-second stalls, which
can only be expected to trigger this from time to time. The point of
this stall is to test the evasive actions that RCU takes when 50% of
the way to the RCU CPU stall timeout.

One approach would be to increase that "3" to "15", but that sounds
quite fragile. Another would be for rcutorture to communicate the fact
that stall testing is in progress, and then this WARN_ON_ONCE() could
silence itself in that case.

But is there a better approach?

Thanx, Paul