Re: Alleged fix for writer stall on -rcu branch dev

From: Paul E. McKenney
Date: Wed Apr 10 2019 - 20:06:27 EST


On Wed, Apr 10, 2019 at 01:33:02PM -0700, Paul E. McKenney wrote:
> On Wed, Apr 10, 2019 at 09:19:18PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2019-04-10 11:41:05 [-0700], Paul E. McKenney wrote:
> > > On Wed, Apr 10, 2019 at 08:15:38PM +0200, Sebastian Andrzej Siewior wrote:
> > > > On 2019-04-09 15:14:26 [-0700], Paul E. McKenney wrote:
> > > > > Hello, Sebastian,
> > > > Hi Paul,
> > > >
> > > > > A few commits later, I finally have something that might work. ;-)
> > > >
> > > > Okay. I started with 33e04e4512797b5e0242f452d0027b096d43d006. The first
> > > > batch of 18 completed and this
> > > > https://breakpoint.cc/console-tree1.2.log
> > >
> > > Another new one on me! ;-)
> > >
> > > It looks like CPU 11 got stuck in the cpu_stopper_thread trying to
> > > offline CPU 6. Unfortunately, CPU 6 is not in the list receiving an
> > > NMI, perhaps because it looked like it was already offline. It does
> > > eventually finish going offline quite some time later.
> > >
> > > Perhaps my forward-progress testing is interacting badly with CPU
> > > hotplug, but only very occasionally?
> >
> > Somehow, yes.
> >
> > [ 8433.835292] Unregister pv shared memory for cpu 6
> > [ 8433.864122] smpboot: CPU 6 is now offline
> >
> > CPU6 is offline.
> >
> > [ 8434.934765] smpboot: Booting Node 0 Processor 6 APIC 0x6
> > [ 8434.950632] kvm-clock: cpu 6, msr 17155a181, secondary cpu clock
> > [ 8434.989124] KVM setup async PF for cpu 6
> > [ 8434.990801] kvm-stealtime: cpu 6, msr 17b195380
> >
> > CPU6 is booting again.
> >
> > [ 8436.035269] Unregister pv shared memory for cpu 6
> >
> > going down again
> >
> > [ 8462.032468] rcu: INFO: rcu_preempt self-detected stall on CPU
> > [ 8462.037385] rcu: 11-...!: (25587 ticks this GP) idle=57e/1/0x4000000000000002 softirq=418067/418067 fqs=1 last_accelerate: 2456/89e6, Nonlazy posted: .LD
> >
> > 25587 ticks is ~25secs with HZ=1k. And the stall occurred on CPU11,
> > correct?
>
> Yes to both.
>
> > [ 8525.041031] smpboot: CPU 6 is now offline
> >
> > 63 secs later, the CPU is down. So that is too long.
>
> Ah, I did miss the down-then-up, thank you!
>
> > > Something for me to look at, anyway!
> >
> > There is also
> > [ 8556.907804] WARNING: CPU: 35 PID: 833 at /home/bigeasy/linux-rt/kernel/rcu/rcutorture.c:1827 rcu_torture_fwd_prog+0x5c4/0x630
> >
> > but I'm not sure what it is :)
>
> The forward-progress test exceeded eight seconds, but pushed fewer
> than 100 callbacks through the system. One way that could happen is
> if the forward-progress kthread was delayed during that eight seconds.
> Another is if grace periods were insanely fast (which they are not).
>
> > > > I should have been alone on that server but I can't guarantee it. I try
> > > > to throw the rcu-torture at a little smaller box and check how it goes.
> > >
> > > Even if you were not alone, I would guess that preempting a runnable
> > > process for 92 seconds would have been a visible event on the host.
> >
> > 92 seconds?! Okay, that should have been visible on the host side. And I
> > see nothing in dmesg on the host side.
> > But why 92? 25 for RCU stall + 63 between starting to go down and
> > finishing?
>
> Because I missed the momentary up-then-down that you spotted. Your 63
> seconds is correct.
>
> > > But maybe not? The watchdog timer is still 120 seconds? Or does your
> > > host set it to 22 seconds?
> >
> > I didn't fiddle with it and host runs 4.19.12-1~bpo9+1. So it should be
> > the default value.
>
> Which I believe to be 22 seconds. Hmmm...
>
> Either way, thank you very much for the testing!!!

[ Adding LKML back -- I think we dropped it due to an attachment? ]

One area of some concern is RCU_KTHREAD_YIELDING and the rcuc kthreads.
This shouldn't be a big deal for the intended use case (real-time
workloads with lightish load), but a two-jiffy delay could allow quite a
few callbacks to build up when doing forward-progress testing. But not
a problem for use_softirq runs, which is what you were running.

But it might be that I need to either skip the forward-progress testing
if !use_softirq, skip the RCU_KTHREAD_YIELDING step if there are too
many callbacks, or some such. Either way, though, I would need at least
a cond_resched() on each pass through the rcu_cpu_kthread() function.

Thoughts from a real-time perspective?

Oooh... But in both use_softirq and !use_softirq runs, if the CPU to be
offlined has a huge number of callbacks pending (assuming callbacks are
not offloaded), there could be quite the delay imposed on the offlining
of that CPU. Now 63 seconds seems a bit much, but there can be tens
of millions of callbacks queued up to be invoked. It -could- happen,
I suppose, though I would guess more like a couple of seconds.

Maybe I need to ignore the ignoring of blimit for a CPU that is going
offline... The theory is that the callbacks will just get dumped on
some other CPU if the CPU is going away, and on the other hand invoking
callbacks sparingly would help the CPU come back all the way online
more quickly. Unless one of the callbacks on that CPU was the one from
the synchronize_rcu() in the CPU-hotplug offline path. :-/

So I could start going slow on callbacks on the outgoing CPU as soon as
that synchronize_rcu() completes. Horribly fragile, given the possibility
of synchronize_rcu() in other CPU-hotplug notifiers, though. Plus it
looks like sched_cpu_deactivate() is about the first thing that happens
when removing a CPU.

But you know... Ignoring the world while invoking up to LONG_MAX RCU
callbacks back to back probably never was an optimal plan. So if that
splat is at all reproducible, could you please give the patch below a try?

I have just touch-tested this on TREE01 and TREE04. I will be testing
the rest of the scenarios overnight, Pacific Time.

Of course, if you have more failures, please let me know!

Thanx, Paul

------------------------------------------------------------------------

commit 3245c1860cb30f85ae42c25209217bac25104faf
Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>
Date: Wed Apr 10 17:01:39 2019 -0700

rcu: Set a maximum limit for back-to-back callback invocation

Currently, if a CPU has more than 10,000 callbacks pending, it will
increase rdp->blimit to LONG_MAX. If you are lucky, LONG_MAX is only
about two billion, but this is still a bit too many callbacks to invoke
back-to-back while otherwise ignoring the world.

This commit therefore sets a maximum limit of DEFAULT_MAX_RCU_BLIMIT,
which is set to 10,000, for rdp->blimit.

Reported-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f53befdeecc1..afeb2751c94f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -401,7 +401,8 @@ static int rcu_is_cpu_rrupt_from_idle(void)
return __this_cpu_read(rcu_data.dynticks_nesting) == 0;
}

-#define DEFAULT_RCU_BLIMIT 10 /* Maximum callbacks per rcu_do_batch. */
+#define DEFAULT_RCU_BLIMIT 10 /* Maximum callbacks per rcu_do_batch ... */
+#define DEFAULT_MAX_RCU_BLIMIT 10000 /* ... even during callback flood. */
static long blimit = DEFAULT_RCU_BLIMIT;
#define DEFAULT_RCU_QHIMARK 10000 /* If this many pending, ignore blimit. */
static long qhimark = DEFAULT_RCU_QHIMARK;
@@ -2134,7 +2135,7 @@ static void rcu_do_batch(struct rcu_data *rdp)

/* Reinstate batch limit if we have worked down the excess. */
count = rcu_segcblist_n_cbs(&rdp->cblist);
- if (rdp->blimit == LONG_MAX && count <= qlowmark)
+ if (rdp->blimit >= DEFAULT_MAX_RCU_BLIMIT && count <= qlowmark)
rdp->blimit = blimit;

/* Reset ->qlen_last_fqs_check trigger if enough CBs have drained. */
@@ -2464,7 +2465,7 @@ static void __call_rcu_core(struct rcu_data *rdp, struct rcu_head *head,
rcu_accelerate_cbs_unlocked(rdp->mynode, rdp);
} else {
/* Give the grace period a kick. */
- rdp->blimit = LONG_MAX;
+ rdp->blimit = DEFAULT_MAX_RCU_BLIMIT;
if (rcu_state.n_force_qs == rdp->n_force_qs_snap &&
rcu_segcblist_first_pend_cb(&rdp->cblist) != head)
rcu_force_quiescent_state();