Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

From: Paul E. McKenney
Date: Mon Mar 28 2016 - 20:25:24 EST


On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:

[ . . . ]

> > > OK, so I should instrument migration_call() if I get the repro rate up?
> >
> > Can do, maybe try the below first. (yes I know how long it all takes :/)
>
> OK, will run this today, then run calibration for last night's run this
> evening.

And there was one failure out of ten runs. If last night's failure rate
was typical (7 of 24), then I believe we can be about 87% confident that
this change helped. That isn't all that confident, but...

Tested-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>

So what to run tonight?

The most sane approach would be to run stock in order to get a baseline
failure rate. It is tempting to run more of Peter's patch, but part of
the problem is that we don't know the current baseline.

So baseline it is...

Thanx, Paul

> Speaking of which, last night's run (disabling TIF_POLLING_NRFLAG)
> consisted of 24 two-hour runs. Six of them had hard hangs, and another
> had a hang that eventually unhung of its own accord. I believe that this
> is significantly fewer failures than from a stock kernel, but I could
> be wrong, and it will take some serious testing to give statistical
> confidence for whatever conclusion is correct.
>
> > > > The other interesting case would be resched_cpu(), which uses
> > > > set_nr_and_not_polling() to kick a remote cpu to call schedule(). It
> > > > atomically sets TIF_NEED_RESCHED and returns if TIF_POLLING_NRFLAG was
> > > > not set. If indeed not, it will send an IPI.
> > > >
> > > > This assumes the idle 'exit' path will do the same as the IPI does; and
> > > > if you look at cpu_idle_loop() it does indeed do both
> > > > preempt_fold_need_resched() and sched_ttwu_pending().
> > > >
> > > > Note that one cannot rely on irq_enter()/irq_exit() being called for the
> > > > scheduler IPI.
> > >
> > > OK, thank you for the info! Any specific debug actions?
> >
> > Dunno, something like the below should bring visibility into the
> > (lockless) wake_list thingy.
> >
> > So these trace_printk()s should happen between trace_sched_waking() and
> > trace_sched_wakeup() (I've not fully read the thread, but ISTR you had
> > some traces with these here thingies on).
> >
> > ---
> > arch/x86/include/asm/bitops.h | 6 ++++--
> > kernel/sched/core.c | 9 +++++++++
> > 2 files changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> > index 7766d1cf096e..5345784d5e41 100644
> > --- a/arch/x86/include/asm/bitops.h
> > +++ b/arch/x86/include/asm/bitops.h
> > @@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
> > if (IS_IMMEDIATE(nr)) {
> > asm volatile(LOCK_PREFIX "andb %1,%0"
> > : CONST_MASK_ADDR(nr, addr)
> > - : "iq" ((u8)~CONST_MASK(nr)));
> > + : "iq" ((u8)~CONST_MASK(nr))
> > + : "memory");
> > } else {
> > asm volatile(LOCK_PREFIX "btr %1,%0"
> > : BITOP_ADDR(addr)
> > - : "Ir" (nr));
> > + : "Ir" (nr)
> > + : "memory");
> > }
> > }
>
> Is the above addition of "memory" strictly for the debug below, or is
> it also a potential fix?
>
> Starting it up regardless, but figured I should ask!
>
> Thanx, Paul
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0b21e7a724e1..b446f73c530d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1669,6 +1669,7 @@ void sched_ttwu_pending(void)
> > while (llist) {
> > p = llist_entry(llist, struct task_struct, wake_entry);
> > llist = llist_next(llist);
> > + trace_printk("waking %d\n", p->pid);
> > ttwu_do_activate(rq, p, 0);
> > }
> >
> > @@ -1719,6 +1720,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu)
> > struct rq *rq = cpu_rq(cpu);
> >
> > if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
> > + trace_printk("queued %d for waking on %d\n", p->pid, cpu);
> > if (!set_nr_if_polling(rq->idle))
> > smp_send_reschedule(cpu);
> > else
> > @@ -5397,10 +5399,17 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> > migrate_tasks(rq);
> > BUG_ON(rq->nr_running != 1); /* the migration thread */
> > raw_spin_unlock_irqrestore(&rq->lock, flags);
> > +
> > + /* really bad m'kay */
> > + WARN_ON(!llist_empty(&rq->wake_list));
> > +
> > break;
> >
> > case CPU_DEAD:
> > calc_load_migrate(rq);
> > +
> > + /* more bad */
> > + WARN_ON(!llist_empty(&rq->wake_list));
> > break;
> > #endif
> > }
> >