Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

From: Paul E. McKenney
Date: Mon Mar 28 2016 - 09:50:46 EST


On Mon, Mar 28, 2016 at 08:13:51AM +0200, Peter Zijlstra wrote:
> On Mon, Mar 28, 2016 at 02:23:45AM +0000, Mathieu Desnoyers wrote:
>
> > >> But, you need hotplug for this to happen, right?
> > >
> > > My understanding is that this seems to be detection of failures to be
> > > awakened for a long time on idle CPUs. It therefore seems to be more
> > > idle-related than cpu hotplug-related. I'm not saying that there is
> > > no issue with hotplug, just that the investigation so far seems to
> > > target mostly idle systems, AFAIK without stressing hotplug.
>
> Paul has stated that without hotplug he cannot trigger this.

Which means either that hotplug is absolutely necessary or that
hotplug increases the probability of failure. Ross's experience is
without hotplug on a mostly idle system, so I am currently betting on
"increases the probability". The set of bugs does seem to have gotten
worse somewhere between v4.1 and v4.2, but not sufficiently to allow
reasonble bisection (yes, I did try, several times).

> > > set_nr_if_polling() returns true if the ti->flags read has the
> > > _TIF_NEED_RESCHED bit set, which will skip the IPI.
>
> POLLING_NR, as per your later comment
>
> > > But it seems weird. The side that calls set_nr_if_polling()
> > > does the following:
> > > 1) llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)
> > > 2) set_nr_if_polling(rq->idle)
> > > 3) (don't do smp_send_reschedule(cpu) since set_nr_if_polling() returned
> > > true)
> > >
> > > The idle loop does:
> > > 1) __current_set_polling()
> > > 2) __current_clr_polling()
> > > 3) smp_mb__after_atomic()
> > > 4) sched_ttwu_pending()
> > > 5) schedule_preempt_disabled()
> > > -> This will clear the TIF_NEED_RESCHED flag
> > >
> > > While the idle loop is in sched_ttwu_pending(), after
> > > it has done the llist_del_all() (thus has grabbed all the
> > > list entries), TIF_NEED_RESCHED is still set.
>
> > > If both list_all and
>
> llist_add() ?
>
> > > set_nr_if_polling() are called right after the llist_del_all(), we
> > > will end up in a situation where we have an entry in the list, but
> > > there won't be any reschedule sent on the idle CPU until something
> > > else awakens it. On a _very_ idle CPU, this could take some time.
>
> Can't happen, as per clearing of POLLING_NR before doing llist_del_all()
> and the latter being a full memory barrier.
>
> > > set_nr_and_not_polling() don't seem to have the same issue, because
> > > it does not return true if TIF_NEED_RESCHED is observed as being
> > > already set: it really just depends on the state of the TIF_POLLING_NRFLAG
> > > bit.
> > >
> > > Am I missing something important ?
> >
> > Well, it seems that the test for _TIF_POLLING_NRFLAG in set_nr_if_polling()
> > just before the test for _TIF_NEED_RESCHED should take care of it: while in
> > sched_ttwu_pending within the idle loop, the TIF_POLLING_NRFLAG should be
> > cleared, thus causing set_nr_if_polling to return false.
>
> Right, clue in the name: Set NEED_RESCHED _IF_ POLLING_NR (is set).
>
> > I'm slightly concerned about the lack of smp_mb__after_atomic()
> > between the TIF_NEED_RESCHED flag being cleared within schedule_preempt_disabled
> > and the TIF_POLLING_NRFLAG being set in the following loop. Indeed, clear_bit()
> > does not have a compiler barrier,
>
> Urgh, it really should, as all atomic ops. set_bit() very much has a
> memory clobber in, see below.

And this is one of the changes in your patch that I am now testing, correct?
(Looks that way to me, but..)

Thanx, Paul

> > nor processor-level memory barriers
> > (of course, the processor memory barrier should not really matter on
> > x86-64 due to lock prefix).
>
> Right.
>
> > Moreover, TIF_NEED_RESCHED is bit 3 on x86-64,
> > whereas TIF_POLLING_NRFLAG is bit 21. Those are in two different bytes of
> > the thread flags, and thus set/cleared as different addresses by clear_bit()
> > acting on an immediate "nr" argument.
> >
> > If we have any state where TIF_POLLING_NRFLAG is set before TIF_NEED_RESCHED
> > is cleared within the idle thread, we could end up missing a needed resched IPI.
>
> Yes, that would be bad. No objection to adding smp_mb__before_atomic()
> before the initial __current_set_polling(). Although that's not going to
> make a difference for x86_64 as you already noted.
>
> > Another question: why are set_nr_if_polling and set_nr_and_not_polling two
> > different implementations ?
>
> Because they're fundamentally two different things. The one
> conditionally sets NEED_RESCHED, the other unconditionally sets it.
>
> > Could they be combined ?
>
> Can, yes, will not be pretty nor clear code though.
>
>
> ---
> arch/x86/include/asm/bitops.h | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 7766d1cf096e..5345784d5e41 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
> if (IS_IMMEDIATE(nr)) {
> asm volatile(LOCK_PREFIX "andb %1,%0"
> : CONST_MASK_ADDR(nr, addr)
> - : "iq" ((u8)~CONST_MASK(nr)));
> + : "iq" ((u8)~CONST_MASK(nr))
> + : "memory");
> } else {
> asm volatile(LOCK_PREFIX "btr %1,%0"
> : BITOP_ADDR(addr)
> - : "Ir" (nr));
> + : "Ir" (nr)
> + : "memory");
> }
> }
>
>