Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

From: Peter Zijlstra
Date: Mon Mar 28 2016 - 02:14:31 EST


On Mon, Mar 28, 2016 at 02:23:45AM +0000, Mathieu Desnoyers wrote:

> >> But, you need hotplug for this to happen, right?
> >
> > My understanding is that this seems to be detection of failures to be
> > awakened for a long time on idle CPUs. It therefore seems to be more
> > idle-related than cpu hotplug-related. I'm not saying that there is
> > no issue with hotplug, just that the investigation so far seems to
> > target mostly idle systems, AFAIK without stressing hotplug.

Paul has stated that without hotplug he cannot trigger this.

> > set_nr_if_polling() returns true if the ti->flags read has the
> > _TIF_NEED_RESCHED bit set, which will skip the IPI.

POLLING_NR, as per your later comment

> > But it seems weird. The side that calls set_nr_if_polling()
> > does the following:
> > 1) llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)
> > 2) set_nr_if_polling(rq->idle)
> > 3) (don't do smp_send_reschedule(cpu) since set_nr_if_polling() returned
> > true)
> >
> > The idle loop does:
> > 1) __current_set_polling()
> > 2) __current_clr_polling()
> > 3) smp_mb__after_atomic()
> > 4) sched_ttwu_pending()
> > 5) schedule_preempt_disabled()
> > -> This will clear the TIF_NEED_RESCHED flag
> >
> > While the idle loop is in sched_ttwu_pending(), after
> > it has done the llist_del_all() (thus has grabbed all the
> > list entries), TIF_NEED_RESCHED is still set.

> > If both list_all and

llist_add() ?

> > set_nr_if_polling() are called right after the llist_del_all(), we
> > will end up in a situation where we have an entry in the list, but
> > there won't be any reschedule sent on the idle CPU until something
> > else awakens it. On a _very_ idle CPU, this could take some time.

Can't happen, as per clearing of POLLING_NR before doing llist_del_all()
and the latter being a full memory barrier.

> > set_nr_and_not_polling() don't seem to have the same issue, because
> > it does not return true if TIF_NEED_RESCHED is observed as being
> > already set: it really just depends on the state of the TIF_POLLING_NRFLAG
> > bit.
> >
> > Am I missing something important ?
>
> Well, it seems that the test for _TIF_POLLING_NRFLAG in set_nr_if_polling()
> just before the test for _TIF_NEED_RESCHED should take care of it: while in
> sched_ttwu_pending within the idle loop, the TIF_POLLING_NRFLAG should be
> cleared, thus causing set_nr_if_polling to return false.

Right, clue in the name: Set NEED_RESCHED _IF_ POLLING_NR (is set).

> I'm slightly concerned about the lack of smp_mb__after_atomic()
> between the TIF_NEED_RESCHED flag being cleared within schedule_preempt_disabled
> and the TIF_POLLING_NRFLAG being set in the following loop. Indeed, clear_bit()
> does not have a compiler barrier,

Urgh, it really should, as all atomic ops. set_bit() very much has a
memory clobber in, see below.

> nor processor-level memory barriers
> (of course, the processor memory barrier should not really matter on
> x86-64 due to lock prefix).

Right.

> Moreover, TIF_NEED_RESCHED is bit 3 on x86-64,
> whereas TIF_POLLING_NRFLAG is bit 21. Those are in two different bytes of
> the thread flags, and thus set/cleared as different addresses by clear_bit()
> acting on an immediate "nr" argument.
>
> If we have any state where TIF_POLLING_NRFLAG is set before TIF_NEED_RESCHED
> is cleared within the idle thread, we could end up missing a needed resched IPI.

Yes, that would be bad. No objection to adding smp_mb__before_atomic()
before the initial __current_set_polling(). Although that's not going to
make a difference for x86_64 as you already noted.

> Another question: why are set_nr_if_polling and set_nr_and_not_polling two
> different implementations ?

Because they're fundamentally two different things. The one
conditionally sets NEED_RESCHED, the other unconditionally sets it.

> Could they be combined ?

Can, yes, will not be pretty nor clear code though.


---
arch/x86/include/asm/bitops.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 7766d1cf096e..5345784d5e41 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr)
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "andb %1,%0"
: CONST_MASK_ADDR(nr, addr)
- : "iq" ((u8)~CONST_MASK(nr)));
+ : "iq" ((u8)~CONST_MASK(nr))
+ : "memory");
} else {
asm volatile(LOCK_PREFIX "btr %1,%0"
: BITOP_ADDR(addr)
- : "Ir" (nr));
+ : "Ir" (nr)
+ : "memory");
}
}