Re: [PATCH] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint

From: Paul E. McKenney
Date: Fri May 31 2019 - 09:46:50 EST


On Fri, May 31, 2019 at 03:36:40AM +0200, Frederic Weisbecker wrote:
> On Thu, May 30, 2019 at 05:58:09AM -0700, Paul E. McKenney wrote:
> > It turns out that tick_broadcast_offline() was an innocent bystander.
> > After all, interrupts are supposed to be disabled throughout
> > take_cpu_down(), and therefore should have been disabled upon entry to
> > tick_offline_cpu() and thus to tick_broadcast_offline(). This suggests
> > that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
> > and leaving them enabled on return.
> >
> > Some debugging code showed that the culprit was sched_cpu_dying().
> > It had irqs enabled after return from sched_tick_stop(). Which in turn
> > had irqs enabled after return from cancel_delayed_work_sync(). Which is a
> > wrapper around __cancel_work_timer(). Which can sleep in the case where
> > something else is concurrently trying to cancel the same delayed work,
> > and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
> > idea when you are invoked from take_cpu_down(), regardless of the state
> > you leave interrupts in upon return.
>
> Nice catch! Sorry for leaving that puzzle behind.

Heh! I needed the break from trying to make RCU do unnatural acts. ;-)

> > Code inspection located no reason why the delayed work absolutely
> > needed to be canceled from sched_tick_stop(): The work is not
> > bound to the outgoing CPU by design, given that the whole point is
> > to collect statistics without disturbing the outgoing CPU.
> >
> > This commit therefore simply drops the cancel_delayed_work_sync() from
> > sched_tick_stop(). Instead, a new ->state field is added to the tick_work
> > structure so that the delayed-work handler function sched_tick_remote()
> > can avoid reposting itself. A cpu_is_offline() check is also added to
> > sched_tick_remote() to avoid mucking with the state of an offlined CPU
> > (though it does appear safe to do so).
>
> I can't guarantee that it is safe myself to call the tick of an offline
> CPU. Better have that check indeed.

No argument, especially given that this preserved the current behavior
exactly.

But it looked to me that all of the code was running on some other online
CPU and was accessing data that would be left around in good state on
the offline CPU. And the current behavior won't account for activity
just prior to the CPU going offline until it comes back online (right?),
so there might be some incentive to let the last call to sched_tick_remote()
do the accounting. Might not be a big deal, though.

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 102dfcf0a29a..9a10ee9afcbf 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3050,14 +3050,44 @@ void scheduler_tick(void)
> >
> > struct tick_work {
> > int cpu;
> > + int state;
> > struct delayed_work work;
> > };
> > +// Values for ->state, see diagram below.
> > +#define TICK_SCHED_REMOTE_IDLE 0
>
> So it took me some time to understand that the IDLE state is the
> tick work that ackowledged OFFLINING and finally completes the
> offline process. Therefore perhaps we can rename it to
> TICK_SCHED_REMOTE_OFFLINE so that we instantly get the state
> machine scenario.

Good point, your name is much better. Updated.

> > +#define TICK_SCHED_REMOTE_RUNNING 1
> > +#define TICK_SCHED_REMOTE_OFFLINING 2
> > +
> > +// State diagram for ->state:
> > +//
> > +//
> > +// +----->IDLE-----------------------------+
> > +// | |
> > +// | |
> > +// | | sched_tick_start()
> > +// | sched_tick_remote() |
> > +// | |
> > +// | V
> > +// | +---------->RUNNING
> > +// | | |
> > +// | | |
> > +// | | |
> > +// | sched_tick_start() | | sched_tick_stop()
> > +// | | |
> > +// | | |
> > +// | | |
> > +// +--------------------OFFLINING<---------+
> > +//
> > +//
> > +// Other transitions get WARN_ON_ONCE(), except that sched_tick_remote()
> > +// and sched_tick_start() are happy to leave the state in RUNNING.
> >
> > static struct tick_work __percpu *tick_work_cpu;
> >
> > static void sched_tick_remote(struct work_struct *work)
> > {
> > struct delayed_work *dwork = to_delayed_work(work);
> > + int os;
> > struct tick_work *twork = container_of(dwork, struct tick_work, work);
> > int cpu = twork->cpu;
> > struct rq *rq = cpu_rq(cpu);
> > @@ -3077,7 +3107,7 @@ static void sched_tick_remote(struct work_struct *work)
> >
> > rq_lock_irq(rq, &rf);
> > curr = rq->curr;
> > - if (is_idle_task(curr))
> > + if (is_idle_task(curr) || cpu_is_offline(cpu))
>
> Or we could simply check rq->online, while we have rq locked.

Which would have better cache locality as well.

Let's see... This is cleared from the same sched_cpu_dying() that invokes
sched_tick_stop(), so that works. Where is it set? In set_rq_online(),
of course, which is invoked from sched_cpu_activate() which is assigned
to .startup.single, which sadly not the CPU-online counterpart to
sched_cpu_dying(). This would result in the workqueue being started
quite some time before its handler would allow itself to sample the state,
and would also thus be a deviation from earlier behavior.

In the immortal words of MS-DOS, "Are you sure?"

> > goto out_unlock;
> >
> > update_rq_clock(rq);
> > @@ -3097,13 +3127,22 @@ static void sched_tick_remote(struct work_struct *work)
> > /*
> > * Run the remote tick once per second (1Hz). This arbitrary
> > * frequency is large enough to avoid overload but short enough
> > - * to keep scheduler internal stats reasonably up to date.
> > + * to keep scheduler internal stats reasonably up to date. But
> > + * first update state to reflect hotplug activity if required.
> > */
> > - queue_delayed_work(system_unbound_wq, dwork, HZ);
> > + do {
> > + os = READ_ONCE(twork->state);
> > + WARN_ON_ONCE(os == TICK_SCHED_REMOTE_IDLE);
> > + if (os == TICK_SCHED_REMOTE_RUNNING)
> > + break;
> > + } while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_IDLE) != os);
> > + if (os == TICK_SCHED_REMOTE_RUNNING)
> > + queue_delayed_work(system_unbound_wq, dwork, HZ);
> > }
> >
> > static void sched_tick_start(int cpu)
> > {
> > + int os;
> > struct tick_work *twork;
> >
> > if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > @@ -3112,14 +3151,23 @@ static void sched_tick_start(int cpu)
> > WARN_ON_ONCE(!tick_work_cpu);
> >
> > twork = per_cpu_ptr(tick_work_cpu, cpu);
> > - twork->cpu = cpu;
> > - INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > - queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > + do {
> > + os = READ_ONCE(twork->state);
> > + if (os == TICK_SCHED_REMOTE_RUNNING)
>
> Is it possible to have RUNNING at this stage? sched_tick_start()
> shouldn't be called twice without a sched_tick_stop() in the middle.
>
> In which case we should even warn if we meet that value here.

Good point, it now looks like this:

if (os == WARN_ON_ONCE(TICK_SCHED_REMOTE_RUNNING))

Which, after an embarrassingly long time, turned into this:

if (WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING))

> > + break;
> > + // Either idle or offline for a short period
> > + } while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_RUNNING) != os);
> > + if (os == TICK_SCHED_REMOTE_IDLE) {
> > + twork->cpu = cpu;
> > + INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > + queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > + }
> > }
> >
> > #ifdef CONFIG_HOTPLUG_CPU
> > static void sched_tick_stop(int cpu)
> > {
> > + int os;
> > struct tick_work *twork;
> >
> > if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > @@ -3128,7 +3176,13 @@ static void sched_tick_stop(int cpu)
> > WARN_ON_ONCE(!tick_work_cpu);
> >
> > twork = per_cpu_ptr(tick_work_cpu, cpu);
> > - cancel_delayed_work_sync(&twork->work);
> > + // There cannot be competing actions, but don't rely on stop_machine.
> > + do {
> > + os = READ_ONCE(twork->state);
> > + WARN_ON_ONCE(os != TICK_SCHED_REMOTE_RUNNING);
> > + // Either idle or offline for a short period
> > + } while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_OFFLINING) != os);
> > + // Don't cancel, as this would mess up the state machine.
> > }
> > #endif /* CONFIG_HOTPLUG_CPU */
>
> I ran the state machine for a few hours inside my head through FWEISBEC_TORTURE
> and I see no error message. Which is of course not a guarantee of anything but:

I know that feeling!

> Reviewed-by: Frederic Weisbecker <frederic@xxxxxxxxxx>

Thank you, applied!

> Also, I believe that /* */ is the preferred comment style, FWIW ;-)

True enough, but "//" is permitted and allows me three more characters of
comment before wanting to do a line break. ;-)

But it is your code, so if you want "/* */", just let me know and I will
switch to that style.

Thanx, Paul