Re: [PATCH RFC] sched: Make wake_up_nohz_cpu() handle CPUs going offline

From: Paul E. McKenney
Date: Mon Jul 04 2016 - 12:55:38 EST


On Mon, Jul 04, 2016 at 02:21:43PM +0200, Frederic Weisbecker wrote:
> On Fri, Jul 01, 2016 at 05:15:06PM -0700, Paul E. McKenney wrote:
> > On Sat, Jul 02, 2016 at 01:49:56AM +0200, Frederic Weisbecker wrote:
> > > On Fri, Jul 01, 2016 at 11:40:54AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Jul 01, 2016 at 01:29:59AM +0200, Frederic Weisbecker wrote:
> > > > > > +/*
> > > > > > + * Wake up the specified CPU. If the CPU is going offline, it is the
> > > > > > + * caller's responsibility to deal with the lost wakeup, for example,
> > > > > > + * by hooking into the CPU_DEAD notifier like timers and hrtimers do.
> > > > > > + */
> > > > > > void wake_up_nohz_cpu(int cpu)
> > > > > > {
> > > > > > - if (!wake_up_full_nohz_cpu(cpu))
> > > > > > + if (cpu_online(cpu) && !wake_up_full_nohz_cpu(cpu))
> > > > >
> > > > > So at this point, as we passed CPU_DYING, I believe the CPU isn't visible in the domains
> > > > > anymore (correct me if I'm wrong), therefore get_nohz_timer_target() can't return it,
> > > > > unless smp_processor_id() is the only alternative.
> > > >
> > > > Right, but the timers have been posted long before even CPU_UP_PREPARE.
> > > > From what I can see, they are left alone until CPU_DEAD. Which means
> > > > that if you try to mod_timer() them between CPU_DYING and CPU_DEAD,
> > > > you can get the above splat.
> > > >
> > > > Or am I missing somthing subtle here?
> > >
> > > Yes that's exactly what I meant. It happens on mod_timer() calls
> > > between CPU_DYING and CPU_DEAD. I just wanted to clarify the
> > > conditions for it to happen: the fact that it shouldn't concern
> > > remote CPU targets, only local pinned timers.
> >
> > OK. What happens in the following sequence of events?
> >
> > o CPU 5 posts a timer, which might well be locally pinned.
> > This is rcu_torture_reader() posting its on-stack timer
> > creatively named "t".
> >
> > o CPU 5 starts going offline, so that rcu_torture_reader() gets
> > migrated to CPU 6.
> >
> > o CPU 5 reaches CPU_DYING but has not yet reached CPU_DEAD.
> >
> > o CPU 6 invokes mod_timer() on its timer "t".
> >
> > Wouldn't that trigger the scenario that I am seeing?
>
> No I don't think so because "t" is then going to be enqueued to CPU 6.
> __mod_timer() -> get_target_base() uses local accessor on timer base.
>
> So the target of pinned timers is always the CPU of the caller.
>
> > > > > BTW, it seems that rcutorture stops its kthreads after CPU_DYING, is it expected that
> > > > > it queues timers at this stage?
> > > >
> > > > Hmmm... From what I can see, rcutorture cleans up its priority-boost
> > > > kthreads at CPU_DOWN_PREPARE time. The other threads are allowed to
> > > > migrate wherever the scheduler wants, give or take the task shuffling.
> > > > The task shuffling only excludes one CPU at a time, and I have seen
> > > > this occur when multiple CPUs were running, e.g., 0, 2, and 3 while
> > > > offlining 1.
> > >
> > > But if rcutorture kthreads are cleaned up at CPU_DOWN_PREPARE, they
> > > shouldn't be calling mod_timer() on CPU_DYING time. Or there are other
> > > rcutorture threads?
> >
> > The rcu_torture_reader() kthreads aren't associated with any particular
> > CPU, so when CPUs go offline, they just get migrated to other CPUs.
> > This allows them to execute on those other CPUs between CPU_DYING and
> > CPU_DEAD time, correct?
>
> Indeed. Timers get migrated on CPU_DEAD only. So if rcu_torture_reader()
> enqueued a pinned timer to CPU 5, then get migrated to CPU 6, the timer
> may well fire after CPU_DYING on CPU 5 and even re-enqueue itself in CPU 5,
> provided the timer is self-enqueued.

It is not self-enqueued, but rcu_torture_reader() will re-enqueue it
any time after it fires. This means that it might re-enqueue it while
it is still executing. If I understand correctly, at that point, the
timer would still be referencing the outgoing CPU. But this is an odd
scenario, so I might well have missed something.

At this point, although I am sure we have a problem, I am not sure we
understand (to say nothing of agree on) exactly what the problem is. ;-)

The timer is not self-enqueued. It is posted like this, from
rcu_torture_reader() in kernel/rcu/rcutorture.c:

mod_timer(&t, jiffies + 1);

That same function sets up the timer upon entry like this:

setup_timer_on_stack(&t, rcu_torture_timer, 0);

The splat is quite rare, which is why I suspect that it is reasonable
to believe that it involves a race with the timer firing.

> > Other rcutorture kthreads -are- bound to specific CPUs, but they are
> > testing priority boosting, not simple reading.
>
> I see.
>
> > > > Besides which, doesn't the scheduler prevent anything but the idle
> > > > thread from running after CPU_DYING time?
> > >
> > > Indeed migrate_tasks() is called on CPU_DYING but pinned kthreads, outside
> > > smpboot, have their own way to deal with hotplug through notifiers.
> >
> > Agreed, but the rcu_torture_reader() kthreads aren't pinned, so they
> > should migrate automatically at CPU_DYING time.
>
> Yes indeed.
>
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 7f2cae4620c7..1a91fc733a0f 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -580,6 +580,8 @@ static bool wake_up_full_nohz_cpu(int cpu)
> > * If needed we can still optimize that later with an
> > * empty IRQ.
> > */
> > + if (cpu_is_offline(cpu))
> > + return true;
>
> Preferably put this under the tick_nohz_full_cpu() below because
> it has a static key optimizations. Distros build NO_HZ_FULL but
> don't use it 99.99999% of the time.

Except that I need to need to return "true" if the CPU is offline even
if it is not a tick_nohz_full_cpu(), don't I?

In fact, the stack trace shows native_smp_send_reschedule(), which leads
me to believe that the splat happened via the "return false" at the bottom
of wake_up_full_nohz_cpu(). So I don't see a way to completely hide
the check under tick_nohz_full_cpu().

Or am I missing something here?

> > if (tick_nohz_full_cpu(cpu)) {
> > if (cpu != smp_processor_id() ||
> > tick_nohz_tick_stopped())
> > @@ -590,6 +592,11 @@ static bool wake_up_full_nohz_cpu(int cpu)
> > return false;
> > }
> >
> > +/*
> > + * Wake up the specified CPU. If the CPU is going offline, it is the
> > + * caller's responsibility to deal with the lost wakeup, for example,
> > + * by hooking into the CPU_DEAD notifier like timers and hrtimers do.
> > + */
>
> I think it's more transparent than that for the caller. migrate_timers() is
> called soon after and it takes care of waking up the destination of the migration
> if necessary. So the caller shouldn't care after all.

For the current two callers, agreed. But is this function designed to
be only used by timers and hrtimers? If so, shouldn't there be some
kind of comment to that effect? If not, shouldn't people considering
use of this function be warned that it is not unconditional?

> But the cpu_is_offline() check above may need a comment about that.

Like this?

if (cpu_is_offline(cpu))
return true; /* Don't try to wake offline CPUs. */

Thanx, Paul