Re: [PATCH 4/6] sched/isolation: Residual 1Hz scheduler tick offload

From: Frederic Weisbecker
Date: Mon Jan 29 2018 - 11:48:43 EST


On Mon, Jan 29, 2018 at 04:38:39PM +0100, Peter Zijlstra wrote:
1;4205;0c> On Fri, Jan 19, 2018 at 01:02:18AM +0100, Frederic Weisbecker wrote:
> > When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> > keep the scheduler stats alive. However this residual tick is a burden
> > for bare metal tasks that can't stand any interruption at all, or want
> > to minimize them.
> >
> > The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
> > outsource these scheduler ticks to the global workqueue so that a
> > housekeeping CPU handles those remotely.
> >
> > Note that in the case of using isolcpus, it's still up to the user to
> > affine the global workqueues to the housekeeping CPUs through
> > /sys/devices/virtual/workqueue/cpumask or domains isolation
> > "isolcpus=nohz,domain".
>
> I would very much like a few words on why sched_class::task_tick() is
> safe to call remote -- from a quick look I think it actually is, but it
> would be good to have some words here.

Let's rather say I can't prove that it is safe, given the amount of code that
is behind throughout the various flavour of scheduler features.

But as far as I checked several times, it seems that nothing is accessed locally
on ::scheduler_tick(). Everything looks fetched from the runqueue struct target
while it is locked.

If we ever find local references such as "current" or "__this_cpu_*" in the path,
we'll have to fix them.

>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d72d0e9..c79500c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3062,7 +3062,82 @@ u64 scheduler_tick_max_deferment(void)
> >
> > return jiffies_to_nsecs(next - now);
> > }
> > -#endif
> > +
> > +struct tick_work {
> > + int cpu;
> > + struct delayed_work work;
> > +};
> > +
> > +static struct tick_work __percpu *tick_work_cpu;
> > +
> > +static void sched_tick_remote(struct work_struct *work)
> > +{
> > + struct delayed_work *dwork = to_delayed_work(work);
> > + struct tick_work *twork = container_of(dwork, struct tick_work, work);
> > + int cpu = twork->cpu;
> > + struct rq *rq = cpu_rq(cpu);
> > + struct rq_flags rf;
> > +
> > + /*
> > + * Handle the tick only if it appears the remote CPU is running
> > + * in full dynticks mode. The check is racy by nature, but
> > + * missing a tick or having one too much is no big deal.
> > + */
> > + if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> > + rq_lock_irq(rq, &rf);
> > + update_rq_clock(rq);
> > + rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> > + rq_unlock_irq(rq, &rf);
> > + }
> > +
> > + queue_delayed_work(system_unbound_wq, dwork, HZ);
>
> Do we want something that tracks the actual interrer arrival time of
> this work, such that we can detect and warn if the book-keeping thing is
> failing to keep up?

Yeah perhaps we can have some sort of check to make sure we got a tick after
some reasonable delay since the last sched in of the current remote task.

>
> > +}
> > +
> > +static void sched_tick_start(int cpu)
> > +{
> > + struct tick_work *twork;
> > +
> > + if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> > + return;
>
> This all looks very static :-(, you can't reconfigure this nohz_full
> crud after boot?

Unfortunately yes. In fact making the nohz interface dynamically available
through cpuset is the next big step.

>
> > + WARN_ON_ONCE(!tick_work_cpu);
> > +
> > + twork = per_cpu_ptr(tick_work_cpu, cpu);
> > + twork->cpu = cpu;
> > + INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> > + queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> > +}
>
> Similarly, I think we want a few words about how unbound workqueues are
> expected to behave vs NUMA.
>
> AFAICT unbound workqueues by default prefer to run on a cpu in the same
> node, but if no cpu is available, it doesn't go looking for the nearest
> node that does have a cpu, it just punts to whatever random cpu.

Yes, and in fact you just made me look into wq_select_unbound_cpu() and
it looks worse than that. If the current CPU is not in the wq_unbound_cpumask,
a random one is picked up from that global cpumask without trying a near
one in the current node.

Looks like room for improvement on the workqueue side. I'll see what I can do.

Thanks.