Re: [PATCH 4/6] sched/isolation: Residual 1Hz scheduler tick offload
From: Peter Zijlstra
Date: Mon Jan 29 2018 - 10:39:01 EST
On Fri, Jan 19, 2018 at 01:02:18AM +0100, Frederic Weisbecker wrote:
> When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> keep the scheduler stats alive. However this residual tick is a burden
> for bare metal tasks that can't stand any interruption at all, or want
> to minimize them.
>
> The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
> outsource these scheduler ticks to the global workqueue so that a
> housekeeping CPU handles those remotely.
>
> Note that in the case of using isolcpus, it's still up to the user to
> affine the global workqueues to the housekeeping CPUs through
> /sys/devices/virtual/workqueue/cpumask or domains isolation
> "isolcpus=nohz,domain".
I would very much like a few words on why sched_class::task_tick() is
safe to call remote -- from a quick look I think it actually is, but it
would be good to have some words here.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d72d0e9..c79500c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3062,7 +3062,82 @@ u64 scheduler_tick_max_deferment(void)
>
> return jiffies_to_nsecs(next - now);
> }
> -#endif
> +
> +struct tick_work {
> + int cpu;
> + struct delayed_work work;
> +};
> +
> +static struct tick_work __percpu *tick_work_cpu;
> +
> +static void sched_tick_remote(struct work_struct *work)
> +{
> + struct delayed_work *dwork = to_delayed_work(work);
> + struct tick_work *twork = container_of(dwork, struct tick_work, work);
> + int cpu = twork->cpu;
> + struct rq *rq = cpu_rq(cpu);
> + struct rq_flags rf;
> +
> + /*
> + * Handle the tick only if it appears the remote CPU is running
> + * in full dynticks mode. The check is racy by nature, but
> + * missing a tick or having one too much is no big deal.
> + */
> + if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
> + rq_lock_irq(rq, &rf);
> + update_rq_clock(rq);
> + rq->curr->sched_class->task_tick(rq, rq->curr, 0);
> + rq_unlock_irq(rq, &rf);
> + }
> +
> + queue_delayed_work(system_unbound_wq, dwork, HZ);
Do we want something that tracks the actual interrer arrival time of
this work, such that we can detect and warn if the book-keeping thing is
failing to keep up?
> +}
> +
> +static void sched_tick_start(int cpu)
> +{
> + struct tick_work *twork;
> +
> + if (housekeeping_cpu(cpu, HK_FLAG_TICK))
> + return;
This all looks very static :-(, you can't reconfigure this nohz_full
crud after boot?
> + WARN_ON_ONCE(!tick_work_cpu);
> +
> + twork = per_cpu_ptr(tick_work_cpu, cpu);
> + twork->cpu = cpu;
> + INIT_DELAYED_WORK(&twork->work, sched_tick_remote);
> + queue_delayed_work(system_unbound_wq, &twork->work, HZ);
> +}
Similarly, I think we want a few words about how unbound workqueues are
expected to behave vs NUMA.
AFAICT unbound workqueues by default prefer to run on a cpu in the same
node, but if no cpu is available, it doesn't go looking for the nearest
node that does have a cpu, it just punts to whatever random cpu.