Re: [RFC PATCH 2/4] sched/fair: Introduce cfs_migration

From: Yafang Shao
Date: Sat Nov 06 2021 - 03:42:59 EST


On Sat, Nov 6, 2021 at 1:01 AM Valentin Schneider
<valentin.schneider@xxxxxxx> wrote:
>
> On 04/11/21 14:57, Yafang Shao wrote:
> > A new per-cpu kthread named "cfs_migration/N" is introduced to do
> > cfs specific balance works. It is a FIFO task with priority FIFO-1,
> > which means it can preempt any cfs tasks but can't preempt other FIFO
> > tasks. The kthreads as follows,
> >
> > PID COMMAND
> > 13 [cfs_migration/0]
> > 20 [cfs_migration/1]
> > 25 [cfs_migration/2]
> > 32 [cfs_migration/3]
> > 38 [cfs_migration/4]
> > ...
> >
> > $ cat /proc/13/sched
> > ...
> > policy : 1
> > prio : 98
> > ...
> >
> > $ cat /proc/13/status
> > ...
> > Cpus_allowed: 0001
> > Cpus_allowed_list: 0
> > ...
> >
> > All the works need to be done will be queued into a singly linked list,
> > in which the first queued will be first serviced.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> > Cc: Valentin Schneider <valentin.schneider@xxxxxxx>
> > Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> > ---
> > kernel/sched/fair.c | 93 +++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 93 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 87db481e8a56..56b3fa91828b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -20,6 +20,8 @@
> > * Adaptive scheduling granularity, math enhancements by Peter Zijlstra
> > * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
> > */
> > +#include <linux/smpboot.h>
> > +#include <linux/stop_machine.h>
> > #include "sched.h"
> >
> > /*
> > @@ -11915,3 +11917,94 @@ int sched_trace_rq_nr_running(struct rq *rq)
> > return rq ? rq->nr_running : -1;
> > }
> > EXPORT_SYMBOL_GPL(sched_trace_rq_nr_running);
> > +
> > +#ifdef CONFIG_SMP
> > +struct cfs_migrater {
> > + struct task_struct *thread;
> > + struct list_head works;
> > + raw_spinlock_t lock;
>
> Hm so the handler of that work queue will now be a SCHED_FIFO task (which
> can block) rather than a CPU stopper (which can't), but AFAICT the
> callsites that would enqueue an item can't block, so having this as a
> raw_spinlock_t should still make sense.
>
> > +};
> > +
> > +DEFINE_PER_CPU(struct cfs_migrater, cfs_migrater);
> > +
> > +static int cfs_migration_should_run(unsigned int cpu)
> > +{
> > + struct cfs_migrater *migrater = &per_cpu(cfs_migrater, cpu);
> > + unsigned long flags;
> > + int run;
> > +
> > + raw_spin_lock_irqsave(&migrater->lock, flags);
> > + run = !list_empty(&migrater->works);
> > + raw_spin_unlock_irqrestore(&migrater->lock, flags);
> > +
> > + return run;
> > +}
> > +
> > +static void cfs_migration_setup(unsigned int cpu)
> > +{
> > + /* cfs_migration should have a higher priority than normal tasks,
> > + * but a lower priority than other FIFO tasks.
> > + */
> > + sched_set_fifo_low(current);
> > +}
> > +
> > +static void cfs_migrater_thread(unsigned int cpu)
> > +{
> > + struct cfs_migrater *migrater = &per_cpu(cfs_migrater, cpu);
> > + struct cpu_stop_work *work;
> > +
> > +repeat:
> > + work = NULL;
> > + raw_spin_lock_irq(&migrater->lock);
> > + if (!list_empty(&migrater->works)) {
> > + work = list_first_entry(&migrater->works,
> > + struct cpu_stop_work, list);
> > + list_del_init(&work->list);
> > + }
> > + raw_spin_unlock_irq(&migrater->lock);
> > +
> > + if (work) {
> > + struct cpu_stop_done *done = work->done;
> > + cpu_stop_fn_t fn = work->fn;
> > + void *arg = work->arg;
> > + int ret;
> > +
> > + preempt_count_inc();
> > + ret = fn(arg);
> > + if (done) {
> > + if (ret)
> > + done->ret = ret;
> > + cpu_stop_signal_done(done);
> > + }
> > + preempt_count_dec();
> > + goto repeat;
> > + }
> > +}
>
> You're pretty much copying the CPU stopper setup, but that seems overkill
> for the functionality we're after: migrate a CFS task from one CPU to
> another. This shouldn't need to be able to run any arbitrary callback
> function.
>
> Unfortunately you are tackling both CFS active balancing and NUMA balancing
> at the same time, and right now they're plumbed a bit differently which
> probably drove you to use something a bit for polymorphic. Ideally we
> should be making them use the same paths, but IMO it would be acceptable as
> a first step to just cater to CFS active balancing - folks that really care
> about their RT tasks can disable CONFIG_NUMA_BALANCING, but there is
> nothing to disable CFS active balancing.
>

Right. The code will be more simplified if we only care about CFS
active balancing in this patchset.
We have disabled the numa balancing through
/proc/sys/kernel/numa_balancing, so it is not a critical issue now.

>
> Now, I'm thinking the bare information we need is:
>
> - a task to migrate
> - a CPU to move it to
>
> And then you can do something like...
>
> trigger_migration(task_struct *p, unsigned int dst_cpu)
> {
> work = { p, dst_cpu };
> get_task_struct(p);
> /* queue work + wake migrater + wait for completion */
> }
>
> cfs_migrater_thread()
> {
> /* ... */
> p = work->p;
>
> if (task_rq(p) != this_rq())
> goto out;
>
> /* migrate task to work->dst_cpu */
> out:
> complete(<some completion struct>);
> put_task_struct(p);
> }
>

Agreed.

>
> We should also probably add something to prevent the migration from
> happening after it is no longer relevant. Say if we have something like:
>
> <queue work to migrate p from CPU0 to CPU1>
> <FIFO-2 task runs for 42 seconds on CPU0>
> <cfs_migration/0 now gets to run>
>
> p could have moved elsewhere while cfs_migration/0. I'm thinking source CPU
> could be a useful information, but that doesn't tell you if the task moved
> around in the meantime...
>
> WDYT?

Agreed.
It seems we'd better take the patch[1] I sent several weeks back.

[1]. https://lore.kernel.org/lkml/20210615121551.31138-1-laoar.shao@xxxxxxxxx/

--
Thanks
Yafang