Re: [patch 09/10] sched/core: Add migrate_disable/enable()

From: peterz
Date: Thu Sep 17 2020 - 10:43:54 EST


On Thu, Sep 17, 2020 at 11:42:11AM +0200, Thomas Gleixner wrote:

> +static inline void update_nr_migratory(struct task_struct *p, long delta)
> +{
> + if (p->nr_cpus_allowed > 1 && p->sched_class->update_migratory)
> + p->sched_class->update_migratory(p, delta);
> +}

Right, so as you know, I totally hate this thing :-) It adds a second
(and radically different) version of changing affinity. I'm working on a
version that uses the normal *set_cpus_allowed*() interface.

> +/*
> + * The migrate_disable/enable() fastpath updates only the tasks migrate
> + * disable count which is sufficient as long as the task stays on the CPU.
> + *
> + * When a migrate disabled task is scheduled out it can become subject to
> + * load balancing. To prevent this, update task::cpus_ptr to point to the
> + * current CPUs cpumask and set task::nr_cpus_allowed to 1.
> + *
> + * If task::cpus_ptr does not point to task::cpus_mask then the update has
> + * been done already. This check is also used in in migrate_enable() as an
> + * indicator to restore task::cpus_ptr to point to task::cpus_mask
> + */
> +static inline void sched_migration_ctrl(struct task_struct *prev, int cpu)
> +{
> + if (!prev->migration_ctrl.disable_cnt ||
> + prev->cpus_ptr != &prev->cpus_mask)
> + return;
> +
> + prev->cpus_ptr = cpumask_of(cpu);
> + update_nr_migratory(prev, -1);
> + prev->nr_cpus_allowed = 1;
> +}

So this thing is called from schedule(), with only rq->lock held, and
that violates the locking rules for changing the affinity.

I have a comment that explains how it's broken and why it's sort-of
working.

> +void migrate_disable(void)
> +{
> + unsigned long flags;
> +
> + if (!current->migration_ctrl.disable_cnt) {
> + raw_spin_lock_irqsave(&current->pi_lock, flags);
> + current->migration_ctrl.disable_cnt++;
> + raw_spin_unlock_irqrestore(&current->pi_lock, flags);
> + } else {
> + current->migration_ctrl.disable_cnt++;
> + }
> +}

That pi_lock seems unfortunate, and it isn't obvious what the point of
it is.

> +void migrate_enable(void)
> +{
> + struct task_migrate_data *pending;
> + struct task_struct *p = current;
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + if (WARN_ON_ONCE(p->migration_ctrl.disable_cnt <= 0))
> + return;
> +
> + if (p->migration_ctrl.disable_cnt > 1) {
> + p->migration_ctrl.disable_cnt--;
> + return;
> + }
> +
> + raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
> + p->migration_ctrl.disable_cnt = 0;
> + pending = p->migration_ctrl.pending;
> + p->migration_ctrl.pending = NULL;
> +
> + /*
> + * If the task was never scheduled out while in the migrate
> + * disabled region and there is no migration request pending,
> + * return.
> + */
> + if (!pending && p->cpus_ptr == &p->cpus_mask) {
> + raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
> + return;
> + }
> +
> + rq = __task_rq_lock(p, &rf);
> + /* Was it scheduled out while in a migrate disabled region? */
> + if (p->cpus_ptr != &p->cpus_mask) {
> + /* Restore the tasks CPU mask and update the weight */
> + p->cpus_ptr = &p->cpus_mask;
> + p->nr_cpus_allowed = cpumask_weight(&p->cpus_mask);
> + update_nr_migratory(p, 1);
> + }
> +
> + /* If no migration request is pending, no further action required. */
> + if (!pending) {
> + task_rq_unlock(rq, p, &rf);
> + return;
> + }
> +
> + /* Migrate self to the requested target */
> + pending->res = set_cpus_allowed_ptr_locked(p, pending->mask,
> + pending->check, rq, &rf);
> + complete(pending->done);
> +}

So, what I'm missing with all this are the design contraints for this
trainwreck. Because the 'sane' solution was having migrate_disable()
imply cpus_read_lock(). But that didn't fly because we can't have
migrate_disable() / migrate_enable() schedule for raisins.

And if I'm not mistaken, the above migrate_enable() *does* require being
able to schedule, and our favourite piece of futex:

raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
spin_unlock(q.lock_ptr);

is broken. Consider that spin_unlock() doing migrate_enable() with a
pending sched_setaffinity().

Let me ponder this more..