Re: [PATCH v3 11/20] sched/core: Push current task from non preferred CPU

From: K Prateek Nayak

Date: Thu Jun 04 2026 - 03:03:18 EST

Hello Shrikanth,

On 5/14/2026 8:51 PM, Shrikanth Hegde wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 86fa4bfaead0..508773e71929 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5678,6 +5678,9 @@ void sched_tick(void)
> unsigned long hw_pressure;
> u64 resched_latency;
>
> + if (!cpu_preferred(cpu))
> + sched_push_current_non_preferred_cpu(rq);
> +

Is there a reason why we don't do a bulk move of all the queued tasks at
once?

Is it simply for the latency reason or could it lead to too many task
movements if the CPU frequently transitions in and out of the preferred
mask ?

> if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> arch_scale_freq_tick();
>
> @@ -11263,3 +11266,87 @@ void sched_change_end(struct sched_change_ctx *ctx)
> p->sched_class->prio_changed(rq, p, ctx->prio);
> }
> }
> +
> +#ifdef CONFIG_PREFERRED_CPU
> +/* npc - non preferred CPU */
> +static DEFINE_PER_CPU(struct cpu_stop_work, npc_push_task_work);
> +
> +static int sched_non_preferred_cpu_push_stop(void *arg)
> +{
> + struct task_struct *p = arg;
> + struct rq *rq = this_rq();
> + struct rq_flags rf;
> + int cpu;

I think we should do that sanity check for cpu_preferred() here instead
of doing it back to back when sched_push_current_non_preferred_cpu() is
called from the tick handler.

> +
> + raw_spin_lock_irq(&p->pi_lock);
> + rq_lock(rq, &rf);
> + rq->push_task_work_done = 0;
> +
> + update_rq_clock(rq);
> +
> + if (task_rq(p) == rq && task_on_rq_queued(p)) {
> + cpu = select_fallback_rq(rq->cpu, p);
> + rq = __migrate_task(rq, &rf, p, cpu);
> + }
> +
> + rq_unlock(rq, &rf);
> + raw_spin_unlock_irq(&p->pi_lock);
> + put_task_struct(p);
> +
> + return 0;
> +}
> +
> +/*
> + * Push the current task running on non-preferred CPU.
> + * Using this non preferred CPU will lead to more vCPU preemptions
> + * in the host. So it is better not to use this CPU.
> + *
> + * Since task is running, call a stopper to push the task out. This is
> + * similar to how task moves during hotplug. In select_fallback_rq a
> + * preferred CPU will be chosen and henceforth task shouldn't come back to
> + * this CPU again.
> + *
> + * Works for FAIR/RT class only
> + *
> + * If task is affined only non-preferred CPUs, it can't be moved out
> + */
> +void sched_push_current_non_preferred_cpu(struct rq *rq)
> +{
> + struct task_struct *push_task = rq->curr;
> + unsigned long flags;
> + struct rq_flags rf;
> +
> + /* sanity check */
> + if (cpu_preferred(rq->cpu))
> + return;
> +
> + /* Push only if it is FAIR/RT class */
> + if (push_task->sched_class != &fair_sched_class &&
> + push_task->sched_class != &rt_sched_class)
> + return;
> +
> + if (kthread_is_per_cpu(push_task) ||
> + is_migration_disabled(push_task))
> + return;
> +
> + /* Is there any preferred CPU in the affinity list */
> + if (!task_has_preferred_cpus(push_task))
> + return;

I think there is some value to teach the __migrate_enable() path about
the preferred_cpu - that way, we don't end up with
__set_cpus_allowed_ptr_locked() putting a task on non-preferred
CPU by selecting it from:

cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);

and affine_move_task() can handle cases where the task does a
migrate_enable() from a non-preferred CPU / changes affinity and can be
pushed out readily.

I'll let Peter comment since that bit already very complicated and this
would be adding more to that already complicated machinery.

> +
> + /* There is already a stopper thread for this. Dont race with it */
> + if (rq->push_task_work_done == 1)
> + return;
> +
> + local_irq_save(flags);
> +
> + get_task_struct(push_task);
> +
> + rq_lock(rq, &rf);

nit. the irqsave and rq_lock can use guards.

> + rq->push_task_work_done = 1;

nit.

Apart from task_has_preferred_cpus(), all the checks above have no
dependency on CONFIG_PREFERRED_CPU.

Can we set some local indicator in sched_tick() within the rq_lock() section
to then schedule the stopper once we drop the rq_lock there? That way we
don't have to grab the rq_lock thrice in the worst case scenario where we
have to schedule the stopper.

> + rq_unlock(rq, &rf);
> +
> + stop_one_cpu_nowait(rq->cpu, sched_non_preferred_cpu_push_stop,
> + push_task, this_cpu_ptr(&npc_push_task_work));
> + local_irq_restore(flags);
> +}
> +#endif

--
Thanks and Regards,
Prateek