Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution

From: Andrea Righi

Date: Thu May 07 2026 - 02:31:32 EST

Hi John, Prateek,

On Thu, May 07, 2026 at 09:04:57AM +0530, K Prateek Nayak wrote:
> Hello John, Andrea,
>
> (Full disclaimer: I haven't looked at the entire series)
>
> On 5/7/2026 2:39 AM, John Stultz wrote:
> >> + /*
> >> + * Tasks pinned to a single CPU (per-CPU kthreads via
> >> + * kthread_bind(), tasks under migrate_disable()) cannot
> >> + * be moved to @owner_cpu. proxy_migrate_task() uses
> >> + * __set_task_cpu() which would silently violate the
> >> + * pinning and leave the task to run on a CPU outside
> >> + * its cpus_ptr once it is unblocked. Stay on this CPU
> >> + * via force_return; the owner running elsewhere will
> >> + * wake @p back up when the mutex becomes available.
> >> + */
> >> + if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
> >> + goto force_return;
> >> goto migrate_task;
> >
> > Hey Andrea!
> > I'm excited to see this series! Thanks for your efforts here!
> >
> > Though I'm a bit confused on this patch. I see the patch changes it
> > so we don't proxy-migrate pinned/migration-disabled patches, but I'm
> > not sure I understand why.
> >
> > We only proxy-migrate blocked_on tasks, which don't run on the cpu
> > they are migrated to (they are only migrated to be used as a donor).
> > That's why we have the proxy_force_return() function to return-migrate
> > them back when they do become runnable.
>
> I agree this shouldn't be a problem from core perspective but there
> are some interesting sched-ext interactions possible. More on that
> below:

So, I included this patch, because in a previous version of this series it was
preventing a "SCX_DSQ_LOCAL[_ON] cannot move migration disabled task" error.

However, I tried again this series without this and everything seems to work. I
guess this was fixed by "sched/ext: Avoid migrating blocked tasks with proxy
execution", that was not present in my previous early implementation. So, let's
ignore this for now...

>
> >
> > Could you provide some more details about what motivated this change
> > (ie: how you tripped a problem that it resolved?).
>
> I think ops.enqueue() always assumes that the task being enqueued is
> runnable on the task_cpu() and when the the sched-ext layer tries to
> dispatch this task to local DSQ, the ext core complains and marks
> the sched-ext scheduler as buggy.

Correct that ops.enqueue() assumes that the task being enqueued is runnable on
task_cpu(), but this should still be true even when the donor is migrated:
proxy-exec should only migrate the donor to the owner's CPU when the placement
is allowed.

>
> With sched-ext, even the lock owner's CPU is slightly complicated
> since the owner might be associated with a CPU but it is in fact on a
> custom DSQ and after moving the donor to owner's CPU, we will need
> sched-ext scheduler to guarantee that the owner runs there else
> there is no point in doing a proxy.

But a donor is always a running task (by definition), so it can't be on a custom
DSQ. Custom DSQs only hold tasks that are in the BPF scheduler's custody,
waiting to be dispatched.

The core keeps the donor logically runnable / on_rq and the ext core always
parks blocked donors on the built-in local DSQ:

put_prev_task_scx():
...
if (p->scx.flags & SCX_TASK_QUEUED) {
set_task_runnable(rq, p);

if (task_is_blocked(p)) {
dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
goto switch_class;
}
...

>
> scx flow should look something like (please correct me if I'm
> wrong):
>
> CPU0: donor CPU1: owner
> =========== ===========
>
> /* Donor is retained on rq*/
> put_prev_task_scx()
> ops.stopping()
> ops.dispatch() /* May be skipped if SCX_OPS_ENQ_LAST is not set */
> do_pick_task_scx()
> next = donor;
> find_proxy_task()
> proxy_migrate_task()
> ops.dequeue()
> ======================> /*
> * Moves to owner CPU (May be outside of affinity list)
> * ops.enqueue() still happens on CPU0 but I've shown it
> * here to depict the context has moved to owner's CPU.
> */
> ops.enqueue()
> scx_bpf_dsq_insert()
> /*
> * !!! Cannot dispatch to local CPU; Outside affinity !!!
> *
> * We need to allow local dispatch outside affinity iff:
> *
> * p->is_blocked && cpu == task_cpu(p)
> *
> * Since enqueue_task_scx() hold's the task's rq_lock, the
> * is_blocked indicator should be stable during a dispatch.
> */
> ops.dispatch()
> do_pick_task_scx()
> set_next_task_scx()
> ops.running(donor)
> find_proxy_task()
> next = owner
> /*
> * !!! Owner stats running without any notification. !!!
> *
> * If owner blocks, dequeue_task_scx() is executed first and
> * the sched-ext scheduler sees:
> *
> * ops.stopping(owner)
> *
> * which leads to some asymmetry.
> *
> * XXX: Below is how I imagine the flow should continue.
> */
> ops.quiescent(owner) /* Core is taking back control of owner's running */
> /* Runs owner */
> ops.runnable(owner) /* Core is giving back control to ext layer */
> ops.stopping(donor); /* Accounting symmetry for donor */

I think the order of operations should be the following:

ops.runnable(donor)
-> ops.enqueue(donor)
-> donor becomes curr
-> ops.running(donor) /* set_next_task_scx(donor); !task_is_blocked(donor) */
-> donor executes
-> donor blocks on mutex (proxy: stays on_rq; task_is_blocked(donor) true)
-> __schedule()
-> pick_next -> proxy-exec selects owner as next
-> put_prev_task_scx(donor)
-> ops.stopping(donor)
-> dispatch_enqueue(local_dsq) /* blocked donor: ext core parks on local DSQ */
-> set_next_task_scx(owner)
-> ops.running(owner)
-> donor runs as rq->donor, owner runs as rq->curr /* execution / accounting split */

Later, when the owner is switched away (another schedule)

... owner running ...
-> __schedule() / switch away from owner
-> put_prev_task_scx(owner)
-> ops.stopping(owner) /* if QUEUED && IS_RUNNING */
-> set_next_task_scx() /* whoever is next */

Later, mutex is released - donor can run as itself again

-> mutex released / donor unblocked (!task_is_blocked(donor))
-> donor selected as next /* becomes rq->curr as donor; not superseded by proxy */
-> ops.running(donor) /* set_next_task_scx(donor); QUEUED && !task_is_blocked(donor) */
-> donor executes as rq->curr

> I think dequeue_task_scx() should see task_current_donor() before
> calling ops.stopping() else we get some asymmetry. The donor will
> anyways be placed back via put_prev_task_scx() and since it hasn't run,
> it cannot block itself and there should be no dependency on
> dequeue_task_scx() for donors.

The ops.running/stopping() pair should be always enforced by
SCX_TASK_IS_RUNNING, so we either see a pair of them or none. So in theory,
there shouldn't be any asymmetry.

>
> With the quiescent() + runnable() scheme, the sched-ext schedulers need
> to be made aware that task can go quiescent() and then back to
> runnable() while being SCX_TASK_QUEUED or the ext core has to spoof a
> full:
>
> dequeue(SLEEP) -> quiescent() -> /* Run owner */ -> runnable() -> select_cpu() -> enqueue()
>
> Also since the mutex owner can block, the sched-ext scheduler needs to
> be aware of the fact that it can get a dequeue() -> quiescent()
> without having stopping() in between if we plan to keep
> symmetry.

We can see ops.dequeue() -> ops.quiescent() without ops.stopping() even without
proxy-exec: if a task becomes runnable and then it's moved to a different sched
class, the BPF scheduler can see ops.runnable/quiescent() without
ops.running/stopping().

As long as ops.runnable/quiescent() and ops.running/stopping() are symmetric I
think we're fine.

>
> There might be more issues there that I'm missing.
>

Right, I'm still trying to figure out if there's any scenario that can break
some BPF assumptions (kfunc permissions or similar), but considering that the
BPF context is usually associated to task_struct I can't see any potential
violation/breakage at the moment.

Thanks,
-Andrea