Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()

From: Peter Zijlstra

Date: Mon Sep 29 2025 - 06:07:11 EST

On Fri, Sep 26, 2025 at 11:39:21AM -1000, Tejun Heo wrote:
> Hello,
>
> On Fri, Sep 26, 2025 at 12:36:28PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 25, 2025 at 11:43:18AM -1000, Tejun Heo wrote:
> > > Yes, I was on a similar train of thought. The only reasonable way that I can
> > > think of for solving this for BPF managed tasks is giving each task its own
> > > inner sched lock, which makes sense as all sched operations (except for
> > > things like watchdog) are per-task and we don't really need wider scope
> > > locking.
> >
> > Like I've said before; I really don't understand how that would be
> > helpful at all.
> >
> > How can you migrate a task by holding a per-task lock?
>
> Let's see whether I'm completely confused. Let's say we have p->sub_lock
> which is optionally grabbed by task_rq_lock() if requested by the current
> sched class (maybe it's a sched_class flag). Then, whoever is holding the
> sub_lock would exclude property and other changes to the task.
>
> In sched_ext, let's say p->sub_lock nests inside dsq locks. Also, right now,
> we're piggy backing on rq lock for local DSQs. We'd need to make local DSQs
> use their own locks like user DSQs. Then,
>
> - If a task needs to be migrated either during enqueue through
> process_ddsp_deferred_locals() or during dispatch from BPF through
> finish_dispatch(): Leave rq locks alone. Grab sub_lock inside
> dispatch_to_local_dsq() after grabbing the target DSQ's lock.
>
> - scx_bpf_dsq_move_to_local() from dispatch: This is a bit tricky as we need
> to scan the tasks on the source DSQ to find the task to dispatch. However,
> there's a patch being worked on to add rcu protected pointer to the first
> task which would be the task to be consumed in vast majority of cases, so
> the fast path wouldn't be complicated - grab sub_lock, do the moving. If
> the first task isn't a good candidate, we'd have to grab DSQ lock, iterate
> looking for the right candidate, unlock DSQ and grab sub_lock (or
> trylock), and see if the task is still on the DSQ and then relock and
> remove.
>
> - scx_bpf_dsq_move() during BPF iteration: DSQ is unlocked during each
> iteration visit, so this is straightforward. Grab sub-lock and do the rest
> the same.
>
> Wouldn't something like the above provide equivalent synchronization as the
> dynamic lock approach? Whoever is holding sub_lock would be guaranteed that
> the task won't be migrating while the lock is held.
>
> However, thinking more about it. I'm unsure how e.g. the actual migration
> would work. The actual migration is done by: deactivate_task() ->
> set_task_cpu() -> switch rq locks -> activate_task(). Enqueueing/dequeueing
> steps have operations that depend on rq lock - psi updates, uclamp updates
> and so on. How would they work?

Suppose __task_rq_lock() will take rq->lock and p->sub_lock, in that
order, such that task_rq_lock() will take p->pi_lock, rq->lock and
p->sub_lock.

Then something like:

guard(task_rq_lock)(p);
scoped_guard (sched_change, p, ...) {
// change me
}

Will end up doing something like:

// task_rq_lock
IRQ-DISABLE
LOCK pi->lock
1:
rq = task_rq(p);
LOCK rq->lock;
if (rq != task_rq(p)) {
UNLOCK rq->lock
goto 1;
}
LOCK p->sub_lock

// sched_change
dequeue_task() := dequeue_task_scx()
LOCK dsq->lock

While at the same time, above you argued p->sub_lock should be inside
dsq->lock. Because:

__schedule()
rq = this_rq();
LOCK rq->lock
next = pick_next() := pick_next_scx()
LOCK dsq->lock
p = find_task(dsq);
LOCK p->sub_lock
dequeue(dsq, p);
UNLOCK dsq->lock

Because if you did something like:

__schedule()
rq = this_rq();
LOCK rq->lock
next = pick_next() := pick_next_scx()
LOCK dsq->lock (or RCU, doesn't matter)
p = find_task(dsq);
UNLOCK dsq->lock
migrate:
LOCK p->pi_lock
rq = task_rq(p)
LOCK rq->lock
(verify bla bla)
LOCK p->sub_lock
LOCK dsq->lock
dequeue(dsq, p)
UNLOCK dsq->lock
set_task_cpu(n);
UNLOCK rq->lock
rq = cpu_rq(n);
LOCK rq->lock (inversion vs p->sub_lock)
LOCK dsq2->lock
enqueue(dsq2, p)
UNLOCK dsq2->lock

LOCK p->sub_lock
LOCK dsq->lock (whoopsie, p is on dsq2)
dequeue(dsq, p)
set_task_cpu(here);
UNLOCK dsq->lock

That is, either way around: dsq->lock outside, p->sub_lock inside, or
the other way around, I emd up with inversions and race conditions that
are not fun.

Also, if you do put p->sub_lock inside dsq->lock, this means
__task_rq_lock() cannot take it and it needs to be pushed deep into scx
(possibly into bpf ?) and that means I'm not sure how to do the change
pattern sanely.

Having __task_rq_lock() take p->dsq->lock solves all these problems,
except for that one weird case where BPF wants to do things their own
way. The longer I'm thinking about it, the more I dislike that. I just
don't see *ANY* upside from allowing BPF to do this while it is making
everything else quite awkward.

The easy fix is to have these BPF managed things have a single global
lock. That works and is correct. Then if they want something better,
they can use DSQs :-)

Fundamentally, we need the DSQ->lock to cover all CPUs that will pick
from it, there is no wiggle room there. Also note that while we change
only the attributes of a single task with the change pattern, that
affects the whole RQ, since a runqueue is an aggregate of all tasks.
This is very much why dequeue/enqueue around the change pattern, to keep
the runqueue aggregates updated.

Use the BPF thing to play with scheduling policies, but leave the
locking to the core code.