Re: [PATCH sched_ext/for-7.1] sched_ext: Reduce DSQ lock contention in consume_dispatch_q()

From: Andrea Righi

Date: Sun Mar 15 2026 - 05:40:59 EST


On Sat, Mar 14, 2026 at 10:58:05PM -1000, Tejun Heo wrote:
> Hello, Andrea.
>
> On Sun, Mar 15, 2026 at 12:52:31AM +0100, Andrea Righi wrote:
> ...
> > Benchmarks that generate many enqueue/dispatch events (e.g., schbench)
> > show around 2-3x higher throughput with most of the scx schedulers with
> > this change applied.
>
> Can you share more details about the benchmark setup and results?

Just running schbench and perf bench for now, it definitely needs more
testing, but I wanted to send a patch to start a discussion about this (I
should have added the RFC in the subject, sorry).

>
> > + /*
> > + * Use trylock to avoid spinning on a contended DSQ, if we fail to
> > + * acquire the lock kick the CPU to retry on the next balance.
> > + *
> > + * In bypass mode simply spin to acquire the lock, since
> > + * scx_kick_cpu() is suppressed.
> > + */
> > + if (scx_bypassing(sch, cpu)) {
> > + raw_spin_lock(&dsq->lock);
> > + } else if (!raw_spin_trylock(&dsq->lock)) {
> > + scx_kick_cpu(sch, cpu, 0);
> > + return false;
> > + }
>
> But I'm not sure this is what we wanna do. If we *really* want to do this,
> maybe we can add a try_move variant; however, I'm pretty deeply skeptical
> about the approach for a few reasons.
>
> - If a shared DSQ becomes a bottleneck, the right thing to do would be
> introducing multiple DSQs and shard them.

True, but then we also need a load balancer with multiple DSQs and moving
tasks across DSQs is also not very efficient. With a shared DSQ we do
really well with latency, but under intense scheduling activity (e.g.,
schbench) we get poor performance, so all those scheduling-related
benchmarks get a bad score with most of the scx schedulers.

With this applied pretty much all the scx schedulers (scx_cosmos,
scx_bpfland, scx_p2dq, scx_lavd) get pretty much the same score (or
even slightly better) as EEVDF with schbench, without any noticeable impact
on latency (tested avg fps and tail latency with a few games).

>
> - This likely is trading off fairness to gain bandwidth and this approach
> depending on machine / workload may lead to severe starvation. One can
> argue that controlled trade off between fairness and bandwidth is useful
> for some use cases. However, even if that is the case, I don't think
> trylock is the way to get there. If we think that low overhead high
> fan-out shared queue is desirable, it'd be better to introduce dedicated
> data structure which can do so in a controlled manner.

True, and I think with moderate CPU activity this may increase latency due
to the additional kick/balance step when trylock fails (maybe control this
behavior with a flag?).

That said, the throughput benefits seem significant. While schbench is
probably an extreme case, the improvement there is substantial (2-3x),
which suggests this approach might also benefit some more realistic
workloads. I'm planning to run additional tests over the next few days to
better understand this.

Based on the schbench results, it seems like a missed opportunity to drop
this entirely. Can you elaborate more on the dedicated data structure you
mentioned? Do you have something specific in mind?

Thanks,
-Andrea