Re: sched_ext: Partial mode priority and fallthrough to EEVDF

From: Matt Fleming

Date: Wed Mar 11 2026 - 07:10:28 EST

On Tue, Mar 10, 2026 at 08:27:00AM -1000, Tejun Heo wrote:
>
> Hmm... I have a bit of hard time following how that's different from partial
> mode. If you want the scheduler to decide whether a task should be in SCX or
> fair, you can do so from ops.init_task() by asserting p->scx.disallow. If
> you mean that you want to switch dynamically on each scheduling event, I
> don't think that's a good idea given that each hop would be full sched_class
> switch.

Oh no, I don't want to switch dynamically at runtime. Doing the
classification once at BPF program load time is fine, but AFAIU
p->scx.disallow still gives us two scheduling classes (SCHED_EXT and
SCHED_NORMAL) where tasks in the fair class get chosen first.

> As for the ordering between the two, I don't know. How are you using partial
> mode? No matter how you order them, the behaviors on pathological cases are
> pretty bad and I've been thinking that most would use partial mode to
> partition the system so that some CPUs are managed by SCX and others by fair
> in which case the ordering doesn't matter that much. If you're mixing the
> two classes on the same CPUs, I wonder whether this is something which can
> be better dealt with the deadline servers. Andrea, what do you think?

I want to use SCHED_EXT to schedule the most latency-critical tasks
because a custom BPF scheduler allows me to make better CPU placement
and preemption decisions. Doing it with partial mode allows me to
progressively switch services over to SCHED_EXT without needing to take
on a mass migration for 100+ services in one go (something I'm trying
to my hardest to avoid :) ).

To clarify my "fallthrough to EEVDF" comment: if I could run in
full-mode, use disallow to keep most tasks EEVDF, and have SCHED_EXT
tasks scheduled with higher priority than SCHED_NORMAL then this would
tick all the boxes.

I have experimented with isolating CPUs where all tasks running are
SCHED_EXT while other CPUs run the SCHED_NORMAL workloads, so that's a
possibility. But not all our servers are configured that way and given
that we run heterogeneous workloads on single machines, it's a tall
price to pay capacity-wise if we can't fully utilise those isolated
CPUs at all times.

And to limit the pathological case in my experiments so far I'm using
cpu.max to cap CPU bandwidth (thanks to scx_lavd's bandwidth support).
All our services are systemd services, so we can set limits to guard
against complete meltdowns.

Thanks for the tip on the DL server. This looks promising and might
solve my problem nicely. I'll reply in more detail to Andrea's post.