Re: sched_ext: Partial mode priority and fallthrough to EEVDF

From: Andrea Righi

Date: Tue Mar 10 2026 - 14:46:29 EST


On Tue, Mar 10, 2026 at 08:27:00AM -1000, Tejun Heo wrote:
> Hello, Matt.
>
> On Tue, Mar 10, 2026 at 02:52:13PM +0000, Matt Fleming wrote:
> > At Cloudflare we're experimenting with inverting the priority of the
> > ext_sched_class and fair_sched_class to allow us to pick SCHED_EXT
> > tasks to run before SCHED_NORMAL. This gives us better scheduling
> > decisions for those SCHED_EXT tasks where we can embed business logic
> > into the BPF program and prevents them being starved by the larger
> > number of SCHED_NORMAL tasks under CPU contention. There are a couple
> > of reasons we took this route:
> >
> > 1. Our workloads are heterogeneous and complex and we can't move entire
> > systems to SCHED_EXT in one shot. We want to experiment with running
> > SCHED_EXT in partial mode as we progressively onboard more and more
> > services (we run multiple services on single machines).
> >
> > 2. There's no way today (AFAIK) to run in "full-mode" and have BPF
> > schedulers fallthrough to EEVDF.
> >
> > In an ideal world, 2 is what we'd want to do. Is anyone else interested
> > in this problem or currently working on it? Is there anything coming in
> > the future that would make it easier for those of us slowly
> > transitioning to SCHED_EXT?
>
> Hmm... I have a bit of hard time following how that's different from partial
> mode. If you want the scheduler to decide whether a task should be in SCX or
> fair, you can do so from ops.init_task() by asserting p->scx.disallow. If
> you mean that you want to switch dynamically on each scheduling event, I
> don't think that's a good idea given that each hop would be full sched_class
> switch.
>
> As for the ordering between the two, I don't know. How are you using partial
> mode? No matter how you order them, the behaviors on pathological cases are
> pretty bad and I've been thinking that most would use partial mode to
> partition the system so that some CPUs are managed by SCX and others by fair
> in which case the ordering doesn't matter that much. If you're mixing the
> two classes on the same CPUs, I wonder whether this is something which can
> be better dealt with the deadline servers. Andrea, what do you think?

I think you can model your scenario using the ext deadline server. For
instance, if you run:

# echo 500000000 | tee /sys/kernel/debug/sched/ext_server/cpu*/runtime

This would give sched_ext tasks a guaranteed 50% bandwidth on all CPUs,
(default is 5%), even if there are tasks running at higher sched classes.

Would this approach work for your needs?

-Andrea