Re: [PATCH 14/31] sched_ext: Implement BPF extensible scheduler class

From: Tejun Heo
Date: Tue Dec 13 2022 - 13:12:16 EST


Hello,

On Tue, Dec 13, 2022 at 11:55:10AM +0100, Peter Zijlstra wrote:
> On Mon, Dec 12, 2022 at 11:33:12AM -1000, Tejun Heo wrote:
>
> > > But this.. afaict that means that:
> > >
> > > - the whole EXT thing is incompatible with SCHED_CORE
> >
> > Can you expand on why this would be? I didn't test against SCHED_CORE, so am
> > sure things might be broken but can't think of a reason why it'd be
> > fundamentally incompatible.
>
> For starters, SCHED_CORE doesn't use __pick_next_task() (much). But I

SCX implements ->pick_task() and the CORE selection path calls ->balance()
and then ->pick_task(). That should work, right? Will test later.

> think you're going to have more trouble with prio_less() (which is the
> 3rd implementation of the scheduling function :/)

Can't it take the same approach as CFS? The BPF scheduler is gonna be the
one defining the relative priorities among SCX tasks, so that's where the
decision belongs.

> > > - the whole EXT thing can be trivially starved by the presence of a
> > > single CFS/BATCH/IDLE task.
> >
> > It's a simliar situation w/ RT vs. CFS, which is resolved via RT having
> > starvation avoidance.
>
> That is a horrible situation as is, FIFO/RR are very crap scheduling
> policies for a number of reasons but we're stuck with them due to
> history and POSIX :-(, that is not something you should justify anything
> with.
>
> In fact, it should be an example of what to avoid.
>
> Specifically, FIFO/RR fail at the fundamentals of OS
> abstractions -- they provide neither resource distribution nor
> isolation.
>
> > Here, the way it's handled is a bit different, SCX has
> > a watchdog mechanism implemented in "[PATCH 18/31] sched_ext: Implement
> > runnable task stall watchdog", so if SCX tasks hang for whatever reason
> > including being starved by CFS, it will get aborted and all tasks will be
> > handed back to CFS. IOW, it's treated like any other BPF scheduler errors
> > that can lead to stalls and recovered the same way.
>
> That all sounds quite terrible.. :/

The main source of difference is that we can't implicitly trust the BPF
scheduler and if it malfunctions or on user request, the system should
always be recoverable, so there are some extra things which are inherently
necessary to support that.

> When the scheduler isn't available it should be an error to switch a
> task to the policy, when there are tasks in the policy, it must not go
> away.

Yeah, this part is an interface design choice. Currently, when the BPF
scheduler fails or is not present for any reason, SCX falls back to CFS
because that seemed like the least invasive way to go about it, but it's
trivial to just let SCX do dumb FIFO scheduling with the global DSQ instead,
which in fact is already used during transition to guarantee forward
progress.

Thanks.

--
tejun