Re: [GIT PULL] sched_ext: Initial pull request for v6.11

From: Peter Zijlstra
Date: Wed Jul 24 2024 - 04:52:37 EST


On Tue, Jul 23, 2024 at 09:34:13AM -1000, Tejun Heo wrote:

> > > + for_each_active_class(class) {
> > > p = class->pick_next_task(rq);
> > > - if (p)
> > > + if (p) {
> > > + const struct sched_class *prev_class = prev->sched_class;
> > > +
> > > + if (class != prev_class && prev_class->switch_class)
> > > + prev_class->switch_class(rq, p);
> > > return p;
> > > + }
> > > }
> >
> > So I really don't like this one.. at the very least it would need a comment
> > explaining why it only needs calling here and not every time a put_prev_task()
> > and set_next_task() pair cross a class boundary -- which would be the
> > consistent thing to do.
> >
> > Now, IIRC, you need a class call that indicates you're about to loose the CPU
> > so that you can kick the task to another CPU or somesuch. And last time I got
> > it backwards and suggested adding an argument to pick_next_task(), but what
> > about put_prev_task()?
> >
> > Can't we universally push put_prev_task() after the pick loop? Then we get
> > something like:
>
> Yeah, the problem with put_prev_task() was that it was before the next task
> was picked, so we couldn't know what the next class should be.

Right. But I don't think it needs to stay that way.

> > next = pick_task();
> > if (next != prev) {
> > put_prev_task(rq, prev, next->class != prev->class);
> > set_next_task(rq, next);
> > }
> >
> > I have patches for most of this for fair (in my eevdf queue), and I
> > think the others are doable, after all, this is more or less what we do
> > for SCHED_CORE already.
> >
> > /me went off hacking for a bit
> >
> > I've done this; find the results at: queue.git sched/prep
> >
> > I've also rebased the sched_ext tree on top of that with the below delta, which
> > you can find at: queue.git sched/scx
>
> Hmm... took a brief look, but I think I'm missing something. Doesn't
> put_prev_task() need to take place before pick_task() so that the previously
> running task can be considered when pick_task() is picking the next task to
> run?

So pick_task() came from the SCHED_CORE crud, which does a remote pick
and as such isn't able to do a put -- remote is still running its
current etc.

So pick_task() *SHOULD* already be considering its current and pick
that if it is a better candidate than whatever is on the queue.

If we have a pick_task() that doesn't do that, it's a pre-existing bug
and needs fixing anyhow.

> > This led me to discover that you're passing the task of the other class into
> > the bpf stuff -- I don't think that is a sane thing to do. You get preempted,
> > it doesn't matter from which higher class or by which specific task, a policy
> > must not care about that. So I kinda bodged it, but I really think this should
> > be taken out.
>
> At least for visibility, I think it makes sense. One can attach extra BPF
> progs to track the preemption events but it can be useful for the scheduler
> itself to be able to colllect when and why it's getting preempted. Also, in
> a more controlled environments, which RT task is preempting the task can be
> a, while not always reliable, useful hint in whether it'd be better to
> bounce to another CPU right away or not. That said, we can drop it e.g. if
> it makes implementation unnecessarily complicated.

It shouldn't be too hard to 'fix', it just feel wrong to hand the next
task into the prev class for something like this.

> > I also found you have some terrible !SMP hack in there, which I've
> > broken, I've disabled it for now. This needs a proper fix, and not
> > something ugly like you did.
>
> Yeah, this came up before. On UP, SCX either needs to call the balance
> callback as that's where the whole dispatch logic is called from (which
> makes sense for it as dispatching often involves balancing operations), or
> SCX itself needs to call it directly in a matching sequence. Just enabling
> balance callback on UP && SCX would be the cleanest.

Or make scx hard depend on SMP? Are there really still people using !SMP
-- and I suppose more importantly, do we care?

I mean, they could always run an SMP kernel on their UP potato if they
*really* feel they need this.

> > Anyway, it seems to build, boot and pass the sched_ext selftest:
> >
> > PASSED: 21
> > SKIPPED: 0
> > FAILED: 0
> >
> > (albeit with a fair amount of console noise -- is that expected?)
>
> Yeah, the selftests trigger a lot of error conditions so that's expected.
>
> > Also, why does that thing hard depend on DEBUG_BTF? (at least having
> > that option enabled no longer explodes build times like it used to)
>
> That's necessary for building the schedulers, at least, I think. We didn't
> have that earlier and people were getting confused.

It's what made it difficult for me to even build this stuff, simple
things like:

cd foo-build; ../scipts/config --enable CONFIG_SCHED_CLASS_EXT; cd -

don't work (nor error out on the lack of dependencies), and my build
scripts were force disabling the BTF crud -- and thus silently also the
scx thing.

It looks like I can change my builds scripts, because the reason I did
that was that BTF made the build times explode and that is no longer the
case.

> > Also should we /CONFIG_SCHED_CLASS_EXT/CONFIG_SCHED_BPF/ ? Then
> > something like: grep BPF foo-build/.config
> > more easily shows what's what.
>
> I don't know. isn't that too inconsistent with other classes, and down the
> line, maybe there may be BPF related additions to scheduler?

We can always rename crap later if needed. Maybe I'm the weird one doing
grep on .config, dunno. I also edit raw diff files and people think
that's weird too.