Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class
From: Tejun Heo
Date: Tue May 28 2024 - 19:46:31 EST
Hello,
BTW, David is off for the week and might be a bit slow to respond. I just
want to comment on one part.
On Mon, May 27, 2024 at 10:25:40PM +0100, Qais Yousef wrote:
..
> And I can only share my experience, I don't think the algorithm itself is the
> bottleneck here. The devil is in the corner cases. And these are hard to deal
> with without explicit hints.
Our perceptions of the scope of the problem space seem very different. To
me, it seems pretty unexplored. Here's just one area: Constantly increasing
number of cores and popularization of more complex cache hierarchies.
Over a hundred CPUs in a system is fairly normal now with a couple layers of
cache hierarchy. Once we have so many, things can look a bit different from
the days when we had a few. Flipping the approach so that we can dynamically
assign close-by CPUs to related groups of threads becomes attractive.
e.g. If you have a bunch of services which aren't latency critical but are
needed to maintain system integrity (updates, monitoring, security and so
on), soft-affining them to a number of CPUs while allowing some CPU headroom
can give you noticeable gain both in performance (partly from cleaner
caches) and power consumption while not adding that much to latency. This is
something the scheduler can and, I believe, should do transparently.
It's not obvious how to do it though. It doesn't quite fit the current LB
model. cgroup hierarchy seems to provide some hints on how threads can be
grouped but the boundaries might not match that well. Even if we figure out
how to define these groups, figuring out group-vs-group competition isn't
trivial (naive load-sums don't work when comparing across groups spanning
multiple CPUs).
Also, what about the threads with oddball cpumasks? Should we begin to treat
CPUs more like other resources, e.g., memory? We don't generally allow
applications to specify which specific physical pages they get because that
doesn't buy anything while adding a lot of constraints. If we have dozens
and hundreds of CPUs, are there fundamental reason to view them differently
from other resources which are treated fungible?
The claim that the current scheduler has the fundamentals all figured out
and it's mostly about handling edge cases and educating users seems wildly
off mark to me.
Maybe we can develop all that in the current framework in a gradual fashion,
but when the problem space is so wide open, that is not a good approach to
take. The cost of constricting is likely significantly higher than the
benefits of having a single code base. Imagine having to develop all the
features of btrfs in the ext2 code base. It's probably doable, at least
theoretically, but that would have been massively stifling, maybe to the
point of most of it not happening.
To the above particular problem of soft-affinity, scx_layered has something
really simple and dumb implemented and we're testing and deploying it in the
fleet with noticeable perf gains, and there are early efforts to see whether
we can automatically figure out grouping based on the cgroup hierarchy and
possibly minimal xattr hints on them.
I don't yet know what generic form soft-affinity should take eventually,
but, with sched_ext, we have a way to try out different ideas in production
and iterate on them learning each step of the way. Given how generic both
the problem and benefits from solving it are, we'll have to reach some
generic solution at one point. Maybe it will come from sched_ext or maybe it
will come from people working on fair like yourself. Either way, sched_ext
is already showing us what can be achieved and prodding people towards
solving it.
Thanks.
--
tejun