Re: [PATCH v2 00/10] sched: Flatten the pick
From: Peter Zijlstra
Date: Wed May 27 2026 - 05:47:17 EST
On Mon, May 18, 2026 at 09:11:03AM -1000, Tejun Heo wrote:
> Hello, Peter.
>
> On Mon, May 18, 2026 at 09:14:56AM +0200, Peter Zijlstra wrote:
> ...
> > So the current scheme will inflate the part of A to be double the weight
> > (of B), giving them 2 out of 3 parts on the contended CPUs, but then B
> > will still get complete / uncontested access to those extra 128 CPUs,
> > resulting in a 2:4 weight distribution.
> >
> > Which also isn't as straight forward as one might think.
>
> Right, the current behavior isn't quite what people would expect intuitively
> either.
>
> ...
> > So for the one contended CPU A gets 256 out of 257 parts, while B gets
> > the full CPU for the remaining 255 CPUs, for a:
> >
> > 256 1 257
> > --- : --- + 255*--- = 256:65535 ~ 1:256
> > 257 257 257
> >
> > distribution. While with the new scheme it would be:
> >
> > 1 1 2
> > - : - + 255*- = 1:511
> > 2 2 2
> >
> > Which, realistically isn't all that different, except the old scheme has
> > this really large weight to deal with.
> >
> > So from where I'm sitting, yes different, but it behaves better.
FWIW if the workload was single threads per CPU; the above is also the
exact behaviour we'd have without cgroups.
> I see. Thread cardinality and affinity problems make weight based
> distribution such a pain. I wonder whether this can be better solved by
> turning it into a two-layer allocation problem - groups to CPUs and then
> timeshare on CPUs as necessary. That comes with a lot of its own problems
> but it can, aspirationally at least, approximate global weight distribution
> and would have better locality properties.
If people want, they can already do this today. I don't see a reason to
mandate something like that. That is, combine cpuset and cpu in a v2
hierarchy and you get this.
The main problem with doing something like that is of course that it
isn't always clear how many CPUs will be needed for a particular 'job'.
So assigning groups to CPUs isn't a straight forward thing.
If I remember, Meta was actually doing some of this. It was dynamically
resizing cpusets based on load predictions and the like in order to
separate various worloads on the same large machine, right?
Anyway, while it is somewhat tedious to change behaviour, I do think it
is worth doing in this case.