Re: [PATCH v2 00/10] sched: Flatten the pick

From: Peter Zijlstra

Date: Tue May 12 2026 - 04:13:56 EST

On Mon, May 11, 2026 at 09:23:45AM -1000, Tejun Heo wrote:
> Hello, Peter.
>
> On Mon, May 11, 2026 at 01:31:04PM +0200, Peter Zijlstra wrote:
> > So cgroup scheduling has always been a pain in the arse. The problems start
> > with weight distribution and end with hierachical picks and it all sucks.
> >
> > The problems with weight distribution are related to that infernal global
> > fraction:
> >
> > tg->w * grq_i->w
> > ge_i->w = ----------------
> > \Sum_j grq_j->w
> >
> > which we've approximated reasonably well by now. However, the immediate
> > consequence of this fraction is that the total group weight (tg->w) gets
> > fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup
> > weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine
> > with the fact that 256 CPU systems are relatively common these days, this
> > becomes painful.
> >
> > The common 'solution' is to inflate the group weight by 'nr_cpus'; the
> > immediate problem with that is that when all load of a group gets concentrated
> > on a single CPU, the per-cpu cgroup weight becomes insanely large, easily
> > exceeding nice -20.
> >
> > Additionally there are numerical limits on the max weight you can have before
> > the math starts suffering overflows. As such there is a definite limit on the
> > total group weight. Which has annoyed people ;-)
> >
> > The first few patches add a knob /debug/sched/cgroup_mode and a few different
> > options on how to deal with this. My favourite is 'concur', but obviously that
> > is also the most expensive one :-/ It adds a tg->tasks counter which makes the
> > update_tg_load_avg() thing more expensive.
>
> Ignoring fixed math accuracy problems, isn't the root problem here that
> every thread in the root cgroup competes as if each is its own cgroup? ie.
> Isn't the canonical solution here to create an enveloping group, at least
> for share calculation purposes, for root threads and then assign them some
> weight so that they compete in the same way that other cgroups do? Then, the
> different modes go away or rather whatever the user wants can be expressed
> via root's weight if that's to be made configurable.

As long as the total group weight is a fraction; and it sorta has to be.
You can run into trouble by stacking that fraction.

Take 256 CPUs and a group weight of 1024. Then each CPU gets a weight of
1/256 or 4. Even if we increase the internal accuracy to 20 bits (we do
on 64bit) then this becomes 4096, do this for 2 more levels in the
hierarchy and you're down to scraping the barrel again.

So if each level runs at a fraction f of the level above, then level n
runs at f^n. Moving root into a phantom group at level 1, only solves
the problem against other tasks at level 1, but then you have the same
problem again at level 2 and below.

Both the numerical problems and the scale problem of the root group can
be avoided if we can get the average/nominal fraction to be near 1.

The 'normal' way around this is to ensure the group weight is nr_cpus *
1024, then, when everybody is running, the per CPU weight is 1024 or 1
and the continued fraction is also 1-ish. This is why people like to
increase the max group weight.

Trouble is of course that if not all CPUs are busy, with the extreme
being only a single CPU carrying that weight of nr_cpus*1024, this then
causes trouble because that one CPU gets overloaded.

One of the options is to simply put a max on the single CPU load; which
is the crudest option to just make it 'work'. The one I favour though is
the one where we scale the group weight by: 'min(cpumas, nr_tasks)'.

Anyway, this is why I've been looking at these alternative weight
schemes, to get the nominal fraction near 1 and make these problems go
away. It is both the numerical issues and the disparity between levels
(with root being at level 0 being the most obvious).

Does that make sense?