Re: [RFC PATCH v5 20/29] sched/deadline: Allow deeper hierarchies of RT cgroups

From: Juri Lelli

Date: Mon May 11 2026 - 05:46:22 EST

On 07/05/26 18:39, luca abeni wrote:
> Hi,
>
> On Thu, 7 May 2026 17:03:41 +0200
> Juri Lelli <juri.lelli@xxxxxxxxxx> wrote:
>
> > On 07/05/26 12:53, Peter Zijlstra wrote:
> > > On Tue, May 05, 2026 at 09:56:58AM -1000, Tejun Heo wrote:
> >
> > ...
> >
> > > > - However, the cpu controller is a threaded controller which
> > > > means that it can have threaded sub-hierarchy where the
> > > > no-internal-process rule doesn't apply. This was created
> > > > explicitly for cpu controller. The proposed change blocks it
> > > > effectively forcing cpu controller into regular domain controller
> > > > behavior subject to no-internal-process rule. Note these are
> > > > enforced at controller granularity and this means that users who
> > > > use the threaded mode will be forced to pick between the two.
> > >
> > > Right... this then means we need two controls, one to do
> > > hierarchical bandwidth distribution, and one to assign bandwidth to
> > > the internal group -- which is then subject to its own bandwidth
> > > distribution constraint.
> > >
> > > This might be a little confusing, but there is no way around that
> > > AFAICT.
> >
> > Just to check if I'm following, you are thinking something like below?
> >
> > groupA/
> > cpu.rt.max = "50 50 100" <- 0.5 from root
> > cpu.rt.internal = "20 20 100" <- 0.2 from groupA for threads at
> > this level
> > + threadA <
> > + threadB <
> > +- group1/
> > cpu.rt.max = "30 30 100" <- 0.3 from groupA
> > + threadC
> >
> > And we still keep it flat, so 2 dl-entities (per CPU), one handles
> > threads at groupA level and the other threads inside group1?
>
> An alternative idea I was thinking about: we create 2 dl entities (one
> for "groupA" and one for "group1"); we set cpu.rt.max for groupA, and
> we subtract group1's utilization from it (so, if groupA's cpu.rt.max is
> "50 100" and group1's cpu.rt.max is "30 100", groupA is served by a dl
> entity (50-30,100)=(20,100) while group1 is served by a dl entity
> (30,100)).
>
> Basically, with this idea the "internal" reservation is automatically
> computed based on rt.max and on the children cgroups. A possible issue
> is that if the children consume all the groupA's utilization the groupA
> RT tasks remain with 0 runtime (and never execute).

While I like the automatic approach, I also fear that it might be more
difficult to maintain/use from a systemd admin perspective, e.g. I
cannot make a subgroup reservation bigger because there are threads
running in the parent group which consume all the remaining (internal)
bandwidth. If we make it explicit it seems easier to see where bandwidth
is allocated at all levels.

Peter? Tejun? What do we want to do with this interface?