Re: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server
From: Juri Lelli
Date: Tue Jun 09 2026 - 12:17:56 EST
Hi Yuri,
Thanks for sending this out.
On 08/06/26 14:15, Yuri Andriaccio wrote:
> Hello,
>
> This is the v6 for Hierarchical Constant Bandwidth Server, aiming at replacing
> the current RT_GROUP_SCHED mechanism with something more robust and
> theoretically sound. The patchset has been presented at OSPM25 and OSPM26
> (https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
> be found at https://lwn.net/Articles/1021332/ . You can find the previous
> versions of this patchset at the bottom of the page, in particular version 1
> which talks in more detail what this patchset is all about and how it is
> implemented.
>
> This v6 version works on the comments by the reviewers and introduces the
> following meaningful changes:
> - Update to kernel version 7.1.
> - Refactorings and general cleanups.
> - Removal of substantial duplicated code.
> - Express more locking constraints in code.
> - New cpu.rt.max interface.
> - Refactoring of migration code to reduce code duplication.
> The new migration code now reuses the existing push/pull and similar functions
> and specializes where needed, substantially reducing the footprint of group
> migration code from previous versions.
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> New cgroup-v2 interface:
> After extensive discussions with the kernel's maintainers, we have built a new
> interface to support HCBS scheduling. Since this will be a cgroup-v2 only
> feature (the fate of cgroup-v1 old RT_GROUP_SCHED has yet to be decided), it was
> possible to drop the original v1 interface entirely and create a completely new
> one that is similar to those that are already existing.
>
> Every cgroup has now two new files:
> - cpu.rt.max (similar to the cpu.max file)
> - cpu.rt.internal (read-only, not available in the root cgroup, it may be
> removed if deemed unnecessary, see later for details)
>
> In this new interface, HCBS cgroups may either be set to use deadline servers,
> and thus reserving a specified amount of bandwidth, very similarly to the
> previous system, or can delegate their FIFO/RR tasks' scheduling to the nearest
> ancestor that it is configured (default on group creation). If the nearest
> configured ancestor is the root cgroup, tasks will be effectively run on the
> root runqueue even if their cgroup is not the root task group.
>
> This means that subtrees are allowed to retain the original non-RT_GROUP_SCHED
> behaviour, scheduling on root, while the feature is nonetheless active. In the
> meantime other subtrees may use HCBS, and the whole hierarchy can coexist
> without issues.
>
> This behaviour is specified in the cpu.rt.max file, which accepts the string
> "<runtime | 'max'> <period>". A zero runtime disables FIFO/RR scheduling for
> tasks in that group, a non-zero runtime creates a reservation and uses HCBS, a
> runtime of 'max' instead tells the scheduler to use the nearest configured
> ancestor for the FIFO/RR task scheduling.
>
> The admission test now does not only check the immediate children of a cgroup
> for schedulability (recall that a group's bandwidth must be always greater than
> or equal to its children total bandwidth), but it has to check its whole
> subtree: if a child delegates its tasks to its parents (runtime = 'max'), then
> this child's own children (the grandchildrens) are effectively viewed as
> immediate children that compete for the same bandwidth of their grandparent, and
> so on down the hierarchy.
>
> To support both threaded and domain cgroups, the original test that allowed only
> to run tasks in leaf cgroups has been removed: this is already enforced for
> domain cgroups by existing code, while this must not be the case for threaded
> cgroups.
>
> Since groups in the middle of the hierarchy can now also run tasks, their
> dl_servers must be configured properly: a parent cgroup dl_servers can only use
> their assigned bandwidth minus the total of their children. The cpu.rt.internal
> file reads exactly what is this "remainder" bandwidth. Since dl_servers must
> have a runtime and period values assigned, the period is taken from the user
> configured cpu.rt.max file and the runtime is computed from the remainder bw.
> This runtime and the period are the values shown by cpu.rt.internal.
>
> Supporting both threaded and domain cgroups also dropped all the extra code
> related to active and 'live' cgroups as mentioned in previous RFCs.
>
I started playing with the new interface and ended up with the following
bash-5.3# cat cpu.rt.max (root)
10000 100000
bash-5.3# cat g1/cpu.rt.max
10000 100000
bash-5.3# cat g1/cpu.rt.internal
9999 100000
which looks odd to me, as nothing is running on g1 yet and no children
groups either. Maybe a rounding error of some kind?
Thanks,
Juri