Re: [PATCH 1/2] sched/fair: expose cpu.max.runtime to set bandwidth runtime directly

From: Benjamin Segall

Date: Tue May 26 2026 - 16:54:24 EST

Fernand Sieber <sieberf@xxxxxxxxxx> writes:

> Add a cpu.max.runtime cgroup v2 interface that allows userspace to set
> the CFS bandwidth controller's runtime directly. This enables CPU credit
> injection: an orchestrator writes a runtime budget which the cgroup
> consumes naturally through the existing bandwidth enforcement mechanism.
>
> The write sets cfs_b->runtime directly. Each period, the task consumes
> runtime and the refill restores only quota (capped at quota + burst), so
> the injected credits drain until runtime falls below the cap, after which
> the cgroup returns to its steady-state quota allocation.
>
> Writes are rejected if the value exceeds quota + burst (the per-period
> runtime cap) or exceeds the maximum bandwidth limit.
>
> Also relax the burst validation: remove the burst <= quota constraint,
> requiring only that burst + quota does not overflow. This allows
> configuring burst > quota so that the runtime cap (quota + burst) can
> reach up to one full period, enabling 100% utilization while credits last.
>
> The interface uses microseconds, consistent with cpu.max quota/period.

I don't necessarily object to supporting this design of userspace
program/bpf for dynamic quota decisions that gets to make use of the
inline cfs bandwidth touch points for the performance sensitive runtime
consumption bits, given how minimal it is.

However the existing APIs give something very close to this - any write
to max/max.burst will also add the new quota to the runtime, and reading
max.runtime (beyond using it to construct a += on runtime) can be done
with cpuacct. Is the overhead of tg_set_cfs_bandwidth (which admittedly isn't
really designed to be fast) too much, or is setting max.runtime rather
than adding to it important, or something else?

>
> Signed-off-by: Fernand Sieber <sieberf@xxxxxxxxxx>
> ---
> kernel/sched/core.c | 44 +++++++++++++++-
> tools/testing/selftests/cgroup/test_cpu.c | 62 +++++++++++++++++++++++
> 2 files changed, 104 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b8871449d..d92e5840b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10085,8 +10085,7 @@ static int tg_set_bandwidth(struct task_group *tg,
> if (quota_us != RUNTIME_INF && quota_us > max_bw_runtime_us)
> return -EINVAL;
>
> - if (quota_us != RUNTIME_INF && (burst_us > quota_us ||
> - burst_us + quota_us > max_bw_runtime_us))
> + if (quota_us != RUNTIME_INF && (burst_us + quota_us > max_bw_runtime_us))
> return -EINVAL;

I'm fine with this in general, but we should keep a check for burst_us >
max_bw_runtime_us as well, to avoid burst_us + quota_us being able to
overflow and avoid the second check.

>
> #ifdef CONFIG_CFS_BANDWIDTH
> @@ -10147,6 +10146,41 @@ static int cpu_burst_write_u64(struct cgroup_subsys_state *css,
> tg_bandwidth(tg, &period_us, &quota_us, NULL);
> return tg_set_bandwidth(tg, period_us, quota_us, burst_us);
> }
> +
> +static int cpu_runtime_write_u64(struct cgroup_subsys_state *css,
> + struct cftype *cftype, u64 runtime_us)
> +{
> + struct task_group *tg = css_tg(css);
> + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
> +
> + if (runtime_us > max_bw_runtime_us)
> + return -EINVAL;
> +
> + raw_spin_lock_irq(&cfs_b->lock);
> + if (cfs_b->quota != RUNTIME_INF &&
> + (u64)runtime_us * NSEC_PER_USEC > cfs_b->quota + cfs_b->burst) {
> + raw_spin_unlock_irq(&cfs_b->lock);
> + return -EINVAL;
> + }
> + cfs_b->runtime = (u64)runtime_us * NSEC_PER_USEC;
> + raw_spin_unlock_irq(&cfs_b->lock);
> +
> + return 0;
> +}

The details of this feel very odd on two fronts:

First, while setting runtime rather than adding to it gives more power
to the controlling userspace, it also forces it to be racy if it wants
to add runtime. But the original design of cfs bandwidth didn't have
burst anyways, and it's not a disaster if it does race, even if the
orchestrator thread manages to get preempted or such. So I don't exactly
object to this design, but I do want to check in on the idea.

More importantly, I think it should definitely call
distribute_cfs_runtime (or an equivalent), to immediately let throttled
tasks start running again. As it is, that will be delayed until the
period timer runs, which is entirely desynchronized from userspace, even
if userspace uses the same period for its timers, along with
inconsistencies with any newly waking cpus which will run immediately.