Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller

From: Odin Ugedal
Date: Tue Feb 09 2021 - 08:18:59 EST



Hi! This looks quite useful, but I have a few quick thoughts. :)

I know of a lot of people who would love this (especially some
Kubernetes users)! I really like how this allow users to use cfs
in a more dynamic and flexible way, without interfering with those
who like the enforce strict quotas.


> +++ b/kernel/sched/core.c
> @ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64
> [...]
> + /* burst_onset needed */
> + if (cfs_b->quota != RUNTIME_INF &&
> + sysctl_sched_cfs_bw_burst_enabled &&
> + sysctl_sched_cfs_bw_burst_onset_percent > 0) {
> +
> + burst_onset = do_div(burst, 100) *
> + sysctl_sched_cfs_bw_burst_onset_percent;
> +
> + cfs_b->runtime += burst_onset;
> + cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
> + }

I saw a comment about this behavior, but I think this can lead to a bit of
confusion. If sysctl_sched_cfs_bw_burst_onset_percent=0, the amount of
bandwidth when the first process starts up will depend on the time between
the quota was set and the startup of the process, and that feel a bit like
a "timing" race that end user application then will have to think about.

I suspect contianer runtimes and/or tools like Kubernetes will then have
to tell users to set the value to a certan amount in order to make it
work as expected.

Another thing is that when a cgroup has saved some time into the
"burst quota", updating the quota, period or burst will then reset the
"burst quota", even though eg. only the burst was changed. Some tools
use dynamic quotas, resulting in multiple changes in the quota over time,
and I am a bit scared that don't allowing them to control "start burst"
on a write can be limiting.

Maybe we can allow people to set the "start bandwidth" explicitly when setting
cfs_burst if they want to do that? (edit: that might be hard for cgroup v1, but
would I think that is a good solution on cgroup v2).

This is however just my thoughts, and I am not 100% sure about what the
best solution is, but if we merge a certain behavior, we have no real
chance of changing it later.


> +++ b/kernel/sched/sched.h
> @@ -367,6 +367,7 @@ struct cfs_bandwidth {
> u64 burst;
> u64 buffer;
> u64 max_overrun;
> + u64 previous_runtime;
> s64 hierarchical_quota;

Maybe indicate that this was the remaining runtime _after_ the previous
period ended? Not 100% sure, but maybe sometihing like
'remaining_runtime_prev_period' or 'end_runtime_prev_period'(as inspiration).


> +++ b/kernel/sched/core.c
> @@ -8234,6 +8236,10 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
> seq_printf(sf, "wait_sum %llu\n", ws);
> }
>
> + seq_printf(sf, "current_bw %llu\n", cfs_b->runtime);
> + seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
> + seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
> +
> return 0;
> }

Looks like these metrics are missing from the cgroup v2 stats.

Are we sure it is smart to start exposing cfs_b->runtime, since it makes it
harder to change the implementation at a later time? I don't thin it is that
usefull, and if it is only exposed for debugging purposes people can probably
use kprobes instead? Also, it would not be usefull unless you know how much
wall time is left in the current period. In that sense,
cfs_b->previous_runtime would probably be more usefull, but still not sure if
it deserves to be exposed to end users like this.

Also, will "cfs_b->runtime" keep updating if no processes are running, or
will it be the the same here, but update (with burst via timer overrun)
when a process starts again? If so, the runtime available when a process
starts on cgroup inint can be hard to communicate if the value here doesn't
update.


> +++ b/kernel/sched/fair.c
> +void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init)
> [...]
> + /*
> + * When period timer stops, quota for the following period is not
> + * refilled, however period timer is already forwarded. We should
> + * accumulate quota once more than overrun here.
> + */


Trying to wrap my head around this one... Is not refilling here, as the
behavior before your patch causing "loss" in runtime and causing unnecessary
possibly causing a cgroup throttle?.


A am not that familiar how cross subsystem patches like these are handled, but
I am still adding the Tejun Heo (cgroup maintainer) as a CC. Should maybe cc to
cgroup@ as well?

Sorry for a long mail, in retrospect it should have been one per patch...