Re: [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller

From: Peter Zijlstra
Date: Tue Jun 22 2021 - 09:20:18 EST


On Mon, Jun 21, 2021 at 05:27:58PM +0800, Huaixin Chang wrote:
> The CFS bandwidth controller limits CPU requests of a task group to
> quota during each period. However, parallel workloads might be bursty
> so that they get throttled even when their average utilization is under
> quota. And they are latency sensitive at the same time so that
> throttling them is undesired.
>
> We borrow time now against our future underrun, at the cost of increased
> interference against the other system users. All nicely bounded.
>
> Traditional (UP-EDF) bandwidth control is something like:
>
> (U = \Sum u_i) <= 1
>
> This guaranteeds both that every deadline is met and that the system is
> stable. After all, if U were > 1, then for every second of walltime,
> we'd have to run more than a second of program time, and obviously miss
> our deadline, but the next deadline will be further out still, there is
> never time to catch up, unbounded fail.
>
> This work observes that a workload doesn't always executes the full
> quota; this enables one to describe u_i as a statistical distribution.
>
> For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
> (the traditional WCET). This effectively allows u to be smaller,
> increasing the efficiency (we can pack more tasks in the system), but at
> the cost of missing deadlines when all the odds line up. However, it
> does maintain stability, since every overrun must be paired with an
> underrun as long as our x is above the average.
>
> That is, suppose we have 2 tasks, both specify a p(95) value, then we
> have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
> everything is good. At the same time we have a p(5)p(5) = 0.25% chance
> both tasks will exceed their quota at the same time (guaranteed deadline
> fail). Somewhere in between there's a threshold where one exceeds and
> the other doesn't underrun enough to compensate; this depends on the
> specific CDFs.
>
> At the same time, we can say that the worst case deadline miss, will be
> \Sum e_i; that is, there is a bounded tardiness (under the assumption
> that x+e is indeed WCET).
>
> The benefit of burst is seen when testing with schbench. Default value of
> kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
>
> mkdir /sys/fs/cgroup/cpu/test
> echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>
> ./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
>
> The average CPU usage is at 80%. I run this for 10 times, and got long tail
> latency for 6 times and got throttled for 8 times.
>
> Tail latencies are shown below, and it wasn't the worst case.
>
> Latency percentiles (usec)
> 50.0000th: 19872
> 75.0000th: 21344
> 90.0000th: 22176
> 95.0000th: 22496
> *99.0000th: 22752
> 99.5000th: 22752
> 99.9000th: 22752
> min=0, max=22727
> rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
>
> The interferenece when using burst is valued by the possibilities for
> missing the deadline and the average WCET. Test results showed that when
> there many cgroups or CPU is under utilized, the interference is
> limited. More details are shown in:
> https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@xxxxxxxxxxxxxxxxx/
>
> Co-developed-by: Shanpei Chen <shanpeic@xxxxxxxxxxxxxxxxx>
> Signed-off-by: Shanpei Chen <shanpeic@xxxxxxxxxxxxxxxxx>
> Co-developed-by: Tianchen Ding <dtcccc@xxxxxxxxxxxxxxxxx>
> Signed-off-by: Tianchen Ding <dtcccc@xxxxxxxxxxxxxxxxx>
> Signed-off-by: Huaixin Chang <changhuaixin@xxxxxxxxxxxxxxxxx>
> ---

Ben, what say you? I'm tempted to pick up at least this first patch.