Re: [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller
From: Benjamin Segall
Date: Tue Jun 22 2021 - 14:58:08 EST
Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:
> On Mon, Jun 21, 2021 at 05:27:58PM +0800, Huaixin Chang wrote:
>> The CFS bandwidth controller limits CPU requests of a task group to
>> quota during each period. However, parallel workloads might be bursty
>> so that they get throttled even when their average utilization is under
>> quota. And they are latency sensitive at the same time so that
>> throttling them is undesired.
>>
>> We borrow time now against our future underrun, at the cost of increased
>> interference against the other system users. All nicely bounded.
>>
>> Traditional (UP-EDF) bandwidth control is something like:
>>
>> (U = \Sum u_i) <= 1
>>
>> This guaranteeds both that every deadline is met and that the system is
>> stable. After all, if U were > 1, then for every second of walltime,
>> we'd have to run more than a second of program time, and obviously miss
>> our deadline, but the next deadline will be further out still, there is
>> never time to catch up, unbounded fail.
>>
>> This work observes that a workload doesn't always executes the full
>> quota; this enables one to describe u_i as a statistical distribution.
>>
>> For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
>> (the traditional WCET). This effectively allows u to be smaller,
>> increasing the efficiency (we can pack more tasks in the system), but at
>> the cost of missing deadlines when all the odds line up. However, it
>> does maintain stability, since every overrun must be paired with an
>> underrun as long as our x is above the average.
>>
>> That is, suppose we have 2 tasks, both specify a p(95) value, then we
>> have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
>> everything is good. At the same time we have a p(5)p(5) = 0.25% chance
>> both tasks will exceed their quota at the same time (guaranteed deadline
>> fail). Somewhere in between there's a threshold where one exceeds and
>> the other doesn't underrun enough to compensate; this depends on the
>> specific CDFs.
>>
>> At the same time, we can say that the worst case deadline miss, will be
>> \Sum e_i; that is, there is a bounded tardiness (under the assumption
>> that x+e is indeed WCET).
>>
>> The benefit of burst is seen when testing with schbench. Default value of
>> kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
>>
>> mkdir /sys/fs/cgroup/cpu/test
>> echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
>> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>>
>> ./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
>>
>> The average CPU usage is at 80%. I run this for 10 times, and got long tail
>> latency for 6 times and got throttled for 8 times.
>>
>> Tail latencies are shown below, and it wasn't the worst case.
>>
>> Latency percentiles (usec)
>> 50.0000th: 19872
>> 75.0000th: 21344
>> 90.0000th: 22176
>> 95.0000th: 22496
>> *99.0000th: 22752
>> 99.5000th: 22752
>> 99.9000th: 22752
>> min=0, max=22727
>> rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
>>
>> The interferenece when using burst is valued by the possibilities for
>> missing the deadline and the average WCET. Test results showed that when
>> there many cgroups or CPU is under utilized, the interference is
>> limited. More details are shown in:
>> https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@xxxxxxxxxxxxxxxxx/
>>
>> Co-developed-by: Shanpei Chen <shanpeic@xxxxxxxxxxxxxxxxx>
>> Signed-off-by: Shanpei Chen <shanpeic@xxxxxxxxxxxxxxxxx>
>> Co-developed-by: Tianchen Ding <dtcccc@xxxxxxxxxxxxxxxxx>
>> Signed-off-by: Tianchen Ding <dtcccc@xxxxxxxxxxxxxxxxx>
>> Signed-off-by: Huaixin Chang <changhuaixin@xxxxxxxxxxxxxxxxx>
>> ---
>
> Ben, what say you? I'm tempted to pick up at least this first patch.
Yeah, I'm fine with it; I know internally we've thought about adding
something like this.
Reviewed-by: Ben Segall <bsegall@xxxxxxxxxx>