Re: [Patch] sched/fair: Avoid throttle_list starvation with low cfs quota

From: bsegall
Date: Wed Oct 10 2018 - 13:49:33 EST


Ingo Molnar <mingo@xxxxxxxxxx> writes:

> I've Cc:-ed a handful of gents who worked on CFS bandwidth details to widen the discussion.
> Patch quoted below.
>
> Looks like a real bug that needs to be fixed - and at first sight the quota of 1000 looks very
> low - could we improve the arithmetics perhaps?
>
> A low quota of 1000 is used because there's many VMs or containers provisioned on the system
> that is triggering the bug, right?
>
> Thanks,
>
> Ingo
>
> * Phil Auld <pauld@xxxxxxxxxx> wrote:
>
>> From: "Phil Auld" <pauld@xxxxxxxxxx>
>>
>> sched/fair: Avoid throttle_list starvation with low cfs quota
>>
>> With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
>> distribute_cfs_runtime may not empty the throttled_list before it runs
>> out of runtime to distribute. In that case, due to the change from
>> c06f04c7048 to put throttled entries at the head of the list, later entries
>> on the list will starve. Essentially, the same X processes will get pulled
>> off the list, given CPU time and then, when expired, get put back on the
>> head of the list where distribute_cfs_runtime will give runtime to the same
>> set of processes leaving the rest.
>>
>> Fix the issue by setting a bit in struct cfs_bandwidth when
>> distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
>> decide to put the throttled entry on the tail or the head of the list. The
>> bit is set/cleared by the callers of distribute_cfs_runtime while they hold
>> cfs_bandwidth->lock.
>>
>> Signed-off-by: Phil Auld <pauld@xxxxxxxxxx>
>> Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>> Cc: stable@xxxxxxxxxxxxxxx

Reviewed-by: Ben Segall <bsegall@xxxxxxxxxx>


In theory this does mean the unfairness could still happen if distribute is still
running, but while a tiny quota makes it more likely, the fact that
we're not getting through much of the list makes it not really a worry.
If you wanted to be even more careful there could be some generation
counter or something, but it doesn't seem necessary.


>> ---
>>
>> This is easy to reproduce with a handful of cpu consumers. I use crash on
>> the live system. In some cases you can simply look at the throttled list and
>> see the later entries are not changing:
>>
>> crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
>> 1 ffff90b56cb2d200 -976050
>> 2 ffff90b56cb2cc00 -484925
>> 3 ffff90b56cb2bc00 -658814
>> 4 ffff90b56cb2ba00 -275365
>> 5 ffff90b166a45600 -135138
>> 6 ffff90b56cb2da00 -282505
>> 7 ffff90b56cb2e000 -148065
>> 8 ffff90b56cb2fa00 -872591
>> 9 ffff90b56cb2c000 -84687
>> 10 ffff90b56cb2f000 -87237
>> 11 ffff90b166a40a00 -164582
>> crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
>> 1 ffff90b56cb2d200 -994147
>> 2 ffff90b56cb2cc00 -306051
>> 3 ffff90b56cb2bc00 -961321
>> 4 ffff90b56cb2ba00 -24490
>> 5 ffff90b166a45600 -135138
>> 6 ffff90b56cb2da00 -282505
>> 7 ffff90b56cb2e000 -148065
>> 8 ffff90b56cb2fa00 -872591
>> 9 ffff90b56cb2c000 -84687
>> 10 ffff90b56cb2f000 -87237
>> 11 ffff90b166a40a00 -164582
>>
>> Sometimes it is easier to see by finding a process getting starved and looking
>> at the sched_info:
>>
>> crash> task ffff8eb765994500 sched_info
>> PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
>> sched_info = {
>> pcount = 8,
>> run_delay = 697094208,
>> last_arrival = 240260125039,
>> last_queued = 240260327513
>> },
>> crash> task ffff8eb765994500 sched_info
>> PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
>> sched_info = {
>> pcount = 8,
>> run_delay = 697094208,
>> last_arrival = 240260125039,
>> last_queued = 240260327513
>> },
>>
>>
>> fair.c | 22 +++++++++++++++++++---
>> sched.h | 2 ++
>> 2 files changed, 21 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7fc4a371bdd2..f88e00705b55 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4476,9 +4476,13 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>>
>> /*
>> * Add to the _head_ of the list, so that an already-started
>> - * distribute_cfs_runtime will not see us
>> + * distribute_cfs_runtime will not see us. If disribute_cfs_runtime is
>> + * not running add to the tail so that later runqueues don't get starved.
>> */
>> - list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
>> + if (cfs_b->distribute_running)
>> + list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
>> + else
>> + list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
>>
>> /*
>> * If we're the first throttled task, make sure the bandwidth
>> @@ -4622,14 +4626,16 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>> * in us over-using our runtime if it is all used during this loop, but
>> * only by limited amounts in that extreme case.
>> */
>> - while (throttled && cfs_b->runtime > 0) {
>> + while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
>> runtime = cfs_b->runtime;
>> + cfs_b->distribute_running = 1;
>> raw_spin_unlock(&cfs_b->lock);
>> /* we can't nest cfs_b->lock while distributing bandwidth */
>> runtime = distribute_cfs_runtime(cfs_b, runtime,
>> runtime_expires);
>> raw_spin_lock(&cfs_b->lock);
>>
>> + cfs_b->distribute_running = 0;
>> throttled = !list_empty(&cfs_b->throttled_cfs_rq);
>>
>> cfs_b->runtime -= min(runtime, cfs_b->runtime);
>> @@ -4740,6 +4746,11 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>>
>> /* confirm we're still not at a refresh boundary */
>> raw_spin_lock(&cfs_b->lock);
>> + if (cfs_b->distribute_running) {
>> + raw_spin_unlock(&cfs_b->lock);
>> + return;
>> + }
>> +
>> if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
>> raw_spin_unlock(&cfs_b->lock);
>> return;
>> @@ -4749,6 +4760,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>> runtime = cfs_b->runtime;
>>
>> expires = cfs_b->runtime_expires;
>> + if (runtime)
>> + cfs_b->distribute_running = 1;
>> +
>> raw_spin_unlock(&cfs_b->lock);
>>
>> if (!runtime)
>> @@ -4759,6 +4773,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>> raw_spin_lock(&cfs_b->lock);
>> if (expires == cfs_b->runtime_expires)
>> cfs_b->runtime -= min(runtime, cfs_b->runtime);
>> + cfs_b->distribute_running = 0;
>> raw_spin_unlock(&cfs_b->lock);
>> }
>>
>> @@ -4867,6 +4882,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> cfs_b->period_timer.function = sched_cfs_period_timer;
>> hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> cfs_b->slack_timer.function = sched_cfs_slack_timer;
>> + cfs_b->distribute_running = 0;
>> }
>>
>> static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 455fa330de04..9683f458aec7 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -346,6 +346,8 @@ struct cfs_bandwidth {
>> int nr_periods;
>> int nr_throttled;
>> u64 throttled_time;
>> +
>> + bool distribute_running;
>> #endif
>> };
>>
>>
>>
>> --